SOFTWARE TOOLS AND DSP BOARDS SHORTEN MULTIPROCESSING DESIGN

According to Forward Concepts (Tempe, AZ), the digital-signal-processing (DSP) product market is projected to show a 33% annual increase in 1999, leading to an overall market of $11 billion. Based on these figures, semiconductor manufacturing leaders Analog Devices Inc. (ADI; Norwood, MA), Texas Instruments (TI; Dallas, TX), and Motorola (Austin, TX), among others, are competing heavily for a share of the DSP integrated-circuit (IC) market. Although TI is acknowledged as the market leader

Jun 1st, 1998
Th Vsd51615 33

SOFTWARE TOOLS AND DSP BOARDS SHORTEN MULTIPROCESSING DESIGN

By Andrew Wilson, Editor at Large

According to Forward Concepts (Tempe, AZ), the digital-signal-processing (DSP) product market is projected to show a 33% annual increase in 1999, leading to an overall market of $11 billion. Based on these figures, semiconductor manufacturing leaders Analog Devices Inc. (ADI; Norwood, MA), Texas Instruments (TI; Dallas, TX), and Motorola (Austin, TX), among others, are competing heavily for a share of the DSP integrated-circuit (IC) market. Although TI is acknowledged as the market leader

with more than 66% market share, Forward Concepts estimates that Analog Devices has more than 15% of the $200 million floating-point-processing IC market.

To address the high end of this market, both Analog Devices and TI have developed DSP ICs that are optimized for multiprocessing applications. For example, the Analog Devices Sharc ADSP-21060 IC integrates a 120-MFLOP/s peak-processing core, a 4-Mbit on-chip dual-ported SRAM, an independent 240-Mbit/s input/output (I/O) processor, a host port, and six 40-Mbyte/s link ports. With this architecture, the IC can support multiprocessing in two ways: though the external port bus or through the six link ports. As many as six Sharc ICs and a host processor can share the external port bus with a global external memory. This architecture allows each Sharc and the host to directly access the internal memory of each Sharc IC and global memory. The six-link-port multiprocessing approach allows data transfers among Sharcs at a rate of 40 Mbyte/s per link for a total I/O bandwidth of 240 Mbyte/s.

Like the ADI Sharc, the TI TMS320C40 IC contains an on-chip 60-MFLOP/s CPU and six communication ports that can transfer up to 28 Mbyte/s each for asynchronous interprocessor communications. With 2000 words of on-chip RAM, 128 words of program cache, and a boot loader, the IC`s two external buses provide an address reach of 4 Gwords of unified memory space.

Shared or distributed

Robert Frankel, former chief technical officer at Spectron Microsystems (Santa Barbara, CA) and now senior scientist for TI, says, "In developing software for a multiprocessor system, designers need to determine whether they need a shared or a distributed processing and data model."

In a distributed architecture, digital-signal processors operate out of a private, local memory store. In a shared memory architecture, digital-signal processors process data from a shared global store. Hybrid designs combine the two approaches by equipping the processors with private memory and providing a global memory bank that all digital-signal processors can access. "Both the Sharc and C40 ICs are adept in supporting both distributed and shared-memory implementations, though the Sharc has been optimized for shared memory architectures," adds Frankel.

In the design of its PCI/66 ADSP-2106 parallel DSP board, Loughborough Sound Images (Loughborough, Leicester, England) used an array of six Sharc processors linked via on-chip link ports and to global memory via the IC`s host port. With an operating speed of 720 MFLOP/s peak, the board provides 16 Mbytes of DRAM. It is supported by the Virtuoso operating system from Eonic Systems (Herndon, VA) and the Spox DSP operating system from Spectron Microsystems (Santa Barbara, CA).

Like the ADI Sharc, the TI C40`s six communication ports can also be used to connect C40s in distributed point-to-point configurations. "Unlike the Sharc, however, the C40 does not provide a dedicated address space for multiprocessing or a locking facility for its external memory buses," says Frankel. "Indeed, the C40`s dual external memory buses suggest a different approach to multiprocessing," he says. Equipped with two memory buses, system designers can implement a variety of shared, distributed, and mixed topologies. In a distributed architecture, private memory is connected to each C40 bus, and the communication ports can be used to establish communications and to pass data among devices.

In a shared memory architecture, the C40`s dual memory buses can be used to create C40 clusters around both the local bus and the global bus. In this scenario, the two memory buses alleviate the bus bandwidth bottleneck associated with shared memory multiprocessors by halving the number of transactions per bus.

Such an approach has been taken by Spectrum Signal Processing (Burnaby, BC, Canada) in the design of its Maranello C4x-based VME64bus master DSP board. Featuring four TIM-40 module sites, ten external communication ports, and a shared memory architecture, this board can be configured with eight C4x DSPs capable of 480 MFLOP/s (see Fig. 1). Because a shared bus architecture is used, the TIM-40 module sites and the VME64 interface can both access a bank of on-board shared SRAMs.

Enter VLIW

Although many board vendors look to the Sharc processor to gain a performance advantage over the C40, TI has evolved its next generation of digital-signal processors based on a very-long-instruction-word (VLIW) design. The first product in the family, the C6201, is a 1.6-GIP (giga-instructions per second), 3-bit, fixed-point processor optimized for C programming.

Mark Siggins, a design engineer with Hunt Engineering (Brent Knoll, Somerset, England), says, "The TI C6201 [is] much faster at performing integer tasks than the C4x by factors ranging from 5.6 to more than 40. This is because the C6201`s architecture includes 256-bit instruction packets, with each packet having up to eight 32-bit instructions for eight of the functional units that operate in parallel." And, because the instruction set of the C62xx allows simultaneous manipulation of 16-bit real and imaginary data, the processor is particularly useful in image-processing applications that use functions such as the fast Fourier transform (FFT).

Already, CowBoy Enterprises (Lewisville, TX), Signatec (Corona, CA), Pentek (Upper Saddle River, NJ), and Coreco (St. Laurent, Quebec, Canada) have revealed board-level products based on the PC104, PCI, and VME buses. Cowboy Enterprises` Furioso PC104 DSP accelerator board uses one 200-MHz C6201 processor and 16 Mbytes of synchronous DRAM (see Fig. 2). It is supported by Code Composer software from GO-DSP (Toronto, Ontario, Canada) and by emulation through a JTAG emulator pod from White Mountain DSP (Nashua, NH). According to Cowboy Enterprises, future support will include the Spox operating system and development tools from Hyperception (Dallas, TX).

However, it is still unclear whether TI will provide hardware hooks for multiprocessing. "Although there are a significant number of customers who have requested multiprocessor support," says Keith M. Reeves, president of Cowboy Enterprises, "the chip does not appear to have been designed to fit that role. I believe that Texas Instruments intended the chip to be fast enough and powerful enough to be a standalone digital-signal processor."

Mitch Reifel of TI doesn`t think any hardware hooks will be forthcoming. "The history of the Transputer and TI`s own experience with the C40 shows that hardware hooks are not the way to support multiprocessing," says Reifel. "The (correct) way is through innovative software in combination with design and hardware support. Texas Instruments is working closely with several companies to provide both the hardware support and the software infrastructure to support multiprocessing," he adds.

Because of the lack of hardware hooks, companies such as Pentek have engineered their own architecture to implement multiprocessing. "The use of bidirectional FIFO devices connected to the external memory interface seems to be a key element in this type of design," says Reeves of CowBoy Enterprises.

Pentek, Signatec, and Coreco all use multiple C6201 processors in their products. Pentek`s Model 4290 VME board supports four 200-MHz TMS320C6201 processors that provide an aggregate performance of 6400 MIPs. It consists of four identical processor cores coupled to the board`s global bus through a dual-port memory structure. Each processor core contains a C6201 processor coupled to private memory and I/O resources.

Signatec`s PMP-8 PCI add-in board can be populated with as many as nine processors to obtain a peak throughput of 12.8 GIPs (see Fig. 3). The board`s architecture incorporates either four or eight slave C6201 DSPs that are controlled by an additional program-execution processor, also a C6201.

To support multiprocessing, Signatec offers a C preprocessor that generates all the code for the multiprocessors from a single source-code program. The preprocessor works from the user`s C code and translates it into C code for the program-execution processor and the other DSPs. "To achieve the best performance, the programmer can direct the operation of the preprocessor through the use of preprocessor directives," says Tom Hunt, president of Signatec.

Coreco`s Python/C6 is also an add-in PCI board. Capable of hosting four C6201 DSPs, the board provides inter-DSP communication links, up to 16 Mbytes of SDRAM, and 4 Mbytes of shared memory. For image-processing applications, the board can be connected to the company`s Cobra/C6 C6201 PCI image-processor board by means of a 200-Mbyte/s auxiliary bus.

Floating point, too

Rather than develop a faster C40 processor, TI used its VelociTI VLIW architecture in the design of its latest processor, the C6701. Providing 1-GFLOP/s performance, the 167-MHz processor supports an instruction set that is a superset of the C62xx. According to TI, this makes the company the first DSP vendor to offer a code-compatible fixed- and floating-point architecture. To achieve this compatibility, floating-point capability was added to six of the eight functional units inside the C6701 CPU. These functional units are two ALU units, two auxiliary units, and two multipliers.

Commenting on this code-compatibility, Greg Da Silva, president of GO DSP, says that because this is the first DSP chip family with such an architecture, system designers gain an unprecedented time-to-market advantage. Indeed, using GO DSP`s Code Composer IDE, designers can start development on the C6701 immediately, even though samples of the part are not planned until the second half of this year.

Already, many third-party suppliers, including Pentek, Spectrum Signal Processing, and 3L Limited (Edinburgh, Scotland), have announced support for the C6701. At press time, only Pentek and Spectrum Signal Processing had preannounced board-level products. Pentek`s product, a variant of its C6201 Model 4290 VME board, dubbed Model 4291, supports four 6701 devices (see Fig. 4). Spectrum said it would shortly offer three boards: a dual PCI board, a quad VME board, and a Quad Compact PCI board.

OS support

For its part, 3L Limited is expected to support the new floating-point processor with its Diamond RTOS, a multitaskingm multithreading real-time operating system for developing multiprocessor systems. "By using RTOS` reconfigurable virtual communication ports between tasks," says Peter Robinson, 3L managing director, "system designers can control intertask interactions and redefine processor networks during debugging. This will be of particular benefit to current C4x designers who have existing system hardware based around C4x communication ports and are considering migrating to C67xx-based systems," he adds.

While TI drums up support for its VLIW architecture, Analog Devices is not sitting still. Two months ago, the company announced a $10 Sharc in quantities of 100,000, the ADSP-21065L, which is capable of 180 MFLOP/s. And, because the part is code-compatible with existing Sharc products, developers can choose from ADI and third-party development tools already in place. For higher-speed performance applications, Analog Devices is developing a static superscalar Sharc DSP at ADI design centers in India, Israel, and the United States that will be capable of 5 billion operations per second. Whether the part remains compatible with other members of the Sharc family remains to be seen.

Some industry analysts assess the TI C67x floating-point preannouncement as a marketing exercise to ensure support among its existing C40 customer base. However, system designers confronted by short product-design cycles might not care. In systems in which price and performance mandate the choice of a digital-signal processor, they will make product choices based on the low cost and availability of ICs, boards, and support tools.

Click here to enlarge image

FIGURE 1. The Maranello C4x-based VME64bus master DSP board from Spectrum Signal Processing contains four TIM-40 module sites, ten external communication ports, and a shared memory architecture. The board can be configured with eight C4x DSPs that are capable of 480 MFLOP/s.

Click here to enlarge image

FIGURE 2. The Furioso PC104 DSP board from CowBoy Enterprises, which uses one 200-MHz C6201 processor and 16 Mbytes of synchronous DRAM, is supported by Code Composer software from GO-DSP and emulation through a JTAG emulator pod from White Mountain DSP.

Click here to enlarge image

FIGURE 3. The Signatec PMP-8 PCI board can be populated with up to nine processors for a peak throughput of 12.8 GIPs. The board`s architecture incorporates either four or eight slave C6201 DSPs that are controlled by a C01 program execution processor.

Click here to enlarge image

FIGURE 4. The Pentek Model 4290 VME board supports four 200-MHz TMS320C6201 processors that provide an aggregate performance of 6400 MIPs. Another similar board from Pentek, the Model 4291, is based on the Texas Instruments C67xx floating-point chip.

More in Boards & Software