Novel designs propel DSP imaging boards to higher performance
Editor at Large
Because image-processing tasks, such as filtering, warping, compression, and statistical analysis, are computationally intensive tasks, they are often executed by repetitive multiply/accumulate functions. In many applications, these tasks are off-loaded from the host CPU to programmable digital-signal-processor (DSP) boards. These boards closely map the architecture of signal- and image-processing algorithms and incorporate single or multiple multiplier-accumulators in the DSP pipeline. Accordingly, systems integrators are using these DSP boards to accelerate the speed of their imaging applications.
By performing parallel multiply and arithmetic-logic-unit (ALU) operations on integer or floating-point data in a single clock cycle, DSPs, such as the TMS320VC33 from Texas Instruments (TI; Dallas, TX), are capable of performing as many as 150 million floating-point operations per second (MFLOPS). By implementing multiply/accumulators (MACs) in hardware, DSP boards provide a means of increasing processing performance while off-loading the host from DSP-related tasks.
Despite such performance increases, designers of DSP-based image-processing boards are continually pressured by the steady performance increases of general-purpose CPUs. Decoupling memory accesses from arithmetic computation, adding MMX instructions, improving floating-point units, and implementing 128-bit instructions on an existing 64-bit data path, for example, have allowed the Intel Corp. (Santa Clara, CA) Pentium III to attain a 250-MFLOPS performance at a clock rate of 400 MHz. To meet this competitive challenge and offer even higher-performing products, image-processing vendors have turned to the power of multiple-instruction multiple-data (MIMD) microprocessors, very-long-instruction-word (VLIW) devices, and combined RISC/DSP devices.
MIMD-based architectures use a number of processors that function asynchronously and independently. At any time, different processors execute different instructions on different data. Such devices can use either shared or distributed memory based on how the MIMD processors access memory. In the design of the TI C80 single-chip, MIMD-parallel processor, all of the processors are coupled tightly through an on-chip crossbar switch that provides shared access to on-chip RAM. Capable of performing more than 2 billion operations per second (2 BOPS), the C80 chip contains a 32-bit RISC master processor with a 100-MFLOPS IEEE-based floating-point unit, four 32-bit parallel-processing DSPs, and a video controller.
PCI-based, the Matrox Electronic Systems Ltd. (Dorval, Quebec, Canada) Genesis vision processor couples a TI C80 multiprocessor with optional mezzanine frame-grabber modules and display controllers to capture, process, and display monochrome or color images (see Fig. 1). Despite the power of the C80 DSP device, Matrox recognized that many applications require processing of neighborhood operations in real time. Therefore, each Genesis processing node also contains a Neighborhood Operations Accelerator (NOA) ASIC and 64 Mbytes of SDRAM as local memory. The NOA speeds up the convolutions, gray-scale and binary morphology, normalized gray-scale correlation, and lossless JPEG compression/decompression processes.To support the image processor, Matrox offers the Matrox Imaging Library (MIL), an image-processing library that includes ActiveMIL. This collection of ActiveX controls optimizes the hardware`s processing power and functionality.
Very long words
Despite the potential of MIMD architecture, Texas Instruments abandoned the concept for its next generation of DSPs. Instead, the company chose to develop VLIW architectures for both its C201 fixed- and C701 floating-point processors. As an alternative for executing more than one instruction at a time, VLIW processors, like MIMD processors, contain multiple functional units. However, in operation, VLIW processors fetch a very long instruction word containing several primitive instructions from the instruction cache and dispatch the entire VLIW for parallel execution.
In such architectures, designers attempt to eliminate the instruction scheduling and parallel dispatching of modern microprocessors. Whereas in theory this approach seems straightforward, the VLIW of such processors demands that compilers must be used to generate code that has grouped together independent primitive instructions executable in parallel. Moving complexity from the hardware to the compiler means that simpler and faster processors can be constructed, but at the expense of compiler complexity. In operation, multiple functional units must be kept busy; however, these units require enough levels of instruction-level parallelism in a code sequence to fill the available operation slots.
With a performance of up to 1600 million instructions per second (MIPS) at a clock rate of 200 MHz, the C6201 processor is based on the TI VelociTI VLIW architecture. With 32 general-purpose registers of 32-bit word length and eight independent functional units that provide six ALUs and two 16-bit multipliers for a 32-bit result, the C6201 can produce two multiply-accumulates (MACs) per cycle--for a total of 400 million MACs per second (MMACS).
With high MAC throughputs, the C6201 processors are highly applicable to image-processing tasks. As a result, Coreco Inc. (St. Laurent, Quebec, Canada) uses the processor in its PCI-based Cobra/C6 image processor (see Fig. 2). Central to the Cobra/C6 architecture is the board`s image gateway, a multiport-transfer controller that interconnects image acquisition circuits; host, DSP, and pixel processors; and an auxiliary bus.
Sustaining total transfers of 720 Mbytes/s among any five ports, the image gateway performs transfers independently among different sections of the Cobra/C6 board. Like the Matrox Genesis board, the Cobra/C6 board also supports a pixel processor that operates on pixel data at rates to 200 Mbytes/s. This pixel processor accelerates point-to-point and neighborhood image processing. For added processing capability, the Cobra/C6 uses the Coreco Auxiliary Bus to communicate image data to additional processing cards such as the Coreco Python/C6 multiple-C6201 card.
The TI floating-point version of the C6000 series is the C6711, a device that is also based on the VelociTI VLIW architecture. With a performance of up to 900 MFLOPS at 150 MHz, the C6711 contains 32 general-purpose registers of 32-bit word length and eight independent functional units. The eight functional units are composed of four floating-/fixed-point ALUs, two fixed-point ALUs, and two floating-/fixed-point multipliers. They can produce two MACs per cycle for a total of 300 MMACS.
Even before the C6711 device is shipped, several companies have pre-announced board-level products. For example, Traquair Data Systems (Ithaca, NY) is going to offer the Heron1-C6701, a self-contained processing unit that features a C6701 DSP module with RAM and FLASH ROM. Developed by Hunt Engineering (Brent Knoll, Somerset, England), the unit is one of a family of options for use with Traquair`s Heron products. Other modules are expected to include the HEGD6, a digital-camera interface that can be used with Heron DSP hardware and TIM-40 or PC/104 DSP systems from Traquair and Hunt. With four 8-bit RS-422 camera inputs, the board can be connected to one 24- or 32-bit camera, two 16-bit cameras, or four 8-bit cameras.
In addition to Texas Instruments offering VLIW processors, the TriMedia TM-1300 from Philips (Eindhoven, The Netherlands) combines a 166-MHz CPU with on-chip input/output (I/O) ports and coprocessing units. Achieving up to 6.5 BOPS, the TM-1300 instruction set includes multimedia and DSP operations to accelerate the performance of SIMD computations common in multimedia applications. These operations combine multiple simple operations into a single VLIW instruction that can implement up to 12 traditional microprocessor operations in a single clock cycle.
Special multimedia operations are invoked with function-call syntax similar to C/C++ and are automatically scheduled to take advantage of the TriMedia`s VLIW implementation. As with other operations generated by the compiler, the scheduler takes care of register allocation, operation packing, and flow analysis. The processor`s five-issue slot-instruction length enables up to five simultaneous operations to be scheduled into a single VLIW instruction. These operations can simultaneously target any five of the CPU`s 27 pipelined functional units in one clock cycle.
For real-time imaging, vision, and DSP applications, Alacron (Nashua, NH) uses four TriMedia devices in a PCI-based image processor called FastImage (see Fig. 3). With 8-16 Mbytes of 400-Mbyte/s SDRAM available per processor, the board can run at more than 2 GFLOPS. Image acquisition is achieved from one of four selectable composite NTSC/PAL video streams and up to four digital or three analog video streams generated by a line- or area-scan camera. Separate digital video ports provide independent high-speed-data input/ output. A crossbar switch provides I/O from external data sources and interprocessor communications, and all on-board and host PCI functions can be addressed by each processor via a PCI bridge. Continuous composite or S-Video NTSC/PAL information along with an SVGA video output can also be achieved with the board.
Other available VLIW processors are also targeting image-processing applications. For example, the MAP1000 from Equator Technologies (Campbell, CA) is designed to replace hard-wired multimedia and conventional microprocessors by integrating image-processing functions into a VLIW processor. In operation, the device can concurrently process audio, video, communications, and three-dimensional graphics. Central to MAP1000 operation is a data-transfer engine and a 200-MHz VLIW processor that can perform 3.2 billion 16-bit MACs, 1.6 billion 32-bit floating-point operations, and 20 billion pixel-level operations per second. In general-purpose computation, the CPU issues 800 MIPS with on-chip instruction and data caches.
Supported by the company`s iMMediaC compiler, the MAP1000, like the TI C6201, can be programmed in C code. For developers wishing to evaluate the device, the company has developed the Maui, a PCI evaluation and development board that contains the MAP1000 processor and support for SVGA and S-Video input and output. A library of image-computing functions is also available from the Image Computing Systems Laboratory of the University of Washington (Seattle, WA).
At present, the MAP100 is under evaluation by Tektronix Inc. (Beaverton, OR) to increase the performance of its Picture Quality Analysis System (PQAS). By incorporating the Equator Technologies MAP1000 processor into a multiprocessor architecture, Tektronix is seeking to increase the performance of the PQAS to enable real-time, in-service video applications.
The PQAS uses current-generation processors to help MPEG equipment manufacturers and TV-system operators optimize their products. However, these processors require programming in assembly language. By using the Equator MAP1000 processor programmed in C language, Tektronix is working to develop, share, and reuse common software modules for the system.
Combining DSP and RISC
Until now, separate DSPs and CPUs have been necessary for a number of image-processing designs. But now, with decreasing IC linewidths, semiconductor vendors are realizing that the benefits of both processors can be combined into one. In Austin, TX, at its Computing Enhancement Group (CEG), Intel is involved in a joint development with Analog Devices Inc. (Norwood, MA) to define the next-generation DSPs. The new processors are expected to include StrongARM RISC CPU cores from Advanced RISC Machines (Cambridge, England) with traditional multiplier-accumulator-based DSP technology from ADI.
For developers not prepared to wait, combined RISC/DSP-based ICs are becoming available from Hyperstone Electronics (Cupertino, CA). These devices are slated to target the embedded systems market where portable, low-power, high-performance systems need to be deployed at low cost.
By combining a RISC processor with a DSP instruction set and on-chip microcontroller functions, the Hyperstone E1-32 is typical of the latest combined RISC/DSP processors. By operating in parallel to the ALU and load/store unit, the DSP can execute a dedicated DSP instruction set. During DSP latency instruction cycles, the ALU and load/store unit can execute other instructions that allow a peak performance of up to 300 MOPS at 100 MHz. For developers interested in using the device, Hyperstone also offers a PCI development board based around the device (see Fig. 4).
In a joint agreement, Bergdata (Bonn, Germany) and Hyperstone have developed an embedded biometric identification system using the E1-32 processor. According to Franz Veit, Bergdata president and chief executive officer, the intelligent scanner system can scan, analyze, and identify a fingerprint without the need of a PC; it resides on a card that is the size of a matchbox.
As combined RISC/DSP processors become available, they will reduce the cost of smart cameras and portable imaging applications such as biometric analyzers. But to keep ahead of the speed and capability of low-cost mass-market microprocessors such as the Intel Pentium, developers of image-processing boards must incorporate new ideas, techniques, or devices. To meet these demands, companies such as Infinite Technologies (Richardson, TX), Lexra (Waltham, MA), and Sandcraft (Santa Clara, CA) are offering custom-tailorable cores that developers can combine into their designs. At present, the Sandcraft SR1-GX core can perform 1.6 GFLOPS at a 400-MHz clock rate.
Unfortunately, board-level vendors that use such devices must generally invest heavily in new processor design. Whereas large vendors such as Matrox have already shown their willingness to invest in this type of technology, others have opted for off-the-shelf processors and field-programmable gate arrays to build their image-processing boards.
In essence, the use of cores such as those from Infinite Technologies, Lexra, and Sandcraft appears to be limited to large companies with VLSI design experience and that build high-volume products. Because of limited resources, board vendors are opting to adopt off-the-shelf devices and couple them with newly or previously developed pixel processors to increase the image-processing performance of their products.
FIGURE 1. The PCI-based Matrox Genesis vision processor uses the C80 multiprocessor and optional mezzanine frame-grabber modules and display controllers to capture and display monochrome or color images. Each processing node on the Genesis also contains a neighborhood operations accelerator ASIC to accelerate convolutions, gray-scale and binary morphology, and normalized gray-scale correlation. An image-processing library optimizes the processing power and functionality of the hardware.
FIGURE 2. Coreco uses the Texas Instruments C6201 processor in its Cobra/C6 PCI-based image processor. Sustaining total transfers of 720 Mbytes/s among any five ports, the board`s image gateway performs transfers independently among different sections of the Cobra/C6 board. The board also supports a pixel processor that operates on pixel data at rates to 200 Mbytes/s.
FIGURE 3. Four Philips TriMedia devices are installed on the Alacron FastImage PCI-based image processor. Designed for real-time imaging, vision, and DSP applications, the board can capture any one of four selectable composite NTSC/PAL video streams and up to four digital or three analog video streams generated by a line- or area-scan camera.
FIGURE 4. Combining a RISC processor with a DSP instruction set and on-chip micro-controller functions, the Hyperstone PC development board takes advantage of the company`s E1-32 combined RISC/DSP processor.
For information on additional suppliers of DSP imaging boards, see the 1999 Vision Systems Design Buyers Guide (Vision Systems Design, Feb. 1999, p. 90).
Nashua, NH 03060
Fax: (603) 891-2745
Analog Devices Inc.
Norwood, MA 02062
Cambridge CB1 9JN, England
Fax: (44) 1223-400410
53173 Bonn, Germany
+49 (0) 228/3680-110
Fax: +49 (0) 228/3680-118
St. Laurent, Quebec, Canada H4T 1V8
Fax: (514) 333-1388
Campbell, CA 95008
Fax: (408) 371-9106
Brent Knoll, Somerset, TA9 4BP, England
(+44) 1278 760188,
Fax: (+44) 1278 760199
E-mail: sales@hunteng. demon.co.uk
Cupertino, CA 95014
Fax: (408) 257-0713
E-mail: info@hyperstone. com
Image Computing Systems Laboratory
University of Washington
Seattle, WA 98195
Fax: (206) 543-0977
Richardson, TX 75081
Fax: (972) 437-7810
Santa Clara, CA 95052
Waltham, MA 02453
Fax: (781) 899-5769
Matrox Electronic Systems Ltd.
Dorval, Quebec, Canada H9P 2T4
Fax: (514) 822-6273
Philips Professional Imaging
5600 MD Eindhoven, The Netherlands
Santa Clara, CA 95054
Fax: (408) 490-3111
Beaverton, OR 97077
Fax: (503) 627-7995
Fax: (972) 995-4360
Traquair Data Systems
Ithaca, NY 14850
Fax: (607) 266-8221