# Back to Basics: Wavelet Processing on SHARC DSPs

Wavelet processing in real time offers significant performance benefits for the radar, sonar, and image-processing communities. Now that digital-signal-processing (DSP) technology is reaching the sophistication necessary to take analytical research into embedded solutions, many system designers will need to evaluate and select DSP products for their tasks.

**By Ray Hardison**

Wavelet processing in real time offers significant performance benefits for the radar, sonar, and image-processing communities. Now that digital-signal-processing (DSP) technology is reaching the sophistication necessary to take analytical research into embedded solutions, many system designers will need to evaluate and select DSP products for their tasks.

#### Conventional spectral analysis

Before discussing wavelet analysis, a brief review of conventional spectral analysis is needed. Conventional spectral analysis is normally thought of as Fourier analysis, which represents a signal as a series of sinusoids of different frequencies. For each sinusoid, a complex value, magnitude and phase, quantifies the sinusoid's presence in the signal for a sample period. Fourier transformations use addition and multiplication operations in the time domain to perform a convolution of the signal over the sample period with discrete sinusoids. The method that is most often used in real-time embedded systems is the fast Fourier transform (FFT). The FFT uses cross-product multiplication and accumulation based on binary pairs to achieve a fast execution time for a set of sinusoids. Each of these sinusoids has its frequency determined by the size of the FFT and the sample period to which it is applied.

The FFT performs best when it is used on signals that are time invariant and well behaved. Any dramatic change or transient in the signal induces effects that can dominate the calculated results. Consequently, most FFT operations follow a windowing operation that applies a modulator to the sample. The modulator smoothly takes the start and end points of the sample to zero, thereby preventing abrupt transients from occurring. These windows induce a loss to the signal that needs to be corrected, after the calculations, if accurate magnitudes are to be achieved. The correction factor is based on the concept of the signal maintaining a constant level during the sample duration. Signals with rapidly variant levels are, at any instance of time, erratically attenuated by the windowing operation and improperly corrected. A solution for this problem is to overlap successive FFTs and average their results.

You need to increase the size of the FFT to obtain high resolution. The larger the FFT, the longer the sample period represented by it. The more detail you see, the longer the period that is represented by each sinusoidal quantity. Any nonperiodic changes that occur during the sample period get averaged out. An example would be a tone that is rapidly shifting up and down in frequency. The FFT would show a presence of that tone for all the sinusoids it is shifted through. The fact that it was shifting is visually apparent in the results, but the information pertaining to the shift rate is lost. You can not look at the spectrum produced and determine if the line was shifting up, down, or back and forth during the sample period. You should slow down the shifting enough so that successive FFTs can each represent some of the shift. Then, you can see the effect in the spectrums if you display them in a stacked format, such as a waterfall display.

When used with high resolution, overlapping and averaging result in a significant amount of instantaneous information being lost. For this reason, FFTs are best used in applications where the information is changing slowly or periodically. Examples include radiated noise measurements, speech processing, and mechanical structure analysis. When the application needs to evaluate transient, nonperiodic phenomena, wavelet analysis becomes useful.

#### Wavelet analysis

Just as a Fourier transform represents a signal as a set of sinusoids, a wavelet transform represents a signal as a set of wavelets. Wavelets are brief oscillating waveforms that are only a few cycles in duration. Wavelet sets are derived from a primary wavelet, the basis of the set, and contain members that have similarity with the basis but are dilated in definition. A dilated wavelet is a longer duration version of the basis and therefore can represent slower transitions in the signal.

There are many established wavelet sets, the most recognized being those discovered by Ingrid Daubechies in the late 1980s. Wavelet sets are developed to optimize the transform to the application. This, unlike Fourier transforms, means a significant amount of wavelet analysis must precede an application to develop the proper wavelet sets to use. There is a large volume of published information available on the types of wavelets to use for specific applications, plus several commercial software tools that support wavelet analysis modeling.

Wavelet transforms are conceptually the same as Fourier transforms in that a sample period of the signal is convolved against each member of the set of wavelets, with the results being the magnitude and location in time of that wavelet's occurrence. This can result in a sample period being quantified with a small number of values, and is the reason wavelets are very useful in data compression. Just as a Fourier transform needs a fast algorithm to be useful in real-time embedded applications, the wavelet transform also uses an algorithmic trick to reduce the computation time (see Fig. 1).

FIGURE 1. Pyramid algorithm |

null

The pyramid algorithm gets its name from its graphical depiction. It operates on a sample of the signal by applying the wavelets as paired linear filters (high pass and low pass) and decimating the results prior to the next pair's evaluation. This reduces the number of samples that need to be worked with as each pair of the wavelet set is used. When graphically depicted, a large segment hierarchically trees down to smaller sub-segments, symmetrically deriving a pyramid shape. This operation reduces the processor cycles required to a product of the number of samples in the sample period and a constant that is a function of the wavelet being used. This constant depends on the number of terms a wavelet requires in its linear filter and can, as in the case of the Daubechies wavelets, consist of just a few terms. For small sets, wavelet transformations can use less processor cycles than an equivalent FFT for the sample period. ^{1 }

Both Fourier and wavelet transforms can be executed backwards to obtain the inverse transform. Taking a signal through a FFT, and then an inverse FFT, results in the signal losing most of the non-periodic effects because the FFT can only represent them with sinusoids. Likewise, an inverse wavelet transform can decrease, or lose altogether, periodic signal components. These two transformation methods sit at the extreme ends of a broad range of transforms. In between can be found wavelet packets and cosine packets. These two transformations reach towards each other with wavelet packets containing a larger number of cycles and, therefore, extracting more periodic information from the signal. Cosine packets, being short duration sinusoids, obtain time-location information for non-periodic signal components. These two waveforms are sometimes referred to as time-frequency waveforms and can be useful in processing signals that have a mix of periodic and non-periodic components, such as digitized music.

A wavelet transformation has the disadvantage of requiring a significant amount of preapplication analysis to determine the appropriate wavelets. However, the results can be a smaller set of values to represent the signal and a potentially faster transformation time than an equivalent FFT. Wavelet packets can be used when the application has some periodic signal components that need to be quantified, but the computation time will increase if more linear filter terms are needed to implement the packet.

#### Wavelets and SHARC

In evaluating DSPs for real-time wavelet processing, the SHARC (ADSP-2106x) provides useful resources. As shown in Fig. 2, the SHARC contains features that can be important to a wavelet transformation algorithm, such as

- Two independent dual-ported banks of SRAM with up to 0.5 Mbytes capacity on-chip
- Dual address and data buses for on-chip and off-chip memory
- On-chip I/O processor with DMA
- Six external high-speed link ports (transfer rates to 40 Mbytes/s each)
- Two external high-speed serial ports (transfer rates to 40 Mbits/s each)
- 32-bit floating-point, with 40-bit extended floating-point capability
- Parallel floating-point computations capable of a peak performance up to 120 MFLOPS.

FIGURE 2. Internal configuration of SHARC |

null

The dual-banked internal memory is dual-ported for independent core and DMA accesses. The dual address and data buses are configured to support either concurrent access of each internal memory bank, at zero wait-state, or internal and external memories. This memory bus configuration, coupled with parallel ALU, multiplier, and shifter units can perform floating-point operations at a rate that is three times the clock rate of the SHARC. This architecture has been optimized to compute multiply and accumulate (MAC) functions that are the primary computational loops for FFTs, digital filters, and wavelet transforms. Applications can be configured to place computational components, such as filter coefficients and state vectors, into both banks to support an interleave effect for data fetching, thereby decreasing data fetch cycles. The large memory size, 0.5 Mbyte on an ADSP-21060, supports large tables of coefficients, large sample vectors, groups of state vectors, and program code space.

In addition to having a capable internal architecture, the SHARC supports multiple external high-speed interconnections. This permits multi-processing algorithms with streaming I/O data flow and shared memory models. Wavelet transforms containing a large number of sets can be divided among processors using combinations of both.

The SHARC performs its floating-point computations in a 40-bit format that can be automatically converted to a 32-bit format or maintained as a 40-bit format by setting a register bit. For certain transformations the additional precision may be beneficial enough to warrant the additional memory required. A 40-bit float-point number occupies a 48-bit memory location, and can be moved in a single cycle over the PM data bus.

The SHARC can be programmed in either C language or Assembly language. There are many sources available for C routines that perform wavelet transforms. Practical experience has shown that for tight loop computations, hand crafting a function in Assembly language that is C callable can yield a significant increase in performance. For example, the Ixthos IXLibs-21k contains over 200 hand-optimized vector math functions. These functions are not only optimized for tight loop efficiency, but also take advantage of the dual-banked memory architecture to maximize memory fetch efficiency.

Timing analysis with a SHARC and C routines performing wavelet transformations have shown real-time wavelet processing is as practical as FFT processing. Optimized wavelet transformation functions, coupled with efficient wavelet sets can actually outperform conventional methods.

#### Wavelets in a real-time application

Using wavelets in a real-time application for a four-SHARC, VMEbus-based board provides a number of system-level resources to the four SHARCs that enhance their ability to perform the application (see Fig. 3). In addition to the link port interconnections of the four SHARCs (A1, A2, B1, and B2), the board also provides

- Off-board link port connections that can be used to connect SHARCs across boards
- IXI interface mezzanines for external I/O support
- Global Resource Switch (GRS) for managing both SHARC and VME access of three IXR memory modules
- Board Controller to provide VMEbus management and Host processor communications to SHARCs
- Personality Module and Multi-Processor Resources (MPR) to provide cross-board semaphore and signal control constructs.

FIGURE 3. Block diagram of a four-SHARC board |

null

The example image-processing application enhances the edges of an image while cleaning up any noise effects or "snow." A wavelet transformation of the input signal followed by an inverse wavelet transformation does the image processing. The wavelet transform represents the edges and discontinuities of the image and eliminates any periodic effects such as noise. The wavelet set used has been developed and optimized to the SHARC such that one SHARC can perform the transformation on one scan line of data faster than it can be received.

Figure 4 shows the process flow for this application. The application uses a 12-bit video rate analog-to-digital converter IXI mezzanine card in addition to a 12-bit video rate digital-to-analog converter IXI mezzanine card. SHARC A1 acquires input data from the analog-to-digital IXI card and preconditions the data to a floating-point representation. Once a complete scan line has been received and processed, A1 sends the data on a link port to SHARC A2. This send is merely a data move into an output queue followed by a call to a C function that handles the transfer via DMA. Once the send function returns, A1 starts working on the next scan line.

FIGURE 4. Process flow for simple DWT filter |

null

Meanwhile, SHARC A2 is waiting on data for its input queue. Once the data sent by A1 arrive, A2 starts performing wavelet transforms on the received scan line. When A2 completes the transforms, it sends the transform results to SHARC B2 across a link port in the same way that A1 did. SHARC A2 at that point looks at its input queue to see if it has another scan line of data; if not, A2 waits on the queue.

SHARC B2 is also waiting on its input queue, and once data are available, B2 performs an inverse transform and sends the resulting scan line of data on to SHARC B1 via a link port. SHARC B1 converts the data and sends it out to the digital-to-analog IXI card.

This design uses a "bucket-brigade" approach to moving the data from input to output. Delays depend on the processing times of A2 and B2 and, at maximum, can take no longer than the amount of time to acquire three scan lines. Because each SHARC is working on one scan line of data, up to 4 scan lines can reside in the processing loop at any time. If A2 and B2 can perform their combined task in a total time that is less than one scan line in duration, the delay from input to output will be two times the duration of a scan line.

If either A2 or B2 can not complete their combined task in two scan lines of duration, then the process can not keep up with the input, and data will eventually be loss. Assuming that work can not be balanced among the four SHARCs to permit the data flow to keep pace with I/O, additional SHARCs would be needed to work in parallel on the transforms. This implies that the wavelet transform is using a large set of wavelets or the scan duration is short with a large volume of information.

To expand the design to deal with higher demand requirements, you must remove the serialized data flow and introduce some parallelism in addition to more SHARCs. The Ixthos IXZ16 16-SHARC board provides IXI mezzanines, IXR memory modules, and special IXQ cluster memory modules. A SHARC can still be used with an IXI module to acquire the data, but in this example, it will put the data into an IXR memory area and set an MPR state flag. Once the flag is set, a group of SHARCs that have been waiting for the flag attack the data in the IXR. Each executes a branch of the pyramid algorithm and sends its results to another group of SHARCs via link ports. The second group inverse-transforms the data, and sums their results into a section of IXR memory. This area of IXR space is controlled by an MPR semaphore, which is reset by the SHARC handling the output each time it removes the results. Figure 5 shows the application implemented with 14 SHARCs on the IXZ16. Figure 6 shows how the processors map to the pyramid algorithm.

FIGURE 5. Parallel implementation of DWT filter |

null

The application in Fig. 5 uses a one-dimensional approach and works on one scan line at a time. This serves the purpose since horizontal transitions appear in the scan line data, but vertical transitions exist between scan lines. However, if the results are not satisfactory, the image could be worked in a two-dimensional approach, processing the total frame at once as an array. Different SHARCs could be allocated for rows and columns. Control mechanisms would still be the same as previously described with the exception being the way the data would be moved into and out of the IXR. The two SHARCs dedicated to I/O would need to work the data as multi-buffer frames with one frame undergoing the processing while the other two are filled and emptied.

FIGURE 6. Processor partitioning of pyramid algorithm |

null

Wavelet processing is expected to become as established with DSP designers as FFTs are now. Knowing when to use one, or the other, or something in between, calls for looking at the application, doing some up-front analysis, and using a library of optimized routines.

**RAY HARDISON** is manager, software engineering, Ixthos Inc., Leesburg, VA

#### Reference

- Bruce et al,,
*Wavelet Analysis, , IEEE Spectrum,*IEEE Press, 26 (Oct. 1996).

#### Further Reading

- Morgan,
*Multiresolution Signal Analysis and Wavelet Decomposition,*Embedded System Programming, Miller Freeman, 30 (July 1996). - Morgan,
*Why Wavelets?,*Embedded System Programming, Miller Freeman, 131 (October 1997). - Smith,
*Handbook of Real-time Fast Fourier Transforms,*IEEE Press, 9 (1995) - Wickerhauser,
*Adapted Wavelet Analysis from Theory to Software,*IEEE Press, 213 (1994). - Analog Devices, ADSP-2106x SHARC User's Manual, First Edition 1995, Analog Devices Inc., Chapters 1 & 2.