[dsp TIPS&TRICKS] Mark Borgerding

Turning Overlap-Save into a Multiband Mixing, Downsampling Filter Bank

n this article, we show how to mixing, filtering, and decimation of mul- filter coefficients. The actual value extend the popular overlap-save tiple arbitrary channels much faster than depends on the relative strengths of the (OS) fast convolution filtering tech- direct time-domain implementation. platform in question (CPU pipelining, nique to create a flexible and compu- zero-overhead looping, memory tationally efficient bank of filters, SOMETHING OLD addressing modes, etc.). On a desktop Iwith frequency translation and decima- AND SOMETHING NEW processor with a highly optimized FFT tion implemented in the frequency The necessary conditions for vanilla- library, the value may be as low as 16. domain. In addition, we supply some tips flavored fast convolution are covered On a fixed-point DSP with a single-cycle for choosing appropriate fast Fourier pretty well in the literature. However, the multiply-accumulate instruction, the transform (FFT) size. choice of FFT size is not. Filtering multi- efficiency value can be over 50. Fast convolution is a well-known and ple channels from the same forward FFT Fast convolution refers to the block- powerful filtering technique. All but the requires special conditions not detailed wise use of circular convolution to shortest (FIR) in textbooks. To downsample and shift accomplish linear convolution. Fast con- filters can be implemented more effi- those channels in the frequency domain volution can be accomplished by OA or ciently in the frequency domain than requires still more conditions. OS methods. OS is also known as “over- when performed directly in the time The first section is meant to be a lap-scrap” [5]. In OA filtering, each signal domain. The longer the filter impulse quick reminder of the basics before we data block contains only as many sam- response, the greater the speed advan- extend the OS fast convolution tech- ples as allows circular convolution to be tage of fast convolution. nique. If you feel comfortable with these equivalent to linear convolution. The sig- When more than one output is fil- concepts, skip ahead. On the other hand, nal data block is zero-padded prior to the tered from a single input, some parts of if this review does not jog your memory, FFT to prevent the filter impulse the fast convolution algorithm are redun- check your favorite DSP book for “Fast response from “wrapping around” the dant. Removing this redundancy increas- Convolution,” “Overlap-Add” (OA), end of the sequence. OA filtering adds es fast convolution’s speed even more. “Overlap-Save,” or “Overlap-Scrap” the input-on transient from one block Sample rate change by decimation [1]–[5]. with the input-off transient from the pre- (downsampling) and frequency transla- vious block. tion (mixing) techniques can also be REVIEW OF FAST CONVOLUTION In OS filtering, shown in Figure 1, no incorporated efficiently in the frequency The convolution theorem tells us that zero-padding is performed on the input domain. These concepts can be combined multiplication in the frequency domain data, thus the circular convolution is not to create a flexible and efficient bank of is equivalent to convolution in the time equivalent to linear convolution. The filters. Such a filter bank can implement domain [1]. Circular convolution is portions that “wrap around” are useless achieved by multiplying two discrete and discarded. To compensate for this, Fourier transforms (DFTs) to effect con- the last part of the previous input block volution of the time sequences that the is used as the beginning of the next “DSP Tips and Tricks” introduces prac- transforms represent. By using the FFT block. OS requires no addition of tran- tical tips and tricks of design and to implement the DFT, the computa- sients, making it faster than OA. The OS implementation of tional complexity of circular convolu- filtering method is recommended as the algorithms so that you may be able tion is approximately O(Nlog 2N ) basis for the techniques outlined in the to incorporate them into your instead of O(N 2 ), as in direct linear remainder of this article. The nomencla- designs. We welcome readers who convolution. Although very small FIR ture “FFTN” in Figure 1 indicates that an enjoy reading this column to submit filters are most efficiently implemented FFT’s input sequence is zero-padded to a their contributions. Contact Associate with direct convolution, fast convolu- length of N samples, if necessary, before Editors Rick Lyons ([email protected]) tion is the clear winner as the FIR filters performing the N-point FFT. or Amy Bell ([email protected]). get longer. Conventional wisdom places For clarity, the following list defines the efficiency crossover point at 25–30 the symbols used in this article.

IEEE SIGNAL PROCESSING MAGAZINE [158] MARCH 2006 1053-5888/06/$20.00©2006IEEE SYMBOL CONVENTIONS x(n) Input data sequence Overlap Input: Repeat h(n), Length = P h(n) FIR filter impulse P−1 of N Samples response x(n) y(n)=x(n)∗h(n) Convolution of x(n) FFTN and h(n) L Number of new input FFTN samples consumed per data block P Length of h(n) Discard P−1 of N Samples N = L + P − 1 FFT size −1 y(n) V = N/(P − 1) Overlap factor, the FFTN ratio of FFT length to P−1 Samples filter transient length ( ) = ( ) ∗ ( ) D Decimation factor [FIG1] Overlap-save (OS) filtering, y n x n h n .

CHOICE OF FFT SIZE: COMPLEXITY IS ■ FFT accuracy—the numerical of the additional coefficients without RELATED TO FILTER LENGTH AND accuracy of fast convolution filtering increasing the computational workload. OVERLAP FACTOR is dependent on the error introduced Processing a single block of input data by the FFT-to-IFFT (inverse FFT) FREQUENCY DOMAIN that produces L outputs incurs a compu- round trip. For floating-point imple- DOWNSAMPLING: IS tational cost related to Nlog(N) = mentations, this may be negligible, YOUR FRIEND (L + P − 1) log(L + P − 1). The compu- but fixed-point processing can lose The most intuitive method for reducing tational cost per sample is related to significant dynamic range in these sample rate in the frequency domain is (L + P − 1) log(L + P − 1)/L . The filter transforms. to simply perform a smaller IFFT using length P is generally a fixed parameter, ■ Latency—Fast convolution filtering only those frequency bins of interest. so choosing L such that this equation is process increases delay by at least L This is not a 100% replacement for time minimized gives the theoretically opti- samples. The longer the FFT, the domain downsampling. This simple mal FFT length. It may be worth men- longer the latency. method will cause ripples (Gibbs phe- tioning that the log base is the radix of While there is no substitute for nomenon) at FFT buffer boundaries the FFT. The only difference between dif- benchmarking on the target platform, in caused by the multiplication in the fre- ferent-based logarithms is a scaling fac- the absence of benchmarks, choosing a quency domain by a rectangular window. tor. This is irrelevant to “big O” power-of-two FFT length about four The sampling theorem tells us that scalability, which generally excludes con- times the length of the FIR filter is a sampling a continuous signal aliases stant scaling factors. good rule of thumb. energy at frequencies higher than the Larger FFT sizes are more costly to (half the signal sample rate) perform, but they also produce more FILTERING MULTIPLE CHANNELS: back into the baseband spectrum (below usable (nonwraparound) output samples. REUSE THE FORWARD FFT the Nyquist rate). This is equally true for In theory, one is penalized more for It is often desirable to apply multiple fil- decimation of a digital sequence as it is choosing too small an overlap factor, V, ters against the same input sample for analog-to-digital conversion [6]. than too large. In practice, the price of sequence. In these cases, the advantages Aliasing is a natural part of downsam- large FFTs paid in computation and/or of fast convolution filtering become even pling. To accurately implement down- numerical accuracy may suggest a differ- greater. The computationally expensive sampling in the frequency domain, it is ent conclusion. operations are the forward FFT, the IFFT, necessary to preserve this behavior. It Some other factors to consider while and the multiplication of the frequency should be noted that, with a suitable deciding the overlap factor for fast con- responses. The forward FFT needs to be antialiasing filter, the energy outside the volution filtering include: computed just once. This is roughly a selected bins might be negligible. This ■ FFT speed—It is common for actual “Buy one filter. Get one 40% off” sale. discussion is concerned with the steps FFT computational cost to differ To realize this computational cost necessary for equivalence. The designer greatly from theory, e.g., due to mem- savings for two or more filters, all filters should decide how much aliasing, if any, ory caching. (“In theory there is no must have the same impulse response is necessary. difference between theory and prac- length. This condition can always be Decimation (i.e., downsampling) can tice. In practice there is.” — Yogi achieved by zero padding the shorter fil- be performed exactly in the frequency Berra, American philosopher and ters. Alternately, the engineer may domain by coherently adding the fre- occasional baseball manager.) redesign the shorter filter(s) to make use quency components to be aliased [7].

IEEE SIGNAL PROCESSING MAGAZINE [159] MARCH 2006 [dsp TIPS&TRICKS] continued

tiple of the decimation rate D.

j2πnf /f P − 1 = K1D Channel 1 e 1 s h1(n) D1 2) The FFT length must be a multiple of the decimation rate D. x(n) y (n) * D 1 L + P − 1 = K2D where j2πnf /f h (n) ■ Channel k e k s k Dk D is the decimation rate or the least common multiple of the deci- y (n) mation rates for multiple channels * D k ■ K1 and K2 are integers. Note if the overlap factor V is an inte- [FIG2] Conceptual model of a filter bank. ger, then the first condition implies the second. It is worth noting that others have also explored the concepts of rate − Overlap Input: Repeat P 1 of N Samples conversion in OA/OS [8]. Also note that x(n) decimation by large primes can lead to FFTN To Additional FFT inefficiency. It may be wise to deci- Channels mate by such factors in the time domain. Channel 1 Mix, MIXING AND OS FILTERING: h1(n), Rotate DFT FFTN Length = P Sequence ROTATE THE FREQUENCY DOMAIN FOR COARSE MIXING Mixing, or frequency shifting, is the mul- Nf1 tiplication of an input signal by a com- Nrot1 = Round V Decimate, Vfs Wrap/Sum Sequence plex sinusoid [1]. It is equivalent to Into Desired N/D1 Bins convolving the frequency spectrum of an − Discard (P 1)/D1 of N/D1 Samples input signal with the spectrum of a sinu- − FFT 1 soid. In other words, the frequency spec- N/D 1 y1(n) trum is shifted by the mixing frequency. − (P 1)/D1 Samples It is possible to implement time- domain mixing in the frequency domain by simply rotating the DFT sequence, [FIG3] OS filter bank. but there are limitations: 1) The precision with which one can The following octave/Matlab code Fx(769:1024); mix a signal by rotating a DFT demonstrates how to swap the order of x_freq_dom_dec = sequence is limited by the resolution an inverse DFT and decimation. ifft(Fx_alias)/4; of the DFT. % Make up a completely The sequences x_time_dom_dec and 2) The mixing precision is limited fur- random frequency spectrum x_freq_dom_dec are equal to each other. ther by the fact that we don’t use a Fx = randn(1,1024) + The above sample code assumes a com- complete buffer of output in fast con- i*randn(1,1024); plex time-domain sequence for generali- volution. We use only L samples. We ty. The division by four in the last step must restrict the mixing to the subset % Time-domain decimation — accounts for the difference in scaling fac- of frequencies whose periods complete inverse transform then tors between the IFFT sizes. As various in those L samples. Otherwise, phase decimate FFT libraries handle scaling differently, discontinuities occur. That is, one can x_full_rate = ifft(Fx); the designer should keep this in mind only shift in multiples of V bins. x_time_dom_dec = during implementation. It’s worth not- The number of “bins” to rotate is x_full_rate(1:4:1024); ing that this discussion assumes the FFT %Retain every fourth sample length is a multiple of the decimation Nfr Nrot = round • V,(1) rate D. That is, N/D must be an integer. Vfs % Frequency-domain To implement time-domain decima- decimation, alias first, tion in the frequency domain as part of where fr is the desired mixing frequency then inverse transform fast convolution filtering, the following and fs is the sampling frequency. The Fx_alias = Fx(1:256) + conditions must be met. second limitation may be overcome by Fx(257:512) + Fx(513:768) + 1) The FIR filter order must be a mul- using a bank of filters corresponding to

IEEE SIGNAL PROCESSING MAGAZINE [160] MARCH 2006 different phases. However, this increase use of simple concepts whenever possi- AUTHOR in design/code complexity probably does ble. (“Things should be described as sim- Mark Borgerding is a principal engineer not outweigh the meager cost of multi- ply as possible, but no simpler.”—A. at 3dB Labs, Inc., a small company spe- plying by a complex phasor. Einstein.) We owe it to ourselves as engi- cializing in DSP consulting and contract If coarse-grained mixing is unaccept- neers to realize those simple concepts as engineering services. He is often found able, mixing in the time domain is a bet- efficiently as possible. lurking on the comp.dsp newsgroup or ter solution. The general solution to The familiar and simple concepts tinkering with his KISSFFT library. allow multiple channels with multiple shown in Figure 2 may be used for the mixing frequencies is to postpone the design of mixed, filtered, and decimated REFERENCES mixing operation until the filtered, deci- channels. The design may be implement- [1] A. Oppenheimer and R. Schafer, Discrete-Time mated data is back in the time domain. ed more efficiently using the equivalent Signal Processing. Upper Saddle River, NJ: Prentice- Hall, 1989. If mixing is performed in the time structure shown in Figure 3. [2] L. Rabiner and B. Gold, Theory and Application domain: of Digital Signal Processing. Englewood Cliffs, NJ: ■ All filters must be specified in SUMMARY Prentice-Hall, 1975. terms of the input frequency (i.e., In this article, we outlined considera- [3] R. Lyons, Understanding Digital Signal Processing, 2/E. Upper Saddle River, NJ: Prentice- nonshifted) spectrum. tions for implementing multiple OS Hall, 2004. ■ The complex sinusoid used for mix- channels with decimation and mixing in [4] S. Orfanidis, Introduction to Signal Processing. ing the output signal must be created the frequency domain, as well as supply- Englewood Cliffs, NJ: Prentice-Hall, 1995. at the output rate. ing recommendations for choosing FFT [5] M. Frerking, Digital Signal Processing in size. We also provided implementation Communication Systems. New York: Chapman & Hall, 1994. PUTTING IT ALL TOGETHER guidance to streamline this powerful [6] R. Crochiere and L. Rabiner, Multirate Digital By making efficient implementations of multichannel filtering, down-conversion, Signal Processing. Englewood Cliffs, NJ: Prentice- conceptually simple tools, we help our- and decimation process. Hall, 1983. selves to create simple designs that are as [7] M. Boucheret, I. Mortensen, and H. Favaro, “Fast efficient as they are easy to describe. ACKNOWLEDGMENTS convolution filter banks for satellite payloads with on-board processing,” IEEE J. Select. Areas. Humans are affected greatly by the sim- I would like to thank my wife, Elaine, for Commun., vol. 17, no. 2, pp. 238–248, Feb. 1999. plicity of the concepts and tools used in helping me find the time to write this, [8] S. Muramatsu and H. Kiya, “Extended overlap- designing and describing a system. We and David Evans, for being a DSP mentor add and -save methods for multirate signal process- ing,” IEEE Trans. Signal Processing, vol. 45, no. 9, owe it to ourselves as humans to make and sounding board. pp. 2376–2380, Sep. 1997. [SP]

[dsp HISTORY] continued from page 157

REFERENCES [1] P. Elias, “Predictive coding I,” IRE Trans. Inform. [6] B.S. Atal and M.R. Schroeder, “Adaptive predic- [10] T.E. Tremain, “The government standard linear Theory, vol. IT-1 no. 1, pp. 16–24, Mar. 1955. tive coding of speech,” Bell Syst. Tech. J., vol. 49 no. predictive coding algorithm: LPC10,” Speech 8, pp. 1973–1986, Oct. 1970. Technol., vol. 1, pp. 40–49, Apr. 1982. [2] N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series. Cambridge, [7] B.S. Atal and S.L. Hanauer, “Speech analysis and [11] B.S. Atal and J.R. Remde, “A new model of MA: MIT Press, 1949. synthesis by linear prediction of the speech wave,” J. LPC excitation for producing natural-sounding Acoust. Soc. Amer., vol. 50, pp. 637–655, Aug. 1971. speech at low bit rates,” in Proc. ICASSP’82, May [3] C.E. Shannon, “A mathematical theory of com- 1982, pp. 614–617. munication,” Bell Syst. Tech. J., vol. 27, pp. 379–423, [8] S. Saito, Fukumura, and F. Itakura, “Theoretical 623–656, 1948. consideration of the statistical optimum recognition [12] B.S. Atal and M.R. Schroeder, “Stochastic coding of the spectral density of speech”, J. Acoust. Soc. of speech signals at very low bit rates,” in Proc. Int. [4] P. Elias, “Predictive coding II,” IRE Trans. Inform. Japan, Jan. 1967. Conf. Commun., ICC’84, May 1984, pp. 1610–1613. Theory, vol. IT-1 no. 1, pp. 24–33, Mar. 1955. [9] F. Itakura and S. Saito, “A statistical method for [13] M.R. Schroeder and B.S. Atal, “Code-excited lin- [5] B.S. Atal and M.R. Schroeder, “Predictive coding estimation of speech spectral density and formant ear prediction (CELP): High-quality speech at very of speech,” in Proc. 1967 Conf. Communications and frequencies,” Electron. Commun. Japan, vol. 53-A, low bit rates,” in Proc ICASSP’85, Mar. 1985, pp. Proc., Nov. 1967, pp. 360–361. pp. 36–43, 1970. 937–940. [SP]

IEEE SIGNAL PROCESSING MAGAZINE [161] MARCH 2006