MUS421 Lecture 8C FFT : The Filter-Bank Summation (FBS) Method for Fourier Analysis, Modification, and Resynthesis

Julius O. Smith III ([email protected]) Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California 94305

June 27, 2020

Outline

• Two Views of Short Time Fourier Transform – OverLap-Add (OLA) – Filter Banks Summation (FBS) • DFT Filter Bank • STFT Filter Bank • Downsampled STFT Filter Bank • STFT Modifications (LTI and Time Varying)

1 Two Views of the Short Time Fourier Transform (STFT)

The Short-Time Fourier Transform (STFT) is defined by (Allen and Rabiner ’77): ∞ −jωkn Xm(ωk)= [w(n − mR)x(n)]e n=−∞ X where R is the window “hop size” There are two Fourier dual interpretations of this formula:

1. The OverLap-Add (OLA) interpretation sees the STFT as a time-ordered sequence of Fourier transforms—one transform every R samples in time 2. The Filter-Bank Summation (FBS) interpretation sees the same formula as a downsampled filter bank, where k is the channel (bandpass-filter) index, and R is the downsampling factor applied to the filter-bank output

We are familiar with the OLA point of view, which is ideal for many STFT applications such as LTI filtering We will now study the FBS point of view—regarding the STFT as a downsampled filter bank

2 Overlap-Add Interpretation

1. Apply a window to x(n), selecting data near time mR 2. Fourier Transform each time-windowed signal

The STFT is viewed as a time sequence of spectra, one per frame, with the frames overlapping in time:

|Xm(ωk)| ωk

... 0 R2R m Overlap-Add view of the STFT

3 Filter-Bank Interpretation of the STFT

We can group the terms in the STFT definition differently to obtain the filter-bank interpretation: ∞ −jωkn Xm(ωk) = [x(n)e ] w(n − mR) n=−∞ X xk(n) = [xk ∗ Flip| ({zw)](mR} ) The filter-bank interpretation operates as follows: 1. Shift (rotate along the unit circle) the spectrum of x, taking frequency ωk to 0 (i.e., modulate x by e−jωkn in the time domain) 2. Lowpass-filter about frequency 0 using the window w 3. Downsample by factor R The STFT is interpreted as a frequency-ordered set of narrow-band time-domain signals:

k |Xm(ωk)| ...... 0 tm = mRT Filter Bank Summation view of the STFT

4 Filter Banks

Let’s build a filter bank in which each frequency band gets translated (“heterodyned”) down to 0 Hz (dc) and lowpassed by a running sum “dc-pass” filter:

The Running-Sum Lowpass Filter

Impulse Response: 1, n =0, 1, 2,...,N − 1 h(n) =∆ ( 0, otherwise Implementation:

x(n) h y(n)

∞ N−1 y(n)=(h ∗ x)(n) =∆ h(m)x(n − m)= x(n − m) m=−∞ m=0 = x(n)+ x(n − 1)X + ··· + x(n − N + 1)X (“Unnormalized moving average” FIR filter)

Transfer Function: 1 − z−N H(z)=1+ z−1 + ··· + z−N+1 = 1 − z−1

5 Running-Sum Frequency Response

1 − e−jωNT e−jωNT/2 sin(ωNT/2) H(ejωT ) = = 1 − e−jωT e−jωT/2 sin(ωT/2)

∆ −jω(N−1)T/2 = Ne asincN (ωT )

• e−jω(N−1)T/2 = linear phase term ↔ delay of (N − 1)/2 samples (half of FIR order) • DC Gain = N • We could use a moving average instead of a running sum (h ← h/N) to obtain unity dc gain.

6 Example running-sum frequency response for N =5:

Running Sum Amplitude Response, N = 5 6

5

4

3 Magnitude

2

1

0 −3 −2 −1 0 1 2 3 Normalized Frequency ω T

• Gain at dc is N =5 • Nulls occur at ωT = ±2π/5 and ±4π/5. These are the sinusoidal frequencies having respectively one and two periods under the 5-sample “rectangular window”. (Three periods would need at least 2*3 = 6 samples, so 6π/5 doesn’t “fit”.) • Since the passband about dc is not flat, it is better to call this a “dc-pass filter” rather than a “lowpass filter.” We could also call it a limited-resolution “dc sampling filter.”

7 Modulation by a Complex Sinusoid

x(n) xc(n)

e−jωcn

Given a signal expressed as a sum of sinusoids,

Nx jωknT x(n)= ak e , ak ∈ C, Xk=1 then

Nx ∆ −jωcnT j(ωk−ωc)nT xc(n) = x(n)e = ake . Xk=1 • Frequency ωk is down-shifted to ωk − ωc

• In particular, frequency ωc (the “center frequency”) is down-shifted to dc

8 Making a Bandpass Filter from a Lowpass Filter

xc(n) yc(n) x(n) LPF h y(n)

e−jωcn ejωcn

• The “running sum” LowPass Filter (LPF) has a “passband” width of zero. Therefore, the corresponding BandPass Filter (BPF) attempts to sample the spectrum at one frequency ωc. • “Transition Band” from ωT =0 to ωT ≈±2π/N • “Stop Band” includes |ωT | > 2π/N • Recall N =5 example (running-sum LPF):

Running Sum Amplitude Response, N = 5 6

5

4

3 Magnitude

2

1

0 −3 −2 −1 0 1 2 3 Normalized Frequency ω T

9 Uniform Running-Sum Filter Banks

For a length N running-sum filter, make N bandpass filters tuned to center frequencies 2π ω T =∆ k , k =0, 1, 2,...,N − 1 k N Example filter-bank channel frequency responses for N =5:

Modulated−Running−Sum Filter Bank, N = 5 6

5

4

3

2 Magnitude

1

0

−1

−2 −3 −2 −1 0 1 2 3 Normalized Frequency ω T

• Also shown is the frequency-response sum (later shown to be constant and equal to N) • Analogous to time-domain sampling (its dual) • Channel filters can be moved anywhere in [−π,π) via resampling

10 System Diagram of the Running-Sum Filter Bank

x0(n) h y0(n)

1

x1(n) h y1(n) x(n) e−jω1nT

xN−1(n) h yN−1(n)

e−jωN−1nT

11 One Filter Bank Channel

xk(n) x(n) h yk(n)

e−jωkn

The kth channel computes: N−1

yk(n)=(h ∗ xk)(n)= h(m)xk(n − m) m=0 X n

= (xk ∗ h)(n)= xk(m)h(n − m) m=n−(N−1) n X −jωkmT = x(m)e Shiftn,m(Flip(h)) m=n−(N−1) Xn = x(m)e−jωkmT m=nX−(N−1)

12 DFT Filter Bank

Definition: The Length N Discrete Fourier Transform (DFT) is defined as N−1 X(k) =∆ x(n)e−j2πnk/N n=0 X Observation:

X(k)= yk(N − 1) That is, if we feed the signal x(0 : N − 1) to our bandpass filterbank, then the output at time n = N − 1 is the DFT of x(0 : N − 1).

• More generally, for all n, we made a DFT filter bank • The Sliding DFT is what you get when you advance successive DFTs by one sample: N−1 ∆ −j2πmk/N Xn(k) = x(n + m)e m=0 X When n = LN − 1 for any integer L, the Sliding DFT is the same as the DFT Filter Bank. At other times, they differ by a linear phase term. (Exercise: find the linear phase term.)

13 • The Sliding DFT redefines the time origin each sample (each modulation term within the DFT starts at time 0), while the DFT Filter Bank does not redefine the time origin (modulation terms are “free running”). Since “DFT time” repeats every N samples, the two treatments coincide every N samples (i.e., ejωk(n+LN)T = ejωknT for every integer L). • When N is a power of 2, the DFT can be implemented using the Fast Fourier Transform (FFT) using only O(N log2(N)) operations per transform. Uniform FIR filter banks are typically implemented in practice using the FFT. • Note that the channel bandwidths are narrow compared with half the sampling rate (especially for large N), so that the filter bank output signals yk(n) are oversampled, in general. • We will later look at downsampling the channel signals yk(n) to obtain a “hopping FFT” filter bank. • “Sliding” and “hopping” FFTs are instances of the discrete-time Short Time Fourier Transform (STFT). • The STFT normally also uses a non-rectangular .

14 Inverse DFT and the DFT Filter Bank Sum

Definition: The Length N Inverse Discrete Fourier Transform (IDFT) can be shown1 to be 1 N−1 x(n)= X(k)ej2πnk/N , n =0, 1, 2,...,N − 1 N Xk=0 Since X(k)= yk(N − 1), we have 1 N−1 x(n)= y (N−1)ej2πnk/N , n =0, 1, 2,...,N−1. N k Xk=0 More generally, we will later show that the DFT Filter Bank can be inverted (for all n) by remodulating and summing (and possibly scaling) the N filter bank channels, i.e., 1 N−1 x(n − N +1) = y (n)ej2πnk/N , n =0, 1, 2,... N k Xk=0

1http://ccrma.stanford.edu/~jos/mdft/

15 Computational Example

% PARAMETERS N=10; % number of filters = DFT length fs=1000; % sampling frequency (arbitrary) D=1; %durationinseconds

% COMPUTATION L = ceil(fs*D)+1; % signal duration (samples) n = 0:L-1; % discrete-time axis (samples) t = n/fs; % discrete-time axis (sec) x = chirp(t,0,D,fs/2); % sine sweep from 0 Hz to fs/2 Hz %x = echirp(t,0,D,fs/2); % "analytic" chirp sweep x = x(1:L); % trim trailing zeros at end h = ones(1,N); % Simple DFT lowpass = rectangular window X = zeros(N,L); % X will be the filter bank output for k=1:N wk = 2*pi*(k-1)/N; xk = exp(-j*wk*n).* x; % Modulation by complex exponentials X(k,:) = filter(h,1,xk); end

% RESULTS clf; figure(1); subplot(N+1,1,1); plot(x,’-k’); ylabel(’x(n)’); axis tight; title(sprintf([’%d-channel DFT filterbank output ’, ’for a linear chirp input’],N)); for k=1:N subplot(N+1,1,k+1); plot(real(X(k,:)),’-k’); axis tight; ylabel(sprintf(’Re{X_%d(n)}’,k-1)); end xlabel(’Time (samples)’);

16 10-Channel DFT Filter Bank Chirp Response Output Real Part versus Time in each Channel

10−channel Hamming−windowed DFT filterbank output for a linear chirp input 1 0 x(n) −1 0 100 200 300 400 500 600 700 800 900 1000

(n) 5 0 0

ReX −5 100 200 300 400 500 600 700 800 900 1000 5 (n) 1 0

ReX −5 100 200 300 400 500 600 700 800 900 1000 4

(n) 2 2 0 −2 ReX −4 100 200 300 400 500 600 700 800 900 1000 4

(n) 2 3 0 −2 ReX −4 100 200 300 400 500 600 700 800 900 1000 5 (n) 4 0

ReX −5 100 200 300 400 500 600 700 800 900 1000

(n) 5 5 0 −5 ReX 100 200 300 400 500 600 700 800 900 1000 5 (n) 6 0

ReX −5 100 200 300 400 500 600 700 800 900 1000 4

(n) 2 7 0 −2 ReX −4 100 200 300 400 500 600 700 800 900 1000 4

(n) 2 8 0 −2 ReX −4 100 200 300 400 500 600 700 800 900 1000 5 (n) 9 0

ReX −5 100 200 300 400 500 600 700 800 900 1000 Time (samples)

17 Output Modulus versus Time in Each Channel

10−channel DFT filterbank output for a linear chirp input 1 0 x(n) −1 0 100 200 300 400 500 600 700 800 900 1000 8

(n)| 6 40 |X 2 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 21 |X 1 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 2 |X 1 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 23 |X 1 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 24 |X 1 100 200 300 400 500 600 700 800 900 1000 8

(n)| 6 45 |X 2 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 26 |X 1 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 27 |X 1 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 28 |X 1 100 200 300 400 500 600 700 800 900 1000 5 4

(n)| 3 29 |X 1 100 200 300 400 500 600 700 800 900 1000 Time (samples)

18 Complex Chirp Response Output Modulus versus Time in each Channel

10−channel DFT filterbank output for a linear chirp input 1 0.5 |x(n)| 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

0 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

1 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

2 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

3 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

4 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

5 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

6 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

7 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

8 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 10 (n)|

9 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 Time (samples)

19 STFT Filterbank

The STFT ∞ −jωkn Xm(ωk)= [x(n)e ]w(n − mR) n=−∞ X may be computed by the following operations:

∆ −jω n • Demodulate x(n) to get xk(n) = e k x(n) −jω n – multiplication by e k shifts ωk down to dc • Next, convolve with w˜ =∆ Flip(w) • Downsample by the (integer) factor R

Thus, we can write ∞ −jωkn Xm(ωk)= = [x(n)e ]˜w(mR − n) n=−∞ X ∆ = (xk ∗ w˜)(mR) (˜w = Flip(w))

20 One Channel of STFT Filterbank (R =1)

Time domain:

xk(n) x(n) w˜ Xn(ωk)

e−jωkn One Channel of the STFT Filter Bank

Frequency domain:

X(ω + ωk) W X(ω) Shift W (ω)X(ω + ωk)

−ωk One Channel of the STFT Filter Bank

21 STFT Filter Bank

The STFT filter-bank channels are arranged in parallel:

Xm(ω0) X Flip(w) ↓ R

e−jω0n x(n) X Flip(w) ↓ R Xm(ω1)

e−jω1n

X Flip(w) ↓ R Xm(ωN−1)

e−jωN−1n Xn Xm

This has been called a phase vocoder filter bank (Portnoff 1976)

22 Hamming Window Filter Bank Chirp Response, Output Modulus versus Time

8−channel Hamming−windowed DFT filterbank output for a linear chirp input 1

0 x(n)

−1 100 200 300 400 500 600 700 800 900 1000

3

(n) 2 0 X 1

100 200 300 400 500 600 700 800 900 1000 2 1.5 (n)

1 1 X 0.5

100 200 300 400 500 600 700 800 900 1000

1.5 (n)

2 1 X 0.5

100 200 300 400 500 600 700 800 900 1000 2 1.5 (n)

3 1 X 0.5

100 200 300 400 500 600 700 800 900 1000

3

(n) 2 4 X 1

100 200 300 400 500 600 700 800 900 1000 2 1.5 (n)

5 1 X 0.5

100 200 300 400 500 600 700 800 900 1000

1.5 (n)

6 1 X 0.5

100 200 300 400 500 600 700 800 900 1000 2 1.5 (n)

7 1 X 0.5

100 200 300 400 500 600 700 800 900 1000 Time (samples)

Replacing rectangular with Hamming window gives • Improved channel isolation • Doubled channel bandwidth

23 Hamming Window, Complex Chirp Response, Output Modulus versus Time

10−channel Hamming−windowed DFT filterbank output for a linear chirp input 1 0.5 |x(n)| 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 0 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 1 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 2 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 3 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 4 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 5 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 6 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 7 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 8 |X 0 0 100 200 300 400 500 600 700 800 900 1000 5 (n)| 9 |X 0 0 100 200 300 400 500 600 700 800 900 1000 Time (samples)

24 FBS Window Constraints for R =1

Recall that in OLA, perfect reconstruction required only that the window meet a constant overlap-add (COLA) constraint: ∞ w(n − mR)= c m=−∞ X where c 6=0 is any constant (always true for R =1). The filterbank summation (FBS) is interpreted as a demodulation (frequency shift by −ωk) and subsequent lowpass filtering by w. Therefore, to resynthesize our original signal, we need to remodulate each baseband signal and sum up the channels. For R =1 (no

25 downsampling), the FB sum is given by N−1 jωkm xˆ(m) = Xm(ωk)e k=0 NX−1 ∞ = x(n)w(n − m)e−jωkn ejωkm k=0 "n=−∞ # X∞ X N−1 = x(n)w(n − m) ejωk(m−n) n=−∞ k=0 X∞ X ∞ = x(n)w(n − m)N δ(m − n − rN) n=−∞ −rN r=−∞ X∞ X = N x(m |−{zrN})˜w(rN) r=−∞ = NwX(0)x(m) if w(rN)=0, ∀r 6=0 Thus, perfect reconstruction is assured if the window is zero at all nonzero multiples of N. Since normally our windows are shorter than N, this always holds for R =1.

26 Nyquist(N) Windows

Above, we derived that the FBS reconstruction sum gives ∞ xˆ(n)= N w˜(rN)x(n − rN) r=−∞ X where w˜ = Flip(w).

• From this we see that if MN), we can still get perfect reconstruction if w(rN)=0, |r| =1, 2,... [w is Nyquist(N)] When this holds, we say the window is Nyquist(N)

27 • The Nyquist(N) property is the Fourier dual of the weak COLA constraint • Portnoff windows (see below), make use of this result; they are longer than the DFT size and therefore the DFT must be preceded by a manual time- step

Portnoff Windows

Portnoff (1976) observed that any window w of the form w(n)=ˆω(n)sinc(n/N) being Nyquist(N) by construction, will obey the weak COLA constraint, where N is the number of spectral samples, ωˆ(n) is any window function whatsoever, and sin(πn) sinc(n) =∆ πn as usual (the unit-amplitude sinc function with zeros at all nonzero integers).

28 Comparison of Hamming and Portnoff Windows

Conventional Window [Hamming(31)] Portnoff Window [Sinc: Nyquist(32)] 1 1

0.8 0.5 0.6

0.4

Amplitude Amplitude 0 0.2

0 -0.5 -100 0 100 -100 0 100 Time (samples) Time (samples)

0 0

-20 -20

-40 -40

-60 -60 Magnitude (dB) Magnitude (dB)

-80 -80 -0.5 0 0.5 -0.5 0 0.5 Normalized Frequency (cycles/sample) Normalized Frequency (cycles/sample)

29 Portnoff Window Application Notes

• In STFT systems, a window w longer than N samples must be time-aliased about length N after applying it to a length M>N segment of input data. • Choosing M ≫ N allows multiple side lobes of the sinc function to alias in on the main lobe • This gives channel filters in the frequency domain which are sharper bandpass filters. I.e., there is less channel cross-talk in the frequency domain. • However, the time-aliasing corresponds to in the frequency domain, implying less robustness to spectral/temporal modifications • Since the hop size needs to be less than N (except for the rectangular window), the overall filter bank based on a Portnoff window remains oversampled as a filter bank • For further discussion, see Dolson 1986 (full citation in the text bibliography2)

2https://ccrma.stanford.edu/~jos/sasp/Bibliography.html

30 COLA/Nyquist OLA/FBS Duality

Let Cola(N) denote constant overlap-add using hop size N. Then we have (by the Poisson summation formula): w ∈ Nyquist(N) ⇔ W ∈ Cola(2π/N) (FBS) w ∈ Cola(R) ⇔ W ∈ Nyquist(2π/R) (OLA)

• For perfect-reconstruction (PR) STFT processing, we prefer both window conditions w ∈ Cola(R) and w ∈ Nyquist(N), in which case the window transform satisfies W ∈ Cola(2π/N) and W ∈ Nyquist(2π/R) • Window lengths M shorter than the FFT length N trivially satisfy the Nyquist(N) property

31 Perfect Reconstruction (PR) STFTs

We can consider Cola(R) and Nyquist(N) windows and window-transforms to be a routine requirement for a fixed-resolution time-frequency sampling system Multiresolution time-frequency sampling systems may be created by

• Combining rectangular regions from fixed-resolution time-frequency samples • Filter Banks (invertible hence PR) • Constant-Q Filter Banks (not PR, hence only approximately invertible)

32 Specific Windows

• Recall that the rectangular window transform is Nyquist(2π/M), implying the rectangular window itself is Cola(M), which is obvious. • The window transform for the Hamming family is Nyquist(4π/M), implying that Hamming windows are Cola(M/2), which we also knew. • The rectangular window transform is also Nyquist(K2π/M) for any integer 1 ≤ K ≤ M/2, implying that all hop sizes given by R = M/K for K =1, 2, 3,...,M/2 are COLA. • Because its side lobes are the same width as the sinc sidelobes, the Hamming window transform is also Nyquist(K2π/M),for any integer 2 ≤ K ≤ M/2, implying hop sizes R = M/K are good, for K =2, 3,...,M/2. Thus, the available hop sizes for the Hamming window family include all of those for the rectangular window except one (R = M).

33 Hamming Window Transform and Frame-Rate

Hamming Window Transform, frame rate = 0.06

0 Folding Frequency

−10 Frame Rate

−20 Frame Rate Harmonic −30

−40 Magnitude (dB)

−50

−60

−70

−80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Normalized Frequency (cycles/sample)

• Hop size = half of window length: M R = 2 • Because the window transform has nulls at the frame rate 2π/R and all of its harmonics, this (“periodic”) Hamming window must be COLA in the time domain at hop size R = M/2.

34 Filter Bank Reconstruction

ω e j 0 n

X m ˜ ( ω 0 ) R f ω e j 1 n

X m ˜ ( ω 1 ) R f . .. x(n) ω ...... e j N − 1 n

X m ˜ ( ω N− 1 ) R f { { { { { Channel Stretch Interp. Remodu- Filter Bank signals Filters lators Sum (STFT)

• Since the channel signals are downsampled, we expect to need bandlimited interpolation (filters f above) in each channel prior to remodulation • However, we know from the OLA development that interpolation is not required as long as the window w is Cola(R) – The downsampled channel signals provide a sequence of spectra that can be IFFTd and overlap-added to reconstruct the signal – Aliasing due to downsampling by R> 1 vanishes!

35 Remarks on Filter Bank Reconstruction

• From studying the overlap-add framework, we know that the inverse STFT is exact when the window w(n) is Cola(R), that is, when AliasR(w) is constant. In only these cases can the STFT be considered a perfect reconstruction filter bank. • From the Poisson Summation Formula in the last lecture, we know that a condition equivalent to the COLA condition is that the window transform W (ω) have notches at the frame rate and all its harmonics, i.e., W (2πk/R)=0 for k =1, 2, 3,R − 1. • In the present context (filter-bank point of view), perfect reconstruction appears impossible for R> 1, because for ideal reconstruction after downsampling, the channel anti-aliasing filter (w) and interpolation filter (f) have to be ideal lowpass filters. This is a true conclusion in any single channel, but not for the filter bank as a whole. We know, for example, from the overlap-add interpretation of the STFT that perfect reconstruction occurs for hop-sizes greater than 1 as long as the COLA condition is met. This is an interesting paradox to which we will return shortly.

36 • What we would expect in the filter-bank context is that the reconstruction can be made arbitrarily accurate given better and better lowpass filters w and f which cut off at ωc = π/R (the folding frequency associated with down-sampling by R). This is the right way to think about the STFT when spectral modifications are involved.

A supplementary lecture3 develops the general topic of perfect reconstruction filter banks, and derives various STFT processors as special cases See especially the polyphase representation of filter banks

3http://ccrma.stanford.edu/~jos/JFB/

37 Anti-Aliasing w. Downsampling Factor R

In OLA, the hop size R is governed by the COLA constraint ∞ w(n + mR)= constant m=−∞ X In FBS, R is the downsampling factor in each of the filterbank channels, and thus the window w serves as the anti-aliasing filter:

x k ( n) X n ( ω k )

x(n) w R X m ˜ ( ω k )

− ω e j k n { { { { Demodu- Anti- Deci- Channel lator Aliasing mation signal Filter (STFT)

We see that to avoid aliasing, W (ω) must be bandlimited to (−π/R,π/R).

38 Properly Anti-Aliasing Window Transforms

W(ω)

−π −π/R 0 π/R π

For simplicity, define window-transform bandlimits at first zero-crossings about the main lobe. 2π π Given the first zero of W (ω) at K M ≤ R, we obtain M Rmax = 2K

K Window Type (Length M) Rmax RCola 1 Rectangular M/2 M 2 Generalized Hamming M/4 M/2 3 Blackman Family M/6 M/3 L L-term Blackman-Harris M/2L M/L

• Any R ≤ Rmax suppresses aliasing well • Note: COLA holds for above windows at R =2Rmax ⇒ perfect reconstruction AND heavy aliasing in FFT-bin signals Xm˜ (ωk)!

39 How can this work?

W(ω)

Aliases Aliases

−π −2π/R −π/R 0 π/R 2π/R π

Answer to the Paradox:

• All aliasing is canceled in the reconstruction • See “Perfect Reconstruction Filter Banks” • Aliasing-cancellation is disturbed by spectral modifications • For robustness in the presence of spectral modifications, keep R ≤ Rmax = M/(2K) • For compression, use R =2Rmax = M/K and a “post-window” (i.e., a synthesis filter, as we’ll see)

40 Hop Sizes for WOLA

In the weighted overlap-add method, with the output window equal to the input window, we have the following modification of the recommended maximum hop-size table: K In and Out Window (Length M) Rmax RCola 1 Rectangular M/2 M 2 Generalized Hamming M/6 M/3 3 Blackman Family M/10 M/5 K K-term Blackman-Harris M/(4K-2) M/(2K-1) Notes:

• Rmax is equal to 2π divided by the main lobe width in “side lobes”, while • RCola is 2π divided by the first notch frequency in the window transform (lowest available frame rate at which all frame-rate harmonics are notched). • For windows in the Blackman-Harris families, and with main-lobe widths defined from zero-crossing to zero-crossing, RCola =2Rmax. • Below RCola = M/(2K − 1) for K-term BH windows, the next “good value” is obtained by

41 moving the first frame-rate harmonic to the next notch in the window transform: 2π RColaM Rk = = , k =1, 2,... 2π + k 2π kR + M RCola M Cola Of course, this must be an integer for exact results. • For a Chebyshev window, on the other hand, all “sufficiently small” hop sizes R ≤ Rmax are considered equally good, because the worst-case side-lobe level is the same for all such values of R. Since the actual frame-rate harmonics will sample the side-lobe peaks nonuniformly, some variation in performance with R can be observed (better than expected if many nulls are sampled).

42 Constant-Overlap-Add (COLA) Cases

• Weak COLA: Window transform has zeros at frame-rate + harmonics: 2πk W (ω )=0, k =1, 2,...,R − 1, ω =∆ k k R – Perfect OLA reconstruction – Relies on aliasing cancellation in frequency domain – Aliasing cancellation is disturbed by spectral modifications • Strong COLA: Window transform is bandlimited consistent with downsampling by the frame rate: W (ω)=0, |ω|≥ π/R – Perfect OLA reconstruction – No aliasing – better for spectral modifications – Time-domain window infinitely long in ideal case

43 Recall Hamming Overlap-Add Example

Modified-Hamming Overlap-Add Example

Matlab code:

M=33; %windowlength w = hamming(M); % window R = (M-1)/2; % maximum hop size w(M) = 0; % make ’periodic Hamming’ (COLA)

44 Overlapped Hamming Windows, M=33, R=16, N=256

1

0.8

0.6 Normalized Amplitude 0.4

0.2

0 10 20 30 40 50 60 Time (samples) COLA Periodic Hamming Window

45 Hamming Window Transform and Frame-Rate

Below is the zero-padded DFT of the modified Hamming window we’re using (w(M − 1) ← 0) with the frame-rate harmonics marked.

Hamming Window Transform, frame rate = 0.06

0 Folding Frequency

−10 Frame Rate

−20 Frame Rate Harmonic −30

−40 Magnitude (dB)

−50

−60

−70

−80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Normalized Frequency (cycles/sample)

In this example (R = M/2), the upper half of the main lobe aliases into the lower half of the main lobe. (In fact, all energy above the folding frequency 0.5/R aliases into the lower half of the main lobe. While this window and hop size still give perfect reconstruction under the STFT, spectral modifications will disturb the aliasing cancellation during reconstruction. This “undersampled” configuration is suitable as a basis for compression.

46 Kaiser Overlap-Add Example

Matlab code:

M = 33; % Window length beta = 8; w = kaiser(M,beta); R = floor(1.7*(M-1)/(beta+1)); % ROUGH estimate

47 Kaiser OLA Waveforms

Overlapped Kaiser Windows, M=33, R=6, N=256 2.5

2

1.5

1 Normalized Amplitude

0.5

0 10 20 30 40 50 60 Time (samples)

48 Kaiser OLA, Steady State

Steady−State Kaiser Overlap−Add Error, M=33, R=6 2.3265 measured 2.326 predicted

2.3255

2.325

Amplitude 2.3245

2.324

2.3235

2.323 30 35 40 45 50 55 60 65 70 Time (samples)

• The “predicted” OLA is computed as sum of R sinusoids weighted by W (ωk) according to the Poisson Summation Formula • The Poisson Summation Formula gives exact results to within numerical precision

49 Kaiser Window Transform and Frame-Rate

Kaiser Window Transform, frame rate = 0.17

0 Folding Frequency

−10 Frame Rate

−20 Frame Rate Harmonic −30

−40 Magnitude (dB)

−50

−60

−70

−80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Normalized Frequency (cycles/sample)

This example, which represents a reasonably high-quality audio STFT, will be robust in the presence of spectral modifications because the folding frequency lies above the main lobe of the Kaiser window transform. Remember that, for robustness in the presence of spectral modifications, the frame rate should be more than twice the highest main-lobe frequency.

50 FBS Fixed Modifications

Let’s set R =1 (“sliding FFT”) and look at spectral modifications in the FBS case.

Consider applying a fixed (time-invariant) filter H(ωk) to each Xm(ωk) before resynthesizing the signal:

Ym(ωk)= Xm(ωk)H(ωk) where, H(ωk) is the sampled frequency response of a filter with impulse response 1 N−1 h(n)= H(ω )ejωkn, n =0,...,N − 1 N k Xk=0 Let’s examine the result this has on the signal in the time

51 domain: 1 N−1 y(m) = Y (ω )ejωkm N m k Xk=0 1 N−1 = X (ω )H(ω )ejωkm N m k k Xk=0 1 N−1 ∞ = x(n)w(n − m)e−jωkn H(ω )ejωkm N k (n=−∞ ) Xk=0 X 1 ∞ N−1 = x(n)w(n − m) H(ω )ejωk(m−n) N k n=−∞ k=0 ∞ X X = x(n)[w(n − m)h(m − n)] n=−∞ X∞ = x(n)[w ˜(m − n)h(m − n)] n=−∞ = (xX∗ [˜w · h])(m)

We see that the result is x convolved with a windowed version of the impulse response h. This is in contrast to the OLA technique where the result gave us a windowed x filtered by h without the window having any effect on the filter, provided it obeys the COLA constraint and

52 sufficient zero padding is used to avoid time aliasing. In other words, FBS gives y = x ∗ [˜w · h] ↔ X · [W˜ ∗ H] while OLA gives y = x ∗ [W (0) · h] ↔ X · [W (0) · H]

• In FBS, the analysis window w smoothes the filter frequency response by time-limiting the corresponding impulse response. • In OLA, the analysis window can only affect scaling.

For these reasons, FFT implementations of FIR filters normally use the Overlap-Add method.

53 Time Varying Modifications in FBS

Consider now applying a time varying modification.

Ym(ωk)= Xm(ωk)Hm(ωk) (R =1) where 1 N−1 H (ω ) ↔ h (n)= H (ω )ejωkn m k m N m k Xk=0 th hm(n) refers to the n tap of the FIR filter at time m.

54 1 N−1 y(m) = Y (ω )ejωkm N m k Xk=0 1 N−1 = X (ω )H (ω )ejωkm N m k m k Xk=0 1 N−1 ∞ = x(n)w(n − m)e−jωkn H (ω )ejωkm N m k (n=−∞ ) Xk=0 X 1 ∞ N−1 = x(n)w(n − m) H (ω )ejωk(m−n) N m k n=−∞ k=0 ∞ X X

= x(n)[w(n − m)hm(m − n)] n=−∞ X∞

= x(n)[w ˜(m − n)hm(m − n)] n=−∞ X = x(m)[w ˜(0)hm(0)] + x(m − 1)[w ˜(1)hm(1)] + ··· = (x ∗ [˜w · hm])(m)

Hence, the result is the convolution of x with the windowed version of hm.

• We saw that in OLA with time varying modifications

55 and R =1 (a “sliding” DFT), the window served as a lowpass filter on each individual tap of the FIR filter being implemented. • In FBS, there is no limitation on how fast the filter hm may vary with time, but its length is limited to that of the window w. • In OLA, there is no limit on length (just add more zero-padding), but the filter taps are band-limited to the spectral width of the window. • FBS filters are time-limited by w, while OLA filters are band-limited by w (another dual relation). • Recall for comparison that each frame in the OLA method is filtered according to

Ym = Xm · Hm =[X ∗ Wm] · Hm ↔ [x · wm] ∗hm

xm where wm denotes ShiftmR(w). | {z } • Time-varying FBS filters instantly in “steady state” • FBS filters must be changed very slowly to avoid clicks and pops (discontinuity distortion likely when filter changes)

For more details, see [Allen and Rabiner, 1977].

56 STFT Summary and Conclusions

The short-time Fourier transform (STFT) may be viewed either as an overlap-add (OLA) processor, or as a filter bank sum (FBS).

• We derived two conditions for perfect reconstruction which are Fourier duals of each other: – For OLA, the window must overlap-add to a constant in the time domain. By the Poisson summation formula, this is equivalent to having window transform nulls at all nonzero multiples of the frame rate 2π/R. – For FBS, the window transform must overlap-add to a constant in the frequency domain, and this is equivalent to having window nulls in the time domain at all nonzero multiples of the transform size N. • STFT filter banks are oversampled except when using the rectangular window of length M = N and a hop size R = N. • Critical sampling is desired for compression systems, but it is problematic in conjunction with spectral modifications. (Aliasing no longer canceled.)

57 • STFT filter banks are also uniform filter banks, as opposed “constant Q”. – In some audio applications, it is preferable to use non-uniform filter banks which approximate the auditory filter bank. – Some pointers can be found in the Bark bilinear transform paper.4 – We will next study a particular case (an octave filter bank). • Approximate constant-Q filter banks are easily synthesized from STFT filter banks by summing adjacent frequency channels. However, – when K adjacent FFT bins are summed, the hop size in the time domain should be reduced by K. – A refinement of bin-summing is to multiply the K FFT bins by a K × K matrix which produces output samples from a K-bin-wide filter band over K successive time steps. As we will see, the K × K DFT matrix can be used for this purpose.

4http://ccrma.stanford.edu/~jos/bbt/

58