<<

General Discrete-Time Model of Speech Production

Digital Speech Processing— Lecture 9

Short-Time Methods- Voiced Speech: • AVP(z)G(z)V(z)R(z) Introduction Unvoiced Speech:

1 • ANN(z)V(z)R(z) 2

Short-Time Fourier Analysis Why STFT for Speech Signals

• represent signal by sum of sinusoids or • steady state sounds, like vowels, are produced complex exponentials as it leads to convenient by periodic excitation of a linear system => solutions to problems (formant estimation, pitch speech spectrum is the product of the excitation period estimation, analysis-by-synthesis spectrum and the vocal tract frequency response methods), and insight into the signal itself • speech is a time-varying signal => need more sophisticated analysis to reflect time varying • such Fourier representations provide properties – convenient means to determine response to a sum of – changes occur at syllabic rates (~10 times/sec) sinusoids for linear systems – over fixed time intervals of 10-30 msec, properties of – clear evidence of signal properties that are obscured most speech signals are relatively constant (when is in the original signal this not the case)

3 4

Frequency Domain Processing Overview of Lecture

• define time-varying (STFT) analysis method • define synthesis method from time-varying FT (filter-bank summation, overlap addition) • show how time-varying FT can be viewed in terms of a bank of filters model • Coding: • computation methods based on using FFT – transform, subband, homomorphic, channel vocoders • application to vocoders, spectrum displays, • Restoration/Enhancement/Modification: format estimation, pitch period estimation – noise and reverberation removal, helium restoration, time-scale modifications (speed-up and slow-down of speech) 5 6

1 Frequency and the DTFT DTFT and DFT of Speech

• sinusoids The DTFT and the DFT for the infinite duration jnωω00− jn signal could be calculated (the DTFT) and xn ( )==+ cos(ω0 n ) ( e e )/ 2 approximated (the DFT) by the following: where ω0 is the frequency (in radians) of the sinusoid ∞ • the Discrete-Time Fourier Transform (DTFT ) Xe (jjmωω )= ∑ xme ( )− ( DTFT ) ∞ m=−∞ jjnωω− Xe ( )== xne ( ) x (n) L−1 ∑ DTFT {} − jLkm(2π / ) n=−∞ Xk()= ∑ xmwme ( )( ) ,kL=−0,1,..., 1 π m=0 1 jjnωω-1 j ω xn()== Xe ( ) e dω DTFT {} Xe ( ) = Xe()jω ( DFT ) ∫ ωπ=(2kL / ) 2π −π where ω is the frequency variable of Xe (jω ) using a value of L =25000 we get the following plot

7 8

25000-Point DFT of Speech

Magnitude Short-Time Fourier Transform (STFT) Log Magnitude (dB)

9 10

Short-Time Fourier Transform Definition of STFT

∞ jjmωωˆˆˆˆ− • speech is not a stationary signal, i.e., it Xenˆ ( )=−∑ xmwnme ( ) ( ) both n and ωˆ are variables m=−∞ has properties that change with time •− wn (ˆˆ m ) is a real window which determines the portion of xn ( ) jωˆ • thus a single representation based on all that is used in the computation of Xenˆ ( ) the samples of a speech utterance, for the most part, has no meaning • instead, we define a time-dependent Fourier transform (TDFT or STFT) of speech that changes periodically as the speech properties change over time

11 12

2 Short Time Fourier Transform Short-Time Fourier Transform

• STFT is a function of two variables, the time index, nˆ , which • alternative form of STFT (based on change of variables) is is discrete, and the frequency variable, ω ˆ, which is continuous ∞ ∞ jjnmωωˆˆ−−()ˆ jjmωωˆˆ− Xenˆ ( )=− wmxnme ( ) (ˆ ) Xenˆ ()=− xmwnme ()(ˆ ) ∑ ∑ m=−∞ m=−∞ ∞ ˆˆ ˆ =−⇒ DTFT (xmwn ( ) ( m )) n fixed, ω ˆ variable =−exnmwme− jnωωˆˆ∑ ()()ˆ jm m=−∞ • if we define ∞ % jjmωωˆˆˆ Xenˆ ( )=−∑ xnmwme ( ) ( ) m=−∞ jωˆ • then Xenˆ ( ) can be expressed as (using mm′ =− ) jjnjjnωωˆˆ−−ˆˆ ωω ˆˆ ˆ Xenˆ ()== e Xe% n () eDTFT ⎣⎦⎡⎤ xnmwm ( +− )()

13 14

STFT-Different Time Origins Time Origin for STFT

• the STFT can be viewed as having two different time origins 1. time origin tied to signal xn ( ) mnx=−ˆ ⇒ [0] ∞ jjmωωˆˆˆ − Xenˆ ()=−∑ xmwnme ()( ) m=−∞ =− DTFT ⎣⎦⎡⎤xmwn( ) (ˆˆ m ) , n fixed, ωˆ variable Time origin tied to 2. time origin tied to window signal wm (− ) ∞ window wmxnm[][]−+ˆ jjnωωˆˆ−−ˆ ˆ jm ω ˆ Xenˆ ()=+− e∑ xnmwme ( )() m=−∞ = eXe− jnωωˆˆˆ % () j − jnωˆ ˆ =−+ewmxnmnDTFT ⎣⎦⎡⎤( ) (ˆˆ ) , fixed, ωˆ variable

15 16

Interpretations of STFT Fourier Transform Interpretation

jωˆ • there are 2 distinct interpretations of Xenˆ ( ) jωˆ • consider Xenˆ ( ) as the normal Fourier transform of the sequence 1. assume nXeˆ is fixed, then (jωˆ ) is simply the normal Fourier nˆ wn(ˆˆ−−∞<<∞ mxm ) ( ), m for fixed n . transform of the sequence wn (ˆ −−∞<<∞=> mxm ) ( ), m for •− the window wn (ˆ m ) slides along the sequence xm ( ) and defines a jωˆ fixed nXˆ, nˆ ( e ) has the same properties as a normal Fourier new STFT for every value of nˆ transform • what are the conditions for the existence of the STFT jωˆ ˆ 2. consider Xenˆ ( ) as a function of the time index n with ωˆ fixed. •− the sequence wn (ˆ mxm ) ( ) must be absolutely summable for all jωˆ ˆ − jnωˆ ˆ values of nˆ Then Xenˆ ( ) is in the form of a convolution of the signal xne ( ) with the window wn (ˆ ). This leads to an interpretation in the form of - since |xn (ˆ ) |≤ L (32767 for 16-bit sampling) − jnωˆ ˆ ˆ linear filtering of the frequency modulated signal xne (ˆˆ ) by wn ( ). - since |wn ( ) |≤ 1 (normalized window levels) - since window duration is usually finite ˆˆ - we will now consider each of these interpretations of the STFT in •− wn ( mxm ) ( ) is absolutely summable for all n a lot more detail 17 18

3 Frequencies for STFT Signal Recovery from STFT

jωˆ • the STFT is periodic in ω with period 2π , i.e., • since for a given value of nXˆ ,nˆ ( e ) has the same properties as a jjkωωπˆˆ()+2 normal Fourier transform, we can recover the input sequence exactly Xennˆˆ()=∀ Xe ( ), k • since Xe (jωˆ ) is the normal Fourier transform of the windowed • can use any of several frequency variables nˆ sequence wn (ˆ − mxm ) ( ), then to express STFT, including 1 π wn (ˆ −= mxm ) ( ) X ( ejjmωωˆˆ ) e dωˆ --ωˆ =Ωˆ TT (where is the sampling period for ∫ nˆ 2π −π xm ( )) to represent analog radian frequency, •≠ assuming the window satisfies the property that w (00 ) ( a trivial

jTΩˆ requirement), then by evaluating the inverse Fourier transform giving Xenˆ ( ) when mn= ˆ , we obtain ˆ ˆ --ωπˆˆ==22fFT or ωπ to represent normalized 1 π xn (ˆ )= X ( ejjnωωˆ ) eˆ dωˆ ˆ ∫ nˆ frequency (0≤≤f 1 ) or analog frequency 20π w()−π ˆ jf2π ˆ j 2πFTˆ (0≤≤FFs =1 /T ), giving Xenˆ ( ) or Xenˆ ( ) 19 20

Signal Recovery from STFT Properties of STFT

jωˆ Xenˆ ( )=−DTFT [ wnmxm (ˆˆ ) ( )] n fixed, ωˆ variable π 1 jjnωωˆ ˆ • relation to short-time power density function xn (ˆ )= Xnˆ ( e ) e dωˆ ∫ jjωωˆˆ2 jj ωω ˆˆ∗ 20π w()−π ˆ Sennˆˆ ( )==⋅= | Xe ( ) | Xe nn ˆˆ ( ) Xe ( )DTFT [] Rk n ˆ ( ) n fixed •≠ with the requirement that wxn (00 ) , the sequence (ˆ ) ∞ jωˆ ˆˆ Rk()=− wnmxmwnm (ˆˆ )( )( −−+⇔ kxmk )( ) S()e can be recovered exactly from Xe (jjωω ), if Xe ( ) is known nˆ ∑ nˆ nnˆˆ m=−∞ for all values of ωˆ over one complete period • Relation to regular Xe (jωˆ ) (assuming it exists) - sample-by-sample recovery process ∞ jjmωωˆˆ− jωˆ Xe()==DTFT [] xm ()∑ xme () - Xenˆ ( ) must be known for every value of nˆ and for all ωˆ m=−∞ ˆ π can also recover sequence wn (− mxm ) ( ) but can't guarantee jjjjnωθωθθˆˆ1 −−−() ˆ Xe()= We ( )( Xe ) e dθ ˆ nˆ ∫ that xm ( ) can be recovered since w (nm− ) can equal 0 2π −π ˆ −−jjnjθθˆ θ ⎣⎦⎡⎤wn()()()−↔ m xm We e ∗ Xe ()

21 22

Properties of STFT Alternative Forms of STFT

Alternative forms of Xe (jωˆ ) • assume Xe (jωˆ ) exists nˆ 1. real and imaginary parts ∞ jjmωωˆˆ− Xe (jjωωˆˆ )=+ Re⎡⎤⎡⎤ Xe ( ) j Im Xe ( j ω ˆ ) Xe( )==DTFT [] xm ()∑ xme () nnˆˆ⎣⎦⎣⎦ n ˆ m=−∞ =−ajb()ωωˆˆ () π nnˆˆ 1 ˆ Xe()jjjjnωθωθθˆˆ= We (−−− )( Xe() ) e dθ aXe()ωˆ = Re⎡⎤ (jωˆ ) nˆ ∫ nnˆˆ⎣⎦ 2π −π bXe()ωˆ =− Im⎡⎤ (jωˆ ) • limiting case nnˆˆ⎣⎦ •− when xm ( ) and wn (ˆ m ) are both real (usually the case) wn()ˆˆ=−∞<<∞⇔12 n We (jωˆ ) =πδ () ωˆ can show that abˆˆ (ωωωˆˆˆ ) is symmetric in , and ( ) is 1 π nn X() ejjjnjωωθθωˆˆˆ=−2πδ ()( θ Xe()−− ) eˆ d θ = Xe () anti-symmetric in ωˆ nˆ 2π ∫ −π 2. magnitude and phase

i.e., we get the same thing no matter where the window is jjωωˆˆjθωnˆ ()ˆ Xennˆˆ ( )= | Xe ( ) | e shifted jωˆ • can relate |Xennˆˆ ( ) | and θω (ˆ ) to abnnˆˆ (ωωˆˆ ) and ( ) 23 24

4 Role of Window in STFT Role of Window in STFT

The window wn (ˆ − m ) does the following: • then for fixed nˆ , the normal Fourier transform of the 1. chooses portion of xm ( ) to be analyzed product wn (ˆ − mxm ) ( ) is the convolution of the transforms 2. window shape determines the nature of Xe (jωˆ ) nˆ of wn(ˆ − m ) and xm( ) Since Xe (jωˆ ) (for fixed nˆ ) is the normal FT of wn(ˆ − mxm ) ( ), nˆ •− for fixed nwnmWeeˆˆ , the FT of ( ) is (−−jjnωωˆˆ )ˆ --thus then if we consider the normal FT's of both xn ( ) and wn ( ) 1 π individually, we get Xe()jjjnjωθθωθˆˆ= We (−− ) eˆ Xe (() − ) dθ nˆ ∫ ∞ 2π −π Xe (jjmωωˆˆ )= ∑ xme ( ) − m=−∞ •− and replacing θθ by gives ∞ π jjmωωˆˆ− 1 We()= ∑ wme () Xe (jjjnjωθθωθˆˆ )= Wee ( )ˆ Xe (()+ ) dθ m=−∞ nˆ ∫ 2π −π

25 26

Interpretation of Role of Window Windows in STFT

jωˆ ˆ jjωωˆˆ • for Xenˆ ( ) to represent the short-time spectral properties of xn ( ) • Xenˆ ( ) is the convolution of Xe ( ) with the FT of the shifted inside the window => We (jθ ) should be much narrower in frequency window sequence We (−−jjnωωˆˆ ) e ˆ than significant spectral regions of Xe (jωˆ )--i.e., almost an impulse • Xe (jωˆ ) really doesn't have meaning since xn (ˆ ) varies with time; in frequency consider xn (ˆ ) defined for window duration and extended for all time • consider rectangular and Hamming windows, where width of the jωˆ to have the same properties⇒ then Xe ( ) does exist with properties main spectral lobe is inversely proportional to window length, and side that reflect the sound within the window (can also consider xn (ˆ )= 0 lobe levels are essentially independent of window length outside the window and define Xe (jωˆ ) appropriately--but this is another case) Rectangular Window: flat window of length L samples; first Bottom Line: Xe (jωˆ ) is a smoothed version of the FT of the part zero in frequency response occurs at FS/L, with sidelobe levels nˆ of -14 dB or lower of xn (ˆ ) that is within the window w . Hamming Window: raised cosine window of length L

samples; first zero in frequency response occurs at 2FS/L, with 27 sidelobe levels of -40 dB or lower 28

Frequency Responses of Windows Windows

L=2M+1-point Hamming window and its corresponding DTFT 29 30

5 Effect of Window Length-HW Effect of Window Length-HW

31 32

Effect of Window Length-RW Effect of Window Length-HW

33 34

Relation to Short-Time Autocorrelation Short-Time Autocorrelation and STFT

jωˆ Xenˆ () is the discrete-time Fourier transform of wnmxm [ˆ − ][] for each value of nˆ, then it is seen that jjωωˆˆ2* jj ωω ˆˆ Sennˆˆ()|()|()()== Xe Xe nn ˆˆ Xe is the Fourier transform of ∞ ˆ ˆ Rlnˆ ()=−∑ wnmx [ ][][mwn−− l mxm ][ + l ] m=−∞ which is the short-time autocorrelation function of the previous chapter. Thus the above equations relate the short-time spectrum to the short-time autocorrelation,

35 36

6 Summary of FT view of STFT

jω • interpret Xenˆ ( ) as the normal Fourier transform of the sequence wn()(),ˆ −−∞<<∞ mxm m • properties of this Fourier transform depend on the window Linear Filtering jω o frequency resolution of Xenˆ ( ) varies inversely with the length of the window => want long windows for high resolution Interpretation of STFT o want xn ( ) to be relatively stationary (non-time-varying) during duration of window for most stable spectrum => want short windows ⇔ as usual in speech processing, there needs to be a compromise between good temporal resolution (short windows) and good frequency resolution (long windows)

37 38

Linear Filtering Interpretation Linear Filtering Interpretation

1. modulation-lowpass filter form (nn rather than ˆ ) ∞ jjmωωˆˆ− Xen ( )=−∑ xme ( ) wnm ( ) m=−∞ 1. modulation-lowpass filter form: − jnωˆ ∞ =∗wn( )() xne ( ) , n variable, ωˆ fixed jjmωωˆˆ− Xen ()=−∑ xme () wnm ( ), m=−∞ 1 π = We()(jjjnθθωθ Xe()+ ˆ ) e dθ n variable, ωˆ fixed 2π ∫ −π =∗()xne()− jnωˆ wn () 2. bandpass filter-demodulation =∗−xn( )cos(ωˆ n ) wn ( ) ∞ () jjnmωωˆˆ−−() Xn ()ewmxnme=−∑ ()( ) jxn()()sin()ωˆ n∗ wn () m=−∞ ∞ =−ajbnn()ωωˆˆ () =−ewmexnm− jnωωˆˆ∑ (() jm )( ) m=−∞ =∗ewnexnn− jnωωˆˆ[( ( ) jn ) ( )], variable, ωˆ fixed 39 40

Linear Filtering Interpretation Linear Filtering Interpretation 2. bandpass filter-demodulation form Lowpass filter frequency response Xe (jjnjnωωˆˆ )=∗ e− ⎡⎤ wne ( ) ω ˆ xn ( ) , n variable, ωˆ fixed n ⎣⎦()

• complex bandpass filter output modulated by signal e− jnωˆ Bandpass filter frequency response • if We (jθ ) is lowpass, then filter is bandpass around θω= ˆ • all real computation for lower half structure

41 42

7 Linear Filtering Interpretation Summary-STFT

• assume normal FT of xn ( ) exists jθ xn ( )↔ Xe ( ) (recall that ωˆ is a particular frequency) Short-Time Fourier Transform (STFT) xne ( )−+jnωθωˆˆ↔ Xe ( j() ) ∞ => spectrum of xn ( ) at frequency ωˆ is shifted to zero frequency; jjmωωˆˆˆ − Xenˆ ()=−∑ xmwnme [][ ] , • since the STFT is a convolution, the FT of the STFT is the product m=−∞ of the individual FT's, i.e., −∞

43 44

Summary Summary – Modulation/Lowpass Filter Short-Time Fourier Transform (STFT) Short-Time Fourier Transform (STFT) ∞ ∞ jjmωωˆˆˆˆ− jjmωωˆˆˆˆ− Xenˆ ()=−−∞<<∞≤<∑ xmwnme [][ ] , n ,0ωˆ 2π Xenˆ ()=−−∞<<∞≤<∑ xmwnme [][ ] , n ,0ωˆ 2π m=−∞ m=−∞ π ωˆL−1 ωˆ ωˆ 2 ωˆ1 0 2 0 ωˆ0 0 R 2R 3R n^ n ˆ n n jjmωωˆˆ− jωˆˆ− jmω Filter Bank:Xn (exmewnm )=− [ ] [ ] DFT:Xenˆ ( )=−() xmwnme [ ] [ˆ ] ∑ () ∑ mnL=−+1 mnL=−+ˆ 1 jωˆ ˆ Xe()jjnωωˆˆ=− xne []− wnm [ ] Xenˆ ()DFT[][=−() xmwnm ] n () 02,0,,2,...≤<ωπˆ nRRˆ = 45 =∗()xne[]− jnωˆ wn [] 46

Summary – Bandpass Filter/Demodulation Summary – Modulation Short-Time Fourier Transform (STFT) ∞ Modulation jjmωωˆˆˆˆ− ˆ Xenˆ ()=−−∞<<∞≤<∑ xmwnme [][ ] , n ,0ω 2π jnωωωˆˆ j jn m=−∞ xn[] e↔∗ Xe ( ) FTe ( ) ωˆL−1 jω ωˆ2 =∗−Xe()δ (ωωˆ ) ωˆ 1 j ()ωω− ˆ ˆ ω0 = Xe() n X()e jω Xe()j ()ωω− ˆ ∞ jjnmωωˆˆ−−() Filter Bank:Xn (exnmewm )=−∑ () [ ] [ ] m=−∞ Xe()jjnjnωωˆˆ e− ⎡⎤ ([])[] wne ω ˆ xn n =∗⎣⎦ -W 0 W 47 ωˆ −W ωˆ ωˆ +W48

8 STFT Magnitude Only

• for many applications you only need the magnitude of the STFT(not the phase) Sampling Rates of STFT • in such cases, the bandpass filter implementation is less complex, since 12/ |Xe (jωˆ ) |⎡⎤ a22 (ˆˆ ) b ( ) nnn=+⎣⎦ωω 12/ =| Xe%%()|()()jωˆ =+⎡⎤ a22ωωˆˆ b nnn⎣⎦%

49 50

Sampling Rates of STFT Sampling Rate in Time

• need to sample STFT in both time and frequency to • to determine the sampling rate in time, we take a linear filtering view produce an unaliased representation from which x(n) can 1. Xe (jωˆ ) is the output of a filter with impulse response wn ( ) be exactly recovered n % jωˆ • sampling rates lower than the theoretical minimum rate 2. We ( ) is a lowpass response with effective bandwidth of B Hertz jjωωˆˆ can be used, in either time or frequency, and x(n) can •=> thus the effective bandwidth of Xenn ( ) is B Hertz Xe ( ) has still be exactly recovered from the aliased (under- to be sampled at a rate of 2B samples/second to avoid aliasing sampled) short-time transform Example: Hamming Window – this is useful for spectral estimation, pitch estimation, formant wn ( )=−054 . 046 . cos(2101πnL / (−≤≤− )) nL estimation, speech spectrograms, vocoders = 0 otherwise – for applications where the signal is modified, e.g., speech 2F enhancement, cannot undersample STFT and still recover ⇒≈BLFBs (Hz); for = 400, =10 , 000 Hz =>= 50 Hz => need modified signal exactly L s rate of 100/sec (every 100 samples) for sampling rate in time

51 52

Sampling Rate in Frequency “Total” Sampling Rate of STFT

jωˆ ˆ • since Xen ( ) is periodic in ωπ with period 2 , it is only necessary to sample over an • the “total” sampling rate for the STFT is the product of the sampling interval of length 2π rates in time and frequency, i.e.,

•==− need to determine an appropriate finite set of frequencies, ωπˆk 2011kNk / , , ,..., N SR = SR(time) x SR(frequency) jωˆ at which Xen ( ) must be specified to exactly recover xn ( ) = 2B x L samples/sec jωˆ B = frequency bandwidth of window (Hz) • use the Fourier transform interpretation of Xen ( ) jωˆ L = time width of window (samples) 1. if the window wn ( ) is time-limited, then the inverse transform of Xn ( e ) is time-limited jωˆ • for most windows of interest, B is a multiple of F /L, i.e., 2. the sampling theorem requres that we sample Xe ( ) in the frequency dimension at a S n B = C F /L (Hz), C=1 for Rectangular Window rate of at least twice its ('symmetric') "time width" S jωˆ C=2 for Hamming Window 3. since the inverse Fourier transform of Xen ( ) is the signal xmwnm ( ) (− ) and this signal SR = 2C FS samples/second is of duration Lwn samples (the duration of ( )), then according to the sampling theorem jωˆ • can define an ‘oversampling rate’ of Xen ( ) must be sampled (in frequency) at the set of frequencies SR/ FS = 2C = oversampling rate of STFT as compared to 2π k ωˆk ==− ,kL01 , ,..., 1 (where L / 2 is the effective width of the window) conventional sampling representation of x(n) L for RW, 2C=2; for HW 2C=4 => range of oversampling is 2-4 in order to exactly recover xn ( ) from X ( ejωˆk ) n this oversampling gives a very flexible representation of the speech signal

• thus for a Hamming window of duration L=400 samples, we require that the STFT be evaluated at at least 400 uniformly spaced frequencies around the unit circle 53 54

9 Mathematical Basis for Sampling the STFT Sampling the STFT

•==−∞<<∞ assume sample in time at nnˆ r rR , r ⎛⎞2π and in frequency at ωωˆˆ==k ⎜⎟kk , =01 , ,..., N − 1 ⎝⎠N • sample values 22ππ jk∞ − jkm NN XerR ( )=−∑ wrRmxme [ ] [ ] m=−∞ 22ππ − jkrR jk NN = eXe% rR() 2π 2π jk ∞ − jkm N N ′ ′ Xe% rR()=+∑ xrRm [ ]wme()−=+=( set mrRmmm; ) m=−∞ • define DFT-type notation 22ππ jk− jkrR NN XkrrR ( )== X ( e ) e Xk% r ( ) 55 56

Sampling the STFT What We Have Learned So Far

∞ jjmωωˆˆˆ − • DFT Notation 1.()()()Xenˆ =−∑ xmwnme m=−∞ 22ππ jk− jkrR NN function of nnˆ for sampled ˆ (looks like a time sequence) XkrrR [ ]== X ( e ) e Xk% r [ ] = ω •−≠≤≤− let wm [ ]00 for mL 1 (finite duration window with no zero-valued samples) function of ωωˆ = for sampled nˆ (looks like a transform) 2π jωˆ L−1 − jkm ˆ N Xenˆ ( ) (no sampling rate reduction) defined for n =≤≤123 , , ,...; 0 ωˆ π Xk% r []=+−∑ xrRmwme [ ] [ ] m=0 jωˆ ˆˆ 2.Xenˆ ( )=−⇒ DTFT⎣⎦⎡⎤ xmwnm ( ) ( ) n fixed, ωˆ variable (rkN fixed, 01≤≤ − ) with time origin tied to xn (ˆ ) •≤ if LN then (DFT defined with no aliasing⇒ can recover sequence jjnωωˆˆ− ˆ ˆˆ exactly using inverse DFT) Xenˆ ( )=+−⇒ e DTFT⎣⎦⎡⎤ xnmwm ( ) ( ) n fixed, ωˆ variable 2π N−1 jkm 1 N with time origin tied to w (−m) xrR [+−= mw ] [ m ]∑ X% r [ k ] e N k =0 jωˆ 3. Interpretations of Xenˆ ( ) (rmN fixed, 01≤≤− ) 1.nXexmwnmˆˆ fixed, ωωˆ ==−⇒ variable; (jωˆ ) DTFT⎡⎤ ( ) ( ) DFT View •≤ if RL (IDFT defined with no aliasing), then all samples nˆ ⎣⎦ ˆ jjnωωˆˆ− 2. nn==∗⇒ variable, ωˆ fixed; Xenˆ ( ) xne ( ) wn ( ) Linear Filtering can be recovered from Xkr [ ]() R>⇒ L gaps in sequence view ⇒ filter bank implementation 57 58

What We Have Learned So Far What We Have Learned So Far 4. Signal Recovery from STFT 1 π ˆ jjmωωˆˆ 6. Sampling Rates in Time and Frequency xmwn ( ) (−= m ) Xnˆ ( e ) e dωˆ ∫ jω 2π −π 1. time: We ( ) has bandwidth of B Hertz ⇒ 2 B samples/sec rate π 1 ˆ 2F xn()ˆ = X ( ejjnωωˆˆ ) e dωˆ S ∫ nˆ Hamming Window: B = (Hz) 20πw()−π L jω 5. Linear Filtering Interpretation 2. frequency: wn% ( ) is time limited to L samples ⇒ inverse of Xen ( ) is 1. modulation-lowpass filter ⇒=∗ Xe (jjωˆ ) wn ( )⎡⎤ xne ( ) − ωˆn , also time limited ⇒ need to sample in frequency at twice the (effective) n ⎣⎦ time width of the time-limited sequence ⇒ L frequency samples 1 π nnˆ == variable, ωˆ fixed Xe (jjjjnωθθωθˆˆ ) WeXe ( ) (()+ ) edθ 3. total Sampling Rate: 2BL⋅ samples/sec n 2π ∫ −π - B = frequency bandwidth of the window (Hz) 2. bandpass filer-demodulation ⇒= Xe (jjnjnωωˆˆ ) e− ⎡⎤ wne ( ) ω ˆ ∗ xn ( ) , n ⎣⎦() - L = effective time width of the window (samples)

nnˆ = variable, ωˆ fixed BCFL=⋅S / (Hz) ⇒ Sampling Rate=⋅=22BL CFS samples/second - for Rectangular Window, C = 1 - for Hamming Window, C = 2

59 60

10 Spectrographic Displays

• Sound Spectrograph-one of the earliest embodiments of the time- dependent spectrum analysis techniques – 2-second utterance repeatedly modulates a variable frequency oscillator, then bandpass filtered, and the average energy at a given time and frequency is measured and used as a crude measure of the Spectrographic Displays STFT – thus energy is recorded by an ingenious electro-mechanical system on special electrostatic paper called teledeltos paper – result is a two-dimensional representation of the time-dependent spectrum-with vertical intensity being spectrum level at a given frequency, and horizontal intensity being spectral level at a given time- with spectrum magnitude being represented by the darkness of the marking – wide bandpass filters (300 Hz bandwidth) provide good temporal resolution and poor frequency resolution (resolve pitch pulses in time but not in frequency)—called wideband spectrogram – narrow bandpass filters (45 Hz bandwidth) provide good frequency resolution and poor time resolution (resolve pitch pulses in frequency, but not in time)—called narrowband spectrogram

61 62

Conventional Spectrogram (Every Digital Speech Spectrograms salt breeze comes from the sea) • wideband spectrogram • follows broad spectral peaks (formants) over time • resolves most individual pitch periods as vertical striations since the IR of the analyzing filter is comparable in duration to a pitch period • what happens for low pitch males—high pitch females • for unvoiced speech there are no vertical pitch striations

• narrowband spectrogram • individual harmonics are resolved in voiced regions • formant frequencies are still in evidence • usually can see fundamental frequency • unvoiced regions show no strong structure 63 64

Digital Speech Spectrograms Digital Speech Spectrograms

• Speech Parameters (“This is a test”): Top Panel: – sampling rate: 16 kHz 3 msec (48 – speech duration: 1.406 seconds samples) window – speaker: male • Wideband Spectrogram Parameters: Second Panel: – analysis window: Hamming window 6 msec (96 – analysis window duration: 6 msec (96 samples) samples) window – analysis window shift: 0.625 msec (10 samples) Third Panel: – FFT size: 512 9 msec (144 – dynamic range of spectral log magnitudes: 40 dB sample) window • Narrowband Spectrogram Parameters: – analysis window: Hamming window Fourth Panel: – analysis window duration: 60 msec (960 samples) 30 msec (480 – analysis window shift: 6 msec (96 samples) sample) window – FFT size: 1024 – dynamic range of spectral log magnitudes: 40 dB 65 66

11 Spectrogram Comparisons Spectrogram - Male nfft= 1024 ,LOverlap== 80 , 75

67 68

Spectrogram - Female nfft==1024 ,LOverlap 80 , = 75

Overlap Addition (OLA) Method

69 70

Overlap Addition (OLA) Method Overlap Addition (OLA) Method

⎡⎤jjnωωkk yn()= ∑∑⎢⎥ Xm ( e ) e • based on normal FT interpretation of short-time spectrum mk⎣⎦ • summation is for overlapping analysis sections jωk DFT/ IDFT Xe( )←⎯⎯⎯⎯→= ym () xmwnm ()(ˆ − ) jωk nnˆˆ • for each value of mXe where m ( ) is measured, do an inverse FT to give jω k ynm ( )=− Lxnwmn ( ) ( ) (where L is the size of the FT) • can reconstruct x(m) by computing IDFT of Xnˆ ( e ) and y(nynLxnwmn)()()()==∑∑m − dividing out the window (assumed non-zero for all samples) mm •=> this process gives Lx(m) signal values of for each window • a basic property of the window is N−1 We()j 0 == We (jωk )| wn () window can be moved by L samples and the process repeated ωk =0 ∑ n=0 jωk • since any set of samples of the window are equivalent (by sampling arguments), • since Xenˆ ( ) is "undersampled" in time, it is highly susceptible then if wn ( ) is sampled often enough we get (independent of n ) to aliasing errors => need more robust synthesis procedure ∑wm (−= n ) We (j 0 ) m yn( )= LxnWe ( ) (j 0 ) using overlap-added sections

71 72

12 Overlap Addition of Bartlett Overlap Addition of Hamming Window and Hann Windows L=128

73 74

Window Spectra Hamming Window Spectra

DTFTs of even-length, odd-length and modified odd-to-even DTFTof Bartlett (triangular), Hann and Hamming windows 75 length Hamming windows; zeros spaced at 2π/R give perfect 76 reconstruction using OLA (even-length window)

Overlap Addition (OLA) Method Overlap Addition (OLA) Method

• w(n) is an L-point Hamming window with R=L/4 • assume x(n)=0 for n<0 • time overlap of 4:1 for HW • first analysis section begins at n=L/4

77 78

13 Overlap Addition (OLA) Method

• 4-overlapping sections contribute to each interval Filter Bank Summation • N-point FFT’s done using L speech samples, with N-L zeros padded at end to (FBS) allow modifications without significant aliasing effects • for a given value of n y(n)=x(n)w(R-n)+x(n)w(2R- n)+x(n)w(3R-n)+x(n)w(4R- n)=x(n)[w(R-n)+w(2R-n)+w(3R- n)+w(4R-n)]=x(n) W(ej0)/R

79 80

Filter Bank Summation Filter Bank Summation

• the filter bank interpretation of the STFT shows that for

jωk any frequency ωkn ,Xe ( ) is a lowpass representation

of the signal in a band centered at ωk (nn= ˆ for FBS) • define a bandpass filter and substitute it in the ∞ equation to give jjnωωkk− jmωk Xenk ( )=− e xnmw ( ) ()me ∑ jnωk m=−∞ hnkk ()= wne () ∞ where wmkk ( ) is the lowpass window used at frequency ω Xe()jjnωωkk e− xnmhm ( )() (we have generalized the structure to allow a different nk=−∑ m=−∞ lowpass window at each frequency ωk ).

81 82

Filter Bank Interpretation of

Filter Bank Summation STFT (case: wk[n]=w[n])

wn[]←−−−→ We (jω )

-ω0 ω0 ω

hn[]=←−−−→=⊗ wne []jωωk njn He (jjωω ) W ( e ) FTe (k ) ∞∞ jnωωkk jn− jnω −− j() ωω k n FT() e===−∑∑ e e e δω ( ωk ) nn=−∞ =−∞

jjωω j ()ωω− k He()=⊗−= We ()δω ( ωk ) We ( ) “single-sided, bandpass”

ωk-ω0 ωk ωk+ ω0 ω jjnωωkk− vnkn[]==⋅∗ Xe ( ) e[] xnhn [] [] Ve()jjjωωω⎡⎤ He ()() Xe FT()e− jnωk k =⋅⊗⎣⎦ ⎡⎤He()()jjωω Xe ( ) =⋅⊗+⎣⎦δω ωk =⋅⎡⎤Xe()(jω Wej ()ωω− k ) ⊗+δω ( ω ) ⎣⎦k j()ωω+ jω 83 =⋅Xe()()k We “lowpass” 84

14 Filter Bank Summation Filter Bank Summation

jωk • thus Xen ( ) is obtained by bandpass filtering xn ( ) followed by modulation with the complex exponential e− jnωk . We can express this in the form ∞ yn ( )==− Xe (jjnωωkk ) e xnmhm ( ) ( ) kn∑ k m=−∞

• thus ynk ( ) is the output of a bandpass filter with impulse

response hnk ( )

85 86

Filter Bank Summation Filter Bank Summation

• a practical method for reconstructing xn ( ) from the STFT is as follows 1. assume we know Xe (jωk ) for a set of N frequencies {ω }, k=−01 , ,..., N 1 nk • consider a set of N bandpass filters, uniformly spaced, so that the entire 2. assume we have a set of N bandpass filters with impulse responses frequency band is covered hn()==− wne ()jnωk , k01 ,,..., N 1 kk 2π k 3. assume wn ( ) is an ideal lowpass filter with cutoff frequency ω ω ==− ,kN01 , ,..., 1 k pk k N • also assume window the same for all channels, i.e.,

wnk ( )= wn ( ), kN=−01, ,..., 1 • if we add together all the bandpass outputs, the composite response is NN−−11 j() He% (jjωω ) H ( e ) We (ωω− k ) ==∑∑k kk==00 - the frequency response of the bandpass filter is j •≥ if We (ωk ) is properly sampled in frequency ( N L ), where L is the j()ωω− He (jω )= We (k ) kk window duration, then it can be shown that N−1 1 j() We (ωω− k )=∀ w (0 ) ω FBS Formula 87 ∑ 88 N k =0

Filter Bank Summation Filter Bank Summation

• derivation of FBS formula wn ( )←⎯⎯⎯→FT/ IFT We (jω ) • if We (jω ) is sampled in frequency at N uniformly spaced points, the inverse discrete Fourier transform 1/N of the sampled version of We (jωk ) is (recall that sampling ⇒⇔⇒ multiplication convolution aliasing) 1 N−∞1 ∑∑We (jjnωωkk ) e=+ wn ( rN ) N kr==−∞0 • an aliased version of wn ( ) is obtained.

89 90

15 Filter Bank Summation Filter Bank Summation

• If wn ( ) is of duration L samples, then • the impulse response of the composite filter bank system is wn ( )=<00 , n , n ≥ L NN−−11 hn% () h () n wne ()jnωk Nw ()() n • and no aliasing occurs due to sampling in frequency ==∑∑k =0 δ kk==00 jω of We ( ). In this case if we evaluate the aliased • thus the composite output is formula for n = 0 , we get yn ()=∗= xn () hn% () Nw ()()0 xn N−1 • thus for FBS method, the reconstructed signal is 1 jω We()()k = w0 NN−−11 ∑ jjnωω N yn () y () n X ( ekk ) e Nw ()() xn k =0 ==∑∑kn =0 • the FBS formula is seen to be equivalent to the formula kk==00 • if Xe (jωk ) is sampled properly in frequency, and is independent of the above, since (according to the sampling theorem) any n shape of wn ( ) set of NWe uniformly spaced samples of (jω ) is adequate

91 92

Filter Bank Summation Filter Bank Summation

22ππ NN−−11jk jkn NN yn()==∑∑ ykn () n X ( e ) e kk==00 22ππ N−1 ⎡⎤− jkmjkn =−∑∑⎢⎥xmwn()( me ) NN e km=0 ⎣⎦ 2π N −1 jknm()− =−∑∑xmwn()( m ) e N mk=0 1/N ∞ =−−−∑∑xmwn()( m ) Nδ ( n m rN ) mr=−∞ ∞ y() n=− N∑ w ( rN )( x n rN ) r =−∞ •≠wn( )00 for ≤≤−⇒≥ n L 1 if N L then need only r = 0 term y(n)= Nw(0 )x(n) •< if NL then in order for ynxn ( ) = ( ) you need the condition wrN ( ) ==±±012 , r , ,... • 'undersampled' representation can still work--at least in theory

93 94

FBS Reconstruction in Non- Summary of FBS Method Overlapping Bands jω • perfect reconstruction of x(n) from Xn ( e ) is possible using FBS under the following conditions: • assume window length for all bands is L samples 1. w(n) is a finite duration filter/window • assume the same window is used for N equally spaced jω frequency bands with analysis frequencies 2. Xen ( ) is sampled properly in both time and frequency j 2π k • perfect reconstruction of x(n) from X ( e ωk ) is also possible using ω ==− ,kN01 , ,..., 1 n k N FBS under the following condition: • where NL can be less than jω We ( ) is perfectly bandlimited • assume w(n) is an ideal lowpass filter with cutoff frequency

jωk • To avoid time aliasing, Xen ( ) must be evaluated at at least L uniformly spaced π ωp = frequencies, where L is the window duration N -since window of length L samples has frequency bandwidth of from 24ππ/LL (for RW) to / (for HW), the bandpass filters in FBS overlap example with N=6 in frequency since the analysis frequencies are 2011π kLk / ,=− , ,..., L equally spaced ideal jωk • there is a way (at least theoretically) for Xen ( ) to be evaluated in filters non-overlapping bands and for which x(n) can still be exactly recovered 95 96

16 FBS Reconstruction in Non- FBS Reconstruction in Non-Overlapping Bands Overlapping Bands impulse response of ideal lowpass filter with cutoff frequency π/N • the composite impulse response for the FBS system is NN−−11 jnωω jn for ideal LPF we have hn% ()==∑∑ wne ()kk wn () e • kk==00 sin(ππnN / ) sin( r ) 1 wn()=== , wrN ( )δ () r • defining a composite of the terms being summed as ππnrNN NN−−11 giving hn%()= δ () n pn ( ) ==∑∑ ejnωk ejknN2π / kk==00 • other cases where perfect reconstruction is obtained • we get for hn%( ) 1. wn ( ) is of finite length L≤ N and causal (no images) hn%()= wn ()() pn 2. wn( ) has length > N and has the property • it is easy to show that p(n) is a periodic train of impulses of the form wn()==1 / N ,for n rN0 ∞ ==≠=±±0012 for nrNrrr (0 , , , ,...) pn()=− Nδ ( n rN ) ∑ giving hn%()==− pnwn () ()δ ( n rN0 ) r =−∞ jω − jrNω 0 • giving for hn% ( ) the expression He%()=⇒=− e yn ()( xn rN0 ) ∞ hn% ( )=− N∑ wrN ( )δ ( n rN ) r =−∞ • thus the composite impulse response is the window sequence sampled 97 98 at intervals of N samples

Summary of FBS Reconstruction Practical Implementation of FBS

• for perfect reconstruction using FBS methods 1. w(n) does not need to be either time-limited or frequency-limited to exactly reconstruct

jωk x(n) from Xn ( e ) 2. w(n) just needs equally spaced zeros, spaced N samples apart for theoretically perfect reconstruction • exact reconstruction of the input is possible with a number of frequency channels less than that required by the sampling theorem • key issue is how to design digital filters that match these criteria 99 100

FBS and OLA Comparisons

•←⎯⎯⎯ filter bank summation method duals → overlap addition method -- one depends on sampling relation in frequency -- one depends on sampling relation in time • FBS requires sampling in frequency be such that the window FBS and OLA Comparisons transform We (jω ) obeys the relation 1 N−1 ∑ We (j()ωω− k )= w (0 ) any ω N k=0 • OLA requires that sampling in time be such that the window obeys the relation ∞ ∑ wrR (−= n ) We (j 0 )/ R any n r =−∞ • the key to Short-Time Fourier Analysis is the ability to modify the short-time spectrum (via quantization, noise enhancement, signal enhancement, speed-up/slow-down,etc) and recover

101 an "unaliased" modified signal 102

17 Overlap Addition (OLA) Method

jωk • assume Xen ( ) sampled with period R samples in time

jjωωkk YerrR ()=≤≤− X (), e r integer, 01 k N • the Overlap Add Method is based on the summation ∞−⎡⎤1 N 1 yn ( )= ⎢⎥ Y ( ejjnωωkk ) e OLA Method ∑∑N r rk=−∞⎣⎦⎢⎥ =0

jωk • for each value of rYe , compute the inverse transform of r ( ) giving the sequences

ymr ()=−−∞<<∞ xmwrRm ()( ), m • the signal at time nn is obtained by summing the values at time

of all the sequences, ymr ( ) that overlap at time n , giving ∞∞ yn () y () n xn () wrR ( n ) ==∑∑r − rr=−∞ =−∞

jωk • if w(n) has a bandlimited FT and if Xn ( e ) is properly sampled in time (i.e., R small enough to avoid aliasing) then ∞ ∑ wrR (−≈ n ) We (j 0 )/ R -- independent of n , and r =−∞ j 0 yn ( )= xnWe ( ) ( )/ R -- exact reconstruction of x(n) 103

18