Short-Time Fourier Analysis Why STFT for Speech Signals Overview
Total Page:16
File Type:pdf, Size:1020Kb
General Discrete-Time Model of Speech Production Digital Speech Processing— Lecture 9 Short-Time Fourier Analysis Methods- Voiced Speech: • AVP(z)G(z)V(z)R(z) Introduction Unvoiced Speech: 1 • ANN(z)V(z)R(z) 2 Short-Time Fourier Analysis Why STFT for Speech Signals • represent signal by sum of sinusoids or • steady state sounds, like vowels, are produced complex exponentials as it leads to convenient by periodic excitation of a linear system => solutions to problems (formant estimation, pitch speech spectrum is the product of the excitation period estimation, analysis-by-synthesis spectrum and the vocal tract frequency response methods), and insight into the signal itself • speech is a time-varying signal => need more sophisticated analysis to reflect time varying • such Fourier representations provide properties – convenient means to determine response to a sum of – changes occur at syllabic rates (~10 times/sec) sinusoids for linear systems – over fixed time intervals of 10-30 msec, properties of – clear evidence of signal properties that are obscured most speech signals are relatively constant (when is in the original signal this not the case) 3 4 Frequency Domain Processing Overview of Lecture • define time-varying Fourier transform (STFT) analysis method • define synthesis method from time-varying FT (filter-bank summation, overlap addition) • show how time-varying FT can be viewed in terms of a bank of filters model • Coding: • computation methods based on using FFT – transform, subband, homomorphic, channel vocoders • application to vocoders, spectrum displays, • Restoration/Enhancement/Modification: format estimation, pitch period estimation – noise and reverberation removal, helium restoration, time-scale modifications (speed-up and slow-down of speech) 5 6 1 Frequency and the DTFT DTFT and DFT of Speech • sinusoids The DTFT and the DFT for the infinite duration jnωω00− jn signal could be calculated (the DTFT) and xn( )==+ cos(ω0 n ) ( e e )/ 2 approximated (the DFT) by the following: where ω0 is the frequency (in radians) of the sinusoid ∞ • the Discrete-Time Fourier Transform (DTFT ) Xe(jjmωω )= ∑ xme ( )− ( DTFT ) ∞ m=−∞ jjnωω− Xe( )== xne ( ) x (n) L−1 ∑ DTFT {} − jLkm(2π / ) n=−∞ Xk()= ∑ xmwme ( )( ) ,kL=−0,1,..., 1 π m=0 1 jjnωω-1 j ω xn()== Xe ( ) e dω DTFT {} Xe ( ) = Xe()jω ( DFT ) ∫ ωπ=(2kL / ) 2π −π where ω is the frequency variable of Xe(jω ) using a value of L=25000 we get the following plot 7 8 25000-Point DFT of Speech Magnitude Short-Time Fourier Transform (STFT) Log Magnitude (dB) 9 10 Short-Time Fourier Transform Definition of STFT ∞ jjmωωˆˆˆˆ− • speech is not a stationary signal, i.e., it Xenˆ ( )=−∑ xmwnme ( ) ( ) both n and ωˆ are variables m=−∞ has properties that change with time •− wn(ˆˆ m ) is a real window which determines the portion of xn( ) jωˆ • thus a single representation based on all that is used in the computation of Xenˆ ( ) the samples of a speech utterance, for the most part, has no meaning • instead, we define a time-dependent Fourier transform (TDFT or STFT) of speech that changes periodically as the speech properties change over time 11 12 2 Short Time Fourier Transform Short-Time Fourier Transform • STFT is a function of two variables, the time index, nˆ, which • alternative form of STFT (based on change of variables) is is discrete, and the frequency variable, ωˆ, which is continuous ∞ ∞ jjnmωωˆˆ−−()ˆ jjmωωˆˆ− Xenˆ ( )=− wmxnme ( ) (ˆ ) Xenˆ ()=− xmwnme ()(ˆ ) ∑ ∑ m=−∞ m=−∞ ∞ ˆˆ ˆ =−⇒ DTFT (xmwn ( ) ( m )) n fixed, ωˆ variable =−exnmwme− jnωωˆˆ∑ ()()ˆ jm m=−∞ • if we define ∞ % jjmωωˆˆˆ Xenˆ ( )=−∑ xnmwme ( ) ( ) m=−∞ jωˆ • then Xenˆ ( ) can be expressed as (using mm′ =− ) jjnjjnωωˆˆ−−ˆˆ ωω ˆˆ ˆ Xenˆ ()== e Xe% n () eDTFT ⎣⎦⎡⎤ xnmwm ( +− )() 13 14 STFT-Different Time Origins Time Origin for STFT • the STFT can be viewed as having two different time origins 1. time origin tied to signal xn( ) mnx=−ˆ ⇒ [0] ∞ jjmωωˆˆˆ − Xenˆ ()=−∑ xmwnme ()( ) m=−∞ =− DTFT ⎣⎦⎡⎤xmwn( ) (ˆˆ m ) , n fixed, ωˆ variable Time origin tied to 2. time origin tied to window signal wm(− ) ∞ window wmxnm[][]−+ˆ jjnωωˆˆ−−ˆ ˆ jm ω ˆ Xenˆ ()=+− e∑ xnmwme ( )() m=−∞ = eXe− jnωωˆˆˆ % () j − jnωˆ ˆ =−+ewmxnmnDTFT ⎣⎦⎡⎤( ) (ˆˆ ) , fixed, ωˆ variable 15 16 Interpretations of STFT Fourier Transform Interpretation jωˆ • there are 2 distinct interpretations of Xenˆ ( ) jωˆ • consider Xenˆ ( ) as the normal Fourier transform of the sequence 1. assume nXeˆ is fixed, then (jωˆ ) is simply the normal Fourier nˆ wn(ˆˆ−−∞<<∞ mxm ) ( ), m for fixed n. transform of the sequence wn(ˆ −−∞<<∞=> mxm ) ( ), m for •− the window wn(ˆ m ) slides along the sequence xm( ) and defines a jωˆ fixed nXˆ, nˆ ( e ) has the same properties as a normal Fourier new STFT for every value of nˆ transform • what are the conditions for the existence of the STFT jωˆ ˆ 2. consider Xenˆ ( ) as a function of the time index n with ωˆ fixed. •− the sequence wn(ˆ mxm ) ( ) must be absolutely summable for all jωˆ ˆ − jnωˆ ˆ values of nˆ Then Xenˆ ( ) is in the form of a convolution of the signal xne( ) with the window wn(ˆ ). This leads to an interpretation in the form of - since |xn (ˆ ) |≤ L (32767 for 16-bit sampling) − jnωˆ ˆ ˆ linear filtering of the frequency modulated signal xne(ˆˆ ) by wn( ). - since |wn ( ) |≤ 1 (normalized window levels) - since window duration is usually finite ˆˆ - we will now consider each of these interpretations of the STFT in •− wn( mxm ) ( ) is absolutely summable for all n a lot more detail 17 18 3 Frequencies for STFT Signal Recovery from STFT jωˆ • the STFT is periodic in ω with period 2π , i.e., • since for a given value of nXˆ,nˆ ( e ) has the same properties as a jjkωωπˆˆ()+2 normal Fourier transform, we can recover the input sequence exactly Xennˆˆ()=∀ Xe ( ), k • since Xe(jωˆ ) is the normal Fourier transform of the windowed • can use any of several frequency variables nˆ sequence wn(ˆ − mxm ) ( ), then to express STFT, including 1 π wn(ˆ −= mxm ) ( ) X ( ejjmωωˆˆ ) e dωˆ --ωˆ =Ωˆ TT (where is the sampling period for ∫ nˆ 2π −π xm( )) to represent analog radian frequency, •≠ assuming the window satisfies the property that w(00 ) ( a trivial jTΩˆ requirement), then by evaluating the inverse Fourier transform giving Xenˆ ( ) when mn= ˆ, we obtain ˆ ˆ --ωπˆˆ==22fFT or ωπ to represent normalized 1 π xn(ˆ )= X ( ejjnωωˆ ) eˆ dωˆ ˆ ∫ nˆ frequency (0≤≤f 1 ) or analog frequency 20π w()−π ˆ jf2π ˆ j 2πFTˆ (0≤≤FFs =1 /T ), giving Xenˆ ( ) or Xenˆ ( ) 19 20 Signal Recovery from STFT Properties of STFT jωˆ Xenˆ ( )=−DTFT [ wnmxm (ˆˆ ) ( )] n fixed, ωˆ variable π 1 jjnωωˆ ˆ • relation to short-time power density function xn(ˆ )= Xnˆ ( e ) e dωˆ ∫ jjωωˆˆ2 jj ωω ˆˆ∗ 20π w()−π ˆ Sennˆˆ( )==⋅= | Xe ( ) | Xe nn ˆˆ ( ) Xe ( )DTFT [] Rk n ˆ ( ) n fixed •≠ with the requirement that wxn(00 ) , the sequence (ˆ ) ∞ jωˆ ˆˆ Rk()=− wnmxmwnm (ˆˆ )( )( −−+⇔ kxmk )( ) S()e can be recovered exactly from Xe(jjωω ), if Xe( ) is known nˆ ∑ nˆ nnˆˆ m=−∞ for all values of ωˆ over one complete period • Relation to regular Xe(jωˆ ) (assuming it exists) - sample-by-sample recovery process ∞ jjmωωˆˆ− jωˆ Xe()==DTFT [] xm ()∑ xme () - Xenˆ ( ) must be known for every value of nˆ and for all ωˆ m=−∞ ˆ π can also recover sequence wn(− mxm ) ( ) but can't guarantee jjjjnωθωθθˆˆ1 −−−() ˆ Xe()= We ( )( Xe ) e dθ ˆ nˆ ∫ that xm( ) can be recovered since w(nm− ) can equal 0 2π −π ˆ −−jjnjθθˆ θ ⎣⎦⎡⎤wn()()()−↔ m xm We e ∗ Xe () 21 22 Properties of STFT Alternative Forms of STFT Alternative forms of Xe(jωˆ ) • assume Xe(jωˆ ) exists nˆ 1. real and imaginary parts ∞ jjmωωˆˆ− Xe(jjωωˆˆ )=+ Re⎡⎤⎡⎤ Xe ( ) j Im Xe ( j ω ˆ ) Xe( )==DTFT [] xm ()∑ xme () nnˆˆ⎣⎦⎣⎦ n ˆ m=−∞ =−ajb()ωωˆˆ () π nnˆˆ 1 ˆ Xe()jjjjnωθωθθˆˆ= We (−−− )( Xe() ) e dθ aXe()ωˆ = Re⎡⎤ (jωˆ ) nˆ ∫ nnˆˆ⎣⎦ 2π −π bXe()ωˆ =− Im⎡⎤ (jωˆ ) • limiting case nnˆˆ⎣⎦ •− when xm( ) and wn(ˆ m ) are both real (usually the case) wn()ˆˆ=−∞<<∞⇔12 n We (jωˆ ) =πδ () ωˆ can show that abˆˆ(ωωωˆˆˆ ) is symmetric in , and ( ) is 1 π nn X() ejjjnjωωθθωˆˆˆ=−2πδ ()( θ Xe()−− ) eˆ d θ = Xe () anti-symmetric in ωˆ nˆ 2π ∫ −π 2. magnitude and phase i.e., we get the same thing no matter where the window is jjωωˆˆjθωnˆ ()ˆ Xennˆˆ( )= | Xe ( ) | e shifted jωˆ • can relate |Xennˆˆ ( ) | and θω(ˆ ) to abnnˆˆ(ωωˆˆ ) and ( ) 23 24 4 Role of Window in STFT Role of Window in STFT The window wn(ˆ − m ) does the following: • then for fixed nˆ, the normal Fourier transform of the 1. chooses portion of xm( ) to be analyzed product wn(ˆ − mxm ) ( ) is the convolution of the transforms 2. window shape determines the nature of Xe(jωˆ ) nˆ of wn(ˆ − m ) and xm( ) Since Xe(jωˆ ) (for fixed nˆ) is the normal FT of wn(ˆ − mxm ) ( ), nˆ •− for fixed nwnmWeeˆˆ, the FT of ( ) is (−−jjnωωˆˆ )ˆ --thus then if we consider the normal FT's of both xn( ) and wn( ) 1 π individually, we get Xe()jjjnjωθθωθˆˆ= We (−− ) eˆ Xe (() − ) dθ nˆ ∫ ∞ 2π −π Xe(jjmωωˆˆ )= ∑ xme ( ) − m=−∞ •− and replacing θθ by gives ∞ π jjmωωˆˆ− 1 We()= ∑ wme () Xe(jjjnjωθθωθˆˆ )= Wee ( )ˆ Xe (()+ ) dθ m=−∞ nˆ ∫ 2π −π 25 26 Interpretation of Role of Window Windows in STFT jωˆ ˆ jjωωˆˆ • for Xenˆ ( ) to represent the short-time spectral properties of xn( ) • Xenˆ ( ) is the convolution of Xe( ) with the FT of the shifted inside the window => We(jθ ) should be much narrower in frequency window sequence We(−−jjnωωˆˆ ) e ˆ than significant spectral regions of Xe(jωˆ )--i.e., almost an impulse • Xe(jωˆ ) really doesn't have meaning since xn(ˆ ) varies with time; in frequency consider xn(ˆ ) defined for window duration and extended for all time • consider rectangular and Hamming windows, where width of the jωˆ to have the same properties⇒ then Xe( ) does exist with properties main spectral lobe is inversely proportional to window length, and side that reflect the sound within the window (can also consider xn(ˆ )= 0 lobe levels are essentially independent of window length outside the window and define Xe(jωˆ ) appropriately--but this is another case) Rectangular Window: flat window of length L samples; first Bottom Line: Xe(jωˆ ) is a smoothed version of the FT of the part zero in frequency response occurs at FS/L, with sidelobe levels nˆ of -14 dB or lower of xn(ˆ ) that is within the window w.