Short-Time Fourier Analysis Why STFT for Speech Signals Overview

Short-Time Fourier Analysis Why STFT for Speech Signals Overview

General Discrete-Time Model of Speech Production Digital Speech Processing— Lecture 9 Short-Time Fourier Analysis Methods- Voiced Speech: • AVP(z)G(z)V(z)R(z) Introduction Unvoiced Speech: 1 • ANN(z)V(z)R(z) 2 Short-Time Fourier Analysis Why STFT for Speech Signals • represent signal by sum of sinusoids or • steady state sounds, like vowels, are produced complex exponentials as it leads to convenient by periodic excitation of a linear system => solutions to problems (formant estimation, pitch speech spectrum is the product of the excitation period estimation, analysis-by-synthesis spectrum and the vocal tract frequency response methods), and insight into the signal itself • speech is a time-varying signal => need more sophisticated analysis to reflect time varying • such Fourier representations provide properties – convenient means to determine response to a sum of – changes occur at syllabic rates (~10 times/sec) sinusoids for linear systems – over fixed time intervals of 10-30 msec, properties of – clear evidence of signal properties that are obscured most speech signals are relatively constant (when is in the original signal this not the case) 3 4 Frequency Domain Processing Overview of Lecture • define time-varying Fourier transform (STFT) analysis method • define synthesis method from time-varying FT (filter-bank summation, overlap addition) • show how time-varying FT can be viewed in terms of a bank of filters model • Coding: • computation methods based on using FFT – transform, subband, homomorphic, channel vocoders • application to vocoders, spectrum displays, • Restoration/Enhancement/Modification: format estimation, pitch period estimation – noise and reverberation removal, helium restoration, time-scale modifications (speed-up and slow-down of speech) 5 6 1 Frequency and the DTFT DTFT and DFT of Speech • sinusoids The DTFT and the DFT for the infinite duration jnωω00− jn signal could be calculated (the DTFT) and xn( )==+ cos(ω0 n ) ( e e )/ 2 approximated (the DFT) by the following: where ω0 is the frequency (in radians) of the sinusoid ∞ • the Discrete-Time Fourier Transform (DTFT ) Xe(jjmωω )= ∑ xme ( )− ( DTFT ) ∞ m=−∞ jjnωω− Xe( )== xne ( ) x (n) L−1 ∑ DTFT {} − jLkm(2π / ) n=−∞ Xk()= ∑ xmwme ( )( ) ,kL=−0,1,..., 1 π m=0 1 jjnωω-1 j ω xn()== Xe ( ) e dω DTFT {} Xe ( ) = Xe()jω ( DFT ) ∫ ωπ=(2kL / ) 2π −π where ω is the frequency variable of Xe(jω ) using a value of L=25000 we get the following plot 7 8 25000-Point DFT of Speech Magnitude Short-Time Fourier Transform (STFT) Log Magnitude (dB) 9 10 Short-Time Fourier Transform Definition of STFT ∞ jjmωωˆˆˆˆ− • speech is not a stationary signal, i.e., it Xenˆ ( )=−∑ xmwnme ( ) ( ) both n and ωˆ are variables m=−∞ has properties that change with time •− wn(ˆˆ m ) is a real window which determines the portion of xn( ) jωˆ • thus a single representation based on all that is used in the computation of Xenˆ ( ) the samples of a speech utterance, for the most part, has no meaning • instead, we define a time-dependent Fourier transform (TDFT or STFT) of speech that changes periodically as the speech properties change over time 11 12 2 Short Time Fourier Transform Short-Time Fourier Transform • STFT is a function of two variables, the time index, nˆ, which • alternative form of STFT (based on change of variables) is is discrete, and the frequency variable, ωˆ, which is continuous ∞ ∞ jjnmωωˆˆ−−()ˆ jjmωωˆˆ− Xenˆ ( )=− wmxnme ( ) (ˆ ) Xenˆ ()=− xmwnme ()(ˆ ) ∑ ∑ m=−∞ m=−∞ ∞ ˆˆ ˆ =−⇒ DTFT (xmwn ( ) ( m )) n fixed, ωˆ variable =−exnmwme− jnωωˆˆ∑ ()()ˆ jm m=−∞ • if we define ∞ % jjmωωˆˆˆ Xenˆ ( )=−∑ xnmwme ( ) ( ) m=−∞ jωˆ • then Xenˆ ( ) can be expressed as (using mm′ =− ) jjnjjnωωˆˆ−−ˆˆ ωω ˆˆ ˆ Xenˆ ()== e Xe% n () eDTFT ⎣⎦⎡⎤ xnmwm ( +− )() 13 14 STFT-Different Time Origins Time Origin for STFT • the STFT can be viewed as having two different time origins 1. time origin tied to signal xn( ) mnx=−ˆ ⇒ [0] ∞ jjmωωˆˆˆ − Xenˆ ()=−∑ xmwnme ()( ) m=−∞ =− DTFT ⎣⎦⎡⎤xmwn( ) (ˆˆ m ) , n fixed, ωˆ variable Time origin tied to 2. time origin tied to window signal wm(− ) ∞ window wmxnm[][]−+ˆ jjnωωˆˆ−−ˆ ˆ jm ω ˆ Xenˆ ()=+− e∑ xnmwme ( )() m=−∞ = eXe− jnωωˆˆˆ % () j − jnωˆ ˆ =−+ewmxnmnDTFT ⎣⎦⎡⎤( ) (ˆˆ ) , fixed, ωˆ variable 15 16 Interpretations of STFT Fourier Transform Interpretation jωˆ • there are 2 distinct interpretations of Xenˆ ( ) jωˆ • consider Xenˆ ( ) as the normal Fourier transform of the sequence 1. assume nXeˆ is fixed, then (jωˆ ) is simply the normal Fourier nˆ wn(ˆˆ−−∞<<∞ mxm ) ( ), m for fixed n. transform of the sequence wn(ˆ −−∞<<∞=> mxm ) ( ), m for •− the window wn(ˆ m ) slides along the sequence xm( ) and defines a jωˆ fixed nXˆ, nˆ ( e ) has the same properties as a normal Fourier new STFT for every value of nˆ transform • what are the conditions for the existence of the STFT jωˆ ˆ 2. consider Xenˆ ( ) as a function of the time index n with ωˆ fixed. •− the sequence wn(ˆ mxm ) ( ) must be absolutely summable for all jωˆ ˆ − jnωˆ ˆ values of nˆ Then Xenˆ ( ) is in the form of a convolution of the signal xne( ) with the window wn(ˆ ). This leads to an interpretation in the form of - since |xn (ˆ ) |≤ L (32767 for 16-bit sampling) − jnωˆ ˆ ˆ linear filtering of the frequency modulated signal xne(ˆˆ ) by wn( ). - since |wn ( ) |≤ 1 (normalized window levels) - since window duration is usually finite ˆˆ - we will now consider each of these interpretations of the STFT in •− wn( mxm ) ( ) is absolutely summable for all n a lot more detail 17 18 3 Frequencies for STFT Signal Recovery from STFT jωˆ • the STFT is periodic in ω with period 2π , i.e., • since for a given value of nXˆ,nˆ ( e ) has the same properties as a jjkωωπˆˆ()+2 normal Fourier transform, we can recover the input sequence exactly Xennˆˆ()=∀ Xe ( ), k • since Xe(jωˆ ) is the normal Fourier transform of the windowed • can use any of several frequency variables nˆ sequence wn(ˆ − mxm ) ( ), then to express STFT, including 1 π wn(ˆ −= mxm ) ( ) X ( ejjmωωˆˆ ) e dωˆ --ωˆ =Ωˆ TT (where is the sampling period for ∫ nˆ 2π −π xm( )) to represent analog radian frequency, •≠ assuming the window satisfies the property that w(00 ) ( a trivial jTΩˆ requirement), then by evaluating the inverse Fourier transform giving Xenˆ ( ) when mn= ˆ, we obtain ˆ ˆ --ωπˆˆ==22fFT or ωπ to represent normalized 1 π xn(ˆ )= X ( ejjnωωˆ ) eˆ dωˆ ˆ ∫ nˆ frequency (0≤≤f 1 ) or analog frequency 20π w()−π ˆ jf2π ˆ j 2πFTˆ (0≤≤FFs =1 /T ), giving Xenˆ ( ) or Xenˆ ( ) 19 20 Signal Recovery from STFT Properties of STFT jωˆ Xenˆ ( )=−DTFT [ wnmxm (ˆˆ ) ( )] n fixed, ωˆ variable π 1 jjnωωˆ ˆ • relation to short-time power density function xn(ˆ )= Xnˆ ( e ) e dωˆ ∫ jjωωˆˆ2 jj ωω ˆˆ∗ 20π w()−π ˆ Sennˆˆ( )==⋅= | Xe ( ) | Xe nn ˆˆ ( ) Xe ( )DTFT [] Rk n ˆ ( ) n fixed •≠ with the requirement that wxn(00 ) , the sequence (ˆ ) ∞ jωˆ ˆˆ Rk()=− wnmxmwnm (ˆˆ )( )( −−+⇔ kxmk )( ) S()e can be recovered exactly from Xe(jjωω ), if Xe( ) is known nˆ ∑ nˆ nnˆˆ m=−∞ for all values of ωˆ over one complete period • Relation to regular Xe(jωˆ ) (assuming it exists) - sample-by-sample recovery process ∞ jjmωωˆˆ− jωˆ Xe()==DTFT [] xm ()∑ xme () - Xenˆ ( ) must be known for every value of nˆ and for all ωˆ m=−∞ ˆ π can also recover sequence wn(− mxm ) ( ) but can't guarantee jjjjnωθωθθˆˆ1 −−−() ˆ Xe()= We ( )( Xe ) e dθ ˆ nˆ ∫ that xm( ) can be recovered since w(nm− ) can equal 0 2π −π ˆ −−jjnjθθˆ θ ⎣⎦⎡⎤wn()()()−↔ m xm We e ∗ Xe () 21 22 Properties of STFT Alternative Forms of STFT Alternative forms of Xe(jωˆ ) • assume Xe(jωˆ ) exists nˆ 1. real and imaginary parts ∞ jjmωωˆˆ− Xe(jjωωˆˆ )=+ Re⎡⎤⎡⎤ Xe ( ) j Im Xe ( j ω ˆ ) Xe( )==DTFT [] xm ()∑ xme () nnˆˆ⎣⎦⎣⎦ n ˆ m=−∞ =−ajb()ωωˆˆ () π nnˆˆ 1 ˆ Xe()jjjjnωθωθθˆˆ= We (−−− )( Xe() ) e dθ aXe()ωˆ = Re⎡⎤ (jωˆ ) nˆ ∫ nnˆˆ⎣⎦ 2π −π bXe()ωˆ =− Im⎡⎤ (jωˆ ) • limiting case nnˆˆ⎣⎦ •− when xm( ) and wn(ˆ m ) are both real (usually the case) wn()ˆˆ=−∞<<∞⇔12 n We (jωˆ ) =πδ () ωˆ can show that abˆˆ(ωωωˆˆˆ ) is symmetric in , and ( ) is 1 π nn X() ejjjnjωωθθωˆˆˆ=−2πδ ()( θ Xe()−− ) eˆ d θ = Xe () anti-symmetric in ωˆ nˆ 2π ∫ −π 2. magnitude and phase i.e., we get the same thing no matter where the window is jjωωˆˆjθωnˆ ()ˆ Xennˆˆ( )= | Xe ( ) | e shifted jωˆ • can relate |Xennˆˆ ( ) | and θω(ˆ ) to abnnˆˆ(ωωˆˆ ) and ( ) 23 24 4 Role of Window in STFT Role of Window in STFT The window wn(ˆ − m ) does the following: • then for fixed nˆ, the normal Fourier transform of the 1. chooses portion of xm( ) to be analyzed product wn(ˆ − mxm ) ( ) is the convolution of the transforms 2. window shape determines the nature of Xe(jωˆ ) nˆ of wn(ˆ − m ) and xm( ) Since Xe(jωˆ ) (for fixed nˆ) is the normal FT of wn(ˆ − mxm ) ( ), nˆ •− for fixed nwnmWeeˆˆ, the FT of ( ) is (−−jjnωωˆˆ )ˆ --thus then if we consider the normal FT's of both xn( ) and wn( ) 1 π individually, we get Xe()jjjnjωθθωθˆˆ= We (−− ) eˆ Xe (() − ) dθ nˆ ∫ ∞ 2π −π Xe(jjmωωˆˆ )= ∑ xme ( ) − m=−∞ •− and replacing θθ by gives ∞ π jjmωωˆˆ− 1 We()= ∑ wme () Xe(jjjnjωθθωθˆˆ )= Wee ( )ˆ Xe (()+ ) dθ m=−∞ nˆ ∫ 2π −π 25 26 Interpretation of Role of Window Windows in STFT jωˆ ˆ jjωωˆˆ • for Xenˆ ( ) to represent the short-time spectral properties of xn( ) • Xenˆ ( ) is the convolution of Xe( ) with the FT of the shifted inside the window => We(jθ ) should be much narrower in frequency window sequence We(−−jjnωωˆˆ ) e ˆ than significant spectral regions of Xe(jωˆ )--i.e., almost an impulse • Xe(jωˆ ) really doesn't have meaning since xn(ˆ ) varies with time; in frequency consider xn(ˆ ) defined for window duration and extended for all time • consider rectangular and Hamming windows, where width of the jωˆ to have the same properties⇒ then Xe( ) does exist with properties main spectral lobe is inversely proportional to window length, and side that reflect the sound within the window (can also consider xn(ˆ )= 0 lobe levels are essentially independent of window length outside the window and define Xe(jωˆ ) appropriately--but this is another case) Rectangular Window: flat window of length L samples; first Bottom Line: Xe(jωˆ ) is a smoothed version of the FT of the part zero in frequency response occurs at FS/L, with sidelobe levels nˆ of -14 dB or lower of xn(ˆ ) that is within the window w.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    18 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us