Time Domain Methods in Speech Processing General

General Synthesis Model Log Areas, Reflection Coefficients, Formants, Vocal voiced sound amplitude Tract Polynomial, Articulatory Parameters, … Digital Speech ProcessingProcessing—— T T Lectures 7-8 1 2 Time Domain Methods Rz()=−1 α z−1 in Speech Processing unvoiced sound amplitude Pitch Detection, Voiced/Unvoiced/Silence Detection, Gain Estimation, Vocal Tract 1 2 Parameter Estimation, Glottal Pulse Shape, Radiation Model V Basics Overview speech or music P P 1 2 • 8 kHz sampled speech (bandwidth < A(x,t) 4 kHz) representation formants reflection coefficients • properties of speech change with speech, x[n] Signal of speech voiced-unvoiced-silence time Processing pitch U/S • excitation goes from voiced to sounds of language unvoiced speaker identification emotions • peak amplitude varies with the sound being produced • time domain processing => direct operations on the speech waveform • pitch varies within and across • frequency domain processing => direct operations on a spectral voiced sounds representation of the signal • periods of silence where zero crossing rate background signals are seen x[n] level crossing rate system energy • the key issue is whether we can autocorrelation create simple time-domain processing methods that enable us to • simple processing measure/estimate speech • enables various types of feature estimation 3 representations reliably and accurately4 Fundamental Assumptions Compromise Solution • properties of the speech signal change relatively • “short-time” processing methods => short segments of slowly with time (5-10 sounds per second) the speech signal are “isolated” and “processed” as if – over very short (5-20 msec) intervals => uncertainty they were short segments from a “sustained” sound with due to small amount of data, varying pitch, varying fixed (non-time-varying) properties amplitude – this short-time processing is periodically repeated for the – over medium length (20-100 msec) intervals => duration of the waveform uncertainty due to changes in sound quality, – these short analysis segments, or “analysis frames” often overlap one another transitions between sounds, rapid transients in – the results of short-time processing can be a single number (e.g., speech an estimate of the pitch period within the frame), or a set of – over long (100-500 msec) intervals => uncertainty numbers (an estimate of the formant frequencies for the analysis due to large amount of sound changes frame) – the end result of the processing is a new, time-varying sequence • there is always uncertainty in short time that serves as a new representation of the speech signal measurements and estimates from speech signals 5 6 1 FrameFrame--byby--FrameFrame Processing Frame 1: samples 0,1,...,L − 1 in Successive Windows Frame 2: samples RR,1,...,1++− R L Frame 1 Frame 3: samples 222RR,21++ 1,..., 221 RL− 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 4: samples 3,3RR++− 1,...,3 RL 1 75% frame overlap => frame length=L, frame shift=R=L/4 Frame1={x[0],x[1],…,x[LFrame1={x[0],x[1],…,x[L--1]}1]} Frame2={x[R],x[R+1],…,x[R+LFrame2={x[R],x[R+1],…,x[R+L--1]}1]} Frame3={x[2R],x[2R+1],…,x[2R+LFrame3={x[2R],x[2R+1],…,x[2R+L--1]}1]} … 7 FrameFrame--byby--FrameFrame Processing in Successive Windows Frames and Windows Frame 1 Fraaeme 2 50% frame overlap Frame 3 Frame 4 • Speech is processed frame-by-frame in overlapping intervals until entire region of speech is covered by at least one such frame • Results of analysis of individual frames used to derive model parameters in some manner FS =16,000 samples/second • Representation goes from time sample x [ n ], n = , 0 , 1 , 2 , to parameter vector f [ m ], m = 0 , 1 , 2 , where n is the time index and m is the frame index. L = 641 samples (equivalent to 40 msec frame (window) length) R = 240 samples (equivalent to 15 msec frame (window) shift) 9 10 Frame rate of 66.7 frames/second ShortShort--TimeTime Processing Generic Short-Short-TimeTime Processing speech speech representation, ⎛⎞∞ waveform, x[n] short-time f[m] QTxmwnmnˆ =−⎜⎟∑ ([ ])[ ] processing ⎝⎠m=−∞ nn= ˆ x[n] T(x[n]) ~ Qnˆ i xn[ ]= samples at 8000/sec rate; (e.g. 2 seconds of 4 kHz bandlimited T( ) w[n] speech, xn[ ], 0≤≤ n 16000) linear or non-linear window sequence transformation (usually finite length) i fm[ ]== f [ mfm ], [ ],..., f [ m ] vectors at 100/sec rate, 1 ≤≤ m 200, {}1 2 L L is the size of the analysis vector (e.g., 1 for pitch period estimate, 12 for autocorrelation estimates, etc) •Qnˆ is a sequence of local weighted average values of the sequence T(x[n]) at time nn= ˆ 11 12 2 ShortShort--TimeTime Energy Computation of ShortShort--TimeTime Energy ∞ Exm= ∑ 2 [ ] m=−∞ -- this is the long term definition of signal energy wn[]− m -- there is little or no utility of this definition for time-varying signals nˆ 2 22ˆˆ Exmnˆ = ∑ [ ] =−+++xn[]...[] N1 xn m=nNˆ−+1 -- short-time energy in vicinity of time nˆ Tx( ) = x2 • window jumps/slides across sequence of squared values, selecting interval for processing wn [ ]=≤≤−10 n N 1 • what happens to E nˆ as sequence jumps by 2,4,8,...,L samples ( E nˆ is a lowpass = 0 otherwise function—so it can be decimated without lost of information; why is Enˆ lowpass?) • effects of decimation depend on L; if L is small, then E nˆ is a lot more variable 13 than if L is large (window bandwidth changes with L!) 14 Effects of Window ShortShort--TimeTime Energy QTxnwnnnnˆˆ=∗([]) [] = •serves to differentiate voiced and unvoiced sounds in speech from silence (background signal) =∗ xn′[] wn [] nn= ˆ • natural definition of energy of weighted signal is: ∞ •wn[] serves as a lowpass filter on Txn([])which often has a lot of ˆ 2 Exmwnmnˆ =−∑ ⎣⎦⎡⎤[ ] [ ] (sum or squares of portion of signal) high frequencies (most non-linearities introduce significant high m=−∞ frequency energy—think of what ( x [] nxn ⋅ [] ) does in frequency) -- concentrates measurement at sample nwn-mˆˆ, using weighting [ ] ∞∞ 22 2 • often we extend the definition of Qnˆ to include a pre-filtering term ˆˆ Exmwnmxmhnmnˆ =−=−∑∑[ ] [ ] [ ] [ ] so that x[]n itself is filtered to a region of interest mm=−∞ =−∞ hn[ ]= w 2 [ n ] short time energy xˆ[]n x[]n Txn([]) QQnnˆ = Linear nn= ˆ T() wn[] x[n] x2[n] EEˆ = Filter ( )2 h[n] nnnn= ˆ FS FS FRS / 15 16 ShortShort--TimeTime Energy Properties Windows • depends on choice of h[n], or equivalently, ~ • consider two windows, w[n] window w~ [n] – rectangular window: –if w[n] duration very long and constant amplitude ~ • h[n]=1, 0≤n≤L-1 and 0 otherwise (w[n]=1, n=0,1,...,L-1), En would not change much over time, and would not reflect the short-time amplitudes of – Hamming window (raised cosine window): the sounds of the speech • h[n]=0.54-0.46 cos(2πn/(L-1)), 0≤n≤L-1 and 0 otherwise – very long duration windows correspond to narrowband – rectangular window gives equal weight to all L lowpass filters samples in the window (n,...,n-L+1) – want En to change at a rate comparable to the changing – Hamming window gives most weight to middle sounds of the speech => this is the essential conflict in all speech processing, namely we need short duration samples and tapers off strongly at the beginning and window to be responsive to rapid sound changes, but the end of the window short windows will not provide sufficient averaging to give smooth and reliable energy function 17 18 3 Rectangular and Hamming Windows Window Frequency Responses • rectangular window sin(ΩLT /2 ) He()jTΩ−Ω−= e jTL()/12 sin(ΩT /2 ) • first zero occurs at f=Fs/L=1/(LT) (or Ω=(2π)/(LT)) => L = 21 samples nominal cutoff frequency of the equivalent “lowpass” filter • Hamming window wnHR[ ]= 0.54 wn [ ]−− 0.46*cos(2π n / ( L 1)) wn R [ ] • can decompose Hamming Window FR into combination of three terms 19 20 RW and HW Frequency Responses Window Frequency Responses • log magnitude response of RW and HW • bandwidth of HW is approximately twice the bandwidth of RW • attenuation of more than 40 dB for HW outside passband, versus 14 dB for RW • stopband attenuation is essentially independent of L, the window duration => iiincreasing L sildimply decreases w idindow bandwidth • L needs to be larger than a pitch period (or severe fluctuations will occur in En), but smaller than a sound duration (or En will not adequately reflect the changes in the speech signal) There is no perfect value of L, since a pitch period can be as short as 20 samples (500 Hz at a 10 kHz sampling rate) for a high pitch child or female, and up to 250 samples (40 Hz pitch at a 10 kHz sampling Rectangular Windows, Hamming Windows, rate) for a low pitch male; a compromise value of L on the order of 100-200 samples for a 10 kHz sampling 21 22 rate is often used in practice L=21,41,61,81,101 L=21,41,61,81,101 ShortShort--TimeTime Energy ShortShort--TimeTime Energy using RW/HW Enˆ Enˆ L=51 L=51 L=101 L=101 i Short-time energy computation: ∞ L=201 L=201 ˆ 2 Exmwnmnˆ =−∑ ([ ][ ]) m=−∞ ∞ L=401 =−∑ ([xm ])2 wn [ˆ m ] L=401 m=−∞ i For L-point rectangular window, wm[ ]== 1, m 0,1, ..., L − 1 i giving ••asas L increases, the plots tend to converge (however you are smoothing sound energies) nˆ • shortshort--timetime energy provides the basis for distinguishing voiced from unvoiced speech 2 Exmnˆ = ∑ ([ ]) 23 regions, and for mediummedium--toto--highhigh SNR recordings, can even be used to find regions of 24 mnL=−+ˆ 1 silence/background signal 4 ShortShort--TimeTime Energy for AGC Recursive Short-Short-TimeTime Energy i un[− m−−−≥110 ] implies the condition n m Can use an IIR filter to define short-short-timetime energy, e.g., or mn≤−1 giving n−1 σαααα22122[]()[]nxmxnxn=−∑

Time Domain Methods in Speech Processing General

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support