Fundamental Frequency Detection
Total Page:16
File Type:pdf, Size:1020Kb
Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika cernocky|ihubeika @fit.vutbr.cz { } DCGM FIT BUT Brno Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 1/37 Agenda Fundamental frequency characteristics. • Issues. • Autocorrelation method, AMDF, NFFC. • Formant impact reduction. • Long time predictor. • Cepstrum. • Improvements in fundamental frequency detection. • Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 2/37 Recap – speech production and its model Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 3/37 Introduction Fundamental frequency, pitch, is the frequency vocal cords oscillate on: F . • 0 The period of fundamental frequency, (pitch period) is T = 1 . • 0 F0 The term “lag” denotes the pitch period expressed in samples : L = T Fs, where Fs • 0 is the sampling frequency. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 4/37 Fundamental Frequency Utilization speech synthesis – melody generation. • coding • – in simple encoding such as LPC, reduction of the bit-stream can be reached by separate transfer of the articulatory tract parameters, energy, voiced/unvoiced sound flag and pitch F0. – in more complex encoders (such as RPE-LTP or ACELP in the GSM cell phones) long time predictor LTP is used. LPT is a filter with a “long” impulse response which however contains only few non-zero components. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 5/37 Fundamental Frequency Characteristics F takes the values from 50 Hz (males) to 400 Hz (children), with Fs=8000 Hz these • 0 frequencies correspond to the lags L=160 to 20 samples. It can be seen, that with low values F0 approaches the frame length (20 ms, which corresponds to 160 samples). The difference in pitch within a speaker can reach to the 2:1 relation. • Pitch is characterized by a typical behaviour within different phones; small changes • after the first period characterize the speaker (∆F0 < 10 Hz), but difficult to estimate. In radio-techniques these small shifts are called “jitter”. F is influenced by many factors – usually the melody, mood, distress, etc. Values of • 0 the changes of F0 are higher (greater voice “modulation”) in case of professional speakers. Common people speech is usually rather monotonous. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 6/37 Issues in Fundamental Frequency Detection Even voiced phones are never purely periodic! Only clean singing can be purely • periodic. Speech generated with F0=const is monotonous. Purely voiced or unvoiced excitation does not exist either. Usually, excitation is • compound (noise at higher frequencies). Difficult estimation of pitch with low energy. • High F can be affected by the low formant F (females, children). • 0 1 During transmission over land line (300–3400 Hz) the basic harmonic of pitch is not • presented but its folds (higher harmonics). So simple filtering for purpose of capturing pitch would not work. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 7/37 Methods Used in Fundamental Frequency Detection Autocorrelation + NCCF, which is applied on the original signal, further on the • so-called clipped signal and linear prediction error. Utilization of the error predictor in linear prediction. • Cepstral method. • Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 8/37 Autocorrelation Function – ACF N−1−m R(m)= s(n)s(n + m) (1) X n=0 The symmetry property of the autocorrelation coefficients gives: N−1 R(m)= s(n)s(n m) (2) X − n=m Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 9/37 The whole signal and one frame of a signal 4 x 104 3 2 1 0 −1 −2 −3 −40 2000 4000 6000 8000 10000 12000 2 x 104 1.5 1 0.5 0 −0.5 −1 −1.50 50 100 150 200 250 300 350 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 10/37 Shift illustration x 104 1.5 1 0.5 0 −0.5 −1 50 100 150 200 250 300 350 400 x 104 1.5 1 0.5 0 −0.5 −1 50 100 150 200 250 300 350 400 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 11/37 Calculated autocorrelation function x 109 10 8 6 4 2 R(m) 0 −2 −4 −6 −8 0 50 100 150 200 250 300 m Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 12/37 Lag Estimation. Voiced/Unvoiced Phones Lag estimation using ACF. Looking for the maximum of the function: N−1−m R(m)= c[s(n)]c[s(n + m)] (3) X n=0 Phones can be determined as voice/unvoiced by comparing the found maximum to the zero’s (maximum) autocorrelation coefficient. The constant α must be chosen experimentally. Rmax < αR(0) unvoiced ⇒ (4) Rmax αR(0) voiced ≥ ⇒ Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 13/37 ACF Maximum estimation lag (L=87 for the given figure): x 109 ⇒ 10 max lag 8 6 4 2 R(m) 0 −2 −4 −6 −8 20−160 − search interval 0 50 100 150 200 250 300 m Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 14/37 AMDF Earlier, when multiplication was computationally expensive, the autocorrelation function was substituted with AMDF (Average Magnitude Difference Function): N−1−m RD(m)= s(n) s(n + m) , (5) X | − | n=0 where on the contrary the minimum had to be identified. Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 15/37 x 106 2.5 2 1.5 1 0.5 0 50 100 150 200 250 300 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 16/37 Cross-Correlation Function The drawback of the ACF is a stepwise “shortening” of the segment, the coefficients are computed from. Here, we want to use the whole signal CCF. b indicates the beginning ⇒ of the frame: b+N−1 CCF (m)= s(n)s(n m) (6) X − n=b Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 17/37 The shift in the CCF calculation: x 104 1.5 1 0.5 0 −0.5 −1 −1.5 50 100 150 200 250 300 350 400 x 104 N−1 1.5 1 past signal 0.5 0 −0.5 −1 k=87 50 100 150 200 250 300 350 400 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 18/37 The CCF shift is a problem as the shifted signal has much higher energy! 3000 2000 1000 0 −1000 50 100 150 200 250 300 350 400 N−1 3000 2000 big values ! 1000 0 −1000 50 100 150 200 250 300 350 400 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 19/37 Normalized cross-correlation function The difference of the energy can be compensated by normalization: NCCF − zr+N 1 s(n)s(n m) CCF (m)= Pn=zr − (7) √E1E2 E1 a E2 are the energies of the original and the shifted signal: zr+N−1 zr+N−1 E = s2(n) E = s2(n m) (8) 1 X 2 X − n=zr n=zr Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 20/37 CCF and NCCF for a “good example” x 109 10 5 0 −5 50 100 150 200 250 300 1 0.5 0 −0.5 50 100 150 200 250 300 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 21/37 CCF and NCCF for a “bad example” x 107 1.5 1 0.5 0 −0.5 −1 −1.5 −2 50 100 150 200 250 300 1 0.5 0 −0.5 50 100 150 200 250 300 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 22/37 Drawback: the methods do not suppress the formants influence (results additional maxima in ACF or AMDF). Center Clipping - a signal preprocessing before ACF: we are interested only in the signal picks. Define so called clipping level cL. We can either “leave out” the the values from the interval < cL, +cL >. Or we can substitute the values of the signal by 1 and -1 where it crosses − the levels cL and cL, respectively: − s(n) cL pro s(n) >cL − c1[s(n)] = 0 pro cL s(n) cL (9) − ≤ ≤ s(n)+ cL pro s(n) < cL − +1 pro s(n) >c L c2[s(n)] = 0 pro cL s(n) cL (10) − ≤ ≤ 1 pro s(n) < cL − − Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 23/37 Figures illustrate clipping into frames on a speech signal with the clipping level 9562: 2 x 104 1.5 1 0.5 original 0 −0.5 −1 −1.50 50 100 150 samples 200 250 300 350 2 x 104 1.5 1 0.5 clip 1 0 −0.5 −1 −1.50 50 100 150 samples 200 250 300 350 1 0.8 0.6 0.4 0.2 0 clip 2 −0.2 −0.4 −0.6 −0.8 −1 0 50 100 150 samples 200 250 300 350 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 24/37 Clipping Level Value Estimation As a speech signal s(n) is a nonstationary signal, the slipping level changes and it is necessary to estimate it for every frame, for which pitch is predicted. A simple method is to estimate the clipping level from the absolute maximum value in the frame: cL = k max x(n) , (11) n=0...N−1 | | where the constant k is selected between 0.6 and 0.8.