Fundamental Frequency Detection

Fundamental Frequency Detection

Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika cernocky|ihubeika @fit.vutbr.cz { } DCGM FIT BUT Brno Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 1/37 Agenda Fundamental frequency characteristics. • Issues. • Autocorrelation method, AMDF, NFFC. • Formant impact reduction. • Long time predictor. • Cepstrum. • Improvements in fundamental frequency detection. • Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 2/37 Recap – speech production and its model Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 3/37 Introduction Fundamental frequency, pitch, is the frequency vocal cords oscillate on: F . • 0 The period of fundamental frequency, (pitch period) is T = 1 . • 0 F0 The term “lag” denotes the pitch period expressed in samples : L = T Fs, where Fs • 0 is the sampling frequency. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 4/37 Fundamental Frequency Utilization speech synthesis – melody generation. • coding • – in simple encoding such as LPC, reduction of the bit-stream can be reached by separate transfer of the articulatory tract parameters, energy, voiced/unvoiced sound flag and pitch F0. – in more complex encoders (such as RPE-LTP or ACELP in the GSM cell phones) long time predictor LTP is used. LPT is a filter with a “long” impulse response which however contains only few non-zero components. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 5/37 Fundamental Frequency Characteristics F takes the values from 50 Hz (males) to 400 Hz (children), with Fs=8000 Hz these • 0 frequencies correspond to the lags L=160 to 20 samples. It can be seen, that with low values F0 approaches the frame length (20 ms, which corresponds to 160 samples). The difference in pitch within a speaker can reach to the 2:1 relation. • Pitch is characterized by a typical behaviour within different phones; small changes • after the first period characterize the speaker (∆F0 < 10 Hz), but difficult to estimate. In radio-techniques these small shifts are called “jitter”. F is influenced by many factors – usually the melody, mood, distress, etc. Values of • 0 the changes of F0 are higher (greater voice “modulation”) in case of professional speakers. Common people speech is usually rather monotonous. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 6/37 Issues in Fundamental Frequency Detection Even voiced phones are never purely periodic! Only clean singing can be purely • periodic. Speech generated with F0=const is monotonous. Purely voiced or unvoiced excitation does not exist either. Usually, excitation is • compound (noise at higher frequencies). Difficult estimation of pitch with low energy. • High F can be affected by the low formant F (females, children). • 0 1 During transmission over land line (300–3400 Hz) the basic harmonic of pitch is not • presented but its folds (higher harmonics). So simple filtering for purpose of capturing pitch would not work. Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 7/37 Methods Used in Fundamental Frequency Detection Autocorrelation + NCCF, which is applied on the original signal, further on the • so-called clipped signal and linear prediction error. Utilization of the error predictor in linear prediction. • Cepstral method. • Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 8/37 Autocorrelation Function – ACF N−1−m R(m)= s(n)s(n + m) (1) X n=0 The symmetry property of the autocorrelation coefficients gives: N−1 R(m)= s(n)s(n m) (2) X − n=m Fundamental Frequency Detection Jan Cernock´y,ValentinaHubeika,DCGMFITBUTBrnoˇ 9/37 The whole signal and one frame of a signal 4 x 104 3 2 1 0 −1 −2 −3 −40 2000 4000 6000 8000 10000 12000 2 x 104 1.5 1 0.5 0 −0.5 −1 −1.50 50 100 150 200 250 300 350 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 10/37 Shift illustration x 104 1.5 1 0.5 0 −0.5 −1 50 100 150 200 250 300 350 400 x 104 1.5 1 0.5 0 −0.5 −1 50 100 150 200 250 300 350 400 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 11/37 Calculated autocorrelation function x 109 10 8 6 4 2 R(m) 0 −2 −4 −6 −8 0 50 100 150 200 250 300 m Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 12/37 Lag Estimation. Voiced/Unvoiced Phones Lag estimation using ACF. Looking for the maximum of the function: N−1−m R(m)= c[s(n)]c[s(n + m)] (3) X n=0 Phones can be determined as voice/unvoiced by comparing the found maximum to the zero’s (maximum) autocorrelation coefficient. The constant α must be chosen experimentally. Rmax < αR(0) unvoiced ⇒ (4) Rmax αR(0) voiced ≥ ⇒ Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 13/37 ACF Maximum estimation lag (L=87 for the given figure): x 109 ⇒ 10 max lag 8 6 4 2 R(m) 0 −2 −4 −6 −8 20−160 − search interval 0 50 100 150 200 250 300 m Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 14/37 AMDF Earlier, when multiplication was computationally expensive, the autocorrelation function was substituted with AMDF (Average Magnitude Difference Function): N−1−m RD(m)= s(n) s(n + m) , (5) X | − | n=0 where on the contrary the minimum had to be identified. Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 15/37 x 106 2.5 2 1.5 1 0.5 0 50 100 150 200 250 300 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 16/37 Cross-Correlation Function The drawback of the ACF is a stepwise “shortening” of the segment, the coefficients are computed from. Here, we want to use the whole signal CCF. b indicates the beginning ⇒ of the frame: b+N−1 CCF (m)= s(n)s(n m) (6) X − n=b Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 17/37 The shift in the CCF calculation: x 104 1.5 1 0.5 0 −0.5 −1 −1.5 50 100 150 200 250 300 350 400 x 104 N−1 1.5 1 past signal 0.5 0 −0.5 −1 k=87 50 100 150 200 250 300 350 400 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 18/37 The CCF shift is a problem as the shifted signal has much higher energy! 3000 2000 1000 0 −1000 50 100 150 200 250 300 350 400 N−1 3000 2000 big values ! 1000 0 −1000 50 100 150 200 250 300 350 400 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 19/37 Normalized cross-correlation function The difference of the energy can be compensated by normalization: NCCF − zr+N 1 s(n)s(n m) CCF (m)= Pn=zr − (7) √E1E2 E1 a E2 are the energies of the original and the shifted signal: zr+N−1 zr+N−1 E = s2(n) E = s2(n m) (8) 1 X 2 X − n=zr n=zr Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 20/37 CCF and NCCF for a “good example” x 109 10 5 0 −5 50 100 150 200 250 300 1 0.5 0 −0.5 50 100 150 200 250 300 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 21/37 CCF and NCCF for a “bad example” x 107 1.5 1 0.5 0 −0.5 −1 −1.5 −2 50 100 150 200 250 300 1 0.5 0 −0.5 50 100 150 200 250 300 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 22/37 Drawback: the methods do not suppress the formants influence (results additional maxima in ACF or AMDF). Center Clipping - a signal preprocessing before ACF: we are interested only in the signal picks. Define so called clipping level cL. We can either “leave out” the the values from the interval < cL, +cL >. Or we can substitute the values of the signal by 1 and -1 where it crosses − the levels cL and cL, respectively: − s(n) cL pro s(n) >cL − c1[s(n)] = 0 pro cL s(n) cL (9) − ≤ ≤ s(n)+ cL pro s(n) < cL − +1 pro s(n) >c L c2[s(n)] = 0 pro cL s(n) cL (10) − ≤ ≤ 1 pro s(n) < cL − − Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 23/37 Figures illustrate clipping into frames on a speech signal with the clipping level 9562: 2 x 104 1.5 1 0.5 original 0 −0.5 −1 −1.50 50 100 150 samples 200 250 300 350 2 x 104 1.5 1 0.5 clip 1 0 −0.5 −1 −1.50 50 100 150 samples 200 250 300 350 1 0.8 0.6 0.4 0.2 0 clip 2 −0.2 −0.4 −0.6 −0.8 −1 0 50 100 150 samples 200 250 300 350 Fundamental Frequency Detection Jan Cernock´y,ˇ Valentina Hubeika, DCGM FIT BUT Brno 24/37 Clipping Level Value Estimation As a speech signal s(n) is a nonstationary signal, the slipping level changes and it is necessary to estimate it for every frame, for which pitch is predicted. A simple method is to estimate the clipping level from the absolute maximum value in the frame: cL = k max x(n) , (11) n=0...N−1 | | where the constant k is selected between 0.6 and 0.8.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    37 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us