The Fundamental Frequency Variation Spectrum

Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg The fundamental frequency variation spectrum Kornel Laskowski 1, Mattias Heldner 2 and Jens Edlund 2 1interACT, Carnegie Mellon University, Pittsburgh PA, USA 2Centre for Speech Technology, KTH Stockholm, Sweden Abstract cation of standard acoustic modeling techniques including bottom-up, continuous statistical se- This paper describes a recently introduced vec- quence learning. tor-valued representation of fundamental fre- In previous work, we have shown that this quency variation – the FFV spectrum – which representation is useful for modeling prosodic has a number of desirable properties. In par- sequences for prediction of speaker change in ticular, it is instantaneous, continuous, distri- the context of conversational spoken dialogue buted, and well-suited to application of stan- systems (Laskowski et al., 2008a, 2008b); how- dard acoustic modeling techniques. We show ever, the representation is potentially useful for what the representation looks like, and how it any prosodic sequence modeling task. can be used to model prosodic sequences. Introduction The fundamental frequency variation spectrum While speech recognition systems have long ago transitioned from formant localization to Instantaneous variation in pitch is normally spectral (vector-valued) formant representa- computed by determining a single scalar, the tions, prosodic processing continues to rely fundamental frequency, at two temporally adja- squarely on a pitch tracker’s ability to identify a cent instants and forming their difference. F0 peak, corresponding to the fundamental fre- represents the frequency of the first harmonic in quency (F0) of the speaker. Peak localization in a spectral representation of a frame of audio, acoustic signals is particularly prone to error, and is undefined for signals without harmonic and pitch trackers (cf. de Cheveigné & Kawa- structure. In the context of speech processing hara, 2002) and downstream speech processing applications, we view the localization of the applications (Shriberg & Stolcke, 2004) employ first harmonic and the subsequent differencing dynamic programming, non-linear filtering, and of two adjacent estimates as a case of subop- linearization to improve robustness. These me- timal feature compression and premature infe- thods introduce long-term dependencies which rence, since the goal of such applications is not violate the temporal locality of the F0 estimate, the accurate estimate of pitch. Instead, we want whose measurement error may be better han- to leverage the fact that all harmonics are dled by statistical modeling than by (linear) equally spaced in adjacent frames, and use rule-based schemes. Even if a robust, local, ana- every element of a spectral representation to lytic, statistical estimate of absolute pitch were yield a representation of the F0 delta. available, applications require a representation To this end, we propose a vector-valued re- of pitch variation and go to considerable addi- presentation of pitch variation, inspired by va- tional effort to identify a speaker-dependent nishing-point perspective, a technique used in quantity for normalization (e.g. Edlund & architectural drawing and grounded in projec- Heldner, 2005). tive geometry. While the standard inner product In the current work, we describe a recently between two vectors can be viewed as the derived representation of fundamental frequen- summation of pair-wise products with pairs se- cy variation (see also Laskowski, Edlund, & lected by orthonormal projection onto a point at Heldner, 2008a, 2008b; Laskowski, Wölfel, inﬁnity, the proposed vanishing-point product Heldner, & Edlund, in press), which implicitly induces a 1-point perspective projection onto a τ addresses most if not all of the above issues. point at (Figure 1). When applied to two vec- This spectral representation, which we will re- tors representing a signal’s spectral content, FL fer to here as the fundamental frequency varia- and FR, at two temporally adjacent instants, the tion (FFV) spectrum is (1) instantaneous, not vanishing-point product yields the standard dot relying on adjacent frames; (2) continuous, de- product between FL and a dilated version of FR, fined for all frames; (3) distributed; and (4) po- or between FR and a dilated version of FL, for τ tentially sparse, making it suitable for the appli- positive and negative values of , respectively. Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg t t FL FR FL FR −∞ −T 0 +T 0 τ −T 0 +T 0 Figure 1. The standard dot-product shown as an orthonormal projection onto a point at infinity (left panel), and the proposed vanishing-point product, which generalizes to the former when τ → ±∞ (right panel). ~ The degree of dilation is controlled by the mag- Σ | F 2( −4r / N k ⋅ ||) F *[k |] nitude of τ. The proposed vector-valued repre- L R , r ≥ 0 ~ −4r / N 2 * 2 sentation of pitch variation is the vanishing- Σ | FL 2( k |) Σ⋅ | FR [k |] g[]r = point product, evaluated over a continuum of τ. ~* +4r / N Σ | F [k ⋅ ||] F 2( k |) For each analysis window, centered at time t, L R , r < 0 2 ~* +4r / N 2 we compute the short-time frequency represen- Σ | FL[k |] Σ⋅ | FR 2( k |) tation of the left-half and the right-half portion of the window, leading to FL and FR, respec- where, in each case, summation is from tively, using two asymmetrical windows which k = -N / 2 +1 to k = N / 2 ; for convenience, r are mirror images of each other, as shown in varies over the same range as k. Normalization Figure 2. ensures that g[r] is an energy-independent representation. The frequency-scaled, interpolated ~ ~ values FL and FR are given by ~ −ρ −ρ −ρ FL 2( k) = α L FL [2 k ]+ 1( −α L )FL [2 k ], ~ + ρ +ρ +ρ FR 2( k) = α R FR [ 2 k ]+ 1( −α R )FR [ 2 k ], where −ρ −ρ Figure 2. Left and right windows used for the com- α L = 2 k− 2 k , putation of FL and FR, respectively, consisting of +ρ +ρ asymmetrical Hamming and Hann window halves. α R = 2 k − 2 k . T0 is 4 ms, and T1 is 12 ms, for a full analysis window width of 32 ms. A 32 ms Hamming window is A sample FFV spectrum, for a voiced shown for comparison. frame, is shown in Figure 3; for unvoiced FL and FR are N≡512-point Fourier trans- frames, the peak tends to be much lower and forms, computed every 8. The peaks of the two the tails much higher. The position of the peak, windows are 8 ms apart. The FFV spectrum is with respect to r = 0, indicates the current rate then given by of fundamental frequency variation. The sample FFV spectrum shown in Figure 3 thus indicates a single frame with a slightly negative slope, that is a slightly falling pitch. Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg rising pitch movements, sequences of FFV spectra need to be modeled. A standard option for modeling sequences involves training hidden Markov models (HMM). In previous work, we have used fully-connected hidden Markov models (HMM) consisting of four states with one Gaussian per state (see Figure 6). Howev- Figure 3. A sample fundamental frequency variation er, other HMM topologies are also possible. spectrum. The x-axis is in octaves per 8ms. Figure 4: Filters in two versions of the filterbank. The x-axis is in octaves per second; note that the filterbank is applied to frames in which FL and FR are computed at instants separated by 0.008s. Two extremity filters at (−2, −1) and (+1, +2) octaves per frame are not shown. Filterbank Rather than locating the peak in the FFV spectrum, we utilize the representation as is, and apply a filterbank. The filterbank (FB NEW shown in Figure 4) attempts to capture mea- ningful prosodic variation, and contains a con- servative trapezoidal filter for perceptually Figure 5. Spectrogram for a 500ms fragment of audio (top panel, upper frequency of 2kHz); the FFV “flat” pitch ('t Hart, Collier, & Cohen, 1990); spectrogram for the same fragment (middle panel); two trapezoidal filters for “slowly changing” and the same FFV spectrum (bottom panel) after pitch; and two trapezoidal filters for “rapidly being passed through the FB NEW filterbank as changing” pitch. In addition, it contains two shown in Figure 4. rectangular extremity filters with spans of (−2, −1) and (+1, +2) octaves per frame, as we have observed that unvoiced frames have flat rather than decaying tails. This filterbank reduces the input space to 7 scalars per frame. We show what a “spectrogram” representation looks like when FFV spectra from consec- utive frames are stacked alongside one another, in Figure 5, as well as what the representation looks like after being passed through filterbank FB NEW of Figure 4. Modeling FFV spectra sequences In order to transition from vectors of frame-by- frame FFV spectra passed through a filterbank Figure 6. A fully-connected hidden Markov model to something more like what we normally as- (HMM) consisting of four states with one Gaussian sociate with prosody, such as flat, falling, and per state. Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg Discussion References We have derived a continuous and instantane- 't Hart, J., Collier, R., & Cohen, A. (1990). A ous vector representation of variation in fun- perceptual study of intonation: An experi- damental frequency and given a detailed de- mental-phonetic approach to speech melo- scription of the steps involved, including a dy . Cambridge: Cambridge University graphical demonstration of both the form of the Press. representation, and its evolution in time. We de Cheveigné, A., & Kawahara, H. (2002). have also suggested a method for modeling se- YIN, a fundamental frequency estimator for quences with HMMs and utilizing the represen- speech and music.

Load more