Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg

The fundamental variation spectrum Kornel Laskowski 1, Mattias Heldner 2 and Jens Edlund 2 1interACT, Carnegie Mellon University, Pittsburgh PA, USA 2Centre for Speech Technology, KTH Stockholm, Sweden

Abstract cation of standard acoustic modeling techniques including bottom-up, continuous statistical se- This paper describes a recently introduced vec- quence learning. tor-valued representation of fundamental fre- In previous work, we have shown that this quency variation – the FFV spectrum – which representation is useful for modeling prosodic has a number of desirable properties. In par- sequences for prediction of speaker change in ticular, it is instantaneous, continuous, distri- the context of conversational spoken dialogue buted, and well-suited to application of stan- systems (Laskowski et al., 2008a, 2008b); how- dard acoustic modeling techniques. We show ever, the representation is potentially useful for what the representation looks like, and how it any prosodic sequence modeling task. can be used to model prosodic sequences.

Introduction The fundamental frequency varia- tion spectrum While speech recognition systems have long ago transitioned from localization to Instantaneous variation in pitch is normally spectral (vector-valued) formant representa- computed by determining a single scalar, the tions, prosodic processing continues to rely fundamental frequency, at two temporally adja- squarely on a pitch tracker’s ability to identify a cent instants and forming their difference. F0 peak, corresponding to the fundamental fre- represents the frequency of the first in quency (F0) of the speaker. Peak localization in a spectral representation of a frame of audio, acoustic signals is particularly prone to error, and is undefined for signals without harmonic and pitch trackers (cf. de Cheveigné & Kawa- structure. In the context of speech processing hara, 2002) and downstream speech processing applications, we view the localization of the applications (Shriberg & Stolcke, 2004) employ first harmonic and the subsequent differencing dynamic programming, non-linear filtering, and of two adjacent estimates as a case of subop- linearization to improve robustness. These me- timal feature compression and premature infe- thods introduce long-term dependencies which rence, since the goal of such applications is not violate the temporal locality of the F0 estimate, the accurate estimate of pitch. Instead, we want whose measurement error may be better han- to leverage the fact that all are dled by statistical modeling than by (linear) equally spaced in adjacent frames, and use rule-based schemes. Even if a robust, local, ana- every element of a spectral representation to lytic, statistical estimate of absolute pitch were yield a representation of the F0 delta. available, applications require a representation To this end, we propose a vector-valued re- of pitch variation and go to considerable addi- presentation of pitch variation, inspired by va- tional effort to identify a speaker-dependent nishing-point perspective, a technique used in quantity for normalization (e.g. Edlund & architectural drawing and grounded in projec- Heldner, 2005). tive geometry. While the standard inner product In the current work, we describe a recently between two vectors can be viewed as the derived representation of fundamental frequen- summation of pair-wise products with pairs se- cy variation (see also Laskowski, Edlund, & lected by orthonormal projection onto a point at Heldner, 2008a, 2008b; Laskowski, Wölfel, infinity, the proposed vanishing-point product Heldner, & Edlund, in press), which implicitly induces a 1-point perspective projection onto a τ addresses most if not all of the above issues. point at (Figure 1). When applied to two vec- This spectral representation, which we will re- tors representing a signal’s spectral content, FL fer to here as the fundamental frequency varia- and FR, at two temporally adjacent instants, the tion (FFV) spectrum is (1) instantaneous, not vanishing-point product yields the standard dot relying on adjacent frames; (2) continuous, de- product between FL and a dilated version of FR, fined for all frames; (3) distributed; and (4) po- or between FR and a dilated version of FL, for τ tentially sparse, making it suitable for the appli- positive and negative values of , respectively. Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg

t t

FL FR FL FR

−∞

−T 0 +T 0 τ −T 0 +T 0

Figure 1. The standard dot-product shown as an orthonormal projection onto a point at infinity (left panel), and the proposed vanishing-point product, which generalizes to the former when τ → ±∞ (right panel).

~ The degree of dilation is controlled by the mag-  Σ | F 2( −4r / N k ⋅ ||) F *[k |] nitude of τ. The proposed vector-valued repre-  L R , r ≥ 0 ~ −4r / N 2 * 2 sentation of pitch variation is the vanishing-  Σ | FL 2( k |) Σ⋅ | FR [k |] g[]r =  point product, evaluated over a continuum of τ. ~* +4r / N  Σ | F [k ⋅ ||] F 2( k |) For each analysis window, centered at time t, L R , r < 0  2 ~* +4r / N 2 we compute the short-time frequency represen-  Σ | FL[k |] Σ⋅ | FR 2( k |) tation of the left-half and the right-half portion of the window, leading to FL and FR, respec- where, in each case, summation is from tively, using two asymmetrical windows which k = -N / 2 +1 to k = N / 2 ; for convenience, r are mirror images of each other, as shown in varies over the same range as k. Normalization Figure 2. ensures that g[r] is an energy-independent re-

presentation. The frequency-scaled, interpolated ~ ~ values FL and FR are given by

~ −ρ −ρ −ρ FL 2( k) = α L FL [2 k ]+ 1( −α L )FL [2 k ], ~ + ρ +ρ +ρ FR 2( k) = α R FR [ 2 k ]+ 1( −α R )FR [ 2 k ],

where

−ρ −ρ Figure 2. Left and right windows used for the com- α L = 2 k− 2 k , putation of FL and FR, respectively, consisting of +ρ +ρ asymmetrical Hamming and Hann window halves. α R = 2 k − 2 k . T0 is 4 ms, and T1 is 12 ms, for a full analysis win- dow width of 32 ms. A 32 ms Hamming window is A sample FFV spectrum, for a voiced shown for comparison. frame, is shown in Figure 3; for unvoiced FL and FR are N≡512-point Fourier trans- frames, the peak tends to be much lower and forms, computed every 8. The peaks of the two the tails much higher. The position of the peak, windows are 8 ms apart. The FFV spectrum is with respect to r = 0, indicates the current rate then given by of fundamental frequency variation. The sam- ple FFV spectrum shown in Figure 3 thus indi- cates a single frame with a slightly negative slope, that is a slightly falling pitch.

Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg

rising pitch movements, sequences of FFV spectra need to be modeled. A standard option for modeling sequences involves training hid- den Markov models (HMM). In previous work, we have used fully-connected hidden Markov models (HMM) consisting of four states with one Gaussian per state (see Figure 6). Howev- Figure 3. A sample fundamental frequency variation er, other HMM topologies are also possible. spectrum. The x-axis is in octaves per 8ms.

Figure 4: Filters in two versions of the filterbank. The x-axis is in octaves per second; note that the filterbank is applied to frames in which FL and FR are computed at instants separated by 0.008s. Two extremity filters at (−2, −1) and (+1, +2) octaves per frame are not shown.

Filterbank Rather than locating the peak in the FFV spec- trum, we utilize the representation as is, and apply a filterbank. The filterbank (FB NEW shown in Figure 4) attempts to capture mea- ningful prosodic variation, and contains a con- servative trapezoidal filter for perceptually Figure 5. Spectrogram for a 500ms fragment of au- dio (top panel, upper frequency of 2kHz); the FFV “flat” pitch ('t Hart, Collier, & Cohen, 1990); spectrogram for the same fragment (middle panel); two trapezoidal filters for “slowly changing” and the same FFV spectrum (bottom panel) after pitch; and two trapezoidal filters for “rapidly being passed through the FB NEW filterbank as changing” pitch. In addition, it contains two shown in Figure 4. rectangular extremity filters with spans of (−2, −1) and (+1, +2) octaves per frame, as we have observed that unvoiced frames have flat rather than decaying tails. This filterbank reduces the input space to 7 scalars per frame. We show what a “spectrogram” representa- tion looks like when FFV spectra from consec- utive frames are stacked alongside one another, in Figure 5, as well as what the representation looks like after being passed through filterbank FB NEW of Figure 4.

Modeling FFV spectra sequences In order to transition from vectors of frame-by- frame FFV spectra passed through a filterbank Figure 6. A fully-connected hidden Markov model to something more like what we normally as- (HMM) consisting of four states with one Gaussian sociate with , such as flat, falling, and per state. Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg

Discussion References We have derived a continuous and instantane- 't Hart, J., Collier, R., & Cohen, A. (1990). A ous vector representation of variation in fun- perceptual study of intonation: An experi- damental frequency and given a detailed de- mental-phonetic approach to speech melo- scription of the steps involved, including a dy . Cambridge: Cambridge University graphical demonstration of both the form of the Press. representation, and its evolution in time. We de Cheveigné, A., & Kawahara, H. (2002). have also suggested a method for modeling se- YIN, a fundamental frequency estimator for quences with HMMs and utilizing the represen- speech and music. The Journal of the tation in a classification task. Acoustical Society of America, 111 (4), Initial experiments along these lines show 1917-1930. that such HMMs, when trained on dialogue da- Edlund, J., & Heldner, M. (2005). Exploring ta, corroborate research on human turn-taking prosody in interaction control. Phonetica, behavior in conversations. These experiments 62 (2-4), 215-226. also suggest that the representation is suitable Edlund, J., House, D., & Skantze, G. (2005). for direct, principled, continuous modeling (as The effects of prosodic features on the in- in automatic speech recognition) of prosodic terpretation of clarification ellipses. In Pro- sequences, which does not require peak- ceedings of Interspeech 2005 (pp. 2389- identification, dynamic time warping, median 2392). Lisbon, Portugal. filtering, landmark detection, linearization, or Laskowski, K., Edlund, J., & Heldner, M. mean pitch estimation and subtraction (2008a). An instantaneous vector represen- (Laskowski et al., 2008a, 2008b). tation of delta pitch for speaker-change We expect the method to be especially use- prediction in conversational dialogue sys- ful in situations where online processing is re- tems. In Proceedings ICASSP 2008 . Las quired, such as in conversational spoken dialo- Vegas, NV, USA. gue systems. Further experiments will test the Laskowski, K., Edlund, J., & Heldner, M. method in real systems, for example to support (2008b). Machine learning of prosodic se- turn-taking decisions. We will also explore the quences using the fundamental frequency use of the FFV spectrum in combination with variation spectrum. In Proceedings Speech other sources of information, such as durational Prosody 2008 . Campinas, Brazil. patterns in interaction control. Laskowski, K., Wölfel, M., Heldner, M., & Ed- Immediate next steps include fine-tuning lund, J. (in press). Computing the funda- the filter banks and the HMM topologies, and mental frequency variation spectrum in testing the results on other tasks where pitch conversational spoken dialogue systems. In movements are expected to play a role, such as '08 Paris . Paris, France. the attitudinal coloring of short feedback utter- Shriberg, E., & Stolcke, A. (2004). Direct ances (e.g. Edlund, House, & Skantze, 2005; modeling of prosody: An overview of ap- Wallers, Edlund, & Skantze, 2006), speaker plications in automatic speech processing. verification, and automatic speech recognition In Proceedings of Speech Prosody 2004 for tonal languages. (pp. 575-582). Nara, Japan. Wallers, Å., Edlund, J., & Skantze, G. (2006). Acknowledgements The effects of prosodic features on the in- terpretation of synthesised backchannels. In We would like to thank Tanja Schultz and Rolf E. André, L. Dybkjaer, W. Minker, H. Carlson for encouragement of this collaboration Neumann & M. Weber (Eds.), Proceedings and Anton Batliner, Rob Malkin, Rich Stern, of Perception and Interactive Technologies Ashish Venugopal, and Matthias Wölfel for (PIT'06) (pp. 183-187): Springer. several occasions of discussion. The work pre- sented here was funded in part by the Swedish Research Council (VR) project 2006-2172.