<<

GAUSSIAN PROCESS DYNAMICAL MODELS FOR NONPARAMETRIC SPEECH REPRESENTATION AND SYNTHESIS

Gustav Eje Henter1,∗ Marcus R. Frean2 W. Bastiaan Kleijn1,2

1School of Electrical Engineering, KTH – Royal Institute of Technology, Stockholm, Sweden 2School of Engineering and Computer Science, Victoria University of Wellington, New Zealand

ABSTRACT This approach has the potential to overcome all principal is- We propose Gaussian process dynamical models (GPDMs) as sues with HMMs and provide more realistic acoustic models. a new, nonparametric paradigm in acoustic models of speech. Like HMMs, GPDMs may later serve as building blocks for These use multidimensional, continuous state-spaces to over- constructing arbitrary speech utterances. In the remainder of come familiar issues with discrete-state, HMM-based speech the text, we motivate and describe GPDMs in the context of models. The added dimensions allow the state to represent acoustic models of speech signals, and present concrete re- and describe more than just temporal structure as system- sults from a synthesis application. atic differences in , rather than as mere correlations in a residual (which dynamic features or AR-HMMs do). Be- 2. INTRODUCING GPDMS FOR SPEECH ing based on Gaussian processes, the models avoid restrictive We here explain the benefits of moving from Markov chains parametric or linearity assumptions on signal structure. We to continuous, multidimensional state-spaces, and introduce outline GPDM theory, and describe model setup and initial- GPDMs as nonparametric dynamical models of speech. ization schemes relevant to speech applications. Experiments demonstrate subjectively better quality of synthesized speech 2.1. Continuous, multidimensional state-spaces than from comparable HMMs. In addition, there is evidence for unsupervised discovery of salient speech structure. Let Y =(Y 1 ... Y N ) be a sequence of observations, here speech features, and let X=(X ... X ) be the correspond- Index Terms— acoustic models, stochastic models, non- 1 N ing sequence of unobserved latent-state values. The features parametric speech synthesis, D are typically continuous and real, yt∈R , with D between 10 and 100. We consider Y a D×N . 1. INTRODUCTION HMMs have a discrete state-space, xt∈ {1,...,M} ∀t. Hidden Markov models (HMMs) [1] constitute the dominant This is sufficient to model piecewise i.i.d. processes, but is not paradigm in model-based speech recognition and synthesis a good fit for speech since 1) HMMs have stepwise constant (e.g., HTS [2]). HMMs are probabilistic, allowing them to evolution, while speech mostly changes continuously, and 2) deal with uncertainty in a principled manner, and strike an at- the implicit geometric state-duration distribution of the un- tractive balance between complexity and descriptive power: derlying has much greater variance than nat- they avoid restrictive assumptions such as limited memory or ural speech sound durations. Dynamic features and hidden linearity, but can still be trained efficiently on large databases. semi-Markov models [4] have been proposed to deal with is- Unfortunately, HMMs are not satisfactory stochastic rep- sues 1) and 2) separately, respectively. Both shortcomings resentations of speech feature sequences [3]. Sampling from can however be addressed simultaneously, by considering a HMMs trained on speech acoustic data reveals several short- continuous state-space to represent incremental progress and comings of the model, in that durations are incorrect and the intermediate sounds [5]. This is the approach explored here. sound is warbly and unnatural. Contemporary model-based Typical speech HMMs use left-right Markov chains to en- speech synthesis systems, HMM-based or not, therefore do code long-range dependencies between features at different not sample from stochastic models for signal generation. times, specifically the sequential order of sounds in an utter- In this paper we introduce a new paradigm of nonpara- ance. Other dependence-modelling is less structured. Short- metric, nonlinear probabilistic modelling of speech, as ex- range time-dependencies can be described as time-correlated emplified by Gaussian process dynamical models (GPDMs). deviations from the state-conditional feature mean, e.g., us- ing dynamic features [3]. This enables gradual changes in ex- ∗ This research is supported by the LISTA project. The project LISTA pected value. Variation between comparable times in distinct acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research realizations is usually only modelled as Gaussian deviations of the European Commission, under FET-Open grant number: 256230. from a single state-conditional mean value. In practice, these correlation-based approaches fail to cap- Using GP-LVMs for the output mapping essentially as- ture important structure in speech variation, and do not pro- sumes that acoustic features yt have similar characteristics duce realistic speech samples [6]. The sampled speech has (mean and standard deviations) for similar speech states xt, a rapidly-varying, warbly quality to it due to the large mag- e.g., the same phone being spoken, though the details depend nitude of the noise-driven deviations from the feature mean. on the chosen kY . This is similar in principle to HMMs, but To obtain more pleasant-sounding output, speech synthesiz- is more flexible and does not assume that xt is quantized. ers therefore generally avoid sampling, and only produce the GP-LVMs were designed as probabilistic, local, nonlinear most likely output sequence. This is known as maximum like- extensions of principal component analysis (PCA), and MAP lihood parameter generation (MLPG) [7]. estimation in a GP-LVM will therefore attempt to attribute as The models considered here can use multidimensional much as possible of the observed acoustic y-variation as due Q state-spaces xt∈R to represent structured differences be- to variations in the underlying speaker state Xt. tween realizations. The added state-space dimensions may GP-LVMs assume X is i.i.d., and have no memory to for instance track different pronunciations for the same utter- account for context or to smooth estimated latent-space po- ance, e.g., stress-dependent pitch and formant evolution. sitions over time. GPDMs endow the GP-LVM states with

As both the continuous state-space and the extra dimen- simple, first-order autoregressive dynamics f∆Xt|Xt , so that sions give us flexibility to explain more empirically observed ∆Xt=Xt+1−Xt is a stochastic of Xt. (Higher- variation as systematic differences in mean, less variability order dynamics and other constructions are also possible [9].) will be attributed to residual, random variation. The estimated Specifically, the next-step distributions for the ∆Xt compo- noise magnitude will thus decrease, making samples less war- nents are assumed to be given by separate Gaussian processes 0 bly and more realistic. We now consider a specific model on (with a shared kernel kX (x, x )), conditionally independent such state spaces, based on Gaussian processes. of other dimensions and of Y given xt. The joint probabil- ity distribution is more involved than for the GP-LVM, as the 2.2. Gaussian process dynamical models dynamics map a space onto itself. It can be written

1 f(x|α)=fX (x1) √ A dynamical model for Y is defined by 1) an initial state dis- 1 Q(N−1) Q (2π) |KX (x, α)| f (x ) f (x | tribution X1 1 , 2) a stochastic mapping Xt+1|Xt t+1 1 −1 T · exp(− 2 tr(∆xKX (x, α)∆x )) (2) xt) describing state-space dynamics, and 3) a state-conditional where ∆x=(x2 − x1,..., xN − xN−1). This distribution is observation distribution fY t|Xt (yt | xt). In speech, xt may represent the state of the speaker—particularly the sound not Gaussian as KX depends on x, and fair sampling requires mean being produced—while yt is the current acoustic features. Metropolis-type algorithms. An approximation called In a simple HMM describing a speech utterance or phone prediction [9] exists for sequentially generating latent-space trajectories of high likelihood, analogous to MLPG. (for synthesis), fXt+1|Xt is usually a left-right Markov chain. Using GPs to describe state dynamics represents an as- The output fY t|Xt is often Gaussian for synthesis tasks, but GMMs are common in recognition. In this paper, however, sumption that the state of the speaker, and thus the acoustic output, evolves similarly when the state is similar, quite like both fXt+1|Xt and fY t|Xt will be modelled as continuous- space densities, using stochastic regression based on Gaus- how HMMs work but without the discretization. sian processes (GPs). (For a review of Gaussian processes By endowing all hyperparameters with priors, a fully please consult [8].) The resulting construction is known as a Bayesian nonparametric dynamical model is obtained. Gaussian process dynamical model, GPDM [9, 10]. MAP estimating the unobserved GPDM variables α, β, W , and X is straightforward using gradient methods. How- For the output mapping fY t|Xt , GPDMs use a tech- nique known as Gaussian process latent-variable models ever, there are many local optima and the estimated latent- (GP-LVMs) [11]. These assume the output is a product of space trajectories are typically quite noisy, since the GPDM tries to place as much variation as possible in the latent space. Gaussian processes, one for each yt-dimension, with a shared 0 kernel kY (x, x ) that depends on latent variables To prevent such overfitting, the unknown X can be in- Xt. The processes are conditionally independent given xt, tegrated out using Monte-Carlo EM, leading to low process similar to assuming diagonal covariance matrices in conven- noise and highly realistic samples in a motion capture appli- tional HMMs. The conditional output distribution becomes cation [10](video). Alternatively, one may choose a fixed α with low noise to obtain smooth dynamics, as we do here. f(y|x, β, w)= √ 1 DN D (2π) |KY (x, β)| 3. IMPLEMENTING GPDMS FOR SPEECH QD 1 2 −1 T · d=1 wd exp(− 2 wdydKY (x, β)yd ), (1) where the kernel matrix has entries (KY )t, t0 = kY (xt, xt0 | GPDMs were first introduced to model motion capture data, 1 β), β being a set of kernel hyperparameters. The scale fac- and we are unaware of any prior applications to speech. We tors wd compensate for different variances in different output 1A reviewer noted that [12] provides an application to ASR; it was pub- dimensions. The entries of X are assumed Gaussian and i.i.d. lished while this paper was in review. here discuss specific issues in using GPDMs as speech acous- As the most important variation in speech occurs along tic models, and propose an initialization scheme for sequen- the time dimension, we initialized the first latent coordinate tial signals such as speech utterances. by the time from utterance start, as an indicator of progress through the sentence. Multiple training utterances were 3.1. Feature representation aligned by dynamic time warping. Remaining x-dimensions were initialized by PCA, so that points at comparable times To create speech features suitable for GPDMs and to synthe- were spaced closer or farther according to acoustic similarity. size speech, we employed the widely used STRAIGHT sys- As the scale of the first latent dimension is arbitrary rel- tem [13]. The STRAIGHT analysis yields: 1) an F 0 con- (0) ative to other axes, it was rescaled according to ∆x1 2 tour with voiced/unvoiced indication, 2) a filter spectrum, and 1 (0) = ∆x2:Q , to have comparable RMS ∆x-magni- 3) an aperiodicity spectrum. To get a more compact feature Q−1 2 tude to the remaining dimensions. Finally, the mean was set, we represented the two spectra by 40 and 10 MFCCs, re- removed and all axes were rescaled equally so that x(0) spectively, and downsampled the features to 100 fps. We also 2 =DN, to match the Gaussian prior mean and variance. removed the mean of each component over the dataset. The relative scale of the STRAIGHT outputs is arbitrary. 4. EXPERIMENTS Even though the scaling factors w in (1) can in principle com- pensate for different feature SNRs, we normalized all dimen- In order to assess the properties of GPDMs as stochastic mod- sions to have unit noise magnitude, as this is beneficial for els of speech, we performed experiments with utterance syn- PCA-based initialization schemes. Component SNRs were thesis and speech representation, and contrasted the results estimated by fitting a third-order AR-process to each dimen- against comparable HMMs. sion and looking at the of the driving noise. The HTS system [2] uses a mixture of continuous and 4.1. Speech generation discrete distributions to represent voiced pitch or unvoiced For synthesis applications, we are interested in the quality of excitations. This is unsuitable for GPs, which are designed samples and maximum probability output of our speech mod- for continuous data spaces. For simplicity, we have restricted els. This was investigated in a subjective listening test. ourselves to voiced speech in this initial work. The test was based on a dataset containing the fully voiced utterances “I’ll willingly marry Marilyn” and “our lawyer will 3.2. Covariance functions allow your rule,” each spoken three times by a single, male Because the feature data mean has been removed it is appro- speaker. Separate models were trained on the voiced frames priate to use zero-mean GPs. We then only need to specify of each of the two utterances. For GPDMs, we set Q=1, fixed 20 kX and kY to have a fully defined model. the dynamics hyperparameters at dB SNR to get smooth For the dynamics, we chose a simple squared exponential latent trajectories, and performed 1000 gradient updates for (RBF) kernel with a noise term, training. Left-right HMMs with 40 states per second were also trained on the data using the Baum-Welch algorithm. By 0  α2 0 2 −1 k (x, x ) = α exp − kx − x k + α δ 0 . (3) X 1 2 3 x, x not including dynamic features we obtain comparable models Linear and higher-order kernel terms are left as future work. that cannot pass any information between frames, apart from A similar RBF kernel with a noise term is an appealing the current position within the utterance. The listening test included the voiced sections of raw choice also for kY , to model smooth output with some resid- ual variation. Rapid changes and localized discontinuities database utterances, speech resynthesized from training-data 2 such as plosives can be modelled with, e.g., neural network features, mean-predicted output and random samples from kernel functions [8], though that has not been pursued here. the GPDM, and MLPG and random samples from the HMMs. Eight subjects were asked to rate these deterministic and 0 3.3. Advanced initialization stochastic signal sources on a scale from (completely un- natural) to 100 (completely natural) using a MUSHRA-like As the likelihood function for the latent x has many local op- interface. The resulting scores are summarized in table1. tima, the starting position x(0) in MAP is highly influential In the test, subjects judged GPDM output as more nat- in determining the quality of the final model. We hence went ural than that of HMMs, both for deterministic signals and to some lengths to compute a starting position that well ex- sampled output. The differences are significant according to presses our expectations on process behaviour. paired t-tests (p=0.0017 and 0.014, respectively). GPDM Initializing the latent-space variable trajectory by PCA, as duration modelling, in particular, is noticeably better as the in [10], ignores the time dimension of the data and produces continuous state-space can represent incremental progress a model where acoustically similar frames will evolve simi- 2We used a fast, approximate procedure for sequentially generating latent larly regardless of utterance position, like in a (non-hidden) trajectories. It is similar to mean prediction, but each new point is sampled Markov chain. This is precisely what we strive to avoid. from the Gaussian next-step distribution, rather than just taking its mean. “I’ll willingly...” “Our lawyer...” discrete-state hidden Markov models of speech. Such rep- Sound source Mean St. dev. Mean St. dev. resentations can be constructed without restrictive paramet- Database speech 100 0 100 0 ric assumptions using Gaussian process dynamical models. Resynthesized 66 19 70 16 The advantages of nonparametric speech modelling through GPDM mean pred. 74 15 78 18 GPDMs include automatic structure discovery and more natu- GPDM sampled 25 11 23 16 HMM MLPG 63 13 62 15 ral synthesized speech, as affirmed by experimental evidence. HMM sampled 16 18 10 10 Further efforts are needed, particularly for the training stage, to realize the full potential of the models and to apply Table 1. Naturalness scores from listening test. them as building blocks in arbitrary speech synthesis. Work is presently underway to address these limitations. We are also investigating adding dynamic features for improved, frame- 1.5 correlated noise modelling.

1

0.5 6. REFERENCES −1.5 t

3 0 −1 [1] L. R. Rabiner, “A tutorial on hidden Markov models and se- x −0.5 −0.5 lected applications in speech recognition,” Proc IEEE, vol. 77,

−1 0 no. 2, pp. 257–286, 1989. 0.5 −1.5 [2] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. −3 1 −2 −1 Black, and K. Tokuda, “The HMM-based speech synthesis 0 1 1.5 x2t 2 3 system version 2.0,” in Proc ISCA SSW6, 2007, vol. 6, pp. x 1t 294–299. Fig. 1. Latent-space trajectories separated in 3D. [3] H. Zen, K. Tokuda, and T. Kitamura, “Reformulating the HMM as a trajectory model by imposing explicit relationships through the utterance. Sampling consistently scored below between static and dynamic feature vector sequences,” Com- put Speech Lang deterministic output due to the unnatural, uncorrelated noise. , vol. 21, no. 1, pp. 153–173, 2007. Interestingly, subjects preferred GPDM mean-predicted [4] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden semi-Markov model based speech synthesis,” in Proc output over speech resynthesized from the training data fea- ICSLP 2004, 2004, pp. 1393–1396. tures (p=0.038). A possible explanation is that the mean- [5] G. E. Henter and W. B. Kleijn, “Intermediate-state HMMs to predicted output is smoother and more stylized. capture continuously-changing signal features,” in Proc Inter- speech 2011, 2011, vol. 12, pp. 1817–1820. 4.2. Speech representation [6] M. Shannon, H. Zen, and W. Byrne, “The effect of using nor- In a second experiment, we explored how the additional state- malized models in statistical speech synthesis,” in Proc Inter- space dimensions in GPDMs can represent multimodal distri- speech 2011, 2011, vol. 12. butions and speech variability. For this, we used six examples [7] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Ki- of the utterance “our lawyer will allow your rule,” three times tamura, “Speech parameter generation algorithms for HMM- Proc ICASSP 2000 pronounced with the stress on “lawyer,” and three times in- based speech synthesis,” in , 2000, pp. 1315–1318. stead stressing the word “allow.” [8] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes If not handled properly, data inconsistencies like this can for , MIT Press, 2006. degrade the quality of traditional speech synthesis systems. [9] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process Q=3 However, a GPDM with trained on the data correctly dynamical models,” in Proc NIPS 2005, 2006, vol. 18, pp. separates the two prosodic variations in the latent space— 1441–1448. shown in blue (on top) and red (below) in figure1—and can [10] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process represent both pronunciation patterns simultaneously. At the dynamical models for human motion,” IEEE T Pattern Anal, same time, the three curves from each variant are placed close vol. 30, no. 2, pp. 283–298, Feb. 2008. together, meaning that information is shared between these [11] N. D. Lawrence, “The Gaussian process latent variable examples and a common stochastic representation has been model,” Tech. Rep. CS-06-03, The University of Sheffield, learned. Note that this structure was not imposed before- Department of Computer Science, 2006. hand (as is typically necessary to model the situation with an [12] H. Park and C. D. Yoo, “Gaussian process dynamical mod- HMM), but was recovered automatically from the data. els for phoneme classification,” in NIPS 2011 Workshop on Bayesian Nonparametrics: Hope or Hype?, 2011. 5. CONCLUSIONS AND FUTURE WORK [13] H. Kawahara, “STRAIGHT, exploitation of the other as- pect of VOCODER: Perceptually isomorphic decomposition We have described how models with continuous, multidimen- of speech sounds,” Acoust Sci & Tech, vol. 27, no. 6, pp. 349– sional state-spaces can avoid the shortcomings of traditional, 353, 2006.