Generative Model-Based Text-to-Speech Synthesis
Heiga Zen (Google London) February rd, @MIT Outline
Generative TTS
Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks
Beyond parametric TTS Learned features WaveNet End-to-end
Conclusion & future topics Outline
Generative TTS
Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks
Beyond parametric TTS Learned features WaveNet End-to-end
Conclusion & future topics Text-to-speech as sequence-to-sequence mapping
Automatic speech recognition (ASR) “Hello my name is Heiga Zen” !
Machine translation (MT) “Hello my name is Heiga Zen” “Ich heiße Heiga Zen” !
Text-to-speech synthesis (TTS) “Hello my name is Heiga Zen” !
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Speech production process
text (concept) fundamental freq voiced/unvoiced char freq transfer
frequency speech transfer characteristics
magnitude start--end Sound source fundamental voiced: pulse frequency unvoiced: noise modulation of carrier wave information speech by
air flow
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Typical ow of TTS system
TEXT Sentence segmentation Word segmentation Text normalization Text analysis Part-of-speech tagging Pronunciation Speech synthesis Prosody prediction discrete discrete Waveform generation ) NLP Frontend discrete continuous SYNTHESIZED ) SEECH Speech Backend
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Rule-based, formant synthesis [] Sample-based, concatenative synthesis []
Model-based, generative synthesis
Speech synthesis approaches
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Sample-based, concatenative synthesis []
Model-based, generative synthesis
Speech synthesis approaches
Rule-based, formant synthesis []
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Model-based, generative synthesis
Speech synthesis approaches
Rule-based, formant synthesis [] Sample-based, concatenative synthesis []
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Speech synthesis approaches
Rule-based, formant synthesis [] Sample-based, concatenative synthesis []
Model-based, generative synthesis
p(speech= | text=”Hello, my name is Heiga Zen.”)
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Synthesis Estimate posterior predictive distribution • p(x w, , ) ! | X W Sample x¯ from the posterior distribution •
Probabilistic formulation of TTS
Random variables
Speech waveforms (data) Observed X Transcriptions (data) Observed W w Given text Observed x Synthesized speech Unobserved
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6of Probabilistic formulation of TTS
Random variables
Speech waveforms (data) Observed X Transcriptions (data) Observed W w Given text Observed x Synthesized speech Unobserved
Synthesis Estimate posterior predictive distribution w • X W p(x w, , ) ! | X W x Sample x¯ from the posterior distribution •
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6of Probabilistic formulation
Introduce auxiliary variables (representation) +factorizedependency p(x w, , )= p(x o)p(o l, )p(l w) | X W y | | | l X8 X8L p( )p( , )p( )p( )/ p( ) dod d X|O O | L L |W X O where
, o: Acoustic features w O X W , l: Linguistic features L l : Model O L
o
x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Approximate sum & integral by best point estimates (like MAP) [] { } p(x w, , ) p(x oˆ) | X W ⇡ | w where X W ˆ ˆ ˆ ˆ l oˆ, l, , , = arg max O L { O L } o,l, , , O L p(x o )p(o l, )p(l w) | | | p( )p( , )p( )p( ) o X|O O | L L |W
oˆ
x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ
oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) Synthesize waveform oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) O L O|L lˆ= arg max p(l w) l | ˆ
oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) O L O|L lˆ= arg max p(l w) l | ˆ
oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Learn mapping O L O|L lˆ= arg max p(l w) l | ˆ
oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ
oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ
oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ
oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) Synthesize waveform oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Approximation ()
Joint Step-by-step maximization [] !
ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ
oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) Synthesize waveform oˆ ⇠ x | x Representations: acoustic, linguistic, mapping
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Based on knowledge about spoken language ! Lexicon, letter-to-sound rules • Tokenizer, tagger, parser • Phonology rules •
Representation – Linguistic features
Hello, world. Sentence: length, ...
Hello, world. Phrase: intonation, ...
hello world Word: POS, grammar, ...
h-e2 l-ou1 w-er1-l-d Syllable: stress, tone, ...
h e l ou w er l d Phone: voicing, manner, ...
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Representation – Linguistic features
Hello, world. Sentence: length, ...
Hello, world. Phrase: intonation, ...
hello world Word: POS, grammar, ...
h-e2 l-ou1 w-er1-l-d Syllable: stress, tone, ...
h e l ou w er l d Phone: voicing, manner, ...
Based on knowledge about spoken language ! Lexicon, letter-to-sound rules • Tokenizer, tagger, parser • Phonology rules •
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Needs to solve inverse problem ! Estimate parameters from signals • Use estimated parameters (e.g., cepstrum) as acoustic features •
Representation – Acoustic features
Piece-wise stationary, source-lter generative model p(x o) | Vocal source Vocal tract filter
Pulse train (voiced) Cepstrum, LPC, ... overlap/shift
Aperiodicity, [dB] windowing voicing,... Fundamental 80 frequency + Speech e(n)
0 x(n) = h(n)*e(n) 0 8 [kHz] h(n) White noise (unvoiced)
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Representation – Acoustic features
Piece-wise stationary, source-lter generative model p(x o) | Vocal source Vocal tract filter
Pulse train (voiced) Cepstrum, LPC, ... overlap/shift
Aperiodicity, [dB] windowing voicing,... Fundamental 80 frequency + Speech e(n)
0 x(n) = h(n)*e(n) 0 8 [kHz] h(n) White noise (unvoiced)
Needs to solve inverse problem ! Estimate parameters from signals • Use estimated parameters (e.g., cepstrum) as acoustic features •
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Hand-crafted rules on knowledge-based features !
Representation – Mapping
Rule-based, formant synthesis []
ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Extract rules O L O|L lˆ= arg max p(l w) Text analysis l | ˆ
oˆ = arg max p(o lˆ, ˆ) Apply rules o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Representation – Mapping
Rule-based, formant synthesis []
ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Extract rules O L O|L lˆ= arg max p(l w) Text analysis l | ˆ
oˆ = arg max p(o lˆ, ˆ) Apply rules o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x Hand-crafted rules on knowledge-based features ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Replace rules by HMM-based generative acoustic model !
Representation – Mapping
HMM-based [], statistical parametric synthesis []
ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Train HMMs O L O|L lˆ= arg max p(l w) Text analysis l | ˆ
oˆ = arg max p(o lˆ, ˆ) Parameter generation o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Representation – Mapping
HMM-based [], statistical parametric synthesis []
ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ, )p( ) Train HMMs O L O|L lˆ= arg max p(l w) Text analysis l | ˆ
oˆ = arg max p(o lˆ, ˆ) Parameter generation o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x Replace rules by HMM-based generative acoustic model ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Outline
Generative TTS
Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks
Beyond parametric TTS Learned features WaveNet End-to-end
Conclusion & future topics HMM-based generative acoustic model for TTS
Context-dependent subword HMMs • Decision trees to cluster & tie HMM states interpretable • !
l1 ... lN l ... q1 q2 q3 q4 : Discrete ...... o1 o2 o3 o4 o5 o6 oT o1 o2 o3 o4 : Continuous o2 o2
T p(o l, )= p(o q , )P (q l, ) q :hiddenstateatt | t | t | t q t=1 X8 Y T = (o ; µ , ⌃ )P (q l, ) N t qt qt | q t=1 X8 Y
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of HMM-based generative acoustic model for TTS
Non-smooth, step-wise statistics • Smoothing is essential ! Difcult to use high-dimensional acoustic features (e.g., raw spectra) • Use low-dimensional features (e.g., cepstra) ! Data fragmentation • Ineffective, local representation ! Alotofresearchworkhavebeendonetoaddresstheseissues
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6of Replace the tree w/ a general-purpose regression model Articial neural network !
Alternative acoustic model
HMM: Handle variable length & alignment Decision tree: Map linguistic acoustic ! Linguistic features l
yes no
yes no yes no
yes no yes no ...
Statistics of acoustic features o
Regression tree: linguistic features Stats. of acoustic features !
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Alternative acoustic model
HMM: Handle variable length & alignment Decision tree: Map linguistic acoustic ! Linguistic features l
yes no
yes no yes no
yes no yes no ...
Statistics of acoustic features o
Regression tree: linguistic features Stats. of acoustic features ! Replace the tree w/ a general-purpose regression model Articial neural network ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of FFNN-based acoustic model for TTS [6]
Target Frame-level acoustic feature ot ot−1 ot ot+1
ht
lt−1 lt lt+1 Frame-level linguistic feature lt Input
h = g (W l + b ) ˆ = arg min o oˆ t hl t h k t tk2 t X oˆ = W h + b = W , W , b , b t oh t o { hl oh h o} oˆ [o l ] Replace decision trees & Gaussian distributions t ⇡ E t | t !
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8of RNN-based acoustic model for TTS []
Target Frame-level acoustic feature ot ot−1 ot ot+1
Recurrent connections
lt−1 lt lt+1 Frame-level linguistic feature lt Input
ht = g (Whllt + Whhht 1 + bh) ˆ = arg min ot oˆt 2 k k t X oˆ = W h + b = W , W , W , b , b t oh t o { hl hh oh h o} FFNN: oˆ [o l ] RNN: oˆ [o l ,...,l ] t ⇡ E t | t t ⇡ E t | 1 t
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of NN-based approach is now mainstream in research & products Models: FFNN [6], MDN [], RNN [], Highway network [], GAN [] • Products: e.g., Google [] •
NN-based generative acoustic model for TTS
Non-smooth, step-wise statistics • RNN predicts smoothly varying acoustic features [, 8] ! Difcult to use high-dimensional acoustic features (e.g., raw spectra) • Layered architecture can handle high-dimensional features [] ! Data fragmentation • Distributed representation [] !
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of NN-based generative acoustic model for TTS
Non-smooth, step-wise statistics • RNN predicts smoothly varying acoustic features [, 8] ! Difcult to use high-dimensional acoustic features (e.g., raw spectra) • Layered architecture can handle high-dimensional features [] ! Data fragmentation • Distributed representation [] !
NN-based approach is now mainstream in research & products Models: FFNN [6], MDN [], RNN [], Highway network [], GAN [] • Products: e.g., Google [] •
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Outline
Generative TTS
Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks
Beyond parametric TTS Learned features WaveNet End-to-end
Conclusion & future topics NN-based generative model for TTS
text (concept) fundamental freq voiced/unvoiced freq transfer char
frequency speech transfer characteristics
magnitude start--end Sound source fundamental voiced: pulse frequency unvoiced: noise modulation of carrier wave speech information by
air flow
Text Linguistic (Articulatory) Acoustic Waveform ! ! ! !
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Knowledge-based features Learned features !
Unsupervised feature learning
~ x(t) w(n+1) world
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
Acoustic Linguistic ) feature ) feature o(t) l(n-1) l(n)
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 hello
x(t) w(n) (raw FFT spectrum) (1-hot representation of word)
Speech: auto-encoder at FFT spectra [,] positive results • ! Text: word [6], phone & syllable [] less positive • !
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Relax approximation Joint acoustic feature extraction & model training
Two-step optimization Joint optimization !
ˆ = arg max p( ) w O X|O X W 8ˆ O ˆ ˆ = arg max p( , )p( ) l <> O|L L O ˆ lˆ :> + L ˆ, ˆ = arg max p( )p( ˆ, )p( ) { O} , X|O O|L O ˆ
Joint source-lter & acoustic model optimization o HMM [8, ,] • oˆ NN [,] • x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Relax approximation Joint acoustic feature extracion & model training
Mixed-phase cepstral analysis + LSTM-RNN []
pn f G(z) n Pulse train -1 -1 -1 -1 xn z z z z z z -1 - en Hu (z) sn Speech 1 -
... (v) (u) ... dt dt Cepstrum (v) (u) Derivatives ... ot ot ... Forward
propagation Back propagation Linguistic ... features lt ...
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Relax approximation Direct mapping from linguistic to waveform
No explicit acoustic features