Generative Model-Based Text-to-Speech Synthesis

Heiga Zen (Google London) February ￿rd, ￿￿￿￿@MIT Outline

Generative TTS

Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

Beyond parametric TTS Learned features WaveNet End-to-end

Conclusion & future topics Outline

Generative TTS

Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

Beyond parametric TTS Learned features WaveNet End-to-end

Conclusion & future topics Text-to-speech as sequence-to-sequence mapping

Automatic speech recognition (ASR) “Hello my name is Heiga Zen” !

Machine translation (MT) “Hello my name is Heiga Zen” “Ich heiße Heiga Zen” !

Text-to-speech synthesis (TTS) “Hello my name is Heiga Zen” !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Speech production process

text (concept) fundamental freq voiced/unvoiced char freq transfer

frequency speech transfer characteristics

magnitude start--end Sound source fundamental voiced: pulse frequency unvoiced: noise modulation of carrier wave information speech by

air flow

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Typical ￿ow of TTS system

TEXT Sentence segmentation Word segmentation Text normalization Text analysis Part-of-speech tagging Pronunciation Speech synthesis Prosody prediction discrete discrete Waveform generation ) NLP Frontend discrete continuous SYNTHESIZED ) SEECH Speech Backend

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Rule-based, formant synthesis [￿] Sample-based, concatenative synthesis [￿]

Model-based, generative synthesis

Speech synthesis approaches

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Sample-based, concatenative synthesis [￿]

Model-based, generative synthesis

Speech synthesis approaches

Rule-based, formant synthesis [￿]

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Model-based, generative synthesis

Speech synthesis approaches

Rule-based, formant synthesis [￿] Sample-based, concatenative synthesis [￿]

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Speech synthesis approaches

Rule-based, formant synthesis [￿] Sample-based, concatenative synthesis [￿]

Model-based, generative synthesis

p(speech= | text=”Hello, my name is Heiga Zen.”)

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Synthesis Estimate posterior predictive distribution • p(x w, , ) ! | X W Sample x¯ from the posterior distribution •

Probabilistic formulation of TTS

Random variables

Speech waveforms (data) Observed X Transcriptions (data) Observed W w Given text Observed x Synthesized speech Unobserved

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ 6of￿￿ Probabilistic formulation of TTS

Random variables

Speech waveforms (data) Observed X Transcriptions (data) Observed W w Given text Observed x Synthesized speech Unobserved

Synthesis Estimate posterior predictive distribution w • X W p(x w, , ) ! | X W x Sample x¯ from the posterior distribution •

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ 6of￿￿ Probabilistic formulation

Introduce auxiliary variables (representation) +factorizedependency p(x w, , )= p(x o)p(o l, )p(l w) | X W y | | | l X8 X8L p( )p( , )p()p( )/ p( ) dod d X|O O | L L |W X O where

, o: Acoustic features w O X W , l: Linguistic features L l : Model O L

o

x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿of￿￿ Approximation (￿)

Approximate sum & integral by best point estimates (like MAP) [￿] { } p(x w, , ) p(x oˆ) | X W ⇡ | w where X W ˆ ˆ ˆ ˆ l oˆ, l, , , = arg max O L { O L } o,l, , , O L p(x o)p(o l, )p(l w) | | | p( )p( , )p()p( ) o X|O O | L L |W

x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ 8of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ

oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) Synthesize waveform oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() O L O|L lˆ= arg max p(l w) l | ˆ

oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() O L O|L lˆ= arg max p(l w) l | ˆ

oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Learn mapping O L O|L lˆ= arg max p(l w) l | ˆ

oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ

oˆ = arg max p(o lˆ, ˆ) o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ

oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ

oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) Synthesize waveform oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Approximation (￿)

Joint Step-by-step maximization [￿] !

ˆ = arg max p( ) Extract acoustic features w O X|O X W O ˆ = arg max p( ) Extract linguistic features l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Learn mapping O L O|L lˆ= arg max p(l w) Predict linguistic features l | ˆ

oˆ = arg max p(o lˆ, ˆ) Predict acoustic features o o | x¯ f (oˆ)=p(x oˆ) Synthesize waveform oˆ ⇠ x | x Representations: acoustic, linguistic, mapping

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿of￿￿ Based on knowledge about spoken language ! Lexicon, letter-to-sound rules • Tokenizer, tagger, parser • Phonology rules •

Representation – Linguistic features

Hello, world. Sentence: length, ...

Hello, world. Phrase: intonation, ...

hello world Word: POS, grammar, ...

h-e2 l-ou1 w-er1-l-d Syllable: stress, tone, ...

h e l ou w er l d Phone: voicing, manner, ...

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Representation – Linguistic features

Hello, world. Sentence: length, ...

Hello, world. Phrase: intonation, ...

hello world Word: POS, grammar, ...

h-e2 l-ou1 w-er1-l-d Syllable: stress, tone, ...

h e l ou w er l d Phone: voicing, manner, ...

Based on knowledge about spoken language ! Lexicon, letter-to-sound rules • Tokenizer, tagger, parser • Phonology rules •

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Needs to solve inverse problem ! Estimate parameters from signals • Use estimated parameters (e.g., cepstrum) as acoustic features •

Representation – Acoustic features

Piece-wise stationary, source-￿lter generative model p(x o) | Vocal source Vocal tract filter

Pulse train (voiced) Cepstrum, LPC, ... overlap/shift

Aperiodicity, [dB] windowing voicing,... Fundamental 80 frequency + Speech e(n)

0 x(n) = h(n)*e(n) 0 8 [kHz] h(n) White noise (unvoiced)

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Representation – Acoustic features

Piece-wise stationary, source-￿lter generative model p(x o) | Vocal source Vocal tract filter

Pulse train (voiced) Cepstrum, LPC, ... overlap/shift

Aperiodicity, [dB] windowing voicing,... Fundamental 80 frequency + Speech e(n)

0 x(n) = h(n)*e(n) 0 8 [kHz] h(n) White noise (unvoiced)

Needs to solve inverse problem ! Estimate parameters from signals • Use estimated parameters (e.g., cepstrum) as acoustic features •

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Hand-crafted rules on knowledge-based features !

Representation – Mapping

Rule-based, formant synthesis [￿]

ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Extract rules O L O|L lˆ= arg max p(l w) Text analysis l | ˆ

oˆ = arg max p(o lˆ, ˆ) Apply rules o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Representation – Mapping

Rule-based, formant synthesis [￿]

ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Extract rules O L O|L lˆ= arg max p(l w) Text analysis l | ˆ

oˆ = arg max p(o lˆ, ˆ) Apply rules o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x Hand-crafted rules on knowledge-based features ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Replace rules by HMM-based generative acoustic model !

Representation – Mapping

HMM-based [￿], statistical parametric synthesis [￿]

ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Train HMMs O L O|L lˆ= arg max p(l w) Text analysis l | ˆ

oˆ = arg max p(o lˆ, ˆ) Parameter generation o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Representation – Mapping

HMM-based [￿], statistical parametric synthesis [￿]

ˆ = arg max p( ) Vocoder analysis w O X|O X W O ˆ = arg max p( ) Text analysis l L L|W O L L ˆ ˆ lˆ ˆ = arg max p( ˆ ˆ,)p() Train HMMs O L O|L lˆ= arg max p(l w) Text analysis l | ˆ

oˆ = arg max p(o lˆ, ˆ) Parameter generation o o | x¯ f (oˆ)=p(x oˆ) Vocoder synthesis oˆ ⇠ x | x Replace rules by HMM-based generative acoustic model ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Outline

Generative TTS

Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

Beyond parametric TTS Learned features WaveNet End-to-end

Conclusion & future topics HMM-based generative acoustic model for TTS

Context-dependent subword HMMs • Decision trees to cluster & tie HMM states interpretable • !

l1 ... lN l ... q1 q2 q3 q4 : Discrete ...... o1 o2 o3 o4 o5 o6 oT o1 o2 o3 o4 : Continuous o2 o2

T p(o l,)= p(o q ,)P (q l,) q :hiddenstateatt | t | t | t q t=1 X8 Y T = (o ; µ , ⌃ )P (q l,) N t qt qt | q t=1 X8 Y

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ HMM-based generative acoustic model for TTS

Non-smooth, step-wise statistics • Smoothing is essential ! Dif￿cult to use high-dimensional acoustic features (e.g., raw spectra) • Use low-dimensional features (e.g., cepstra) ! Data fragmentation • Ineffective, local representation ! Alotofresearchworkhavebeendonetoaddresstheseissues

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿6of￿￿ Replace the tree w/ a general-purpose regression model Arti￿cial neural network !

Alternative acoustic model

HMM: Handle variable length & alignment Decision tree: Map linguistic acoustic ! Linguistic features l

yes no

yes no yes no

yes no yes no ...

Statistics of acoustic features o

Regression tree: linguistic features Stats. of acoustic features !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ Alternative acoustic model

HMM: Handle variable length & alignment Decision tree: Map linguistic acoustic ! Linguistic features l

yes no

yes no yes no

yes no yes no ...

Statistics of acoustic features o

Regression tree: linguistic features Stats. of acoustic features ! Replace the tree w/ a general-purpose regression model Arti￿cial neural network ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ FFNN-based acoustic model for TTS [6]

Target Frame-level acoustic feature ot ot−1 ot ot+1

ht

lt−1 lt lt+1 Frame-level linguistic feature lt Input

h = g (W l + b ) ˆ = arg min o oˆ t hl t h k t tk2 t X oˆ = W h + b = W , W , b , b t oh t o { hl oh h o} oˆ [o l ] Replace decision trees & Gaussian distributions t ⇡ E t | t !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿8of￿￿ RNN-based acoustic model for TTS [￿]

Target Frame-level acoustic feature ot ot−1 ot ot+1

Recurrent connections

lt−1 lt lt+1 Frame-level linguistic feature lt Input

ht = g (Whllt + Whhht 1 + bh) ˆ = arg min ot oˆt 2 k k t X oˆ = W h + b = W , W , W , b , b t oh t o { hl hh oh h o} FFNN: oˆ [o l ] RNN: oˆ [o l ,...,l ] t ⇡ E t | t t ⇡ E t | 1 t

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ NN-based approach is now mainstream in research & products Models: FFNN [6], MDN [￿￿], RNN [￿], Highway network [￿￿], GAN [￿￿] • Products: e.g., Google [￿￿] •

NN-based generative acoustic model for TTS

Non-smooth, step-wise statistics • RNN predicts smoothly varying acoustic features [￿, 8] ! Dif￿cult to use high-dimensional acoustic features (e.g., raw spectra) • Layered architecture can handle high-dimensional features [￿] ! Data fragmentation • Distributed representation [￿￿] !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ NN-based generative acoustic model for TTS

Non-smooth, step-wise statistics • RNN predicts smoothly varying acoustic features [￿, 8] ! Dif￿cult to use high-dimensional acoustic features (e.g., raw spectra) • Layered architecture can handle high-dimensional features [￿] ! Data fragmentation • Distributed representation [￿￿] !

NN-based approach is now mainstream in research & products Models: FFNN [6], MDN [￿￿], RNN [￿], Highway network [￿￿], GAN [￿￿] • Products: e.g., Google [￿￿] •

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ Outline

Generative TTS

Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

Beyond parametric TTS Learned features WaveNet End-to-end

Conclusion & future topics NN-based generative model for TTS

text (concept) fundamental freq voiced/unvoiced freq transfer char

frequency speech transfer characteristics

magnitude start--end Sound source fundamental voiced: pulse frequency unvoiced: noise modulation of carrier wave speech information by

air flow

Text Linguistic (Articulatory) Acoustic Waveform ! ! ! !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Knowledge-based features Learned features !

Unsupervised feature learning

~ x(t) w(n+1) world

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

Acoustic Linguistic ) feature ) feature o(t) l(n-1) l(n)

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 hello

x(t) w(n) (raw FFT spectrum) (1-hot representation of word)

Speech: auto-encoder at FFT spectra [￿,￿￿] positive results • ! Text: word [￿6], phone & syllable [￿￿] less positive • !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Relax approximation Joint acoustic & model training

Two-step optimization Joint optimization !

ˆ = arg max p( ) w O X|O X W 8ˆ O ˆ ˆ = arg max p( ,)p() l <> O|L L O ˆ lˆ :> + L ˆ, ˆ = arg max p( )p( ˆ,)p() { O} , X|O O|L O ˆ

Joint source-￿lter & acoustic model optimization o HMM [￿8, ￿￿,￿￿] • oˆ NN [￿￿,￿￿] • x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Relax approximation Joint acoustic feature extracion & model training

Mixed-phase cepstral analysis + LSTM-RNN [￿￿]

pn f G(z) n Pulse train -1 -1 -1 -1 xn z z z z z z -1 - en Hu (z) sn Speech 1 -

... (v) (u) ... dt dt Cepstrum (v) (u) Derivatives ... ot ot ... Forward

propagation Back propagation Linguistic ... features lt ...

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Relax approximation Direct mapping from linguistic to waveform

No explicit acoustic features

ˆ, ˆ = arg max p( )p( ˆ,)p() w { O} , X|O O|L X W O l + L ˆ ˆ = arg max p( ,)p() ˆ lˆ X|L L Generative models for raw audio

LPC [￿￿] ˆ • WaveNet [￿￿] • x SampleRNN [￿￿] •

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿6of￿￿ WaveNet [￿￿] p(xn x0,...,xn 1,) is modeled by convolutional NN ! | Key components Causal dilated convolution: capture long-term dependency • Gated convolution + residual + skip: powerful non-linearity • Softmax at output: classi￿cation rather than regression •

WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = x0,x1,...,xN 1 :rawwaveform { } N 1 p(x )=p(x0,x1,...,xN 1 )= p(xn x0,...,xn 1,) | | | n=0 Y

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ Key components Causal dilated convolution: capture long-term dependency • Gated convolution + residual + skip: powerful non-linearity • Softmax at output: classi￿cation rather than regression •

WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = x0,x1,...,xN 1 :rawwaveform { } N 1 p(x )=p(x0,x1,...,xN 1 )= p(xn x0,...,xn 1,) | | | n=0 Y WaveNet [￿￿] p(xn x0,...,xn 1,) is modeled by convolutional NN ! |

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = x0,x1,...,xN 1 :rawwaveform { } N 1 p(x )=p(x0,x1,...,xN 1 )= p(xn x0,...,xn 1,) | | | n=0 Y WaveNet [￿￿] p(xn x0,...,xn 1,) is modeled by convolutional NN ! | Key components Causal dilated convolution: capture long-term dependency • Gated convolution + residual + skip: powerful non-linearity • Softmax at output: classi￿cation rather than regression •

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ WaveNet – Causal dilated convolution

￿￿￿ms in ￿6kHz sampling = ￿,6￿￿ time steps Too long to be captured by normal RNN/LSTM ! Dilated convolution Exponentially increase receptive ￿eld size w.r.t. # of layers

p(xn | x n -1 ,..., x n -16 ) Output

Hidden layer3 Hidden layer2 Hidden layer1

Input

xn-16 . . . xn-3 xn-2 xn-1

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿8of￿￿ WaveNet – Non-linearity

Residual block Skip connections

Residual block

Softmax p(xn | x ,..., xn -1 ) ReLU ReLU 0 1x1 1x1 ...... 256 256 256 30 +

Residual block

To residual block 512 Residual block Residual block

1x1 1x1 : 1x1 convolution Gated 1x1 To skip connection 256 ReLU : ReLU activation 2x1 dilated Gated : Gated activation Softmax : Softmax activation

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ WaveNet – Softmax Amplitude

Time Analog audio signal

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ WaveNet – Softmax Amplitude

Time Sampling & Quantization

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ WaveNet – Softmax

1 Amplitude Category index

16 Time

Categorical distribution → Histogram - Unimodal - Multimodal - Skewed ...

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ WaveNet – Conditional modelling

1x1 Residual block Linguistic features l 1x1 Residual block Softmax ReLU ReLU 1x1 1x1 p(xn | x 0,..., xn -1 ,l) ...... Embedding + at time n 1x1 Residual block hn Residual block

1x1 Residual block 1x1

Gated 1x1

conditioning

2x1 dilated

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ WaveNet vs conventional audio generative models

Assumptions in conventional audio generative models [￿￿,￿6, ￿￿,￿￿] Stationary process w/ ￿xed-length analysis window • Estimate model within ￿￿–￿￿ms window w/ ￿–￿￿ shift ! Linear, time-invariant ￿lter within a frame • Relationship between samples can be non-linear ! Gaussian process • Assumes speech signals are normally distributed ! WaveNet Sample-by-saple, non-linear, capable to take additional inputs • Arbitrary-shaped signal distribution • SOTA subjective naturalness w/ WaveNet-based TTS [￿￿] HMM LSTM Concatenative WaveNet

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Relax approximation Towards Bayesian end-to-end TTS

Integrated end-to-end

ˆ = arg max p( ) w L L|W X W L 8ˆ = arg max p( ˆ,)p() <> X|L

:> + ˆ = arg max p( ,)p() X|W ˆ Text analysis is integrated to model

x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Relax approximation Towards Bayesian end-to-end TTS

Bayesian end-to-end

ˆ w = arg max p( ,)p() X W X|W 8x¯ f (w, ˆ)=p(x w, ˆ) < ⇠ x | : + x¯ f (w, , )=p(x w, , ) ⇠ x X W | X W = p(x w,)p( , )d | |X W Z K 1 p(x w, ˆ ) Ensemble ⇡ K | k x Xk=1 Marginalize model parameters & architecture

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ Outline

Generative TTS

Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

Beyond parametric TTS Learned features WaveNet End-to-end

Conclusion & future topics Generative model-based text-to-speech synthesis

Bayes formulation + factorization + approximations • Representation: acoustic features, linguistic features, mapping • Mapping: Rules HMM NN ! ! Feature: Engineered Unsupervised, learned ! Less approximations • Joint training, direct waveform modelling Moving towards integrated & Bayesian end-to-end TTS

Naturalness: Concatenative Generative  Flexibility: Concatenative Generative (e.g., multiple speakers) ⌧

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿6of￿￿ We need representation that can organize the world information & make it accessible & useful from TTS generative models ,

Beyond “text”-to-speech synthesis

TTS on conversational assistants

Texts aren’t fully contained • Need more context • Location to resolve homographs User query to put right emphasis

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ Beyond “text”-to-speech synthesis

TTS on conversational assistants

Texts aren’t fully contained • Need more context • Location to resolve homographs User query to put right emphasis

We need representation that can organize the world information & make it accessible & useful from TTS generative models ,

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ Speech is for communication • Goal: maximize the amount of information to be received Missing “listener” “listener” in training / model itself? !

Beyond “generative” TTS

Generative model-based TTS Model represents process behind speech production • Trained to minimize error against human-produced speech Learned model speaker !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿8of￿￿ Beyond “generative” TTS

Generative model-based TTS Model represents process behind speech production • Trained to minimize error against human-produced speech Learned model speaker ! Speech is for communication • Goal: maximize the amount of information to be received Missing “listener” “listener” in training / model itself? !

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿8of￿￿ Thanks!

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ References I

[￿] D. Klatt. Real-time speech synthesis by rule. Journal of ASA,68(S￿):S￿8–S￿8,￿￿8￿. [￿] A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP,pages￿￿￿–￿￿6,￿￿￿6. [￿] K. Tokuda. Speech synthesis as a statistical problem. https://www.sp.nitech.ac.jp/~tokuda/tokuda_asru2011_for_pdf.pdf. Invited talk given at ASRU ￿￿￿￿. [￿] T.Yoshimura, K. Tokuda, T.Masuko, T.Kobayashi, and T.Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. IEICE Trans. Inf. Syst.,J8￿-D-II(￿￿):￿￿￿￿–￿￿￿￿,￿￿￿￿. (in Japanese). [￿] H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commn.,￿￿(￿￿):￿￿￿￿–￿￿6￿,￿￿￿￿. [6] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP,pages￿￿6￿–￿￿66,￿￿￿￿. [￿] Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech,pages￿￿6￿–￿￿68,￿￿￿￿.

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd,￿￿￿￿ ￿￿of￿￿ References II

[8] H. Zen. Acoustic modeling for speech synthesis: from HMM to RNN. http://research.google.com/pubs/pub44630.html. Invited talk given at ASRU ￿￿￿￿. [￿] S. Takaki and J. Yamagishi. Adeepauto-encoderbasedlow-dimensionalfeatureextractionfromFFTspectralenvelopesforstatisticalparametricspeech synthesis. In Proc. ICASSP,pages￿￿￿￿–￿￿￿￿,￿￿￿6. [￿￿] G. Hinton, J. McClelland, and D. Rumelhart. Distributed representation. In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in the microstructure of cognition.MITPress,￿￿86. [￿￿] H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP,pages￿8￿￿–￿8￿6,￿￿￿￿. [￿￿] X. Wang, S. Takaki, and J. Yamagishi. Investigating very deep highway networks for parametric speech synthesis. In Proc. ISCA SSW￿,￿￿￿6. [￿￿] Y.Saito, S. Takamichi, and Saruwatari. Training algorithm to deceive anti-spoo￿ng veri￿cation for DNN-based speech synthesis. In Proc. ICASSP,￿￿￿￿. [￿￿] H. Zen, Y.Agiomyrgiannakis, N. Egberts, F.Henderson, and P.Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In Proc. Interspeech,￿￿￿6.

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ References III

[￿￿] P.Muthukumar and A. Black. Adeeplearningapproachtodata-drivenparameterizationsforstatisticalparametricspeechsynthesis. arXiv:￿￿￿￿.8￿￿8,￿￿￿￿. [￿6] P. Wang, Y. Qian, F. Soong, L. He, and H. Zhao. for based TTS synthesis. In Proc. ICASSP,pages￿8￿￿–￿88￿,￿￿￿￿. [￿￿] X. Wang, S. Takaki, and J. Yamagishi. Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis. IEICE Trans. Inf. Syst.,E￿￿-D(￿￿):￿￿￿￿–￿￿8￿,￿￿￿6. [￿8] T. To d a a n d K . To k u d a . Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In Proc. ICASSP,pages￿￿￿￿–￿￿￿8,￿￿￿8. [￿￿] Y. - J . W u a n d K . To k u d a . Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech,pages￿￿￿–￿8￿,￿￿￿8. [￿￿] R. Maia, H. Zen, and M. Gales. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In Proc. ISCA SSW￿,pages88–￿￿,￿￿￿￿. [￿￿] K. Tokuda and H. Zen. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. ICASSP,pages￿￿￿￿–￿￿￿￿,￿￿￿￿. [￿￿] K. Tokuda and H. Zen. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In Proc. ICASSP,pages￿6￿￿–￿6￿￿,￿￿￿6.

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ References IV

[￿￿] F.Itakura and S. Saito. Astatisticalmethodforestimationofspeechspectraldensityandformantfrequencies. Trans. IEICE,J￿￿￿A:￿￿–￿￿,￿￿￿￿. [￿￿] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:￿6￿￿.￿￿￿￿￿,￿￿￿6. [￿￿] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y.Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv:￿6￿￿.￿￿8￿￿,￿￿￿6. [￿6] S. Imai and C. Furuichi. Unbiased estimation of log spectrum. In Proc. EURASIP,pages￿￿￿–￿￿6,￿￿88. [￿￿] H. Kameoka, Y.Ohishi, D. Mochihashi, and J. Le Roux. Speech analysis with multi-kernel linear prediction. In Proc. Spring Conference of ASJ,pages￿￿￿–￿￿￿,￿￿￿￿. (in Japanese).

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿ w w w w X W X W X W X W

l l l O L O L O L ˆ ˆ lˆ O L o o ˆ oˆ o

oˆ x x x x

(1) Bayesian (2) Auxiliary variables (3) Joint maximization (4) Step-by-step maximization + factorization e.g., statistical parametric TTS

w w w w X W X W X W X W

l l L L O ˆ lˆ ˆ lˆ L L ˆ ˆ ˆ

o

x x x x

(5) Joint acoustic feature (6) Conditional WaveNet (7) Integrated end-to-end (8) Bayesian end-to-end extraction + model training -based TTS

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February ￿rd, ￿￿￿￿ ￿￿of￿￿