Wave-Tacotron: Spectrogram-Free End-To-End Text-To-Speech Synthesis ∿ Ron J

Home , Spectrogram

Paper #2729 Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis ∿ Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, and Diederik P. Kingma 烙 {ronw, rjryan, ebattenberg, soroosh, durk}@google.com

Overview Model

waveform block noise Summary xN xM yt zt ● TTS* in one sequence-to-sequence model Invertible Affine Squeeze ActNorm ActNorm Squeeze Unsqueeze Flow loss ○ block-autoregressive normalizing flow, no vocoder 1x1 Conv Coupling ○ *normalized-text- or phoneme-to-speech ● Directly predict 40ms waveform blocks at each decoder step 2 Layer LSTM + Residual Linear ○ no overlap, no spectrograms concat Stop token EOS loss ● End-to-end training, maximizing likelihood yt-1 Pre-Net Attention LSTM x4 c Proj. t ● High fidelity output cond. features

○ trails Tacotron+WaveRNN baseline Input tokens Embed Enc. Pre-Net CBHG e1:I

○ higher sample variation, captures modes of training data? ● ~10x faster than real-time synthesis on TPU Architecture Normalizing flow Training

● Replace decoder post-net and vocoder with conditional normalizing flow ● Model joint distribution of K samples: P(yt1, yt2, ..., ytK | ct) ● Teacher forced conditioning P(y | c ) = P(y | y , e ) ○ similar to FloWaveNet [4], WaveGlow [5] neural vocoders ● At each step: transform waveform block y into noise z Background t t t 1:t-1 1:l t t = P(yt | previous waveform blocks, text) ● Invertible network ● Flow loss ● Tacotron [1] [2]: phoneme input, mel spectrogram frame output ○ training: transform waveform block into noise -log P(y) = sumt -log P(yt | ct) ■ autoregressive decoder, each step generates new frame -1 -1 ● Tacotron encoder/decoder predicts flow conditioning input ○ sampling: transform noise sample into waveform block using inverse = sumt -log N(g (yt; ct); 0, I) - log |det( dg (yt; ct) / dyt)| ○ separate vocoder, inverts spectrogram to waveform ● Train end-to-end, maximize likelihood of training data ● Change of variables y = g(z ; c ) spherical Jacobian t t t ■ e.g., WaveRNN [3], sample-by-sample autoregressive ● Block-autoregressive generation ● Maximize likelihood P(y | c ) = P(z | c ) |det(dz / dy )| Gaussian determinant t t t t t t ○ waveform samples in each block generated in parallel ● Coupling layers [6] ➔ fast inverse, Jacobian determinant text / spectrogram ● EOS stop token classifier loss: P(t is last frame) 烙: Encoder Decoder Vocoder waveform phonemes

fine L x J 2L x J/2 coarse 2M-1L x J/2M-1 Multiscale network [7] 1 x K 1 x K Sampling ● Squeeze waveform block into frames, length L = 10 samples where J = K/L ● Invert the flow network text / ∿烙: Encoder Decoder waveform blocks phonemes ● M = 5 stages, each processes signal at different scale ○ take inverse of each layer, reverse order

○ N = 12 steps per stage yt Sqz(L) Step xN Sqz Step xN Sqz Step xN Unsqz zt ● At each step ○ deep convnet: M N = 60 total steps ○ sample noise vector z ~ N(0, T I) cond. features c t ● Wave-Tacotron: generate sequence of non-overlapping waveform t ● Sinusoidal position embeddings encode position in each frame repeated J times Average Average ○ generate waveform block with flow y = g(z ; c ) blocks t t t pos. embeddings Frames Frames ○ autoregressive conditioning on previous output yt-1 ■ K = 960 samples (40 ms at 24 kHz) J positions

● concatenate blocks yt to form final signal yt = vstack(yt)

Experiments Sample variation Data Results ● US English, single female speaker, sampled at 24 kHz ● Tacotron + WaveRNN best ○ 39 hours training, 601 utterances held out ○ char / phoneme roughly on par ● Generate 12 samples from the same input text ● Baselines ● Wave-Tacotron trails by ~0.2 points ○ Tacotron-PN (postnet) + Griffin-Lim (similar to [1]) ○ phoneme > char ● Baselines generate very consistent samples, ○ Tacotron + WaveRNN (similar to [2]) ○ network uses capacity to model detailed across vocoders ○ Tacotron + Flow vocoder waveform structure instead of pronunciation? ○ same prosody every time ■ fully parallel (similar flow to Wave-Tacotron, 6 stages) ● Large gap to Tacotron-PN and ● Subjective listening tests rating speech naturalness Tacotron + Flowcoder ● Wave-Tacotron has high variance ○ MOS on 5 point scale ○ captures multimodal training distribution? ■ Tacotron regression loss collapses to single prosody mode? Generation speed Ablations ○ similar pattern in Flowtron [8] ● Seconds to generate 5 seconds of speech ● 2 layer decoder LSTM ○ useful for ASR data augmentation? ○ 90 input tokens, batch size 1 256 channels in coupling layers ● Wave-Tacotron ~10x faster than real-time on TPU (2x on CPU) ○ slower as frame size K decreases (more autoregressive steps) ● Optimal sampling temperature T = 0.7 ● ~10x faster than Tacotron + WaveRNN on TPU (25x on CPU) ● ~2.5x slower than fully parallel vocoder on CPU ● Deep multiscale flow is critical

● Varying block size K ○ quality starts degrading for K > 40 ms References [1] Wang, et al., Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017. [2] Shen, et al., Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP 2018. [3] Kalchbrenner, et al., Efficient Neural Audio Synthesis. ICML 2018. [4] Kim, et al., FloWaveNet : A Generative Flow for Raw Audio. ICML 2019. Sound examples: [5] Prenger, et al., WaveGlow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019. [6] Dinh, et al., NICE: Non-linear independent components estimation. ICLR 2015. https://google.github.io/tacotron/publications/wave-tacotron [7] Dinh and Bengio, Density estimation using Real NVP. ICLR 2017. [8] Valle, et al., Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis . ICLR 2021.