Generating Speech from Speech How End-To-End Is Too Far?

Generating speech from speech How end-to-end is too far? Ron Weiss SANE 2019 2019-10-24 Joint work with Fadi Biadsy, Ye Jia, and many others in Google Brain, Research, Translate, Speech, Project Euphonia Generating speech from speech - SANE 2019 1 End-to-end / Direct models E2E model how's it going vs Acoustic C∘L∘G LM how's it going model Decoder rescoring ● Single model trained to directly predict desired output from raw input ● Why? ○ Simpler... ■ training data ➔ e.g. don't need lexicon ■ implementation ➔ single neural network, smaller model ■ single decoding step ➔ lower latency inference ○ Directly optimize for desired output ■ avoid compounding errors from intermediate predictions ■ retain useful information in input, e.g. speaker identity, prosody Generating speech from speech - SANE 2019 2 End-to-end speech models, a journey... 1. Speech-to-text translation [Weiss, et al., 2017] how's it going [Jia, Johnson, et al., 2019] E2E model que tal 2. 列 Parrotron voice conversion, speech separation E2E model [Biadsy, et al., 2019] how do you... how do you... 3. Translatotron speech-to-speech translation E2E model [Jia, Weiss, et al., 2019] que tal how's it going Generating speech from speech - SANE 2019 3 Sequence-to-sequence with attention [Bahdanau et al., 2015] Attention Decoder c1 y1 y c2 2 yK-1 c K yK h1 h2 h3 hL Encoder x1 x2 x3 xT ● Listen, Attend and Spell (LAS) [Chan et al., 2016] [Chorowski et al., 2015] ○ Encoder Bidi LSTM stack computes latent representation of spectrogram frames ○ Decoder Autoregressive LSTM next-step prediction, outputs one character at a time ■ conditioned on entire encoded input sequence via attention context vector ○ Attention aligns input and output sequences Bahdanau, et al., Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. Chan, et al., Listen, Attend and Spell. ICASSP 2016. Chorowski, et al., Attention-Based Models for Speech Recognition. NeurIPS 2015. Generating speech from speech - SANE 2019 4 烙 Tacotron [Wang et al., 2017] [Shen et al., 2018] Waveform spectrogram Samples 5 Conv Layer WaveNet Post-Net MoL Linear 2 Layer 2 LSTM Projection Pre-Net Layers Linear Decoder Stop Token Projection Location Sensitive Attention Character 3 Conv BidirectionaBidirectional Input text Embedding Layers l LSTMLSTM Encoder ● Inputs are characters, outputs are spectrogram frames ● Separate vocoder network to invert spectrogram to waveform Wang, et al., Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017. Shen, et al., Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP 2018. Generating speech from speech - SANE 2019 5 Speech-to-text translation Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, Zhifeng Chen Interspeech 2017 Generating speech from speech - SANE 2019 6 Speech-to-text translation (ST) how's it going ASR que tal model que tal ● Replace the speech transcript with its translation ○ deeper LAS ASR architecture [Zhang et al., 2017] ○ as in "Listen and Translate" on synthetic speech [Bérard et al., 2016] ● Data: Fisher Spanish ➔ English ○ transcribed Spanish telephone conversations from LDC ○ crowdsourced English translations of Spanish transcripts from [Post et al., 2013] ○ train on 140k Fisher utterances (160 hours) Zhang, et al., Very Deep Convolutional Networks for End-to-End Speech Recognition. ICASSP 2017 Bérard, et al., Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation. NeurIPS 2018. Post, et al., Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus. IWSLT 2013 Generating speech from speech - SANE 2019 7 ST models: End-to-end Compare three approaches: English attention English decoder 1. End-to-end ST ○ train LAS model to directly predict yes and have you been living here... English text from Spanish audio 2. Multi-task ST / ASR ○ shared encoder ○ 2 independent decoders with different attention networks, each emitting text in a different language 3. ASR ➔ NMT cascade ○ train independent Spanish ASR, and text neural machine translation models ○ pass top ASR hyp through NMT Generating speech from speech - SANE 2019 8 ST models: Multi-task English attention English decoder Compare three approaches: yes and have you been living here... 1. End-to-end ST ○ train LAS model to directly predict English text from Spanish audio Spanish attention Spanish decoder 2. Multi-task ST / ASR si y usted hace mucho... ○ shared encoder ○ 2 independent decoders with different attention networks, each emitting text in a different language 3. ASR ➔ NMT cascade ○ train independent Spanish ASR, and text neural machine translation models ○ pass top ASR hyp through NMT Generating speech from speech - SANE 2019 9 ST models: Cascade Compare three approaches: NMT attention / decoder 1. End-to-end ST yes and have you... ○ train LAS model to directly predict English text from Spanish audio 2. Multi-task ST / ASR Spanish attention / decoder ○ shared encoder ○ 2 independent decoders with si y usted hace mucho... different attention networks, each emitting text in a different language 3. ASR ➔ NMT cascade ○ train independent Spanish ASR, text neural machine translation ○ pass top ASR hyp through NMT Generating speech from speech - SANE 2019 10 Results ● BLEU score (higher is better) ● Multi-task > End-to-end ST > Cascade >> non-seq2seq baselines Generating speech from speech - SANE 2019 11 Speech-to-text translation Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, Stella Laurenzo, Yonghui Wu ICASSP 2019 Generating speech from speech - SANE 2019 12 Scaling up ● ST1: 1M utterances, read English speech ➔ Spanish text ○ conversational speech translation ○ 1k+ hours, ~7x larger than Fisher E2E model que tal how's it going ● Compare with baseline English ASR ➔ English-to-Spanish NMT cascade ○ ASR29: 29M transcribed English utterances ■ anonymized voice search logs, 40k+ hours ○ MT70: 70M English text ➔ Spanish text ■ web text, superset of ST1 Generating speech from speech - SANE 2019 13 Scaled up model ● Bigger model ○ 5x BLSTM encoder, 8x LSTM decoder, 8-head attention ○ 62M parameters, ~6x larger than Fisher ST MT70 Spanish text ● Underperforms cascade of ASR and NMT models Decoder ○ but trained on 100x fewer examples Attention Decoder ● Encoder Pretraining bridges most of the gap Multihead ○ encoder pretrained on ASR29 Attention ASR29 ○ decoder pretrained on MT70 Decoder Encoder Attention Model Fine-tuning set BLEU Encoder Cascaded 56.9 English speech Fused ST1 49.1 Fused + pretrained ST1 54.6 Generating speech from speech - SANE 2019 14 Weak supervision: Multi-task training ASR29 ST1 MT70 Text (En) Text (Es) Text (Es) ● Multi-task train with all available data ○ share encoder with English ASR Decoder Decoder ○ share decoder with English ➔ Spanish NMT ○ sample task independently at each step Attention Attention Attention Encoder Encoder ● Matches cascade performance Speech (En) Speech (En) Text (En) Model Fine-tuning set BLEU Cascaded 56.9 Fused ST1 49.1 Fused + pretrained ST1 54.6 Fused + pretrained + multi-task ST1 57.1 Generating speech from speech - SANE 2019 15 Interlude: Speech-to-speech Translation, Take 1 Generating speech from speech - SANE 2019 16 Toward direct speech-to-speech translation Input: Qué tal, eh, yo soy Guillermo, ¿Cómo estás? Encoder Strided conv Conv LSTM ● Join ST encoder with 3x Bidi LSTM Tacotron 2 decoder Output: How's it going, hey, this is Guillermo, How are you? Attention w/ pre-net ● Fisher Spanish ➔ English (TTS) 2x LSTM + ○ synthesize target English speech from 5x Conv translated transcript using Spectrogram decoder single speaker Parallel WaveNet TTS [van den Oord et al., 2018] ● Fails to learn attention, just babbles Output Target van den Oord, et al., Parallel WaveNet: Fast High-Fidelity Speech Synthesis. ICML 2018. Generating speech from speech - SANE 2019 17 Multi-task training Input: Qué tal, eh, yo soy Guillermo, ¿Cómo estás? Encoder Strided conv Conv LSTM ● Add auxiliary decoder 3x Bidi LSTM ST decoder ○ predict output (translated) transcript Output: How's it going, hey, this is Guillermo, How are you? ○ helps encoder learn more useful internal Attention Attention w/ pre-net representation 1x LSTM ■ not used during inference 2x LSTM + softmax ○ ~35M parameters 5x Conv Spectrogram transcript decoder ● Begins to pick up attention ○ but only learns to translate common words, short phrases Output Target Output Target Generating speech from speech - SANE 2019 18 列 Voice conversion Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation Fadi Biadsy, Ron Weiss, Pedro Moreno, Dimitri Kanevsky, Ye Jia Interspeech 2019 Generating speech from speech - SANE 2019 19 列 Voice normalization ● Voice normalization = text-independent, many-to-one voice conversion to canonical target voice ● Retain what is being said, not who, where, or how ○ i.e. discard speaker identity, accent, prosody, background noise, ... lots of speakers, lots of accents, and acoustic conditions canonical target speaker Generating speech from speech - SANE 2019 20 列 Parrotron Input speech Encoder Strided conv Conv LSTM ● Same architecture: 3x Bidi LSTM ASR decoder ASR encoder ➔ Tacotron 2 decoder Output speech Attention Attention multi-task with ASR decoder w/ pre-net 1x LSTM

Load more