SEMI-SUPERVISED TRAINING FOR IMPROVING DATA EFFICIENCY IN END-TO-END SPEECH SYNTHESIS
Yu-An Chung∗1 Yuxuan Wang2 Wei-Ning Hsu∗1 Yu Zhang2 RJ Skerry-Ryan2
1Massachusetts Institute of Technology 2Google Inc.
ABSTRACT step using only a small amount of paired data to learn the alignment between the two representation domains. Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require In this preliminary study, we first identify the data require- a sizable set of high-quality
2. PROPOSED APPROACH 1. INTRODUCTION
Recent advances in end-to-end text-to-speech (TTS) have We use a baseline Tacotron architecture specified in [8], shown great promise. We are now able to produce natural where we use a GMM attention [9], LSTM-based decoder prosody with high audio fidelity using a much simplified with zoneout regularization [10] and phoneme inputs derived voice building pipeline [1, 2, 3]. However, such models from normalized text. We use Griffin-Lim [11] as the in- typically require a sizable dataset consisting of high-quality version algorithm to convert the predicted spectrograms to
The first conditioning approach concatenates the word vector at the first phoneme of the corresponding word and replicates the word vector across all phonemes in the word. Take in- put text “Thank you” as an example. The phoneme inputs are ‘th’, ‘a’, ‘ng’, ‘k’, ‘
Fig. 1. Illustration of conditioning encoder on pre-trained 2.1.2. A conditioning attention head word vectors. The left side shows the locations of encoder input and encoder top, where the word vectors can be incor- If we consider the phrase “Thank you”, it’s possible that the porated via a conditioning module. The right side illustrates semantics of “Thank” can help the encoder to generate more the conditioning module, where we show two methods of con- robust representation for “you”. However, the first approach ditioning. ‘
We conduct experiments to demonstrate the effectiveness of our framework. We use an internal single-speaker US English that our MCD results correlate well with our subjective per- dataset for training (fine-tuning). ception. For subjective measurements, we ran a series of side- by-side preference tests using 1000 unseen phrases of differ- 3.1. Data requirements of the baseline Tacotron ent lengths. For encoder conditioning, we used a neural network lan- To improve Tacotron’s data efficiency, first we need to un- guage model (NNLM) [14] trained on English Google News derstand its limit. We’d like to answer the following ques- 200B corpus from TensorFlow Hub as the word embedding tion: what is the maximum amount of data N that could al- module. The module maps each word to a 128-dimensional most never successfully train a baseline Tacotron to produce vector. We also tried word2vec (W2V) [17] trained on the intelligible speech? To find out N, we gradually decrease same corpus as the word embedding module. the amount of data used for training a baseline Tacotron from For decoder pre-training, we used VCTK [18], a publicly about 40 hours to about 12 minutes and listen to the synthe- available corpus containing 44 hours of speech from 109 sized speech on unseen phrases. As can be heard on our demo speakers, the majority of which have British accents. Note page, we estimated that using between 10 and 40 hours of data that there is an accent mismatch between the decoder pre- produces almost equally good synthesis, and using between 3 training (multiple speakers with British accents) and fine- and 10 hours of data causes minor degradation but still sounds tuning (single speaker with US accent) datasets. As men- very good. However, when there are only about 24 minutes tioned above, we only use the speech signals in VCTK but of data, the model fails to produce intelligible speech. When not their transcripts. there are only 12 minutes of data, the model outputs gibberish that is impossible to understand. It’s important to note that the 3.2.1. MCD objective tests transcripts in the 12 minutes data already cover all phonemes, therefore the failure is not simply due to phoneme coverage. The MCD results are shown in Table 1. We first compare Therefore, in the next section, we focus on demonstrat- the four configurations of encoder conditioning. Here we in- ing the effectiveness of our semi-supervised framework using clude the results of both NNLM and W2V word embeddings. only 24 minutes of paired data. We can see that models using NNLM always outperform their W2V counterpart. We speculate that this is because W2V 3.2. Results on small data only conveys word meanings but not the contextual or struc- tural information, which is modeled in NNLM. Our encoder conditioning and decoder pre-training approaches In terms of conditioning locations, we see that condi- can be applied to Tacotron independently or jointly. We de- tioning at encoder top always outperforms conditioning at note the model that only incorporates encoder conditioning encoder input. While feeding word vectors to the early parts as T-Enc, model that only incorporates decoder pre-training of the network seems intuitive (as in encoder input condition- as T-Dec, model that incorporates both as T-Enc-Dec, and the ing), we believe it is not the best choice in the low-resource baseline Tacotron as T-Base. setting. If the encoder weights learned from small data are We measure the synthesis quality using both objective and noisy, for example, they may “distort” well-trained word subjective tests. For the objective metric, we use mel cep- vectors. Therefore, conditioning word vectors at a higher stral distortions (MCD) [16], which measures the distance be- layer (e.g. encoder top) may lead to better generalization. tween synthesis and ground truth in the mel cepstrum space— In terms of conditioning method, we find that the simple the smaller the better. We use an evaluation set containing concatenation method always outperforms using a separate about 30 minutes (631 sentences) of unseen data. We found attention head. We also attribute this to the limited training data: although a separate attention head offers more flexibility for learning the alignment, it also introduces more trainable parameters. In summary, the best configuration for encoder conditioning is to directly concatenate word vectors obtained from a pre-trained NNLM at the encoder top. We used this configuration for T-Enc for the rest of the experiments. From Table 1 we can see that T-Enc, T-Dec, and T-Enc- Dec all achieve much lower MCD than T-Base. Among them, T-Dec achieves the best result. However, T-Enc, T-Dec, and T-Enc-Dec achieve similar MCD results (12.46, 12.09, 12.27, respectively). As shown in side-by-side comparisons below, the raters did not strongly prefer one over the other two, either.
3.2.2. Side-by-side subjective tests Fig. 2. MCD results by increasing the amount of paired data.
4. CONCLUSIONS AND DISCUSSIONS Table 2. Results of SxS subjective tests based on a 7-point rating scale. We report both rater preferences (in percentage) We have proposed a semi-supervised training framework for and p-values for each comparison. improving data efficiency in end-to-end TTS. Our framework Preference (%) leverages large-scale, publicly available, and unpaired text Competing pair p-value Former Latter Neutral and speech data to provide additional textual and acoustic knowledge to the Tacotron encoder and decoder, respectively. T-Base vs. T-Enc 3.3 65.1 31.6 1.07e-84 We have shown that our framework makes end-to-end TTS T-Base vs. T-Dec 3.2 61.8 35.0 3.47e-83 feasible in small-data regime. Specifically, a semi-supervised T-Enc vs. T-Dec 16.1 18.2 65.7 0.256 T-Enc-Dec vs. T-Dec 17.0 17.9 65.1 0.630 trained Tacotron can produce intelligible speech using just 24 minutes of paired training data. This promising result also provides some guiding principles for future data collection Table 2 shows the results of the four side-by-side (SxS) efforts for both single and multi-speaker TTS. While we used preference tests, comparing T-Base against T-Enc, T-Base Tacotron as the TTS model in this study, we believe the frame- against T-Dec, T-Enc against T-Dec, and T-Enc-Dec against work is generally applicable to other end-to-end TTS models. T-Dec. As we can see from the table, both T-Enc and T-Dec This is only a preliminary work, and there is still much significantly outperform T-Base: in both tests, raters strongly to be investigated. For example, we’ve been using phoneme preferred them over the baseline by more than 60%. Inter- inputs in this work and we’d like to understand the perfor- estingly, the raters considered T-Dec, T-Enc, and T-Enc-Dec mance tradeoffs on grapheme inputs. For leveraging textual similarly preferable. The results of SxS tests are consistent knowledge, instead of simply conditioning with word vectors, to those of MCD objective tests, and both demonstrate the a likely more effective method is to initialize the entire en- effectiveness of our semi-supervised framework. coder with a pre-trained bidirectional NNLM [19]. For de- coder pre-training, the model mismatch during pre-training 3.3. Results on other amounts of data and fine-tuning can be further studied. An analysis on what We also compare T-Base, T-Enc, T-Dec, and T-Enc-Dec kind of information are extracted from external data and how trained on other amounts of data. In Figure 2, each curve they are actually used by Tacotron is also an important future corresponds to a Tacotron variant, showing the relationship work. Lastly, since the main focus of this work is to make between the amount of paired data used for training that end-to-end TTS feasible in small-data regime instead of pro- Tacotron variant and the MCD between ground-truth audio ducing high-fidelity audio, we only used Griffin-Lim as the and synthesis from it. We can see that the largest gap be- waveform synthesizer. To produce high-fidelity speech with tween T-Base and the three semi-supervised systems occurs very little paired data, we still need to address the problem of when using only 12 minutes (0.5 shards) of paired data. The adapting neural vocoders in the semi-supervised setting. gap keeps decreasing when the amount of data increases. This phenomenon is somewhat expected, because with more 5. ACKNOWLEDGEMENTS paired data, Tacotron relies less on external knowledge for learning representations and alignments. However, semi- The authors thank Daisy Stanton, Eric Battenberg, Soroosh supervised Tacotron consistently achieves lower MCD than Mariooryad, Yinfei Yang, and the Machine Hearing and the baseline, which may indicate benefits beyond better data Google Brain teams for their helpful feedback and discus- efficiency (e.g. improved prosody). sions. 6. REFERENCES [12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean, “Distributed representations of [1] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui words and phrases and their compositionality,” in NIPS, Wu, Ron J. Weiss, Zongheng Yang Jaitly, Ying 2013. Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yan- nis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous, [13] Jeffrey Pennington, Richard Socher, and Christopher “Tacotron: Towards end-to-end speech synthesis,” in Manning, “Glove: Global vectors for word represen- INTERSPEECH, 2017. tation,” in EMNLP, 2014. [14] Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and [2] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Christian Jauvin, “A neural probabilistic language Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng model,” Journal of Machine Learning Research, vol. Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. 3, no. 2, pp. 1137–1155, 2003. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu, “Natural TTS synthesis by conditioning wavenet on mel [15] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- spectrogram predictions,” in ICASSP, 2018. gio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015. [3] Wei Ping, Kainan Peng, and Jitong Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” [16] R. Kubichek, “Mel-cepstral distance measure for objec- arXiv preprint arXiv:1807.07281, 2018. tive speech quality assessment,” in PacRim, 1993.
[4] Oliver Samuel Watts, Unsupervised learning for text-to- [17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey speech synthesis, Ph.D. thesis, The University of Edin- Dean, “Efficient estimation of word representations in burgh, 2013. vector space,” in ICLR Workshop, 2013.
[5] Heng Lu, Simon King, and Oliver Watts, “Combining [18] Christophe Veaux, Junichi Yamagishi, and Kirsten Mac- a vector space representation of linguistic context with Donald, “CSTR VCTK corpus: English multi-speaker a deep neural network for text-to-speech synthesis,” in corpus for CSTR voice cloning toolkit,” 2017. ISCA Workshop on Speech Synthesis, 2013. [19] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke [6] Oliver Watts, Zhizheng Wu, and Simon King, Zettlemoyer, “Deep contextualized word representa- “Sentence-level control vectors for deep neural network tions,” in NAACL-HLT, 2018. speech synthesis,” in INTERSPEECH, 2015.
[7] Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao, “Word embedding for recurrent neural network based TTS synthesis,” in ICASSP, 2015.
[8] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry- Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous, “Style tokens: Unsuper- vised style modeling, control and transfer in end-to-end speech synthesis,” in ICML, 2018.
[9] Alex Graves, “Generating sequences with recurrent neu- ral networks,” arXiv preprint arXiv:1308.0850, 2013.
[10] David Krueger, Tegan Maharaj, Janos´ Kramar,´ Mo- hammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal, “Zoneout: Regularizing RNNs by randomly preserving hidden activations,” in ICLR, 2017.
[11] Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transac- tions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.