INTERSPEECH 2020 October 25–29, 2020, Shanghai, China

Improving the Prosody of RNN-based English Text-To- by Incorporating a BERT model

Tom Kenter, Manish Sharma, Rob Clark

Google UK {tomkenter, skmanish, rajclark}@.com

Abstract the pretrained model, whereas it is typically hard for standard methods to resolve this ambiguity. Sentences that are The prosody of currently available speech synthesis systems challenging linguistically (e.g., because they are long or other- can be unnatural due to the systems only having access to the wise difficult to parse) can benefit from the new approach based text, possibly enriched by linguistic information such as part- on BERT representations rather than explicit features, as any of-speech tags and parse trees. We show that incorporating errors in the parse information no longer impair performance. a BERT model in an RNN-based speech synthesis model — where the BERT model is pretrained on large amounts of un- We show that human raters favour CHiVE-BERT over a labeled data, and fine-tuned to the speech domain — improves current state-of-the-art model on three datasets designed to prosody. Additionally, we propose a way of handling arbitrar- highlight different prosodic phenomena. Additionally, we pro- ily long sequences with BERT. Our findings indicate that small pose a way of handling long sequences with BERT (which has BERT models work better than big ones, and that fine-tuning fixed-length inputs and outputs), so it can deal with arbitrarily the BERT part of the model is pivotal for getting good results. long input.

1. Introduction 2. Related work The voice quality of current text-to-speech (TTS) synthesis is Using BERT in TTS There are existing attempts to use BERT getting close to the quality of natural human speech. In terms in TTS. Both [11] and [12] employ pretrained BERT models of prosody, however, the speech that TTS systems produce can with a Tacotron 2 type TTS models, and show gains in mean still be unnatural along a number of different dimensions in- opinion scores. [13], again, use pretrained BERT, but only in cluding intonation which conveys semantics and emotion. One the front-end components of TTS. They show of the reasons for this is that the input to speech synthesis sys- improved accuracy in the front-end but do not evaluate the final tems typically consists of only text, read as characters or words speech quality. In [11, 13], large BERT models are used: 12- [1, 2]. The input is usually enriched by information about syn- layers, and 768 hidden units in each (in [12] the size of tax, using taggers and parsers [3, 4, 5]. As such, no information the BERT models are not specified). In contrast, our proposed about semantics or world knowledge is available to these mod- model uses a smaller model (only 2 layers, hidden state size els, as this is not inferable from just the text. Furthermore, when 256). More importantly, instead of using the pretrained BERT input is enriched by syntax information, errors made by the al- model as is, we propose to update parameters of the BERT net- gorithms providing this information can propagate through the work itself while training our prosody model. system, which may degrade, rather than improve, synthesis. Handling arbitrarily long input with Transformers Perhaps We present CHiVE-BERT, an extension of the Clockwork the simplest way of dealing with arbitrarily long input in a Hierarchical Variational (CHiVE) prosody model Transformer model [14] – which has a fixed input length – is that incorporates a Bidirectional Encoder Representations from to split the input into consecutive segments, and process these Transformers (BERT) network [6], trained first on text, and then segments separately [15]. An alternative approach is employed fine-tuned on a speech synthesis task – a prosody generation in [16, 17], where the input, again, is split into consecutive seg- task in particular. The motivation for incorporating a BERT ments, that the model now recurses over. During evaluation, the model in our overall model is twofold: i) BERT provides repre- Transformer model thus has an input size twice as large as it sentations that have been proven to embody syntactic informa- would usually have. tion in a more robust way than traditional parsing and tagging Different to the approaches described above, but similar to, [7, 8, 9]; ii) BERT models can provide useful cues beyond syn- e.g., [18] we propose to split the input into overlapping seg- tax, such as word semantics and world knowledge [10]. ments. This is different from local attention [19, 20, 21] as the It has been shown that BERT models can incorporate syn- Transformer architecture itself is unaltered in our case. For effi- tactic information in text-only domain [7, 8]. While the BERT ciency reasons, we do not recurse over the multiple Transformer model only implicitly captures the linguistic structure of the in- outputs [18]. We aggregate the output of overlapping windows put text, we show that its output representations can be used to by selecting the representations that had most context (i.e. fur- replace features encoding this syntactic information explicitly. thest away from the start or end of the window, cf. §3.1.2). This The BERT models in the CHiVE-BERT system are pre- comes at the cost of losing the ability to model prosodic effects trained on a language modeling task, which is known to lead to spanning very long contexts. We motivate this choice by noting semantic information being incorporated into the model [7, 6] that i) though prosodic phenomena spanning very long contexts We find this enables CHiVE-BERT to learn to correctly pro- can occur, we expect local context to be primarily important for nounce longer noun compounds, such as ‘diet cat food’, as the predicting prosodic features; ii) we expect the length of most in- knowledge that this is more likely to be interpreted as ‘(diet put sequences to be under the maximum input size of the BERT (cat food))’ rather than ‘((diet cat) food)’, is incorporated into model, both during fine-tuning and final inference.

Copyright © 2020 ISCA 4412 http://dx.doi.org/10.21437/Interspeech.2020-1430 μ predicted prosodic features 3.1. BERT model project sample The BERT model is run as part of the full model and gradients RNN RNN encoder decoder flow all the way through the model, up until, but not includ- sentence σ prosody ing, the WordPiece embedding lookup table. The input text prosodic features embedding prosody embedding is split into WordPieces by the default BERT WordPiece tok- enizer, and run through the Transformer graph, which outputs BERT embeddings BERT embeddings an embedding for each WordPiece in the input. As we wish textual features textual features to provide these embeddings to the syllable-level RNNs in the encoder/decoder, which have no notion of WordPieces, the em- preprocessor BERT bedding corresponding to the first WordPiece of each word is selected to represent the word, and is presented to each RNN input (and replicated for each syllable, cf. Figure 2). Figure 1: Schematic overview of the CHiVE-BERT conditional variational auto-encoder model at training time. 3.1.1. Freezing WordPiece embeddings after pretraining We freeze the WordPiece embeddings of the model after pre- 3. CHiVE-BERT training [26], as not all WordPieces observed during pretraining might be observed (as often) during the fine-tuning stage. This CHiVE-BERT extends the Clockwork Hierarchical Variational might cause discrepancies in the way the semantic space would autoEncoder (CHiVE) model described in [3]. The CHiVE be updated during fine-tuning if the gradients would, in fact, model is a conditional variational autoencoder (VAE) [22] with flow through it. If two WordPieces are close after pretraining, inputs and outputs that are prosodic features, namely pitch (fun- but only one of them is observed frequently during fine-tuning, they might end up being separated. This separation, however, damental frequency F0), energy (represented by the 0th cep- would not be informed by the fine-tuning process, but, rather, stral coefficient c0) and phone durations (in number of frames). The variational embedding represents the prosody of a sin- by data imbalance. gle utterance. Both the encoder and decoder are structured as hierarchical recurrent neural networks (RNNs) where each 3.1.2. Handling arbitrarily long input with BERT layer is clocked dynamically to the linguistic structure of a While the RNN layers of the CHiVE-BERT model can han- given utterance. For example, if a particular syllable consists dle arbitrarily long input (memory permitting) as they unroll of three phonemes, the encoder phoneme layer will consume dynamically, the BERT model has a fixed length m. We aug- three phonemes before its output state is fed to the syllable ment the BERT model to handle arbitrarily long sequences of layer, whereas if the next syllable contains five phonemes, five WordPieces by splitting a sequence wp1, wp2, . . . , wpn into phonemes will be consumed. Both the encoder and decoder are windows of size m, if n > m, and presenting these windows conditioned on features encoding information about, e.g., syl- to the model as a mini-batch. The windows overlap by stride lable stress and part-of-speech (cf. §4.3 for more information size s, which we set to be (the floor of) half of the input size: on these features). These features are fed to the model at the n+2−m b(m − 2)/2c. Each window wi, for i ∈ {0, 1,..., d s e} appropriate linguistic layer by being appended to the state re- is defined as: ceived from the previous layer. In CHiVE-BERT, the word-level features are replaced by Word- wi = {START, wpi·s+1, wpi·s+2, . . . , wpi·s+(m−2), END}, Piece embeddings from a pretrained BERT model. This pre- where wp is the WordPiece at position i, and START and END trained BERT model is fine-tuned as CHiVE-BERT is trained. i are start and end tokens, respectively. We introduce two new Figure 1 shows a high-level overview of CHiVE-BERT. The tokens [BREAK] and [CONT]. The START token of the first preprocessor is a text-normalization front-end [23, 24], and is window is the standard [CLS] token, and [CONT] for any sub- not updated as part of the CHiVE-BERT training procedure. As sequent window. The END token of the last window is the stan- such, we will not elaborate on it further. dard end-of-sequence token [SEP] (which can occur earlier in Figure 2 shows the details of the CHiVE-BERT encoder. the window, in which case it is followed by padding to fill up to m [BREAK] At the top (shown in red in the figure) is a syllable-rate RNN size ), and a token for any window before it. To combine the d n+2−m e + 1 overlapping windows of which encodes an utterance. For each syllable in the utterance s m n it encodes 1) the prosodic features encoded frame by frame (by length to a sequence of length again, we select the cen- bs/2c the blue frame-level RNN in the lower left of the figure), 2) the ter WordPiece embeddings from every window, except phone-level features encoded by the phoneme level RNN (in for the first one, where we start from index 0, and the last one, [SEP] purple, in the lower right), 3) syllable-level features 4) BERT where we end at the token, and pad the remainder. WordPiece embeddings [25], and 5) sentence-level features. The last state of the syllable RNN is fed to the variational layer, 4. Experimental setup which computes a sentence prosody embedding from it. Here we describe the way our experiments are set up. Similar to the way it works in the encoder, the WordPiece 4.1. CHiVE-BERT model embeddings in the decoder are broadcasted as input to each syl- lable of a given word, replacing the word-level features in the The CHiVE-BERT model is trained on 215,650 utterances (cor- original CHiVE model. The overall structure of the decoder responding to about 228 hours of speech). A development set is somewhat different to that of the encoder to account for the of 3000 is employed to tune hyperparameters. The data was need to predict the number of frames in each phoneme. It is curated to have good phonetic coverage, and was recorded un- fully described in [3] and not discussed further here for brevity. der studio conditions by 30 native US English speakers. While

4413 syllable-rate RNN μ sentence sentence-level features project sample prosody embedding σ WordPiece embeddings syllable-level features

BERT encoder

WordPiece inputs

syllable-rate output syllable-rate output

frame-rate RNN phone-rate RNN

frame-level features phone-level features

Figure 2: CHiVE-BERT encoder and variational layer. Circles represent blocks of RNN cells, rectangles represent vectors. Broadcast- ing (vector duplication) is indicated by vectors being displayed in a lighter shade. The preprocessor, that computes the input features at various levels, is omitted from the picture, as it is run offline and, as such, not part of the network. (Best viewed in colour). some speakers are represented better than others, all speakers 4.3. Features are represented by a substantial number of utterances. Align- ment, down to phoneme level, was done automatically. Both CHiVE-BERT and the baseline model have input features The BERT part of the CHiVE-BERT model is trained on describing the textual contents at the level of sentence, words 1 (only for the baseline), syllables and phonemes, following, e.g., the same corpus the publicly available BERTBASE model was trained on2 [6], using a dropout probability of 0.1 [27], Gaussian [5]. The sentence-level features contain information about the Error Linear Unit (GELU) activations [28], hidden size 256, speaker and their gender. Word-level features contain informa- intermediate size 1024, 4 attention heads, 2 hidden layers, and tion about POS tags, dependency parse, phrase-level features a vocabulary size of 30,522. The model has an input size of 512 (indicating if the phrase is part of a question, statement), and (m in §3.1.2), i.e., it reads 512 WordPiece embeddings at once. preceding or subsequent punctuation marks. At syllable level, The internal size of the syllable-level RNN and of the sen- information such as number of phonemes and lexical stress is tence prosody embedding (the output of the variational layer) represented. Lastly, the phoneme-level features contain infor- is 256. The internal size of both the frame-rate RNN and the mation about the position of the phoneme in the syllable, the phone-rate RNN is 64 in the encoder, and 64 and 32, respec- phoneme identity, and the number of phonemes in the syllable. CHiVE-BERT has the same input features the baseline tively, in the decoder. We represent F0 and c0 values at 200Hz model has, except for the features at word level, which are left (i.e., we use 5ms frames). Adadelta optimization is used [29], 3 with a decay rate ρ = 0.95 and  = 10-6, and an initial learn- out and replaced by BERT representations as described in §3. ing rate of 0.05, exponentially decayed over 750,000 steps. All networks are trained for 1M iterations, with a batch size of 4. 4.4. WaveNet As reported in, e.g., [30], the fine-tuning process is sensitive We use two WaveNets [31]: one for our baseline, and one for to different initializations. We run 10 runs for each hyperparam- the CHiVE-BERT model. Both are trained on linguistic and eter setting considered, and report on the model performing best prosodic features, using a dataset containing 115,754 utterances on the development set. of speech for 9 speakers. They are trained on multiple speakers because we found that even if the models are used to generate 4.2. Baseline speech for only one or two speakers as they are in our experi- The baseline model is the prosody model currently used for ments, training them on more speakers yields better results. The popular US English voices in the Google Assistant, described only difference between the two WaveNets is that the one used in [3]. The capacity of this model (the number of parameters) for the baseline does have access to features at word level, while is lower than the capacity of the CHiVE-BERT model, as the the one used for CHiVE-BERT does not, to prevent it from get- latter incorporates an extra Transformer network. It might be ting potentially conflicting information between the BERT rep- considered fair, therefore, to compare the CHiVE-BERT model resentation and the (noisy) word-level linguistic features. to a version of the baseline that is similar in terms of capacity. We found, however, in preliminary experiments, that increasing 4.5. Evaluation the size of the hidden states, or number of RNN cells in any of To evaluate the performance of CHiVE-BERT compared to the the layers decreased performance, as the models started to over- baseline model we run 7-way side-by-side tests, where raters fit. As such, the architecture of the baseline is tuned to yield indicate which audio sample they think sounds better on a scale the highest performance in terms of MOS score, and hence is a from -3 to +3, where the center of the scale means “no differ- strong baseline to compare the CHiVE-BERT model against. ence”. Every rater rates at most 6 samples, and every sample is 1See https://github.com/google-research/bert. 2 3 We also tried initializing from BERTBASE itself in preliminary ex- In preliminary experiments, we also tried combining both word- periments. However, we found that did not yield promising results, and level linguistic features and BERT representations, but as the results as it takes a long time to fine-tune, we left it out of our final experiments. were not promising, we left out this setting in the final experiments.

4414 Table 1: Results of comparing the CHiVE-BERT model to the Table 3: Comparison of CHiVE-BERT to the same model in baseline. All results are statistically significant, using a two- which the BERT part is not fine-tuned. All results are statisti- sided t-test with α = 0.01. The and symbols indicate female cally significant, using a two-sided t-test with α = 0.01 and male speaker, respectively.♀ ♂ test outcome test outcome Hard lines 0.320 ± 0.069 Hard lines 0.549 ± 0.076 ♀ 0.642 ± 0.072 ♀ 0.524 ± 0.077 Questions ♂ 0.238 ± 0.082 Questions ♂ 0.287 ± 0.077 ♀ 0.504 ± 0.083 ♀ 0.229 ± 0.075 Generic lines ♂ 0.208 ± 0.054 Generic lines ♂ 0.274 ± 0.058 ♀ 0.588 ± 0.068 ♀ 0.186 ± 0.056 ♂ ♂ 5.2. Fine-tuning

Table 2: Mean absolute F0 error in Hz, for CHiVE-BERT mod- Different from the findings in [13], but in line with, e.g., [35], els incorporating BERT models of different sizes. taking the output of a fixed BERT model did not improve per- formance in our setting. We found that the BERT part of the hidden interm. # attention # Transformer model has to be fine-tuned. Table 3 lists the results of a side- state state heads layers F0 error by-side test of the fine-tuned model described above, and the 24 96 1 2 20.7 exact same model where no gradients flow to the BERT part of 128 512 2 2 18.11 the model at all. As we can see from this table, fine-tuning the 128 512 2 4 17.77 model consistently yields better results. 256 1024 4 2 17.59 256 1024 4 4 22.28 5.3. Negative findings 768 3072 12 6 22.51 BERT models typically have a much smaller learning rate dur- 768 3072 12 8 22.66 ing pretraining (e.g., 10-4) than the one we use when training CHiVE-BERT (0.05). We tried applying a similar (smaller) learning rate to just the BERT part of the model, which yielded rated by 1 to 8 raters. To gain insight into the quality on differ- identical or worse results than not doing this. ent kinds of input data, we test it on three separate datasets: Hard lines A test set of 286 lines, expected to be hard, contain- 6. Conclusion ing, e.g., titles and long noun compounds; We presented CHiVE-BERT, an RNN-based prosody model Questions As questions are prosodically markedly different that incorporates a BERT model, fine-tuned during training, to from statements, we include a test sets of 299 questions; improve the prosody of synthesized speech. We showed that Generic lines The aim of this set is to test against regression our model outperforms the state-of-the-art baseline on three (i.e., that normal lines that already sound good are not nega- datasets selected to be prosodically different, for a female and tively impacted). The set consists of 911 lines, typically short, male speaker. for which a generic default prosody is likely to be appropriate. We conclude from our findings that, rather than using em- beddings from a pretrained BERT model, fine-tuning BERT is 5. Results and analysis crucial for getting good results. Additionally, our experiments indicate that relatively small BERT models yield better results As described in §4.5 we perform tests on three datasets. The than much bigger ones, when employed in a speech synthesis results of the tests are in Table 1. The difference between context as proposed. This is at odds with findings in other NLP CHiVE-BERT and the baseline is most apparent for the hard tasks, where bigger models tend to yield better performance. lines, and is still significant for the other test sets. A limitation of the work presented is that, even though both Audio samples of the CHiVE-BERT model and the base- the RNNs and the BERT model we implemented can, in the- line are available at https://google.github.io/ ory, handle arbitrarily long input [36], very long sequences will chive-prosody/chive-bert/. take prohibitively long to synthesize. Novel developments in model distillation, or new ways of running parts of the network in parallel might allow for longer sequences to be dealt with. 5.1. Model sizes Lastly, the current evaluations were done on US English data only. It would be interesting to see if similar improvements When finetuning for specific NLP task, bigger BERT (or BERT- can be obtained on other languages. like) models tend to yield better results than smaller models [32, 33]. This was not the case in our experiments, similar to 7. Acknowledgments findings in, e.g., [34]. Table 2 lists the mean absolute error, in Hz, of predicted F0 when the decoder is conditioned on an The authors would like to thank the Google TTS Research team, all-zeros embedding, for CHiVE-BERT models incorporating in particular Vincent Wan and Chun-an Chan, for their invalu- BERT models of different sizes (pretrained on the same corpus, able help at every stage of the development of the work pre- and all fine-tuned). We observe a trend where bigger models sented here, Aliaksei Severyn for helping out with all things yield higher losses. To see if smaller is always better we also BERT-related, and Alexander Gutkin for helpful comments on tried a very small model (first row in the table) which does sur- earlier versions of the manuscript. Lastly, many thanks to the prisingly well, but worse than the medium sized models. reviewers for their thoughtful and constructive suggestions.

4415 8. References [17] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for lan- [1] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, guage understanding,” in Advances in Neural Information Pro- Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS cessing Systems, 2019, pp. 5754–5764. synthesis by conditioning WaveNet on Mel spectrogram predic- tions,” in IEEE International Conference on Acoustics, Speech [18] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, and N. Dehak, and Signal Processing (ICASSP), 2018, pp. 4779–4783. “Hierarchical transformers for long document classification,” in IEEE Automatic and Understanding Work- [2] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, shop (ASRU), 2019, pp. 838–844. N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: [19] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- Towards end-to-end speech synthesis,” Interspeech 2017, pp. document transformer,” arXiv preprint arXiv:2004.05150, 2020. 4006–4010, 2017. [20] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, [3] V. Wan, C. an Chan, T. Kenter, J. Vit, and R. Clark, “CHiVE: S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., Varying prosody in speech synthesis with a linguistically driven “Big bird: Transformers for longer sequences,” arXiv preprint dynamic hierarchical conditional variational network,” in Interna- arXiv:2007.14062, 2020. tional Conference on (ICML) , 2019, pp. 3331– [21] S. Hofstatter,¨ H. Zamani, B. Mitra, N. Craswell, and A. Han- 3340. bury, “Local self-attention over long text for efficient document [4] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and retrieval,” in International ACM SIGIR Conference on Research P. Szczepaniak, “Fast, compact, and high quality LSTM-RNN and Development in Information Retrieval (SIGIR 2020), 2020. based statistical parametric speech synthesizers for mobile de- [22] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” vices,” in Interspeech 2016, 2016, pp. 2273–2277. in International Conference on Learning Representations (ICLR), [5] H. Zen and H. Sak, “Unidirectional long short-term memory re- 2014. current neural network with recurrent output layer for low-latency [23] P. Ebden and R. Sproat, “The Kestrel TTS text normalization sys- speech synthesis,” in Acoustics, Speech and Signal Processing tem,” Natural Language Engineering, vol. 21, no. 3, pp. 333–353, (ICASSP), 2015, pp. 4470–4474. 2015. [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [24] H. Zhang, R. Sproat, A. H. Ng, F. Stahlberg, X. Peng, K. Gorman, training of deep bidirectional transformers for language under- and B. Roark, “Neural models of text normalization for speech standing,” in Conference of the North American Chapter of the As- applications,” Computational Linguistics, vol. 45, no. 2, pp. 293– sociation for Computational Linguistics: Human Language Tech- 337, 2019. nologies, 2019, pp. 4171–4186. [25] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” in IEEE International Conference on Acoustics, Speech [7] G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn and Signal Processing (ICASSP), 2012, pp. 5149–5152. about the structure of language?” in Annual Meeting of the As- sociation for Computational Linguistics (ACL), 2019, pp. 3651– [26] S. Wu and M. Dredze, “Beto, Bentz, Becas: The surprising cross- 3657. lingual effectiveness of BERT,” in EMNLP-IJCNLP, 2019, pp. 833–844. [8] I. Tenney, D. Das, and E. Pavlick, “BERT rediscovers the classical [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and NLP pipeline,” in Meeting of the Association for Computational R. Salakhutdinov, “Dropout: a simple way to prevent neural Linguistics (ACL), 2019, pp. 4593–4601. networks from overfitting,” The journal of machine learning re- [9] J. Hewitt and C. D. Manning, “A structural probe for finding syn- search, pp. 1929–1958, 2014. tax in word representations,” in Conference of the North Amer- [28] D. Hendrycks and K. Gimpel, “Gaussian error linear units ican Chapter of the Association for Computational Linguistics: (GELUs),” arXiv preprint arXiv:1606.08415, 2016. Human Language Technologies , 2019, pp. 4129–4138. [29] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv [10] F. Petroni, T. Rocktaschel,¨ S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, preprint arXiv:1212.5701, 2012. and A. Miller, “Language models as knowledge bases?” in [30] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and EMNLP-IJCNLP, 2019, pp. 2463–2473. N. A. Smith, “Fine-tuning pretrained language models: Weight [11] T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, and initializations, data orders, and early stopping,” ArXiv, vol. K. Livescu, “Pre-trained text embeddings for enhanced text-to- abs/2002.06305, 2020. speech synthesis,” Interspeech 2019, pp. 4430–4434, 2019. [31] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, [12] Y. Xiao, L. He, H. Ming, and F. K. Soong, “Improving prosody F. Stimberg et al., “Parallel WaveNet: Fast high-fidelity speech with linguistic and BERT derived features in multi-speaker based synthesis,” in International Conference on Machine Learning, Mandarin Chinese neural TTS,” in IEEE International Conference 2017, pp. 3918–3926. on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6704–6708. [32] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith, “Linguistic knowledge and transferability of contextual represen- [13] B. Yang, J. Zhong, and S. Liu, “Pre-trained text representations for tations,” in Conference of the North American Chapter of the As- improving front-end text processing in Mandarin text-to-speech sociation for Computational Linguistics: Human Language Tech- synthesis,” Interspeech 2019, pp. 4480–4484, 2019. nologies, 2019, pp. 1073–1094. [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [33] A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in BERTol- Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” ogy: What we know about how BERT works,” arXiv preprint in Advances in Neural Information Proceeding Systems, 2017, pp. arXiv:2002.12327, 2020. 5998–6008. [34] Y. Goldberg, “Assessing BERT’s syntactic abilities,” arXiv preprint arXiv:1901.05287 [15] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones, , 2019. “Character-level language modeling with deeper self-attention,” [35] M. E. Peters, S. Ruder, and N. A. Smith, “To tune or not to tune? in AAAI Conference on Artificial Intelligence, 2019, pp. 3159– adapting pretrained representations to diverse tasks,” in Workshop 3166. on Representation Learning for NLP (RepL4NLP-2019), 2019, pp. 7–14. [16] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhut- dinov, “Transformer-XL: Attentive language models beyond a [36] R. Clark, H. Silen, T. Kenter, and R. Leith, “Evaluating long- fixed-length context,” in 57th Annual Meeting of the Association form text-to-speech: Comparing the ratings of sentences and para- for Computational Linguistics, 2019, pp. 2978–2988. graphs,” in ISCA Speech Synthesis Workshop (SSW10), 2019.

4416