Linear Prediction-Based Wavenet Speech Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis Min-Jae Hwang Frank Soong Eunwoo Song Search Solution Microsoft Naver Corporation Seongnam, South Korea Beijing, China Seongnam, South Korea [email protected] [email protected] [email protected] Xi Wang Hyeonjoo Kang Hong-Goo Kang Microsoft Yonsei University Yonsei University Beijing, China Seoul, South Korea Seoul, South Korea [email protected] [email protected] [email protected] Abstract—We propose a linear prediction (LP)-based wave- than the speech signal, the training and generation processes form generation method via WaveNet vocoding framework. A become more efficient. WaveNet-based neural vocoder has significantly improved the However, the synthesized speech is likely to be unnatural quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target when the prediction errors in estimating the excitation are database contains massive amount of acoustical information propagated through the LP synthesis process. As the effect such as prosody, style or expressiveness. As a solution, the of LP synthesis is not considered in the training process, the approaches that only generate the vocal source component by synthesis output is vulnerable to the variation of LP synthesis a neural vocoder have been proposed. However, they tend to filter. generate synthetic noise because the vocal source component is independently handled without considering the entire speech To alleviate this problem, we propose an LP-WaveNet, production process; where it is inevitable to come up with a which enables to jointly train the complicated interactions mismatch between vocal source and vocal tract filter. To address between the excitation and LP synthesis filter. Based on the this problem, we propose an LP-WaveNet vocoder, where the basic assumption that the past speech samples and the LP complicated interactions between vocal source and vocal tract components are jointly trained within a mixture density network- coefficients are given as conditional information, we figure based WaveNet model. The experimental results verify that the out that the distributions of speech and excitation only lies on proposed system outperforms the conventional WaveNet vocoders a constant difference. Furthermore, if we model the speech both objectively and subjectively. In particular, the proposed distribution by using a mixture density network (MDN) [11], method achieves 4.47 MOS within the TTS framework. then the target speech distribution can be estimated by sum- Index Terms—Text-to-speech, speech synthesis, WaveNet ming the mean parameters of predicted mixture and an LP vocoder approximation, which is defined as the linear combination of I. INTRODUCTION past speech samples weighted by LP coefficients. Note that the LP-WaveNet is easy to train because the WaveNet only needs Waveform generation systems using WaveNet have signifi- to model the excitation component, and the complicated spec- cantly improved the synthesis quality of deep learning-based trum modeling part is embedded into the LP approximation. text-to-speech (TTS) systems [1]–[5]. Because the WaveNet vocoder can generate speech samples in a single unified In the objective and subjective evaluations, we verified arXiv:1811.11913v2 [eess.AS] 4 Mar 2020 neural network, it does not require any hand-engineered signal the outperforming performance of the proposed LP-WaveNet processing pipeline. Thus, it presents much higher synthetic in comparison to the conventional WaveNet-based neural quality than the traditional parametric vocoders [2]. vocoders. Especially, the LP-WaveNet provided 4.47 mean To further improve the perceptual quality of the synthesized opinion score result in the TTS framework. speech, more recent neural excitation vocoders take advantages of the merits from both the linear prediction (LP) vocoder II. WAVENET-BASED SPEECH SYNTHESIS SYSTEMS and the WaveNet structure [6]–[10]. In this framework, the A. µ-law quantization-based WaveNet formant-related spectral structure of the speech signal is decoupled by an LP analysis filter, and the WaveNet only WaveNet is a convolutional neural network (CNN)-based estimates the distribution of its residual signal (i.e., excitation). auto-regressive generative model that predicts the joint prob- Because the physical behavior of excitation signal is simpler ability distribution of speech samples x = fx1; x2; :::; xN g as follows: Work partially performed when the first author was an intern at Microsoft Y p(xjh) = p(xnjx<n; h); (1) Research Asia. n th where xn, x<n, and h denote the n speech sample, its III. LINEAR PREDICTION WAVENET VOCODER past speech samples, and the acoustic features, respectively. By stacking the dilated causal convolution layers multiply, the A. Fundamental mathematics WaveNet effectively extends its receptive field to the thousand Before introducing the proposed LP-WaveNet, a probabilis- of samples. tic relationship between speech and excitation signals have to The firstly proposed WaveNet, a.k.a., µ-law WaveNet [1], be clarified. Note that at the moment of nth sample generation defines the distribution of speech sample as a 256 categorical process in the WaveNet’s synthesis stage, x^n shown in Eq. (2) class of symbols obtained by an 8-bit µ-law quantized speech can be treated as a given factor since both LP coefficients, samples. To model the distribution of speech sample, the cate- ai, and previously reconstructed samples, fxn−ig are already gorical distribution is computed by applying softmax operation estimated. Hence, we conclude that the difference between two to the output of WaveNet. In the training phase, the weights random variables, xn and en, is only a known constant value of WaveNet is updated to minimize the cross-entropy loss. In term of x^n. the generation phase, the speech sample is auto-regressively Considering the shift property of second-order random generated in sample-by-sample. variable, if we define the speech’s distribution as a mixture of Since the µ-law WaveNet can generate the speech signal in Gaussian (MoG), the relationship between mixture parameters a single unified model, it provides significantly better synthetic of speech and excitation distributions can be lie on the only sound than the conventional parametric vocoders. However, it constant difference of mean parameters as follows: is not easy to train the network when the amount of database is larger and its acoustical informations such as prosody, style, or N " 2 # X wn;i (xn − µn;i) expressiveness are wider. Moreover, the synthesized sound of p(xnjx<n; h) = p exp − ; (3) 2πs 2s2 WaveNet is often suffered from the background noise artifact i=1 n;i n;i as the target speech signal is too coarsely quantized. x e wn;i =wn;i; B. WaveNet-based excitation modeling x e µn;i =µn;i + pn; (4) One effective solution is to model the excitation signal x e sn;i =sn;i; instead of the speech signal. For instance, in the ExcitNet approach [8], an excitation signal is first obtained by an LP where N and i denote the number and index of mixture, analysis filter, then its probabilistic behavior is trained by the respectively; w denotes the weights of mixture component; WaveNet framework. N (µ, s) imply the Gaussian distribution having mean of µ During the synthesis, the excitation signal is generated by and standard deviation of s; the superscripts e and x denote the trained WaveNet, then it is passed through an LP synthesis the excitation and the speech, respectively. Based on this filter to synthesize the speech signal as follows: observation, we propose an LP-WaveNet vocoder, where the LP synthesis process is structurally reflected to the WaveNet’s xn = en +x ^n; training and inference processes. p X (2) x^n = αixn−i; i=1 B. Network architecture th The detailed architecture of LP-WaveNet is illustrated in where en, x^n, p, and αi denote the n sample of excitation signal, the intermediate LP approximation term, the order of Fig. 1. In the proposed system, the distribution of speech sam- LP analysis, and the ith LP coefficient, respectively. Note that ple is defined as a MoG distribution by following Eq. (3), and the LP coefficients are periodically updated to match with the the LP-WaveNet is trained to generate the MoG parameters, extraction period of acoustic features. For instance, if acoustic [wn; µn; sn] conditioned by the input acoustic features. features are extracted at every 5-ms, then the LP coefficients In detail, the acoustic features pass through two 1- are updated at every 5-ms to synchronize the feature update dimensional convolution layers having kernel size of 3 for interval. explicitly imposing the contextual information of feature tra- Because the variation in the excitation signal is only con- jectory. Then, the residual connection with respect to the strained by vocal cord movement, its training is much easier input acoustic feature is applied to make the network more and the quality of finally synthesized speech is much higher, focus on the current frame information. Finally, the transposed too. However, the synthesized speech often contains unnatural convolution is applied to upsample the temporal resolution of artifacts because the excitation model is trained independently this features into that of speech signal. without considering the effect of LP synthesis filter; where To generate the speech samples, the mixture parameters, i.e., it happens mismatch between the excitation signal and LP mixture gain, mean and log-standard deviation, of excitation synthesis filter. To address this limitation, we propose an LP- signal are first predicted by WaveNet as follows: WaveNet, where both excitation signal and LP synthesis filter w µ s are jointly considered for training and synthesis. [zn ; zn; zn] = W aveNet(x<n; hn) (5) µ derivative of NLL loss L with respect to zn can be represented as follows: @L @L @µj µ = · µ : (8) @zi @µj @zi The time index n is omitted for the readability.