Parallel Wavenet: Fast High-Fidelity Speech Synthesis

Parallel WaveNet: Fast High-Fidelity Speech Synthesis Aaron van den Oord 1 Yazhe Li 1 Igor Babuschkin 1 Karen Simonyan 1 Oriol Vinyals 1 Koray Kavukcuoglu 1 George van den Driessche 1 Edward Lockhart 1 Luis C. Cobo 1 Florian Stimberg 1 Norman Casagrande 1 Dominik Grewe 1 Seb Noury 1 Sander Dieleman 1 Erich Elsen 1 Nal Kalchbrenner 1 Heiga Zen 1 Alex Graves 1 Helen King 1 Tom Walters 1 Dan Belov 1 Demis Hassabis 1 Abstract network which can synthesise equally high quality speech much more efficiently, and is deployed to millions of users. The recently-developed WaveNet architecture (van den Oord et al., 2016a) is the current WaveNet is one of a family of autoregressive deep genera- state of the art in realistic speech synthesis, tive models that have been applied with great success to data consistently rated as more natural sounding for as diverse as text (Mikolov et al., 2010), images (Larochelle many different languages than any previous & Murray, 2011; Theis & Bethge, 2015; van den Oord system. However, because WaveNet relies on et al., 2016c;b), video (Kalchbrenner et al., 2016), hand- sequential generation of one audio sample at writing (Graves, 2013) as well as human speech and music. a time, it is poorly suited to today’s massively Modelling raw audio signals, as WaveNet does, represents parallel computers, and therefore hard to deploy a particularly extreme form of autoregression, with up to in a real-time production setting. This paper 24,000 samples predicted per second. Operating at such a introduces Probability Density Distillation, a high temporal resolution is not problematic during network new method for training a parallel feed-forward training, where the complete sequence of input samples is network from a trained WaveNet with no already available and—thanks to the convolutional struc- significant difference in quality. The resulting ture of the network—can be processed in parallel. When system is capable of generating high-fidelity generating samples, however, each input sample must be speech samples at more than 20 times faster than drawn from the output distribution before it can be passed real-time, a 1000x speed up relative to the original in as input at the next time step, making parallel processing WaveNet, and capable of serving multiple English impossible. and Japanese voices in a production setting. Inverse autoregressive flows (IAFs) (Kingma et al., 2016) represent a kind of dual formulation of deep autoregressive modelling, in which sampling can be performed in parallel, 1. Introduction while the inference procedure required for likelihood esti- mation is sequential and slow. The goal of this paper is to Recent successes of deep learning go beyond achieving marry the best features of both models: the efficient training state-of-the-art results in research benchmarks, and push the of WaveNet and the efficient sampling of IAF networks. frontiers in some of the most challenging real world applica- The bridge between them is a new form of neural network tions such as speech recognition (Hinton et al., 2012), image distillation (Hinton et al., 2015), which we refer to as Prob- recognition (Krizhevsky et al., 2012; Szegedy et al., 2015), ability Density Distillation, where a trained WaveNet model and machine translation (Wu et al., 2016). The recently pub- is used as a teacher for a feedforward IAF model. lished WaveNet (van den Oord et al., 2016a) model achieves state-of-the-art results in speech synthesis, and significantly The next section describes the original WaveNet model, closes the gap with natural human speech. However, it is while Sections 3 and 4 define in detail the new, parallel not well suited for real world deployment due to its pro- version of WaveNet and the distillation process used to hibitive generation speed. In this paper, we present a new transfer knowledge between them. Section 5 then presents algorithm for distilling WaveNet into a feed-forward neural experimental results showing no loss in perceived quality for parallel versus original WaveNet, and continued superiority 1 DeepMind Technologies, London, United Kingdom. Corre- over previous benchmarks. We also present timings for spondence to: Aaron van den Oord <[email protected]>. sample generation, demonstrating more than 1000 speed- × Proceedings of the 35 th International Conference on Machine up relative to original WaveNet. Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Parallel WaveNet: Fast High-Fidelity Speech Synthesis 2. WaveNet 2.1. Higher Fidelity WaveNet Autoregressive networks model the joint distribution of high- For this work we made two improvements to the basic dimensional data as a product of conditional distributions WaveNet model to enhance its audio quality for production using the probabilistic chain-rule: use. Unlike previous versions of WaveNet (van den Oord Y et al., 2016a), where 8-bit (µ-law or PCM) audio was mod- p(x) = p(x x ; θ); tj <t elled with a 256-way categorical distribution, we increased t the fidelity by modelling 16-bit audio. Since training a where xt is the t-th variable of x and θ are the parameters 65,536-way categorical distribution would be prohibitively of the autoregressive model. The conditional distributions costly, we instead modelled the samples with the discretized are usually modelled with a neural network that receives mixture of logistics distribution introduced in (Salimans x<t as input and outputs a distribution over possible xt. et al., 2017). We further improved fidelity by increasing the WaveNet (van den Oord et al., 2016a) is a convolutional audio sampling rate from 16kHz to 24kHz. This required a WaveNet with a wider receptive field, which we achieved autoregressive model which produces all p(xt x<t) in one forward pass, by making use of causal—or maskedj — by increasing the dilated convolution filter size from 2 to 3. convolutions (van den Oord et al., 2016c; Germain et al., An alternative strategy would be to increase the number of 2015). Every causal convolutional layer can process its in- layers or add more dilation stages. put in parallel, making these architectures very fast to train compared to RNNs (van den Oord et al., 2016b), which can 3. Parallel WaveNet only be updated sequentially. At generation time, however, the waveform has to be synthesised in a sequential fashion While the convolutional structure of WaveNet allows for rapid parallel training, sample generation remains inherently as xt must be sampled first in order to obtain x>t. Due to this nature, real time (or faster) synthesis with a fully sequential and therefore slow, as it is for all autoregressive autoregressive system is challenging. While sampling speed models which use ancestral sampling. We therefore seek is not a significant issue for offline generation, it is essential an alternative architecture that will allow for rapid, parallel for real-word applications. A version of WaveNet that gen- generation. erates in real-time has been developed (Paine et al., 2016), Inverse-autoregressive flows (IAFs) (Kingma et al., 2016) but it required the use of a much smaller network, resulting are stochastic generative models whose latent variables are in severely degraded quality. arranged so that all elements of a high dimensional observ- Raw audio data is typically very high-dimensional (e.g. able sample can be generated in parallel. IAFs are a special 16,000 samples per second for 16kHz audio), and contains type of normalising flow (Dinh et al., 2014; Rezende & Mo- complex, hierarchical structures spanning many thousands hamed, 2015; Dinh et al., 2016) which model a multivari- of time steps, such as words in speech or melodies in mu- ate distribution pX (x) as an explicit invertible non-linear sic. Modelling such long-term dependencies with standard transformation x = f(z) of a simple tractable distribution causal convolution layers would require a very deep net- pZ (z) (such as an isotropic Gaussian distribution). Using work to ensure a sufficiently broad receptive field. WaveNet the change of variables formula the resulting distribution avoids this constraint by using dilated causal convolutions, can be written as: which allow the receptive field to grow exponentially with dx depth. log p (x) = log p (z) log ; X Z − dz WaveNet uses gated activation functions, together with a simple mechanism introduced in (van den Oord et al., 2016c) dx where dz is the determinant of the Jacobian of f. For to condition on extra information such as class labels or all normalizing flows the transformation f is chosen so linguistic features: that it is invertible and its Jacobian determinant is easy to T T compute. In the case of an IAF, the output is modelled by hi = σ Wg;i xi + Vg;ic tanh Wf;i xi + Vf;ic ; ∗ ∗ (1) xt = f(z≤t). Because of this strict dependency structure, where denotes a convolution operator, and denotes the transformation has a triangular Jacobian matrix which makes the determinant equal to the product of the diagonal an element-wise∗ multiplication operator. σ( ) is a logistic entries: sigmoid function. c represents extra conditioning· data. i is dx X @f(z≤t) the layer index. f and g denote filter and gate, respectively. log = log : dz @z W and V are learnable weights. In cases where c encodes t t spatial or sequential information (such as a sequence of T T linguistic features), the matrix products (Vf;ic and Vg;ic) To sample from an IAF, a random sample is first drawn from are replaced by convolutions (V c and V c). z p (z) (we use the Logistic(0;I) distribution) which f;i ∗ g;i ∗ ∼ Z Parallel WaveNet: Fast High-Fidelity Speech Synthesis Output Dilation = 8 Hidden Layer Dilation = 4 Hidden Layer Dilation = 2 Hidden Layer Dilation = 1 Input Figure 1.

Load more