Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data
Guokun Lai 1 Bohan Li 1 Guoqing Zheng 1 Yiming Yang 1
Abstract All these methods are aimed at learning a deterministic mapping from the data input to the output. Recently, evi- How to model distribution of sequential data, in- dence has been found (Fabius and van Amersfoort, 2014; cluding but not limited to speech and human mo- Gan et al., 2015; Gu et al., 2015; Goyal et al., 2017; Sha- tions, is an important ongoing research problem. banian et al., 2017) that probabilistic modeling with neural It has been demonstrated that model capacity can networks can benefit from uncertainty introduced to their be significantly enhanced by introducing stochas- hidden states, namely including stochastic latent variables tic latent variables in the hidden states of recur- in the network architecture. Without such uncertainty in the rent neural networks. Simultaneously, WaveNet, hidden states, RNN, PixelCNN and WaveNet would param- equipped with dilated convolutions, achieves as- eterize the randomness only in the final layer by shaping tonishing empirical performance in natural speech a output distribution from the specific distribution family. generation task. In this paper, we combine the Hence the output distribution (which is often assumed to ideas from both stochastic latent variables and be Gaussian for continuous data) would be unimodal or the dilated convolutions, and propose a new architec- mixture of unimodals given the input data, which may be ture to model sequential data, termed as Stochas- insufficient to capture the complex true data distribution tic WaveNet, where stochastic latent variables are and to describe the complex correlations among different injected into the WaveNet structure. We argue output dimensions (Boulanger-Lewandowski et al., 2012). that Stochastic WaveNet enjoys powerful distri- Even for the non-parametrized discrete output distribution bution modeling capacity and the advantage of modeled by the softmax function, a phenomenon referred parallel training from dilated convolutions. In or- to as softmax bottleneck (Yang et al., 2017a) still limits the der to efficiently infer the posterior distribution family of output distributions. By injecting the stochastic of the latent variables, a novel inference network latent variables into the hidden states and transforming their structure is designed based on the characteristics uncertainty to outputs by non-linear layers, the stochastic of WaveNet architecture. State-of-the-art perfor- neural network is equipped with the ability to model the mances on benchmark datasets are obtained by data with a much richer family of distributions. Stochastic WaveNet on natural speech modeling and high quality human handwriting samples can Motivated by this, numerous variants of RNN-based stochas- be generated as well. tic neural network have been proposed. STORN (Bayer and Osendorfer, 2014) is the first to integrate stochastic latent variables into RNN’s hidden states. In VRNN (Chung et al., 1. Introduction 2015), the prior of stochastic latent variables is assumed to be a function over historical data and stochastic latent Learning to capture complex distribution of sequential data variables, which allows them to capture temporal dependen- is an important machine learning problem and has been cies. SRNN (Fraccaro et al., 2016) and Z-forcing (Goyal extensively studied in recent years. The autoregressive neu- et al., 2017) offer more powerful versions with augmented ral network models, including Recurrent Neural Network inference networks which better capture the correlation be- (Hochreiter and Schmidhuber, 1997; Chung et al., 2014), tween the stochastic latent variables and the whole observed PixelCNN (Oord et al., 2016) and WaveNet (Van Den Oord sequence. ome training tricks introducing in (Goyal et al., et al., 2016), have shown strong empirical performance in 2017; Shabanian et al., 2017) would ease training process modeling natural language, images and human speeches. for the stochastic recurrent neural networks which lead to better empirical performance. By introducing stochastic- 1 Language Technology Institute, Carnegie Mellon University, ity to the hidden states, these RNN-based models achieved Pittsburgh, PA 15213, USA. Correspondence to: Guokun Lai
In parallel with RNN, WaveNet (Van Den Oord et al., 2016) sample as provides another powerful way of modeling sequential data T Y with dilated convolutions, especially in the natural speech p(x) = pθ(xt|x
Output as index, such as xi or xi,j. f(·) represents the general Dilation = 8 function that transforms an input vector to a output vector. Hidden Layer Dilation = 4 And f (·) is a neural network function parametrized by θ. θ Hidden Layer For a sequential data sample x, T represents its length. Dilation = 2 Hidden Layer Dilation = 1 2.2. Autoregressive Neural Network Input
Autoregressive network model is designed to model the Figure 1. Visualization of a WaveNet structure from (Van joint distribution of the high-dimensional data with sequen- Den Oord et al., 2016) tial structure, by factorizing the joint distribution of a data Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data
3. Stochastic WaveNet are stochastic because of the random samples z. We pa- rameterize the mean and variance of the prior distributions In this section, we introduce a sequential generative model by the hidden representations h, which is µt,l = fθ (ht,l) (Stochastic WaveNet), which imposes stochastic latent vari- 3 and log vt,l = fθ (ht,l). Similarly, we parameterize the ables with the multi-layer dilated convolution structure. 4 emission probability pθ(xt|x ht,l = fθ (d l , dt,l−1) crease the degree of distribution mismatch between the prior 1 t−2 ,l−1 (4) d = f (h , z ) and posterior distribution, namely enlarging the KL term in t,l θ2 t,l t,l loss function. Where fθ1 mimics the design of the dilated convolution Exploring the dependency structure in WaveNet. How- in WaveNet, and fθ2 is a fully connected layer to sum- ever, by analyzing the dependency among the outputs and marize the hidden states and the sampled latent variable. hidden states of WaveNet, we would find that the stochastic Different from the vanilla WaveNet, the hidden states h latent variables at time stamp t, zt,1:L would not influence Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data whole posterior outputs. So the inference network would � � � � only require partial posterior outputs to maximize the re- construction loss term in the loss function. Denote the set � , � , � , � , of outputs that would be influenced by zl,t as s(l, t). The posterior distribution can be modified as, ℎ , ℎ , ℎ , ℎ , T L � , � , � , � , Y Y qφ(z|x) = qφ(zt,l|x