Stochastic Wavenet: a Generative Latent Variable Model for Sequential Data

Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data Guokun Lai 1 Bohan Li 1 Guoqing Zheng 1 Yiming Yang 1 Abstract All these methods are aimed at learning a deterministic mapping from the data input to the output. Recently, evi- How to model distribution of sequential data, in- dence has been found (Fabius and van Amersfoort, 2014; cluding but not limited to speech and human mo- Gan et al., 2015; Gu et al., 2015; Goyal et al., 2017; Sha- tions, is an important ongoing research problem. banian et al., 2017) that probabilistic modeling with neural It has been demonstrated that model capacity can networks can benefit from uncertainty introduced to their be significantly enhanced by introducing stochas- hidden states, namely including stochastic latent variables tic latent variables in the hidden states of recur- in the network architecture. Without such uncertainty in the rent neural networks. Simultaneously, WaveNet, hidden states, RNN, PixelCNN and WaveNet would param- equipped with dilated convolutions, achieves as- eterize the randomness only in the final layer by shaping tonishing empirical performance in natural speech a output distribution from the specific distribution family. generation task. In this paper, we combine the Hence the output distribution (which is often assumed to ideas from both stochastic latent variables and be Gaussian for continuous data) would be unimodal or the dilated convolutions, and propose a new architec- mixture of unimodals given the input data, which may be ture to model sequential data, termed as Stochas- insufficient to capture the complex true data distribution tic WaveNet, where stochastic latent variables are and to describe the complex correlations among different injected into the WaveNet structure. We argue output dimensions (Boulanger-Lewandowski et al., 2012). that Stochastic WaveNet enjoys powerful distri- Even for the non-parametrized discrete output distribution bution modeling capacity and the advantage of modeled by the softmax function, a phenomenon referred parallel training from dilated convolutions. In or- to as softmax bottleneck (Yang et al., 2017a) still limits the der to efficiently infer the posterior distribution family of output distributions. By injecting the stochastic of the latent variables, a novel inference network latent variables into the hidden states and transforming their structure is designed based on the characteristics uncertainty to outputs by non-linear layers, the stochastic of WaveNet architecture. State-of-the-art perfor- neural network is equipped with the ability to model the mances on benchmark datasets are obtained by data with a much richer family of distributions. Stochastic WaveNet on natural speech modeling and high quality human handwriting samples can Motivated by this, numerous variants of RNN-based stochas- be generated as well. tic neural network have been proposed. STORN (Bayer and Osendorfer, 2014) is the first to integrate stochastic latent variables into RNN’s hidden states. In VRNN (Chung et al., 1. Introduction 2015), the prior of stochastic latent variables is assumed to be a function over historical data and stochastic latent Learning to capture complex distribution of sequential data variables, which allows them to capture temporal dependen- is an important machine learning problem and has been cies. SRNN (Fraccaro et al., 2016) and Z-forcing (Goyal extensively studied in recent years. The autoregressive neu- et al., 2017) offer more powerful versions with augmented ral network models, including Recurrent Neural Network inference networks which better capture the correlation be- (Hochreiter and Schmidhuber, 1997; Chung et al., 2014), tween the stochastic latent variables and the whole observed PixelCNN (Oord et al., 2016) and WaveNet (Van Den Oord sequence. ome training tricks introducing in (Goyal et al., et al., 2016), have shown strong empirical performance in 2017; Shabanian et al., 2017) would ease training process modeling natural language, images and human speeches. for the stochastic recurrent neural networks which lead to better empirical performance. By introducing stochastic- 1 Language Technology Institute, Carnegie Mellon University, ity to the hidden states, these RNN-based models achieved Pittsburgh, PA 15213, USA. Correspondence to: Guokun Lai <[email protected]>. significant improvements over vanilla RNN models on log- likelihood evaluations on multiple benchmark datasets from Presented at the ICML 2018 workshop on Theoretical Foundations various domains (Goyal et al., 2017; Shabanian et al., 2017). and Applications of Deep Generative Models Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data In parallel with RNN, WaveNet (Van Den Oord et al., 2016) sample as provides another powerful way of modeling sequential data T Y with dilated convolutions, especially in the natural speech p(x) = pθ(xtjx<t) (1) generation task. While RNN-based models must be trained t=1 in a sequential manner, training a WaveNet can be easily d parallelized. Furthermore, the parallel WaveNet proposed where x = fx1; x2; ··· xT g; xt 2 R , t indexes the tem- in (Oord et al., 2017) is able to generate new sequences in poral time stamps, and θ represents the model parameters. parallel. WaveNet, or dilated convolutions, has also been Then the autoregressive model can compute the likelihood adopted as the encoder or decoder in the VAE framework of a sample and generate a new data sample in a sequential and produces reasonable results in the text (Semeniuta et al., manner. 2017; Yang et al., 2017b) and music (Engel et al., 2017) In order to capture richer stochasticities of the sequential generation task. generation process, stochastic latent variables for each time In light of the advantage of introducing stochastic latent stamp have been introduced, referred to as stochastic neural variables to RNN-based models, it is natural to raise a prob- network (Chung et al., 2015; Fraccaro et al., 2016; Goyal lem whether this benefit carries to WaveNet-based models. et al., 2017). Then the joint distribution of the data together To this end, in this paper we propose Stochastic WaveNet, with the latent variables is factorized as, which associates stochastic latent variables with every hidden states in the WaveNet architecture. Compared with the T vanilla WaveNet, Stochastic WaveNet is able to capture a Y p(x; z) = p (x ; z jx ; z ) richer family of data distributions via the added stochastic θ t t <t <t t=1 latent variables. It also inherits the ease of parallel train- (2) T Y ing with dilated convolutions from the WaveNet architec- = p (x jx ; z )p (z jx ; z ) ture. Because of the added stochastic latent variables, an θ t <t ≤t θ t <t <t t=1 inference network is also designed and trained jointly with Stochastic WaveNet to maximize the data log-likelihood. d0 We believe that after model training, the multi-layer struc- where z = fz1; z2; ··· zT g; zt 2 R has the same sequence 0 ture of latent variables leads them to reflect both hierarchical length as the data sample, d is its dimension for one time and sequential structures of the data. This hypothesis is val- stamp. z is also generated sequentially, namely the prior of idated empirically by controlling the number of layers of zt is conditional probability given x<t and z<t. stochastic latent variables. 2.3. WaveNet The rest of this paper is organized as follows: we briefly review the background in Section2. The proposed model WaveNet (Van Den Oord et al., 2016) is a convolutional and optimization algorithm are introduced in Section3. We autoregressive neural network which adopts dilated causal evaluate and analyze the proposed model on multiple bench- convolutions (Yu and Koltun, 2015) to extract the sequen- mark datasets in Section4. Finally, the summary of this tial dependency in the data distribution. Different from paper is included in Section5. recurrent neural network, dilated convolution layers can be computed in parallel during the training process, which 2. Preliminary makes WaveNet much faster than RNN in modeling sequential data. A typical WaveNet structure is visualized in Figure 2.1. Notation 1. Beside the computation advantage, WaveNet has shown the start-of-the-art result in speech generation task (Oord We first define the mathematical symbols used in the rest of et al., 2017). this paper. We denote a set of vectors by a bold symbol, such as x, which may utilize one or two dimension subscripts Output as index, such as xi or xi;j. f(·) represents the general Dilation = 8 function that transforms an input vector to a output vector. Hidden Layer Dilation = 4 And f (·) is a neural network function parametrized by θ. θ Hidden Layer For a sequential data sample x, T represents its length. Dilation = 2 Hidden Layer Dilation = 1 2.2. Autoregressive Neural Network Input Autoregressive network model is designed to model the Figure 1. Visualization of a WaveNet structure from (Van joint distribution of the high-dimensional data with sequen- Den Oord et al., 2016) tial structure, by factorizing the joint distribution of a data Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data 3. Stochastic WaveNet are stochastic because of the random samples z. We parameterize the mean and variance of the prior distributions In this section, we introduce a sequential generative model by the hidden representations h, which is µt;l = fθ (ht;l) (Stochastic WaveNet), which imposes stochastic latent vari- 3 and log vt;l = fθ (ht;l). Similarly, we parameterize the ables with the multi-layer dilated convolution structure. 4 emission probability pθ(xtjx<t; zt;1:L; z<t;1:L) as a neural We firstly introduce the generation process of Stochas- network function over the hidden representations. tic WaveNet, and then describe the variational inference method. 3.2. Variational Inference for Stochastic WaveNet 3.1. Generative Model Instead of directly maximizing log-likelihood for a sequential sample x, we optimize its variational evidence lower Similar as stochastic recurrent neural networks, we inject bound (ELBO) (Jordan et al., 1999).

Stochastic Wavenet: a Generative Latent Variable Model for Sequential Data

Chombining Recurrent Neural Networks and Adversarial Training

Lecture 4 Feedforward Neural Networks, Backpropagation

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

Recurrent Neural Network for Text Classification with Multi-Task

Face Recognition: a Convolutional Neural-Network Approach

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

CNN Architectures

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

Self-Training Wavenet for TTS in Low-Data Regimes

Loss Function Search for Face Recognition

Deep Learning Architectures for Sequence Processing

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J