Learning Semantics of Raw Audio Davis Foote, Daylen Yang, Mostafa Rohaninejad Group name: “Audio Style Transfer” Introduction Variational Inference Numerically modeling semantics has given some useful exciting results in the We would like to have a latent space in which arithmetic operations domains of images [1] and natural language [2]. We would like to be able to correspond to semantic operations in the observed space. To that end, we operate on raw audio signals at a similar level of abstraction. Consider the adopt the following model, adapted from [1]: problem of trying to interpolate two voices into a voice that sounds “in We interpret the hidden variables � as the latent code. between” the two. Averaging the signals will result in the two voices There is a prior over latent codes �� � and a conditional speaking at once. If this operation were performed in a latent space in which distribution �� � � . In order to be able to encode a data dimensions have semantic meaning, then we would see results which match point, we also need to learn an approximation to the human perception of what this interpolation should mean. posterior distribution, �� � � . We can optimize all these We follow two distinct branches in pursuing this goal. First, motivated by parameters at once using the Auto-Encoding Variational aesthetic results in style transfer [3] and generation of novel styles [4], we Bayes [1]. Some very simple results (ours) of implement a deep classifier for various musical tags in hopes that the various sampling from this model trained on MNIST are shown: layer activations will encode meaningful high-level audio features. Second, we try to perform variational inference over latent variables which can be interpreted as compressing a signal to only global information which will be probabilistically decoded by a deep autoregressive model.

Style classification We would like to leverage the expressive power of the recent autoregressive We replicated the results of [5] and made a few slight changes to improve model for raw audio, WaveNet [8]. This can be done by using a Conditional performance (training with Adam, tag merging, etc.). The model is WaveNet as the conditional distribution �� � � (the diagram below illustrated below. ReLUs are inserted after every CONV and FC layer. Our illustrates this at train time). Unfortunately, such powerful models tend to Because models with causal convolutions do not have recurrent connections, they are typically faster data source was the MagnaTagATune audio dataset, which contains various toget train stuck than in RNNs, local especially optima whenin which applied the to KL very divergence long sequences. penalty One ofin thethe problems evidence of causal raw audio clips and tag annotations. The metric we used for this task was convolutionslower bound is that(see they [1]) require is driven many to layers, zero orand large the filters posterior to increase encodes the receptive no field. For average AUC for every tag. example,information. in Fig. [9] 2 the presents receptive the field Variational is only 5 (= Lossy #layers + filter length -as 1). a Insolution this paper to we use dilated convolutions to increase the receptive field by orders of magnitude, without greatly increasing computationalthis problem, cost. in which limiting the decoder’s expressivity to only modeling local statistics forces the posterior to learn to model global dependencies. A dilated convolution (also called a` trous, or convolution with holes) is a convolution where the filterWe hope is applied to extend over an areathis largerwith WaveNet than its length to solve by skipping our original input values problem. with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, “fast” Results: The full implementation is still in the works. See below for the FC FC but is significantly more efficient. A dilated convolution effectively allows the network to operate on Stride 4 Stride 4 50 units 100 units 3 sec @ CONV CONV “techno” adetails coarser scaleof our than Conditional with a normal WaveNet convolution. implementation. This is similar to poolingThe model or strided works convolutions, on but here the output has the same size as the input. As a special case, dilated convolution with dilation 16 kHz MAX POOL MAX POOL image data as shown above, however we have not implemented a flexible 32 filters, size 8, stride 1 stride 8, size filters, 32 1 stride 8, size filters, 32 1 yields the standard convolution. Fig. 3 depicts dilated causal convolutions for dilations 1, 2, 4, STRIDED CONV 128 filters, size 200, stride 200 stride 200, size filters, 128 andPixelCNN8. Dilated decoder convolutions to illustrate have previously the effects been of used lossy in variousencoding. contexts, e.g. signal processing (Holschneider et al., 1989; Dutilleux, 1989), and image segmentation (Chen et al., 2015; Yu & Average AUC over Tags Koltun,Conditional 2016). WaveNet E2E-Spectrogram [5] Output E2E-Raw [5] Dilation = 8 Deep-BoF [6] Hidden Layer Ours Dilation = 4

Hidden Layer 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 Dilation = 2

Qualitative Results: Hidden Layer Dilation = 1 • “Shake It Off” by Taylor Swift: rock, guitar, drum Input • “You’ve Got A Friend In Me” by Randy Newman: guitar, vocal, male • “i hate u, i love u” by gnash, Olivia O’Brien: guitar, vocal, female Global Latent Variable • “Time” by Hans Zimmer: slow, ambient, quiet Figure 3: Visualization of a stack of dilated causal convolutional layers. Does this learn a useful representation? Stacked dilated convolutions enable networks to have very large receptive fields with just a few lay- ers, while preserving the input resolution throughout the network as well as computational efficiency. We ran the following two experiments to try to understand what this model In this paper,An autoregressive the dilation ismodel doubled for every layer up to aAdding limit global and then conditioning repeated: e.g. learned to represent: 1, 2, 4,...,512, 1, 2, 4,...,512, 1, 2, 4,...,512. 1. “Image Fooling”[7]/”DeepDream”[4]: Initialize a signal at the input of The intuition behind this configuration is two-fold. First, exponentially increasing the dilation factor results in exponentialDiscretize audio receptive into samples, field sample growth xt withfrom a depth distribution (Yu conditioned & Koltun, on 2016). all For example each the network as some starting audio (e.g. a clip of music) and maximize a the previous samples and the global conditioning variables. particular output or the activations at a particular layer. All results 1, 2, 4,...,512 block has receptive field of size 1024, and can be seen as a more efficient and dis- criminative (non-linear) counterpart of a 1 1024 convolution. Second, stacking these blocks further sounded like white noise added to the initial clip. ⇥ increasesWe refactored the model an capacity implementation and the receptive of Wave fieldN size.et to add global conditioning 2. Neural Style [3]: Simultaneously try to match activations at a deep layer and trained a model on the VCTK and MagnaTagATune corpuses. We could with one clip and temporal correlations (in order to eliminate indexical 2.2 SOFTMAX DISTRIBUTIONS not train a text-to-speech model because we did not have aligned local temporal information) of activations at a set of layers with another clip. Oneconditioning approach to data, modeling but we the conditionalwere able distributionsto generatep samples(xt x1,...,x witht distinct1) over themale individual Results again sounded like white noise mixed with content clip. | audioand female samples characteristics. would be to use a mixture model such as a mixture density network (Bishop, 1994) Potential explanation: These operations lead to regions in activation space or mixture of conditional Gaussian scale mixtures (MCGSM) (Theis & Bethge, 2015). However, van den Oord et al. (2016a) showed that a softmax distribution tends to work better, even when the that lie off the “audio signal manifold.” Since there is no probabilistic data is implicitly continuous (as is the case for image pixel intensities or audio sample values). One interpretation assigning meaning to these regions, the model has learned ofPlease the reasons ask is that afor categorical a live distribution demo! is more flexible and can more easily model arbitrary nothing here and we truly do just get white noise. distributionsOur results because are all it audio makes noand assumptions thus impossible about their to shape.illustrate on a poster! We References: Becausecan demo raw any audio audio is typically experiment stored as mentioned a sequence ofhere 16-bit for integer any audio values clip (one per timestep), a [1] Kingma et al. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013). softmaxobtainable layer from would the need Internet. to output 65,536 probabilities per timestep to model all possible values. [2] Mikolov et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 To make this more tractable, we first apply a µ-law companding transformation (ITU-T, 1988) to (2013). [3] Gatys et al. "A neural algorithm of artistic style." arXiv preprint arXiv:1508.06576 (2015). the data, and then quantize it to 256 possible values: [4] Mordvintsev et al. "DeepDream - a code example for visualizing Neural Networks". Google Research. (2015). Code ln (1 + µ xt ) [5] Dieleman et al. "End-to-end learning for music audio." IEEE, 2014. • Style classification github.com/daylen/audiof (xt) = sign(xt) -style| |-classifier, [6] Nam et al. "A Deep Bag-of-Features Model for Music Auto-Tagging." arXiv preprint arXiv:1508.04999 (2015). ln (1 + µ) [7] Nguyen et al. "Deep neural networks are easily fooled: High confidence predictions for unrecognizable • Conditional WaveNet github.com/mostafarohani/tensorflow-wavenet images." 2015 IEEE Conference on and Pattern Recognition (CVPR). IEEE, 2015. [8] Oord et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016). • VLAE github.com/djfoote/audio-style3-transfer [9] Chen et al. "Variational Lossy Autoencoder." arXiv preprint arXiv:1611.02731 (2016).