Advanced ML: with autoregressive and latent variable models

https://github.com/janchorowski/JSALT2019_tutorials

Jan Chorowski University of Wrocław

With slides from TMLSS 2018 (Jan Chorowski, Urlich Paquet) Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Deep Model = Hierarchy of Concepts

Cat Dog … Moon Banana

M. Zieler, “Visualizing and Understanding Convolutional Networks” history: 1986

‘hidden’ units which are not part of the input or output come to represent important features of the task domain Deep Learning history: 2006

Stacked RBMs

Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012

2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe

1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – : 1000h (LibriSpeech), 12000h ( Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of

• Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits

Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data

“The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 of constraint per second.”

Geoff Hinton

https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe

1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free

2. Train model to…???

What is the task? What is the ? Unsupervised learning goals

• Learn a data representation: – Extract features for downstream tasks – Describe the data (clusterings) • Data generation / density estimation – What are my data – Outlier detection – Super-resolution – Artifact removal – Compression 2 minute exercise

Talk to your friend next to you, and tell him or her everything you can about this data set: The rows are correlated Latent representation Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning high dimensional distributions is hard • Assume we work with small (32x32) images • Each data point is a real vector of size 32 × 32 × 3 • Data occupies only a tiny fraction of ℝ32×32×3 • Impossible to directly fit standard prob. distribution!

Divide and conquer

What is easier: 1. Write a paragraph at once? 2. Guess the next word?

Mary had a little Mary had a little dog. Mary had a little dog. It Mary had a little dog. It was Mary had a little dog. It was cute

Chain rule

Decompose probability of 푛-dimensional data points into product of 푛 conditional probabilities

푝 푥 = 푝 푥1, 푥2, … , 푥푛 = 푝 푥1 푝 푥2 푥1 … 푝 푥푛 푥1, 푥2, … , 푥푛−1

= 푝(푥푡|푥1, 푥2, … , 푥푡−1) 푡 Learning 푝 푥푡 푥1, … , 푥푡−1

The good: It looks like !

The bad: We need to handle variable-sized input. Large n -> sparse data

Consider an n-gram LM: #푥푡−푘:푡 푝 푥푡 푥푡−푘+1, … , 푥푡−1 = #푥푡−푘:푡−1

As 푛 → ∞ n-grams become unique: - Probability of all but one continuation is 0 - Can’t model long contexts

Solution: Use a representation in which “llama ≈ animal”

Learning 푝 푥푡 푥1, … , 푥푡−1 #1

Assume distant past doesn’t matter, 푝 푥푡 푥1, … , 푥푡−1 ≈ 푝 푥푡 푥푡−푘+1, … , 푥푡−1

Quite popular: • n-gram text models (estimate 푝 by counting) • FIR filters

How to efficiently handle large 푘? Neural LM

Compute 푝 푥푡 푥푡−푘+1, … , 푥푡−1 with a Neural Net Table

Lookup Representation

Concatenation

Extractor* Table Sentence Softmax Lookup An arbitrary

sub-graph

Table Lookup

Looks somewhat like a . Still: large “n” => many params, how to fix?

Y. Bengio et al “A Neural Probabilistic Language Model”, NIPS 2011 Dilated Convolutions Large (but finite) receptive fields Few parameters

WaveNet ByteNet https://arxiv.org/abs/1609.03499 https://arxiv.org/abs/1610.10099 Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning 푝 푥푡 푥1, … , 푥푡−1 #2

Introduce a hidden (i.e. not directly observed) state ℎ푡.

ℎ푡 summarizes 푥1, … , 푥푡−1.

Compute ℎ푡 using this recurrence: ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡) Recurrences are powerful Input: a sequence of bits 1,0,1,0,0,1 Output: the parity

Solution: The hidden state will be just 1 bit – parity so far: ℎ0 = 0 ℎ푡 = 푋푂푅 ℎ푡−1, 푥푡 푦푇 = ℎ푇

Constant operating memory! Works for arbitrary long sequences! Recurrent Neural Networks (RNNs)

ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡)

푓, 푔 are implemented as feedforward neural nets.

푓 푔

RNNs naturally handle sequences Training: unroll and supervise at every step:

푝(ℎ푎푑|푀푎푟푦) 푝(푎| … ) 푝(푙𝑖푡푡푙푒| … )

Mary had a ….. RNNs naturally handle sequences Generation: sample one element at a time

Mary had a

푥1 sampled from 푝 푥1

RNNs are dynamical systems

푓 푔 …

RNN is a very deep network, all “arrows” compute the same function! This impacts training! RNN behavior - intuitions

A linear dynamical system computes 푡 ℎ푡 = 푤 ⋅ ℎ푡−1 = 푤 ℎ0 When:

1. 푤 > 1 it diverges: lim ℎ푡 = sign h0 ⋅ ∞ 푡→∞ 2. 푤 = 1 it is constant lim ℎ푡 = h0 푡→∞ 3. −1 < 푤 < 1 it decays to 0: lim ℎ푡 = 0 푡→∞ 4. 푤 = −1 flips between ±ℎ0 5. 푤 < −1 diverges, changes sign at each step A multidimensional linear dyn. system

푛 푛×푛 ℎ푡 ∈ ℝ , 푊 ∈ ℝ ℎ푡 = 푊ℎ푡−1 Compute the eigendecomposition of 푊: 휆11 푊 = 푄Λ푄−1 = 푄 ⋱ 푄−1 휆푛푛 Then 푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0 = 푛 휆11 −1 = 푄 ⋱ 푄 ℎ0 푛 휆푛푛 A multidimensional dyn. system cont.

푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0

If largest eigenvalue has norm: 1. >1, system diverges 2. =1, system is stable or oscillates 3. <1, system decays to 0

This is similar to the scalar case.

A nonlinear dynamical system

ℎ푡 = tanh(푤 ⋅ ℎ푡−1)

Output is bounded (can’t diverge!)

If 푤 > 1 has 3 fixed points: 0, ±ℎ푓. Starting the iteration form ℎ0 ≠ 0 it ends at ±ℎ푓 (effectively remembers the sign).

If 푤 ≤ 1 has one fixed point: 0

A nonlinear dynamical system

ℎ푡 = tanh(푤 ⋅ ℎ푡−1) , 푤 > 1

푡 Gradients in RNNs

Recall RNN equations: ℎ푡 = 푓 ℎ푡−1, 푥푡; 휃 푦푡 = 푔 ℎ푡 Assume supervision only at the last step: L = 푒(푦푇) The gradient is 휕퐿 휕퐿 흏풉푻 휕ℎ푡 = 휕휃 휕ℎ푇 흏풉풕 휕휃 푡 휕ℎ Trouble with 푇 휕ℎ푡

휕ℎ푇 measures how much ℎ푇 changes when ℎ푡 changes. 휕ℎ푡 ℎ

푡 휕ℎ Another look at 푇 휕ℎ푡 Let: ℎ푡 = tanh(푤 ⋅ ℎ푡−1) Then: 휕ℎ푡 휕ℎ푡 휕ℎ푡−1 휕ℎ푡−2 휕ℎ1 = … 휕ℎ0 휕ℎ푡−1 휕ℎ푡−2 휕ℎ푡−3 휕ℎ0 휕ℎ𝑖+1 = tanh ′(푤ℎ푡) 푤 휕ℎ𝑖 푡−1 푡−1 휕ℎ푡 푡 = tanh ′(푤ℎ𝑖) 푤 = 푤 tanh ′(푤ℎ𝑖) 휕ℎ0 𝑖=0 𝑖=0 Backward phase is linear! Vanishing gradient

휕ℎ If 푇 = 0 the network forgets all information 휕ℎ푡 from time 푡 when it reaches time 푇.

This makes it impossible to discover correlations between inputs at distant time steps.

This is a modeling problem. Exploding gradient

휕ℎ 휕Loss When 푇 is large will be large too. 휕ℎ푡 휕휃

휕Loss Making a gradient step 휃 ← 훼 can 휕휃 drastically change the network, or even destroy it.

This is not a problem of information flow, but of training stability!

Exploding gradient intuition #1

ℎ푡 = tanh(2ℎ푡−1 + 푏)

푡 Exploding gradient intuition #2

ℎ0 = 휎(0.5) ℎ푡 = 휎 푤ℎ푡−1 + 푏 2 퐿 = ℎ50 − 0.7

R. Pascanu et al. https://arxiv.org/abs/1211.5063 휕ℎ Summary: trouble with 푇 휕ℎ푡 RNNs are difficult to train because:

휕ℎ 푇 can be 0 – vanishing gradient 휕ℎ푡 휕ℎ 푇 can be ∞ – exploding gradient 휕ℎ푡

With both phenomena governed by the spectral norm of the weight (norm of the largest eigenvalue, magnitude of 푤 in the scalar case).

Exploding gradient solution

Don’t do large steps.

Pick a gradient norm threshold and scale down all larger gradients.

This prevents the model from doing a large learning update and destroying itself. Vanishing gradient solution: LSTM, the RNN workhorse 휕ℎ 푇 ≈ 0: step 푇 forgots information from 푡 휕ℎ푡

The dynamical system ℎ푡 = 푤ℎ푡−1 maximally preserves information when 푤 = 1

LSTM introduces a memory cell 푐푡 that will keep information forever: 푐푡 = 푐푡−1 Memory cell

푐푡

Memory cell preserves information 푐푡 = 푐푡−1 휕푐푇 = 1 휕푐푡 Gates

휎(푊푥𝑖푥푡 + ⋯ )

⋅ + 푐푡 tanh(푊푥푐푥푡 + ⋯ ) Gates selectively load information into the memory cell: 𝑖푡 = 휎(푊푥𝑖푥푡 + ⋯ ) 푐푡 = 푐푡−1 + 𝑖푡 ⋅ tanh(푊푥푐푥푡 + ⋯ ) LSTM: the details Hidden state is a pair of: -푐푡 information in the cell, hidden from the rest of the network -ℎ푡 information extracted form the cell into the network

Update equations: 𝑖푡 = 휎 푊푥𝑖푥푡 + 푊ℎ𝑖ℎ푡−1 + 푏𝑖 푓푡 = 휎 푊푥푓푥푡 + 푊ℎ푓ℎ푡−1 + 푏푓 표푡 = 휎 푊푥표푥푡 + 푊ℎ표ℎ푡−1 + 푏표 푐푡 = 𝑖푡 tanh(푊푥푐푥푡 + 푊ℎ푐ℎ푡−1 + 푏푐) + 푓푡푐푡−1 ℎ푡 = 표푡 tanh 푐푡

RNNs summary

LSTM/GRU can remember long histories: • #params and memory footprint independent of sequence length • gates control information flow • each element is processed in constant time: – RNN needs to cram all past into a vector – in practice this limits the receptive field LSTMs gave us gating!

Gates decouple if from what WaveNet & Conv models for text use gates too!

Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning 푝 푥푡 푥1, … , 푥푡−1 #3

Reduce 푥1, … , 푥푡−1 to a fixed-size representation

E.g. take weighted average of 푥1 … 푥푡−1

This works really well with self-attention!

Attention ~ unbounded convolution

Slide by L. Kaiser https://nlp.stanford.edu/seminar/details/lkaiser.pdf Attention details

The past encodings form a list of key-value pairs: 푑 퐷 [(푘𝑖, 푉𝑖)], 푘 ∈ ℝ , 푉 ∈ ℝ

At step 푡 we start with a query 푞 ∈ ℝ푑 푇 푒푞 푘푖 1. Match 푞 to all keys: 훼𝑖 = 푇 푞 푘푗 푗 푒 2. Compute an average: 푅 = 𝑖 훼𝑖 푉𝑖 Attention in action

• This is part of an encoder in NMT. The attention can look forward and backward!

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html?m=1 GPT2 Transformer in action The JSALT summer school is about … GPT2 Transformer in action The JSALT summer school is about developing creative, innovative ways to solve global challenges, whether the problem is creating sustainable energy systems, promoting economic development, improving global health or making the world more energy independent.

The JSALT summer school is funded by an international grant for its education. Students enter the JSALT program on the basis of merit, with the objective of applying their talents to global development issues. They learn how to: develop, design and produce sustainable and integrated energy systems, promote economic development, increase efficiency of energy, develop a healthy environment for the health and well being of people and the environment, develop sustainable, clean and abundant https://talktotransformer.com/ Self Attention Summary

# params independent from history length + lots of parameter reuse: - effectively processes very long histories

Trades compute time for better performance! - no need to cram all past into a vector! - 푛-th step requires 푂(푛) ops

Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how?

Beyond sequences

PixelRNN: A “language model for images” Pixels generated left-to-right, top-to-bottom.

Cond. probabilities estimated using recurrent or convolutional neural networks.

van den Oord, A., et al. “Pixel Recurrent Neural Networks.” ICML (2016). Modeling pixels

How to model pixel values: - A Gaussian with fixed st. dev? - A Gaussian with tunable st. dev? - A distribution over discrete levels [0,1,2,…255]?

What are the implications? Modeling pixel values

Model works best with a flexible distribution: better to use a SoftMax over pixel values! PixelCNN samples & completions

Salimans et al, “A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications” van den Oord, A., et al. “Pixel Recurrent Neural Networks.” ICML (2016). Autoregressive Models Summary

The good: • Easy to exploit correlations in data. • Reduce data generation to many small decisions – Simple define (just pick an ordering) – Trains like fully supervised – Model operations are deterministic, randomness needed during generation • Often SOTA log-likelihood Autoregressive Model Summary

The bad: • Train/test mismatch (teacher forcing): trained on ground truth sequences but applied to own predictions • Generation requires 푂 푛 steps (Training can be sometimes parallelized) • No compact intermediate data representation, not obvious how to use for downstream tasks. Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how?

Ad break

I took the following slides from Ulrich Paquet See https://www.youtube.com/watch?v=xTsnNcctvmU for a recording of a his explanation! Latent Variable Models

Intuition: the data have a simple structure

2 3 4 2 1

4 2 2 1 3 Data structure

We can capture most of the variability in the data through one number

푧(푛) = 1 or 2, 3, 4 for each image n, even though each image is 16 dimensional

How?

How?

Take 푧(푛) = 2 Draw bar in column 2 of image Et voila! You have 푥(푛) Some bar-drawing process How?

Take 푧(푛) = 2 Draw bar in column 2 of image (푛) Maybe some neural Et voila! You have 푥 network, that takes z as input, and outputs a 16- dimensional vector x…? Exercise

Write or draw a function (like a multi- ) that takes 푧 ∈ ℝ and produces 푥 Is your input one-dimensional? Θ Is your output 16-dimensional? Identify all the “tunable” parameters Θ of your function Data manifold

The 16-dimensional images live on a 1- dimensional manifold, plus some “noise”

“1” “2” “3” “4” and noise

The 16-dimensional images live on a 1- dimensional manifold, plus some “noise”

“1” “2” “3” “4” Exercise

Change the neural network to take 푧 and produce a distribution over 푥:

푝Θ(푥|푧) Generation and Inference

Generation: 푝(푧) 푝Θ 푥 푧

Inference: 푝Θ 푧 푥 = ? ? ? ?

Bayes says: 푝Θ 푥 푧 푝(푧) 푝Θ 푧 푥 = ∫ 푑푧′푝Θ 푥 푧′ 푝(푧′)

But often it’s intractable  Inference starts with priors

Area = 1

“unobserved random variables”

“observed random variables” Inference

I give you . Keeping fixed, what was ? Inference

I give you . Keeping fixed, what was ? Exercise

Assuming the largest value of is 1, draw

as a function of on the same axis as above Joint density (with 푥 observed)

Assuming the largest value of is 1, draw

as a function of on the same axis as above Joint density (with 푥 observed)

Area = 1

Area = ? 1-minute exercise: what is the area? Posterior

Area = 1

Area = 1

Dividing by the marginal likelihood (evidence) scales the area back to 1... Evidence of all data points

Area for data point n Evidence for all data points

The product of the areas underneath the green curves

Maybe the x’s don’t even look that nice, when sampled with Maximizing the evidence

The product of the areas underneath the green curves

By changing we can make the evidence for these data points bigger...

These ’s don’t generate images like the ones in the data set…

(With this , the prior doesn’t capture the data manifold well) Maximizing the evidence

The product of the areas underneath the green curves

That’s better…! For the sharp-sighted

The product of the areas underneath the green curves

roughly... 20% 40% 20% 20% Generation and Learning

Generation: 푝(푧) 푝Θ(푥|푧)

Training by max log-likelihood arg max log 푝Θ(푥) Θ

But 푝Θ 푥 = ∫ 푑푧 푝Θ 푥 푧 푝(푧) Approximate likelihood optimization

Our approach:

- Lower bound log 푝Θ(푥) - Push the lower-bound up… … hoping to increase log 푝Θ(푥) Exercise

Jensen’s inequality Draw log(...) as a function, convince yourself that 2 1 2 1 log 푧 + 푧 ≥ log 푧 + log 푧 3 1 3 2 3 1 3 2 is true for any (nonnegative) setting of z1 and z2.

Jensen’s inequality

log ∫ 푑푧 푞 푧 푓 푧 ≥ ∫ 푑푧푞 푧 log 푓(푧) ELBO: A likelihood bound log 푝Θ 푥 = log ∫ 푑푧 푝Θ 푥, 푧 = 푝Θ(푥, 푧) = log ∫ 푑푧 푞Φ 푧 푥 푞Φ(푧|푥) 푝Θ 푥, 푧 ≥ ∫ 푑푧푞Φ 푧 푥 log 푞Φ 푧 푥 푝Θ 푥 푧 푝(푧) = 피푞Φ 푧 푥 푞Φ 푧 푥 푞 푧 푥 = 피 푝 푥 푧 − 피 Φ 푞Φ 푧 푥 Θ 푞Φ 푧 푥 푝 푧

= 피푞Φ 푧 푥 푝Θ 푥 푧 − 퐾퐿(푞Φ(푧|푥) ∥ 푝(푧)) ELBO interpretation

log 푝Θ(푥) ≥ 피푞Φ 푧 푥 푝Θ 푥 푧 − 퐾퐿 푞Φ 푧 푥 ∥ 푝 푧

피푞Φ 푧 푥 푝Θ 푥 푧 : auto-encoding term!

sample ELBO optimization

log 푝Θ(푥) ≥ 피푞Φ 푧 푥 푝Θ 푥 푧 − 퐾퐿 푞Φ 푧 푥 ∥ 푝 푧

ELBO is a function of 푥, Θ, and Φ

What it means to maximize ELBO over Φ?

It can’t change log 푝Θ(푥)…

It tries to make the bound tight! Exercise

Recall Jensen’s inequality: log ∫ 푑푧 푞 푧 푓 푧 ≥ ∫ 푑푧푞 푧 log 푓(푧)

When is it an equality?

When 푓 푧 = const

When is ELBO tight?

푝Θ(푥, 푧) log 푝Θ 푥 = log ∫ 푑푧 푞Φ 푧 푥 푞Φ(푧|푥) 푝Θ 푥, 푧 ≥ ∫ 푑푧푞Φ 푧 푥 log = 퐸퐿퐵푂 푞Φ 푧 푥

푝 (푥,푧) When Θ = const! 푞Φ(푧|푥)

What does it mean?

푝Θ(푥, 푧) 푝Θ 푥 푧 푝(푧) = = const ⇒ 푝Θ 푥 푧 = 푞Φ(푧|푥) 푞Φ(푧|푥) 푞Φ(푧|푥)

ELBO is tight when 푞Φ(푧|푥) does exact inference!

ELBO optimization

log 푝Θ(푥) ≥ 피푞Φ 푧 푥 푝Θ 푥 푧 − 퐾퐿 푞Φ 푧 푥 ∥ 푝 푧

ELBO is a function of 푥, Θ, and Φ

What it means to maximize ELBO over Θ?

Can only affect 피푞Φ 푧 푥 푝Θ 푥 푧 !

Makes 푝Θ x z generate back our 푥! This affects log 푝Θ(푥)… …making room for improving 푞! ELBO optimization

log 푝Θ(푥) ≥ 피푞Φ 푧 푥 푝Θ 푥 푧 − 퐾퐿 푞Φ 푧 푥 ∥ 푝 푧

Change Φ to maximize the bound, Similar to E step making 푞Φ 푧 푥 ≈ 푝Θ(푧|푥)

Change Θ to (if bound sufficiently tight) Similar to M step improve log 푝Θ(푥)

But we tune Φ and Θ at the same time! ELBO interpretation ELBO, or evidence lower bound:

log 푝 푥 ≥ 피푧~푞Φ 푧 푥 log 푝Θ 푥 푧 − 퐾퐿 푞Φ 푧 푥 ∥ 푝 푧 where:

피푧~푞Φ 푧 푥 log 푝Θ 푥 푧 reconstruction quality: how many nats we need to reconstruct 푥, when someone gives us 푞 푧 푥

퐾퐿 푞Φ 푧 푥 ∥ 푝 푧 code transmission cost: how many nats we transmit about 푥 in 푞Φ(푧|푥) rather than 푝 푧

Interpretation: do well at reconstructing 푥, limiting the amount of information about 푥 encoded in 푧.

The Variational

푝(푧) 퐾퐿 푞 푧 푥 ∥ 푝 푧 푥 푝(푥|푧)

q 푞(푧|푥) p

피푧~푞 푧 푥 log 푝 푥 푧

An input 푥 is put through the 푞 network to obtain a distribution over latent code 푧, 푞(푧|푥).

Samples 푧1, … , 푧푘 are drawn from 푞(푧|푥). They 푘 reconstructions 푝(푥|푧푘) are computed using the network 푝. VAE is an Information Bottleneck

Each sample is represented as a Gaussian

This discards information (latent representation has low precision) How to evaluate a VAE

Compute:

log 푝 푥 ≥ 피푧~푞Φ 푧 푥 log 푝Θ 푥 푧 − 퐾퐿 푞Φ 푧 푥 ∥ 푝 푧

퐾퐿 푞Φ 푧 푥 ∥ 푝 푧 has closed form for simple 푞Φ(푧|푥)

피푧~푞Φ 푧 푥 log 푝Θ 푥 푧 can be approximated:

피푧~푞Φ 푧 푥 log 푝Θ 푥 푧 ≈ log 푝Θ 푥 푧𝑖 𝑖 Where 푧𝑖 drawn from 푞Φ 푧 푥 How to train a VAE?

forward computation

derivatives • Forward computation involves drawing samples • Can’t backprop  Reparameterization exercise

Assume that 푞Φ 푧 푥 = 풩(μ푧, 휎푧).

Exercise: you can sample from 풩(0,1)

Q: how to draw samples from 풩 μ푧, 휎푧

A: 휖𝑖~풩 0,1 푧𝑖 = 휇푧 + 휎푧휖 Reparametrization to the rescue

Assume that 푞Φ 푧 푥 = 풩(μ푧, 휎푧).

Then:

피푧~푞Φ 푧 푥 log 푝Θ 푥 푧 = 피휖~풩(0,1) log 푝Θ 푥 휇푧 + 휎푧휖

휖 is drawn from a fixed distribution. With 휖 given, the computaiton graph is deterministic -> we can backprop! Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Exercise

1 for 푥 ∈ 0,1 푝 푥 = uniform 0,1 = 0 otherwise 푦 = 2 ∗ 푥 + 1 푝 푦 = ? Transforming distributions

푦 also has a uniform distribution!

푝 푦 = uniform 0,2 1/2 for 푦 ∈ 1,3 = 0 otherwise

E Jang https://blog.evjang.com/2018/01/nf1.html The Jacobian: stretching space

Let 푦 = 푓 푥 Take a range (푥, 푥 + Δ푥) Question: How long is this range? Δ푥

Question: How long is (푓 푥 , 푓 푥 + Δ푥 )? 휕푓 푓 푥 + Δ푥 ≈ 푓 푥 + Δ푥 휕푥 휕푓 Δ푥 휕푥 https://blog.evjang.com/2018/01/nf1.html Change of variables 푦 = 푓(푥), 푓 is a bijection 휕푓(푥) 푝 푥 = 푝 푓 푥 det 푥 푦 휕푥 Δ푥 Δ푦 푓 transforms 푥 ± to y ± 2 2 The space is stretched by 휕푓(푥) Δ푦 ≈ 휕푥 Δ푥 Idea Start with 푧~풩(0,1) Then 푥 = 푓−1 푧 (equivalently z = 푓(푥))

휕푓(푥) 푝 푥 = 푝 푓 푥 det 푥 푧 푥

Tractable when: - 푓 is easy to exactly invert 푓 and 푓−1 form and exact auto-encoder! 휕푓(푥) - det is easy to compute 푥 => We need special 푓!

A special form of 푓

푥1, 푥2 = split(푥) 푦1 = 푥1 푦2 = 푥2푠 푥1 + 푡(푥1) Trivial to invert! 휕푦 is diagonal, determinant is easy! 휕푥 L. Dinh et al “NICE”, L. Dinh et al “Real NVP” Normalizing flows in action

Image credit Eric Jang, https://blog.evjang.com/2018/01/nf2.html Normalizing flows in action

Kingma et al, “GLOW” Prenger et al, “WAVEGLOW” Outline

• Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how?

VAEs and sequential data To encode a long sequence, we apply the VAE to chunks:

푧 푧 푧 푧 푧

But neighboring chunks are similar! We are encoding the same information in many 푧s! We are wasting capacity!

WaveNet + VAE A WaveNet reconstructs the waveform using the information from the past

푧 푧 푧 푧 푧 Latent representations are extracted at regular inervals.

The WaveNet uses information from: 1. The past recording 2. The latent vectors 푧 3. Other conditioning, e.g. about speaker The encoder transmits in 푧s only the information that is missing from the past recording . The whole system is a very low bitrate codec (roughly 0.7kbits/sec, the waveform is 16k Hz* 8bit=128kbit/sec) van den Oord et al. Neural discrete representation learning VAE + autoregressive models: latent collapse danger • Purely Autoregressive models: SOTA log- likelihoods • Conditioning on latents: information passed through bottleneck lower reconstruction x-entropy • In standard VAE model actively tries to - reduce information in the latents - maxmally use autoregressive information => Collapse: latents are not used! • Solution: stop optimizing KL term (free bits), make it a hyperparam (VQVAE)

Model description

WaveNet decoder conditioned on: - latents extracted at 24Hz-50Hz - speaker spkr spkr spkr spkr spkr

3 bottleneck evaluated: 푧 푧 푧 푧 푧 - , max 32 bits/dim - VAE, 퐾퐿 푞 푧 푥 ∥ 풩 0,1 nats (bits)

- VQVAE with 퐾 protos: log2 퐾 bits

Input: Waveforms, Mel Filterbanks, MFCCs

Hope: speaker separated form content. Phonemes vs Gender tradeoff

https://arxiv.org/abs/1901.08810 Summary

Model Evaluate 풑(풙) Sample 풑(풙) Extract Control info latents in latents Autoregressive Exact & Cheap Exact & Impossible N/A Expensive Latent var, VAE Lower Bound Exact & Cheap Easy No Latent var, Exact & Cheap Exact & Cheap Easy No Norm. Flow Autoregressive Lower Bound Exact & Easy Yes cond on latents Expensive Our topic at JSALT

We will these ideas during JSALT’s topic “Distant supervision for representation learning”: - Work on speech and handwriting - Explore ways of integrating metadata and unlabeled data to control latent representations - Focus on downstream supervised OCR and ASR tasks under low data conditions

Thank you!

• Questions? References – other tutorials

• https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf

Backup ELBO: A lower bound on log 푝(푥)

Let 푞(푧|푥) be any distribution. We can show that

log 푝 푥 = 푝 푧|푥 = 퐾퐿 푞 푧 푥 ∥ 푝 푧 푥 + 피 log 푝 푥 푧~푞 푧 푥 푞 푧 푥 푝 푧|푥 ≥ 피 log 푝 푥 푧~푞 푧 푥 푞 푧 푥

= 피푧~푞 푧 푥 log 푝 푥 푧 − 퐾퐿 푞 푧 푥 ∥ 푝 푧

The bound is tight for 푝 푧 푥 = 푞 푧 푥 .

ELBO Derivation pt. 1 ELBO derivation pt. 2