Advanced ML: Unsupervised Learning with Autoregressive and Latent Variable Models

Advanced ML: Unsupervised learning with autoregressive and latent variable models https://github.com/janchorowski/JSALT2019_tutorials Jan Chorowski University of Wrocław With slides from TMLSS 2018 (Jan Chorowski, Urlich Paquet) Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 1986 ‘hidden’ units which are not part of the input or output come to represent important features of the task domain Deep Learning history: 2006 Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning goals • Learn a data representation: – Extract features for downstream tasks – Describe the data (clusterings) • Data generation / density estimation – What are my data – Outlier detection – Super-resolution – Artifact removal – Compression 2 minute exercise Talk to your friend next to you, and tell him or her everything you can about this data set: The rows are correlated Latent representation Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning high dimensional distributions is hard • Assume we work with small (32x32) images • Each data point is a real vector of size 32 × 32 × 3 • Data occupies only a tiny fraction of ℝ32×32×3 • Impossible to directly fit standard prob. distribution! Divide and conquer What is easier: 1. Write a paragraph at once? 2. Guess the next word? Mary had a little Mary had a little dog. Mary had a little dog. It Mary had a little dog. It was Mary had a little dog. It was cute Chain rule Decompose probability of 푛-dimensional data points into product of 푛 conditional probabilities 푝 푥 = 푝 푥1, 푥2, … , 푥푛 = 푝 푥1 푝 푥2 푥1 … 푝 푥푛 푥1, 푥2, … , 푥푛−1 = 푝(푥푡|푥1, 푥2, … , 푥푡−1) 푡 Learning 푝 푥푡 푥1, … , 푥푡−1 The good: It looks like supervised learning! The bad: We need to handle variable-sized input. Large n -> sparse data Consider an n-gram LM: #푥푡−푘:푡 푝 푥푡 푥푡−푘+1, … , 푥푡−1 = #푥푡−푘:푡−1 As 푛 → ∞ n-grams become unique: - Probability of all but one continuation is 0 - Can’t model long contexts Solution: Use a representation in which “llama ≈ animal” Learning 푝 푥푡 푥1, … , 푥푡−1 #1 Assume distant past doesn’t matter, 푝 푥푡 푥1, … , 푥푡−1 ≈ 푝 푥푡 푥푡−푘+1, … , 푥푡−1 Quite popular: • n-gram text models (estimate 푝 by counting) • FIR filters How to efficiently handle large 푘? Neural LM Compute 푝 푥푡 푥푡−푘+1, … , 푥푡−1 with a Neural Net Table Lookup Representation Concatenation Extractor* Table Sentence Softmax Lookup An arbitrary sub-graph Table Lookup Looks somewhat like a convolution. Still: large “n” => many params, how to fix? Y. Bengio et al “A Neural Probabilistic Language Model”, NIPS 2011 Dilated Convolutions Large (but finite) receptive fields Few parameters WaveNet ByteNet https://arxiv.org/abs/1609.03499 https://arxiv.org/abs/1610.10099 Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning 푝 푥푡 푥1, … , 푥푡−1 #2 Introduce a hidden (i.e. not directly observed) state ℎ푡. ℎ푡 summarizes 푥1, … , 푥푡−1. Compute ℎ푡 using this recurrence: ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡) Recurrences are powerful Input: a sequence of bits 1,0,1,0,0,1 Output: the parity Solution: The hidden state will be just 1 bit – parity so far: ℎ0 = 0 ℎ푡 = 푋푂푅 ℎ푡−1, 푥푡 푦푇 = ℎ푇 Constant operating memory! Works for arbitrary long sequences! Recurrent Neural Networks (RNNs) ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡) 푓, 푔 are implemented as feedforward neural nets. 푓 푔 RNNs naturally handle sequences Training: unroll and supervise at every step: 푝(ℎ푎푑|푀푎푟푦) 푝(푎| … ) 푝(푙푡푡푙푒| … ) … Mary had a ….. RNNs naturally handle sequences Generation: sample one element at a time Mary had a … <SOS> 푥1 sampled from 푝 푥1 This This impacts training! thesame function! compute all “arrows” deep network, a is very RNN 푓 RNNs are dynamical systems dynamical RNNs are 푔 … RNN behavior - intuitions A linear dynamical system computes 푡 ℎ푡 = 푤 ⋅ ℎ푡−1 = 푤 ℎ0 When: 1. 푤 > 1 it diverges: lim ℎ푡 = sign h0 ⋅ ∞ 푡→∞ 2. 푤 = 1 it is constant lim ℎ푡 = h0 푡→∞ 3. −1 < 푤 < 1 it decays to 0: lim ℎ푡 = 0 푡→∞ 4. 푤 = −1 flips between ±ℎ0 5. 푤 < −1 diverges, changes sign at each step A multidimensional linear dyn. system 푛 푛×푛 ℎ푡 ∈ ℝ , 푊 ∈ ℝ ℎ푡 = 푊ℎ푡−1 Compute the eigendecomposition of 푊: 휆11 −1 −1 푊 = 푄Λ푄 = 푄 ⋱ 푄 휆푛푛 Then 푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0 = 푛 휆11 −1 = 푄 ⋱ 푄 ℎ0 푛 휆푛푛 A multidimensional dyn. system cont. 푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0 If largest eigenvalue has norm: 1. >1, system diverges 2. =1, system is stable or oscillates 3. <1, system decays to 0 This is similar to the scalar case. A nonlinear dynamical system ℎ푡 = tanh(푤 ⋅ ℎ푡−1) Output is bounded (can’t diverge!) If 푤 > 1 has 3 fixed points: 0, ±ℎ푓. Starting the iteration form ℎ0 ≠ 0 it ends at ±ℎ푓 (effectively remembers the sign). If 푤 ≤ 1 has one fixed point: 0 A nonlinear dynamical system ℎ푡 = tanh(푤 ⋅ ℎ푡−1) , 푤 > 1 ℎ 푡 Gradients in RNNs Recall RNN equations: ℎ푡 = 푓 ℎ푡−1, 푥푡; 휃 푦푡 = 푔 ℎ푡 Assume supervision only at the last step: L = 푒(푦푇) The gradient is 휕퐿 휕퐿 흏풉푻 휕ℎ푡 = 휕휃 휕ℎ푇 흏풉풕 휕휃 푡 휕ℎ Trouble with 푇 휕ℎ푡 휕ℎ푇 measures how much ℎ푇 changes when ℎ푡 changes. 휕ℎ푡 ℎ 푡 휕ℎ Another look at 푇 휕ℎ푡 Let: ℎ푡 = tanh(푤 ⋅ ℎ푡−1) Then: 휕ℎ푡 휕ℎ푡 휕ℎ푡−1 휕ℎ푡−2 휕ℎ1 = … 휕ℎ0 휕ℎ푡−1 휕ℎ푡−2 휕ℎ푡−3 휕ℎ0 휕ℎ+1 = tanh ′(푤ℎ푡) 푤 휕ℎ 푡−1 푡−1 휕ℎ푡 푡 = tanh ′(푤ℎ) 푤 = 푤 tanh ′(푤ℎ) 휕ℎ0 =0 =0 Backward phase is linear! Vanishing gradient 휕ℎ If 푇 = 0 the network forgets all information 휕ℎ푡 from time 푡 when it reaches time 푇. This makes it impossible to discover correlations between inputs at distant time steps. This is a modeling problem. Exploding gradient 휕ℎ 휕Loss When 푇 is large will be large too. 휕ℎ푡 휕휃 휕Loss Making a gradient step 휃 ← 훼 can 휕휃 drastically change the network, or even destroy it. This is not a problem of information flow, but of training stability! Exploding gradient intuition #1 ℎ푡 = tanh(2ℎ푡−1 + 푏) ℎ 푡 Exploding gradient intuition #2 ℎ0 = 휎(0.5) ℎ푡 = 휎 푤ℎ푡−1 + 푏 2 퐿 = ℎ50 − 0.7 R. Pascanu et al. https://arxiv.org/abs/1211.5063 휕ℎ Summary: trouble with 푇 휕ℎ푡 RNNs are difficult to train because: 휕ℎ푇 can be 0 – vanishing gradient 휕ℎ푡 휕ℎ푇 can be ∞ – exploding gradient 휕ℎ푡 With both phenomena governed by the spectral norm of the weight matrix (norm of the largest eigenvalue, magnitude of 푤 in the scalar case). Exploding gradient solution Don’t do large steps. Pick a gradient norm threshold and scale down all larger gradients. This prevents the model from doing a large learning update and destroying itself. Vanishing gradient solution: LSTM, the RNN workhorse 휕ℎ 푇 ≈ 0: step 푇 forgots information from 푡 휕ℎ푡 The dynamical system ℎ푡 = 푤ℎ푡−1 maximally preserves information when 푤 = 1 LSTM introduces a memory cell 푐푡 that will keep information forever: 푐푡 = 푐푡−1 Memory cell 푐푡 Memory cell preserves information 푐푡 = 푐푡−1 휕푐푇 = 1 휕푐푡 Gates 휎(푊푥푥푡 + ⋯ ) ⋅ + 푐푡 tanh(푊푥푐푥푡 + ⋯ ) Gates selectively load information into the memory cell: 푡 = 휎(푊푥푥푡 + ⋯ ) 푐푡 = 푐푡−1 + 푡 ⋅ tanh(푊푥푐푥푡 + ⋯ ) LSTM: the details Hidden state is a pair of: -푐푡 information in the cell, hidden from the rest of the network -ℎ푡 information extracted form the cell into the network Update equations: 푡 = 휎 푊푥푥푡 + 푊ℎℎ푡−1 + 푏 푓푡 = 휎 푊푥푓푥푡 + 푊ℎ푓ℎ푡−1 + 푏푓 표푡 = 휎 푊푥표푥푡 + 푊ℎ표ℎ푡−1 + 푏표 푐푡 = 푡 tanh(푊푥푐푥푡 + 푊ℎ푐ℎ푡−1 + 푏푐) + 푓푡푐푡−1 ℎ푡 = 표푡 tanh 푐푡 RNNs summary LSTM/GRU

Load more