Advanced ML: Unsupervised Learning with Autoregressive and Latent Variable Models

Advanced ML: Unsupervised learning with autoregressive and latent variable models https://github.com/janchorowski/JSALT2019_tutorials Jan Chorowski University of Wrocław With slides from TMLSS 2018 (Jan Chorowski, Urlich Paquet) Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 1986 ‘hidden’ units which are not part of the input or output come to represent important features of the task domain Deep Learning history: 2006 Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning goals • Learn a data representation: – Extract features for downstream tasks – Describe the data (clusterings) • Data generation / density estimation – What are my data – Outlier detection – Super-resolution – Artifact removal – Compression 2 minute exercise Talk to your friend next to you, and tell him or her everything you can about this data set: The rows are correlated Latent representation Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning high dimensional distributions is hard • Assume we work with small (32x32) images • Each data point is a real vector of size 32 × 32 × 3 • Data occupies only a tiny fraction of ℝ32×32×3 • Impossible to directly fit standard prob. distribution! Divide and conquer What is easier: 1. Write a paragraph at once? 2. Guess the next word? Mary had a little Mary had a little dog. Mary had a little dog. It Mary had a little dog. It was Mary had a little dog. It was cute Chain rule Decompose probability of 푛-dimensional data points into product of 푛 conditional probabilities 푝 푥 = 푝 푥1, 푥2, … , 푥푛 = 푝 푥1 푝 푥2 푥1 … 푝 푥푛 푥1, 푥2, … , 푥푛−1 = 푝(푥푡|푥1, 푥2, … , 푥푡−1) 푡 Learning 푝 푥푡 푥1, … , 푥푡−1 The good: It looks like supervised learning! The bad: We need to handle variable-sized input. Large n -> sparse data Consider an n-gram LM: #푥푡−푘:푡 푝 푥푡 푥푡−푘+1, … , 푥푡−1 = #푥푡−푘:푡−1 As 푛 → ∞ n-grams become unique: - Probability of all but one continuation is 0 - Can’t model long contexts Solution: Use a representation in which “llama ≈ animal” Learning 푝 푥푡 푥1, … , 푥푡−1 #1 Assume distant past doesn’t matter, 푝 푥푡 푥1, … , 푥푡−1 ≈ 푝 푥푡 푥푡−푘+1, … , 푥푡−1 Quite popular: • n-gram text models (estimate 푝 by counting) • FIR filters How to efficiently handle large 푘? Neural LM Compute 푝 푥푡 푥푡−푘+1, … , 푥푡−1 with a Neural Net Table Lookup Representation Concatenation Extractor* Table Sentence Softmax Lookup An arbitrary sub-graph Table Lookup Looks somewhat like a convolution. Still: large “n” => many params, how to fix? Y. Bengio et al “A Neural Probabilistic Language Model”, NIPS 2011 Dilated Convolutions Large (but finite) receptive fields Few parameters WaveNet ByteNet https://arxiv.org/abs/1609.03499 https://arxiv.org/abs/1610.10099 Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning 푝 푥푡 푥1, … , 푥푡−1 #2 Introduce a hidden (i.e. not directly observed) state ℎ푡. ℎ푡 summarizes 푥1, … , 푥푡−1. Compute ℎ푡 using this recurrence: ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡) Recurrences are powerful Input: a sequence of bits 1,0,1,0,0,1 Output: the parity Solution: The hidden state will be just 1 bit – parity so far: ℎ0 = 0 ℎ푡 = 푋푂푅 ℎ푡−1, 푥푡 푦푇 = ℎ푇 Constant operating memory! Works for arbitrary long sequences! Recurrent Neural Networks (RNNs) ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡) 푓, 푔 are implemented as feedforward neural nets. 푓 푔 RNNs naturally handle sequences Training: unroll and supervise at every step: 푝(ℎ푎푑|푀푎푟푦) 푝(푎| … ) 푝(푙푡푡푙푒| … ) … Mary had a ….. RNNs naturally handle sequences Generation: sample one element at a time Mary had a … <SOS> 푥1 sampled from 푝 푥1 This This impacts training! thesame function! compute all “arrows” deep network, a is very RNN 푓 RNNs are dynamical systems dynamical RNNs are 푔 … RNN behavior - intuitions A linear dynamical system computes 푡 ℎ푡 = 푤 ⋅ ℎ푡−1 = 푤 ℎ0 When: 1. 푤 > 1 it diverges: lim ℎ푡 = sign h0 ⋅ ∞ 푡→∞ 2. 푤 = 1 it is constant lim ℎ푡 = h0 푡→∞ 3. −1 < 푤 < 1 it decays to 0: lim ℎ푡 = 0 푡→∞ 4. 푤 = −1 flips between ±ℎ0 5. 푤 < −1 diverges, changes sign at each step A multidimensional linear dyn. system 푛 푛×푛 ℎ푡 ∈ ℝ , 푊 ∈ ℝ ℎ푡 = 푊ℎ푡−1 Compute the eigendecomposition of 푊: 휆11 −1 −1 푊 = 푄Λ푄 = 푄 ⋱ 푄 휆푛푛 Then 푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0 = 푛 휆11 −1 = 푄 ⋱ 푄 ℎ0 푛 휆푛푛 A multidimensional dyn. system cont. 푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0 If largest eigenvalue has norm: 1. >1, system diverges 2. =1, system is stable or oscillates 3. <1, system decays to 0 This is similar to the scalar case. A nonlinear dynamical system ℎ푡 = tanh(푤 ⋅ ℎ푡−1) Output is bounded (can’t diverge!) If 푤 > 1 has 3 fixed points: 0, ±ℎ푓. Starting the iteration form ℎ0 ≠ 0 it ends at ±ℎ푓 (effectively remembers the sign). If 푤 ≤ 1 has one fixed point: 0 A nonlinear dynamical system ℎ푡 = tanh(푤 ⋅ ℎ푡−1) , 푤 > 1 ℎ 푡 Gradients in RNNs Recall RNN equations: ℎ푡 = 푓 ℎ푡−1, 푥푡; 휃 푦푡 = 푔 ℎ푡 Assume supervision only at the last step: L = 푒(푦푇) The gradient is 휕퐿 휕퐿 흏풉푻 휕ℎ푡 = 휕휃 휕ℎ푇 흏풉풕 휕휃 푡 휕ℎ Trouble with 푇 휕ℎ푡 휕ℎ푇 measures how much ℎ푇 changes when ℎ푡 changes. 휕ℎ푡 ℎ 푡 휕ℎ Another look at 푇 휕ℎ푡 Let: ℎ푡 = tanh(푤 ⋅ ℎ푡−1) Then: 휕ℎ푡 휕ℎ푡 휕ℎ푡−1 휕ℎ푡−2 휕ℎ1 = … 휕ℎ0 휕ℎ푡−1 휕ℎ푡−2 휕ℎ푡−3 휕ℎ0 휕ℎ+1 = tanh ′(푤ℎ푡) 푤 휕ℎ 푡−1 푡−1 휕ℎ푡 푡 = tanh ′(푤ℎ) 푤 = 푤 tanh ′(푤ℎ) 휕ℎ0 =0 =0 Backward phase is linear! Vanishing gradient 휕ℎ If 푇 = 0 the network forgets all information 휕ℎ푡 from time 푡 when it reaches time 푇. This makes it impossible to discover correlations between inputs at distant time steps. This is a modeling problem. Exploding gradient 휕ℎ 휕Loss When 푇 is large will be large too. 휕ℎ푡 휕휃 휕Loss Making a gradient step 휃 ← 훼 can 휕휃 drastically change the network, or even destroy it. This is not a problem of information flow, but of training stability! Exploding gradient intuition #1 ℎ푡 = tanh(2ℎ푡−1 + 푏) ℎ 푡 Exploding gradient intuition #2 ℎ0 = 휎(0.5) ℎ푡 = 휎 푤ℎ푡−1 + 푏 2 퐿 = ℎ50 − 0.7 R. Pascanu et al. https://arxiv.org/abs/1211.5063 휕ℎ Summary: trouble with 푇 휕ℎ푡 RNNs are difficult to train because: 휕ℎ푇 can be 0 – vanishing gradient 휕ℎ푡 휕ℎ푇 can be ∞ – exploding gradient 휕ℎ푡 With both phenomena governed by the spectral norm of the weight matrix (norm of the largest eigenvalue, magnitude of 푤 in the scalar case). Exploding gradient solution Don’t do large steps. Pick a gradient norm threshold and scale down all larger gradients. This prevents the model from doing a large learning update and destroying itself. Vanishing gradient solution: LSTM, the RNN workhorse 휕ℎ 푇 ≈ 0: step 푇 forgots information from 푡 휕ℎ푡 The dynamical system ℎ푡 = 푤ℎ푡−1 maximally preserves information when 푤 = 1 LSTM introduces a memory cell 푐푡 that will keep information forever: 푐푡 = 푐푡−1 Memory cell 푐푡 Memory cell preserves information 푐푡 = 푐푡−1 휕푐푇 = 1 휕푐푡 Gates 휎(푊푥푥푡 + ⋯ ) ⋅ + 푐푡 tanh(푊푥푐푥푡 + ⋯ ) Gates selectively load information into the memory cell: 푡 = 휎(푊푥푥푡 + ⋯ ) 푐푡 = 푐푡−1 + 푡 ⋅ tanh(푊푥푐푥푡 + ⋯ ) LSTM: the details Hidden state is a pair of: -푐푡 information in the cell, hidden from the rest of the network -ℎ푡 information extracted form the cell into the network Update equations: 푡 = 휎 푊푥푥푡 + 푊ℎℎ푡−1 + 푏 푓푡 = 휎 푊푥푓푥푡 + 푊ℎ푓ℎ푡−1 + 푏푓 표푡 = 휎 푊푥표푥푡 + 푊ℎ표ℎ푡−1 + 푏표 푐푡 = 푡 tanh(푊푥푐푥푡 + 푊ℎ푐ℎ푡−1 + 푏푐) + 푓푡푐푡−1 ℎ푡 = 표푡 tanh 푐푡 RNNs summary LSTM/GRU

Advanced ML: Unsupervised Learning with Autoregressive and Latent Variable Models

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support