Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder
• Neural networks trained to attempt to copy its input to its output
• Contain two parts: • Encoder: map the input to a hidden representation • Decoder: map the hidden representation to the output Autoencoder
ℎ Hidden representation (the code)
Input 푥 푟 Reconstruction Autoencoder
ℎ
Encoder 푓(⋅) Decoder 푔(⋅)
푥 푟
ℎ = 푓 푥 , 푟 = 푔 ℎ = 푔(푓 푥 ) Why want to copy input to output
• Not really care about copying
• Interesting case: NOT able to copy exactly but strive to do so • Autoencoder forced to select which aspects to preserve and thus hopefully can learn useful properties of the data
• Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994). Undercomplete autoencoder
• Constrain the code to have smaller dimension than the input • Training: minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 )
푥 ℎ 푟 Undercomplete autoencoder
• Constrain the code to have smaller dimension than the input • Training: minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 )
• Special case: 푓, 푔 linear, 퐿 mean square error • Reduces to Principal Component Analysis Undercomplete autoencoder
• What about nonlinear encoder and decoder?
• Capacity should not be too large
• Suppose given data 푥1, 푥2, … , 푥푛 • Encoder maps 푥푖 to 𝑖 • Decoder maps 𝑖 to 푥푖 • One dim ℎ suffices for perfect reconstruction Regularization
• Typically NOT • Keeping the encoder/decoder shallow or • Using small code size
• Regularized autoencoders: add regularization term that encourages the model to have other properties • Sparsity of the representation (sparse autoencoder) • Robustness to noise or to missing inputs (denoising autoencoder) • Smallness of the derivative of the representation Sparse autoencoder
• Constrain the code to have sparsity • Training: minimize a loss function 퐿푅 = 퐿(푥, 푔 푓 푥 ) + 푅(ℎ)
푥 ℎ 푟 Probabilistic view of regularizing ℎ
• Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 log 푝(푥) = log 푝(ℎ′, 푥) ℎ′
• Hard to sum over ℎ′ Probabilistic view of regularizing ℎ
• Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 max log 푝(푥) = max log 푝(ℎ′, 푥) ℎ′
• Approximation: suppose ℎ = 푓(푥) gives the most likely hidden ′ representation, and σℎ′ 푝(ℎ , 푥) can be approximated by 푝(ℎ, 푥) Probabilistic view of regularizing ℎ
• Suppose we have a probabilistic model 푝(ℎ, 푥) • Approximate MLE on 푥, ℎ = 푓(푥) max log 푝(ℎ, 푥) = max log 푝(푥|ℎ) + log 푝(ℎ)
Loss Regularization Sparse autoencoder
• Constrain the code to have sparsity 휆 휆 • Laplacian prior: 푝 ℎ = exp(− ℎ ) 2 2 1
• Training: minimize a loss function
퐿푅 = 퐿(푥, 푔 푓 푥 ) + 휆 ℎ 1 Denoising autoencoder
• Traditional autoencoder: encourage to learn 푔 푓 ⋅ to be identity
• Denoising : minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 ) where 푥 is 푥 + 푛표𝑖푠푒 Boltzmann machine Boltzmann machine
• Introduced by Ackley et al. (1985)
• General “connectionist” approach to learning arbitrary probability distributions over binary vectors exp(−퐸 푥 ) • Special case of energy model: 푝 푥 = 푍 Boltzmann machine
• Energy model: exp(−퐸 푥 ) 푝 푥 = 푍 • Boltzmann machine: special case of energy model with 퐸 푥 = −푥푇푈푥 − 푏푇푥 where 푈 is the weight matrix and 푏 is the bias parameter Boltzmann machine with latent variables
• Some variables are not observed
푥 = 푥푣, 푥ℎ , 푥푣 visible, 푥ℎ hidden 푇 푇 푇 푇 푇 퐸 푥 = −푥푣 푅푥푣 − 푥푣 푊푥ℎ − 푥ℎ 푆푥ℎ − 푏 푥푣 − 푐 푥ℎ
• Universal approximator of probability mass functions Maximum likelihood
1 2 푛 • Suppose we are given data 푋 = 푥푣, 푥푣 , … , 푥푣 • Maximum likelihood is to maximize 푖 log 푝 푋 = log 푝(푥푣) 푖 where 1 푝 푥 = 푝(푥 , 푥 ) = exp(−퐸(푥 , 푥 )) 푣 푣 ℎ 푍 푣 ℎ 푥ℎ 푥ℎ
• 푍 = σ exp(−퐸(푥푣, 푥ℎ)): partition function, difficult to compute Restricted Boltzmann machine
• Invented under the name harmonium (Smolensky, 1986) • Popularized by Hinton and collaborators to Restricted Boltzmann machine Restricted Boltzmann machine
• Special case of Boltzmann machine with latent variables: exp(−퐸 푣, ℎ ) 푝 푣, ℎ = 푍 where the energy function is 퐸 푣, ℎ = −푣푇푊ℎ − 푏푇푣 − 푐푇ℎ with the weight matrix 푊 and the bias 푏, 푐 • Partition function 푍 = exp(−퐸 푣, ℎ ) 푣 ℎ Restricted Boltzmann machine
Figure from Deep Learning, Goodfellow, Bengio and Courville Restricted Boltzmann machine
• Conditional distribution is factorial 푝(푣, ℎ) 푝 ℎ|푣 = = ෑ 푝(ℎ |푣) 푝(푣) 푗 푗 and 푇 푝 ℎ푗 = 1|푣 = 휎 푐푗 + 푣 푊:,푗 is logistic function Restricted Boltzmann machine
• Similarly, 푝(푣, ℎ) 푝 푣|ℎ = = ෑ 푝(푣 |ℎ) 푝(ℎ) 푖 푖 and 푝 푣푖 = 1|ℎ = 휎 푏푖 + 푊푖,:ℎ is logistic function Deep Boltzmann machine
• Special case of energy model. Take 3 hidden layers and ignore bias: exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푝 푣, ℎ1, ℎ2, ℎ3 = 푍 • Energy function 퐸 푣, ℎ1, ℎ2, ℎ3 = −푣푇푊1ℎ1 − (ℎ1)푇푊2ℎ2 − (ℎ2)푇푊3ℎ3 with the weight matrices 푊1, 푊2, 푊3 • Partition function 푍 = exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푣,ℎ1,ℎ2,ℎ3 Deep Boltzmann machine
Figure from Deep Learning, Goodfellow, Bengio and Courville