Basics Lecture 8: & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder

• Neural networks trained to attempt to copy its input to its output

• Contain two parts: • Encoder: map the input to a hidden representation • Decoder: map the hidden representation to the output Autoencoder

ℎ Hidden representation (the code)

Input 푥 푟 Reconstruction Autoencoder

Encoder 푓(⋅) Decoder 푔(⋅)

푥 푟

ℎ = 푓 푥 , 푟 = 푔 ℎ = 푔(푓 푥 ) Why want to copy input to output

• Not really care about copying

• Interesting case: NOT able to copy exactly but strive to do so • Autoencoder forced to select which aspects to preserve and thus hopefully can learn useful properties of the data

• Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994). Undercomplete autoencoder

• Constrain the code to have smaller dimension than the input • Training: minimize a 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 )

푥 ℎ 푟 Undercomplete autoencoder

• Constrain the code to have smaller dimension than the input • Training: minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 )

• Special case: 푓, 푔 linear, 퐿 mean square error • Reduces to Principal Component Analysis Undercomplete autoencoder

• What about nonlinear encoder and decoder?

• Capacity should not be too large

• Suppose given data 푥1, 푥2, … , 푥푛 • Encoder maps 푥푖 to 𝑖 • Decoder maps 𝑖 to 푥푖 • One dim ℎ suffices for perfect reconstruction Regularization

• Typically NOT • Keeping the encoder/decoder shallow or • Using small code size

• Regularized : add regularization term that encourages the model to have other properties • Sparsity of the representation (sparse autoencoder) • Robustness to or to missing inputs (denoising autoencoder) • Smallness of the derivative of the representation Sparse autoencoder

• Constrain the code to have sparsity • Training: minimize a loss function 퐿푅 = 퐿(푥, 푔 푓 푥 ) + 푅(ℎ)

푥 ℎ 푟 Probabilistic view of regularizing ℎ

• Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 log 푝(푥) = log ෍ 푝(ℎ′, 푥) ℎ′

•  Hard to sum over ℎ′ Probabilistic view of regularizing ℎ

• Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 max log 푝(푥) = max log ෍ 푝(ℎ′, 푥) ℎ′

• Approximation: suppose ℎ = 푓(푥) gives the most likely hidden ′ representation, and σℎ′ 푝(ℎ , 푥) can be approximated by 푝(ℎ, 푥) Probabilistic view of regularizing ℎ

• Suppose we have a probabilistic model 푝(ℎ, 푥) • Approximate MLE on 푥, ℎ = 푓(푥) max log 푝(ℎ, 푥) = max log 푝(푥|ℎ) + log 푝(ℎ)

Loss Regularization Sparse autoencoder

• Constrain the code to have sparsity 휆 휆 • Laplacian prior: 푝 ℎ = exp(− ℎ ) 2 2 1

• Training: minimize a loss function

퐿푅 = 퐿(푥, 푔 푓 푥 ) + 휆 ℎ 1 Denoising autoencoder

• Traditional autoencoder: encourage to learn 푔 푓 ⋅ to be identity

• Denoising : minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥෤ ) where 푥෤ is 푥 + 푛표𝑖푠푒 Boltzmann machine

• Introduced by Ackley et al. (1985)

• General “connectionist” approach to learning arbitrary distributions over binary vectors exp(−퐸 푥 ) • Special case of energy model: 푝 푥 = 푍 Boltzmann machine

• Energy model: exp(−퐸 푥 ) 푝 푥 = 푍 • Boltzmann machine: special case of energy model with 퐸 푥 = −푥푇푈푥 − 푏푇푥 where 푈 is the weight and 푏 is the bias parameter Boltzmann machine with latent variables

• Some variables are not observed

푥 = 푥푣, 푥ℎ , 푥푣 visible, 푥ℎ hidden 푇 푇 푇 푇 푇 퐸 푥 = −푥푣 푅푥푣 − 푥푣 푊푥ℎ − 푥ℎ 푆푥ℎ − 푏 푥푣 − 푐 푥ℎ

• Universal approximator of probability mass functions Maximum likelihood

1 2 푛 • Suppose we are given data 푋 = 푥푣, 푥푣 , … , 푥푣 • Maximum likelihood is to maximize 푖 log 푝 푋 = ෍ log 푝(푥푣) 푖 where 1 푝 푥 = ෍ 푝(푥 , 푥 ) = ෍ exp(−퐸(푥 , 푥 )) 푣 푣 ℎ 푍 푣 ℎ 푥ℎ 푥ℎ

• 푍 = σ exp(−퐸(푥푣, 푥ℎ)): partition function, difficult to compute Restricted Boltzmann machine

• Invented under the name harmonium (Smolensky, 1986) • Popularized by Hinton and collaborators to Restricted Boltzmann machine Restricted Boltzmann machine

• Special case of Boltzmann machine with latent variables: exp(−퐸 푣, ℎ ) 푝 푣, ℎ = 푍 where the energy function is 퐸 푣, ℎ = −푣푇푊ℎ − 푏푇푣 − 푐푇ℎ with the weight matrix 푊 and the bias 푏, 푐 • Partition function 푍 = ෍ ෍ exp(−퐸 푣, ℎ ) 푣 ℎ Restricted Boltzmann machine

Figure from Deep Learning, Goodfellow, Bengio and Courville Restricted Boltzmann machine

• Conditional distribution is factorial 푝(푣, ℎ) 푝 ℎ|푣 = = ෑ 푝(ℎ |푣) 푝(푣) 푗 푗 and 푇 푝 ℎ푗 = 1|푣 = 휎 푐푗 + 푣 푊:,푗 is logistic function Restricted Boltzmann machine

• Similarly, 푝(푣, ℎ) 푝 푣|ℎ = = ෑ 푝(푣 |ℎ) 푝(ℎ) 푖 푖 and 푝 푣푖 = 1|ℎ = 휎 푏푖 + 푊푖,:ℎ is logistic function Deep Boltzmann machine

• Special case of energy model. Take 3 hidden layers and ignore bias: exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푝 푣, ℎ1, ℎ2, ℎ3 = 푍 • Energy function 퐸 푣, ℎ1, ℎ2, ℎ3 = −푣푇푊1ℎ1 − (ℎ1)푇푊2ℎ2 − (ℎ2)푇푊3ℎ3 with the weight matrices 푊1, 푊2, 푊3 • Partition function 푍 = ෍ exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푣,ℎ1,ℎ2,ℎ3 Deep Boltzmann machine

Figure from Deep Learning, Goodfellow, Bengio and Courville