Deep Learning Basics Lecture 4: Regularization II

Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder • Neural networks trained to attempt to copy its input to its output • Contain two parts: • Encoder: map the input to a hidden representation • Decoder: map the hidden representation to the output Autoencoder ℎ Hidden representation (the code) Input 푥 푟 Reconstruction Autoencoder ℎ Encoder 푓(⋅) Decoder 푔(⋅) 푥 푟 ℎ = 푓 푥 , 푟 = 푔 ℎ = 푔(푓 푥 ) Why want to copy input to output • Not really care about copying • Interesting case: NOT able to copy exactly but strive to do so • Autoencoder forced to select which aspects to preserve and thus hopefully can learn useful properties of the data • Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994). Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 ) 푥 ℎ 푟 Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 ) • Special case: 푓, 푔 linear, 퐿 mean square error • Reduces to Principal Component Analysis Undercomplete autoencoder • What about nonlinear encoder and decoder? • Capacity should not be too large • Suppose given data 푥1, 푥2, … , 푥푛 • Encoder maps 푥푖 to • Decoder maps to 푥푖 • One dim ℎ suffices for perfect reconstruction Regularization • Typically NOT • Keeping the encoder/decoder shallow or • Using small code size • Regularized autoencoders: add regularization term that encourages the model to have other properties • Sparsity of the representation (sparse autoencoder) • Robustness to noise or to missing inputs (denoising autoencoder) • Smallness of the derivative of the representation Sparse autoencoder • Constrain the code to have sparsity • Training: minimize a loss function 퐿푅 = 퐿(푥, 푔 푓 푥 ) + 푅(ℎ) 푥 ℎ 푟 Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 log 푝(푥) = log ෍ 푝(ℎ′, 푥) ℎ′ • Hard to sum over ℎ′ Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 max log 푝(푥) = max log ෍ 푝(ℎ′, 푥) ℎ′ • Approximation: suppose ℎ = 푓(푥) gives the most likely hidden ′ representation, and σℎ′ 푝(ℎ , 푥) can be approximated by 푝(ℎ, 푥) Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 푝(ℎ, 푥) • Approximate MLE on 푥, ℎ = 푓(푥) max log 푝(ℎ, 푥) = max log 푝(푥|ℎ) + log 푝(ℎ) Loss Regularization Sparse autoencoder • Constrain the code to have sparsity 휆 휆 • Laplacian prior: 푝 ℎ = exp(− ℎ ) 2 2 1 • Training: minimize a loss function 퐿푅 = 퐿(푥, 푔 푓 푥 ) + 휆 ℎ 1 Denoising autoencoder • Traditional autoencoder: encourage to learn 푔 푓 ⋅ to be identity • Denoising : minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥෤ ) where 푥෤ is 푥 + 푛표푠푒 Boltzmann machine Boltzmann machine • Introduced by Ackley et al. (1985) • General “connectionist” approach to learning arbitrary probability distributions over binary vectors exp(−퐸 푥 ) • Special case of energy model: 푝 푥 = 푍 Boltzmann machine • Energy model: exp(−퐸 푥 ) 푝 푥 = 푍 • Boltzmann machine: special case of energy model with 퐸 푥 = −푥푇푈푥 − 푏푇푥 where 푈 is the weight matrix and 푏 is the bias parameter Boltzmann machine with latent variables • Some variables are not observed 푥 = 푥푣, 푥ℎ , 푥푣 visible, 푥ℎ hidden 푇 푇 푇 푇 푇 퐸 푥 = −푥푣 푅푥푣 − 푥푣 푊푥ℎ − 푥ℎ 푆푥ℎ − 푏 푥푣 − 푐 푥ℎ • Universal approximator of probability mass functions Maximum likelihood 1 2 푛 • Suppose we are given data 푋 = 푥푣, 푥푣 , … , 푥푣 • Maximum likelihood is to maximize 푖 log 푝 푋 = ෍ log 푝(푥푣) 푖 where 1 푝 푥 = ෍ 푝(푥 , 푥 ) = ෍ exp(−퐸(푥 , 푥 )) 푣 푣 ℎ 푍 푣 ℎ 푥ℎ 푥ℎ • 푍 = σ exp(−퐸(푥푣, 푥ℎ)): partition function, difficult to compute Restricted Boltzmann machine • Invented under the name harmonium (Smolensky, 1986) • Popularized by Hinton and collaborators to Restricted Boltzmann machine Restricted Boltzmann machine • Special case of Boltzmann machine with latent variables: exp(−퐸 푣, ℎ ) 푝 푣, ℎ = 푍 where the energy function is 퐸 푣, ℎ = −푣푇푊ℎ − 푏푇푣 − 푐푇ℎ with the weight matrix 푊 and the bias 푏, 푐 • Partition function 푍 = ෍ ෍ exp(−퐸 푣, ℎ ) 푣 ℎ Restricted Boltzmann machine Figure from Deep Learning, Goodfellow, Bengio and Courville Restricted Boltzmann machine • Conditional distribution is factorial 푝(푣, ℎ) 푝 ℎ|푣 = = ෑ 푝(ℎ |푣) 푝(푣) 푗 푗 and 푇 푝 ℎ푗 = 1|푣 = 휎 푐푗 + 푣 푊:,푗 is logistic function Restricted Boltzmann machine • Similarly, 푝(푣, ℎ) 푝 푣|ℎ = = ෑ 푝(푣 |ℎ) 푝(ℎ) 푖 푖 and 푝 푣푖 = 1|ℎ = 휎 푏푖 + 푊푖,:ℎ is logistic function Deep Boltzmann machine • Special case of energy model. Take 3 hidden layers and ignore bias: exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푝 푣, ℎ1, ℎ2, ℎ3 = 푍 • Energy function 퐸 푣, ℎ1, ℎ2, ℎ3 = −푣푇푊1ℎ1 − (ℎ1)푇푊2ℎ2 − (ℎ2)푇푊3ℎ3 with the weight matrices 푊1, 푊2, 푊3 • Partition function 푍 = ෍ exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푣,ℎ1,ℎ2,ℎ3 Deep Boltzmann machine Figure from Deep Learning, Goodfellow, Bengio and Courville.

Load more