Deep Learning Basics Lecture 4: Regularization II

Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder • Neural networks trained to attempt to copy its input to its output • Contain two parts: • Encoder: map the input to a hidden representation • Decoder: map the hidden representation to the output Autoencoder ℎ Hidden representation (the code) Input 푥 푟 Reconstruction Autoencoder ℎ Encoder 푓(⋅) Decoder 푔(⋅) 푥 푟 ℎ = 푓 푥 , 푟 = 푔 ℎ = 푔(푓 푥 ) Why want to copy input to output • Not really care about copying • Interesting case: NOT able to copy exactly but strive to do so • Autoencoder forced to select which aspects to preserve and thus hopefully can learn useful properties of the data • Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994). Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 ) 푥 ℎ 푟 Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥 ) • Special case: 푓, 푔 linear, 퐿 mean square error • Reduces to Principal Component Analysis Undercomplete autoencoder • What about nonlinear encoder and decoder? • Capacity should not be too large • Suppose given data 푥1, 푥2, … , 푥푛 • Encoder maps 푥푖 to • Decoder maps to 푥푖 • One dim ℎ suffices for perfect reconstruction Regularization • Typically NOT • Keeping the encoder/decoder shallow or • Using small code size • Regularized autoencoders: add regularization term that encourages the model to have other properties • Sparsity of the representation (sparse autoencoder) • Robustness to noise or to missing inputs (denoising autoencoder) • Smallness of the derivative of the representation Sparse autoencoder • Constrain the code to have sparsity • Training: minimize a loss function 퐿푅 = 퐿(푥, 푔 푓 푥 ) + 푅(ℎ) 푥 ℎ 푟 Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 log 푝(푥) = log ෍ 푝(ℎ′, 푥) ℎ′ • Hard to sum over ℎ′ Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 푝(ℎ, 푥) • MLE on 푥 max log 푝(푥) = max log ෍ 푝(ℎ′, 푥) ℎ′ • Approximation: suppose ℎ = 푓(푥) gives the most likely hidden ′ representation, and σℎ′ 푝(ℎ , 푥) can be approximated by 푝(ℎ, 푥) Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 푝(ℎ, 푥) • Approximate MLE on 푥, ℎ = 푓(푥) max log 푝(ℎ, 푥) = max log 푝(푥|ℎ) + log 푝(ℎ) Loss Regularization Sparse autoencoder • Constrain the code to have sparsity 휆 휆 • Laplacian prior: 푝 ℎ = exp(− ℎ ) 2 2 1 • Training: minimize a loss function 퐿푅 = 퐿(푥, 푔 푓 푥 ) + 휆 ℎ 1 Denoising autoencoder • Traditional autoencoder: encourage to learn 푔 푓 ⋅ to be identity • Denoising : minimize a loss function 퐿 푥, 푟 = 퐿(푥, 푔 푓 푥෤ ) where 푥෤ is 푥 + 푛표푠푒 Boltzmann machine Boltzmann machine • Introduced by Ackley et al. (1985) • General “connectionist” approach to learning arbitrary probability distributions over binary vectors exp(−퐸 푥 ) • Special case of energy model: 푝 푥 = 푍 Boltzmann machine • Energy model: exp(−퐸 푥 ) 푝 푥 = 푍 • Boltzmann machine: special case of energy model with 퐸 푥 = −푥푇푈푥 − 푏푇푥 where 푈 is the weight matrix and 푏 is the bias parameter Boltzmann machine with latent variables • Some variables are not observed 푥 = 푥푣, 푥ℎ , 푥푣 visible, 푥ℎ hidden 푇 푇 푇 푇 푇 퐸 푥 = −푥푣 푅푥푣 − 푥푣 푊푥ℎ − 푥ℎ 푆푥ℎ − 푏 푥푣 − 푐 푥ℎ • Universal approximator of probability mass functions Maximum likelihood 1 2 푛 • Suppose we are given data 푋 = 푥푣, 푥푣 , … , 푥푣 • Maximum likelihood is to maximize 푖 log 푝 푋 = ෍ log 푝(푥푣) 푖 where 1 푝 푥 = ෍ 푝(푥 , 푥 ) = ෍ exp(−퐸(푥 , 푥 )) 푣 푣 ℎ 푍 푣 ℎ 푥ℎ 푥ℎ • 푍 = σ exp(−퐸(푥푣, 푥ℎ)): partition function, difficult to compute Restricted Boltzmann machine • Invented under the name harmonium (Smolensky, 1986) • Popularized by Hinton and collaborators to Restricted Boltzmann machine Restricted Boltzmann machine • Special case of Boltzmann machine with latent variables: exp(−퐸 푣, ℎ ) 푝 푣, ℎ = 푍 where the energy function is 퐸 푣, ℎ = −푣푇푊ℎ − 푏푇푣 − 푐푇ℎ with the weight matrix 푊 and the bias 푏, 푐 • Partition function 푍 = ෍ ෍ exp(−퐸 푣, ℎ ) 푣 ℎ Restricted Boltzmann machine Figure from Deep Learning, Goodfellow, Bengio and Courville Restricted Boltzmann machine • Conditional distribution is factorial 푝(푣, ℎ) 푝 ℎ|푣 = = ෑ 푝(ℎ |푣) 푝(푣) 푗 푗 and 푇 푝 ℎ푗 = 1|푣 = 휎 푐푗 + 푣 푊:,푗 is logistic function Restricted Boltzmann machine • Similarly, 푝(푣, ℎ) 푝 푣|ℎ = = ෑ 푝(푣 |ℎ) 푝(ℎ) 푖 푖 and 푝 푣푖 = 1|ℎ = 휎 푏푖 + 푊푖,:ℎ is logistic function Deep Boltzmann machine • Special case of energy model. Take 3 hidden layers and ignore bias: exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푝 푣, ℎ1, ℎ2, ℎ3 = 푍 • Energy function 퐸 푣, ℎ1, ℎ2, ℎ3 = −푣푇푊1ℎ1 − (ℎ1)푇푊2ℎ2 − (ℎ2)푇푊3ℎ3 with the weight matrices 푊1, 푊2, 푊3 • Partition function 푍 = ෍ exp(−퐸 푣, ℎ1, ℎ2, ℎ3 ) 푣,ℎ1,ℎ2,ℎ3 Deep Boltzmann machine Figure from Deep Learning, Goodfellow, Bengio and Courville.

Deep Learning Basics Lecture 4: Regularization II

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support