Advanced Deep Learning

Approximate Inference-2

U Kang Seoul National University

U Kang 1 Outline

Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning

U Kang 2 MAP Inference

 If we wish to develop a learning process based on maximizing 퐿(푣, ℎ, 푞), then it is helpful to think of MAP inference as a procedure that provides a value of 푞.

 What is MAP inference?  Finds the most likely value of a

U Kang 3 MAP Inference

 Exact inference consists of maximizing

 with respect to q over an unrestricted family of probability distributions, using an exact optimization algorithm.  We restrict the family of distributions of q to take on a Dirac distribution

U Kang 4 MAP Inference

 We can now control 푞 entirely via µ. Dropping terms of 퐿 that do not vary with µ, we are left with the

which is equivalent to the MAP inference problem

ELBO:

U Kang 5 MAP Inference

 Thus, we can think of the following algorithm for maximizing ELBO, which is similar to EM  Alternate the following two steps  Perform MAP inference to infer h* while fixing 휃  Update 휃 to increase log p(h*, v)

ELBO:

U Kang 6 Outline

Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning Discrete Latent Variables of Variations Continuous Latent Variables

U Kang 7 Previously

 Lower bound L  Inference = max L w.r.t. q  Learning = max L w.r.t. θ  EM algorithm -> allows us to make large learning steps with fixed q  MAP inference enable us to learn using a point of estimate rather than inferring the entire distribution

U Kang 8 Variational Methods

 We want to do the following:  Given this surveillance footage X, did the suspect show up in it?  Given this twitter feed X, is the author depressed?

 Problem: cannot compute P(z|x)

U Kang 9 Variational Methods

 Idea:  Allow us to re-write statistical inference problems as optimization problems  Statistical inference = infer value of RV given another  Optimization problem = find the parameter values that minimize cost  Bridge between optimization and statistical techniques

U Kang 10 Variational Learning

 Key idea: We can maximize L over a restricted family of distributions q

 L is the lower bound of log 푝(푣; 휃)

 Chose family such that is easy to compute.

 Typically: impose that q is a distribution

U Kang 11 The Mean field approach

 q is a factorial distribution

 where ℎ푖 are independent (and thus q(h|v) cannot match the true distribution p(h|v))  Advantage: no need to specify parametric form for q.  The optimization problem determines the optimal probability under the constraints

U Kang 12 Discrete Latent Variable

 The goal is to maximize ELBO:

 In the mean field approach, we assume q is a factorial distribution

 We can parameterize q with a vector ℎ෠ whose entries are probabilities; then 푞 ℎ푖 = 1 푣) = ℎ෡푖  Then, we simply optimize the parameters ℎ෠ by any standard optimization technique (e.g., descent)

U Kang 13 Outline

Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Continuous Latent Variables

U Kang 14 Recap

푝(푣|ℎ)푝(ℎ)  Inference is to compute 푝 ℎ 푣 = 푝(푣)  However, exact inference requires an exponential amount of time in these models  Computing 푝(푣) is intractable  Approximate inference is needed

U Kang 15 Recap

 We compute Evidence Lower Bound (ELBO) instead of 푝(푣).  ELBO:  After rearranging the ,

 For any choice of 푞, 퐿 provides a lower bound.  If we take 푞 equals to 푝, we can get 푝(푣) exactly.

U Kang 16 Calculus of Variations

 In machine learning, minimizing a function 퐽(휃) by finding the input vector 휃 ∈ 푅푛 is the purpose.  This can be accomplished by solving for the critical points where 훻휃퐽 휃 = 0.

U Kang 17 Calculus of Variations

 But in some cases, we actually want to solve for a function 푓(푥).  Calculus of variations is a method of finding the critical points w.r.t 푓(푥).  A function of a function 푓 is functional 퐽[푓].  We can take functional (a.k.a. variational derivatives), of 퐽[푓] with respect to individual values of the function 푓(푥) at any specific value of 푥. 훿  Functional derivatives is denoted 퐽. 훿푓(푥)

U Kang 18 Calculus of Variations

 Euler-Lagrange equation (simplified form)  Consider a functional 퐽 푓 = ׬ 푔 푓 푥 , 푥 푑푥. Extreme point 휕 of 퐽 is given by the condition 푔 푓 푥 , 푥 = 0 휕푓(푥)  (Proof) Assume we change f by the amount of 휖 ∙ 휂(푥) by an arbitrary function 휂(푥) . Then, 퐽 푓(푥) + 휖휂 푥 = 휕푔 = ׬ 푔 푓 푥 + 휖휂 푥 , 푥 푑푥 = ׬[푔 푓 푥 , 푥 + 휖휂 푥 ] 푑푥 휕푓 휕푔 퐽 + 휖 ׬ 휂 푥 푑푥 휕푓 퐷 휕푦 2  =: Use 푦 푥1 + 휖1, … , 푥퐷 + 휖퐷 = 푦 푥1, … , 푥퐷 + σ푖=1 휖푖 + 푂(휖 ) 휕푥푖  Note that at the extreme point, 퐽 푓 + 휖휂 푥 = 퐽, which 휕푔 휕푔 .implies ׬ 휂 푥 = 0. Since this is true for any 휂 푥 , = 0 휕푓 휕푓

U Kang 19 Calculus of Variations

 퐻 푞 = − ׬ 푞(푥) log 푞(푥) 푑푥

 We want to maximize 퐿.  So we have to find the 푞 which becomes a critical point of L 훿  Find 퐿=0 훿푞(푥)

U Kang 20 Example

 Consider the problem of finding the function over 푥 ∈ 푅 that has the maximal differential 퐻 푝 = ׬ 푝 푥 log 푝(푥) 푑푥 among the distribution with − 퐸 푥 = 휇 and 푉푎푟 푥 = 휎2  I.e., the problem is to find 푝 푥 to maximize 퐻 푝 = ׬ 푝 푥 log 푝(푥) 푑푥, such that −

 p(x) integrates to 1  퐸 푥 = 휇 2  푉푎푟 푥 = 휎

U Kang 21 Lagrange Multipliers

 How to maximize (or minimize) a function with equality constraint?  Lagrange multipliers  Problem: maximize f(x) when g(x)=0

 Solution  Maximize L(푥, 휆) = 푓(푥) + 휆 · 푔(푥)  휆 is called Lagrange multiplier 휕퐿  We find x and 휆 s.t. 훻 퐿 = 0 and = 0 푥 휕휆  훻푥푓(푥) and 훻푥푔(푥) are orthogonal to the surface g(x)=0; thus 훻푥푓 푥 = −휆 훻푥푔(푥) for some 휆 휕퐿  = 0 leads to g(x)=0 휕휆

U Kang 22 Calculus of Variations

 Goal: find 푝 푥 which maximizes 퐻 푝 = = ׬ 푝 푥 log 푝(푥) 푑푥, s.t. ׬ 푝(푥) 푑푥 = 1, 퐸 푥 − 휇, and 푉푎푟 푥 = 휎2  Using Lagrange multiplier, we maximize

U Kang 23 Calculus of Variations

 We set the functional derivatives equal to 0:

 We obtain  We are free to choose Lagrange multipliers as long as ׬ 푝(푥) 푑푥 = 1, 퐸 푥 = 휇, and 푉푎푟 푥 = 휎2

 We may set the followings.

 Then, we obtain  This is one reason for using the when we do not know the true distribution.

U Kang 24 Outline

Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables

U Kang 25 Continuous Latent Variables

 When our model contains continuous latent variables, we can perform variational inference by maximizing 퐿 using calculus of variations.  If we make the mean field approximation, 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , and fix 푞 ℎ푖 푣 for all 푖 ≠ 푗, then the optimal 푞 ℎ푗 푣 can be obtained by normalizing the unnormalized distribution

푞෤ ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣

 Thus, we apply the above equation iteratively for each value of j until convergence U Kang 26 Proof of Mean Field Approximation

 (claim) Assuming 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , the optimal 푞 ℎ푗 푣 is given by normalizing the unnormalized distribution 푞෤ ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣

 (proof) Note that ELBO = 퐸ℎ~푞 log 푝 ℎ, 푣 + 퐻(푞), and 푞 ℎ 푣 = ς푖 푞푖(ℎ푖|푣). =  Thus, ELBO = ׬ ς푖 푞푖 (log 푝(ℎ, 푣)) 푑ℎ − ׬(ς푖 푞푖)(log ς푖 푞푖) 푑ℎ ׬ 푞푗 {׬ log 푝 ℎ, 푣 (ς푖≠푗 푞푖 푑ℎ푖)} 푑ℎ푗 − ׬(ς푖 푞푖)(σ푖 log 푞푖)푑ℎ  If we take out terms related to 푞푗, then ELBO becomes

׬ 푞푗 퐸ℎ−푗(log 푝(ℎ, 푣))푑ℎ푗 − ׬ 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 ∗ (׬ 푞푗 log 푝 (ℎ푗, 푣) 푑ℎ푗 − ׬ 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 (1 = ∗ ∗  where 푝 (ℎ푗, 푣) is a prob. distribution and log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡 ∗  Note that (1) is negative KL -퐷퐾퐿(푞푗||푝 (ℎ푗, 푣)); thus, the best 푞푗 ∗ maximizing ELBO is given by 푞푗 = 푝 (ℎ푗, 푣). ∗  In that case, log 푞푗 = log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡. Thus, 푞푗 ∝

exp 퐸ℎ log 푝 ℎ, 푣 = exp(퐸 log 푝 ℎ, 푣 ) −푗 ℎ−푗~푞 ℎ−푗 푣

U Kang 27 What you need to know

 Inference as Optimization  Expectation Maximization  MAP Inference and Sparse Coding  Variational Inference and Learning

U Kang 28 Questions?

U Kang 29