Lagrange Multipliers

Advanced Deep Learning Approximate Inference-2 U Kang Seoul National University U Kang 1 Outline Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning U Kang 2 MAP Inference If we wish to develop a learning process based on maximizing 퐿(푣, ℎ, 푞), then it is helpful to think of MAP inference as a procedure that provides a value of 푞. What is MAP inference? Finds the most likely value of a variable U Kang 3 MAP Inference Exact inference consists of maximizing with respect to q over an unrestricted family of probability distributions, using an exact optimization algorithm. We restrict the family of distributions of q to take on a Dirac distribution U Kang 4 MAP Inference We can now control 푞 entirely via µ. Dropping terms of 퐿 that do not vary with µ, we are left with the optimization problem which is equivalent to the MAP inference problem ELBO: U Kang 5 MAP Inference Thus, we can think of the following algorithm for maximizing ELBO, which is similar to EM Alternate the following two steps Perform MAP inference to infer h* while fixing 휃 Update 휃 to increase log p(h*, v) ELBO: U Kang 6 Outline Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables U Kang 7 Previously Lower bound L Inference = max L w.r.t. q Learning = max L w.r.t. θ EM algorithm -> allows us to make large learning steps with fixed q MAP inference enable us to learn using a point of estimate rather than inferring the entire distribution U Kang 8 Variational Methods We want to do the following: Given this surveillance footage X, did the suspect show up in it? Given this twitter feed X, is the author depressed? Problem: cannot compute P(z|x) U Kang 9 Variational Methods Idea: Allow us to re-write statistical inference problems as optimization problems Statistical inference = infer value of RV given another Optimization problem = find the parameter values that minimize cost function Bridge between optimization and statistical techniques U Kang 10 Variational Learning Key idea: We can maximize L over a restricted family of distributions q L is the lower bound of log 푝(푣; 휃) Chose family such that is easy to compute. Typically: impose that q is a factorial distribution U Kang 11 The Mean field approach q is a factorial distribution where ℎ푖 are independent (and thus q(h|v) cannot match the true distribution p(h|v)) Advantage: no need to specify parametric form for q. The optimization problem determines the optimal probability under the constraints U Kang 12 Discrete Latent Variable The goal is to maximize ELBO: In the mean field approach, we assume q is a factorial distribution We can parameterize q with a vector ℎ෠ whose entries are probabilities; then 푞 ℎ푖 = 1 푣) = ℎ෡푖 Then, we simply optimize the parameters ℎ෠ by any standard optimization technique (e.g., gradient descent) U Kang 13 Outline Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables U Kang 14 Recap 푝(푣|ℎ)푝(ℎ) Inference is to compute 푝 ℎ 푣 = 푝(푣) However, exact inference requires an exponential amount of time in these models Computing 푝(푣) is intractable Approximate inference is needed U Kang 15 Recap We compute Evidence Lower Bound (ELBO) instead of 푝(푣). ELBO: After rearranging the equation, For any choice of 푞, 퐿 provides a lower bound. If we take 푞 equals to 푝, we can get 푝(푣) exactly. U Kang 16 Calculus of Variations In machine learning, minimizing a function 퐽(휃) by finding the input vector 휃 ∈ 푅푛 is the purpose. This can be accomplished by solving for the critical points where 훻휃퐽 휃 = 0. U Kang 17 Calculus of Variations But in some cases, we actually want to solve for a function 푓(푥). Calculus of variations is a method of finding the critical points w.r.t 푓(푥). A function of a function 푓 is functional 퐽[푓]. We can take functional derivatives (a.k.a. variational derivatives), of 퐽[푓] with respect to individual values of the function 푓(푥) at any specific value of 푥. 훿 Functional derivatives is denoted 퐽. 훿푓(푥) U Kang 18 Calculus of Variations Euler-Lagrange equation (simplified form) Consider a functional 퐽 푓 = ׬ 푔 푓 푥 , 푥 푑푥. Extreme point 휕 of 퐽 is given by the condition 푔 푓 푥 , 푥 = 0 휕푓(푥) (Proof) Assume we change f by the amount of 휖 ∙ 휂(푥) by an arbitrary function 휂(푥) . Then, 퐽 푓(푥) + 휖휂 푥 = 휕푔 = ׬ 푔 푓 푥 + 휖휂 푥 , 푥 푑푥 = ׬[푔 푓 푥 , 푥 + 휖휂 푥 ] 푑푥 휕푓 휕푔 퐽 + 휖 ׬ 휂 푥 푑푥 휕푓 퐷 휕푦 2 =: Use 푦 푥1 + 휖1, … , 푥퐷 + 휖퐷 = 푦 푥1, … , 푥퐷 + σ푖=1 휖푖 + 푂(휖 ) 휕푥푖 Note that at the extreme point, 퐽 푓 + 휖휂 푥 = 퐽, which 휕푔 휕푔 .implies ׬ 휂 푥 = 0. Since this is true for any 휂 푥 , = 0 휕푓 휕푓 U Kang 19 Calculus of Variations 퐻 푞 = − ׬ 푞(푥) log 푞(푥) 푑푥 We want to maximize 퐿. So we have to find the 푞 which becomes a critical point of L 훿 Find 퐿=0 훿푞(푥) U Kang 20 Example Consider the problem of finding the probability distribution function over 푥 ∈ 푅 that has the maximal differential entropy 퐻 푝 = ׬ 푝 푥 log 푝(푥) 푑푥 among the distribution with − 퐸 푥 = 휇 and 푉푎푟 푥 = 휎2 I.e., the problem is to find 푝 푥 to maximize 퐻 푝 = ׬ 푝 푥 log 푝(푥) 푑푥, such that − p(x) integrates to 1 퐸 푥 = 휇 2 푉푎푟 푥 = 휎 U Kang 21 Lagrange Multipliers How to maximize (or minimize) a function with equality constraint? Lagrange multipliers Problem: maximize f(x) when g(x)=0 Solution Maximize L(푥, 휆) = 푓(푥) + 휆 · 푔(푥) 휆 is called Lagrange multiplier 휕퐿 We find x and 휆 s.t. 훻 퐿 = 0 and = 0 푥 휕휆 훻푥푓(푥) and 훻푥푔(푥) are orthogonal to the surface g(x)=0; thus 훻푥푓 푥 = −휆 훻푥푔(푥) for some 휆 휕퐿 = 0 leads to g(x)=0 휕휆 U Kang 22 Calculus of Variations Goal: find 푝 푥 which maximizes 퐻 푝 = = ׬ 푝 푥 log 푝(푥) 푑푥, s.t. ׬ 푝(푥) 푑푥 = 1, 퐸 푥 − 휇, and 푉푎푟 푥 = 휎2 Using Lagrange multiplier, we maximize U Kang 23 Calculus of Variations We set the functional derivatives equal to 0: We obtain We are free to choose Lagrange multipliers as long as ׬ 푝(푥) 푑푥 = 1, 퐸 푥 = 휇, and 푉푎푟 푥 = 휎2 We may set the followings. Then, we obtain This is one reason for using the normal distribution when we do not know the true distribution. U Kang 24 Outline Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables U Kang 25 Continuous Latent Variables When our model contains continuous latent variables, we can perform variational inference by maximizing 퐿 using calculus of variations. If we make the mean field approximation, 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , and fix 푞 ℎ푖 푣 for all 푖 ≠ 푗, then the optimal 푞 ℎ푗 푣 can be obtained by normalizing the unnormalized distribution 푞෤ ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣 Thus, we apply the above equation iteratively for each value of j until convergence U Kang 26 Proof of Mean Field Approximation (claim) Assuming 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , the optimal 푞 ℎ푗 푣 is given by normalizing the unnormalized distribution 푞෤ ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣 (proof) Note that ELBO = 퐸ℎ~푞 log 푝 ℎ, 푣 + 퐻(푞), and 푞 ℎ 푣 = ς푖 푞푖(ℎ푖|푣). = Thus, ELBO = ׬ ς푖 푞푖 (log 푝(ℎ, 푣)) 푑ℎ − ׬(ς푖 푞푖)(log ς푖 푞푖) 푑ℎ ׬ 푞푗 {׬ log 푝 ℎ, 푣 (ς푖≠푗 푞푖 푑ℎ푖)} 푑ℎ푗 − ׬(ς푖 푞푖)(σ푖 log 푞푖)푑ℎ If we take out terms related to 푞푗, then ELBO becomes ׬ 푞푗 퐸ℎ−푗(log 푝(ℎ, 푣))푑ℎ푗 − ׬ 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 ∗ (׬ 푞푗 log 푝 (ℎ푗, 푣) 푑ℎ푗 − ׬ 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 (1 = ∗ ∗ where 푝 (ℎ푗, 푣) is a prob. distribution and log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡 ∗ Note that (1) is negative KL divergence -퐷퐾퐿(푞푗||푝 (ℎ푗, 푣)); thus, the best 푞푗 ∗ maximizing ELBO is given by 푞푗 = 푝 (ℎ푗, 푣). ∗ In that case, log 푞푗 = log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡. Thus, 푞푗 ∝ exp 퐸ℎ log 푝 ℎ, 푣 = exp(퐸 log 푝 ℎ, 푣 ) −푗 ℎ−푗~푞 ℎ−푗 푣 U Kang 27 What you need to know Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning U Kang 28 Questions? U Kang 29.

Lagrange Multipliers

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support