Lagrange Multipliers

Advanced Deep Learning Approximate Inference-2 U Kang Seoul National University U Kang 1 Outline Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning U Kang 2 MAP Inference If we wish to develop a learning process based on maximizing 퐿(푣, ℎ, 푞), then it is helpful to think of MAP inference as a procedure that provides a value of 푞. What is MAP inference? Finds the most likely value of a variable U Kang 3 MAP Inference Exact inference consists of maximizing with respect to q over an unrestricted family of probability distributions, using an exact optimization algorithm. We restrict the family of distributions of q to take on a Dirac distribution U Kang 4 MAP Inference We can now control 푞 entirely via µ. Dropping terms of 퐿 that do not vary with µ, we are left with the optimization problem which is equivalent to the MAP inference problem ELBO: U Kang 5 MAP Inference Thus, we can think of the following algorithm for maximizing ELBO, which is similar to EM Alternate the following two steps Perform MAP inference to infer h* while fixing 휃 Update 휃 to increase log p(h*, v) ELBO: U Kang 6 Outline Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables U Kang 7 Previously Lower bound L Inference = max L w.r.t. q Learning = max L w.r.t. θ EM algorithm -> allows us to make large learning steps with fixed q MAP inference enable us to learn using a point of estimate rather than inferring the entire distribution U Kang 8 Variational Methods We want to do the following: Given this surveillance footage X, did the suspect show up in it? Given this twitter feed X, is the author depressed? Problem: cannot compute P(z|x) U Kang 9 Variational Methods Idea: Allow us to re-write statistical inference problems as optimization problems Statistical inference = infer value of RV given another Optimization problem = find the parameter values that minimize cost function Bridge between optimization and statistical techniques U Kang 10 Variational Learning Key idea: We can maximize L over a restricted family of distributions q L is the lower bound of log 푝(푣; 휃) Chose family such that is easy to compute. Typically: impose that q is a factorial distribution U Kang 11 The Mean field approach q is a factorial distribution where ℎ푖 are independent (and thus q(h|v) cannot match the true distribution p(h|v)) Advantage: no need to specify parametric form for q. The optimization problem determines the optimal probability under the constraints U Kang 12 Discrete Latent Variable The goal is to maximize ELBO: In the mean field approach, we assume q is a factorial distribution We can parameterize q with a vector ℎ෠ whose entries are probabilities; then 푞 ℎ푖 = 1 푣) = ℎ෡푖 Then, we simply optimize the parameters ℎ෠ by any standard optimization technique (e.g., gradient descent) U Kang 13 Outline Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables U Kang 14 Recap 푝(푣|ℎ)푝(ℎ) Inference is to compute 푝 ℎ 푣 = 푝(푣) However, exact inference requires an exponential amount of time in these models Computing 푝(푣) is intractable Approximate inference is needed U Kang 15 Recap We compute Evidence Lower Bound (ELBO) instead of 푝(푣). ELBO: After rearranging the equation, For any choice of 푞, 퐿 provides a lower bound. If we take 푞 equals to 푝, we can get 푝(푣) exactly. U Kang 16 Calculus of Variations In machine learning, minimizing a function 퐽(휃) by finding the input vector 휃 ∈ 푅푛 is the purpose. This can be accomplished by solving for the critical points where 훻휃퐽 휃 = 0. U Kang 17 Calculus of Variations But in some cases, we actually want to solve for a function 푓(푥). Calculus of variations is a method of finding the critical points w.r.t 푓(푥). A function of a function 푓 is functional 퐽[푓]. We can take functional derivatives (a.k.a. variational derivatives), of 퐽[푓] with respect to individual values of the function 푓(푥) at any specific value of 푥. 훿 Functional derivatives is denoted 퐽. 훿푓(푥) U Kang 18 Calculus of Variations Euler-Lagrange equation (simplified form) Consider a functional 퐽 푓 = ׬ 푔 푓 푥 , 푥 푑푥. Extreme point 휕 of 퐽 is given by the condition 푔 푓 푥 , 푥 = 0 휕푓(푥) (Proof) Assume we change f by the amount of 휖 ∙ 휂(푥) by an arbitrary function 휂(푥) . Then, 퐽 푓(푥) + 휖휂 푥 = 휕푔 = ׬ 푔 푓 푥 + 휖휂 푥 , 푥 푑푥 = ׬[푔 푓 푥 , 푥 + 휖휂 푥 ] 푑푥 휕푓 휕푔 퐽 + 휖 ׬ 휂 푥 푑푥 휕푓 퐷 휕푦 2 =: Use 푦 푥1 + 휖1, … , 푥퐷 + 휖퐷 = 푦 푥1, … , 푥퐷 + σ푖=1 휖푖 + 푂(휖 ) 휕푥푖 Note that at the extreme point, 퐽 푓 + 휖휂 푥 = 퐽, which 휕푔 휕푔 .implies ׬ 휂 푥 = 0. Since this is true for any 휂 푥 , = 0 휕푓 휕푓 U Kang 19 Calculus of Variations 퐻 푞 = − ׬ 푞(푥) log 푞(푥) 푑푥 We want to maximize 퐿. So we have to find the 푞 which becomes a critical point of L 훿 Find 퐿=0 훿푞(푥) U Kang 20 Example Consider the problem of finding the probability distribution function over 푥 ∈ 푅 that has the maximal differential entropy 퐻 푝 = ׬ 푝 푥 log 푝(푥) 푑푥 among the distribution with − 퐸 푥 = 휇 and 푉푎푟 푥 = 휎2 I.e., the problem is to find 푝 푥 to maximize 퐻 푝 = ׬ 푝 푥 log 푝(푥) 푑푥, such that − p(x) integrates to 1 퐸 푥 = 휇 2 푉푎푟 푥 = 휎 U Kang 21 Lagrange Multipliers How to maximize (or minimize) a function with equality constraint? Lagrange multipliers Problem: maximize f(x) when g(x)=0 Solution Maximize L(푥, 휆) = 푓(푥) + 휆 · 푔(푥) 휆 is called Lagrange multiplier 휕퐿 We find x and 휆 s.t. 훻 퐿 = 0 and = 0 푥 휕휆 훻푥푓(푥) and 훻푥푔(푥) are orthogonal to the surface g(x)=0; thus 훻푥푓 푥 = −휆 훻푥푔(푥) for some 휆 휕퐿 = 0 leads to g(x)=0 휕휆 U Kang 22 Calculus of Variations Goal: find 푝 푥 which maximizes 퐻 푝 = = ׬ 푝 푥 log 푝(푥) 푑푥, s.t. ׬ 푝(푥) 푑푥 = 1, 퐸 푥 − 휇, and 푉푎푟 푥 = 휎2 Using Lagrange multiplier, we maximize U Kang 23 Calculus of Variations We set the functional derivatives equal to 0: We obtain We are free to choose Lagrange multipliers as long as ׬ 푝(푥) 푑푥 = 1, 퐸 푥 = 휇, and 푉푎푟 푥 = 휎2 We may set the followings. Then, we obtain This is one reason for using the normal distribution when we do not know the true distribution. U Kang 24 Outline Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables U Kang 25 Continuous Latent Variables When our model contains continuous latent variables, we can perform variational inference by maximizing 퐿 using calculus of variations. If we make the mean field approximation, 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , and fix 푞 ℎ푖 푣 for all 푖 ≠ 푗, then the optimal 푞 ℎ푗 푣 can be obtained by normalizing the unnormalized distribution 푞෤ ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣 Thus, we apply the above equation iteratively for each value of j until convergence U Kang 26 Proof of Mean Field Approximation (claim) Assuming 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , the optimal 푞 ℎ푗 푣 is given by normalizing the unnormalized distribution 푞෤ ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣 (proof) Note that ELBO = 퐸ℎ~푞 log 푝 ℎ, 푣 + 퐻(푞), and 푞 ℎ 푣 = ς푖 푞푖(ℎ푖|푣). = Thus, ELBO = ׬ ς푖 푞푖 (log 푝(ℎ, 푣)) 푑ℎ − ׬(ς푖 푞푖)(log ς푖 푞푖) 푑ℎ ׬ 푞푗 {׬ log 푝 ℎ, 푣 (ς푖≠푗 푞푖 푑ℎ푖)} 푑ℎ푗 − ׬(ς푖 푞푖)(σ푖 log 푞푖)푑ℎ If we take out terms related to 푞푗, then ELBO becomes ׬ 푞푗 퐸ℎ−푗(log 푝(ℎ, 푣))푑ℎ푗 − ׬ 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 ∗ (׬ 푞푗 log 푝 (ℎ푗, 푣) 푑ℎ푗 − ׬ 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 (1 = ∗ ∗ where 푝 (ℎ푗, 푣) is a prob. distribution and log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡 ∗ Note that (1) is negative KL divergence -퐷퐾퐿(푞푗||푝 (ℎ푗, 푣)); thus, the best 푞푗 ∗ maximizing ELBO is given by 푞푗 = 푝 (ℎ푗, 푣). ∗ In that case, log 푞푗 = log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡. Thus, 푞푗 ∝ exp 퐸ℎ log 푝 ℎ, 푣 = exp(퐸 log 푝 ℎ, 푣 ) −푗 ℎ−푗~푞 ℎ−푗 푣 U Kang 27 What you need to know Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning U Kang 28 Questions? U Kang 29.

Load more