Advanced Deep Learning
Approximate Inference-2
U Kang Seoul National University
U Kang 1 Outline
Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning
U Kang 2 MAP Inference
If we wish to develop a learning process based on maximizing 퐿(푣, ℎ, 푞), then it is helpful to think of MAP inference as a procedure that provides a value of 푞.
What is MAP inference? Finds the most likely value of a variable
U Kang 3 MAP Inference
Exact inference consists of maximizing
with respect to q over an unrestricted family of probability distributions, using an exact optimization algorithm. We restrict the family of distributions of q to take on a Dirac distribution
U Kang 4 MAP Inference
We can now control 푞 entirely via µ. Dropping terms of 퐿 that do not vary with µ, we are left with the optimization problem
which is equivalent to the MAP inference problem
ELBO:
U Kang 5 MAP Inference
Thus, we can think of the following algorithm for maximizing ELBO, which is similar to EM Alternate the following two steps Perform MAP inference to infer h* while fixing 휃 Update 휃 to increase log p(h*, v)
ELBO:
U Kang 6 Outline
Inference as Optimization Expectation Maximization MAP Inference Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables
U Kang 7 Previously
Lower bound L Inference = max L w.r.t. q Learning = max L w.r.t. θ EM algorithm -> allows us to make large learning steps with fixed q MAP inference enable us to learn using a point of estimate rather than inferring the entire distribution
U Kang 8 Variational Methods
We want to do the following: Given this surveillance footage X, did the suspect show up in it? Given this twitter feed X, is the author depressed?
Problem: cannot compute P(z|x)
U Kang 9 Variational Methods
Idea: Allow us to re-write statistical inference problems as optimization problems Statistical inference = infer value of RV given another Optimization problem = find the parameter values that minimize cost function Bridge between optimization and statistical techniques
U Kang 10 Variational Learning
Key idea: We can maximize L over a restricted family of distributions q
L is the lower bound of log 푝(푣; 휃)
Chose family such that is easy to compute.
Typically: impose that q is a factorial distribution
U Kang 11 The Mean field approach
q is a factorial distribution
where ℎ푖 are independent (and thus q(h|v) cannot match the true distribution p(h|v)) Advantage: no need to specify parametric form for q. The optimization problem determines the optimal probability under the constraints
U Kang 12 Discrete Latent Variable
The goal is to maximize ELBO:
In the mean field approach, we assume q is a factorial distribution
We can parameterize q with a vector ℎ whose entries are probabilities; then 푞 ℎ푖 = 1 푣) = ℎ푖 Then, we simply optimize the parameters ℎ by any standard optimization technique (e.g., gradient descent)
U Kang 13 Outline
Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables
U Kang 14 Recap
푝(푣|ℎ)푝(ℎ) Inference is to compute 푝 ℎ 푣 = 푝(푣) However, exact inference requires an exponential amount of time in these models Computing 푝(푣) is intractable Approximate inference is needed
U Kang 15 Recap
We compute Evidence Lower Bound (ELBO) instead of 푝(푣). ELBO: After rearranging the equation,
For any choice of 푞, 퐿 provides a lower bound. If we take 푞 equals to 푝, we can get 푝(푣) exactly.
U Kang 16 Calculus of Variations
In machine learning, minimizing a function 퐽(휃) by finding the input vector 휃 ∈ 푅푛 is the purpose. This can be accomplished by solving for the critical points where 훻휃퐽 휃 = 0.
U Kang 17 Calculus of Variations
But in some cases, we actually want to solve for a function 푓(푥). Calculus of variations is a method of finding the critical points w.r.t 푓(푥). A function of a function 푓 is functional 퐽[푓]. We can take functional derivatives (a.k.a. variational derivatives), of 퐽[푓] with respect to individual values of the function 푓(푥) at any specific value of 푥. 훿 Functional derivatives is denoted 퐽. 훿푓(푥)
U Kang 18 Calculus of Variations
Euler-Lagrange equation (simplified form) Consider a functional 퐽 푓 = 푔 푓 푥 , 푥 푑푥. Extreme point 휕 of 퐽 is given by the condition 푔 푓 푥 , 푥 = 0 휕푓(푥) (Proof) Assume we change f by the amount of 휖 ∙ 휂(푥) by an arbitrary function 휂(푥) . Then, 퐽 푓(푥) + 휖휂 푥 = 휕푔 = 푔 푓 푥 + 휖휂 푥 , 푥 푑푥 = [푔 푓 푥 , 푥 + 휖휂 푥 ] 푑푥 휕푓 휕푔 퐽 + 휖 휂 푥 푑푥 휕푓 퐷 휕푦 2 =: Use 푦 푥1 + 휖1, … , 푥퐷 + 휖퐷 = 푦 푥1, … , 푥퐷 + σ푖=1 휖푖 + 푂(휖 ) 휕푥푖 Note that at the extreme point, 퐽 푓 + 휖휂 푥 = 퐽, which 휕푔 휕푔 .implies 휂 푥 = 0. Since this is true for any 휂 푥 , = 0 휕푓 휕푓
U Kang 19 Calculus of Variations
퐻 푞 = − 푞(푥) log 푞(푥) 푑푥
We want to maximize 퐿. So we have to find the 푞 which becomes a critical point of L 훿 Find 퐿=0 훿푞(푥)
U Kang 20 Example
Consider the problem of finding the probability distribution function over 푥 ∈ 푅 that has the maximal differential entropy 퐻 푝 = 푝 푥 log 푝(푥) 푑푥 among the distribution with − 퐸 푥 = 휇 and 푉푎푟 푥 = 휎2 I.e., the problem is to find 푝 푥 to maximize 퐻 푝 = 푝 푥 log 푝(푥) 푑푥, such that −
p(x) integrates to 1 퐸 푥 = 휇 2 푉푎푟 푥 = 휎
U Kang 21 Lagrange Multipliers
How to maximize (or minimize) a function with equality constraint? Lagrange multipliers Problem: maximize f(x) when g(x)=0
Solution Maximize L(푥, 휆) = 푓(푥) + 휆 · 푔(푥) 휆 is called Lagrange multiplier 휕퐿 We find x and 휆 s.t. 훻 퐿 = 0 and = 0 푥 휕휆 훻푥푓(푥) and 훻푥푔(푥) are orthogonal to the surface g(x)=0; thus 훻푥푓 푥 = −휆 훻푥푔(푥) for some 휆 휕퐿 = 0 leads to g(x)=0 휕휆
U Kang 22 Calculus of Variations
Goal: find 푝 푥 which maximizes 퐻 푝 = = 푝 푥 log 푝(푥) 푑푥, s.t. 푝(푥) 푑푥 = 1, 퐸 푥 − 휇, and 푉푎푟 푥 = 휎2 Using Lagrange multiplier, we maximize
U Kang 23 Calculus of Variations
We set the functional derivatives equal to 0:
We obtain We are free to choose Lagrange multipliers as long as 푝(푥) 푑푥 = 1, 퐸 푥 = 휇, and 푉푎푟 푥 = 휎2
We may set the followings.
Then, we obtain This is one reason for using the normal distribution when we do not know the true distribution.
U Kang 24 Outline
Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning Discrete Latent Variables Calculus of Variations Continuous Latent Variables
U Kang 25 Continuous Latent Variables
When our model contains continuous latent variables, we can perform variational inference by maximizing 퐿 using calculus of variations. If we make the mean field approximation, 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , and fix 푞 ℎ푖 푣 for all 푖 ≠ 푗, then the optimal 푞 ℎ푗 푣 can be obtained by normalizing the unnormalized distribution
푞 ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣
Thus, we apply the above equation iteratively for each value of j until convergence U Kang 26 Proof of Mean Field Approximation
(claim) Assuming 푞 ℎ 푣 = ς푖 푞 ℎ푖 푣 , the optimal 푞 ℎ푗 푣 is given by normalizing the unnormalized distribution 푞 ℎ푗 푣 = exp(퐸 log 푝 ℎ, 푣 ) ℎ−푗~푞 ℎ−푗 푣
(proof) Note that ELBO = 퐸ℎ~푞 log 푝 ℎ, 푣 + 퐻(푞), and 푞 ℎ 푣 = ς푖 푞푖(ℎ푖|푣). = Thus, ELBO = ς푖 푞푖 (log 푝(ℎ, 푣)) 푑ℎ − (ς푖 푞푖)(log ς푖 푞푖) 푑ℎ 푞푗 { log 푝 ℎ, 푣 (ς푖≠푗 푞푖 푑ℎ푖)} 푑ℎ푗 − (ς푖 푞푖)(σ푖 log 푞푖)푑ℎ If we take out terms related to 푞푗, then ELBO becomes
푞푗 퐸ℎ−푗(log 푝(ℎ, 푣))푑ℎ푗 − 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 ∗ ( 푞푗 log 푝 (ℎ푗, 푣) 푑ℎ푗 − 푞푗 log 푞푗 푑ℎ푗 + 푐표푛푠푡 (1 = ∗ ∗ where 푝 (ℎ푗, 푣) is a prob. distribution and log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡 ∗ Note that (1) is negative KL divergence -퐷퐾퐿(푞푗||푝 (ℎ푗, 푣)); thus, the best 푞푗 ∗ maximizing ELBO is given by 푞푗 = 푝 (ℎ푗, 푣). ∗ In that case, log 푞푗 = log 푝 (ℎ푗, 푣) = 퐸ℎ−푗 log 푝 ℎ, 푣 + 푐표푛푠푡. Thus, 푞푗 ∝
exp 퐸ℎ log 푝 ℎ, 푣 = exp(퐸 log 푝 ℎ, 푣 ) −푗 ℎ−푗~푞 ℎ−푗 푣
U Kang 27 What you need to know
Inference as Optimization Expectation Maximization MAP Inference and Sparse Coding Variational Inference and Learning
U Kang 28 Questions?
U Kang 29