CS 6347
Lecture 22-23
Exponential Families & Expectation Propagation Discrete State Spaces
• We have been focusing on the case of MRFs over discrete state spaces
• Probability distributions over discrete spaces correspond to vectors of probabilities for each element in the space such that the vector sums to one
– The partition function is simply a sum over all of the possible values for each variable
– Entropy of the distribution is nonnegative and is also computed by summing over the state space
2 Continuous State Spaces
1 =
𝑝𝑝 𝑥𝑥 � 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 • For continuous state spaces, the𝑍𝑍 partition𝐶𝐶 function is now an integral
=
𝑍𝑍 � � 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 𝑑𝑑𝑑𝑑 • The entropy becomes 𝐶𝐶 = log ( )
𝐻𝐻 𝑥𝑥 − � 𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑
= log ( )
• This is called the𝐻𝐻 differential𝑥𝑥 − �entropy𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑
– It is not always greater than or equal to zero
• Easy to construct such distributions:
– Let be the uniform distribution over the interval [ , ], what is the entropy of ( )? 𝑞𝑞 𝑥𝑥 𝑎𝑎 𝑏𝑏 𝑞𝑞 𝑥𝑥
4 Differential Entropy
= log ( )
• This is called the𝐻𝐻 differential𝑥𝑥 − �entropy𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑
– It is not always greater than or equal to zero
• Easy to construct such distributions:
– Let be the uniform distribution over the interval [ , ], what is the entropy of ( )? 𝑞𝑞 𝑥𝑥 1 1 𝑎𝑎 𝑏𝑏= log 𝑞𝑞 𝑥𝑥 = log( ) 𝑏𝑏 𝐻𝐻 𝑞𝑞 − � 𝑑𝑑𝑑𝑑 𝑏𝑏 − 𝑎𝑎 𝑎𝑎 𝑏𝑏 − 𝑎𝑎 𝑏𝑏 − 𝑎𝑎 5 KL Divergence
( | = log ( ) 𝑞𝑞 𝑥𝑥 𝑑𝑑 𝑞𝑞 𝑝𝑝 � 𝑞𝑞 𝑥𝑥 𝑑𝑑𝑑𝑑 • The KL-divergence is still nonnegative, even though𝑝𝑝 𝑥𝑥 it contains the differential entropy
– This means that all of the observations that we made for finite state spaces will carry over to the continuous case
– The EM algorithm, mean-field methods, etc.
– Most importantly
log + log
𝑍𝑍 ≥ 𝐻𝐻 𝑞𝑞 � � 𝑞𝑞𝐶𝐶 𝑥𝑥𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 𝑑𝑑𝑥𝑥𝐶𝐶 𝐶𝐶 6 Continuous State Spaces
• Examples of probability distributions over continuous state spaces
– The uniform distribution over the interval [ , ] 1 = , 𝑎𝑎 𝑏𝑏 𝑥𝑥∈ 𝑎𝑎 𝑏𝑏 𝑞𝑞 𝑥𝑥 – The multivariate normal distribution𝑏𝑏 − 𝑎𝑎 with mean and covariance matrix 𝜇𝜇 1 1 =Σ exp ( ) 2 det( ) 𝑇𝑇 −1 𝑞𝑞 𝑥𝑥 𝑘𝑘 − 𝑥𝑥 − 𝜇𝜇 Σ 𝑥𝑥 − 𝜇𝜇 2𝜋𝜋 Σ
7 Continuous State Spaces
• What makes continuous distributions so difficult to deal with?
– They may not be compactly representable
– Families of continuous distributions need not be closed under marginalization
• The marginal distributions of multivariate normal distributions are again (multivariate) normal distributions
– Integration problems of interest (e.g., the partition function or marginal distributions) may not have closed form solutions
• Integrals may also not exist!
8 The Exponential Family
= h x exp , ( ) log ( )
• A distribution𝑝𝑝 𝑥𝑥 is𝜃𝜃 a member⋅ of the exponential𝜃𝜃 𝜙𝜙 𝑥𝑥 − family𝑍𝑍 if 𝜃𝜃its probability density function can be expressed as above for some choice of parameters and potential functions ( )
• We are only interested𝜃𝜃 in models for which𝜙𝜙 𝑥𝑥 ( ) is finite
• The family of log-linear models that we have𝑍𝑍 been𝜃𝜃 focusing on in the discrete case belong to the exponential family
9 The Exponential Family
= h x exp , ( ) log ( )
• As in the 𝑝𝑝discrete𝑥𝑥 𝜃𝜃 case, there⋅ is not𝜃𝜃 necessarily𝜙𝜙 𝑥𝑥 − a unique𝑍𝑍 𝜃𝜃 way to express a distribution in this form • We say that the representation is minimal if there does not exist a vector 0 such that
𝑎𝑎 ≠ , ( ) = – In this case, there 𝑎𝑎is a𝜙𝜙 unique𝑥𝑥 parameter𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 vector associated with each member of the family – The are called sufficient statistics for the distribution
𝜙𝜙
10 The Multivariate Normal
1 1 | , = exp 2 𝑇𝑇 −1 𝑞𝑞 𝑥𝑥 𝜇𝜇 Σ − 𝑥𝑥 − 𝜇𝜇 Σ 𝑥𝑥 − 𝜇𝜇 • The multivariate normal𝑍𝑍 distribution is a member of the exponential family 1 | = exp + ( ) 𝑞𝑞 𝑥𝑥 𝜃𝜃 � 𝜃𝜃𝑖𝑖𝑥𝑥𝑖𝑖 � 𝜃𝜃𝑖𝑖𝑖𝑖𝑥𝑥𝑖𝑖𝑥𝑥𝑗𝑗 • The mean and the covariance𝑍𝑍 𝜃𝜃 matrix𝑖𝑖 (must be𝑖𝑖 ≥positive𝑗𝑗 semidefinite) are sufficient statistics of the multivariate normal distribution
11 The Exponential Family
• Many of the discrete distributions that you have seen before are members of the exponential family
– Binomial, Poisson, Bernoulli, Gamma, Beta, Laplace, Categorical, etc.
• The exponential family, while not the most general parametric family, is one of the easiest to work with and captures a variety of different distributions
12 Continuous Bethe Approximation
• Recall that, from the nonnegativity of the KL-divergence
log + log
𝑍𝑍 ≥ 𝐻𝐻 𝑞𝑞 � � 𝑞𝑞𝐶𝐶 𝑥𝑥𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 𝑑𝑑𝑥𝑥𝐶𝐶 for any probability distribution 𝐶𝐶
• We can make the same approximations𝑞𝑞 that we did in the discrete case to approximate ( ) in the continuous case
𝑍𝑍 𝜃𝜃
13 Continuous Bethe Approximation
max + log
𝐵𝐵 𝐶𝐶 𝐶𝐶 𝑐𝑐 𝑐𝑐 𝐶𝐶 𝜏𝜏∈𝐓𝐓 𝐻𝐻 𝜏𝜏 � � 𝜏𝜏 𝑥𝑥 𝜓𝜓 𝑥𝑥 𝑑𝑑𝑥𝑥 where 𝐶𝐶 ( ) = log ( ) log 𝐶𝐶 𝐶𝐶 𝐵𝐵 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝐶𝐶 𝐶𝐶 𝜏𝜏 𝑥𝑥 𝐶𝐶 𝐻𝐻 𝜏𝜏 − � � 𝜏𝜏 𝑥𝑥 𝜏𝜏 𝑥𝑥 𝑑𝑑𝑥𝑥 − � � 𝜏𝜏 𝑥𝑥 𝑖𝑖 𝑖𝑖 𝑑𝑑𝑥𝑥 and is a vector𝑖𝑖∈V of locally consistent marginals𝐶𝐶 ∏𝑖𝑖∈𝐶𝐶 𝜏𝜏 𝑥𝑥
• This𝑇𝑇 approximation is exact on trees
14 Continuous Belief Propagation
1 = ( ) ( , ) , 𝑝𝑝 𝑥𝑥 � 𝜙𝜙𝑖𝑖 𝑥𝑥𝑖𝑖 � 𝜓𝜓𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 • The messages passed𝑍𝑍 by𝑖𝑖∈ belief𝑉𝑉 propagation𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 are
= , ( )
𝑚𝑚𝑖𝑖𝑖𝑖 𝑥𝑥𝑗𝑗 � 𝜙𝜙𝑖𝑖 𝑥𝑥𝑖𝑖 𝜓𝜓𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 � 𝑚𝑚𝑘𝑘𝑘𝑘 𝑥𝑥𝑖𝑖 𝑑𝑑𝑥𝑥𝑖𝑖 • Depending on the functional form of the𝑘𝑘 potential∈𝑁𝑁 𝑖𝑖 ∖𝑗𝑗 functions, the message update may not have a closed form solution
– We can’t necessarily compute the correct marginal distributions/partition function even in the case of a tree!
15 Gaussian Belief Propagation
• When ( ) is a multivariate normal distribution, the message updates can be computed in closed form 𝑝𝑝 𝑥𝑥 – In this case, max-product and sum-product are equivalent
– Note that computing the mode of a multivariate normal is equivalent to solving a linear system of equations
– Called Gaussian belief propagation or GaBP
– Does not converge for all multivariate normal
• The messages can have a non-positive definite inverse covariance matrix
16 Properties of Exponential Families
• Exponential families are
– Closed under multiplication
– Not closed under marginalization
• Easy to get mixtures of Gaussians when a model has both discrete and continuous variables
– Let ( , ) be such that and {1, … , } such that ( | ) is normally distributed𝑛𝑛 and is multinomially𝑝𝑝 𝑥𝑥 𝑦𝑦 distributed𝑥𝑥 ∈ ℝ 𝑦𝑦 ∈ 𝑘𝑘 𝑝𝑝 𝑥𝑥 𝑦𝑦 𝑝𝑝 𝑦𝑦 – ( ) is then a Gaussian mixture (mixtures of exponential family distributions are not generally in the exponential 𝑝𝑝family)𝑥𝑥 17 Properties of Exponential Families
• Derivatives of the log-partition function correspond to expectations of the sufficient statistics
log ( ) = ( )
𝜃𝜃 • So do second derivatives𝛻𝛻 𝑍𝑍 𝜃𝜃 � 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑑𝑑𝑑𝑑
log = 2 𝜕𝜕 𝑍𝑍 𝜃𝜃 𝜕𝜕𝜃𝜃𝑘𝑘𝜕𝜕𝜃𝜃𝑙𝑙
� 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑘𝑘𝜙𝜙 𝑥𝑥 𝑙𝑙𝑑𝑑𝑑𝑑 − � 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑘𝑘𝑑𝑑𝑑𝑑 � 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑙𝑙𝑑𝑑𝑑𝑑
18 Mean Parameters
• Exponential family distributions can be equivalently characterized in terms of their mean parameters • Consider the set of all vectors such that
= 𝜇𝜇( )
𝑘𝑘 𝑘𝑘 for some probability distribution𝜇𝜇 � 𝑞𝑞(𝑥𝑥)𝜙𝜙 𝑥𝑥 𝑑𝑑𝑑𝑑
• If the representation is minimal,𝑞𝑞 𝑥𝑥then every collection of mean parameters can be realized (perhaps as a limit) by some exponential family – This characterization is unique
19 KL-Divergence and Exponential Families
• Minimizing KL divergence is equivalent to “moment matching”
• Let = h x exp , ( ) log ( ) and let ( ) be an arbitrary distribution 𝑞𝑞 𝑥𝑥 𝜃𝜃 ⋅ 𝜃𝜃 𝜙𝜙 𝑥𝑥 − 𝑍𝑍 𝜃𝜃 𝑝𝑝 𝑥𝑥 ( ) ( | = ( ) log ( | ) 𝑝𝑝 𝑥𝑥 𝑑𝑑 𝑝𝑝 𝑞𝑞 � 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑 • This KL divergence is minimized when 𝑞𝑞 𝑥𝑥 𝜃𝜃
=
� 𝑝𝑝 𝑥𝑥 𝜙𝜙 𝑥𝑥 𝑘𝑘𝑑𝑑𝑑𝑑 � 𝑞𝑞 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑘𝑘𝑑𝑑𝑑𝑑
20 Expectation Propagation
• Key idea: given = ( ) approximate it by a simpler 1 distribution = ( ) 𝑝𝑝 𝑥𝑥 𝑍𝑍 ∏𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 1 𝐶𝐶 𝐶𝐶 • We could just𝑝𝑝 replace𝑥𝑥 ≈ 𝑝𝑝 �each𝑥𝑥 factor𝑍𝑍� ∏ with𝐶𝐶 𝜓𝜓� a member𝑥𝑥 of some exponential family that best describes it, but this can result in a poor approximation unless each is essentially a member of the exponential family already 𝜓𝜓𝐶𝐶 • Instead, we construct the approximating distribution by performing a series of optimizations
21 Expectation Propagation
• Input = ( ) 1 • Initialize𝑝𝑝 𝑥𝑥 the approximate𝑍𝑍 ∏𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥 distribution𝐶𝐶 = ( ) so that each ( ) is a member of some exponential1 family 𝑝𝑝� 𝑥𝑥 𝑍𝑍� ∏𝐶𝐶 𝜓𝜓�𝐶𝐶 𝑥𝑥𝐶𝐶 • Repeat𝜓𝜓� 𝐶𝐶until𝑥𝑥𝐶𝐶 convergence – For each
• Let 𝐶𝐶 = ( ) 𝑝𝑝� 𝑥𝑥 � 𝐶𝐶 𝐶𝐶 𝐶𝐶 𝐶𝐶 • Set 𝑞𝑞 𝑥𝑥 = 𝜓𝜓argmin𝑥𝑥 𝜓𝜓 𝑥𝑥( || ) where the minimization is over all exponential families′ of′ the chosen form 𝑞𝑞 𝑝𝑝� 𝑥𝑥 𝑑𝑑 𝑞𝑞 ′𝑞𝑞 22 𝑞𝑞 Expectation Propagation
• EP over exponential family distributions maximizes the Bethe free energy subject to the following moment matching conditions (instead of the marginalization conditions)
=
𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝐶𝐶 𝐶𝐶 𝑖𝑖 𝑖𝑖 𝐶𝐶 where is a� vector𝜏𝜏 𝑥𝑥 of𝜙𝜙 sufficient𝑥𝑥 𝑑𝑑𝑥𝑥 statistics� 𝜏𝜏 𝑥𝑥 𝜙𝜙 𝑥𝑥 𝑑𝑑𝑥𝑥
𝜙𝜙𝑖𝑖
23 Expectation Propagation
• Maximizing the Bethe free energy subject to these moment matching constraints is equivalent to a form of belief propagation where the beliefs are projected onto a set of allowable marginal distributions (e.g., those in a specific exponential family)
• This is the approach that is often used to handle continuous distributions in practice
• Other methods include discretization/sampling methods that make use of BP in a discrete setting
24