<<

CS 6347

Lecture 22-23

Exponential Families & Expectation Propagation Discrete State

• We have been focusing on the case of MRFs over discrete state spaces

distributions over discrete spaces correspond to vectors of for each element in the such that the vector sums to one

– The partition is simply a sum over all of the possible values for each variable

of the distribution is nonnegative and is also computed by summing over the state space

2 Continuous State Spaces

1 =

𝑝𝑝 𝑥𝑥 � 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 • For continuous state spaces, the𝑍𝑍 partition𝐶𝐶 function is now an

=

𝑍𝑍 � � 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 𝑑𝑑𝑑𝑑 • The entropy becomes 𝐶𝐶 = log ( )

𝐻𝐻 𝑥𝑥 − � 𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑

3

= log ( )

• This is called the𝐻𝐻 differential𝑥𝑥 − �entropy𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑

– It is not always greater than or equal to zero

• Easy to construct such distributions:

– Let be the uniform distribution over the [ , ], what is the entropy of ( )? 𝑞𝑞 𝑥𝑥 𝑎𝑎 𝑏𝑏 𝑞𝑞 𝑥𝑥

4 Differential Entropy

= log ( )

• This is called the𝐻𝐻 differential𝑥𝑥 − �entropy𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑

– It is not always greater than or equal to zero

• Easy to construct such distributions:

– Let be the uniform distribution over the interval [ , ], what is the entropy of ( )? 𝑞𝑞 𝑥𝑥 1 1 𝑎𝑎 𝑏𝑏= log 𝑞𝑞 𝑥𝑥 = log( ) 𝑏𝑏 𝐻𝐻 𝑞𝑞 − � 𝑑𝑑𝑑𝑑 𝑏𝑏 − 𝑎𝑎 𝑎𝑎 𝑏𝑏 − 𝑎𝑎 𝑏𝑏 − 𝑎𝑎 5 KL

( | = log ( ) 𝑞𝑞 𝑥𝑥 𝑑𝑑 𝑞𝑞 𝑝𝑝 � 𝑞𝑞 𝑥𝑥 𝑑𝑑𝑑𝑑 • The KL-divergence is still nonnegative, even though𝑝𝑝 𝑥𝑥 it contains the differential entropy

– This that all of the observations that we made for finite state spaces will carry over to the continuous case

– The EM , - methods, etc.

– Most importantly

log + log

𝑍𝑍 ≥ 𝐻𝐻 𝑞𝑞 � � 𝑞𝑞𝐶𝐶 𝑥𝑥𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 𝑑𝑑𝑥𝑥𝐶𝐶 𝐶𝐶 6 Continuous State Spaces

• Examples of probability distributions over continuous state spaces

– The uniform distribution over the interval [ , ] 1 = , 𝑎𝑎 𝑏𝑏 𝑥𝑥∈ 𝑎𝑎 𝑏𝑏 𝑞𝑞 𝑥𝑥 – The multivariate 𝑏𝑏 − 𝑎𝑎 with mean and 𝜇𝜇 1 1 =Σ exp ( ) 2 det( ) 𝑇𝑇 −1 𝑞𝑞 𝑥𝑥 𝑘𝑘 − 𝑥𝑥 − 𝜇𝜇 Σ 𝑥𝑥 − 𝜇𝜇 2𝜋𝜋 Σ

7 Continuous State Spaces

• What makes continuous distributions so difficult to deal with?

– They may not be compactly representable

– Families of continuous distributions need not be closed under marginalization

• The marginal distributions of multivariate normal distributions are again (multivariate) normal distributions

– Integration problems of interest (e.g., the partition function or marginal distributions) may not have closed form solutions

may also not exist!

8 The

= h x exp , ( ) log ( )

• A distribution𝑝𝑝 𝑥𝑥 is𝜃𝜃 a member⋅ of the exponential𝜃𝜃 𝜙𝜙 𝑥𝑥 − family𝑍𝑍 if 𝜃𝜃its probability density function can be expressed as above for some choice of and potential functions ( )

• We are only interested𝜃𝜃 in models for which𝜙𝜙 𝑥𝑥 ( ) is finite

• The family of log-linear models that we have𝑍𝑍 been𝜃𝜃 focusing on in the discrete case belong to the exponential family

9 The Exponential Family

= h x exp , ( ) log ( )

• As in the 𝑝𝑝discrete𝑥𝑥 𝜃𝜃 case, there⋅ is not𝜃𝜃 necessarily𝜙𝜙 𝑥𝑥 − a unique𝑍𝑍 𝜃𝜃 way to express a distribution in this form • We say that the representation is minimal if there does not exist a vector 0 such that

𝑎𝑎 ≠ , ( ) = – In this case, there 𝑎𝑎is a𝜙𝜙 unique𝑥𝑥 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 vector associated with each member of the family – The are called sufficient for the distribution

𝜙𝜙

10 The Multivariate Normal

1 1 | , = exp 2 𝑇𝑇 −1 𝑞𝑞 𝑥𝑥 𝜇𝜇 Σ − 𝑥𝑥 − 𝜇𝜇 Σ 𝑥𝑥 − 𝜇𝜇 • The multivariate normal𝑍𝑍 distribution is a member of the exponential family 1 | = exp + ( ) 𝑞𝑞 𝑥𝑥 𝜃𝜃 � 𝜃𝜃𝑖𝑖𝑥𝑥𝑖𝑖 � 𝜃𝜃𝑖𝑖𝑖𝑖𝑥𝑥𝑖𝑖𝑥𝑥𝑗𝑗 • The mean and the covariance𝑍𝑍 𝜃𝜃 matrix𝑖𝑖 (must be𝑖𝑖 ≥positive𝑗𝑗 semidefinite) are sufficient statistics of the multivariate normal distribution

11 The Exponential Family

• Many of the discrete distributions that you have seen before are members of the exponential family

– Binomial, Poisson, Bernoulli, Gamma, Beta, Laplace, Categorical, etc.

• The exponential family, while not the most general , is one of the easiest to work with and captures a variety of different distributions

12 Continuous Bethe Approximation

• Recall that, from the nonnegativity of the KL-divergence

log + log

𝑍𝑍 ≥ 𝐻𝐻 𝑞𝑞 � � 𝑞𝑞𝐶𝐶 𝑥𝑥𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 𝑑𝑑𝑥𝑥𝐶𝐶 for any 𝐶𝐶

• We can make the same approximations𝑞𝑞 that we did in the discrete case to approximate ( ) in the continuous case

𝑍𝑍 𝜃𝜃

13 Continuous Bethe Approximation

max + log

𝐵𝐵 𝐶𝐶 𝐶𝐶 𝑐𝑐 𝑐𝑐 𝐶𝐶 𝜏𝜏∈𝐓𝐓 𝐻𝐻 𝜏𝜏 � � 𝜏𝜏 𝑥𝑥 𝜓𝜓 𝑥𝑥 𝑑𝑑𝑥𝑥 where 𝐶𝐶 ( ) = log ( ) log 𝐶𝐶 𝐶𝐶 𝐵𝐵 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝐶𝐶 𝐶𝐶 𝜏𝜏 𝑥𝑥 𝐶𝐶 𝐻𝐻 𝜏𝜏 − � � 𝜏𝜏 𝑥𝑥 𝜏𝜏 𝑥𝑥 𝑑𝑑𝑥𝑥 − � � 𝜏𝜏 𝑥𝑥 𝑖𝑖 𝑖𝑖 𝑑𝑑𝑥𝑥 and is a vector𝑖𝑖∈V of locally consistent marginals𝐶𝐶 ∏𝑖𝑖∈𝐶𝐶 𝜏𝜏 𝑥𝑥

• This𝑇𝑇 approximation is exact on trees

14 Continuous

1 = ( ) ( , ) , 𝑝𝑝 𝑥𝑥 � 𝜙𝜙𝑖𝑖 𝑥𝑥𝑖𝑖 � 𝜓𝜓𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 • The messages passed𝑍𝑍 by𝑖𝑖∈ belief𝑉𝑉 propagation𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 are

= , ( )

𝑚𝑚𝑖𝑖𝑖𝑖 𝑥𝑥𝑗𝑗 � 𝜙𝜙𝑖𝑖 𝑥𝑥𝑖𝑖 𝜓𝜓𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 � 𝑚𝑚𝑘𝑘𝑘𝑘 𝑥𝑥𝑖𝑖 𝑑𝑑𝑥𝑥𝑖𝑖 • Depending on the functional form of the𝑘𝑘 potential∈𝑁𝑁 𝑖𝑖 ∖𝑗𝑗 functions, the message update may not have a closed form solution

– We can’t necessarily compute the correct marginal distributions/partition function even in the case of a tree!

15 Gaussian Belief Propagation

• When ( ) is a multivariate normal distribution, the message updates can be computed in closed form 𝑝𝑝 𝑥𝑥 – In this case, max-product and sum-product are equivalent

– Note that computing the of a multivariate normal is equivalent to solving a linear system of equations

– Called Gaussian belief propagation or GaBP

– Does not converge for all multivariate normal

• The messages can have a non-positive definite inverse

16 Properties of Exponential Families

• Exponential families are

– Closed under multiplication

– Not closed under marginalization

• Easy to get mixtures of Gaussians when a model has both discrete and continuous variables

– Let ( , ) be such that and {1, … , } such that ( | ) is normally distributed𝑛𝑛 and is multinomially𝑝𝑝 𝑥𝑥 𝑦𝑦 distributed𝑥𝑥 ∈ ℝ 𝑦𝑦 ∈ 𝑘𝑘 𝑝𝑝 𝑥𝑥 𝑦𝑦 𝑝𝑝 𝑦𝑦 – ( ) is then a Gaussian mixture (mixtures of exponential family distributions are not generally in the exponential 𝑝𝑝family)𝑥𝑥 17 Properties of Exponential Families

of the log-partition function correspond to expectations of the sufficient statistics

log ( ) = ( )

𝜃𝜃 • So do second derivatives𝛻𝛻 𝑍𝑍 𝜃𝜃 � 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑑𝑑𝑑𝑑

log = 2 𝜕𝜕 𝑍𝑍 𝜃𝜃 𝜕𝜕𝜃𝜃𝑘𝑘𝜕𝜕𝜃𝜃𝑙𝑙

� 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑘𝑘𝜙𝜙 𝑥𝑥 𝑙𝑙𝑑𝑑𝑑𝑑 − � 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑘𝑘𝑑𝑑𝑑𝑑 � 𝑝𝑝 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑙𝑙𝑑𝑑𝑑𝑑

18 Mean Parameters

• Exponential family distributions can be equivalently characterized in terms of their mean parameters • Consider the of all vectors such that

= 𝜇𝜇( )

𝑘𝑘 𝑘𝑘 for some probability distribution𝜇𝜇 � 𝑞𝑞(𝑥𝑥)𝜙𝜙 𝑥𝑥 𝑑𝑑𝑑𝑑

• If the representation is minimal,𝑞𝑞 𝑥𝑥then every collection of mean parameters can be realized (perhaps as a limit) by some exponential family – This characterization is unique

19 KL-Divergence and Exponential Families

• Minimizing KL divergence is equivalent to “

• Let = h x exp , ( ) log ( ) and let ( ) be an arbitrary distribution 𝑞𝑞 𝑥𝑥 𝜃𝜃 ⋅ 𝜃𝜃 𝜙𝜙 𝑥𝑥 − 𝑍𝑍 𝜃𝜃 𝑝𝑝 𝑥𝑥 ( ) ( | = ( ) log ( | ) 𝑝𝑝 𝑥𝑥 𝑑𝑑 𝑝𝑝 𝑞𝑞 � 𝑝𝑝 𝑥𝑥 𝑑𝑑𝑑𝑑 • This KL divergence is minimized when 𝑞𝑞 𝑥𝑥 𝜃𝜃

=

� 𝑝𝑝 𝑥𝑥 𝜙𝜙 𝑥𝑥 𝑘𝑘𝑑𝑑𝑑𝑑 � 𝑞𝑞 𝑥𝑥 𝜃𝜃 𝜙𝜙 𝑥𝑥 𝑘𝑘𝑑𝑑𝑑𝑑

20 Expectation Propagation

• Key idea: given = ( ) approximate it by a simpler 1 distribution = ( ) 𝑝𝑝 𝑥𝑥 𝑍𝑍 ∏𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥𝐶𝐶 1 𝐶𝐶 𝐶𝐶 • We could just𝑝𝑝 replace𝑥𝑥 ≈ 𝑝𝑝 �each𝑥𝑥 factor𝑍𝑍� ∏ with𝐶𝐶 𝜓𝜓� a member𝑥𝑥 of some exponential family that best describes it, but this can result in a poor approximation unless each is essentially a member of the exponential family already 𝜓𝜓𝐶𝐶 • Instead, we construct the approximating distribution by performing a series of optimizations

21 Expectation Propagation

• Input = ( ) 1 • Initialize𝑝𝑝 𝑥𝑥 the approximate𝑍𝑍 ∏𝐶𝐶 𝜓𝜓𝐶𝐶 𝑥𝑥 distribution𝐶𝐶 = ( ) so that each ( ) is a member of some exponential1 family 𝑝𝑝� 𝑥𝑥 𝑍𝑍� ∏𝐶𝐶 𝜓𝜓�𝐶𝐶 𝑥𝑥𝐶𝐶 • Repeat𝜓𝜓� 𝐶𝐶until𝑥𝑥𝐶𝐶 convergence – For each

• Let 𝐶𝐶 = ( ) 𝑝𝑝� 𝑥𝑥 � 𝐶𝐶 𝐶𝐶 𝐶𝐶 𝐶𝐶 • Set 𝑞𝑞 𝑥𝑥 = 𝜓𝜓argmin𝑥𝑥 𝜓𝜓 𝑥𝑥( || ) where the minimization is over all exponential families′ of′ the chosen form 𝑞𝑞 𝑝𝑝� 𝑥𝑥 𝑑𝑑 𝑞𝑞 ′𝑞𝑞 22 𝑞𝑞 Expectation Propagation

• EP over exponential family distributions maximizes the Bethe free energy subject to the following moment matching conditions (instead of the marginalization conditions)

=

𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝐶𝐶 𝐶𝐶 𝑖𝑖 𝑖𝑖 𝐶𝐶 where is a� vector𝜏𝜏 𝑥𝑥 of𝜙𝜙 sufficient𝑥𝑥 𝑑𝑑𝑥𝑥 statistics� 𝜏𝜏 𝑥𝑥 𝜙𝜙 𝑥𝑥 𝑑𝑑𝑥𝑥

𝜙𝜙𝑖𝑖

23 Expectation Propagation

• Maximizing the Bethe free energy subject to these moment matching constraints is equivalent to a form of belief propagation where the beliefs are projected onto a set of allowable marginal distributions (e.g., those in a specific exponential family)

• This is the approach that is often used to handle continuous distributions in practice

• Other methods include discretization/ methods that make use of BP in a discrete setting

24