Moment Matching

Probabilistic Graphical Models

Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 7: Exponential Families, Conjugate Priors, and Factor Graphs

Some figures courtesy Michael Jordan’s draft textbook, An Introduction to Probabilistic Graphical Models Exponential Families of Distributions

d fixed vector of sufficient statistics (features), (x) R 2 specifying the family of distributions ✓ ⇥ unknown vector of natural parameters, 2 determine particular distribution in this family normalization constant or partition function, Z(✓) > 0 ensuring this is a valid probability distribution reference measure independent of parameters h(x) > 0 (for many models, we simply have h ( x )=1 ) To ensure this construction is valid, we take

d ⇥= ✓ R Z(✓) < { 2 | 1} Why the Exponential Family?

• Many standard distributions are in this family, and by studying exponential families, we study them all simultaneously • Explains similarities among learning algorithms for different models, and makes it easier to derive new algorithms: • ML estimation takes a simple form for exponential families: moment matching of sufficient statistics • Bayesian learning is simplest for exponential families: they are the only distributions with conjugate priors • They have a maximum entropy interpretation: Among all distributions with certain moments of interest, the exponential family is the most random (makes fewest assumptions) Examples of Exponential Families

• Bernoulli and binomial (2 classes) (x)=I(x = 1) = x • Categorical and multinomial (K classes) (x)=[I(x = 1),...,I(x = K 1)] • Scalar Gaussian (x)=[x, x2] • Multivariate Gaussian (x)=[x, xxT ] 1 • Poisson h(x)= ,(x)=x x! • Dirichlet and beta • Gamma and exponential • … Non-Exponential Families • Uniform distribution 1 Unif(x a, b)= I(a x b) | b a   • Laplace and Student-t distributions Lap(x µ, )= exp( x µ ) | 2 | | Convexity Convexity & Jensen’s Inequality

f(E[X]) E[f(X)]  f(x)

chord

a x! b Concavity & Jensen’s Inequality ln(E[X]) E[ln(X)] Log Partition Function

• Derivatives of log partition function have an intuitive form:

✓A(✓)=E✓[(x)] r 2 T T A(✓)=Cov✓[(x)] = E✓[(x)(x) ] E✓[(x)]E✓[(x)] r✓ • Important consequences for learning with exponential families: • Finding gradients is equivalent to finding expected sufficient statistics, or moments, of some current model • The Hessian is positive definite so A ( ✓ ) is convex • This in turn implies that the parameter space ⇥ is convex • Learning is a convex problem: No local optima! At least when we have complete observations… Sec. 2.1. Exponential Families 31

Sec. 2.1. Exponential Families 31 Proposition 2.1.1. The log partition function Φ(θ) of eq. (2.2) is convex (strictly so Proposition 2.1.1. The log partitionfor minimal function representations)Φ(θ) of eq. (2.2) andis convex continuously (strictly disofferentiable over its domain Θ.Its derivatives are the cumulants of the sufficient statistics φ a ,sothat for minimal representations) and continuously differentiable over its domain Θ.Its { a | ∈ A} derivatives are the cumulants of the sufficient statistics∂Φφ(aθ)a ,sothat { | =∈EA}[φ (x)] ! φ (x) p(x θ) dx (2.4) ∂Φ(θ) ∂θ θ a a | = E [φ (x)] ! φ (x) p(x θ) adx X (2.4) Sec. 2.1. Exponential Families θ a a 2 ! 31 ∂θa X ∂| Φ(θ) 2 ! = Eθ[φa(x) φb(x)] Eθ[φa(x)] Eθ[φb(x)] (2.5) ∂ Φ(θ) ∂θa∂θb − Proposition 2.1.1. The log= partitionEθ[φa(x) functionφb(x)] ΦE(θ[)φofa(x eq.)] E(θ2.2[φb)(xis)] convex (strictly (2.5) so ∂θ ∂θ − for minimal representations)a b andProof. continuouslyFor a di detailedfferentia proofble over of this its domain classic result,Θ.Its see [15, 36, 311]. The cumulant gener- derivativesProof. For are a detailed the cumulants proof ofof this theating classic suffi propertiescient result, statistics see follow [15,φ36 from, 311a the]. The,sothat chain cumulant rule and gener- algebraic manipulation. From eq. (2.5), { a | ∈ A} ating properties follow from the chain2Φ( ruleθ) is and a positive algebraic semi–definite manipulation. covariance From eq. ( matrix,2.5), implying convexity of Φ(θ). For 2 ∂Φ(θ) Φ(θ) is a positive semi–definiteE ∇ covariance! matrix,2 implying convexity of Φ(θ). For = θ[φminimala(x)] families,φa(x) p(x Φθ)(θdx) must be positive definite,(2.4) guaranteeing strict convexity. ∇ 2 ∂θ | minimal families, Φ(θa) must be positive!X definite, guaranteeing∇ strict convexity. ∇∂2Φ(θ) = Eθ[φDuea(x) φ tob(x this)] result,Eθ[φa(x the)] E logθ[φb partition(x)] function (2.5) is also known asthecumulant generating Due to this result,∂θ thea∂θ logb partition function− is also known asthecumulant generating function of the exponential family. The convexity of Φ(θ) has important implications function of the exponential family. The convexity of Φ(θ) has important implications Proof. For a detailed proof of thisfor classic the result, geometry see [ of15, exponential36, 311]. The families cumulant [6, gener-15, 36, 74]. atingfor the properties geometry follow of exponential from the chain families rule [6 and, 15, algebraic36, 74]. manipulation. From eq. (2.5), 2Φ(θ) is a positive semi–definite covariance matrix, implying convexity of Φ(θ). For ∇ Entropy, Information, and Divergence minimalEntropy, families, Information,2Φ and(θ) must Divergence be positive definite, guaranteeing strict convexity. ∇ Concepts from information theoryConcepts play a central from information role in the theory study of play learning a central and role in the study of learning and Dueinference to this in result, exponential the log families. partitioninference Givenfunction a in isprobability also exponential known dis atributionsthe families.cumulantp( Givenx) defined generating a probability on a distribution p(x) defined on a functiondiscreteof space the exponential, Shannon’s family. measurediscrete The of convexityentropy space (in of, Shannon’snaturalΦ(θ) has units, important measure or nats implications of) equalsentropy (in natural units, or nats) equals X X for the geometry of exponential families [6, 15, 36, 74]. H(p)= p(x) log p(x)H(p)= p( (2.6)x) log p(x) (2.6) − − Entropy, Information, and Divergence x"∈X x"∈X In such diverse fields as communications,A Little signal Information processing, and statistical Theory physics, Concepts from information theoryIn play such a diverse central role fields in as the communications, study of learning signal and processing, and statistical physics, entropy arises as a natural measure of the inherent uncertainty in a random variable [49]. inference in exponential• The families. entropyentropy Given is a arises probabilitya natural as a natural dismeasuretribution measure pof(x of)the defined the inherent inherent on a uncertai uncertaintynty in a random variable [49]. The differential entropy extends this definition to continuous spaces: discrete space , Shannon’s(difficulty measureThe of ofdi entropycompression)fferential(in entropy natural of units,extends some or nats this random) definition equals variable: to continuous spaces: X HH((pp)=)= pp((xx)) log logpp((xx))dx H(p)= p( (2.6)x)(2.7) log p(x) dx (2.7) −− X − x!∈X !X In both discrete and continuousdiscrete domains," entropy the (di fferential) entropy differentialH(p) is concave, entropy In such diverse fields as communications,In both discrete signal processing and continuous, and statistical domains, physics, the (differential) entropy H(p) is concave, continuous, and maximal for uniform densities.32 However, while the discreteCHAPTER entropy 2. is NONPARAMETRIC AND GRAPHICAL MODELS entropy arises as a natural measure(concave,continuous, of the non-negative) inherent and uncertai maximal nty in for a randomuniform(concave, variable densities. real-valued) [49]. However, while the discrete entropy is guaranteed to be non-negative, differential entropy is sometimes less than zero. The differential entropy extends thisguaranteed definition to to be continuous non-negative, spaces: differential entropy is sometimes less than zero. For problems of• modelThe32 selectionrelative and entropy approximation,Applying or Kullback-Leibler Jensen’s we inequality need a tomeasure the (KL) logarithmCHAPTER ofdivergence the of eq. 2. (2.8 NONPARAMETRIC), which is concave, it is AND eas- GRAPHICAL MODELS For problemsily shown of that model the KL selection divergence andD(p approximation,q) 0, with D(p weq) = need 0 if and a measure only if of the distance between probabilityis then distributions. a non-negative, The relative but asymmetric, entropy or Kullback-Leibler “distance”|| ≥ between|| a H(p)=distancep( betweenx) logp(xp)=(x) probabilityqdx(x) almost everywhere. distributions. However,(2.7) The it is notrelative a true entropy distance metrior cKullback-Leibler because (KL) divergence between two probability− X distributionsD(p q) = pD((xq)p and). Givenq(x) a equals target density p(x) and an approximation q(x), D(p q) given pair of! probability|| " distributions:|| || Applying(KL) divergence Jensen’scan bebetween inequality motivated twoas the probability to information the logarithm gain distributions achievable by of usingp eq.(xp)( (x and)2.8 in place),q(x which) of equalsq(x)[49 is]. concave, it is eas- In both discrete and continuous domains, the (diInterestingly,fferential)p(x) theentropy alternateH KL(p divergence) is concave,D(q p) also plays an important role in the D(p q)= p(x) log dx (2.8)|| continuous, and maximal forily uniform shown densities. thatdevelopment However, the KL wh of variational divergenceile the discrete methodsD entropy for( approximatep q is) p( inferex)0,nce with (see Sec.D2.3().p q) = 0 if and only if || X q(x) D(p q)= p(x) log dx (2.8) guaranteed to be non-negative, differential! entropyAn is sometimportantimes special less case than|| arises zero. when|| we consider≥q(x) the dependency between|| two Important properties of thep( KLx)= divergenceq(x) follow almost from everywhere.Jensen’s inequality However,![X49], which it is not a true distance metric because For problems of modelThe selection KL divergence and approximation,random equals variables wezerox needand iff y a .Letp (measure x )= p xy q ( x,( x y of)) almost denote the their everywhere. joint distribution, px(x) and bounds the expectation ofD convex(p Important functions:q) = D( propertiesq py(yp)). their Given corresponding of the a KL target marginals, divergence density and followandp( fromxtheir) andJensen’s sample an spaces. approximation inequality The mutual[49], whichq(x), D(p q) distance between probability distributions. The relative entropy or Kullback-LeiblerX Y • The mutual||bounds " information the expectationinformation|| measuresbetween of convexx and dependencey functions:then equals between a || (KL) divergence betweenE twocan probabilityE be motivated distributions as thep(x) information and q(x)R equals gain achievable by using p(x) in place of q(x)[49]. [f(x)] f( [x]) for any convex f : (2.9) pxy(x, y) pair≥ of random variables: I(pxy) !XD→(pxy pxpy)= pxy(x, y) log dy dx (2.10) Interestingly, the alternateE[f(x)] f KL(E[x divergence])|| X forY D any(q convexpp)x( alsox)pfy(:y) playsR an important(2.9) role in the p(x) ≥ ! ! X → D(p q)= p(x) log dx = H(px)+H(py) H(p(2.8)xy)|| (2.11) development|| X of variationalq(x) methods for− approximate inference (see Sec. 2.3). ! where eq. (2.11) follows from algebraic manipulation. The mutual information can be Important properties of the KLAn divergence important followinterpreted from specialJensen’s as the case expected inequality arises reduction[49 when], in whichuncertainty we about considerone random the variable depend from ency between two bounds the expectation of convexrandom functions: variablesobservationx and of anothery.Let [49]. pxy(x, y) denote their joint distribution, px(x) and E[f(x)] fp(Ey([xy])) their corresponding for anyProjections convex ontof Exponential: marginals,R Families and(2.9) and their sample spaces. The mutual ≥ X → X Y information betweenIn many cases,x and learningy problemsthen equals can be posed as a search for the best approximation of an empirically derived target densityp ˜(x). As discussed in the previous section, the KL divergence D(˜p q) is a natural measure of the accuracy of an approximation q(x). || For exponential! families, the optimal approximating density is elegantlypxy characterized(x, y) I(pbyxy the) followingD(pmoment–matchingxy pxpy)=conditions: pxy(x, y) log dy dx (2.10) || X Y px(x)py(y) Proposition 2.1.2. Let p˜ denote a target! ! probability density, and pθ an exponential family. The approximating density minimizing D(˜p p ) then has canonical parameters = H(px)+H(py) H(pxy)|| θ (2.11) θˆ chosen to match the expected values− of that family’s sufficientstatistics: E where eq. (2.11) follows fromˆ[ algebraicφa(x)] = φa(x manipulation.)˜p(x) dx a The mutual(2.12) information can be θ ∈ A !X interpreted as theFor minimal expected families, reduction these optimal parameters in uncertaintyθˆ are uniquely determined. about one random variable from

observation of anotherProof. From [ the49 deﬁnition]. of KL divergence (eq. (2.8)), we have p˜(x) D(˜p p )= p˜(x) log dx || θ p(x θ) Projections onto Exponential!X Families|

= p˜(x) logp ˜(x) dx p˜(x) log ν(x)+ θaφa(x) Φ(θ) dx X − X " − $ In many cases, learning problems! can be! posed asa# a∈A search for the best approximation of an empirically derived= targetH(˜p) densityp˜(x) log ν(xp) ˜(dxx). Asθa discussedφa(x)˜p(x) dx + inΦ(θ) the previous section, the − − X − X KL divergence D(˜p q) is a natural! measurea# of∈A the! accuracy of an approximation q(x). || For exponential families, the optimal approximating density is elegantly characterized by the following moment–matching conditions:

Proposition 2.1.2. Let p˜ denote a target probability density, and pθ an exponential family. The approximating density minimizing D(˜p p ) then has canonical parameters || θ θˆ chosen to match the expected values of that family’s suﬃcientstatistics:

E [φ (x)] = φ (x)˜p(x) dx a (2.12) θˆ a a ∈ A !X For minimal families, these optimal parameters θˆ are uniquely determined.

Proof. From the deﬁnition of KL divergence (eq. (2.8)), we have p˜(x) D(˜p p )= p˜(x) log dx || θ p(x θ) !X |

= p˜(x) logp ˜(x) dx p˜(x) log ν(x)+ θaφa(x) Φ(θ) dx X − X " − $ ! ! a#∈A = H(˜p) p˜(x) log ν(x) dx θa φa(x)˜p(x) dx + Φ(θ) − − X − X ! a#∈A ! 32 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS

Applying Jensen’s inequality to the logarithm of eq. (2.8), which is concave, it is eas- ily shown that the KL divergence D(p q) 0, with D(p q) = 0 if and only if || ≥ || p(x)=q(x) almost everywhere. However, it is not a true distance metric because D(p q)Sec.= 2.1.D(q Exponentialp). Given Families a target density p(x) and an approximation q(x), D(p q) 33 || " || || can be motivated as the information gain achievable by using p(x) in place of q(x)[49]. Interestingly,Taking the derivatives alternate with KL respect divergence to θaDand(q settingp) also∂ playsD(˜p anpθ) important/∂θa = 0, we role then in havethe || || development of variational methods∂Φ(θ) for approximate inference (see Sec. 2.3). An important special case arises= whenφa( wex)˜p consider(x) dx the depend a ency between two Sec. 2.1. Exponential Families ∂θa 33X ∈ A random variables x and y.Letpxy(x,! y) denote their joint distribution, px(x) and py(y) theirEquation corresponding (2.12) follows marginals, from the and cumulantand generatingtheir sample properties spaces. of Φ The(θ)(eq.(mutual2.4)). Taking derivatives with respect to θ and setting ∂D(˜p p ) /∂θ = 0, we then have X Y a informationBecause|| θbetweenΦ(aθ) isx strictlyand y then convex equals for minimal families (Prop. 2.1.1), the canonical parameters θˆ satisfying eq. (2.12) achieve the unique global minimum of D(˜p pθ). ∂Φ(θ) pxy(x, y) || = φa(x)˜p(x) dx aI(pxy) ! D(pxy pxpy)= pxy(x, y) log dy dx (2.10) ∂θa X ∈ A || px(x)py(y) ! In information geometry, the density!X !Y satisfying eq. (2.12) is known as the I–projection = H(p )+H(p ) H(p ) (2.11) Equation (2.12) follows from the cumulant generatingof propertiesp ˜(x) onto the of Φe–flat(xθ)(eq.( manifold2.4y −)).definedxy by the exponential family’s canonical param- Because Φ(θ) is strictly convex for minimal familieswhere (Prop. eq.eters (2.112.1.1 [6,)52 follows),]. the Note canonical from that algebraic the param- optimal manipulation. projection depends The mutual only informatthe potentiional can functions’ be ˆ eters θ satisfying eq. (2.12) achieve the uniqueinterpreted globalexpected minimum as the values of expectedD(˜ underp pθ reduction).p ˜(x), so that in uncertainty these statistics about are suonefficient randomto determine variable from the closest approximation.|| observation of another [49]. In information geometry, the density satisfying eq. (2.12)In is many known applications, as the I–projection rather than an explicit target density˜p(x), we instead observe L independent samples x(") L from that density. In this situation, we define the ofp ˜(x) onto the e–flat manifold defined by theProjections exponential onto family’s Exponential canonical Families{ param-}"=1 empirical density of the samples as follows: eters [6, 52]. Note that the optimal projection depends only the potential functions’ In many cases, learning problems can be posed as a search for the best approximation expected values underp ˜(x), so that these statistics are sufficient to determine the clos- L of an empirically derived target densityp ˜(x).1 As discussed in the previous section, the est approximation. Learning in Exponential Familiesp˜(x)= δ x, x(") (2.13) KL divergence D(˜p q) is a natural measure ofL the accuracy of an approximation q(x). In many applications, rather than an explicit target density˜p(x||), we instead observe "=1 For exponential families, the optimal approximating" # density$ is elegantly characterized L independent samples x(") L from that density. In this situation, we define the { }"=1 by the followingHere, δ x,moment–matching x(") is the Dirac deltaconditions: function for continuous , and the Kronecker delta empirical density of the samples as follows: X for discrete . Specializing Prop. 2.1.2 to this case, we find a correspondence between Proposition 2.1.2.# X $Let p˜ denote a target probability density, and pθ an exponential L information projection and maximum likelihood (ML) parameter estimation. 1 family. The approximating density minimizing D(˜p p ) then has canonical parameters • Given anyp˜(x target)= probabilityδ x, x(") distribution p ˜ ( x ), the(2.13) closest || θ L θˆ chosenProposition to match the 2.1.3. expectedLet valuesp denote of that an exponential family’s su familyfficientstatistics: with canonical parameters θ. exponential family"=1 distribution matches momentsθ : " (") L # $Given L independent, identically distributed samples x "=1,withempiricaldensity (") ˆ E { ˆ } Here, δ x, x is the✓ Dirac= arg delta min functionD(˜p forp continuous✓) p˜(x) as, in and eq. the(2.13θˆ[ Kroneckerφa)(,themaximumlikelihoodestimatex)] = deltaφa(x)˜p(x) dx a θ of the canonical parameters(2.12) X X ∈ A for discrete . Specializing Prop.✓2.1.2 to this|| case, wecoincides find a correspondence with the empirical between! density’s information projection: # X $ information projection• Given and maximum L samples, likelihood theirFor (ML) minimalempiricalparameter families, distribution estimation. these optimal equals parameters θˆ are uniquely determined. L L 1 ˆ (") Proof. From the definitionθ = of arg KL max divergencelog p (eq.(x (2.8θ))), = arg we min haveD(˜p pθ) (2.14) Proposition 2.1.3. Let pθ denote an exponentialp˜(x)= family withx canonical(`) (x) parametersθ θ. | θ || Given L independent, identically distributed samples Lx(") L ,withempiricaldensity"=1 `=1"=1 p˜(x) " { ˆX} p˜(x) as in eq. (2.13• ),themaximumlikelihoodestimateFor exponential families,D (˜maximumThesep θpθof)= optimal the canonicallikelihoodp˜( parametersx) log parameters estimation aredx uniquely determined for minimal families, and charac- || X p(x θ) coincides with the empiricalalways density’s minimizes information KL divergence projecterizedtion: by from! the following empirical moment| distribution: matching conditions: L = p˜(x) logp ˜(x) dx p˜(xL) log ν(x)+ θaφa(x) Φ(θ) dx (") X E − X 1 " (") − $ θˆ = arg max log p(x θ) = arg min D(˜p ! pθ)ˆ[φa (2.14)(x)]! = φa(x ) a∈A a (2.15) θ | θ || θ L # ∈ A "=1 "=1 " = H(˜p) p˜(x) log ν(x) dx" θa φa(x)˜p(x) dx + Φ(θ) − − X − X These optimal parameters are uniquely determined for minimal families,! and charac- a#∈A ! terized by the following moment matching conditions:

L 1 E [φ (x)] = φ (x(")) a (2.15) θˆ a L a ∈ A ""=1 34 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS

Proof. Expanding the KL divergence fromp ˜(x)(eq.(2.13)), we have 34 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS D(˜p p )= p˜(x) logp ˜(x) dx p˜(x) log p(x θ) dx || θ Proof. Expanding− the KL divergence| fromp ˜(x)(eq.(2.13)), we have !X !X L 1 (") = H(˜p) Dδ(˜px,p xθ)=log pp˜(x) logθ)p ˜dx(x) dx p˜(x) log p(x θ) dx − − L || X | − X | !X "=1 ! ! " # $ L L 1 (") 1 (") = H(˜p) δ x, x log p(x θ) dx = H(˜p) log p(x θ) − − X L | − − L | ! ""=1 "=1 L # $ " 1 = H(˜p) log p(x(") θ) Because H(˜p) does not depend on θ, the parameters minimizing− − LD(˜p p ) and| maxi- "=1|| θ mizing the expected log–likelihood coincide, establishingeq.(2.14)." The unique char- ˆ Because H(˜p) does not depend on θ, the parameters minimizing D(˜p pθ) and maxi- acterization of θ via moment–matching (eq. (2.15)) then follows from Prop. 2.1.2. || mizing the expected log–likelihood coincide, establishingeq.(2.14). The unique char- acterization of θˆ via moment–matching (eq. (2.15)) then follows from Prop. 2.1.2. In principle, Prop. 2.1.2 and 2.1.3 suggest a straightforward procedure for learning exponential familes: estimate appropriateIn principle, suffi Prop.cient2.1.2 statistandics,2.1.3 andsuggest then finda straightforward correspond- procedure for learning ex- ing canonical parameters via convexponential optimization familes: estimate [6, 15, appropriate36, 52]. In su practice,fficient statist however,ics, and then find correspond- significant difficulties may arise.ing For canonical example, parameters practical via applications convex optimization often require [6, 15semi-, 36, 52]. In practice, however, supervised learning from partiallysignificant labeled diffi trainingculties may data, arise. so that For example, the needed practical statistics applications often require semi- cannot be directly measured. Evensupervised when learning sufficientfrom statistics partiallyare labeled available, training calculation data, so that the needed statistics of the corresponding parameterscannot can be be intractable directly measured. in large Even, complex when models. sufficient statistics are available, calculation These results also have importantof the corresponding implications parameters for the sele canction be intractable of appropriate in large ex-, complex models. ponential families. In particular, becauseThese results the chosen also have stat importantistics are su implicationsfficient for for param- the selection of appropriate exponential families. In particular, because the chosen statistics are sufficient for parameter estimation, the learned model cannot capture aspects of the target distribution eter estimation, the learned model cannot capture aspects of the target distribution neglected by these statistics. Theseneglected concerns by these motivate statistics. our Theselater concerns development motivate of non- our later development of nonparametric methods (see Sec. 2.5parametric) whichmethods extend exponential (see Sec. 2.5) families which extend to learn exponential richer, families to learn richer, more flexible models. more flexible models. Maximum Entropy Models Maximum EntropyMaximum Models Entropy Models In the previous section, we arguedIn the that previous certain section, statistics we argued are su thatfficient certain to characterize statistics are sufficient to characterize the best exponential family approximationthe best exponential of a given family target approximationdensity. of The a given following target density. The following theorem shows that if these statistics are the only available information about a target theorem shows that if these statistics are the only available information about a target density, then the corresponding exponential family provides a natural model. density, then the corresponding exponential family provides a natural model. Theorem 2.1.1. Consider a collection of statistics φa a ,whoseexpectations Theorem 2.1.1. Consider a• collectionConsider of a statistics collectionφ of da target,whoseexpectations statistics { a ( x |) , whose∈ A} with respect to some target{ a density| ∈ A}p˜(x) are known: with respect to some target densityexpectationsp˜(x) are known: with respect to some distribution p˜ ( x ) are φ (x)˜p(x) dx = µ a (2.16) a a ∈ A φ (x)˜p(x) dx = µ !Xa (2.16) a a ∈ A !X • TheThe unique unique distribution distributionpˆ(x) maximizingpˆ ( x ) maximizing the entropy theH entropy(ˆp),subjecttothesemoment H (ˆ p ) , The unique distribution pˆ(x) maximizingconstraints,subject to isthe the then entropy constraint a memberH(ˆp of)that,subjecttothesemoment the these exponential moments family are of eq exactly. (2.1),with ν(x)=1and canonical parameters θˆ chosen so that E [φ (x)] = µ . constraints, is then a member ofmatched, the exponential is then family an exponential of eq. (2.1),with θfamilyˆ a ν (distributionx)=1a and with ˆ E canonical parameters θ chosen so that θˆ[φa(x)] = µa. h(x)=1

Out of all distributions which reproduce the observed sufficient statistics, the exponential family distribution (roughly) makes the fewest additional assumptions. 36 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS of the canonical parameters can be written as follows: p(x(1),...,x(L) θ, λ) p(θ λ) p(θ x(1),...,x(L), λ)= | | (2.17) | p(x(1),...,x(L) θ, λ) p(θ λ) dθ Θ | | L ! Sec. 2.1. Exponential Families p(θ λ) p(x(!) θ) (2.18) 37 ∝ | | !"=1 The proportionalityThis approach symbol is of best eq. justified (2.1836 ) represents when the the training constant set needed size L tois veryensureCHAPTER large, in- 2. so NONPARAMETRIC that the pos- AND GRAPHICAL MODELS tegration toterior unity distribution (in this case, of the eq. data (2.22 likelihood) is tightly of concentrated eq. (2.17)). Recall [21, 107 that,]. Sometimes, for however, minimal exponentialMAP estimates families, are the used canonicalof with the canonical smaller parameters datasets parameters are uniquely because can associated they be written are the with as only follows: computation- 36expectationsally of viablethat family’s option. sufficientCHAPTER statistics 2. (Prop. NONPARAMETRIC2.1.3). The AND posterior GRAPHICAL distribu- MODELS Parametric & Predictive Sufficiency(1) (L) tion of eq. (2.18) thus captures our knowledge about the statistics(1) likely(L) to be exhibitedp(x ,...,x θ, λ) p(θ λ) by future observations. p(θ x ,...,x , λ)= (1) (L) | | (2.17) of the canonicalPosteriorParametric parameters and distributions Predictive can be written Suffi andciency as follows: predictive| likelihoods:p (x ,...,x θ, λ) p(θ λ) dθ In many situations, statistical models are used primarily to predict future observa-Θ | | L tions. GivenWhenL independent computing observations the posteriorp as(x(1) before, distributions,...,x the(L) predictiveθ, λ and) p(θ predictive likelihoodλ) !oflikelihoods a new motivated in p(θ x(1),...,x(L), λ)= | | p(θ (2.17)λ) p(x(!) θ) (2.18) observationthex ¯ equals previous| section, it is veryp(x(1) helpful,...,x(L to) haveθ, λ) p compact(θ λ) dθ ways∝ of| characterizing| large Θ | | !=1 datasets. For exponential families,L the notions of sufficiency introduced" in Sec. 2.1.1 (1) (L) ! (1) (L) canp(¯ bex x extended,...,x to, λ simplify)=Thep(θp proportionality learning(¯xλ) θ) pp((θx with(!x) θ prior,...,x) symbol knowledge., ofλ) eq.dθ (2.18)(2.19) represents (2.18) the constant needed to ensure in- | ∝ Θ | | | | tegration# !=1 to unity (in this case, the data likelihood of eq. (2.17)). Recall that, for where the posteriorTheorem distribution 2.1.2. Let overp(x parametersθ) denote" is an as exponential in eq.(2.18). family By averaging with canonical over parameters θ, The proportionality symbol of eq. (2.18minimal| ) represents exponential the constant families, needed the canonical to ensure in-parameters are uniquely associated with our posteriorand uncertaintyp(θ λ) acorrespondingpriordensity.Given in the parametersexpectationsθ, this approachof that family’s leadsL toindependent, su predictionsfficient statistics which identically (Prop. distributed2.1.3). The posterior distribu- tegrationare typically to unity more robust (in| ( this!) thanL case, those the based data on likelihood a single parameof eq. (ter2.17 estimate.)). Recall that, for samples x !=1,considerthefollowingstatistics:tion of eq. (2.18) thus captures our knowledge about the statistics likely to be exhibited minimalIn principle, exponential a fully{ families, Bayesian} the analysis canonical should parameters also place are a uniquely prior distribution associatedp( withλ) expectationson the hyperparameters. of that family’s In practice, sufficientby however, statistics future computation observations. (Prop. 2.1.3alL considerations). The posterior frequently distribu- (1)In many(L) situations,1 statistical(!) models are used primarily to predict future observa- tionmotivate of eq. an (2.18empirical) thus captures Bayesian ourapproachφ knowledge(x ,...,x [21, 75 about, 107) !]inwhich the statisticsφλais( likelyx estimated) toabe by exhibited max- (2.24) by future observations. tions. Given L independent!L observations∈ A$ as before, the predictive likelihood of a new imizing the training data’s marginal likelihood: !=1 # In many situations, statistical modelsobservation are usedx ¯ primarilyequals " to predict future# observa- These empiricalλˆ = arg moments, max p(x(1) along,...,x with(L) theλ) sample size L,arethensaidtobe# (2.20) parametric tions. Given L independent observationsλ as before,| the predictive likelihood of a new (1) (L) (1) (L) observationsux ¯ffiequalscient for the posterior distributionL p(¯ overx x canonical,...,x parameters,, λ)= p(¯xsoθ that) p(θ x ,...,x , λ) dθ (2.19) (!)| | | = arg max p(θ λ) p(x θ) dθ #Θ (2.21) (1) (L) λ (1) | (L) (1)| (L(1)) (L) p(¯x x ,...,x p(,θλ)=x Θ ,...,xp(¯x θ!=1) p, (λθ)=x p(,...,xθ φ(x , ,...,xλ) dθ ) ,L,(2.19)λ) (2.25) | | #where the| " posterior| distribution| over parameters is as in eq.(2.18). By averaging over #Θ In situationsEquivalently, where this optimization they are predictiveour is intractable, posterior suffi uncertaintycient cross–validationfor the in likelihood the approaches parameters of new whichθ data, thisx¯ approach: leads to predictions which whereoptimize the the posterior predictive distribution likelihood over ofare a parameters held–out typically data is more as set in arrobust eqe.( often2.18 than useful). By those [ averaging21]. based on over a single parameter estimate. our posterior uncertainty in the parameters θ, this approach leads to predictions which More generally, the predictivep(¯x x likelihood(1),...,xIn principle,( computationL), λ)= ap fully(¯x ofφ Bayesian( exq.(1) (,...,x2.19)isitselfin- analysis(L)) ,L, shouldλ) also place (2.26) a prior distribution p(λ) are typically more robust than those based on a single parameter estimate. tractable for many practical models.| on In the these hyperparameters. cases, the par| ameters’ In practice, posterior however, dis- computational considerations frequently In principle, a fully Bayesian analysis should also place a prior distribution p(λ) tribution (eq.Proof. (2.18Parametric)) is often approximated sufficiencymotivate follows by an a singleempirical frommaximum the Bayesian Neyman a posterioriapproach factorizatio (MAP) [21,n75 criterion,, 107]inwhich whichλ is estimated by max- on the hyperparameters. In practice, however, computational considerations frequently estimate: is satisfied by any exponential family. The correspondence between parametric and motivate an empirical Bayesian approachimizing [21 the, 75, training107]inwhich data’sλ marginalis estimated likelihood: by max- predictive suffiˆciency can then(1) be argued(L) from eqs. (2.18, 2.19). For details, see Sec. imizing the training data’sθ marginal= arg max likelihood:p(θ x ,...,x , λ)ˆ (1) (2.22) (L) 4.5 of Bernardo and Smithθ [21| ]. λ = arg max p(x ,...,x λ) (2.20) λ | ˆ (1) L (L) λ = arg max p(x ,...,x (λ!)) (2.20) L = argλ max p(θ λ) p(x| θ) (2.23) θ | | (!) This theorem makes exponential!=1 familiesL particularly= arg max attractivep(θ whenλ) learningp(x θ from) dθ (2.21) " (!) λ Θ | | large datasets,= arg due max to thep( oftenθ λ) dramaticp(x θ compression) dθ provi# ded(2.21) by!" the=1 statistics of λ Θ | | eq. (2.24). It also emphasizesIn# situations the!" importance=1 where this of optimization selecting appropriat is intractable,esufficient cross–validation statis- approaches which In situationstics, where since this other optimization featuresoptimize isof intractable, the data the predictive cannot cross–validation affect likelihood subseque approaches ofnt a held–out model which predictions. data set are often useful [21]. optimize the predictive likelihood of a held–outMore generally, data set ar thee often predictive useful [21 likelihood]. computation of eq. (2.19)isitselfin- More generally,Analysis with the predictive Conjugate likelihoodtractable Priors for computation many practical of eq. models.(2.19)isitselfin- In these cases, the parameters’ posterior dis- tractable for many practical models. In these cases, the parameters’ posterior dis- Theorem 2.1.2 shows thattribution statistical (eq. ( predictions2.18)) is often in exponential approximated familie by as single are functionsmaximum a posteriori (MAP) tribution (eq. (2.18)) is often approximated by a single maximum a posteriori (MAP) solely of the chosen suffiestimate:cient statistics. However, it does not provide an explicit char- estimate: acterization of the posterior distribution over modelˆ parameters, or guarantee(1) (L that) the (1) (L) θ = arg max p(θ x ,...,x , λ) (2.22) predictive likelihoodθˆ = arg can max bep( computedθ x ,...,x tractably., λ) In this section,θ we (2.22)| describe an expres- θ | sive family of prior distributions which are also analytically tractable. L L (!) (!) = arg max p(θ λ) p(x θ) (2.23) = arg max p(θ λ) p(x θ)θ (2.23)| | θ | | !=1 !"=1 " Chapter 2

Nonparametric and Graphical Models

TATISTICAL methods play a central role in the design and analysis of machine vi- Ssion systems. In this background chapter, we review several learning and inference techniques upon which our later contributions are based. We begin in Sec. 2.1 by de- scribing exponential families of probability densities, emphasizing the roles of sufficiency and conjugacy in Bayesian learning. Sec. 2.2 then shows how graphs may be used to impose structure on exponential families. We contrast several types of graphical models, and provide results clarifying38 their underlying statistical assumptions.CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS To apply graphical models in practical applications, computationally efficient learn- Let p(x θ) denote a family of probability densities parameterized by θ. A family of ing and inference algorithms are needed.| Sec. 2.3 describes several variational meth- prior densities p(θ λ) is said to be conjugate to p(x θ) if, for any observation x and ods which approximate intractable inference| tasks via message–passing algorithms.| In hyperparameters λ, the posterior distribution p(θ x, λ) remains in that family: Sec. 2.4, we discuss a complementary class of Monte Carlo methods whi| ch use stochas- tic simulations to analyze complex models. Inp(θ thisx, λ thesis,) p(x weθ) proposep(θ λ) newp θ inferenceλ¯ (2.27) | ∝ | | ∝ | algorithms which integrate variational and Monte Carlo38 methods in novel ways. CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS ! " Finally, we concludeIn in this Sec. case,2.5 thewith posterior an introduction distribution to is nonparametric compactly descr methodsibed by an updated set of ¯ for Bayesian learning.hyperparameters These infinite–dimensionalλ. For exponential models38Let families achievep(x θ parameterized) denote greater a family robustness as of in probabilityCHAPTER eq. (2.1 2.), densities NONPARAMETRIC conjugate parameterized AND GRAPHICAL by θ. MODELS A family of | priors [21, 36] take the following generalprior densities form: p(θ λ) is said to be conjugate to p(x θ) if, for any observation x and by avoiding restrictive assumptions about the data generatLetionp(x process.θ) denote| a Despite family of probability this densities parameterized| by θ. A family of hyperparameters| λ, the posterior distribution p(θ x, λ) remains in that family: flexibility, variational and Monte Carlo methods can beprior adap densitiestedp to(θ allowλ) is said tractable to be conjugate to p(x| θ) if, for any observation x and | | analysis38 of large, high–dimensional datasets.CHAPTERp(θ λ)=exp 2. NONPARAMETRIChyperparametersθ λ λ ANDλ, GRAPHICALΦ the(θ posterior) Ω(λ MODELS) distribution p(θ x, λ(2.28)) remains¯ in that family: | a 0 a − 0 p(−θ x, λ) p(x θ) p(θ | λ) p θ λ (2.27) #a∈A | %∝ | | ∝ | $ p(θ x, λ) p(x θ) p(θ λ) p θ λ¯ (2.27) In this case, the posterior| distribution∝ | is compactly| ∝ descr|! ibed" by an updated set of ! 2.1Let Exponentialp(x θ) denote Families a family of probability densities parameterized by θ. A family of | While this functional form duplicateshyperparametersIn this the case, exponential theλ¯ posterior. For fami exponential distributionly’s, the families is interpretation compactly parameterized descr! ibed is" as by in an eq. updated (2.1), set conjugate of prior densities p(θ λ) is said to be conjugate to30p(x θ) if, for any observation xCHAPTERand 2. NONPARAMETRIC AND GRAPHICAL MODELS | different: the density is over the spacepriors|hyperparameters [of21, parameters36] takeλ¯ the. ForΘ following, exponential and d generaletermined families form: parameterized by hyperpa- as in eq. (2.1), conjugate Anhyperparametersexponential familyλ, theoframeters probability posteriorλ. The distribution distributions conjugatep prior(θ [15x, is, 36λproperpriors), 311 remains [21,] or, is36 normalizable, characterized] take in that the following family: when by general the the form: hyperparameters where| θ R|A| are the family’s natural or canonical parameters, and ν(x) is a non- values of certain sufficient statistics.Letx be a random variable∈ taking values in some take values in the space Λ wherenegative the logreference normalization measurep(θ. constaλ In)=exp soment Ω applications,(λ)isfinite:θ λ λ λ theΦ( parametersθ) Ω(λ) θ are set to(2.28) fixed sample space , which mayp( beθ eitherx, λ) continuousp(x θ) p(θ or discrete.λ) p θ Givenλ¯ a set of| statistics(2.27) or a 0 a − 0 − constants, while in otherp cases(θ λ)=exp they are#a interpreted∈Aθaλ0λa λ0 asΦ( lateθ) ntΩ( randomλ) % variables.(2.28) The log X | ∝ | | ∝ | | # $ − − % potentials φa a , the corresponding exponentialpartition family function of densitiesΦ(θ) is defined is give tonby normalizea$∈A p(x θ) so that it integrates to one: { | ∈ A} LearningΩ(λ) = log withWhileexp! thisConjugate" functionalθaλ0λa λ form0Φ(θ duplicates) dθ Priors the exponential (2.29) family’s, the interpretation is In this case, the posterior distribution is compactlyWhile descr thisibed functional by an− form updated duplicates set the of exponential| family’s, the interpretation is &Θ #a∈A % hyperparameters λ¯. For exponential families parameterizeddidifferent:fferent:$ the as the in density density eq. ( is2.1 over over), the conjugatethe space space of of parameters parametersΘ,Θ and, and determined determined by hyperpa- by hyperpa- |A|+1 priors [21, 36] take thep(x followingθ)=ν(x general)exp form:Λθ!aφaλ(x)rametersRrametersΦ(θ) λΩλ.(.λ The The) < conjugate conjugateΦ(θ) = log prior prior is(2.1) isνproper(properx)exp, or, or normalizable, normalizable,θaφ(2.30)a(x) when whendx the the hyperparameters hyperparameters(2.2) | ! ∈ take−take values values|# in in the the∞ space space ΛΛ wherewhere!X the the log log normalization normalization"a∈A consta consta$nt Ωnt(λΩ)isfinite:(λ)isfinite: a"∈A ' ( # Note that the dimension of theFor conjugate discrete spaces, family’sdx hyperparais taken tometers be countingλ is one measure, larger than so that integrals become sum- p(theθ correspondingλ)=exp canonicalθaλ0λa parametersλ0Φ(θ) θΩ. (λ) Ω(λ)) = = log log exp(2.28)exp θaθλa0λλ0aλa λ0Φλ(0θΦ)(θ)dθ dθ (2.29)(2.29) mations. This construction is validΘ when the canonical− − parameters θ belong to the set | − − & Θ ##a∈aA∈A % % The following#a∈A result verifiesΘ for that which the the conjugate log% partition family function of& eq.( is finite:2.28$$) satisfies the $ ! R|A|+1 definition of eq. (2.27), and provides an intuitive interpretationΛ ! λλ forR the29|A|+1 hyperpaΩ(Ωλ()λ<) conjugate prior is proper| , or normalizable,Because ΦThe(θ) when is following a convex the result hyperparameters function verifies (see that Prop. the conjugate2.1.1), Θ familyis necessarily of eq.(2.28 convex.) satisfies If Θ theis also independent samples x(!) L ,theposteriordistributionremainsinthesamefamily:The following result verifies that the conjugate family of eq.(2.28) satisfies the take values in the space Λ where the log{ normalization}!=1open, thedefinition consta exponential ofnt eq.Ω ((2.27λ family)isfinite:), and is provides said to an be intuitiveregular. interpretation Many classic for probability the hyperparameters: distributions formdefinition regular exponential of eq. (2.27), families, and provides including an intuitive the Bernoull interpretationi, Poisson, for the Gaussian, hyperpa beta,rameters: and (1) (L) Proposition¯ 2.1.4. Let p(x θ) denote an exponential family with canonical param- p(θ x ,...,x , λ)=p θ λ | (2.31) | gammaPropositioneters densitiesθ,and| [p 2.1.4.21(θ, 107λ) ].Leta family Forp(x example, ofθ) conjugatedenote for an priors scalar exponential defined Gaussian as family in eq.densities with(2.28 canonical).Given the suffi param-Lcient 2 | L (!)| Ω(λ) = log exp θaλ0statisticsλa independentλ0Φ are(θλ)0x,λa xd samples+θ, ν(x)xφ =(a!)( 1,xL and,theposteriordistributionremainsinthesamefamily:) Θ(2.29)constrains the variance to be positive. ¯ eters¯ θ,and!{ p("}θ λ!=1) a family!=1 of conjugate priors defined as in eq. (2.28).GivenL Θλ0 =#λ0 + L Exponential−λa = % families| { are(!) typically} L parameterizeda so(2.32) that no linear combination of & a∈A independent samplesλ0 + Lx ,theposteriordistributionremainsinthesamefamily:∈ A $ )p(θ {x(1),...,x}!=1 (L), λ)=p θ λ¯ (2.31) ! R|A|+1 the potentials φa a | is almost everywhere| constant. In such a minimal repre- IntegratingΛ λ over Θ,thelog–likelihoodoftheobservationscanthenbecompactΩ(λ) < 1 { | p(∈θ A}x(1),...,x(L)(2.30), λ)=p θ λ¯ lyL written(!) (2.31) ∈ | sentation,∞ there is¯ a unique set of canonical¯ λ!0λ parametersa +" !=1 φa(θx associated) with each density λ0 = λ|0 + L λa = ! | a (2.32) using the' normalization constantin the of( family, eq. (2.29 whose): dimension equals d .λ0 Furthermore,+ L (!) the∈ exponentialA family ¯ ¯ λ!|A|0λa +)" !=1 φa(x ) Note that the dimension of the conjugate family’sdefines hyperpara a d–dimensionalmetersλ0 =λλis0 Riemannian+ oneL larger manifold,λ thana = and the canonical parama eters a(2.32) coor- Integrating over Θ,thelog–likelihoodoftheobservationscanthenbecompactλ0 + L ∈ Aly written the corresponding canonical parameters θ. dinate system for that manifold.L By characterizing) the convex geometric structure of log p(x(1),...,x(L) usingλ)= theΩ normalizationλ¯ Ω(λ)+ constantlog ofν( eq.x(!()2.29)): (2.33) The following result verifies that the conjugatesuchIntegrating manifolds, family| of overinformation eq−.(Θ,thelog–likelihoodoftheobservationscanthenbecompact2.28) geometry satisfies[6 the, 15, 52, 74, 305] provides a powerful frameworkly written for analyzingusing the learning normalization and inference constant!=1 algorithms. of eq. (2.29): In particLular, as we discuss in Sec. 2.3, definition of eq. (2.27), and provides an intuitive interpretation! for" thelog p hyperpa(x(1),...,x$ rameters:(L) λ)=Ω λ¯ Ω(λ)+ log ν(x(!)) (2.33) | − results from conjugate duality [15, 311] underlie many algorithms!=1 L used in this thesis. ! " $ Proposition 2.1.4. Let p(x θ) denote an exponentialIn the family following withlog sections,p canonical(x(1),...,x we further(L param-) λ)= exploreΩ λ¯ theΩ propertie(λ)+ slog ofν exponential(x(!)) families, (2.33) | | − eters θ,andp(θ λ) a family of conjugate priorsemphasizing defined asresults in eq. which(2.28 guide).Given the specificationL of sufficie!=1nt statistics appropriate to | particular learning problems. We then introduce! " a family$ of conjugate priors for the independent samples x(!) L ,theposteriordistributionremainsinthesamefamily: { }!=1 canonical parameters θ, and provide detailed computational methods for two exponential families (the normal–inverse–Wishart and Dirichlet–multinomial) used extensively p(θ x(1),...,x(L), λ)=p θ λ¯ (2.31) | | in this thesis. For further discussion of the convex geometry underlying exponential λ λ + families,L φ see(x( [!6),)15, 36, 74, 311]. λ¯ = λ + L λ¯ = !0 a " !=1 a a (2.32) 0 0 a λ + L ∈ A ")0 2.1.1 Sufficient Statistics and Information Theory Integrating over Θ,thelog–likelihoodoftheobservationscanthenbecompactIn this section, we establish severally written results which motivate the use of exponential fam- using the normalization constant of eq. (2.29): ilies, and clarify the notion of sufficiency. The following properties of the log partition function establish its central role in the study of exponential families: L 1We note, however, that overcomplete representations play an important role in recent theoretical log p(x(1),...,x(L) λ)=Ω λ¯ Ω(λ)+ log ν(x(!)) (2.33) | −analyses of variational approaches to approximate inference [305, 306, 311]. !=1 ! " $ Chapter 2

Nonparametric and Graphical Models

x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2

50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS x3 x3 x3 x3 x3 x3

x1 x2 x1 x2 x1 x2 x4 x5 x4 x5 x44 xx5 5 x4 x5 x4 x5

(a) (b) (c)(a) (b) (c) x3 x3 x3 Figure 2.4. Three graphical representations of a distribution overFigure five random 2.4. variThreeables graphical (see [175]). representations of a distribution over five random variables (see [175]). (a) Directed graph G depicting a causal, generative process. (b) Factor(a) graph Directed expressing graphtheG depicting factoriza- a causal, generative process. (b) Factor graph expressing the factoriza- tion underlying G. (c) A “moralized” undirected graph capturing thetion Markov underlying structureG.of (c)G. A “moralized” undirected graph capturing the Markov structure of G. x4 x5 x4 x5 x4 x5 For example,(a) in the factor graph of Fig. (b)2.5(c), there are 5For variable example, (c) nodes, in and the the factor joint graph of Fig. 2.5(c), there are 5 variable nodes, and the joint distribution has one potential for each of the 3 hyperedges: Figure 2.4. Three graphicalFactor representations Graphs of a distribution & overExponential five randomdistribution variables (see [has175Families]). one potential for each of the 3 hyperedges: (a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza- tion underlying G. (c) A “moralized”p(x) ψ123 undirected(x1,x12 graph,x3) capturingψ234(x2 the,x3 Markov,x4) ψ structure35(x3,xof5)G. ∝ p(x) ψ123(x1,x2,x3) ψ234(x2,x3,x4) ψ35(x3,x5) p(x)= f (xf ✓f ) ∝ Often, these potentials can be interpretedZ(✓) as local depende| ncies or constraints. Note, For example, in the factor graph of Fig. 2.5(c),f there are 5 variableOften, nodes, and these the jointpotentials can be interpreted as local dependencies or constraints. Note, however, that ψ (x ) does not typically correspond to the marginal distribution p (x ), distribution has onef potentialf for each of the 3Y hyperedges:2F f f due to interactions withset the of graph’s hyperedges other linking potentials. subsets ofhowever, nodes f that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ), ✓ V In many applications,p(x) ψ123 factor(x1,x2,x graphs3) ψ234 are(x2,x used3,x4 to) ψ35 impose(x3,x5due) struc toture interactions on an exponential with the graph’s other potentials. F ∝ family of densities. In particular,set of N nodes suppose or vertices, that each 1 poten, 2,...,NtialIn function many applications,is described by factor graphs are used to impose structure on an exponential Often, these potentials can be interpreted as local depende{ ncies or constraints.} Note, the followingV unnormalized exponential form: family of densities. In particular, suppose that each potential function is described by however, that ψf (xf ) does notnormalizationtypically correspond constant to (partition the marginal function) distribution pf (xf ), due to interactionsZ with the graph’s other potentials. the following unnormalized exponential form: In many applications, factor graphs are used to impose structure on an exponential • A factorψf (xf graphθf )= isν fcreated(xf )exp from non-negativeθfaφfa(xf ) potential functions(2.67) family of densities. In particular,| suppose that each potential function is described by a∈Af  ψf (xf θf )=νf (xf )exp θfaφfa(xf ) (2.67) the following unnormalized• To guarantee exponential non-negativity, form: $ we typically define potentials as |   a∈Af  Here, θ ! θ a are the canonical parameters of the localLocalexponential exponential family: family $ f { fa | ∈ Af } for hyperedge f. Fromψf (xf eq.θf )= (2.66νf),(x thef )exp joint distributionθfaφfa(xf ) can then be written(2.67) as |   Here, θf ! θfa a f are the canonical parameters of the local exponential family a∈Af  { | ∈ A } $ for hyperedge f. From eq. (2.66), the joint distribution can then be written as Here, θf ! θfa a f are the canonical parameters of the local exponential family { p|(x∈ θA)=} νf (xf ) exp θfaφfa(xf ) Φ(θ) (✓)(2.68) = log Z(✓) for hyperedge f. From| eq. (2.66), the joint distribution can then be written− as  ( f∈F * f∈F a∈Af ) $ $ p(x θ)= ν (x ) exp θ φ (x ) Φ(θ) (2.68) | f f  fa fa f −  Comparingp( tox eq.θ)= (2.1),ν we(x see) exp that factor graphsθ φ (x define) Φ( regularθ) exponential(2.68) fami-( f∈F * f∈F a∈Af  | f f  fa fa f −  ) $ $ lies [104, 311], with parameters( f∈F *θ = θff∈Ffa∈Af , whenever local potentials are chosen ) {$| $∈ F}  from such families. The results of Sec. 2.1 then show thatComparinglocal statistics, to eq. computed (2.1), we see that factor graphs define regular exponential fami- Comparing to eq. (2.1), we see that factor graphs define regular exponential fami- over the support of each hyperedge, are sufficient for learninliesg from [104, training311], with data. parameters This θ = θf f , whenever local potentials are chosen lies [104, 311], with parameters θ = θ f , whenever local potentials are chosen { | ∈ F} { f | ∈ F} from such families. The results of Sec. 2.1 then show that local statistics, computed from such families. The results of Sec. 2.1 then show that local statistics, computed over the support of each hyperedge, are sufficient for learning fromover training the data. support This of each hyperedge, are sufficient for learning from training data. This