<<

Maximum Framework

Sridhar Mahadevan

[email protected]

University of Massachusetts

©Sridhar Mahadevan: CMPSCI 689 – p. 1/??

Topics

Maximum entropy is the dual of maximum likelihood. ME involves finding the “maximally general” distribution that satisfies a set of prespecified constraints The notion of “maximal generality” is made precise using the concept of entropy The distribution that satisfies the ME constraint can be showntobeinan exponential form. The concept of maximum entropy comes from statistical physics, and this whole area of machine learning borrows heavily from physics. Undirected graphical models are called random field models, and also derive much of their formalism from statistical physics.

©Sridhar Mahadevan: CMPSCI 689 – p. 2/?? Statistical Physics

Molecules in a gas

Temperature ~ Kinetic Energy Distribution of velocities?

©Sridhar Mahadevan: CMPSCI 689 – p. 3/??

Statistical Physics

The concept of entropy was investigated originally in statistical physics and thermodynamics. The temperature of a gas is proportional to the average kinetic energy of the molecules in the gas. The distribution of velocities at a given temperature is a maximum entropy distribution (also known as the Maxwell-Boltzmann distribution). Boltzmann machines are a type of neural network that get their inspiration from statistical physics. There is a whole subfield of energy basedmodelsinmachine learning based on borrowing ideas from statistical physics.

©Sridhar Mahadevan: CMPSCI 689 – p. 4/?? Maximum Entropy Example

Consider estimating a joint distribution over two variables u and v,givensome constraints

P(u,v) 0 1 0 ? ? 1 ? ? 0.6 1.0

The maximum entropy framework suggests picking values that make the least commitments, while being consistent with the contraints that are given. For this example, this suggests a unique assignment of values, but in general, determining the set of possible distributions is not so straightforward.

©Sridhar Mahadevan: CMPSCI 689 – p. 5/??

Maximum entropy distributions

Consider maximizing the entropy h(P ) over all distributions P satisfying the following constraints: P (x) 0 (where x S,thesupportofP ). ≥ ∈ x∈S P (x)=1 P (x)ri(x)=αi for 1 i m. !x∈S ≤ ≤ !

©Sridhar Mahadevan: CMPSCI 689 – p. 6/?? Maximum entropy distribution

Writing out the Lagrangian, we get

m

Λ(P )= P (x)lnP (x)+λ ( P (x) 1) + λi( P (x)ri(x) αi) − 0 − − x∈S x∈S i=1 x∈S " " " " Taking the gradient of this Lagrangian w.r.t. P (x) we get

m ∂ Λ(P )= ln P (x) 1+λ0 + λiri(x) ∂P − − # i=1 $ "

m λ0−1+ λiri(x) P (x)=e i=1 x S ⇒ ∈ % ! & where λ0, λ1,...,λm are chosen so that the constraints are satisfied.

©Sridhar Mahadevan: CMPSCI 689 – p. 7/??

Differential Entropy

If X is a continuous variable, the entropy H(X) is referred to as the differential entropy Asimilarresultcanbeprovedforcontinuousvariables P (x) 0 (where P (x)=0if x is not in the support S of P ). ≥ P (x)dx =1 S P (x)ri(x)=αi for 1 i m. 'S ≤ ≤ The Lagrangian' in the continuous case is m

Λ(P )= P (x)lnP (x)+λ0( P (x)dx)+ λi( P (x)ri(x)dx αi) − − S S i=1 S ( ( " (

©Sridhar Mahadevan: CMPSCI 689 – p. 8/?? Example 1: Uniform Distribution

Find the maximum entropy distribution that satisfies the following constraints: S =[a, b] Since there are no other constraints, the form of the distribution must be

P (x)=eλ0

This is because all λi =0,i>0. Solving this integral, we find

b eλ0 dx =1 eλ0 (b a)=1 ⇒ − (a which immediately gives us the uniform distribution because

1 P (x)=eλ0 = b a −

©Sridhar Mahadevan: CMPSCI 689 – p. 9/??

Example 2: Gaussian Distribution

Find the maximum entropy distribution that satisfies the following constraints: S =( , ) −∞ ∞∞ EX or xP x dx =0 −∞ ( ) =0 EX2 = σ2. ' 2 The form of our distribution in this case is P (x)=eλ0+λ1x+λ2x

We have to find the constants λi by solving

∞ 2 eλ0+λ1x+λ2x dx =1 −∞ (∞ 2 xeλ0+λ1x+λ2x dx =0 −∞ (∞ 2 x2eλ0+λ1x+λ2x dx = σ2 (−∞ 1 Show by choosing λ1 =0, λ2 = 2 and λ0 = ln √2πσ,wegettheGaussian − 2σ − distribution. ©Sridhar Mahadevan: CMPSCI 689 – p. 10/?? Example 3: Mystery Distribution

Find the maximum entropy distribution that satisfies the following constraints: S =[0, ) ∞ EX = µ The solution to this mystery distribution gives us a well-known PDF (highly recommended you solve this by finding the λi constants!)

©Sridhar Mahadevan: CMPSCI 689 – p. 11/?? Cross Entropy and Log Likelihood

Let us define the empirical distribution P˜(x) by placing a point mass at each data point

N 1 P˜(x)= I(x, xn) N n=1 " The cross-entropy of the empirical distribution with the model P (x θ) gives: |

N 1 P˜(x)logP (x θ)= I(x, xn)logP (x θ) | N | x x n=1 " " N " 1 = I(x, xn)logP (x θ) N | n=1 x "N " 1 1 = log P (xn θ)= l(θ X) N | N | n=1 " ©Sridhar Mahadevan: CMPSCI 689 – p. 12/?? KL Divergence and Maximum Likelihood

Let us compute the KL divergence between the empirical distribution P˜(x) and the model P (x θ). |

P˜(x) KL(P˜ P (x θ)) = P˜(x)log || | P (x θ) x | " = P˜(x)logP˜(x) P˜(x)logP (x θ) − | x x " "1 = P˜(x)logP˜(x) l(θ X) − N | x " The first term is independent of the model parameters. This means optimizing the parameters θ to minimize the KL divergence between the model and the empirical distribution is the same as maximizing the (log) likelihood.

©Sridhar Mahadevan: CMPSCI 689 – p. 13/??

Conditional Maximum Entropy

In classification, we are interested in modeling the conditional distribution P (y x). | Let us now turn to viewing classification in the maximum entropy framework.

To model P (y x),letusassumewearegivenasetofbinary features fi(x, y), | which represents the co-occurence of the (“contextual”) input x and (label) output y. For example, in a natural language application, x may be the set of words surrounding a given word, and y may represent the part of speech (or a category like “company name”). Let us generalize the idea from logistic regression, and construct an exponential model P (y x, λ) using a weighted linear combination of the features fi(x, y), | where the weights are scalars λi. We want to set up constraints based on the “error” between the empirical joint distribution P˜(x, y) and the product model distribution P˜(x)P (y x, λ) |

©Sridhar Mahadevan: CMPSCI 689 – p. 14/?? Constraints

We now define a set of constraints on the simplex of distributions. Each constraint is based on comparing the prediction of the model, and the empirical distribution based on the training set (x1,y1),...,(xn,yn). Let us define the empirical joint distribution P˜(u, v) as simply the frequency of co-occurence of (u, v) in the training set.

1 P˜(u, v)= I((u, v), (x, y)) N x,y " For each binary valued feature f(x, y) we can impose an additional constraint, namely the expected value of f under the model should agree with its expectation under the empirical joint distribution.

EP (f) P˜(x)P (y x, λ)f(x, y)=E (f) P˜(x, y)f(x, y) ≡ | P˜ ≡ x,y x,y " "

©Sridhar Mahadevan: CMPSCI 689 – p. 15/??

Maximum Conditional Entropy

Given n feature functions fi(x, y),wewouldlikeourestimatedmodeltobein accordance with the sample distribution

= P EP (fi(x, y)) = E (fi(x, y)) for i 1,...,n C ∈ P| P˜ ∈ { }

But these set of% constraints still define a simplex of possibledistributions,andwe& need to constrain this further. The maximum conditional entropy principle states that we should pick the model that maximizes the conditional entropy

H(P )= P˜(x)P (y x, λ)logP (y x, λ) − | | x∈X ,y∈Y " Putting both these constraints together, we get

∗ P = argmaxP ∈CH(P )

©Sridhar Mahadevan: CMPSCI 689 – p. 16/?? Constrained Optimization

For each feature function fi(x, y),weintroduceaLagrangianmultiplierparameter λi,andformtheoverallLagrangianas

Λ(P, λ) H(P )+ λi EP (fi) E (fi) ≡ − P˜ i " % & If we keep λ fixed, we can find the unconstrained maximum of the Lagrangian Λ(P, λ) over all P .LetPλ denote the argument where the maximum is ∈ P reached, and Ψ(λ) denote the maximizing value

Pλ = argmaxP ∈P Λ(P, λ)

Ψ(λ)=Λ(Pλ, λ)

©Sridhar Mahadevan: CMPSCI 689 – p. 17/??

Maximum Entropy Distribution

Key theorem: By solving the Lagrangian above, one can show that the maximum entropy distribution is given by the following exponential form

1 λifi(x,y) P (y x, λ)= e i | Zλ %! &

λifi(x,y) i where Zλ(x)= y e The maximizing value% is & ! ! Ψ(λ) Λ(Pλ, λ)= P˜(x)logZλ(x)+ λiE (fi) ≡ − x i P˜ ! !

©Sridhar Mahadevan: CMPSCI 689 – p. 18/?? Karush Kuhn Tucker Theorem

The original constrained optimization problem, called the primal is

∗ P = argmaxP ∈CH(P )

The dual (unconstrained) optimization problem is

∗ Find λ = argmaxλΨ(λ)

The KKT theorem shows that under certain conditions, the solution to the dual problem gives us the solution to the primal problem. ∗ More precisely, P = Pλ∗ We will see the KKT theorem shows up again in support vector machines.

©Sridhar Mahadevan: CMPSCI 689 – p. 19/??

Maximum Likelihood Revisited

The log-likelihood of of the empirical distribution P˜,aspredictedbythe discriminant function P (Y X, λ) is |

˜ L (P )=log P (y x, λ)P (x,y) = P˜(x, y)logP (y x, λ) P˜ | | x,y x,y ) " We can show that the log-likelihood above is exactly the dual function Ψ(λ) given above. In other words, the maximum entropy model P ∗ is exactly the model in the ∈ C parametric family P (Y X, λ) that maximizes the likelihood of the training instances | P˜.

©Sridhar Mahadevan: CMPSCI 689 – p. 20/?? Maximum Likelihood and Entropy

Let us see what the maximum likelihood solution yields, by taking the gradient

∂L ˜ (P ) ∂ ˜ P = log P (y x, λ)P (x,y) ∂λi ∂λi | # x,y $ ) ∂ = P˜(x, y)logP (y x, λ) ∂λi | # x,y $ " = P˜(x, y)fi(x, y) P˜(x)P (y x, λ)fi(x, y) − | x,y x,y " " We now see that this is exactly the same set of constraints thatthemaximum entropy solution was based on.

©Sridhar Mahadevan: CMPSCI 689 – p. 21/??

Maximum Likelihood and Entropy

Primal Dual

Problem argmaxP ∈CH(P ) argmaxλΨ(λ) Optimization Constrained Unconstrained Approach Maximum Entropy Maximum Likelihood

Search P λ λ1,...,λn ∈ C ∈ { } Solution P ∗ λ∗

©Sridhar Mahadevan: CMPSCI 689 – p. 22/?? Maximum Entropy Parameter Estimation

Note that maximum likelihood of exponential model is not analytically solvable, since the normalizer couples the parameters together

λifi(x,y) log Ψ(λ)= λi P˜(x, y)fi(x, y) P˜(x)log e i − i x,y x y " " " " ! Methods based on Generalized Iterative Scaling [Darroch andRatcliff,’72;Berger, Della Pietra, Della Pietra] Use Jensen’s inequality and a bound on log x 1 x to derive a lower − ≥ − bound on log-likelihood improvement # Introduce f (x, y)= i fi(x, y) to get a PDF-like term by averaging out f(x, y). ! Gradient methods Plug in favorite first-order or second-order gradient methodfromnumerical analysis, including Newton’s method, conjugate gradient, line search These have been recently found to be much faster than GIS for logistic regression [Minka 2003] ©Sridhar Mahadevan: CMPSCI 689 – p. 23/?? Generative vs Discriminative Models

233 !"#$%&'"(%)

4/5 *+,#)-#. /%,0%))#+1

©Sridhar Mahadevan: CMPSCI 689 – p. 24/?? Distributions on Undirected Graphs

Earlier we saw that for directed graphical models, d-separation formalizes the notion of conditional independence. We also saw that the joint distribution factorizes as a product of terms, each of which involves a node and its parents. The factorization property of directed graphs ensures that every distribution defined in this way must satisfy all d-separation properties in the graph. AsetofnodesA is conditionally independent of a set of nodes B given the separating set C,ifeverypathfromanodeinA to a node in B goes through a node in C. We say that a P is globally Markov with respect to an undirected graph G iff for every disjoint set of nodes A, B,andC,ifA B C, ⊥ | then the distribution also satisfies the same property.

©Sridhar Mahadevan: CMPSCI 689 – p. 25/??

Two Node Chain Graph

A0 B0 10 B0 C0 1 A0 B1 1 B0 C1 10 A1 B0 1 B1 C0 1 A1 B1 10 B1 C1 10

1 P (A, B, C)= φ1(A, B).φ2(B,C) Z

Z = φ1(A, B).φ2(B,C) A,B,C"

©Sridhar Mahadevan: CMPSCI 689 – p. 26/?? Two Node Chain Graph

A0 B0 C0 10 0.0413 A0 B0 C1 100 0.413 A0 B1 C0 1 0.00413 A0 B1 C1 10 0.0413 A1 B0 C0 1 0.00413 A1 B0 C1 10 0.0413 A1 B1 C0 10 0.0413 A1 B1 C1 100 0.413

©Sridhar Mahadevan: CMPSCI 689 – p. 27/??

Factorization in Undirected Models

How do we define probability distributions on undirected models? Define a clique C to be a maximal set of nodes such that each node is connected to every other node in the set. 1 Define the distribution P (S)= Z C ψC (SC ) where each ψC is an arbitrary potential function on clique C,andZ is a normalizer. * Theorem: For any undirected graph G,anydistributionwhichsatisfiesthe factorization property will be globally Markov.

Hammersley Clifford Theorem:Forstrictlypositivedistributions(i.e.,every instantiation of variables has non-zero probability), the global Markov property is equivalent to the factorization property.

©Sridhar Mahadevan: CMPSCI 689 – p. 28/?? Random Field Models

We showed above that the maximum entropy approach requires that the resulting probability distribution be in an exponential form. As it turns out, when we consider probability distributions on undirected graphical models, the form of the distribution is exactly in the same exponential form. For conditional models, we get

λifi(x,y) 1 i Pλ(y x)= e | Zλ %! &

λifi(x,y) Z x e i where λ( )= y f x, y % & The features i(! ) are! now associated with cliques of the undirected model.

©Sridhar Mahadevan: CMPSCI 689 – p. 29/??

Conditional Random Field Models

Lafferty et al [ICML ’01] define an undirected graphical modelcalledaCRFthatis similar to an HMM (if we drop the edge directions). Recall that in an HMM, the joint distribution P (x, y) was specified as

T −1 T

P (q, y)=P (q0) P (qt+1 qt) P (yt qt) | | t=0 t=0 ) ) ACRFisadiscriminantmodel,andsowefocusmainlyonmodeling the conditional distribution P (q y) instead of the joint distribution. | One way to understand this model is that we interpret the components of the joint distributions, not as , but as arbitrary potentials.

T −1 T P (q, y) 1 P (q y)= = φ(qt,qt+1) φ(qt,yt) | P (y) Z t=0 t=0 ) ) where the normalizer makes this a valid distribution.

©Sridhar Mahadevan: CMPSCI 689 – p. 30/?? Generative vs. Discriminative

Generative approaches (e.g., HMMs) are based on modeling P (x) and P (x y). | Discriminant approaches directly model P (y x) and ignore P (x). | Generative models lead to directed graphical models, and discriminative models lead to undirected graphical models. Likelihood for generative models factors nicely for completely observed graphical models. For discriminative models, likelihood is non-factorable due to normalizing constant (Z). Generative models make it possible to handle missing data through EM; Discriminative approaches cannot deal with missing data easily. Although likelihood term for completely observed discriminative models is non-factorable, and requires numerical methods, it is unimodal and concave with a global optimum. Generative models (HMMs) have been hugely successful in speech and many other domains. Discriminative models, like CRFs, offer significant advantages in domains like text, where multiple overlapping feature sets are useful (e.g., “is-capitalized”, “is-noun” etc.). They have also been applied extensively in vision.

©Sridhar Mahadevan: CMPSCI 689 – p. 31/??