Maximum Entropy Distribution (Also Known As the Maxwell-Boltzmann Distribution)

Maximum Entropy Framework Sridhar Mahadevan [email protected] University of Massachusetts ©Sridhar Mahadevan: CMPSCI 689 – p. 1/?? Topics Maximum entropy is the dual of maximum likelihood. ME involves finding the “maximally general” distribution that satisfies a set of prespecified constraints The notion of “maximal generality” is made precise using the concept of entropy The distribution that satisfies the ME constraint can be showntobeinan exponential form. The concept of maximum entropy comes from statistical physics, and this whole area of machine learning borrows heavily from physics. Undirected graphical models are called random field models, and also derive much of their formalism from statistical physics. ©Sridhar Mahadevan: CMPSCI 689 – p. 2/?? Statistical Physics Molecules in a gas Temperature ~ Kinetic Energy Distribution of velocities? ©Sridhar Mahadevan: CMPSCI 689 – p. 3/?? Statistical Physics The concept of entropy was investigated originally in statistical physics and thermodynamics. The temperature of a gas is proportional to the average kinetic energy of the molecules in the gas. The distribution of velocities at a given temperature is a maximum entropy distribution (also known as the Maxwell-Boltzmann distribution). Boltzmann machines are a type of neural network that get their inspiration from statistical physics. There is a whole subfield of energy basedmodelsinmachine learning based on borrowing ideas from statistical physics. ©Sridhar Mahadevan: CMPSCI 689 – p. 4/?? Maximum Entropy Example Consider estimating a joint distribution over two variables u and v,givensome constraints P(u,v) 0 1 0 ? ? 1 ? ? 0.6 1.0 The maximum entropy framework suggests picking values that make the least commitments, while being consistent with the contraints that are given. For this example, this suggests a unique assignment of values, but in general, determining the set of possible distributions is not so straightforward. ©Sridhar Mahadevan: CMPSCI 689 – p. 5/?? Maximum entropy distributions Consider maximizing the entropy h(P ) over all distributions P satisfying the following constraints: P (x) 0 (where x S,thesupportofP ). ≥ ∈ x∈S P (x)=1 P (x)ri(x)=αi for 1 i m. !x∈S ≤ ≤ ! ©Sridhar Mahadevan: CMPSCI 689 – p. 6/?? Maximum entropy distribution Writing out the Lagrangian, we get m Λ(P )= P (x)lnP (x)+λ ( P (x) 1) + λi( P (x)ri(x) αi) − 0 − − x∈S x∈S i=1 x∈S " " " " Taking the gradient of this Lagrangian w.r.t. P (x) we get m ∂ Λ(P )= ln P (x) 1+λ0 + λiri(x) ∂P − − # i=1 $ " m λ0−1+ λiri(x) P (x)=e i=1 x S ⇒ ∈ % ! & where λ0, λ1,...,λm are chosen so that the constraints are satisfied. ©Sridhar Mahadevan: CMPSCI 689 – p. 7/?? Differential Entropy If X is a continuous variable, the entropy H(X) is referred to as the differential entropy Asimilarresultcanbeprovedforcontinuousvariables P (x) 0 (where P (x)=0if x is not in the support S of P ). ≥ P (x)dx =1 S P (x)ri(x)=αi for 1 i m. 'S ≤ ≤ The Lagrangian' in the continuous case is m Λ(P )= P (x)lnP (x)+λ0( P (x)dx)+ λi( P (x)ri(x)dx αi) − − S S i=1 S ( ( " ( ©Sridhar Mahadevan: CMPSCI 689 – p. 8/?? Example 1: Uniform Distribution Find the maximum entropy distribution that satisfies the following constraints: S =[a, b] Since there are no other constraints, the form of the distribution must be P (x)=eλ0 This is because all λi =0,i>0. Solving this integral, we find b eλ0 dx =1 eλ0 (b a)=1 ⇒ − (a which immediately gives us the uniform distribution because 1 P (x)=eλ0 = b a − ©Sridhar Mahadevan: CMPSCI 689 – p. 9/?? Example 2: Gaussian Distribution Find the maximum entropy distribution that satisfies the following constraints: S =( , ) −∞ ∞∞ EX or xP x dx =0 −∞ ( ) =0 EX2 = σ2. ' 2 The form of our distribution in this case is P (x)=eλ0+λ1x+λ2x We have to find the constants λi by solving ∞ 2 eλ0+λ1x+λ2x dx =1 −∞ (∞ 2 xeλ0+λ1x+λ2x dx =0 −∞ (∞ 2 x2eλ0+λ1x+λ2x dx = σ2 (−∞ 1 Show by choosing λ1 =0, λ2 = 2 and λ0 = ln √2πσ,wegettheGaussian − 2σ − distribution. ©Sridhar Mahadevan: CMPSCI 689 – p. 10/?? Example 3: Mystery Distribution Find the maximum entropy distribution that satisfies the following constraints: S =[0, ) ∞ EX = µ The solution to this mystery distribution gives us a well-known PDF (highly recommended you solve this by finding the λi constants!) ©Sridhar Mahadevan: CMPSCI 689 – p. 11/?? Cross Entropy and Log Likelihood Let us define the empirical distribution P˜(x) by placing a point mass at each data point N 1 P˜(x)= I(x, xn) N n=1 " The cross-entropy of the empirical distribution with the model P (x θ) gives: | N 1 P˜(x)logP (x θ)= I(x, xn)logP (x θ) | N | x x n=1 " " N " 1 = I(x, xn)logP (x θ) N | n=1 x "N " 1 1 = log P (xn θ)= l(θ X) N | N | n=1 " ©Sridhar Mahadevan: CMPSCI 689 – p. 12/?? KL Divergence and Maximum Likelihood Let us compute the KL divergence between the empirical distribution P˜(x) and the model P (x θ). | P˜(x) KL(P˜ P (x θ)) = P˜(x)log || | P (x θ) x | " = P˜(x)logP˜(x) P˜(x)logP (x θ) − | x x " "1 = P˜(x)logP˜(x) l(θ X) − N | x " The first term is independent of the model parameters. This means optimizing the parameters θ to minimize the KL divergence between the model and the empirical distribution is the same as maximizing the (log) likelihood. ©Sridhar Mahadevan: CMPSCI 689 – p. 13/?? Conditional Maximum Entropy In classification, we are interested in modeling the conditional distribution P (y x). | Let us now turn to viewing classification in the maximum entropy framework. To model P (y x),letusassumewearegivenasetofbinary features fi(x, y), | which represents the co-occurence of the (“contextual”) input x and (label) output y. For example, in a natural language application, x may be the set of words surrounding a given word, and y may represent the part of speech (or a category like “company name”). Let us generalize the idea from logistic regression, and construct an exponential model P (y x, λ) using a weighted linear combination of the features fi(x, y), | where the weights are scalars λi. We want to set up constraints based on the “error” between the empirical joint distribution P˜(x, y) and the product model distribution P˜(x)P (y x, λ) | ©Sridhar Mahadevan: CMPSCI 689 – p. 14/?? Constraints We now define a set of constraints on the simplex of probability distributions. Each constraint is based on comparing the prediction of the model, and the empirical distribution based on the training set (x1,y1),...,(xn,yn). Let us define the empirical joint distribution P˜(u, v) as simply the frequency of co-occurence of (u, v) in the training set. 1 P˜(u, v)= I((u, v), (x, y)) N x,y " For each binary valued feature f(x, y) we can impose an additional constraint, namely the expected value of f under the model should agree with its expectation under the empirical joint distribution. EP (f) P˜(x)P (y x, λ)f(x, y)=E (f) P˜(x, y)f(x, y) ≡ | P˜ ≡ x,y x,y " " ©Sridhar Mahadevan: CMPSCI 689 – p. 15/?? Maximum Conditional Entropy Given n feature functions fi(x, y),wewouldlikeourestimatedmodeltobein accordance with the sample distribution = P EP (fi(x, y)) = E (fi(x, y)) for i 1,...,n C ∈ P| P˜ ∈ { } But these set of% constraints still define a simplex of possibledistributions,andwe& need to constrain this further. The maximum conditional entropy principle states that we should pick the model that maximizes the conditional entropy H(P )= P˜(x)P (y x, λ)logP (y x, λ) − | | x∈X ,y∈Y " Putting both these constraints together, we get ∗ P = argmaxP ∈CH(P ) ©Sridhar Mahadevan: CMPSCI 689 – p. 16/?? Constrained Optimization For each feature function fi(x, y),weintroduceaLagrangianmultiplierparameter λi,andformtheoverallLagrangianas Λ(P, λ) H(P )+ λi EP (fi) E (fi) ≡ − P˜ i " % & If we keep λ fixed, we can find the unconstrained maximum of the Lagrangian Λ(P, λ) over all P .LetPλ denote the argument where the maximum is ∈ P reached, and Ψ(λ) denote the maximizing value Pλ = argmaxP ∈P Λ(P, λ) Ψ(λ)=Λ(Pλ, λ) ©Sridhar Mahadevan: CMPSCI 689 – p. 17/?? Maximum Entropy Distribution Key theorem: By solving the Lagrangian above, one can show that the maximum entropy distribution is given by the following exponential form 1 λifi(x,y) P (y x, λ)= e i | Zλ %! & λifi(x,y) i where Zλ(x)= y e The maximizing value% is & ! ! Ψ(λ) Λ(Pλ, λ)= P˜(x)logZλ(x)+ λiE (fi) ≡ − x i P˜ ! ! ©Sridhar Mahadevan: CMPSCI 689 – p. 18/?? Karush Kuhn Tucker Theorem The original constrained optimization problem, called the primal is ∗ P = argmaxP ∈CH(P ) The dual (unconstrained) optimization problem is ∗ Find λ = argmaxλΨ(λ) The KKT theorem shows that under certain conditions, the solution to the dual problem gives us the solution to the primal problem. ∗ More precisely, P = Pλ∗ We will see the KKT theorem shows up again in support vector machines. ©Sridhar Mahadevan: CMPSCI 689 – p. 19/?? Maximum Likelihood Revisited The log-likelihood of of the empirical distribution P˜,aspredictedbythe discriminant function P (Y X, λ) is | ˜ L (P )=log P (y x, λ)P (x,y) = P˜(x, y)logP (y x, λ) P˜ | | x,y x,y ) " We can show that the log-likelihood above is exactly the dual function Ψ(λ) given above. In other words, the maximum entropy model P ∗ is exactly the model in the ∈ C parametric family P (Y X, λ) that maximizes the likelihood of the training instances | P˜.

Maximum Entropy Distribution (Also Known As the Maxwell-Boltzmann Distribution)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support