Machine Learning Srihari

Learning Parameters of Undirected Models

Sargur Srihari [email protected]

1 Machine Learning Srihari Topics • Difficulties due to Global Normalization • • Maximum Likelihood Parameter Estimation • Simple and Conjugate Gradient Ascent • Conditionally Trained Models • Parameter Learning with Missing Data – Gradient Ascent vs. EM • Alternative Formulation of Max Likelihood – Maximum Entropy subject to Constraints • Parameter Priors and Regularization 2 Machine Learning Srihari Local vs Global Normalization

• BN: Local normalization within each CPD n p(X) = ∏ln p(Xi | pai ) i=1 • MN: Global normalization (partition function) 1 K Z = φi (Di ) P(X1,.., Xn ) = φi Di ∑ ∏ ( ) X ,..X Z i=1 1 n

• Global factor couples all parameters preventing decomposition • Significant computational ramifications – M.L. parameter estimation has no closed-form soln. – Need iterative methods 3 Machine Learning Srihari Issues in Parameter Estimation

• Simple ML parameter estimation (even with complete data) cannot be solved in closed form • Need iterative methods such as gradient ascent • Good news: – Likelihood function is concave • Methods converge with global optimum • Bad news: – Each step in iterative algorithm requires inference • Simple parameter estimation expensive/intractable 4 – Bayesian estimation is practically infeasible Machine Learning Srihari Discriminative Training

• Common use of MNs is in settings such as image segmentation where we have a particular inference task in mind • Train the network discriminatively to get good performance for our particular inference task

5 Machine Learning Srihari Likelihood Function

• Basis for all discussion of learning • How likelihood function can be optimized to find maximum likelihood parameter estimates • Begin with form of likelihood function for Markov networks, its properties and their computational implications

6 Machine Learning Srihari Example of Coupled Likelihood Function – Simple Markov network A—B—C

– Two potentials ϕ1(A,B), ϕ2(B,C): Z = (a,b) (b,c) • Gibbs: P(A,B,C)=(1/Z) ϕ1(A,B)ϕ2(B,C) ∑φ1 φ2 a,b,c • Log-likelihood of instance (a,b,c) is

ln P(a,b,c)=ln ϕ1(A,B)+ln ϕ2(B,C)-ln Z • Log-likelihood of data set D with M instances: Parameter θ consists of all values of factorsϕ1 and ϕ2 M Summing over different ℓ(θ : D) = ∑(lnφ1(a[m],b[m]) + lnφ2 (b[m],c[m]) − ln Z(θ)) m=1 Instances M = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) Summing over different ∑ 1 ∑ 2 values of A and B a,b b,c • First term involves only ϕ1. Second only ϕ2. But third involves ⎛ ⎞ lnZ(θ) = ln⎜ φ (a,b)φ (b,c)⎟ ⎜∑ 1 2 ⎟ ⎝a,b,c ⎠ • Which couples the two potentials in the likelihood function

– When we change one potential ϕ1, Z(θ) changes, possibly changing the value of ϕ2 that maximizes –ln Z(θ) Machine Learning Srihari Illustration of Coupled Likelihood

• Log-likelihood A—B—C wrt two factors ϕ1(A,B), ϕ2(B,C) – With binary variables we would have 8 parameters ℓ(θ : D) = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) • Log-likelihood surface ∑ 1 ∑ 2 a,b b,c – when ϕ1changes,

ϕ2 that maximizes –ln Z(θ) also changes • In this case problem avoidable

0 1 f2(b , c ) 1 1 – Equivalent to BN f1(a , b ) All other parameters set to 1 – AàBàC & estimate parameters of Data Set has M=100 M[a1,b1]=40, M[b0,c1]=40 – ϕ1(A,B)=P(A)P(B|A), ϕ2(B,C)=P(C|B) • In general, cannot convert learned BN parameters into equivalent MN – Optimal likelihood achievable by the two representations is not the same 8 Machine Learning Srihari Form of Likelihood Function • Instead of Gibbs, use log-linear framework

– Joint distribution of n variables X1,..Xn

– k features F = { fi(Di) }i=1,.. k k depends on how many values Di take

• where Di is a sub-graph and fi maps Di to R

1 ⎧ k ⎫ P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) ⎩ i=1 ⎭

• Parameters θi are weights we put on features

– If we have a sample ξ then its features are fi(ξ(Di)) which has the shorthand fi(ξ). • Representation is general – can capture Markov networks with global structure 9 and local structure MachineParameters Learning θ, Factors ϕ, binary features Sriharif

A B ϕ1(A,B) k 1 0 0 0 0 A A ⎧ ⎫ a b ϕ1 (a ,b ) P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) i=1 0 1 0 1 ⎩ ⎭ a b ϕ1(a ,b ) D B D B D B a1 b0 ϕ (a1,b0) • Variables: A—B—C—D— 1 C C C A 1 1 1, 1 a b ϕ1(a b ) (a) (b) (c) • A feature for every entry in every table

– fi(Di) are sixteen indicator functions defined over clusters, AB, BC,CD,DA 0 0 Val(A)={a 0,a1} Val(B)={b 0,b1} f 0 0 (A, B) I A a I B b a b = { = } { = } 0 0 fa0,b0=1 if a=a ,b=b etc. 0 otherwise, etc. – With this representation Parameters θ are potentials 0 0 which are weights put on θ 0 0 = lnφ a ,b a b 1 ( ) features Machine Learning Srihari Log Likelihood and Sufficient Statistics

• Joint distribution: 1 ⎧ k ⎫ θ={θ1,..θk} are table entries P(X ,..X ; ) exp f (D ) 1 n θ = ⎨∑θi i i ⎬ fi are features over instances of Di Z(θ) ⎩ i=1 ⎭ • Let D be a data set of M samples ξ [m] m =1,..M • Log-likelihood – Log of product of probs of M indep. instances:

⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • Sufficient statistics (likelihood depends only on this)

Dividing by no. 1 ED[fi(di)] is the average ℓ(θ : D) = ∑θi (ED [ fi (di )]) − ln Z(θ) of samples M i in the data set Machine Learning Srihari Properties of Log-Likelihood • Log-likelihood is a sum of two functions ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m First Term Second Term – First term is linear in the parameters θ • Increasing parameters increases this term • But likelihood has upper-bound of probability 1

– Second term: ⎧ ⎫ ln Z (θ ) = ln∑exp⎨∑θi fi (ξ)⎬ • balances first term ξ ⎩ i ⎭ – Partition function Z(θ) is convex as seen next » Its negative is concave – Sum of linear fn. and concave is concave, So 12 • There are no local optima – Can use gradient ascent Machine Learning Srihari Proof that Z(θ) is Convex ! A function f (x) is convex if for every 0 ≤ α ≤ 1 ! ! ! ! f ( x (1 )y) f (x) (1 ) f (y) α + −α ≤ α + −α –The function is bowl-like –Every interpolation between the images of two points is larger than the image of their interpolation – One way to show that a Function is convex is to show that its Hessian (matrix of the function’s second derivatives) is positive-semi-definite • Hessian (2nd der) of ln Z(θ) computed as: Machine Learning Srihari Hessian of ln Z(θ) ⎧ ⎫ ln Z θ = ln exp θ f ξ Given set of features F with ( ) ∑ ⎨∑ i i ( )⎬ ξ ⎩ i ⎭ ∂ ln Z(θ) = Eθ [ fi ] ∂θi ∂2 ln Z(θ) = Cov [ f ; f ] θ i j where Eθ[fi] is shorthand for EP(χ,θ)[fi] ∂θi ∂θ j

Proof: ∂ 1 ∂ ⎪⎧ ⎪⎫ 1 ⎪⎧ ⎪⎫ ln Z( ) exp f f ( )exp f E [ f ] θ = ∑ ⎨∑θ j j (ξ)⎬ = ∑ i ξ ⎨∑θ j j (ξ)⎬ = θ i ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ Partial derivatives ∂2 ∂ ⎡ 1 ∂ ⎧ ⎫⎤ wrt θi and θj ln Z( ) exp f Cov [ f ; f ] θ = ⎢ ∑ ⎨∑θk k (ξ)⎬⎥ = θ i j ∂θi∂θ j ∂θ j ⎣ Z(θ) ξ ∂θi ⎩ k ⎭⎦ – Since covariance matrix of features is pos. semi- definite, we have -ln Z(θ) is a concave function of θ • Corollary: log-likelihood function is concave Machine Learning Srihari Non-unique Solution • Since ln Z(θ) is convex, -ln Z(θ) is concave • Implies that log-likelihood is unimodal – Has no local optima • However does not imply uniqueness of global optimum • Multiple parameterizations can result in same distribution – A feature for every entry in the table is always redundant, e.g.,

• fa0,b0 = 1- fa0,b1- fa1,b0- fa1,b1 15 • A continuum of parameterizations Machine Learning Srihari Maximum (Conditional) Likelihood Parameter Estimation • Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D • Simplest variant of the problem is maximum likelihood parameter estimation

• Log-likelihood given features F={ fi, i=1,.. k} is ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z (θ ) ∑ ⎝⎜ ∑ ⎠⎟ i=1 m

16 Machine Learning Srihari Gradient of Log-likelihood ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z (θ ) ∑ ⎝⎜ ∑ ⎠⎟ • Log-likelihood: i=1 m 1 ( : D) E f (d ) ln Z( ) Dividing by no. • Average log-likelihood: ℓ θ = ∑θi ( D [ i i ]) − θ M i of samples • Gradient of second term is ∂ 1 ∂ ⎧ ⎫ 1 ⎧ ⎫ ln Z(θ) = ∑ exp ⎨∑θ j f j (ξ)⎬ = ∑ fi (ξ)exp ⎨∑θ j f j (ξ)⎬ = Eθ [ fi ] ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ • Gradient of the average log-likelihood is First term is average value of f in ∂ 1 i ℓ(θ : D) = ED[ fi (χ )]− Eθ [ fi ] data D. Second term is ∂θ M i expected value from distribution • Provides a precise characterization of m.l. parameters θ • Theorem: Let F be a feature set. Then θ is a m.l. assignment if and only if E f E f for all i D ⎣⎡ i (χ)⎦⎤ = θˆ [ i ] • i.e., expected value of each feature relative to Pθ matches its empirical expectation in D Machine Learning Srihari Need for Iterative Method

• Although log-likelihood function ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • is concave, there is no analytical form for the maximum • Since no closed-form solution – Can use iterative methods, e.g. gradient ascent as shown next • Gradient is ∂ 1 ℓ(θ : D) = ED[ fi (χ )]− Eθ [ fi ] ∂θ M i 18 Machine Learning Srihari Computing the gradient

• Gradient wrt θi of log-likelihood ℓ(θ;D) is

ED[ fi (χ )]− Eθ [ fi ] A A – It is the difference between the D B D B D B • feature’s empirical count in data D and • the expected count relative to our current C C C A (a) (b) (c) parameterization

• Computing the empirical count E D [ f i ( χ ) ] is easy – Ex: for feature 0 0 f 0 0 (a,b) I a a I b b a b = { = } { = } it is the empirical frequency in D of the event a0,b0 19 Machine Learning Srihari Computing the expected count • Second term in the gradient is

1 ⎧ ⎫ Eθ [ fi ] = ∑ fi (ξ)exp ⎨∑θ j f j (ξ)⎬ Z(θ) ξ ⎩⎪ j ⎭⎪ • We need to compute the different

of the form P t a,b θ ( ) – Since expectation is a probability-weighted average • Computing probability requires running inference over the network Machine Learning Srihari Iterative solution for MRF parameters

Initialize . θ

Run inference ⎪⎧ ⎪⎫ Z θ = exp⎨⎪ θ f ξ ⎬⎪ ( ) ∑ ⎪∑ i i ( )⎪ (compute Z ( θ ) ) ξ ⎩⎪ i ⎭⎪ Compute E [ f χ ]− E [ f ] gradient of ! . D i ( ) θ i

t+1 t t . ← + ∇f Update θ θ θ η obj (θ )

No Optimum Reached? Yes

Stop 21 Machine Learning Srihari Difficulty with Iterative method • Gradient ascent over parameter space • Good news: – likelihood function is concave • Guaranteed to converge to global optimum • Bad news: – each step needs inference – Simple parameter estimation is intractable – Bayesian parameter estimation even harder • Integration done using MCMC

22 Machine Learning Srihari Use of Second Order Solution

• Computational cost of parameter estimation is very high – Gradient ascent is not efficient • Much faster convergence using second order methods based on Hessian

23 Machine Learning Srihari Intuition for Second-Order Solution • Newton’s method Newton in one-dim. – finds zeroes of a function using derivatives • More efficient than simple gradient descent • Quasi Newton’s method – uses an approximation to the gradient • Since we are solving for derivative of l(θ,D) – need second derivative (Hessian) Machine Learning Srihari Derivation of Hessian of the Log-Likelihood

∂ 1 ∂ ⎧ ⎫ 1 ⎧ ⎫ ln Z(θ) = ∑ exp ⎨∑θ j f j (ξ)⎬ = ∑ fi (ξ)exp ⎨∑θ j f j (ξ)⎬ = Eθ [ fi ] ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ ∂2 ∂ ⎡ 1 ∂ ⎧ ⎫⎤ ln Z(θ) = ⎢ ∑ exp ⎨∑θk fk (ξ)⎬⎥ ∂θi ∂θ j ∂θ j ⎣ Z(θ) ξ ∂θi ⎩ k ⎭⎦ 1 ⎛ ∂ ⎞ ⎧ ⎫ 1 ⎧ ⎫ =- Z f ( )exp f f ( ) f ( )exp f 2 ⎜ (θ )⎟ ∑ i ξ ⎨∑θk k (ξ)⎬ + ∑ i ξ j ξ ⎨∑θk k (ξ)⎬ Z(θ) ⎝ ∂θ j ⎠ ξ ⎩ k ⎭ Z(θ) ξ ⎩ k ⎭ 1 1 - Z E [ f ] f ( )P! : f ( ) f ( )P! : = 2 (θ ) θ j ∑ i ξ (ξ θ ) + ∑ i ξ j ξ (ξ θ ) Z(θ) ξ Z(θ) ξ

= − Eθ [ f j ]∑ fi (ξ)P(ξ :θ ) + ∑ fi (ξ) f j (ξ)P(ξ :θ ) ξ ξ

= Eθ [ fi f j ]− Eθ [ fi ]Eθ [ f j ] = Cov [ f ; f ] θ i j ∂ ℓ(θ, D) = −M Covθ fi , f j ∂θ ∂θ ( ) i j 25 Machine Learning Srihari Computation of Hessian • Log-likelihood has the form ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z (θ ) ∑ ⎝⎜ ∑ ⎠⎟ i=1 m • Solution 1: Hessian ∂ ℓ(θ, D) = −M Covθ fi , f j ∂θ ∂θ ( ) i j – Requires joint expectation of two features, often computationally infeasible • Solution 2: Commonly used – L-BFGS (a quasi-Newton algorithm) – uses gradient ascent line search to avoid

computing the Hessian 26 Machine Learning Srihari L-BFGS Algorithm

• Limited-memory BFGS • Approximates the Broyden-Fletcher- Goldfarb-Shanno (BFGS) • Popular algorithm for ML parameter estimation – Algorithm of choice for log-linear models and CRFs • Uses an estimation of the inverse Hessian to steer through variable space 27 Machine Learning Srihari Conditionally Trained Models • Often we want to perform an inference task where we have a known set of variables, or features, X

• We want to query a pre-determined set of variables Y • We prefer to use discriminative training • Train the network as a Conditional Random Field (CRF) that encodes a conditional

distribution P(Y|X) 28 Machine Learning Srihari CRF Training • Train the network as a CRF that encodes a conditional distribution P(Y|X)

• Training set consists of M pairs Example: y[m]=word category, B-PER D={y[m], x[m]}, m =1,.., M x[m]=word, Mrs. • Objective Function: Log-Conditional likelihood We are not interested in the distribution ⎡ ! y | x ⎤ E(x,y)~P* logP ( ) ⎣⎢ ⎦⎥ of x variables; only in predicting y given x • Log-Conditional likelihood is

ℓY|X (θ : D) = ln P(y[1,.., M ]| x[1,.., M ],θ ) Optimizing likelihood of M each observed assignment y[m] =∑ln P(y[m]| x[m],θ ) given observed assignment x[m] m=1 –It is concave since each term is concave 29 Machine Learning Srihari Gradient of Conditional Likelihood

• A reduced MN is itself an MN.

• We use log-linear representation with features fi and parameters θ – Analogous to gradient for full MN ∂ 1 ℓ(θ : D) = ED[ fi (χ )]− Eθ [ fi ] ∂θ M i we can write gradient for reduced MN

M ∂ ⎡ ⎤ ⎡ ⎤ ⎡ ⎡ ⎤⎤ ℓ Y|X (θ : D) = fi y m ,x m −E fi | x m ∂θ ∑( ( ⎣⎢ ⎦⎥ ⎣⎢ ⎦⎥) θ ⎣⎢ ⎣⎢ ⎦⎥⎦⎥) i m=1 First term is empirical count conditioned on x[m] Second is based on running inference on each data case Machine Learning Srihari Conditional Training Complexity

• In the unconditional case, each gradient step required only a single execution of inference • When training CRF we must execute inference for every single data case, conditioning on x[m] • On the other hand inference is executed on a simpler model – Since conditioning on evidence can only reduce computational cost 31 Machine Learning Srihari Simplification due to Conditioning

X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5 •Full MN encodes 4 5 P!(X,Y) = Y ,Y Y ,X ,X ,X ,X ,X ∏φi ( i i+1)∏φi ( i 1 2 3 4 5 ) i=1 i=1 •Edges disappear in a reduced Markov network – After conditioning on X, remaining edges form a

simple chain 5 P!( | ) = φ Y ,Y | X ,X ,X ,X ,X Y X ∏ i ( i i+1 1 2 3 4 5 ) i=1 – Next slide compares different sequential models, CRF, HMM and MEMM wrt parameter learning complexity 32 Machine Learning Srihari Generative and Discriminative Models for Sequence Labeling • A main task of PGMs: taking a set of inter-

related instances and jointly buildinglabeling them

car

• Also called collective classificationroad

cow Non-sequential grass

(a) (b) (c) (d)

Sequential

• We look at trade-offs in using different models for instances organized sequentially Machine Learning Srihari Models for Sequence Labeling

X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 Given: sequence of observations X={X1,..Xk}. CRF Directly Need: a joint label Y={Y1,..Yk} obtains P(Y|X) Model Trade-offs: expressive power, learnability Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 Y5

1 (a) (b) (c) P(Y | X) = P!(Y, X) Note: Z(X) is Z(X) k−1 k marginal of P!(Y, X) (Y ,Y ) (Y , X ) = ∏φi i i+1 ∏φi i i un-normalized i=1 i=1 • HMM is a generative model measure Z(X) = ∑ P!(Y, X) Y That needs joint probability P(X,Y) X1 X2 X3 X4 X5 X1 X2 X3 MEMMX4 X5 X1 X2 X3 X4 X5 Does not • MEMM is a discriminative model model • conditional directedY1 Ymodel2 Y3 Y4 Y5 Y1 Y2 Y3 distributionY4 Y5 Y1 Y2 Y3 Y4 Y5 over X (a) (b) (c)k Y1 ⊥ X2 if not given Y2 , by D-separation P(Y | X) P(Y | X ,Y ) = ∏ i i i−1 i=1 More generally, Yi⊥Xj | X-j j > I

• CRF is a discriminative model X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 HMM

Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 Y5 Y1 Y2 Y3 Y4 Y5

k P(X,Y) (a) HMMP(X /Y )P(Y |Y ) (b) MEMM (c) CRF Needs = ∏ i i i i−1 joint i=1 P(X,Y) distribution P(Y / X) = P(X) Machine Learning Srihari Comparison of Sequential Models • Trade-offs: Training, Expression, Indendence • Training Effort – HMM, MEMM easily learned (they are BNs) – CRF: gradient-based inference for every sequence difficult with large data sets • Expressibility – HMM is generative, training over features is hard – CRF can use a rich feature set, which is important • Independence Assumptions

MEMM Has label bias problem: Yi⊥Xj | X-j j > I Later observation has no effect on posterior probability of current state. In activity recognition in video sequence: frames are labeled as running/walking. 35 Earlier frames may be blurry but later ones clearer. Machine Learning Srihari

Markov Network Parameter Estimation with Missing Data

36 Machine Learning Srihari Parameter Estimation with Missing Data • We look at estimating parameters θ • When some data fields are missing • Missing data examples – Some data fields omitted or not collected – Some hidden variables

37 Machine Learning Srihari Methods for Parameter Estimation with Missing Data

• Learning problem has difficulties – Parameters θ may not be identifiable – Coupling between different parameters – Likelihood is not concave (has local maxima) • Two alternative methods 1.Gradient Ascent (assume missing data is random) 2.Expectation-Maximization

38 Machine Learning Srihari Parameter Method 1 for Missing Data: Gradient Ascent • Assume data is missing at random in data D • In the mth instance, let o[m] be observed entries and h[m] random variables that are missing • Log-likelihood has the form M ⎛ ⎞ 1 1  ln P(D |θ) = ∑ln⎜ ∑ P(o[m], h[m]|θ)⎟ − ln Z M M m=1 ⎝ h[m] ⎠

• Gradient wrt θi for feature fi has the form ∂ 1 1 ⎡ M ⎤ : D E f E f (θ ) = ⎢∑ h[m]~P(h[m]|o[m],θ) [ i ]⎥ − θ [ i ] ∂θi M M ⎣ m=1 ⎦

• Which is a difference between two expectations of fi – Expectation over data and hidden variables

– Expectation in current distribution P(χ|θ) 39 Machine Learning Srihari Gradient Ascent Complexity: Full vs Missing 1. With full data, gradient of log-likelihood is ∂ 1 ℓ(θ : D) = ED[ fi (χ)] − Eθ [ fi ] ∂θi M – For second term we need inference over current distribution P(χ|θ) – First term is aggregate over data D. 2. With missing data we have ⎡ M ⎤ ∂ 1 1 ⎢ ⎡ ⎤⎥ ⎡ ⎤ ℓ(θ : D) = E fi −E fi ∂θ M M ⎢∑ h[m ]~P(h[m ]|o[m ],θ) ⎣⎢ ⎦⎥⎥ θ ⎣⎢ ⎦⎥ i ⎣m=1 ⎦ – We have to run inference separately for every instance m conditioning on o[m] • Cost is much higher than with full data 40 Machine LearningParameter Method 2 for Missing Data: Srihari Expectation Maximization • As for any probabilistic model an alternative method of estimating parameters θ with missing data is EM – E step is to use current parameters to estimate missing values – M step is to re-estimate the parameters • For BN it has significant advantages – M-step has closed-form solution • For MN, the answer is not clear-cut – M-step requires running inference multiple times 41 • As seen next Machine Learning Srihari EM for MN parameter learning • E-step – using current parameters θt to compute expected sufficient statistics, i.e., expected feature counts

• At iteration t expected sufficient statistic for feature fi is 1 ⎡ ⎤ M ⎡f ⎤ = E ⎡f ⎤ θ(t ) ⎣⎢ i ⎦⎥ ⎢ h[m ]~P(h[m ]|o[m ],θ) ⎣⎢ i ⎦⎥⎥ M ⎣ ⎦ • M-step – done using maximum likelihood parameter estimation (using gradient ascent!) • Requires running inference multiple times, once for each iteration of gradient ascent procedure • At step k of this “inner loop” optimization, we have a

gradient of the form M ⎡f ⎤ −E ⎡f ⎤ 42 (t) ⎢ i ⎥ (t,k) ⎢ i ⎥ θ ⎣ ⎦ θ ⎣ ⎦ Machine Learning Srihari

MN Parameter Learning: Maximum Entropy Formulation of Maximum Likelihood Objective

43 Machine Learning Srihari Maximum Entropy Formulation of Markov Network Parameter Estimation • Basic formulation is: maximum likelihood parameter estimation • Alternative formulation provides insight • Insight is that “Distribution that maximizes entropy subject to constraints (of expectation) is the Gibbs distribution” Maximizing entropy subject to constraints is

dual of maximizing likelihood given structural44 conatraints Machine Learning Srihari Entropy Definition • Entropy is a measure of information – It is the Expected value (average value) of information generated by the distribution • Where Information is (log) reciprocal of probability – For a random variable X with distribution P(X)

⎡ ⎤ 1 1 H X = E ⎢log ⎥ = P x log P ( ) P ⎢ ⎥ ∑ ( ) ⎢ P (X)⎥ x P (X) ⎣ ⎦ • where we treat 0 log 1/0=0 – Entropy is the Lower bound on the average number of bits needed to encode values of X • Central Theorem of by Shannon

• i.e., Expected code length cannot be less than HP(X) Machine Learning Srihari Motivation of Max Entropy with Constraints • Suppose we have some summary statistics of an empirical distribution – Ex 1: published in a census report • Marginal distributions of single variables or certain pairs, or other events of interest – Ex 2: average final grade of students in class (70), correlation of final grades with homework scores • (correlation coefficient, r=0.6, on scale of -1 to +1) • But no access to full data set • Find a typical distribution that satisfies these constraints and use it to answer other queries46 Machine Learning Srihari Maximum Entropy Formulation

• Goal is to select a distribution that: 1. Satisfies given constraints 2. Has no additional structure or information • We formulate the following problem: – Find Q(χ)

• Maximizing HQ(χ)

• Subject to EQ[fi]=ED [fi], i=1,.., k • These are called expectation constraints

47 Machine Learning Srihari Solution to Max Entropy Estimation

• Somewhat surprisingly • The distribution Q*, the maximum entropy distribution satisfying the expectation

constraints, EQ[fi]=ED [fi], Q* P is a Gibbs distribution = θˆ where

1 ⎧ ⎫ P χ = exp θˆ f χ θˆ ( ) ˆ ⎨∑ i i ( )⎬ Z (θ ) ⎩ i ⎭

ˆ where θ is the m.l.e. parameterization relative to D

48 Machine Learning Srihari Duality of Max Ent and Max Likelihood

• The two problems 1.Maximizing entropy subject to expectation constraints 2.Maximizing likelihood given structural constraints on the distribution – are convex duals of each other ˆ • For the maximum likelihood parameters θ

1 ˆ HP⌢ (χ) = − ℓ(θ : D) θ M 49 Machine Learning Srihari Convex Duality

• Optimization problems are viewed as • Primal Problem – E.g., Lagrangian form with specified constraints • Dual Problem – Provides lower bounds to the primal problem • In general optimal values are not the same • For convex optimization the duality gap is zero

50 Machine Learning Srihari

MN Parameter Estimation: Parameter Priors and Regularization

51 Machine Learning Srihari

Parameter Priors and Regularization

• Maximum likelihood estimation of θ is prone to over-fitting • This is true of maximum likelihood of parameters of Bayesian networks as well • But it is not as apparent with MNs due to lack of direct correspondence between empirical counts and parameters • Overfitting is as much of a problem with MNs

52 Machine Learning Srihari Parameter Priors

• Can introduce prior distribution P(θ) – Defined over model parameters θ • Due to non-decomposable likelihood function, a fully Bayesian approach is infeasible 1 ⎧ k ⎫ P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) ⎩ i=1 ⎭ – In a fully Bayesian approach we integrate out the parameters to get the next prediction • Instead, we perform MAP estimation – Find parameters that maximize P(θ)P(D | θ) 53 Machine Learning Srihari MAP estimation of θ

• Instead, we perform MAP estimation – Find parameters that maximize P(θ)P(D | θ) • Where we have ln P(D |θ) expressed in log-space as ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z (θ ) ∑ ⎝⎜ ∑ ⎠⎟ i=1 m • Which in turn is derived from the joint distribution 1 ⎧ k ⎫ P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) ⎩ i=1 ⎭

54 Machine Learning Choice of Prior P(θ) Srihari • No constraints on conjugacy of prior and likelihood. Only a few priors used in practice

1. Zero-mean diagonal Gaussian with0.5 equal 0.4

variances for each of the weights 0.3

0.2

2. Zero-mean Laplacian distribution 0.1

0 –10 –5 0 5 10 • They are referred to as Gaussian distribution σ2=1 Laplacian distribution β=1 – L2 regularization, • Weight penalty is quadratic in θ (Euclidean norm)

– L1 regularization • Weight penalty is linear in |θ| – Both priors penalize parameters whose magnitude (positive or negative) is large 55 Machine Learning Srihari

Gaussian Prior and L2-Regularization • Most common is Gaussian prior on log-linear

parameters 2 θ k 1 ⎪⎧ θ ⎪⎫ P | σ2 = exp⎪− i ⎪ (θ ) ∏ ⎨ 2 ⎬ i=1 2πσ ⎪ 2σ ⎪ ⎩⎪ ⎭⎪ – For some choice of hyper-parameter (variance) σ2

• Analogous to �i in Dirichlet prior for multinomial • Converting MAP objective P(θ)P(D |θ) to log- space, gives l n P ( θ ) + ℓ ( θ : D ) whose first term 1 k lnP = − θ2 (θ) 2 ∑ i 2σ i=1

– is called an L2- regularization term 56 Machine Learning Srihari

Laplacian Prior and L1-Regularization

• Zero-mean Laplacian distribution 0.5 ⎧ ⎫ 1 ⎪ | θ |⎪ 0.4 PLaplacian (θ) = exp⎨− ⎬ 2β ⎪ β ⎪ 0.3 ⎩⎪ ⎭⎪ 0.2

0.1

• Taking log we obtain a term 0 –10 –5 0 5 10 1 k ln P(θ ) = − ∑|θi | Laplacian distribution β=1 β i=1 Gaussian distribution σ2=1 – which we wish to minimize

• Generally called L1-regularization • Both forms of regularization penalize parameters whose magnitude is large 57 Machine Learning Srihari Why prefer low magnitude parameters ? • Properties of prior – To pull distribution towards an uninformed one – To smooth fluctuations in the data • A distribution is smooth if – Probabilities assigned to different assignments are not radically different • Consider two assignments ξ and ξ’

– An assignment is an instance of variables X1,..Xn • We consider ratio of their probabilities next

58 Machine Learning Srihari Smoothness resulting from small θ • Given two assignments ξ and ξ’ – Their relative probability is P (ξ) P!(ξ)/Z (θ) P!(ξ) = = P ξ ' P! ξ ' /Z θ P! ξ ' ( ) ( ) ( ) ( ) • where the un-normalized probabilities are

k ! ⎧ ⎫ P(ξ) = exp⎨∑θi fi (ξ)⎬ ⎩ i=1 ⎭ • In log-space, log-probability ratio is P (ξ) k k k ln = f − f ' = f − f ' ∑θi i (ξ) ∑θi i (ξ ) ∑ θi ( i (ξ) i (ξ )) P ξ ' i=1 i=1 i=1 ( )

– When θi’s have small magnitude, this log-ratio is also bounded, i.e., probabilities are similar 59 – This results in a smooth distribution Machine Learning Srihari

Comparison of L1 and L2 Regularization

• Both L1 and L2 penalize parameter magnitude • Encode belief that model weights should be small (Close to zero)

• In Gaussian case (L2) , penalty grows quadratically with parameters

– An increase in θi from 3 to 3.1 is penalized more than θi from 0 to 0.1 – Leads to many small parameters

• In Laplacian case (L1), penalty grows linearly – Results in fewer edges and is more tractable 60 Machine Learning Srihari Efficiency of Optimization

• Both L1 and L2 Regularization terms are Concave • Because Log-likelihood is also Concave, resulting posterior is also concave • Can be optimized using gradient descent methods • Introduction of penalty terms eliminates multiple equivalent minima

61 Machine Learning Srihari Choice of Hyper-parameters 2 • Regularization hyper-parameters σ (L1) and β (L2) – Larger hyper-parameters mean broader prior – Choice of prior has effect on learned model • Standard method of selecting this parameter is via cross-validation – Repeatedly partition training set • Learn model over one part with some choice of parameters • ensure the performance on the held-out fragment

62