Learning Parameters of Undirected Models

Machine Learning Srihari Learning Parameters of Undirected Models Sargur Srihari [email protected] 1 Machine Learning Srihari Topics • Difficulties due to Global Normalization • Likelihood Function • Maximum Likelihood Parameter Estimation • Simple and Conjugate Gradient Ascent • Conditionally Trained Models • Parameter Learning with Missing Data – Gradient Ascent vs. EM • Alternative Formulation of Max Likelihood – Maximum Entropy subject to Constraints • Parameter Priors and Regularization 2 Machine Learning Srihari Local vs Global Normalization • BN: Local normalization within each CPD n p(X) = ∏ln p(Xi | pai ) i=1 • MN: Global normalization (partition function) 1 K Z = φi (Di ) P(X1,.., Xn ) = φi Di ∑ ∏ ( ) X ,..X Z i=1 1 n • Global factor couples all parameters preventing decomposition • Significant computational ramifications – M.L. parameter estimation has no closed-form soln. – Need iterative methods 3 Machine Learning Srihari Issues in Parameter Estimation • Simple ML parameter estimation (even with complete data) cannot be solved in closed form • Need iterative methods such as gradient ascent • Good news: – Likelihood function is concave • Methods converge with global optimum • Bad news: – Each step in iterative algorithm requires inference • Simple parameter estimation expensive/intractable 4 – Bayesian estimation is practically infeasible Machine Learning Srihari Discriminative Training • Common use of MNs is in settings such as image segmentation where we have a particular inference task in mind • Train the network discriminatively to get good performance for our particular inference task 5 Machine Learning Srihari Likelihood Function • Basis for all discussion of learning • How likelihood function can be optimized to find maximum likelihood parameter estimates • Begin with form of likelihood function for Markov networks, its properties and their computational implications 6 Machine Learning Srihari Example of Coupled Likelihood Function – Simple Markov network A—B—C – Two potentials ϕ1(A,B), ϕ2(B,C): Z = (a,b) (b,c) • Gibbs: P(A,B,C)=(1/Z) ϕ1(A,B)ϕ2(B,C) ∑φ1 φ2 a,b,c • Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ1(A,B)+ln ϕ2(B,C)-ln Z • Log-likelihood of data set D with M instances: Parameter θ consists of all values of factorsϕ1 and ϕ2 M Summing over different ℓ(θ : D) = ∑(lnφ1(a[m],b[m]) + lnφ2 (b[m],c[m]) − ln Z(θ)) m=1 Instances M = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) Summing over different ∑ 1 ∑ 2 values of A and B a,b b,c • First term involves only ϕ1. Second only ϕ2. But third involves ⎛ ⎞ lnZ(θ) = ln⎜ φ (a,b)φ (b,c)⎟ ⎜∑ 1 2 ⎟ ⎝a,b,c ⎠ • Which couples the two potentials in the likelihood function – When we change one potential ϕ1, Z(θ) changes, possibly changing the value of ϕ2 that maximizes –ln Z(θ) Machine Learning Srihari Illustration of Coupled Likelihood • Log-likelihood A—B—C wrt two factors ϕ1(A,B), ϕ2(B,C) – With binary variables we would have 8 parameters ℓ(θ : D) = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) • Log-likelihood surface ∑ 1 ∑ 2 a,b b,c – when ϕ1changes, ϕ2 that maximizes –ln Z(θ) also changes • In this case problem avoidable 0 1 f2(b , c ) 1 1 – Equivalent to BN f1(a , b ) All other parameters set to 1 – AàBàC & estimate parameters of Data Set has M=100 M[a1,b1]=40, M[b0,c1]=40 – ϕ1(A,B)=P(A)P(B|A), ϕ2(B,C)=P(C|B) • In general, cannot convert learned BN parameters into equivalent MN – Optimal likelihood achievable by the two representations is not the same 8 Machine Learning Srihari Form of Likelihood Function • Instead of Gibbs, use log-linear framework – Joint distribution of n variables X1,..Xn – k features F = { fi(Di) }i=1,.. k k depends on how many values Di take • where Di is a sub-graph and fi maps Di to R 1 ⎧ k ⎫ P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) ⎩ i=1 ⎭ • Parameters θi are weights we put on features – If we have a sample ξ then its features are fi(ξ(Di)) which has the shorthand fi(ξ). • Representation is general – can capture Markov networks with global structure 9 and local structure MachineParameters Learning θ, Factors ϕ, binary features Sriharif A B ϕ1(A,B) k 1 0 0 0 0 A A ⎧ ⎫ a b ϕ1 (a ,b ) P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) i=1 0 1 0 1 ⎩ ⎭ a b ϕ1(a ,b ) D B D B D B a1 b0 ϕ (a1,b0) • Variables: A—B—C—D— 1 C C C A 1 1 1, 1 a b ϕ1(a b ) (a) (b) (c) • A feature for every entry in every table – fi(Di) are sixteen indicator functions defined over clusters, AB, BC,CD,DA 0 0 Val(A)={a 0,a1} Val(B)={b 0,b1} f 0 0 (A, B) I A a I B b a b = { = } { = } 0 0 fa0,b0=1 if a=a ,b=b etc. 0 otherwise, etc. – With this representation Parameters θ are potentials 0 0 which are weights put on θ 0 0 = lnφ a ,b a b 1 ( ) features Machine Learning Srihari Log Likelihood and Sufficient Statistics • Joint probability distribution: 1 ⎧ k ⎫ θ={θ1,..θk} are table entries P(X ,..X ; ) exp f (D ) 1 n θ = ⎨∑θi i i ⎬ fi are features over instances of Di Z(θ) ⎩ i=1 ⎭ • Let D be a data set of M samples ξ [m] m =1,..M • Log-likelihood – Log of product of probs of M indep. instances: ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • Sufficient statistics (likelihood depends only on this) Dividing by no. 1 ED[fi(di)] is the average ℓ(θ : D) = ∑θi (ED [ fi (di )]) − ln Z(θ) of samples M i in the data set Machine Learning Srihari Properties of Log-Likelihood • Log-likelihood is a sum of two functions ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m First Term Second Term – First term is linear in the parameters θ • Increasing parameters increases this term • But likelihood has upper-bound of probability 1 – Second term: ⎧ ⎫ ln Z (θ ) = ln∑exp⎨∑θi fi (ξ)⎬ • balances first term ξ ⎩ i ⎭ – Partition function Z(θ) is convex as seen next » Its negative is concave – Sum of linear fn. and concave is concave, So 12 • There are no local optima – Can use gradient ascent Machine Learning Srihari Proof that Z(θ) is Convex ! A function f (x) is convex if for every 0 ≤ α ≤ 1 ! ! ! ! f ( x (1 )y) f (x) (1 ) f (y) α + −α ≤ α + −α –The function is bowl-like –Every interpolation between the images of two points is larger than the image of their interpolation – One way to show that a Function is convex is to show that its Hessian (matrix of the function’s second derivatives) is positive-semi-definite • Hessian (2nd der) of ln Z(θ) computed as: Machine Learning Srihari Hessian of ln Z(θ) ⎧ ⎫ ln Z θ = ln exp θ f ξ Given set of features F with ( ) ∑ ⎨∑ i i ( )⎬ ξ ⎩ i ⎭ ∂ ln Z(θ) = Eθ [ fi ] ∂θi ∂2 ln Z(θ) = Cov [ f ; f ] θ i j where Eθ[fi] is shorthand for EP(χ,θ)[fi] ∂θi ∂θ j Proof: ∂ 1 ∂ ⎪⎧ ⎪⎫ 1 ⎪⎧ ⎪⎫ ln Z( ) exp f f ( )exp f E [ f ] θ = ∑ ⎨∑θ j j (ξ)⎬ = ∑ i ξ ⎨∑θ j j (ξ)⎬ = θ i ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ Partial derivatives ∂2 ∂ ⎡ 1 ∂ ⎧ ⎫⎤ wrt θi and θj ln Z( ) exp f Cov [ f ; f ] θ = ⎢ ∑ ⎨∑θk k (ξ)⎬⎥ = θ i j ∂θi∂θ j ∂θ j ⎣ Z(θ) ξ ∂θi ⎩ k ⎭⎦ – Since covariance matrix of features is pos. semi- definite, we have -ln Z(θ) is a concave function of θ • Corollary: log-likelihood function is concave Machine Learning Srihari Non-unique Solution • Since ln Z(θ) is convex, -ln Z(θ) is concave • Implies that log-likelihood is unimodal – Has no local optima • However does not imply uniqueness of global optimum • Multiple parameterizations can result in same distribution – A feature for every entry in the table is always redundant, e.g., • fa0,b0 = 1- fa0,b1- fa1,b0- fa1,b1 15 • A continuum of parameterizations Machine Learning Srihari Maximum (Conditional) Likelihood Parameter Estimation • Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D • Simplest variant of the problem is maximum likelihood parameter estimation • Log-likelihood given features F={ fi, i=1,.. k} is ⎛ ⎞ ℓ(θ : D) = ∑θi ⎜ ∑ fi (ξ[m])⎟ − M ln Z (θ ) i=1 ⎝ m ⎠ 16 Machine Learning Srihari Gradient of Log-likelihood ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z (θ ) ∑ ⎝⎜ ∑ ⎠⎟ • Log-likelihood: i=1 m 1 ( : D) E f (d ) ln Z( ) Dividing by no. • Average log-likelihood: ℓ θ = ∑θi ( D [ i i ]) − θ M i of samples • Gradient of second term is ∂ 1 ∂ ⎧ ⎫ 1 ⎧ ⎫ ln Z(θ) = ∑ exp ⎨∑θ j f j (ξ)⎬ = ∑ fi (ξ)exp ⎨∑θ j f j (ξ)⎬ = Eθ [ fi ] ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ • Gradient of the average log-likelihood is First term is average value of f in ∂ 1 i ℓ(θ : D) = ED[ fi (χ )]− Eθ [ fi ] data D. Second term is ∂θ M i expected value from distribution • Provides a precise characterization of m.l. parameters θ • Theorem: Let F be a feature set. Then θ is a m.l. assignment if and only if E f E f for all i D ⎣⎡ i (χ)⎦⎤ = θˆ [ i ] • i.e., expected value of each feature relative to Pθ matches its empirical expectation in D Machine Learning Srihari Need for Iterative Method • Although log-likelihood function ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • is concave, there is no analytical form for the maximum • Since no closed-form solution – Can use iterative methods, e.g.

Load more