Learning Parameters of Undirected Models

Machine Learning Srihari Learning Parameters of Undirected Models Sargur Srihari [email protected] 1 Machine Learning Srihari Topics • Difficulties due to Global Normalization • Likelihood Function • Maximum Likelihood Parameter Estimation • Simple and Conjugate Gradient Ascent • Conditionally Trained Models • Parameter Learning with Missing Data – Gradient Ascent vs. EM • Alternative Formulation of Max Likelihood – Maximum Entropy subject to Constraints • Parameter Priors and Regularization 2 Machine Learning Srihari Local vs Global Normalization • BN: Local normalization within each CPD n p(X) = ∏ln p(Xi | pai ) i=1 • MN: Global normalization (partition function) 1 K Z = φi (Di ) P(X1,.., Xn ) = φi Di ∑ ∏ ( ) X ,..X Z i=1 1 n • Global factor couples all parameters preventing decomposition • Significant computational ramifications – M.L. parameter estimation has no closed-form soln. – Need iterative methods 3 Machine Learning Srihari Issues in Parameter Estimation • Simple ML parameter estimation (even with complete data) cannot be solved in closed form • Need iterative methods such as gradient ascent • Good news: – Likelihood function is concave • Methods converge with global optimum • Bad news: – Each step in iterative algorithm requires inference • Simple parameter estimation expensive/intractable 4 – Bayesian estimation is practically infeasible Machine Learning Srihari Discriminative Training • Common use of MNs is in settings such as image segmentation where we have a particular inference task in mind • Train the network discriminatively to get good performance for our particular inference task 5 Machine Learning Srihari Likelihood Function • Basis for all discussion of learning • How likelihood function can be optimized to find maximum likelihood parameter estimates • Begin with form of likelihood function for Markov networks, its properties and their computational implications 6 Machine Learning Srihari Example of Coupled Likelihood Function – Simple Markov network A—B—C – Two potentials ϕ1(A,B), ϕ2(B,C): Z = (a,b) (b,c) • Gibbs: P(A,B,C)=(1/Z) ϕ1(A,B)ϕ2(B,C) ∑φ1 φ2 a,b,c • Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ1(A,B)+ln ϕ2(B,C)-ln Z • Log-likelihood of data set D with M instances: Parameter θ consists of all values of factorsϕ1 and ϕ2 M Summing over different ℓ(θ : D) = ∑(lnφ1(a[m],b[m]) + lnφ2 (b[m],c[m]) − ln Z(θ)) m=1 Instances M = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) Summing over different ∑ 1 ∑ 2 values of A and B a,b b,c • First term involves only ϕ1. Second only ϕ2. But third involves ⎛ ⎞ lnZ(θ) = ln⎜ φ (a,b)φ (b,c)⎟ ⎜∑ 1 2 ⎟ ⎝a,b,c ⎠ • Which couples the two potentials in the likelihood function – When we change one potential ϕ1, Z(θ) changes, possibly changing the value of ϕ2 that maximizes –ln Z(θ) Machine Learning Srihari Illustration of Coupled Likelihood • Log-likelihood A—B—C wrt two factors ϕ1(A,B), ϕ2(B,C) – With binary variables we would have 8 parameters ℓ(θ : D) = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) • Log-likelihood surface ∑ 1 ∑ 2 a,b b,c – when ϕ1changes, ϕ2 that maximizes –ln Z(θ) also changes • In this case problem avoidable 0 1 f2(b , c ) 1 1 – Equivalent to BN f1(a , b ) All other parameters set to 1 – AàBàC & estimate parameters of Data Set has M=100 M[a1,b1]=40, M[b0,c1]=40 – ϕ1(A,B)=P(A)P(B|A), ϕ2(B,C)=P(C|B) • In general, cannot convert learned BN parameters into equivalent MN – Optimal likelihood achievable by the two representations is not the same 8 Machine Learning Srihari Form of Likelihood Function • Instead of Gibbs, use log-linear framework – Joint distribution of n variables X1,..Xn – k features F = { fi(Di) }i=1,.. k k depends on how many values Di take • where Di is a sub-graph and fi maps Di to R 1 ⎧ k ⎫ P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) ⎩ i=1 ⎭ • Parameters θi are weights we put on features – If we have a sample ξ then its features are fi(ξ(Di)) which has the shorthand fi(ξ). • Representation is general – can capture Markov networks with global structure 9 and local structure MachineParameters Learning θ, Factors ϕ, binary features Sriharif A B ϕ1(A,B) k 1 0 0 0 0 A A ⎧ ⎫ a b ϕ1 (a ,b ) P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) i=1 0 1 0 1 ⎩ ⎭ a b ϕ1(a ,b ) D B D B D B a1 b0 ϕ (a1,b0) • Variables: A—B—C—D— 1 C C C A 1 1 1, 1 a b ϕ1(a b ) (a) (b) (c) • A feature for every entry in every table – fi(Di) are sixteen indicator functions defined over clusters, AB, BC,CD,DA 0 0 Val(A)={a 0,a1} Val(B)={b 0,b1} f 0 0 (A, B) I A a I B b a b = { = } { = } 0 0 fa0,b0=1 if a=a ,b=b etc. 0 otherwise, etc. – With this representation Parameters θ are potentials 0 0 which are weights put on θ 0 0 = lnφ a ,b a b 1 ( ) features Machine Learning Srihari Log Likelihood and Sufficient Statistics • Joint probability distribution: 1 ⎧ k ⎫ θ={θ1,..θk} are table entries P(X ,..X ; ) exp f (D ) 1 n θ = ⎨∑θi i i ⎬ fi are features over instances of Di Z(θ) ⎩ i=1 ⎭ • Let D be a data set of M samples ξ [m] m =1,..M • Log-likelihood – Log of product of probs of M indep. instances: ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • Sufficient statistics (likelihood depends only on this) Dividing by no. 1 ED[fi(di)] is the average ℓ(θ : D) = ∑θi (ED [ fi (di )]) − ln Z(θ) of samples M i in the data set Machine Learning Srihari Properties of Log-Likelihood • Log-likelihood is a sum of two functions ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m First Term Second Term – First term is linear in the parameters θ • Increasing parameters increases this term • But likelihood has upper-bound of probability 1 – Second term: ⎧ ⎫ ln Z (θ ) = ln∑exp⎨∑θi fi (ξ)⎬ • balances first term ξ ⎩ i ⎭ – Partition function Z(θ) is convex as seen next » Its negative is concave – Sum of linear fn. and concave is concave, So 12 • There are no local optima – Can use gradient ascent Machine Learning Srihari Proof that Z(θ) is Convex ! A function f (x) is convex if for every 0 ≤ α ≤ 1 ! ! ! ! f ( x (1 )y) f (x) (1 ) f (y) α + −α ≤ α + −α –The function is bowl-like –Every interpolation between the images of two points is larger than the image of their interpolation – One way to show that a Function is convex is to show that its Hessian (matrix of the function’s second derivatives) is positive-semi-definite • Hessian (2nd der) of ln Z(θ) computed as: Machine Learning Srihari Hessian of ln Z(θ) ⎧ ⎫ ln Z θ = ln exp θ f ξ Given set of features F with ( ) ∑ ⎨∑ i i ( )⎬ ξ ⎩ i ⎭ ∂ ln Z(θ) = Eθ [ fi ] ∂θi ∂2 ln Z(θ) = Cov [ f ; f ] θ i j where Eθ[fi] is shorthand for EP(χ,θ)[fi] ∂θi ∂θ j Proof: ∂ 1 ∂ ⎪⎧ ⎪⎫ 1 ⎪⎧ ⎪⎫ ln Z( ) exp f f ( )exp f E [ f ] θ = ∑ ⎨∑θ j j (ξ)⎬ = ∑ i ξ ⎨∑θ j j (ξ)⎬ = θ i ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ Partial derivatives ∂2 ∂ ⎡ 1 ∂ ⎧ ⎫⎤ wrt θi and θj ln Z( ) exp f Cov [ f ; f ] θ = ⎢ ∑ ⎨∑θk k (ξ)⎬⎥ = θ i j ∂θi∂θ j ∂θ j ⎣ Z(θ) ξ ∂θi ⎩ k ⎭⎦ – Since covariance matrix of features is pos. semi- definite, we have -ln Z(θ) is a concave function of θ • Corollary: log-likelihood function is concave Machine Learning Srihari Non-unique Solution • Since ln Z(θ) is convex, -ln Z(θ) is concave • Implies that log-likelihood is unimodal – Has no local optima • However does not imply uniqueness of global optimum • Multiple parameterizations can result in same distribution – A feature for every entry in the table is always redundant, e.g., • fa0,b0 = 1- fa0,b1- fa1,b0- fa1,b1 15 • A continuum of parameterizations Machine Learning Srihari Maximum (Conditional) Likelihood Parameter Estimation • Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D • Simplest variant of the problem is maximum likelihood parameter estimation • Log-likelihood given features F={ fi, i=1,.. k} is ⎛ ⎞ ℓ(θ : D) = ∑θi ⎜ ∑ fi (ξ[m])⎟ − M ln Z (θ ) i=1 ⎝ m ⎠ 16 Machine Learning Srihari Gradient of Log-likelihood ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z (θ ) ∑ ⎝⎜ ∑ ⎠⎟ • Log-likelihood: i=1 m 1 ( : D) E f (d ) ln Z( ) Dividing by no. • Average log-likelihood: ℓ θ = ∑θi ( D [ i i ]) − θ M i of samples • Gradient of second term is ∂ 1 ∂ ⎧ ⎫ 1 ⎧ ⎫ ln Z(θ) = ∑ exp ⎨∑θ j f j (ξ)⎬ = ∑ fi (ξ)exp ⎨∑θ j f j (ξ)⎬ = Eθ [ fi ] ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ • Gradient of the average log-likelihood is First term is average value of f in ∂ 1 i ℓ(θ : D) = ED[ fi (χ )]− Eθ [ fi ] data D. Second term is ∂θ M i expected value from distribution • Provides a precise characterization of m.l. parameters θ • Theorem: Let F be a feature set. Then θ is a m.l. assignment if and only if E f E f for all i D ⎣⎡ i (χ)⎦⎤ = θˆ [ i ] • i.e., expected value of each feature relative to Pθ matches its empirical expectation in D Machine Learning Srihari Need for Iterative Method • Although log-likelihood function ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • is concave, there is no analytical form for the maximum • Since no closed-form solution – Can use iterative methods, e.g.

Learning Parameters of Undirected Models

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support