Graphical Models: Learning Parameters and Structure Zoubin

Graphical Models: Learning parameters and structure Zoubin Ghahramani [email protected] http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge, UK Machine Learning Department Carnegie Mellon University, USA 2010 Learning parameters x1 x2 θ2 x2 0.2 0.3 0.5 x x3 1 x4 0.1 0.6 0.3 p(x1)p(x2|x1)p(x3|x1)p(x4|x2) Assume each variable xi is discrete and can take on Ki values. The parameters of this model can be represented as 4 tables: θ1 has K1 entries, θ2 has K1 × K2 entries, etc. These are called conditional probability tables (CPTs) with the following semantics: 0 p(x1 = k) = θ1,k p(x2 = k |x1 = k) = θ2,k,k0 If node i has M parents, θi can be represented either as an M + 1 dimensional table, or as Q a 2-dimensional table with j∈pa(i) Kj × Ki entries by collapsing all the states of the P parents of node i. Note that k0 θi,k,k0 = 1. (n) N Assume a data set D = {x }n=1. How do we learn θ from D? Learning parameters x1 x 2 (n) N Assume a data set D = {x }n=1. How do we learn θ from D? x3 x4 p(x|θ) = p(x1|θ1)p(x2|x1, θ2)p(x3|x1, θ3)p(x4|x2, θ4) N Likelihood: Y p(D|θ) = p(x(n)|θ) n=1 Log Likelihood: N X X (n) (n) log p(D|θ) = log p(xi |xpa(i), θi) n=1 i This decomposes into sum of functions of θi. Each θi can be optimized separately: ˆ ni,k,k0 θi,k,k0 = P k00 ni,k,k00 0 where ni,k,k0 is the number of times in D where xi = k and xpa(i) = k, where k represents Q a joint configuration of all the parents of i (i.e. takes on one of j∈pa(i) Kj values) x2 x2 n2 θ2 2 3 0 0.4 0.6 0 x x 1 ⇒ 1 ML solution: Simply calculate frequencies! 3 1 6 0.3 0.1 0.6 Deriving the Maximum Likelihood Estimate x Y δ(x,k)δ(y,`) p(y|x, θ) = θk,` k,` θ Dataset D = {(x(n), y(n)): n = 1 ...,N} y Y L(θ) = log p(y(n)|x(n), θ) n Y Y δ(x(n),k)δ(y(n),`) = log θk,` n k,` X (n) (n) = δ(x , k)δ(y , `) log θk,` n,k,` ! X X (n) (n) X = δ(x , k)δ(y , `) log θk,` = nk,` log θk,` k,` n k,` P Maximize L(θ) w.r.t. θ subject to ` θk,` = 1 for all k. Maximum Likelihood Learning with Hidden Variables θ1 X1 θ θ2 3 X2 X3 Assume a model parameterised by θ with observable variables Y and hidden variables X θ4 Y Goal: maximize parameter log likelihood given observed data. X L(θ) = log p(Y |θ) = log p(Y, X|θ) X Maximum Likelihood Learning with Hidden Variables: The EM Algorithm θ1 X1 θ θ2 3 X X Goal: maximise2 parameter3 log likelihood given observables. θ 4 X Y L(θ) = log p(Y |θ) = log p(Y, X|θ) X The Expectation Maximization (EM) algorithm (intuition): Iterate between applying the following two steps: • The E step: fill-in the hidden/missing variables • The M step: apply complete data learning to filled-in data. Maximum Likelihood Learning with Hidden Variables: The EM Algorithm Goal: maximise parameter log likelihood given observables. X L(θ) = log p(Y |θ) = log p(Y, X|θ) X The EM algorithm (derivation): X p(Y, X|θ) X p(Y, X|θ) L(θ) = log q(X) ≥ q(X) log = F(q(X), θ) q(X) q(X) X X • The E step: maximize F(q(X), θ[t]) wrt q(X) holding θ[t] fixed: q(X) = p(X|Y, θ[t]) • The M step: maximize F(q(X), θ) wrt θ holding q(X) fixed: [t+1] X θ ← argmaxθ q(X) log p(Y, X|θ) X The E-step requires solving the inference problem, finding the distribution over the hidden variables p(X|Y, θ[t]) given the current model parameters. This can be done using belief propagation or the junction tree algorithm. Maximum Likelihood Learning without and with Hidden Variables ML Learning with Complete Data (No Hidden Variables) Log likelihood decomposes into sum of functions of θi. Each θi can be optimized separately: ˆ nijk θijk ← P k0 nijk0 where nijk is the number of times in D where xi = k and xpa(i) = j. Maximum likelihood solution: Simply calculate frequencies! ML Learning with Incomplete Data (i.e. with Hidden Variables) Iterative EM algorithm [t] E step: compute expected counts given previous settings of parameters E[nijk|D, θ ]. M step: re-estimate parameters using these expected counts E[n |D, θ[t]] θ[t+1] ← ijk ijk P [t] k0 E[nijk0|D, θ ] Bayesian Learning Apply the basic rules of probability to learning from data. 0 Data set: D = {x1, . , xn} Models: m, m etc. Model parameters: θ Prior probability of models: P (m), P (m0) etc. Prior probabilities of model parameters: P (θ|m) Model of data given parameters (likelihood model): P (x|θ, m) If the data are independently and identically distributed then: n Y P (D|θ, m) = P (xi|θ, m) i=1 Posterior probability of model parameters: P (D|θ, m)P (θ|m) P (θ|D, m) = P (D|m) Posterior probability of models: P (m)P (D|m) P (m|D) = P (D) Bayesian Occam’s Razor and Model Comparison Compare model classes, e.g. m and m0, using posterior probabilities given D: p(D|m) p(m) Z p(m|D) = , p(D|m) = p(D|θ, m) p(θ|m) dθ p(D) Interpretations of the Marginal Likelihood (“model evidence”): • The probability that randomly selected parameters from the prior would generate D. • Probability of the data under the model, averaging over all possible parameter values. 1 • log2 p(D|m) is the number of bits of surprise at observing data D under model m. Model classes that are too simple are unlikely too simple to generate the data set. ) m | D ( Model classes that are too complex can P generate many possible data sets, so again, "just right" they are unlikely to generate that particular too complex data set at random. D All possible data sets of size n Model structure and overfitting: A simple example: polynomial regression M = 0 M = 1 M = 2 M = 3 40 40 40 40 20 20 20 20 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6 M = 7 40 40 40 40 20 20 20 20 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 Bayesian Model Comparison: Occam’s Razor at Work M = 0 M = 1 M = 2 M = 3 40 40 40 40 Model Evidence 20 20 20 20 1 0 0 0 0 0.8 0.6 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6 M = 7 P(Y|M) 0.4 40 40 40 40 0.2 0 20 20 20 20 0 1 2 3 4 5 6 7 M 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 2 For example, for quadratic polynomials (m = 2): y = a0 + a1x + a2x + , where 2 ∼ N (0, σ ) and parameters θ = (a0 a1 a2 σ) demo: polybayes Learning Model Structure How many clusters in the data? What is the intrinsic dimensionality of the data? Is this input relevant to predicting that output? What is the order of a dynamical system? How many states in a hidden Markov model? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY How many auditory sources in the input? Which graph structure best models the data? A B C D demo: run simple E Bayesian parameter learning with no hidden variables (n) (n) Let nijk be the number of times (xi = k and xpa(i) = j) in D. For each i and j, θij· is a probability vector of length Ki × 1. Since xi is a discrete variable with probabilities given by θi,j,·, the likelihood is: Y Y (n) (n) Y Y Y nijk p(D|θ) = p(xi |xpa(i), θ) = θijk n i i j k If we choose a prior on θ of the form: Y Y Y αijk−1 p(θ) = c θijk i j k P where c is a normalization constant, and k θijk = 1 ∀i, j, then the posterior distribution also has the same form: 0 Y Y Y α˜ijk−1 p(θ|D) = c θijk i j k where α˜ijk = αijk + nijk. This distribution is called the Dirichlet distribution. Dirichlet Distribution The Dirichlet distribution is a distribution over the K-dim probability simplex. PK Let θ be a K-dimensional vector s.t. ∀j : θj ≥ 0 and j=1 θj = 1 P K def Γ( j αj) Y α −1 p(θ|α) = Dir(α , . , α ) = θ j 1 K Q Γ(α ) j j j j=1 1 P where the first term is a normalization constant and E(θj) = αj/( k αk) The Dirichlet is conjugate to the multinomial distribution.

Load more