<<

School of Computer Science

Probabilistic Graphical Models

Learning generalized linear models and tabular CPT of fully observed BN

Eric Xing Lecture 7, September 30, 2009

X1 X1 X1 X1

X2 X3 X2 X3 X2 X3 Reading:

X4 X4 © Eric Xing @ CMU, 2005-2009 1

Exponential family

z For a numeric X

T p(x |η) = h(x)exp{η T (x) − A(η)} Xn 1 N = h(x)exp{}η TT (x) Z(η)

is an exponential family distribution with natural (canonical) parameter η

z Function T(x) is a sufficient .

z Function A(η) = log Z(η) is the log normalizer.

z Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...

© Eric Xing @ CMU, 2005-2009 2

1 Recall

z Let us assume that the target variable and the inputs are related by the equation:

T yi = θ xi + ε i

where ε is an error term of unmodeled effects or random noise

z Now assume that ε follows a Gaussian N(0,σ), then we have:

1 ⎛ (y −θ T x )2 ⎞ p(y | x ;θ) = exp⎜− i i ⎟ i i ⎜ 2 ⎟ 2πσ ⎝ 2σ ⎠

© Eric Xing @ CMU, 2005-2009 3

Recall: (sigmoid classifier)

z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a 1 µ(x) = T 1 + e−θ x

z We can used the brute-force gradient method as in LR

z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized !

© Eric Xing @ CMU, 2005-2009 4

2 Example: Multivariate Gaussian Distribution

z For a continuous vector random variable X∈Rk:

1 ⎧ 1 T −1 ⎫ p(x µ,Σ) = 1/2 exp⎨− (x − µ) Σ (x − µ)⎬ ()2π k /2 Σ ⎩ 2 ⎭ parameter 1 = exp{}− 1 tr()Σ−1 xxT + µ T Σ−1 x − 1 µ T Σ−1µ − log Σ ()2π k /2 2 2

z Exponential family representation Natural parameter

−1 1 −1 −1 1 −1 η = [Σ µ;− 2 vec(Σ )]= [η1 , vec(η2 )], η1 = Σ µ and η2 = − 2 Σ T (x) = []x; vec()xxT 1 T −1 1 T 1 A(η) = 2 µ Σ µ + log Σ = − 2 tr(η2η1η1 ) − 2 log()−2η2 h(x) = ()2π −k /2

z Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)- element vector of sufficient (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

© Eric Xing @ CMU, 2005-2009 5

Example:

z For a binary vector random variable x ~ multi(x |π ),

x1 x1 xK ⎧ ⎫ p(xπ ) = π1 π 2 Lπ K = exp⎨∑ xk lnπ k ⎬ ⎩ k ⎭ ⎧K −1 ⎛ K −1 ⎞ ⎛ K −1 ⎞⎫ = exp⎨∑ xk lnπ k + ⎜1− ∑ xK ⎟ln⎜1− ∑π k ⎟⎬ ⎩ k =1 ⎝ k =1 ⎠ ⎝ k =1 ⎠⎭ ⎧K −1 ⎛ π ⎞ ⎛ K −1 ⎞⎫ = exp x ln⎜ k ⎟ + ln⎜1− π ⎟ ⎨∑ k ⎜ K −1 ⎟ ∑ k ⎬ ⎩ k =1 ⎝1− ∑k =1 π k ⎠ ⎝ k =1 ⎠⎭ z Exponential family representation

⎡ ⎛π k ⎞ ⎤ η = ⎢ln⎜ ⎟;0⎥ ⎣ ⎝ π K ⎠ ⎦ T (x) = []x K −1 K ⎛ ⎞ ⎛ ηk ⎞ A(η) = −ln⎜1 − ∑π k ⎟ = ln⎜∑e ⎟ ⎝ k =1 ⎠ ⎝ k =1 ⎠ h(x) = 1 © Eric Xing @ CMU, 2005-2009 6

3 Why exponential family?

z Moment generating property

dA d 1 d = log Z(η) = Z(η) dη dη Z(η) dη 1 d = h(x)exp{}η TT (x) dx Z(η) dη ∫ h(x)exp η TT (x) = T (x) {}dx ∫ Z(η) = E[]T(x)

d 2 A h(x)exp η TT (x) h(x)exp η TT (x) 1 d = T 2 (x) { }dx − T (x) { }dx Z(η) dη 2 ∫ Z(η) ∫ Z(η) Z(η) dη = E[]T 2 (x) − E 2 []T (x) = Var[]T(x)

© Eric Xing @ CMU, 2005-2009 7

Moment estimation

z We can easily compute moments of any exponential family distribution by taking the of the log normalizer A(η). z The qth gives the qth centered moment.

dA(η) = dη d 2 A(η) = dη 2 L z When the is a stacked vector, partial derivatives need to be considered.

© Eric Xing @ CMU, 2005-2009 8

4 Moment vs canonical parameters

z The moment parameter µ can be derived from the natural (canonical) parameter

dA(η) def 8 = E[]T (x) = µ A dη

4 z A(η) is convex since

2 d A(η) η = Var[]T (x) > 0 -2-1012∗ dη 2 η

z Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1): def η =ψ (µ)

z A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by µ − the moment parameterization.

© Eric Xing @ CMU, 2005-2009 9

MLE for Exponential Family

z For iid , the log-likelihood is

T l (η; D) = log∏ h(xn )exp{η T (xn ) − A(η)} n

⎛ T ⎞ = ∑∑log h(xn ) + ⎜η T (xn )⎟ − NA(η) nn⎝ ⎠ z Take derivatives and set to zero: ∂l ∂A(η) = ∑T (xn ) − N = 0 ∂η n ∂η ∂A(η) 1 = T (x ) ∂η N ∑ n ⇒ n ) 1 µMLE = ∑T (xn ) N n z This amounts to moment . ) ) z We can infer the canonical parameters using ηMLE =ψ (µMLE )

© Eric Xing @ CMU, 2005-2009 10

5 Sufficiency

z For p(x|θ), T(x) is sufficient for θ if there is no information in X regarding θ beyond that in T(x). z We can throw away X for the purpose of inference w.r.t. θ .

z Bayesian view X T(x) θ p(θ |T (x), x) = p(θ |T (x))

z Frequentist view X T(x) θ p(x |T (x),θ ) = p(x |T (x))

z The Neyman factorization theorem

X T(x) θ

z T(x) is sufficient for θ if

p(x,T (x),θ ) =ψ 1 (T (x),θ )ψ 2 (x,T (x)) ⇒ p(x |θ ) = g(T (x),θ )h(x,T (x)) © Eric Xing @ CMU, 2005-2009 11

Examples

z Gaussian: −1 1 −1 η = [Σ µ;− 2 vec(Σ )] T (x) = []x; vec()xxT 1 1 ⇒ µMLE = ∑T1 (xn ) = ∑ xn 1 T −1 1 N N A(η) = 2 µ Σ µ + 2 log Σ n n h(x) = ()2π −k /2

z Multinomial: ⎡ ⎛π k ⎞ ⎤ η = ⎢ln⎜ ⎟;0⎥ ⎣ ⎝ π K ⎠ ⎦ T (x) = []x 1 ⇒ µMLE = ∑ xn K −1 K N n ⎛ ⎞ ⎛ ηk ⎞ A(η) = −ln⎜1 − ∑π k ⎟ = ln⎜∑ e ⎟ ⎝ k =1 ⎠ ⎝ k =1 ⎠ h(x) = 1

z Poisson: η = log λ T (x) = x 1 A(η) = λ = eη ⇒ µMLE = ∑ xn N n 1 h(x) = x! © Eric Xing @ CMU, 2005-2009 12

6 Generalized Linear Models (GLIMs)

z The

z Linear regression Xn z Discriminative linear classification

z Commonality: Y T n model Ep(Y)=µ=f(θ X) N z What is p()? the cond. dist. of Y. z What is f()? the response function.

z GLIM z The observed input x is assumed to enter into the model via a linear combination of its elements ξ = θT x z The conditional mean µ is represented as a function f(ξ) of ξ, where f is known as the response function z The observed output y is assumed to be characterized by an exponential family distribution with conditional mean µ. © Eric Xing @ CMU, 2005-2009 13

GLIM, cont.

θ f ψ GLIM ξ µ η y x p(y |η) = h(y)exp{η T (x)y − A(η)}

1 T ⇒ p(y |η) = h(y,φ)exp{φ (η (x)y − A(η))} z The choice of exp family is constrained by the nature of the data Y z Example: y is a continuous vector Æ multivariate Gaussian y is a class label Æ Bernoulli or multinomial

z The choice of the response function

z Following some mild constrains, e.g., [0,1]. Positivity … −1 z Canonical response function: f = ψ ( ⋅ ) z In this case θTx directly corresponds to canonical parameter η. © Eric Xing @ CMU, 2005-2009 14

7 MLE for GLIMs with natural response

z Log-likelihood

T l = ∑log h(yn ) + ∑(θ xn yn − A(ηn )) nn z Derivative of Log-likelihood

dl ⎛ dA(η ) dη ⎞ ⎜ n n ⎟ = ∑⎜ xn yn − ⎟ dθ n ⎝ dηn dθ ⎠

= ∑()yn − µn xn n This is a fixed point function T = X (y − µ) because µ is a function of θ z Online learning for canonical GLIMs

z Stochastic gradient ascent = least mean squares (LMS) algorithm:

t+1 t t θ = θ + ρ(yn − µn )xn t t T where µn = (θ ) xn and ρ is a step size

© Eric Xing @ CMU, 2005-2009 15

Batch learning for canonical GLIMs

⎡− − x − −⎤ z The Hessian matrix 1 ⎢− − x − −⎥ X = ⎢ 2 ⎥ 2 ⎢ M M M ⎥ d l d dµ ⎢ ⎥ H = = y − µ x = x n − − x − − T T ∑()n n n ∑ n T ⎣ n ⎦ dθdθ dθ n n dθ ⎡ y1 ⎤ ⎢ ⎥ dµn dηn y = − x v 2 ∑ n T y = ⎢ ⎥ n dηn dθ ⎢ M ⎥ ⎢ ⎥ y dµn T T ⎣ n ⎦ = −∑ xn xn since η n= θ xn n dηn = −X TWX

T where X = [ x n ] is the design matrix and

⎛ dµ1 dµN ⎞ W = diag⎜ ,K, ⎟ ⎝ dη1 dηN ⎠ nd which can be computed by calculating the 2 derivative of A(ηn)

© Eric Xing @ CMU, 2005-2009 16

8 Recall LMS

⎡− − x − −⎤ z Cost function in matrix form: 1 ⎢− − x − −⎥ ⎢ 2 ⎥ n X = 1 T 2 ⎢ M M M ⎥ J(θ ) = ∑(xi θ − yi ) ⎢ ⎥ 2 i=1 ⎣− − xn − −⎦

1 T ⎡ y ⎤ = ()()Xθ − yv Xθ − yv 1 ⎢y ⎥ 2 yv = ⎢ 2 ⎥ ⎢ M ⎥ ⎢ ⎥ ⎣yn ⎦ z To minimize J(θ), take derivative and set to zero:

1 T T T T v vT vT v ∇θ J = ∇θ tr()θ X Xθ −θ X y − y Xθ + y y 2 ⇒ X T Xθ = X T yv 1 = ()∇ trθ T X T Xθ −2∇ tryvT Xθ + ∇ tryvT yv 2 θ θ θ The normal equations 1 = ()X T Xθ + X T Xθ −2X T yv 2 ⇓ −1 = X T Xθ − X T yv = 0 θ * = (X T X ) X T yv

© Eric Xing @ CMU, 2005-2009 17

Iteratively Reweighted (IRLS)

z Recall Newton-Raphson methods with cost function J t+1 t −1 θ = θ − H ∇θ J

z We now have T ∇θ J = X (y − µ) −1 H = −X TWX θ * = (X T X ) X T yv z Now: t+1 t −1 θ = θ + H ∇θ l − = ()X TW t X 1 []X TW t Xθ t + X T (y − µ t ) − z = ()X TW t X 1 X TW t zt − where the adjusted response is zt = Xθ t + (W t ) 1 (y − µ t ) z This can be understood as solving the following " Iteratively reweighted least squares " problem θ t+1 = arg min(z − Xθ )T W (z − Xθ ) θ

© Eric Xing @ CMU, 2005-2009 18

9 Example 1: logistic regression (sigmoid classifier)

z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = 1 + e−η (x) z p(y|x) is an exponential family function, with 1 z mean: E[]y | x = µ = 1 + e−η ( x) T z and canonical response function η = ξ = θ x

dµ = µ(1 − µ) z IRLS dη ⎛ µ (1 − µ ) ⎞ ⎜ 1 1 ⎟ W = ⎜ O ⎟ ⎜ ⎟ µN (1 − µN ) ⎝ © Eric Xing @ CMU, 2005-2009 ⎠ 19

Logistic regression: practical issues

z It is very common to use regularized maximum likelihood.

1 T p(y = ±1 x,θ ) = T = σ (yθ x) 1 + e− yθ x p(θ ) ~ Normal(0,λ−1I)

T λ T l(θ ) = ∑log()σ (ynθ xn ) − θ θ n 2 z IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x. z Quasi-Newton methods, that approximate the Hessian, work faster. z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. z Stochastic gradient descent can also be used if N is large c.f. perceptron rule: T ∇θ l = (1 −σ (ynθ xn ))yn xn − λθ

© Eric Xing @ CMU, 2005-2009 20

10 Example 2: linear regression

z The condition distribution: a Gaussian

1 ⎧ 1 T −1 ⎫ p(y x,θ,Σ) = 1/2 exp⎨− (y − µ(x)) Σ (y − µ(x))⎬ ()2π k /2 Σ ⎩ 2 ⎭

1 −1 T Rescale ⇒ h(x)exp{}− 2 Σ ()η (x)y − A(η) where µ is a linear function µ(x) = θ T x =η(x)

z p(y|x) is an exponential family function, with z mean: E[]y | x = µ = θ T x

T z and canonical response function η1 = ξ = θ x − dµ θ t+1 = (X TW t X ) 1 X TW t zt t →∞ z IRLS = 1 −1 dη ⇒ = ()(X T X X T Xθ t + (y − µ t ) ) ⇒ θ = (X T X )−1 X TY − W = I = θ t + ()X T X 1 X T (y − µ t ) Steepest descent Normal equation © Eric Xing @ CMU, 2005-2009 21

Simple GMs are the building blocks of complex BNs

Density estimation µ,σ Parametric and nonparametric methods X X Regression XY Linear, conditional mixture, nonparametric Q Q Classification

Generative and discriminative approach X X

© Eric Xing @ CMU, 2005-2009 22

11 School of Computer ScienceAn (incomplete) genealogy of graphical models

The structures of most GMs (e.g., all listed here), are not learned from data, but designed by human.

But such designs are useful and indeed favored because thereby human knowledge are put into good use …

© Eric Xing @ CMU, 2005-2009 23

MLE for general BNs

z If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log- decomposes into a sum of local terms, one per node: ⎛ ⎞ ⎛ ⎞ l (θ; D) = log p(D |θ ) = log ⎜ p(x | x ,θ )⎟ = ⎜ log p(x | x ,θ )⎟ ∏∏⎜ n,i n,π i i ⎟ ∑∑ n,i n,π i i ni⎝ ⎠ in⎝ ⎠

X2=1 X5=0

X2=0 X5=1

© Eric Xing @ CMU, 2005-2009 24

12 Example: decomposable likelihood of a directed model

z Consider the distribution defined by the directed acyclic GM:

p(x |θ ) = p(x1 |θ1 ) p(x2 | x1 ,θ1 ) p(x3 | x1 ,θ3 ) p(x4 | x2 , x3,θ1 )

z This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

X1 X1 X1 X1

X2 X3 X2 X3

X2 X3

X4 X4

© Eric Xing @ CMU, 2005-2009 25

MLE for BNs with tabular CPDs

z Assume each CPD is represented as a table (multinomial)

where def θ = p(X = j | X = k) ijk i π i

z Note that in case of multiple parents, X will have a composite πi state, and the CPD will be a high-dimensional table

z The sufficient statistics are counts of family configurations def n = x j xk ijk ∑n n,i n,π i

z The log-likelihood is nijk l (θ;D) = log ∏θijk = ∑nijk logθijk i ,j ,k i ,j ,k

z Using a ML nijk θijk = to enforce ∑ θ ijk = 1 , we get: j ∑ nij'k i, j',k © Eric Xing @ CMU, 2005-2009 26

13 MLE and Kulback-Leibler

z KL divergence q(x) D()q(x) || p(x) = ∑ q(x)log x p(x) z Empirical distribution N ~ def 1 p(x) = ∑δ (x, xn ) N n=1

z Where δ(x,xn) is a Kronecker delta function

z Maxθ(MLE) ≡ Minθ(KL) ~p(x) D()~p(x) || p(x |θ ) = ∑ ~p(x)log x p(x |θ ) = ∑ ~p(x)log ~p(x) − ∑ ~p(x)log p(x |θ ) x x ~ ~ 1 = ∑ p(x)log p(x) − ∑ log p(xn |θ ) x N n 1 = C + l (θ; D) N © Eric Xing @ CMU, 2005-2009 27

Parameter sharing

π A

… X1 X2 X3 XT

z Consider a time-invariant (stationary) 1st-order Markov model def z Initial state probability vector: k π k = p(X 1 = 1) def j i z State transition probability matrix: Aij = p(X t = 1 | X t−1 = 1) T

z The joint: p(X 1:T |θ ) = p(x1 |π )∏∏ p(X t | X t−1 ) tt==22 T z The log-likelihood: l (θ; D) = ∑log p(xn,1 |π ) + ∑∑log p(xn,t | xn,t−1 , A) n n t=2 z Again, we optimize each parameter separately z π is a multinomial vector, and we've seen it before z What about A? © Eric Xing @ CMU, 2005-2009 28

14 Learning a Markov chain transition matrix

z A is a stochastic matrix: A = ∑ j ij 1 z Each row of A is multinomial distribution.

z So MLE of Aij is the fraction of transitions from i to j

T xi x j ML #(i → j) ∑∑n t=2 n,t−1 n,t Aij = = T #(i → •) xi ∑∑n t=2 n,t−1

z Application:

z if the states Xt represent words, this is called a bigram language model z Sparse data problem:

z If i Æ j did not occur in data, we will have Aij =0, then any futher sequence with word pair i Æ j will have zero probability. z A standard hack: backoff smoothing or deleted interpolation

~ ML Ai →• = ληt + (1 − λ)Ai →• © Eric Xing @ CMU, 2005-2009 29

Bayesian language model

z Global and local parameter independence

α β A

π A Æ A Æ i · i' · … … X1 X2 X3 XT

z The posterior of Ai Æ· and Ai' Æ· is factorized despite v-structure on Xt, because Xt-1 acts like a multiplexer

z Assign a Dirichlet prior βi to each row of the transition matrix:

def Bayes #(i → j) + βi,k ' ML βi Aij = p( j | i, D, βi ) = = λi βi,k + (1 − λi )Aij , where λi = #(i → •) + βi βi +#(i → •)

z We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

© Eric Xing @ CMU, 2005-2009 30

15 Example: HMM: two scenarios

z Supervised learning: estimation when the “right answer” is known

z Examples:

GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

z Unsupervised learning: estimation when the “right answer” is unknown

z Examples: GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

z QUESTION: Update the parameters θ of the model to maximize P(x|θ) - -- Maximal likelihood (ML) estimation

© Eric Xing @ CMU, 2005-2009 31

Recall definition of HMM

z Transition probabilities between y1 y2 y3 ... yT any two states

j i ... p(yt = 1 | yt−1 = 1) = ai, j , xA1 xA2 xA3 xAT

or i p(yt | yt−1 = 1) ~ Multinomial(ai,1 ,ai,2 ,K,ai,M ),∀i ∈ I.

z Start probabilities

p(y1 ) ~ Multinomial(π1 ,π 2 ,K,π M ).

z Emission probabilities associated with each state

i p(xt | yt = 1) ~ Multinomial(bi,1 ,bi,2 ,K,bi,K ),∀i ∈ I.

i or in general: p(xt | yt = 1) ~ f (⋅|θi ),∀i ∈I.

© Eric Xing @ CMU, 2005-2009 32

16 Supervised ML estimation

z Given x = x1…xN for which the true state path y = y1…yN is known, z Define:

Aij = # times state transition i→j occurs in y

Bik = # times state i in y emits k in x

z We can show that the maximum likelihood parameters θ are:

T i j #(i → j) yn,t− yn,t A a ML = = ∑∑n t=2 1 = ij ij T i #(i → •) y Aij' ∑∑n t=2 n,t−1 ∑ j'

T i k #(i → k) yn,t xn,t B bML = = ∑∑n t=1 = ik ik T i #(i → •) y Bik ' ∑∑n t=1 n,t ∑k '

z What if x is continuous? We can treat { ( x n , t , y n , t ) : t = 1 : T,n = 1: N} as N×T observations of, e.g., a Gaussian, and apply learning rules for Gaussian …

© Eric Xing @ CMU, 2005-2009 33

Supervised ML estimation, ctd.

z Intuition: z When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the training data

z Drawback: z Given little data, there may be overfitting: z P(x|θ) is maximized, but θ is unreasonable 0 probabilities – VERY BAD

z Example: z Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 y = F, F, F, F, F, F, F, F, F, F

z Then: aFF = 1; aFL = 0

bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

© Eric Xing @ CMU, 2005-2009 34

17 Pseudocounts

z Solution for small training sets:

z Add pseudocounts

Aij = # times state transition i→j occurs in y + Rij

Bik = # times state i in y emits k in x + Sik

z Rij, Sij are pseudocounts representing our prior belief

z Total pseudocounts: Ri = ΣjRij , Si = ΣkSik , z --- "strength" of prior belief, z --- total number of imaginary instances in the prior

z Larger total pseudocounts ⇒ strong prior belief

z Small total pseudocounts: just to avoid 0 probabilities --- smoothing

z This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts

© Eric Xing @ CMU, 2005-2009 35

How to define parameter prior?

M Factorization: p(X = x) = p(x | x ) Earthquake Burglary ∏ i π i i=1

Local Distributions Radio Alarm defined by, e.g., multinomial parameters:

k j p ( x | x ) = θ k j i π i x |x i π i Call

Assumptions (Geiger & Heckerman 97,99): z Complete Model Equivalence p (θ | G ) ? z Global Parameter Independence

z Local Parameter Independence

z Likelihood and Prior Modularity

© Eric Xing @ CMU, 2005-2009 36

18 Global & Local Parameter Independence

„ Global Parameter Independence Earthquake Burglary For every DAG model:

M Radio Alarm p (θ m | G ) = ∏ p (θ i | G ) i =1

Call „ Local Parameter Independence For every node: P (θ Call | Alarm =YES ) qi independent of p(θ | G ) = p(θ j | G ) i ∏ x k |x i π i j =1 P (θ Call | Alarm = NO )

© Eric Xing @ CMU, 2005-2009

Parameter Independence, Graphical View

Global Parameter Independence θ θ θ 1 2|1 2|1 Local Parameter Independence

X1 X2 sample 1

X1 X2 sample 2

Provided all variables are observedM in all cases, we can perform Bayesian update each parameter independently !!! © Eric Xing @ CMU, 2005-2009

19 Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

x | π j ~ Multi(θ ) z Discrete DAG Models: i xi

Γ(∑α k ) k α k -1 αk -1 Dirichlet prior: P(θ ) = ∏θk = C(α)∏θk ∏Γ(α k ) k k k

z Gaussian DAG Models: x | π j ~ Normal(µ,Σ) i xi

1 ⎧ 1 −1 ⎫ Normal prior: p(µ |ν ,Ψ) = / / exp⎨− (µ −ν )'Ψ (µ −ν )⎬ (2π )n 2 | Ψ |1 2 ⎩ 2 ⎭ Normal-Wishart prior:

−1 p(µ |ν ,α µ , W) = Normal(ν ,(α µ W) ), α /2 (α −n −1)/2 ⎧1 ⎫ p(W |α ,Τ) = c(n,α ) Τ w W w exp⎨ tr{}ΤW ⎬, w w ⎩2 ⎭ where W = Σ−1. © Eric Xing @ CMU, 2005-2009 39

Summary: Learning GM

z For exponential family dist., MLE amounts to moment matching

z GLIM:

z Natural response

z Iteratively Reweighted Least Squares as a general algorithm

z For fully observed BN, the log-likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored

© Eric Xing @ CMU, 2005-2009 40

20