Hmms and Crfs

School of Computer Science 10-701 Introduction to Machine Learning HMMs and CRFs Readings: Matt Gormley Bishop 13.1-13.2 Bishop 8.3-8.4 Lecture 19 Sutton & McCallum (2006) Lafferty et al. (2001) November 14, 2016 1 Reminders • HomeworK 4 – deadline extended to Wed, Nov. 16th – 10 extra points for submitting by Mon, Nov. 14th • Poster Sessions – two sessions on Fri, Dec. 2nd – session 1: 8 - 11:30 am – session 2: 2 - 6 pm 2 HIDDEN MARKOV MODEL (HMM) 3 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1 n v p d n y(1) Sample 1: time flies like an arrow x(1) n n v d n y(2) Sample 2: time flies like an arrow x(2) (3) n v p n n y Sample 3: (3) flies fly with their wings x (4) p n n v v y Sample 4: (4) with time you will see x 4 Naïve Bayes for Time Series Data We could treat each word-tag pair (i.e. toKen) as independent. This corresponds to a Naïve Bayes model with a single feature (the word). p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …) v .1 v .1 n .8 n .8 p .2 p .2 d .2 d .2 n v p d n … … like like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 5 Hidden MarKov Model A Hidden MarKov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags. p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …) v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0 <START> n v p d n … … like like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 6 From NB to HMM Y Y Y Y Y 1 2 3 4 5 X1 X2 X3 X4 X5 K P (s, u)= P (X Y )p(Y ) “Naïve Bayes”: k| k k k=1 Y0 Y1 Y2 Y3 Y4 Y5 X X X X X 1 2 3 4 5 K P (s, u)= P (Xk Yk)p(Yk Yk 1) HMM: | | − 7 k=1 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0 Y0 Y1 Y2 Y3 Y4 Y5 … … like like like flies flies time time X1 X2 X3 X4 X5 v 3 5 3 v 3 5 3 n 4 5 2 n 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 8 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ Assumption: y0 = START Generative Story: Yk Multinomial(Yk 1 ) k ∼ − ∀ X Multinomial(" ) k k ∼ Yk ∀ Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 9 Hidden MarKov Model Joint Distribution: K p(t, v)= p(xk yk)p(yk yk 1) | | − k=1 K = Ayk,xk Byk 1,yk − k=1 Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 10 From static to dynamic mixture models Static mixture Dynamic mixture Y1 Y1 Y2 Y3 ... YT XA1 XA1 XA2 XA3 ... XAT N © Eric Xing @ CMU, 2006-2011 11 HMMs: History • Markov chains: Andrey Markov (1906) – Random walks and Brownian motion • Used in Shannon’s work on information theory (1948) • Baum-Welsh learning algorithm: late 60’s, early 70’s. – Used mainly for speech in 60s-70s. • Late 80’s and 90’s: David Haussler (major player in learning theory in 80’s) began to use HMMs for modeling biological sequences • Mid-late 1990’s: Dayne Freitag/Andrew McCallum – Freitag thesis with Tom Mitchell on IE from Web using logic programs, grammar induction, etc. – McCallum: multinomial Naïve Bayes for text – With McCallum, IE using HMMs on CORA • … 12 Slide from William Cohen Higher-order HMMs • 1st-order HMM (i.e. bigram HMM) <START> Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 • 2nd-order HMM (i.e. trigram HMM) <START> Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 • 3rd-order HMM <START> Y1 Y2 Y3 Y4 Y5 X X X X X 1 2 3 4 5 13 SUPERVISED LEARNING FOR BAYES NETS 14 Machine Learning The data inspires Our model the structures defines a score we want to for each structure predict Domain Mathematical It also tells us Knowledge Modeling what to optimize ML Inference finds Combinatorial Optimization Optimization {best structure, marginals, partition function} for a new observation Learning tunes the parameters of the (Inference is usually model called as a subroutine in learning) 15 3 Alice saw Bob on a hill with a telescope a telescope hill with on a Bob saw Machine Learning Alice Model 4 time flies like an arrow X Data 1 X an arrow 3 like X2 flies time X4 X5 an arrow like flies time an arrow Objective like flies time an arrow like flies timeInference Learning 2 (Inference is usually called as a subroutine in learning) 16 Recall… Learning Fully Observed BNs X1 p(X1,X2,X3,X4,X5)= X X 3 2 p(X X )p(X X ,X ) 5| 3 4| 2 3 X4 X p(X3)p(X2 X1)p(X1) 5 | 17 Recall… Learning Fully Observed BNs X1 p(X1,X2,X3,X4,X5)= X X 3 2 p(X X )p(X X ,X ) 5| 3 4| 2 3 X4 X p(X3)p(X2 X1)p(X1) 5 | 18 Recall… Learning Fully Observed BNs X1 p(X1,X2,X3,X4,X5)= X X 3 2 p(X X )p(X X ,X ) 5| 3 4| 2 3 X4 X p(X3)p(X2 X1)p(X1) 5 | How do we learn these conditional and marginal distributions for a Bayes Net? 19 Recall… Learning Fully Observed BNs Learning this fully observed Bayesian Network is p(X1,X2,X3,X4,X5)= p(X X )p(X X ,X ) equivalent to learning five 5| 3 4| 2 3 p(X )p(X X )p(X ) (small / simple) independent 3 2| 1 1 networKs from the same data X1 X1 X1 X X 3 2 X3 X2 X3 X3 X2 X4 X5 X4 X5 20 Learning Fully Observed BNs How do we learn these conditional and marginal distributions for a Bayes Net? ✓✓⇤⇤ == argmax argmaxloglogpp((XX11,X,X22,X,X33,X,X44,X,X55)) ✓✓ == argmax argmaxloglogpp((XX55 XX33,,✓✓55)) ++ log logpp((XX44 XX22,X,X33,,✓✓44)) ✓✓ || || X1 ++ log logpp((XX3 ✓✓3)) ++ log logpp((XX2 XX1,,✓✓2)) 3|| 3 2|| 1 2 X3 ++ log logpp((XX11 ✓✓11)) X2 || ✓1⇤ = argmax log p(X1 ✓1) ✓1 | X4 X5 ✓2⇤ = argmax log p(X2 X1, ✓2) ✓2 | ✓3⇤ = argmax log p(X3 ✓3) ✓3 | ✓4⇤ = argmax log p(X4 X2,X3, ✓4) ✓4 | ✓5⇤ = argmax log p(X5 X3, ✓5) ✓5 | 21 SUPERVISED LEARNING FOR HMMS 22 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0 Y0 Y1 Y2 Y3 Y4 Y5 … … like like like flies flies time time X1 X2 X3 X4 X5 v 3 5 3 v 3 5 3 n 4 5 2 n 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 23 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ Assumption: y0 = START Generative Story: Yk Multinomial(Yk 1 ) k ∼ − ∀ X Multinomial(" ) k k ∼ Yk ∀ Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 24 Hidden MarKov Model Joint Distribution: K p(t, v)= p(xk yk)p(yk yk 1) | | − k=1 K = Ayk,xk Byk 1,yk − k=1 Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 25 Whiteboard • MLEs for HMM 26 Representation of both directed and undirected graphical models FACTOR GRAPHS 27 Sampling from a Joint Distribution A joint distribution defines a probability p(x) for each assignment of values x to variables X.

Load more