Hmms and Crfs

Hmms and Crfs

School of Computer Science 10-701 Introduction to Machine Learning HMMs and CRFs Readings: Matt Gormley Bishop 13.1-13.2 Bishop 8.3-8.4 Lecture 19 Sutton & McCallum (2006) Lafferty et al. (2001) November 14, 2016 1 Reminders • HomeworK 4 – deadline extended to Wed, Nov. 16th – 10 extra points for submitting by Mon, Nov. 14th • Poster Sessions – two sessions on Fri, Dec. 2nd – session 1: 8 - 11:30 am – session 2: 2 - 6 pm 2 HIDDEN MARKOV MODEL (HMM) 3 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1 n v p d n y(1) Sample 1: time flies like an arrow x(1) n n v d n y(2) Sample 2: time flies like an arrow x(2) (3) n v p n n y Sample 3: (3) flies fly with their wings x (4) p n n v v y Sample 4: (4) with time you will see x 4 Naïve Bayes for Time Series Data We could treat each word-tag pair (i.e. toKen) as independent. This corresponds to a Naïve Bayes model with a single feature (the word). p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …) v .1 v .1 n .8 n .8 p .2 p .2 d .2 d .2 n v p d n … … like like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 5 Hidden MarKov Model A Hidden MarKov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags. p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …) v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0 <START> n v p d n … … like like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 6 From NB to HMM Y Y Y Y Y 1 2 3 4 5 X1 X2 X3 X4 X5 K P (s, u)= P (X Y )p(Y ) “Naïve Bayes”: k| k k k=1 Y0 Y1 Y2 Y3 Y4 Y5 X X X X X 1 2 3 4 5 K P (s, u)= P (Xk Yk)p(Yk Yk 1) HMM: | | − 7 k=1 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0 Y0 Y1 Y2 Y3 Y4 Y5 … … like like like flies flies time time X1 X2 X3 X4 X5 v 3 5 3 v 3 5 3 n 4 5 2 n 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 8 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ Assumption: y0 = START Generative Story: Yk Multinomial(Yk 1 ) k ∼ − ∀ X Multinomial(" ) k k ∼ Yk ∀ Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 9 Hidden MarKov Model Joint Distribution: K p(t, v)= p(xk yk)p(yk yk 1) | | − k=1 K = Ayk,xk Byk 1,yk − k=1 Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 10 From static to dynamic mixture models Static mixture Dynamic mixture Y1 Y1 Y2 Y3 ... YT XA1 XA1 XA2 XA3 ... XAT N © Eric Xing @ CMU, 2006-2011 11 HMMs: History • Markov chains: Andrey Markov (1906) – Random walks and Brownian motion • Used in Shannon’s work on information theory (1948) • Baum-Welsh learning algorithm: late 60’s, early 70’s. – Used mainly for speech in 60s-70s. • Late 80’s and 90’s: David Haussler (major player in learning theory in 80’s) began to use HMMs for modeling biological sequences • Mid-late 1990’s: Dayne Freitag/Andrew McCallum – Freitag thesis with Tom Mitchell on IE from Web using logic programs, grammar induction, etc. – McCallum: multinomial Naïve Bayes for text – With McCallum, IE using HMMs on CORA • … 12 Slide from William Cohen Higher-order HMMs • 1st-order HMM (i.e. bigram HMM) <START> Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 • 2nd-order HMM (i.e. trigram HMM) <START> Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 • 3rd-order HMM <START> Y1 Y2 Y3 Y4 Y5 X X X X X 1 2 3 4 5 13 SUPERVISED LEARNING FOR BAYES NETS 14 Machine Learning The data inspires Our model the structures defines a score we want to for each structure predict Domain Mathematical It also tells us Knowledge Modeling what to optimize ML Inference finds Combinatorial Optimization Optimization {best structure, marginals, partition function} for a new observation Learning tunes the parameters of the (Inference is usually model called as a subroutine in learning) 15 3 Alice saw Bob on a hill with a telescope a telescope hill with on a Bob saw Machine Learning Alice Model 4 time flies like an arrow X Data 1 X an arrow 3 like X2 flies time X4 X5 an arrow like flies time an arrow Objective like flies time an arrow like flies timeInference Learning 2 (Inference is usually called as a subroutine in learning) 16 Recall… Learning Fully Observed BNs X1 p(X1,X2,X3,X4,X5)= X X 3 2 p(X X )p(X X ,X ) 5| 3 4| 2 3 X4 X p(X3)p(X2 X1)p(X1) 5 | 17 Recall… Learning Fully Observed BNs X1 p(X1,X2,X3,X4,X5)= X X 3 2 p(X X )p(X X ,X ) 5| 3 4| 2 3 X4 X p(X3)p(X2 X1)p(X1) 5 | 18 Recall… Learning Fully Observed BNs X1 p(X1,X2,X3,X4,X5)= X X 3 2 p(X X )p(X X ,X ) 5| 3 4| 2 3 X4 X p(X3)p(X2 X1)p(X1) 5 | How do we learn these conditional and marginal distributions for a Bayes Net? 19 Recall… Learning Fully Observed BNs Learning this fully observed Bayesian Network is p(X1,X2,X3,X4,X5)= p(X X )p(X X ,X ) equivalent to learning five 5| 3 4| 2 3 p(X )p(X X )p(X ) (small / simple) independent 3 2| 1 1 networKs from the same data X1 X1 X1 X X 3 2 X3 X2 X3 X3 X2 X4 X5 X4 X5 20 Learning Fully Observed BNs How do we learn these conditional and marginal distributions for a Bayes Net? ✓✓⇤⇤ == argmax argmaxloglogpp((XX11,X,X22,X,X33,X,X44,X,X55)) ✓✓ == argmax argmaxloglogpp((XX55 XX33,,✓✓55)) ++ log logpp((XX44 XX22,X,X33,,✓✓44)) ✓✓ || || X1 ++ log logpp((XX3 ✓✓3)) ++ log logpp((XX2 XX1,,✓✓2)) 3|| 3 2|| 1 2 X3 ++ log logpp((XX11 ✓✓11)) X2 || ✓1⇤ = argmax log p(X1 ✓1) ✓1 | X4 X5 ✓2⇤ = argmax log p(X2 X1, ✓2) ✓2 | ✓3⇤ = argmax log p(X3 ✓3) ✓3 | ✓4⇤ = argmax log p(X4 X2,X3, ✓4) ✓4 | ✓5⇤ = argmax log p(X5 X3, ✓5) ✓5 | 21 SUPERVISED LEARNING FOR HMMS 22 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0 Y0 Y1 Y2 Y3 Y4 Y5 … … like like like flies flies time time X1 X2 X3 X4 X5 v 3 5 3 v 3 5 3 n 4 5 2 n 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 23 Hidden MarKov Model HMM Parameters: Emission matrix, , where P (X = w Y = t)=A , k k | k t,w ∀ Transition matrix, ", where P (Yk = t Yk 1 = s)=Bs,t, k | − ∀ Assumption: y0 = START Generative Story: Yk Multinomial(Yk 1 ) k ∼ − ∀ X Multinomial(" ) k k ∼ Yk ∀ Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 24 Hidden MarKov Model Joint Distribution: K p(t, v)= p(xk yk)p(yk yk 1) | | − k=1 K = Ayk,xk Byk 1,yk − k=1 Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 25 Whiteboard • MLEs for HMM 26 Representation of both directed and undirected graphical models FACTOR GRAPHS 27 Sampling from a Joint Distribution A joint distribution defines a probability p(x) for each assignment of values x to variables X.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    125 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us