Hidden Markov Models + Conditional Random Fields
Total Page:16
File Type:pdf, Size:1020Kb
10-715 Advanced Intro. to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models + Conditional Random Fields Matt Gormley Guest Lecture 2 Oct. 31, 2018 1 1. Data 2. Model (n) N 1 = x n=1 p(x ✓)= (x ) D { } | Z(✓) C C n v p d n Sample 1: C time flies like an arrow Y2C T1 T2 T3 T4 T5 Sample n n v d n 2: W W W W W time flies like an arrow 1 2 3 4 5 Sample n v p n n 3: flies fly with their wings 3. Objective N Sample p n n v v (n) 4: `(✓; )= log p(x ✓) with time you will see D | n=1 X 5. Inference 4. Learning 1. Marginal Inference p(x )= p(x0 ✓) ✓⇤ = argmax `(✓; ) C | x :x =x D 0 C0 C ✓ 2. Partition Function X Z(✓)= C (xC ) x C 3. MAP Inference X Y2C xˆ = argmax p(x ✓) x | 2 HIDDEN MARKOV MODEL (HMM) 3 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1 n v p d n y(1) Sample 1: time flies like an arrow x(1) n n v d n y(2) Sample 2: time flies like an arrow x(2) (3) n v p n n y Sample 3: (3) flies fly with their wings x (4) p n n v v y Sample 4: (4) with time you will see x 4 Naïve Bayes for Time Series Data We could treat each word-tag pair (i.e. token) as independent. This corresponds to a Naïve Bayes model with a single feature (the word). p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …) v .1 v .1 n .8 n .8 p .2 p .2 d .2 d .2 n v p d n … … like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 5 Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags. p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …) v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0 <START> n v p d n … … like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 6 From Mixture Model to HMM Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 “Naïve Bayes”: Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 HMM: 7 From Mixture Model to HMM Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 “Naïve Bayes”: Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 HMM: 8 SUPERVISED LEARNING FOR HMMS 10 Hidden Markov Model HMM Parameters: O S C O S C O .8 O .9 .08.02 O .9 .08.02 S .1 S .2 .7 .1 S .2 .7 .1 C .1 C .9 0 .1 C .9 0 .1 Y1 Y2 Y3 Y4 Y5 … … 1min 2min 3min 1min 2min 3min X1 X2 X3 X4 X5 O .1 .2 .3 O .1 .2 .3 S .01 .02.03 S .01 .02.03 C 0 0 0 C 0 0 0 12 Hidden Markov Model HMM Parameters: Assumption: y0 = START Generative Story: For notational convenience, we fold the initial probabilities C into the transition matrix B by our assumption. Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 13 Hidden Markov Model Joint Distribution: y0 = START Y0 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 14 Training HMMs Whiteboard – (Supervised) Likelihood for an HMM – Maximum Likelihood Estimation (MLE) for HMM 15 Supervised Learning for HMMs Learning an HMM decomposes into solving two (independent) Mixture Models Yt Yt+1 Yt Xt 16 Supervised Learning for HMMs Learning an HMM decomposes into solving two (independent) Mixture Models Yt Yt+1 Yt Xt 17 HMMs: History • Markov chains: Andrey Markov (1906) – Random walks and Brownian motion • Used in Shannon’s work on information theory (1948) • Baum-Welsh learning algorithm: late 60’s, early 70’s. – Used mainly for speech in 60s-70s. • Late 80’s and 90’s: David Haussler (major player in learning theory in 80’s) began to use HMMs for modeling biological sequences • Mid-late 1990’s: Dayne Freitag/Andrew McCallum – Freitag thesis with Tom Mitchell on IE from Web using logic programs, grammar induction, etc. – McCallum: multinomial Naïve Bayes for text – With McCallum, IE using HMMs on CORA • … 19 Slide from William Cohen Higher-order HMMs • 1st-order HMM (i.e. bigram HMM) <START> Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 • 2nd-order HMM (i.e. trigram HMM) <START> Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 • 3rd-order HMM <START> Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 20 BACKGROUND: MESSAGE PASSING 21 Great Ideas in ML: Message Passing Count the soldiers there's 1 of me 1 2 3 4 5 before before before before before you you you you you 5 4 3 2 1 behind behind behind behind behind you you you you you 22 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Count the soldiers there's Belief: 1 of me Must be 22 + 11 + 3 = 6 of 2 us before you only see 3 my incoming behind messages you 23 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Count the soldiers there's Belief: Belief: 1 of me Must be Must be 11 + 11 + 4 = 6 of 22 + 11 + 3 = 6 of 1 before us us you only see 4 my incoming behind messages you 24 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) 25 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here 26 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here 27 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here Belief: Must be 3 here 14 of us 28 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here Belief: Must be 3 here 14 of us 29 adapted from MacKay (2003) textbook THE FORWARD-BACKWARD ALGORITHM 30 Inference for HMMs Whiteboard – Three Inference Problems for an HMM 1. Evaluation: Compute the probability of a given sequence of observations 2. Decoding: Find the most-likely sequence of hidden states, given a sequence of observations 3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations 31 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1 n v p d n y(1) Sample 1: time flies like an arrow x(1) n n v d n y(2) Sample 2: time flies like an arrow x(2) (3) n v p n n y Sample 3: (3) flies fly with their wings x (4) p n n v v y Sample 4: (4) with time you will see x 32 Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags. p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …) v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0 <START> n v p d n … … like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 34 Forward-Backward Algorithm Y1 Y2 Y3 X1 X2 X3 find preferred tags Could be verb or noun Could be adjective or verb Could be noun or verb 35 Forward-Backward Algorithm Y1 Y2 Y3 X1 X2 X3 find preferred tags 36 Forward-Backward Algorithm Y1 Y2 Y3 v v v START n n n END a a a X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 37 Forward-Backward Algorithm Y1 Y2 Y3 v v v START n n n END a a a X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 38 Forward-Backward Algorithm Y1 Y2 Y3 v v v START n n n END a a a X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 39 Forward-Backward Algorithm Y1 Y2 Y3 v v v START n n n END a a a X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 transition / emission factors think of it … 40 Forward-Backward Algorithm v n a Y Y Y 1 v 1 6 4 2 3 n 8 4 0.1 v a 0.1 8 0 v v START n n n END … a a a find tags pref.