10-715 Advanced Intro. to Machine Learning
Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models + Conditional Random Fields
Matt Gormley Guest Lecture 2 Oct. 31, 2018
1 1. Data 2. Model
(n) N 1 = x n=1 p(x ✓)= (x ) D { } | Z(✓) C C n v p d n Sample 1: C time flies like an arrow Y2C T1 T2 T3 T4 T5
Sample n n v d n 2: W W W W W time flies like an arrow 1 2 3 4 5
Sample n v p n n 3: flies fly with their wings 3. Objective N
Sample p n n v v (n) 4: `(✓; )= log p(x ✓) with time you will see D | n=1 X 5. Inference 4. Learning 1. Marginal Inference p(x )= p(x0 ✓) ✓⇤ = argmax `(✓; ) C | x :x =x D 0 C0 C ✓ 2. Partition Function X
Z(✓)= C (xC ) x C 3. MAP Inference X Y2C xˆ = argmax p(x ✓) x | 2 HIDDEN MARKOV MODEL (HMM)
3 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1
n v p d n y(1) Sample 1: time flies like an arrow x(1)
n n v d n y(2) Sample 2: time flies like an arrow x(2)
(3) n v p n n y Sample 3: (3) flies fly with their wings x
(4) p n n v v y Sample 4: (4) with time you will see x
4 Naïve Bayes for Time Series Data We could treat each word-tag pair (i.e. token) as independent. This corresponds to a Naïve Bayes model with a single feature (the word).
p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …)
v .1 v .1 n .8 n .8 p .2 p .2 d .2 d .2
n v p d n … … like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 5 Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.
p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)
v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
“Naïve Bayes”:
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
HMM: 7 From Mixture Model to HMM
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
“Naïve Bayes”:
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
HMM: 8 SUPERVISED LEARNING FOR HMMS
10 Hidden Markov Model
HMM Parameters:
O S C O S C O .8 O .9 .08.02 O .9 .08.02 S .1 S .2 .7 .1 S .2 .7 .1 C .1 C .9 0 .1 C .9 0 .1
Y1 Y2 Y3 Y4 Y5 … … 1min 2min 3min 1min 2min 3min X1 X2 X3 X4 X5 O .1 .2 .3 O .1 .2 .3 S .01 .02.03 S .01 .02.03 C 0 0 0 C 0 0 0 12 Hidden Markov Model
HMM Parameters:
Assumption: y0 = START Generative Story: For notational convenience, we fold the initial probabilities C into the transition matrix B by our assumption.
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5 13 Hidden Markov Model
Joint Distribution: y0 = START
Y0 Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5 14 Training HMMs
Whiteboard – (Supervised) Likelihood for an HMM – Maximum Likelihood Estimation (MLE) for HMM
15 Supervised Learning for HMMs Learning an HMM decomposes into solving two (independent) Mixture Models
Yt Yt+1
Yt
Xt
16 Supervised Learning for HMMs Learning an HMM decomposes into solving two (independent) Mixture Models
Yt Yt+1
Yt
Xt
17 HMMs: History • Markov chains: Andrey Markov (1906) – Random walks and Brownian motion • Used in Shannon’s work on information theory (1948) • Baum-Welsh learning algorithm: late 60’s, early 70’s. – Used mainly for speech in 60s-70s. • Late 80’s and 90’s: David Haussler (major player in learning theory in 80’s) began to use HMMs for modeling biological sequences • Mid-late 1990’s: Dayne Freitag/Andrew McCallum – Freitag thesis with Tom Mitchell on IE from Web using logic programs, grammar induction, etc. – McCallum: multinomial Naïve Bayes for text – With McCallum, IE using HMMs on CORA • …
19 Slide from William Cohen Higher-order HMMs • 1st-order HMM (i.e. bigram HMM)
X1 X2 X3 X4 X5 • 2nd-order HMM (i.e. trigram HMM)
X1 X2 X3 X4 X5 • 3rd-order HMM
X1 X2 X3 X4 X5
20 BACKGROUND: MESSAGE PASSING
21 Great Ideas in ML: Message Passing Count the soldiers there's 1 of me 1 2 3 4 5 before before before before before you you you you you
5 4 3 2 1 behind behind behind behind behind you you you you you
22 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Count the soldiers there's Belief: 1 of me Must be 22 + 11 + 3 = 6 of 2 us before you
only see 3 my incoming behind messages you
23 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Count the soldiers there's Belief: Belief: 1 of me Must be Must be 11 + 11 + 4 = 6 of 22 + 11 + 3 = 6 of 1 before us us you
only see 4 my incoming behind messages you
24 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree
3 here
7 here 1 of me
11 here (= 7+3+1)
25 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree
3 here
7 here (= 3+3+1)
3 here
26 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree
11 here (= 7+3+1)
7 here
3 here
27 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree
3 here
7 here
Belief: Must be 3 here 14 of us
28 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree
3 here
7 here
Belief: Must be 3 here 14 of us
29 adapted from MacKay (2003) textbook THE FORWARD-BACKWARD ALGORITHM
30 Inference for HMMs
Whiteboard – Three Inference Problems for an HMM 1. Evaluation: Compute the probability of a given sequence of observations 2. Decoding: Find the most-likely sequence of hidden states, given a sequence of observations 3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations
31 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1
n v p d n y(1) Sample 1: time flies like an arrow x(1)
n n v d n y(2) Sample 2: time flies like an arrow x(2)
(3) n v p n n y Sample 3: (3) flies fly with their wings x
(4) p n n v v y Sample 4: (4) with time you will see x
32 Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.
p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)
v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0
Y1 Y2 Y3
X1 X2 X3
find preferred tags
Could be verb or noun Could be adjective or verb Could be noun or verb
35 Forward-Backward Algorithm
Y1 Y2 Y3
X1 X2 X3 find preferred tags
36 Forward-Backward Algorithm
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 37 Forward-Backward Algorithm
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 38 Forward-Backward Algorithm
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 39 Forward-Backward Algorithm
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 transition / emission factors think of it … 40 Forward-Backward Algorithm
v n a Y Y Y 1 v 1 6 4 2 3 n 8 4 0.1 v a 0.1 8 0 v v
START n n n END
… a a a find tags pref. v 3 5 3 n 4 5 2 a 0.1 0.2 0.1
X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 transition / emission factors think of it … 41 Viterbi Algorithm: Most Probable Assignment
Y1 Y2 Y3
v v v
B(a,END)
START n n n END A(tags,n)
a a a
X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product of 7 numbers • Numbers associated with edges and nodes of path • Most probable assignment = path with highest product42 Viterbi Algorithm: Most Probable Assignment
Y1 Y2 Y3
v v v
B(a,END)
START n n n END A(tags,n)
a a a
X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path
43 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through a 44 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = n) = (1/Z) * total weight of all paths through n 45 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = v) = (1/Z) * total weight of all paths through v 46 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = n) = (1/Z) * total weight of all paths through n 47 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags
α2(n) = total weight of these path prefixes
48 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags
b2(n) = total weight of these path suffixes
49 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
X1 X2 X3 find preferred tags
α2(n) = total weight of these b2(n) = total weight of these path prefixes (a + b + c) path suffixes (x + y + z) Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths50 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
Oops! The weight of av path v v through a state also includes a weight at that state. “belief that Y = n” STARTSo α(n)·β(n) isn’t enough.n n n 2 END α2(n) b2(n) The extra weight is the opinion of the emissiona a a probability at this variable.
A(pref., n)
X1 X2 X3 find preferred tags
total weight of all paths through n
= α2(n)× A(pref.,× n) b2(n) 51 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v “beliefv that Y2 = v”
START n n “beliefn that Y2 = n” END α2(v) b2(v)
a a a
A(pref., v)
X1 X2 X3 find preferred tags
total weight of all paths through v
= α2(v)× A(pref.,× v) b2(v) 52 Forward-Backward Algorithm: Finds Marginals
v 0.1 Y1 Y2 Y3 n 0 a 0.4 v v “beliefv that Y2 = v” divide v 0.2 by Z=0.5 “belief that Y = n” STARTn 0 to getn n n 2 END a 0.8 marginal α2(a) b2(a) probs a a “beliefa that Y2 = a”
sum = Z A(pref., a) (total weight of all paths) X1 X2 X3 find preferred tags
total weight of all paths through a
= α2(a)× A(pref.,× a) b2(a) 53 Forward-Backward Algorithm
Y1 Y2 Y3
X1 X2 X3
find preferred tags
Could be verb or noun Could be adjective or verb Could be noun or verb
54 Inference for HMMs
Whiteboard – Derivation of Forward algorithm – Forward-backward algorithm – Viterbi algorithm
55 Inference in HMMs What is the computational complexity of inference for HMMs?
• The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(KT)
• The forward-backward algorithm and Viterbi algorithm run in polynomial time, O(T*K2) – Thanks to dynamic programming!
59 Conditional Random Fields (CRFs) for time series data LINEAR-CHAIN CRFS
60 Shortcomings of Hidden Markov Models
START Y1 Y2 … … … Yn
X1 X2 … … … Xn
• HMM models capture dependences between each state and only its corresponding observation – NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non- local) features of the whole line such as line length, indentation, amount of white space, etc. • Mismatch between learning objective function and prediction objective function – HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)
© Eric Xing @ CMU, 2005-2015 61 Conditional Random Field (CRF)
Conditional distribution over tags Xi given words wi. The factors and Z are now specific to the sentence w.
p(n, v, p, d, n | time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …)
v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0
v 3 ψ1 ψ3 v 5 ψ5 ψ7 ψ9 n 4 n 5 p 0.1 p 0.1 d 0.1 d 0.2
time flies like an arrow 62 Conditional Random Field (CRF)
Recall: Shaded nodes in a graphical model are observed
v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0
v 3 ψ1 ψ3 v 5 ψ5 ψ7 ψ9 n 4 n 5 p 0.1 p 0.1 d 0.1 d 0.2
X1 X2 X3 X4 X5 63 Conditional Random Field (CRF) This linear-chain CRF is just like an HMM, except that its factors are not necessarily probability distributions
1 K p(v t)= em(yk,xk) tr(yk,yk 1) | Z(t) k =1 1 K = 2tT( 7em(yk,xk))2tT( 7tr(yk,yk 1)) Z(t) · · k =1
ψ1 ψ3 ψ5 ψ7 ψ9
X1 X2 X3 X4 X5 64 Exercise 1 K p(v t)= em(yk,xk) tr(yk,yk 1) | Z(t) k =1 1 K = 2tT( 7em(yk,xk))2tT( 7tr(yk,yk 1)) Z(t) · · k =1 Multiple Choice: Which model does the above distribution share the most in common with? A. Hidden Markov Model B. Bernoulli Naïve Bayes C. Gaussian Naïve Bayes D. Logistic Regression
65 Conditional Random Field (CRF) This linear-chain CRF is just like an HMM, except that its factors are not necessarily probability distributions
1 K p(v t)= em(yk,xk) tr(yk,yk 1) | Z(t) k =1 1 K = 2tT( 7em(yk,xk))2tT( 7tr(yk,yk 1)) Z(t) · · k =1
ψ1 ψ3 ψ5 ψ7 ψ9
X1 X2 X3 X4 X5 66 Conditional Random Field (CRF)
• That is the vector X • Because it’s observed, we can condition on it for free • Conditioning is how we converted from the MRF to the CRF (i.e. when taking a slice of the emission factors)
v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0
v 3 ψ1 ψ3 v 5 ψ5 ψ7 ψ9 n 4 n 5 p 0.1 p 0.1 d 0.1 d 0.2
X 67 Conditional Random Field (CRF)
• This is the standard linear-chain CRF definition • It permits rich, overlapping features of the vector X
1 K p(v t)= em(yk, t) tr(yk,yk 1, t) | Z(t) k =1 1 K = 2tT( 7em(yk, t))2tT( 7tr(yk,yk 1, t)) Z(t) · · k =1
ψ1 ψ3 ψ5 ψ7 ψ9
X 68 Conditional Random Field (CRF)
• This is the standard linear-chain CRF definition • It permits rich, overlapping features of the vector X
1 K p(v t)= em(yk, t) tr(yk,yk 1, t) | Z(t) k =1 1 K = 2tT( 7em(yk, t))2tT( 7tr(yk,yk 1, t)) Z(t) · · k =1
ψ1 ψ3 ψ5 ψ7 ψ9
Visual Notation: Usually we draw a CRF without showing
the variable corresponding to X 69 Whiteboard
• Forward-backward algorithm for linear-chain CRF
70 General CRF
The topology of the Y9
ψψψ graphical model for a CRF ψ{1,8,9}{1,8,9}{1,8,9} doesn’t have to be a chain Y8
ψ{2,7,8} 1 Y7 p (v t)= (v , t; ) ψ{3,6,7} | Z( ) t
Y1 ψψψψ{1,2}222 Y2 ψ{2,3} Y3 ψ{3,4} Y
ψψψψ{1}111 ψψψψ{2}333 ψψψψ{3}555 ψψψ
time flies like an 71 Standard CRF Parameterization 1 p (v t)= (v , t; ) | Z( ) t Define each potential function in terms of a fixed set of feature functions: (v , t; )=2tT( 7 (v , t)) · Predicted Observed variables variables
72 Standard CRF Parameterization
Define each potential function in terms of a fixed set of feature functions: (v , t; )=2tT( 7 (v , t)) ·
n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9
time flies like an arrow
73 Standard CRF Parameterization
Define each potential function in terms of a fixed set of feature functions: (v , t; )=2tT( 7 (v , t)) · s
ψ13 vp
ψ12 pp
ψ11 np
ψ10
n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9 time flies like an arrow 74 SUPERVISED LEARNING FOR CRFS
75 What is Training?
That’s easy:
Training = picking good model parameters!
But how do we know if the model parameters are any “good”?
76 Machine Log-likelihood Training Learning
1. Choose model Such that derivative in #3 is ea 2. Choose objective: Assign high probability to the things we observe and low probability to everything else
3. Compute derivative by hand using the chain rule 4. Replace exact inference by approximate inference
77 Machine Log-likelihood Training Learning
1. Choose model Such that derivative in #3 is easy 2. Choose objective: Assign high probability to the things we observe and low probability to everything else
3. Compute derivative by hand using the chain rule 4. Compute the marginals by Note that these are factor marginals which are just the (normalized) exact inference factor beliefs from BP!
78 Recipe for Gradient-based Learning
1. Write down the objective function 2. Compute the partial derivatives of the objective (i.e. gradient, and maybe Hessian) 3. Feed objective function and derivatives into black box
Optimization
4. Retrieve optimal parameters from black box
79 Optimization Algorithms
What is the black box? Optimization • Newton’s method • Hessian-free / Quasi-Newton methods – Conjugate gradient – L-BFGS • Stochastic gradient methods – Stochastic gradient descent (SGD) – Stochastic meta-descent – AdaGrad
80 2.1.5 Newton-Raphson (Newton’s method, a second-order method) From our introductory example, we know that we can find the solution to a quadratic function • analytically. Yet gradient descent may take many steps to converge to that optimum. The motivation behind Newton’s method is to use a quadratic approximation of our function to make a good guess where we should step next. Definition: the Hessian of an n-dimensional function is the matrix of partial second derivatives • with respect to each pair of dimensions.
d2f(~x) d2f(~x) 2 ... dx1 dx1dx2 2f(~x)= d2f(~x) d2f(~x) O 2 2 3 dx2dx1 dx2 6 ...... 7 6 7 4 5 Consider the secord order Taylor series expansion of f at x. • 1 gˆ(v)=fˆ(x + v)=f(x)+ f(x)T v + vT 2f(x)v O 2 O We want to find the v that maximizes gˆ(v). This maximizer is called Newton’s step. x = • O nt arg maxv gˆ(v). Algorithm: • 1. Choose a starting point x. 2. While not converged: Compute Newton’s step x =( 2f(x)) 1 f(x) O nt O O Update x(k+1) = x(k) + x O nt Intuition: • If f(x) is quadratic, x + x exactly maximizes f. O nt gˆ(v) is a good quadratic approximation to the function f near the point x.Soiff(x) is locally quadratic, then f(x) is locally well approximated by gˆ(v). See Figure 9.17 in Boyd and Vandenberghe. In most presentations, Newton-Raphson would be presented a minimization algorithm, for • which we would negate the definition of Newton’s step from above.
2.1.6 Quasi-Newton methods (L-BFGS) What if we have n = millions of features? • The Hessian matrix H = 2f(x) is too large: n2 entries. • O quasi-Newton methods approximate the Hessian. • Limited memory BFGS stores only a history of the last k updates to ~x and f(~x). k is usually • O small (e.g. k = 10). This history is used to approximate the Hessian-vector product. • Optimization has nearly become a technology. Almost every language has many generic • optimizationStochastic routines built Gradient in that you canDescent use out of the box. 2.1.7 Stochastic Gradient Descent N Suppose we have N training examples s.t. f(x)= i=1 fi(x). • N GradientThis implies Descent: that Of(x)= i=1 Ofi(x). •• P N (k+1) (k) (k) ~x P = ~x + tOf(~x)=~x + t Ofi(x) i=1 6 X SGD Algorithm: • 1. Choose a starting point x. 2. While not converged: Choose a step size t. Choose i so that it sweeps through the training set. Update (k+1) (k) ~x = ~x + tOfi(~x) For CRF training Stochastic Meta Descent is even better (Vishwanathan, 2006). •
2.2 Additional readings 81 PGM Appendix A.5: Continuous Optimization • Boyd & Vandenberghe “Convex Optimization” http://www.stanford.edu/boyd/cvxbook/˜ • Chapter 9: Unconstrained Minimization ⌅ 9.3 Gradient Descent
⌅ 9.4 Steepest Descent
⌅ 9.5 Newton’s Method Intuitive explanation of Lagrange Multipliers (without the assumption of differentiability): • http://www.umiacs.umd.edu/˜resnik/ling848_fa2004/lagrange.html
2.3 Advanced readings “Overview of Quasi-Newton optimization methods” http://homes.cs.washington.edu/˜galen/files/quasi- • newton-notes.pdf Shewchuk (1994) “An Introduction to the Conjugate Gradient Method Without the Agonizing • Pain” http://www.cs.cmu.edu/˜quake-papers/painless-conjugate-gradient.pdf Conjugate Gradient Method: http://www.cs.iastate.edu/˜cs577/handouts/conjugate-gradient.pdf • 3 Third Review Session
3.1 Continuous Optimization (constrained) 3.1.1 Running example: MLE of a Multinomial Recall the pdf of the Categorical distribution: • support: X 0,...,k 2 { } pmf: p(X = k)=✓ k Let Xi Categorical(✓~) for 1 i N. • ⇠ N The likelihood of all these is: N ✓ = k ✓ l where N is the number of X = l. • i=1 Xi l=1 l l i The log-likelihood is then: LL(✓~)= k N log(✓ ) • Q l=1 Ql l Suppose we want to find the maximum likelihood parameters: ✓~ = arg min LL(✓~) subject • P MLE ✓~ to the constraints k ✓ =1and 0 ✓ l. l=1 l l8 P
7 Whiteboard
• CRF model • CRF data log-likelihood • CRF derivatives
82 Practical Considerations for Gradient-based Methods • Overfitting – L2 regularization – L1 regularization – Regularization by early stopping • For SGD: Sparse updates
83 “Empirical” Comparison of Parameter Estimation Methods
• Example NLP task: CRF dependency parsing • Suppose: Training time is dominated by inference • Dataset: One million tokens • Inference speed: 1,000 tokens / sec • è 0.27 hours per pass through dataset
# passes through # hours to data to converge converge GIS 1000+ 270 L-BFGS 100+ 27 SGD 10 ~3
84 Exact inference for tree-structured factor graphs BELIEF PROPAGATION
85 Inference for HMMs
• Sum-product BP on an HMM is called the forward-backward algorithm • Max-product BP on an HMM is called the Viterbi algorithm
86 Inference for CRFs
• Sum-product BP on a CRF is called the forward-backward algorithm • Max-product BP on a CRF is called the Viterbi algorithm
87 THE FORWARD-BACKWARD ALGORITHM
88 Learning and Inference Summary
For discrete variables:
Learning Marginal MAP Inference Inference HMM Forward- Viterbi backward Linear-chain Forward- Viterbi CRF backward
89 Forward-Backward Algorithm
Y1 Y2 Y3
find preferred tags
Could be verb or noun Could be adjective or verb Could be noun or verb
90 Forward-Backward Algorithm
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags • Show the possible values for each variable
91 Forward-Backward Algorithm
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags • Let’s show the possible values for each variable • One possible assignment 92 Forward-Backward Algorithm
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 93 Viterbi Algorithm: Most Probable Assignment
Y1 Y2 Y3
v v v
ψ{3,4}(a,END) START n n n END
ψ{3}(n)
a a a
find preferred tags • So p(v a n) = (1/Z) * product of 7 numbers • Numbers associated with edges and nodes of path • Most probable assignment = path with highest product94 Viterbi Algorithm: Most Probable Assignment
Y1 Y2 Y3
v v v
ψ{3,4}(a,END) START n n n END
ψ{3}(n)
a a a
find preferred tags • So p(v a n) = (1/Z) * product weight of one path
95 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through a 96 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through n 97 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through v 98 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through n 99 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags
α2(n) = total weight of these path prefixes
100 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags
b2(n) = total weight of these path suffixes
101 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v v
START n n n END
a a a
find preferred tags
α2(n) = total weight of these b2(n) = total weight of these path prefixes (a + b + c) path suffixes (x + y + z) Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths102 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
Oops! The weight of av path v v through a state also includes a weight at that state. “belief that Y = n” STARTSo α(n)·β(n) isn’t enough.n n n 2 END α2(n) b2(n The extra weight is the ) opinion of the unigrama a a factor at this variable.
ψ{2}(n)
find preferred tags
total weight of all paths through n
= α2(n)× ψ{2}(n)× b2(n ) 103 Forward-Backward Algorithm: Finds Marginals
Y1 Y2 Y3
v v “beliefv that Y2 = v”
START n n “beliefn that Y2 = n” END
α2(v) b2(v)
a a a
ψ{2}(v)
find preferred tags
total weight of all paths through v
= α2(v)× ψ{2}(v)× b2(v) 104 Forward-Backward Algorithm: Finds Marginals
v 1.8 Y1 Y2 Y3 n 0 a 4.2 v v “beliefv that Y2 = v” divide v 0.3 by Z=6 to “belief that Y = n” STARTn 0 get n n n 2 END a 0.7 marginal α2(a) b2(a) probs a a “beliefa that Y2 = a”
sum = Z ψ{2}(a) (total probability of all paths) find preferred tags
total weight of all paths through a
= α2(a)× ψ{2}(a)× b2(a) 105 CRF Tagging Model
Y1 Y2 Y3
find preferred tags
Could be verb or noun Could be adjective or verb Could be noun or verb
106 Whiteboard
• Forward-backward algorithm • Viterbi algorithm
107 CRF Tagging by Belief Propagation
Forward algorithm = Backward algorithm = message passing belief message passing (matrix-vector products) v 1.8 (matrix-vector products) n 0 a 4.2 α messageα βmessage β v 7 v n av 3 v 2 v n a v 3 … … n 2 v 0 2 1 n 1 nv 1 0 2 1 n 6 a 1 n 2 1 0a 6 an7 2 1 0 a 1 a 0 3 1 a 0 3 1 v 0.3 n 0 a 0.1
find preferred tags • Forward-backward is a message passing algorithm. • It’s the simplest case of belief propagation. 108 HMMS VS CRFS
128 Generative vs. Discriminative
Liang & Jordan (ICML 98% 95.60% 2008) compares HMM 96% 93.50% and CRF with identical 94% features 92% 89.80% HMM • Dataset 1: (Real) 90% 87.90% CRF – WSJ Penn Treebank 88% (38K train, 5.5K test) 86% 84% – 45 part-of-speech tags Dataset 1 Dataset 2 • Dataset 2: (Artificial) – Synthetic data generated from HMM learned on Dataset 1 (1K train, 1K test) • Evaluation Metric: Accuracy
129 CRFs: some empirical results
• Parts of Speech tagging
– Using same set of features: HMM >=< CRF > MEMM – Using additional overlapping features: CRF+ > MEMM+ >> HMM
© Eric Xing @ CMU, 2005-2015 130 SUMMARY
131 Summary: Learning and Inference
For discrete variables:
Learning Marginal MAP Inference Inference HMM MLE by counting Forward- Viterbi backward Linear-chain Gradient based – doesn’t Forward- Viterbi CRF decompose because of backward Z(x) and requires marginal inference
132 Summary: Models
Classification Structured Prediction
Generative Naïve Bayes HMM
Discriminative Logistic CRF Regression
133