10-715 Advanced Intro. to Machine Learning

Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models + Conditional Random Fields

Matt Gormley Guest Lecture 2 Oct. 31, 2018

1 1. Data 2. Model

(n) N 1 = x n=1 p(x ✓)= (x ) D { } | Z(✓) C C n v p d n Sample 1: C time flies like an arrow Y2C T1 T2 T3 T4 T5

Sample n n v d n 2: W W W W W time flies like an arrow 1 2 3 4 5

Sample n v p n n 3: flies fly with their wings 3. Objective N

Sample p n n v v (n) 4: `(✓; )= log p(x ✓) with time you will see D | n=1 X 5. Inference 4. Learning 1. Marginal Inference p(x )= p(x0 ✓) ✓⇤ = argmax `(✓; ) C | x :x =x D 0 C0 C ✓ 2. Partition Function X

Z(✓)= C (xC ) x C 3. MAP Inference X Y2C xˆ = argmax p(x ✓) x | 2 HIDDEN (HMM)

3 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1

n v p d n y(1) Sample 1: time flies like an arrow x(1)

n n v d n y(2) Sample 2: time flies like an arrow x(2)

(3) n v p n n y Sample 3: (3) flies fly with their wings x

(4) p n n v v y Sample 4: (4) with time you will see x

4 Naïve Bayes for Time Series Data We could treat each word-tag pair (i.e. token) as independent. This corresponds to a Naïve Bayes model with a single feature (the word).

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .1 * .5 * …)

v .1 v .1 n .8 n .8 p .2 p .2 d .2 d .2

n v p d n … … like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 5 A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0

n v p d n … … like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 6 From Mixture Model to HMM

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

“Naïve Bayes”:

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

HMM: 7 From Mixture Model to HMM

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

“Naïve Bayes”:

Y0 Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

HMM: 8 SUPERVISED LEARNING FOR HMMS

10 Hidden Markov Model

HMM Parameters:

O S C O S C O .8 O .9 .08.02 O .9 .08.02 S .1 S .2 .7 .1 S .2 .7 .1 C .1 C .9 0 .1 C .9 0 .1

Y1 Y2 Y3 Y4 Y5 … … 1min 2min 3min 1min 2min 3min X1 X2 X3 X4 X5 O .1 .2 .3 O .1 .2 .3 S .01 .02.03 S .01 .02.03 C 0 0 0 C 0 0 0 12 Hidden Markov Model

HMM Parameters:

Assumption: y0 = START Generative Story: For notational convenience, we fold the initial probabilities C into the transition matrix B by our assumption.

Y0 Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5 13 Hidden Markov Model

Joint Distribution: y0 = START

Y0 Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5 14 Training HMMs

Whiteboard – (Supervised) Likelihood for an HMM – Maximum Likelihood Estimation (MLE) for HMM

15 Supervised Learning for HMMs Learning an HMM decomposes into solving two (independent) Mixture Models

Yt Yt+1

Yt

Xt

16 Supervised Learning for HMMs Learning an HMM decomposes into solving two (independent) Mixture Models

Yt Yt+1

Yt

Xt

17 HMMs: History • Markov chains: Andrey Markov (1906) – Random walks and Brownian motion • Used in Shannon’s work on information theory (1948) • Baum-Welsh learning algorithm: late 60’s, early 70’s. – Used mainly for speech in 60s-70s. • Late 80’s and 90’s: David Haussler (major player in learning theory in 80’s) began to use HMMs for modeling biological sequences • Mid-late 1990’s: Dayne Freitag/Andrew McCallum – Freitag thesis with Tom Mitchell on IE from Web using logic programs, grammar induction, etc. – McCallum: multinomial Naïve Bayes for text – With McCallum, IE using HMMs on CORA • …

19 Slide from William Cohen Higher-order HMMs • 1st-order HMM (i.e. bigram HMM)

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5 • 2nd-order HMM (i.e. trigram HMM)

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5 • 3rd-order HMM

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

20 BACKGROUND: MESSAGE PASSING

21 Great Ideas in ML: Message Passing Count the soldiers there's 1 of me 1 2 3 4 5 before before before before before you you you you you

5 4 3 2 1 behind behind behind behind behind you you you you you

22 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Count the soldiers there's Belief: 1 of me Must be 22 + 11 + 3 = 6 of 2 us before you

only see 3 my incoming behind messages you

23 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Count the soldiers there's Belief: Belief: 1 of me Must be Must be 11 + 11 + 4 = 6 of 22 + 11 + 3 = 6 of 1 before us us you

only see 4 my incoming behind messages you

24 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree

3 here

7 here 1 of me

11 here (= 7+3+1)

25 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree

3 here

7 here (= 3+3+1)

3 here

26 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree

11 here (= 7+3+1)

7 here

3 here

27 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree

3 here

7 here

Belief: Must be 3 here 14 of us

28 adapted from MacKay (2003) textbook Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree

3 here

7 here

Belief: Must be 3 here 14 of us

29 adapted from MacKay (2003) textbook THE FORWARD-BACKWARD ALGORITHM

30 Inference for HMMs

Whiteboard – Three Inference Problems for an HMM 1. Evaluation: Compute the probability of a given sequence of observations 2. Decoding: Find the most-likely sequence of hidden states, given a sequence of observations 3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations

31 Dataset for Supervised Part-of-Speech (POS) Tagging Data: = x(n), y(n) N D { }n=1

n v p d n y(1) Sample 1: time flies like an arrow x(1)

n n v d n y(2) Sample 2: time flies like an arrow x(2)

(3) n v p n n y Sample 3: (3) flies fly with their wings x

(4) p n n v v y Sample 4: (4) with time you will see x

32 Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

v n p d v n p d v .1 .4 .2 .3 v .1 .4 .2 .3 n .8 .1 .1 0 n .8 .1 .1 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 0 d .2 .8 0 0

n v p d n … … like like flies flies time time flies time like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 34 Forward-Backward Algorithm

Y1 Y2 Y3

X1 X2 X3

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

35 Forward-Backward Algorithm

Y1 Y2 Y3

X1 X2 X3 find preferred tags

36 Forward-Backward Algorithm

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 37 Forward-Backward Algorithm

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 38 Forward-Backward Algorithm

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 39 Forward-Backward Algorithm

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 transition / emission factors think of it … 40 Forward-Backward Algorithm

v n a Y Y Y 1 v 1 6 4 2 3 n 8 4 0.1 v a 0.1 8 0 v v

START n n n END

… a a a find tags pref. v 3 5 3 n 4 5 2 a 0.1 0.2 0.1

X1 X2 X3 find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 transition / emission factors think of it … 41 Viterbi Algorithm: Most Probable Assignment

Y1 Y2 Y3

v v v

B(a,END)

START n n n END A(tags,n)

a a a

X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product of 7 numbers • Numbers associated with edges and nodes of path • Most probable assignment = path with highest product42 Viterbi Algorithm: Most Probable Assignment

Y1 Y2 Y3

v v v

B(a,END)

START n n n END A(tags,n)

a a a

X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path

43 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through a 44 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = n) = (1/Z) * total weight of all paths through n 45 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = v) = (1/Z) * total weight of all paths through v 46 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = n) = (1/Z) * total weight of all paths through n 47 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags

α2(n) = total weight of these path prefixes

48 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags

b2(n) = total weight of these path suffixes

49 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

X1 X2 X3 find preferred tags

α2(n) = total weight of these b2(n) = total weight of these path prefixes (a + b + c) path suffixes (x + y + z) Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths50 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

Oops! The weight of av path v v through a state also includes a weight at that state. “belief that Y = n” STARTSo α(n)·β(n) isn’t enough.n n n 2 END α2(n) b2(n) The extra weight is the opinion of the emissiona a a probability at this variable.

A(pref., n)

X1 X2 X3 find preferred tags

total weight of all paths through n

= α2(n)× A(pref.,× n) b2(n) 51 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v “beliefv that Y2 = v”

START n n “beliefn that Y2 = n” END α2(v) b2(v)

a a a

A(pref., v)

X1 X2 X3 find preferred tags

total weight of all paths through v

= α2(v)× A(pref.,× v) b2(v) 52 Forward-Backward Algorithm: Finds Marginals

v 0.1 Y1 Y2 Y3 n 0 a 0.4 v v “beliefv that Y2 = v” divide v 0.2 by Z=0.5 “belief that Y = n” STARTn 0 to getn n n 2 END a 0.8 marginal α2(a) b2(a) probs a a “beliefa that Y2 = a”

sum = Z A(pref., a) (total weight of all paths) X1 X2 X3 find preferred tags

total weight of all paths through a

= α2(a)× A(pref.,× a) b2(a) 53 Forward-Backward Algorithm

Y1 Y2 Y3

X1 X2 X3

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

54 Inference for HMMs

Whiteboard – Derivation of Forward algorithm – Forward-backward algorithm – Viterbi algorithm

55 Inference in HMMs What is the computational complexity of inference for HMMs?

• The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(KT)

• The forward-backward algorithm and Viterbi algorithm run in polynomial time, O(T*K2) – Thanks to dynamic programming!

59 Conditional Random Fields (CRFs) for time series data LINEAR-CHAIN CRFS

60 Shortcomings of Hidden Markov Models

START Y1 Y2 … … … Yn

X1 X2 … … … Xn

• HMM models capture dependences between each state and only its corresponding observation – NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non- local) features of the whole line such as line length, indentation, amount of white space, etc. • Mismatch between learning objective function and prediction objective function – HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

© Eric Xing @ CMU, 2005-2015 61 (CRF)

Conditional distribution over tags Xi given words wi. The factors and Z are now specific to the sentence w.

p(n, v, p, d, n | time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …)

v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0

ψ0 n ψ2 v ψ4 p ψ6 d ψ8 n

v 3 ψ1 ψ3 v 5 ψ5 ψ7 ψ9 n 4 n 5 p 0.1 p 0.1 d 0.1 d 0.2

time flies like an arrow 62 Conditional Random Field (CRF)

Recall: Shaded nodes in a are observed

v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0

ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

v 3 ψ1 ψ3 v 5 ψ5 ψ7 ψ9 n 4 n 5 p 0.1 p 0.1 d 0.1 d 0.2

X1 X2 X3 X4 X5 63 Conditional Random Field (CRF) This linear-chain CRF is just like an HMM, except that its factors are not necessarily probability distributions

1 K p(v t)= em(yk,xk)tr(yk,yk 1) | Z(t) k=1 1 K = 2tT( 7em(yk,xk))2tT( 7tr(yk,yk 1)) Z(t) · · k=1

ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

X1 X2 X3 X4 X5 64 Exercise 1 K p(v t)= em(yk,xk)tr(yk,yk 1) | Z(t) k=1 1 K = 2tT( 7em(yk,xk))2tT( 7tr(yk,yk 1)) Z(t) · · k=1 Multiple Choice: Which model does the above distribution share the most in common with? A. Hidden Markov Model B. Bernoulli Naïve Bayes C. Gaussian Naïve Bayes D. Logistic Regression

65 Conditional Random Field (CRF) This linear-chain CRF is just like an HMM, except that its factors are not necessarily probability distributions

1 K p(v t)= em(yk,xk)tr(yk,yk 1) | Z(t) k=1 1 K = 2tT( 7em(yk,xk))2tT( 7tr(yk,yk 1)) Z(t) · · k=1

ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

X1 X2 X3 X4 X5 66 Conditional Random Field (CRF)

• That is the vector X • Because it’s observed, we can condition on it for free • Conditioning is how we converted from the MRF to the CRF (i.e. when taking a slice of the emission factors)

v n p d v n p d v 1 6 3 4 v 1 6 3 4 n 8 4 2 0.1 n 8 4 2 0.1 p 1 3 1 3 p 1 3 1 3 d 0.1 8 0 0 d 0.1 8 0 0

ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

v 3 ψ1 ψ3 v 5 ψ5 ψ7 ψ9 n 4 n 5 p 0.1 p 0.1 d 0.1 d 0.2

X 67 Conditional Random Field (CRF)

• This is the standard linear-chain CRF definition • It permits rich, overlapping features of the vector X

1 K p(v t)= em(yk, t)tr(yk,yk 1, t) | Z(t) k=1 1 K = 2tT( 7em(yk, t))2tT( 7tr(yk,yk 1, t)) Z(t) · · k=1

ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

X 68 Conditional Random Field (CRF)

• This is the standard linear-chain CRF definition • It permits rich, overlapping features of the vector X

1 K p(v t)= em(yk, t)tr(yk,yk 1, t) | Z(t) k=1 1 K = 2tT( 7em(yk, t))2tT( 7tr(yk,yk 1, t)) Z(t) · · k=1

ψ0 Y1 ψ2 Y2 ψ4 Y3 ψ6 Y4 ψ8 Y5

ψ1 ψ3 ψ5 ψ7 ψ9

Visual Notation: Usually we draw a CRF without showing

the variable corresponding to X 69 Whiteboard

• Forward-backward algorithm for linear-chain CRF

70 General CRF

The topology of the Y9

ψψψ graphical model for a CRF ψ{1,8,9}{1,8,9}{1,8,9} doesn’t have to be a chain Y8

ψ{2,7,8} 1 Y7 p (v t)= (v , t; ) ψ{3,6,7} | Z( ) t

Y1 ψψψψ{1,2}222 Y2 ψ{2,3} Y3 ψ{3,4} Y

ψψψψ{1}111 ψψψψ{2}333 ψψψψ{3}555 ψψψ

time flies like an 71 Standard CRF Parameterization 1 p (v t)= (v , t; ) | Z( ) t Define each potential function in terms of a fixed set of feature functions: (v , t; )=2tT( 7 (v , t)) · Predicted Observed variables variables

72 Standard CRF Parameterization

Define each potential function in terms of a fixed set of feature functions: (v , t; )=2tT( 7 (v , t)) ·

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

time flies like an arrow

73 Standard CRF Parameterization

Define each potential function in terms of a fixed set of feature functions: (v , t; )=2tT( 7 (v , t)) · s

ψ13 vp

ψ12 pp

ψ11 np

ψ10

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9 time flies like an arrow 74 SUPERVISED LEARNING FOR CRFS

75 What is Training?

That’s easy:

Training = picking good model parameters!

But how do we know if the model parameters are any “good”?

76 Machine Log-likelihood Training Learning

1. Choose model Such that derivative in #3 is ea 2. Choose objective: Assign high probability to the things we observe and low probability to everything else

3. Compute derivative by hand using the chain rule 4. Replace exact inference by approximate inference

77 Machine Log-likelihood Training Learning

1. Choose model Such that derivative in #3 is easy 2. Choose objective: Assign high probability to the things we observe and low probability to everything else

3. Compute derivative by hand using the chain rule 4. Compute the marginals by Note that these are factor marginals which are just the (normalized) exact inference factor beliefs from BP!

78 Recipe for Gradient-based Learning

1. Write down the objective function 2. Compute the partial derivatives of the objective (i.e. gradient, and maybe Hessian) 3. Feed objective function and derivatives into black box

Optimization

4. Retrieve optimal parameters from black box

79 Optimization Algorithms

What is the black box? Optimization • Newton’s method • Hessian-free / Quasi-Newton methods – Conjugate gradient – L-BFGS • Stochastic gradient methods – Stochastic gradient descent (SGD) – Stochastic meta-descent – AdaGrad

80 2.1.5 Newton-Raphson (Newton’s method, a second-order method) From our introductory example, we know that we can find the solution to a quadratic function • analytically. Yet gradient descent may take many steps to converge to that optimum. The motivation behind Newton’s method is to use a quadratic approximation of our function to make a good guess where we should step next. Definition: the Hessian of an n-dimensional function is the matrix of partial second derivatives • with respect to each pair of dimensions.

d2f(~x) d2f(~x) 2 ... dx1 dx1dx2 2f(~x)= d2f(~x) d2f(~x) O 2 2 3 dx2dx1 dx2 6 ...... 7 6 7 4 5 Consider the secord order Taylor series expansion of f at x. • 1 gˆ(v)=fˆ(x + v)=f(x)+ f(x)T v + vT 2f(x)v O 2 O We want to find the v that maximizes gˆ(v). This maximizer is called Newton’s step. x = • O nt arg maxv gˆ(v). Algorithm: • 1. Choose a starting point x. 2. While not converged: Compute Newton’s step x =( 2f(x)) 1 f(x) O nt O O Update x(k+1) = x(k) + x O nt Intuition: • If f(x) is quadratic, x + x exactly maximizes f. O nt gˆ(v) is a good quadratic approximation to the function f near the point x.Soiff(x) is locally quadratic, then f(x) is locally well approximated by gˆ(v). See Figure 9.17 in Boyd and Vandenberghe. In most presentations, Newton-Raphson would be presented a minimization algorithm, for • which we would negate the definition of Newton’s step from above.

2.1.6 Quasi-Newton methods (L-BFGS) What if we have n = millions of features? • The Hessian matrix H = 2f(x) is too large: n2 entries. • O quasi-Newton methods approximate the Hessian. • Limited memory BFGS stores only a history of the last k updates to ~x and f(~x). k is usually • O small (e.g. k = 10). This history is used to approximate the Hessian-vector product. • Optimization has nearly become a technology. Almost every language has many generic • optimizationStochastic routines built Gradient in that you canDescent use out of the box. 2.1.7 Stochastic Gradient Descent N Suppose we have N training examples s.t. f(x)= i=1 fi(x). • N GradientThis implies Descent: that Of(x)= i=1 Ofi(x). •• P N (k+1) (k) (k) ~x P = ~x + tOf(~x)=~x + t Ofi(x) i=1 6 X SGD Algorithm: • 1. Choose a starting point x. 2. While not converged: Choose a step size t. Choose i so that it sweeps through the training set. Update (k+1) (k) ~x = ~x + tOfi(~x) For CRF training Stochastic Meta Descent is even better (Vishwanathan, 2006). •

2.2 Additional readings 81 PGM Appendix A.5: Continuous Optimization • Boyd & Vandenberghe “Convex Optimization” http://www.stanford.edu/boyd/cvxbook/˜ • Chapter 9: Unconstrained Minimization ⌅ 9.3 Gradient Descent

⌅ 9.4 Steepest Descent

⌅ 9.5 Newton’s Method Intuitive explanation of Lagrange Multipliers (without the assumption of differentiability): • http://www.umiacs.umd.edu/˜resnik/ling848_fa2004/lagrange.html

2.3 Advanced readings “Overview of Quasi-Newton optimization methods” http://homes.cs.washington.edu/˜galen/files/quasi- • newton-notes.pdf Shewchuk (1994) “An Introduction to the Conjugate Gradient Method Without the Agonizing • Pain” http://www.cs.cmu.edu/˜quake-papers/painless-conjugate-gradient.pdf Conjugate Gradient Method: http://www.cs.iastate.edu/˜cs577/handouts/conjugate-gradient.pdf • 3 Third Review Session

3.1 Continuous Optimization (constrained) 3.1.1 Running example: MLE of a Multinomial Recall the pdf of the Categorical distribution: • support: X 0,...,k 2 { } pmf: p(X = k)=✓ k Let Xi Categorical(✓~) for 1 i N. • ⇠   N The likelihood of all these is: N ✓ = k ✓ l where N is the number of X = l. • i=1 Xi l=1 l l i The log-likelihood is then: LL(✓~)= k N log(✓ ) • Q l=1 Ql l Suppose we want to find the maximum likelihood parameters: ✓~ = arg min LL(✓~) subject • P MLE ✓~ to the constraints k ✓ =1and 0 ✓ l. l=1 l  l8 P

7 Whiteboard

• CRF model • CRF data log-likelihood • CRF derivatives

82 Practical Considerations for Gradient-based Methods • Overfitting – L2 regularization – L1 regularization – Regularization by early stopping • For SGD: Sparse updates

83 “Empirical” Comparison of Parameter Estimation Methods

• Example NLP task: CRF dependency parsing • Suppose: Training time is dominated by inference • Dataset: One million tokens • Inference speed: 1,000 tokens / sec • è 0.27 hours per pass through dataset

# passes through # hours to data to converge converge GIS 1000+ 270 L-BFGS 100+ 27 SGD 10 ~3

84 Exact inference for tree-structured factor graphs

85 Inference for HMMs

• Sum-product BP on an HMM is called the forward-backward algorithm • Max-product BP on an HMM is called the Viterbi algorithm

86 Inference for CRFs

• Sum-product BP on a CRF is called the forward-backward algorithm • Max-product BP on a CRF is called the Viterbi algorithm

87 THE FORWARD-BACKWARD ALGORITHM

88 Learning and Inference Summary

For discrete variables:

Learning Marginal MAP Inference Inference HMM Forward- Viterbi backward Linear-chain Forward- Viterbi CRF backward

89 Forward-Backward Algorithm

Y1 Y2 Y3

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

90 Forward-Backward Algorithm

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags • Show the possible values for each variable

91 Forward-Backward Algorithm

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags • Let’s show the possible values for each variable • One possible assignment 92 Forward-Backward Algorithm

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags • Let’s show the possible values for each variable • One possible assignment • And what the 7 factors think of it … 93 Viterbi Algorithm: Most Probable Assignment

Y1 Y2 Y3

v v v

ψ{3,4}(a,END) START n n n END

ψ{3}(n)

a a a

find preferred tags • So p(v a n) = (1/Z) * product of 7 numbers • Numbers associated with edges and nodes of path • Most probable assignment = path with highest product94 Viterbi Algorithm: Most Probable Assignment

Y1 Y2 Y3

v v v

ψ{3,4}(a,END) START n n n END

ψ{3}(n)

a a a

find preferred tags • So p(v a n) = (1/Z) * product weight of one path

95 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through a 96 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through n 97 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through v 98 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags • So p(v a n) = (1/Z) * product weight of one path • Marginal probability p(Y2 = a) = (1/Z) * total weight of all paths through n 99 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags

α2(n) = total weight of these path prefixes

100 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags

b2(n) = total weight of these path suffixes

101 (found by dynamic programming: matrix-vector products) Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v v

START n n n END

a a a

find preferred tags

α2(n) = total weight of these b2(n) = total weight of these path prefixes (a + b + c) path suffixes (x + y + z) Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths102 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

Oops! The weight of av path v v through a state also includes a weight at that state. “belief that Y = n” STARTSo α(n)·β(n) isn’t enough.n n n 2 END α2(n) b2(n The extra weight is the ) opinion of the unigrama a a factor at this variable.

ψ{2}(n)

find preferred tags

total weight of all paths through n

= α2(n)× ψ{2}(n)× b2(n ) 103 Forward-Backward Algorithm: Finds Marginals

Y1 Y2 Y3

v v “beliefv that Y2 = v”

START n n “beliefn that Y2 = n” END

α2(v) b2(v)

a a a

ψ{2}(v)

find preferred tags

total weight of all paths through v

= α2(v)× ψ{2}(v)× b2(v) 104 Forward-Backward Algorithm: Finds Marginals

v 1.8 Y1 Y2 Y3 n 0 a 4.2 v v “beliefv that Y2 = v” divide v 0.3 by Z=6 to “belief that Y = n” STARTn 0 get n n n 2 END a 0.7 marginal α2(a) b2(a) probs a a “beliefa that Y2 = a”

sum = Z ψ{2}(a) (total probability of all paths) find preferred tags

total weight of all paths through a

= α2(a)× ψ{2}(a)× b2(a) 105 CRF Tagging Model

Y1 Y2 Y3

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

106 Whiteboard

• Forward-backward algorithm • Viterbi algorithm

107 CRF Tagging by Belief Propagation

Forward algorithm = Backward algorithm = message passing belief message passing (matrix-vector products) v 1.8 (matrix-vector products) n 0 a 4.2 α messageα βmessage β v 7 v n av 3 v 2 v n a v 3 … … n 2 v 0 2 1 n 1 nv 1 0 2 1 n 6 a 1 n 2 1 0a 6 an7 2 1 0 a 1 a 0 3 1 a 0 3 1 v 0.3 n 0 a 0.1

find preferred tags • Forward-backward is a message passing algorithm. • It’s the simplest case of belief propagation. 108 HMMS VS CRFS

128 Generative vs. Discriminative

Liang & Jordan (ICML 98% 95.60% 2008) compares HMM 96% 93.50% and CRF with identical 94% features 92% 89.80% HMM • Dataset 1: (Real) 90% 87.90% CRF – WSJ Penn Treebank 88% (38K train, 5.5K test) 86% 84% – 45 part-of-speech tags Dataset 1 Dataset 2 • Dataset 2: (Artificial) – Synthetic data generated from HMM learned on Dataset 1 (1K train, 1K test) • Evaluation Metric: Accuracy

129 CRFs: some empirical results

• Parts of Speech tagging

– Using same set of features: HMM >=< CRF > MEMM – Using additional overlapping features: CRF+ > MEMM+ >> HMM

© Eric Xing @ CMU, 2005-2015 130 SUMMARY

131 Summary: Learning and Inference

For discrete variables:

Learning Marginal MAP Inference Inference HMM MLE by counting Forward- Viterbi backward Linear-chain Gradient based – doesn’t Forward- Viterbi CRF decompose because of backward Z(x) and requires marginal inference

132 Summary: Models

Classification

Generative Naïve Bayes HMM

Discriminative Logistic CRF Regression

133