Predicting Sequences: Structured

CS 6355:

1 Conditional Random Fields summary

• An undirected – Decompose the score over the structure into a collection of factors – Each factor assigns a score to assignment of the random variables it is connected to

• Training and prediction – Final prediction via argmax wTÁ(x, y) – Train by maximum (regularized) likelihood

• Connections to other models – Effectively a – A generalization of to structures – An conditional variant of a Markov • We will see this soon

2 Global features

The feature function decomposes over the sequence

y0 y1 y2 y3

x

� � � � �(�, �0, �1) � �(�, �, �2) � �(�, �2, �3)

3 Outline

• Sequence models • Hidden Markov models – Inference with HMM – Learning • Conditional Models and Local Classifiers • Global models – Conditional Random Fields – Structured Perceptron for sequences

4 HMM is also a linear classifier

Consider the HMM:

� �, � = � � � � � �

5 HMM is also a linear classifier

Consider the HMM:

� �, � = � � � � � �

Transitions Emissions

6 HMM is also a linear classifier

Consider the HMM:

� �, � = � � � � � �

Or equivalently

log � �, � = log � � ∣ � + log � � ∣ �

Log joint probability = transition scores + emission scores

7 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Log joint probability = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions

8 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Log joint probability = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions

1, � is true, � = 0, � is false.

Indicators are functions that map Booleans to 0 or 1

9 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Log joint probability = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions Equivalent to log � � � ⋅ � ⋅ � []

The indicators ensure that only one of the elements of the double summation is non-zero

10 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Log joint probability = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions Equivalent to

log � � � ⋅ �

The indicators ensure that only one of the elements of the summation is non-zero

11 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Let us examine this expression using a carefully defined set of indicator functions log � �, � = log � � � ⋅ � ⋅ � []

+ log � � � ⋅ �

12 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Let us examine this expression using a carefully defined set of indicator functions log � �, � = log � � � ⋅ � ⋅ � []

+ log � � � ⋅ � log � �, � = log �(� ∣ �) � � []

+ log � � � �

13 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Let us examine this expression using a carefully defined set of indicator functions log � �, � = log � � � � � []

+ log � � � � Number of times there is a transition in the sequence from state �’ to state � Count(� → �)

14 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Let us examine this expression using a carefully defined set of indicator functions log � �, � = log � � � ⋅ Count(� → �)

+ log � � � �

15 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Let us examine this expression using a carefully defined set of indicator functions log � �, � = log � � � ⋅ Count(� → �)

+ log � � � �

Number of times state � occurs in the sequence: Count(�)

16 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Let us examine this expression using a carefully defined set of indicator functions log � �, � = log � � � ⋅ Count(� → �)

+ log � � � ⋅ Count �

17 HMM is also a linear classifier

log � �, � = log � � ∣ � + log � � ∣ � Let us examine this expression using a carefully defined set of indicator functions log � �, � = log � � � ⋅ Count(� → �)

+ log � � � ⋅ Count �

This is a linear function log P terms are the weights; counts via indicators are features Can be written as wTÁ(x, y) and add more features

18 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework

19 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

20 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

Transition scores + Emission scores

21 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

log P(Det ! Noun) £ 2

+ log P(Noun ! Verb) £ 1 + Emission scores

+ log P(Verb ! Det) £ 1

22 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1

+ log P(Noun ! Verb) £ 1 + + log P(ate| Verb) £ 1

+ log P(Verb ! Det) £ 1 + log P(the| Det) £ 1

+ log P(homework| Noun) £ 1

23 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1

+ log P(Noun ! Verb) £ 1 + + log P(ate| Verb) £ 1

+ log P(Verb ! Det) £ 1 + log P(the| Det) £ 1

+ log P(homework| Noun) £ 1 w: Parameters of the model 24 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1 Á(x, y): Properties of this output and the + log P(Noun ! Verb) £ 1 + input + log P(ate| Verb) £ 1

+ log P(Verb ! Det) £ 1 + log P(the| Det) £ 1

+ log P(homework| Noun) £ 1

25 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

log �(��� → ����) 2 log �(���� → ����) 1 log �(���� → ���) 1 log � �ℎ� ���) 1 ⋅ log � ��� ����) 1 log � ��� ����) 1 log � �ℎ� ��� 1 log � ℎ������� ����) 1 Á(x, y): Properties of this w: Parameters output and the input of the model 26 HMM is a linear classifier: An example

Det Noun Verb Det Noun

The dog ate the homework Consider

log �(��� → ����) 2 log �(���� → ����) 1 log P(x, y) = A linear scoring log �(���� → ���) 1 function = wTÁ(x,y) log � �ℎ� ���) 1 ⋅ log � ��� ����) 1 log � ��� ����) 1 log � �ℎ� ��� 1 log � ℎ������� ����) 1 Á(x, y): Properties of this w: Parameters output and the input of the model 27 Towards structured Perceptron

1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties • As long as the output can be decomposed for easy inference

2. The calculates max wTÁ(x, y) Viterbi only cares about scores to structures (not necessarily normalized)

3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize by computing exponentiating and dividing by Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a !

28 Towards structured Perceptron

1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties • As long as the output can be decomposed for easy inference

2. The Viterbi algorithm calculates max wTÁ(x, y) Viterbi only cares about scores to structures (not necessarily normalized)

3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize by computing exponentiating and dividing by Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model!

29 Towards structured Perceptron

1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties • As long as the output can be decomposed for easy inference

2. The Viterbi algorithm calculates max wTÁ(x, y) Viterbi only cares about scores to structures (not necessarily normalized)

3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize by exponentiating and dividing by Z (the partition function) – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model instead of the generative one!

30 Structured Perceptron algorithm

Given a training set D = {(x,y)} 1. Initialize w = 0 2

T Prediction: argmaxy w Á(x, y)

31 Structured Perceptron algorithm

Given a training set D = {(x,y)} 1. Initialize w = 0 2

T Prediction: argmaxy w Á(x, y)

32 Structured Perceptron algorithm

Given a training set D = {(x,y)} 1. Initialize w = 0 2

33 Structured Perceptron algorithm

Given a training set D = {(x,y)} 1. Initialize w = 0 2

T Prediction: argmaxy w Á(x, y)

34 Structured Perceptron algorithm

In practice, good to Given a training set D = {(x,y)} shuffle D before the inner loop 1. Initialize w = 0 2

T Prediction: argmaxy w Á(x, y)

35 Structured Perceptron algorithm

Given a training set D = {(x,y)} 1. Initialize w = 0 2

T Prediction: argmaxy w Á(x, y)

36 Notes on structured perceptron

• Mistake bound for separable data, just like perceptron

• In practice, use averaging for better generalization – Initialize a = 0 – After each step, whether there is an update or not, a à w + a • Note, we still check for mistake using w not a – Return a at the end instead of w Exercise: Optimize this for performance – modify a only on errors

• Global update – One weight vector for entire sequence • Not for each position – Same algorithm can be derived via constraint classification • Create a binary classification data set and run perceptron

37 Structured Perceptron with averaging

Given a training set D = {(x,y)} 1. Initialize w = 0 2

38 CRF vs. structured perceptron

Stochastic update for CRF

– For a training example (xi, yi)

Structured perceptron Expectation vs max – For a training example (xi, yi)

Caveat: Adding regularization will change the CRF update, averaging changes the perceptron update

39 The lay of the land HMM: A , assigns probabilities to sequences

40 The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge

41 The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge

• Hidden Markov Models are actually just linear classifiers

• Don’t really care whether we are predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)

• Generalize algorithms for linear classifiers. Sophisticated models that can use arbitrary features

• Structured Perceptron Structured SVM

42 The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge

• Hidden Markov Models are actually • Model probabilities via exponential just linear classifiers functions. Gives us the log-linear representation • Don’t really care whether we are predicting probabilities. We are • Log-probabilities for sequences for a assigning scores to a full output for a given input given input (like multiclass)

• Generalize algorithms for linear • Learn by maximizing likelihood. classifiers. Sophisticated models that Sophisticated models that can use can use arbitrary features arbitrary features

• Structured Perceptron • Structured SVM

43 The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge

• Hidden Markov Models are actually • Model probabilities via exponential just linear classifiers functions. Gives us the log-linear representation • Don’t really care whether we are predicting probabilities. We are • Log-probabilities for sequences for a assigning scores to a full output for a given input given input (like multiclass) Discriminative/Conditional models • Generalize algorithms for linear • Learn by maximizing likelihood. classifiers. Sophisticated models that Sophisticated models that can use can use arbitrary features arbitrary features

• Structured Perceptron • Conditional Random field Structured SVM

44 The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge

• Hidden Markov Models are actually • Model probabilities via exponential just linear classifiers functions. Gives us the log-linear representation • Don’t really care whether we are predicting probabilities. We are • Log-probabilities for sequences for a assigning scores to a full output for a given input given input (like multiclass) Discriminative/Conditional models • Generalize algorithms for linear • Learn by maximizing likelihood. classifiers. Sophisticated models that Sophisticated models that can use can use arbitrary features arbitrary features

• Structured Perceptron • Conditional Random field Structured SVM

Applicable beyond sequences Eventually, similar objective minimized with different loss functions 45 The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge

• Hidden Markov Models are actually • Model probabilities via exponential just linear classifiers functions. Gives us the log-linear representation • Don’t really care whether we are predicting probabilities. We are • Log-probabilities for sequences for a assigning scores to a full output for a given input given input (like multiclass) Discriminative/Conditional models • Generalize algorithms for linear • Learn by maximizing likelihood. classifiers. Sophisticated models that Sophisticated models that can use can use arbitrary features arbitrary features

• Structured Perceptron • Conditional Random field Structured SVM

Applicable beyond sequences Coming Eventually, similar objective minimized with different loss functions soon… 46 Sequence models: Summary

• Goal: Predict an output sequence given input sequence

• Inference Prediction is not always tractable – Predict via Viterbi algorithm for general structures • Conditional models/discriminative models – Local approaches (no inference during training) Same dichotomy for • MEMM, conditional Markov model – Global approaches (inference during training) more general structures • CRF, structured perceptron

• To think – What are the parts in a sequence model? – How is each model scoring these parts?

47