Predicting Sequences: Structured Perceptron

Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary • An undirected graphical model – Decompose the score over the structure into a collection of factors – Each factor assigns a score to assignment of the random variables it is connected to • Training and prediction – Final prediction via argmax wTÁ(x, y) – Train By maximum (regularized) likelihood • Connections to other models – Effectively a linear classifier – A generalization of logistic regression to structures – An conditional variant of a Markov Random Field • We will see this soon 2 Global features The feature function decomposes over the sequence y0 y1 y2 y3 x � � � � �(�, �0, �1) � �(�, �+, �2) � �(�, �2, �3) 3 Outline • Sequence models • Hidden Markov models – Inference with HMM – Learning • Conditional Models and Local Classifiers • Global models – Conditional Random Fields – Structured Perceptron for sequences 4 HMM is also a linear classifier Consider the HMM: 6 � �, � = 2 � �3 �34+ � �3 �3 3 5 HMM is also a linear classifier Consider the HMM: 6 � �, � = 2 � �3 �34+ � �3 �3 3 Transitions Emissions 6 HMM is also a linear classifier Consider the HMM: 6 � �, � = 2 � �3 �34+ � �3 �3 3 Or equivalently 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores 7 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions 8 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions 1, � is true, � = @ ? 0, � is false. Indicators are functions that map Booleans to 0 or 1 9 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions Equivalent to 6 6 : : log � � �L ⋅ � ⋅ � NOPQ [NOTUVWR] Q QR The indicators ensure that only one of the elements of the double summation is non-zero 10 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions Equivalent to 6 : log � �3 � ⋅ � NOPQ Q The indicators ensure that only one of the elements of the summation is non-zero 11 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 6 log � �, � = : : : log � � �L ⋅ � ⋅ � NOPQ [NOTUVWR] 3 Q6 QR6 + : : log � �3 � ⋅ � NOPQ 3 Q 12 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 6 log � �, � = : : : log � � �L ⋅ � ⋅ � NOPQ [NOTUVWR] 3 Q6 QR6 + : : log � �3 � ⋅ � NOPQ 6 6 3 Q 6 log � �, � = : : log �(� ∣ �L) : � � NOPQ [NOTUVWR] Q Q6R 3 6 + : log � �3 � : � NOPQ Q 3 13 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 6 log � �, � = : : log � � �L : � � NOPQ [NOTUVWR] Q QR6 36 + : log � �3 � : � NOPQ Q 3 Number of times there is a transition in the sequence from state �’ to state � Count(�L → �) 14 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR 6 + : log � �3 � : � NOPQ Q 3 15 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR 6 + : log � �3 � : � NOPQ Q 3 Number of times state � occurs in the sequence: Count(�) 16 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR + : log � �3 � ⋅ Count � Q 17 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR + : log � �3 � ⋅ Count � Q This is a linear function log P terms are the weights; counts via indicators are features Can be written as wTÁ(x, y) and add more features 18 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework 19 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider 20 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider Transition scores + Emission scores 21 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(Det ! Noun) £ 2 + log P(Noun ! VerB) £ 1 + Emission scores + log P(VerB ! Det) £ 1 22 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1 + log P(Noun ! VerB) £ 1 + + log P(ate| Verb) £ 1 + log P(VerB ! Det) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 23 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1 + log P(Noun ! VerB) £ 1 + + log P(ate| Verb) £ 1 + log P(VerB ! Det) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 w: Parameters of the model 24 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1 Á(x, y): Properties of this output and the + log P(Noun ! VerB) £ 1 + input + log P(ate| Verb) £ 1 + log P(VerB ! Det) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 25 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log �(�� → ��) 2 log �(�� → ��) 1 log �(�� → ��) 1 log � �ℎ� ��) 1 ⋅ log � �� ) 1 log � �� ) 1 log � �ℎ� �� 1 log � ℎ�� ) 1 Á(x, y): Properties of this w: Parameters output and the input of the model 26 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log �(�� → ��) 2 log �(�� → ��) 1 log P(x, y) = A linear scoring log �(�� → ��) 1 function = wTÁ(x,y) log � �ℎ� ��) 1 ⋅ log � �� ) 1 log � �� ) 1 log � �ℎ� �� 1 log � ℎ�� ) 1 Á(x, y): Properties of this w: Parameters output and the input of the model 27 Towards structured Perceptron 1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are gloBal properties • As long as the output can Be decomposed for easy inference 2. The ViterBi algorithm calculates max wTÁ(x, y) ViterBi only cares aBout scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize By computing exponentiating and dividing By Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model! 28 Towards structured Perceptron 1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are gloBal properties • As long as the output can Be decomposed for easy inference 2. The ViterBi algorithm calculates max wTÁ(x, y) ViterBi only cares aBout scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize By computing exponentiating and dividing By Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model! 29 Towards structured Perceptron 1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are gloBal properties • As long as the output can Be decomposed for easy inference 2. The ViterBi algorithm calculates max wTÁ(x, y) ViterBi only cares aBout scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize By exponentiating and dividing By Z (the partition function) – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model instead of the generative one! 30 Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 <n 2. For epoch = 1 … T: 1. For each training example (x, y) 2 D: T 1. PredictStructured Perceptron updatey’ = argmaxy’ w Á(x, y’) 2. If y ≠ y’, update w Ã w + learningRate (Á(x, y) - Á(x, y’)) 3. Return w T Prediction: argmaxy w Á(x, y) 31 Structured Perceptron algorithm Given a training set D = {(x,y)} 1.

Load more