CRF and Structured Perceptron

CRF and Structured Perceptron CS 585, Fall 2015 -- Oct. 6 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2015/ Brendan O’Connor Tuesday, October 6, 15 • Viterbi exercise solution • CRF & Structured Perceptrons • Thursday: project discussion + midterm review 2 Tuesday, October 6, 15 Log-linear models (NB, LogReg, HMM, CRF...) • x: Text Data • y: Proposed class or sequence • θ: Feature weights (model parameters) • f(x,y): Feature extractor, produces feature vector 1 p(y x)= exp ✓Tf(x, y) | Z ✓ f(x, y) G(y) | {z } Decision rule: arg max G(y⇤) y outputs(x) ⇤2 How to we evaluate for HMM/CRF? Viterbi! 3 Tuesday, October 6, 15 Things to do with a log-linear model 1 p(y x)= exp ✓Tf(x, y) | Z ✓ f(x, y) G(y) f(x,y) x y θ Feature extractor| {zText Input} Output Feature (feature vector) weights decoding/prediction given given obtain given (just one) (just one) arg max G(y⇤) y outputs(x) ⇤2 given given given obtain parameter learning (many pairs)(many pairs) fiddle with obtain feature engineering given during given in each (many pairs)(many pairs) (human-in-the-loop) experiments experiment 4 [This is new slide after lecture] Tuesday, October 6, 15 HMM as factor graph A1 A2 y1 y2 y3 p(y, w)= p(w y ) p(y y ) B1 B2 B3 y| t t+1| t t Y log p(y, w)= log p(wt yt) + log p(yt yt 1) | | − t X G(y) Bt(yt) A(yt,yt+1) goodness emission transition factor score factor score arg max G(y⇤) (Additive) Viterbi: y outputs(x) ⇤2 5 Tuesday, October 6, 15 10 An Introduction to Conditional Random Fields for Relational Learning We can write (1.13) more compactly by introducing the concept of feature functions, just as we did for logistic regression in (1.7). Each feature function has the form fk(yt,yt 1,xt). In order to duplicate (1.13), there needs to be one feature − fij(y, y0,x)=1 y=i 1 y =j for each transition (i, j) and one feature fio(y, y0,x)= { } { 0 } 1 y=i 1 x=o for each state-observation pair (i, o). Then we can write an HMM as: { } { } 1 K p(y, x)= exp λkfk(yt,yt 1,xt) . (1.14) Z ( − ) kX=1 Again, equation (1.14) defines exactly the same family of distributions as (1.13), • isand there therefore asa theterrible original HMM bug equation in (1.8). sutton&mccallum? The last step is to write the conditional distribution p(y x) that results from the | 10there’sHMM An Introduction (1.14). no This to issumConditional over Random Fieldst in for these Relational Learningequations! K p(y, x) exp k=1 λkfk(yt,yt 1,xt) We canp write(y x)= (1.13) more compactly= by introducing the concept− of feature. functions(1.15) , n K o | y p(y0, x) just as we did for0 logistic regressiony expP in (1.7).k=1 λk Eachfk(yt0,y featuret0 1,xt) function has the 0 − form fk(yt,yt 1,xPt). In order to duplicaten (1.13), there needs too be one feature This conditional− distribution (1.15)P is a linear-chainP CRF, in particular one that fij(y, y0,x)=1 y=i 1 y0=j for each transition (i, j) and one feature fio(y, y0,x)= includes features{ only} for{ the} current word’s identity. But many other linear-chain 1 y=i 1 x=o for each state-observation pair (i, o). Then we can write an HMM as: CRFs{ } use{ richer} features of the input, such as prefixes and suffixes of the current word, the identity of surrounding1 words,K and so on. Fortunately, this extension p(y, x)= exp λkfk(yt,yt 1,xt) . (1.14) requires little change to our existingZ notation. We simply− allow the feature functions (k=1 ) fk(yt,yt 1, xt) to be more general than indicatorX functions. This leads to the general − definitionAgain, equation of linear-chain (1.14) defines CRFs, which exactly we the present same now. family of distributions as (1.13), and therefore as the original HMM equation (1.8). DefinitionThe last step 1.1 is to write the conditional distribution p(y x) that results from the K | LetHMMY,X (1.14).be random This is vectors, ⇤ = λk be a parameter vector, and K { } 2< f (y, y0, x ) be a set of real-valued feature functions. Then a linear-chain { k t }k=1 conditional random field is a distributionexp p(yKx)λ thatf ( takesy ,y the,x form) p(y, x) k|=1 k k t t 1 t p(y x)= = − . (1.15) nK K o | y p(y0, x) 0 1 y expP k=1 λkfk(yt0,yt0 1,xt) p(y x)= exp 0 λkfk(yt,yt 1, xt) −, (1.16) P | Z(x) ( n − ) o This conditional distribution (1.15)P iskX a=1 linear-chainP CRF, in particular one that whereincludesZ(x features) is an instance-specific only for the current normalization word’s identity. function But many other linear-chain 6 CRFs use richer features of the input,K such as prefixes and suffixes of the current word, the identity of surrounding words, and so on. Fortunately, this extension Tuesday, October 6, 15 Z(x)= exp λkfk(yt,yt 1, xt) . (1.17) − requires little change to our existingy ( notation. We simply allow) the feature functions X kX=1 fk(yt,yt 1, xt) to be more general than indicator functions. This leads to the general − definition of linear-chain CRFs, which we present now. We have just seen that if the joint p(y, x) factorizes as an HMM, then the associated conditionalDefinition distribution 1.1 p(y x) is a linear-chain CRF. This HMM-like CRF is | K picturedLet Y,X inbe Figure random 1.3. Other vectors, types⇤ of= linear-chainλk CRFsbe are a parameter also useful, vector, however. and K { } 2< Forf example,(y, y0, x ) in anbe HMM, a set a of transition real-valued from feature state i functions.to state j Thenreceives a linear-chain the same { k t }k=1 score,conditional log p(y randomt = j yt field1 = isi), a regardless distribution of thep(y input.x) that In takes a CRF, the formwe can allow the | − score of the transition (i, j) to depend on the| current observation vector, simply 1 K p(y x)= exp λkfk(yt,yt 1, xt) , (1.16) | Z(x) ( − ) kX=1 where Z(x) is an instance-specific normalization function K Z(x)= exp λkfk(yt,yt 1, xt) . (1.17) − y ( ) X kX=1 We have just seen that if the joint p(y, x) factorizes as an HMM, then the associated conditional distribution p(y x) is a linear-chain CRF. This HMM-like CRF is | pictured in Figure 1.3. Other types of linear-chain CRFs are also useful, however. For example, in an HMM, a transition from state i to state j receives the same score, log p(yt = j yt 1 = i), regardless of the input. In a CRF, we can allow the | − score of the transition (i, j) to depend on the current observation vector, simply HMM as log-linear A1 A2 y1 y2 y3 p(y, w)= p(w y ) p(y y ) B1 B2 B3 y| t t+1| t t Y log p(y, w)= log p(wt yt) + log p(yt yt 1) | | − t X G(y) Bt(yt) A(yt,yt+1) goodness emission transition factor score factor score µ 1 y = k w = w + λ 1 y = j y = k G(y) = 2 w,k { t ^ t } j,k { t ^ t+1 }3 t k K w V k,j K X X2 X2 X2 4 5 = ✓ift,i(yt,yt+1,wt) t i allfeats X 2 X = ✓ifi(yt,yt+1,wt) i allfeats [~ SM eq 1.13, 1.14] 2 X 7 Tuesday, October 6, 15 CRF log p(y x)=C + ✓Tf(x, y) Prob. dist over whole sequence | Linear-chain CRF: whole- f(x, y)= f (x, y ,y ) t t t+1 sequence feature function t X decomposes into pairs • advantages • 1. why just word identity features? add many more! • 2. can train it to optimize accuracy of sequences (discriminative learning) • Viterbi can be used for efficient prediction 8 Tuesday, October 6, 15 finna get good gold y = V V A f(x,y) is... Two simple feature templates V,V: 1 “Transition features” V,A: 1 ftrans:A,B(x, y)= 1 yt 1 = A, yt = B { − } V, N : 0 t X .... “Observation features” V,finna: 1 V,get: 1 f (x, y)= 1 y = A, x = w emit:A,w { t t } A,good: 1 t X N,good: 0 9 ... Tuesday, October 6, 15 Mathematical convention is finna get good numeric indexing, though sometimes convenient to gold y = V V A implement as hash table. Transition features Observation features ✓ -0.6 -1.0 1.1 0.5 0.0 0.8 0.5 -1.3 -1.6 0.0 0.6 0.0 -0.2 -0.2 0.8 -1.0 0.1 -1.9 1.1 1.2 -0.1 -1.0 -0.1 -0.1 f(x, y) 1 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 N ftrans:V,A(x, y)= 1 yt 1 = V,yt = A { − } t=2 X N f (x, y)= 1 y = V,x =finna obs:V,finna { t t } t=1 X Goodness(y) = ✓Tf(x, y) Tuesday, October 6, 15 large overall feature extraction function for an entire structure at once.1 The following parts will build up these pieces.

Load more