Outline

• Motivation: • HMM and CMM limitations Conditional Random Fields: • Label Bias problem Probabilistic Models for Segmenting and Labeling Sequence Data • Conditional Random (CRF) Definition J. Lafferty, A. McCallum, F. Pereira. (ICML’01) • CRF Parameter Estimation • Iterative Scaling • Experiments • Synthetic Presented by Kevin Duh • Part-of-speech tagging March 4, 2005 UW Markovia Reading Group

2

Conditional Markov Models Hidden Markov Models (HMMs) (CMMs)

Yi-3 Yi-2 Yi-1 Yi Yi-3 Yi-2 Yi-1 Yi Example: Maximum Entropy Markov Model (MEMM) Xi-3 Xi-2 Xi-1 Xi Xi-3 Xi-2 Xi-1 Xi

• Generative model p(X,Y) • Conditional model P(Y|X) • Must enumerate all possible observation • No effort wasted on modeling observations sequences  Requires atomic • Transition probability can depend on both past and Y future observations representation • Features can be dependent • Assumes independence of features • Suffers label-bias problem due to per-state • same as Naïve Bayes XA XB normalization

3 4

1 Label Bias Example Label Bias Problem

D: determiner N: noun • The Problem: States with low-entropy next-state V: verb wheels Fred round A: adjective V N R distributions ignore observations R: adverb The robot • Fundamental cause: Per-state normalization s D N e • “Conservation of score mass” wheels N V A • Transitions leaving a given state only compete against each are round other • Solution Obs: “The robot wheels are round.” • Model accounts for whole sequence at once But if P(V|N,wheels) > P(N|N,wheels), • Prob. mass is amplified/dampened at individual then upper path is chosen regardless of obs. transitions

After Wallach ‘02

5 6

Conditional Random Fields Definition of CRF (CRFs) Def: A CRF is a undirected globally Y Y Y Y i-3 i-2 i-1 i conditioned on the observation sequence

X Graph: G=(V,E). V represents all Y.

(X,Y) is a CRF if, when conditioned on X, Yv obeys • Single exponential model of joint the with respect to G: probability of entire state sequence P(Y | X,Y ,w ! v) = P(Y | X,Y ,w v) given observations v w v w ! • Alternative view: Finite state model with un-normalized transition prob.

7 8

2 What does the distribution of a Maximum Entropy Principle Random Field look like? • Hammersley-Clifford Theorem: • MaxEnt says: “When estimating a distribution, pick the max entropy Conditional Independence 1 • p(v ,v ,..,v ) ! ! (v ) Statements made by RF 1 2 n Z # Vc c distribution that respects all features f(x,y) seen in c"C training data” • Potential functions: Constrained optimization problem • strictly positive and real value function • • no direct probabilistic interpretation Ep! (x,y)[ f ] = Eq[ f ] • represent “constraints” on configurations of random variables • An overall configuration satisfying more constraints will have higher i.e. ! p!(x, y)f (x, y) = ! p!(x)q(y | x) f (x, y) probability x,y x,y 1 # & • Here, potential functions are chosen based on the Parametric form: • p! (y | x) = exp% "!k fk (x,y)( Maximum Entropy principle Z(x) $ k '

9 10

Parametric Form of CRF Parameter Estimation CRF Distribution $ ' Define each potential function as: ! (y ) = exp " f (c,y ,x) Yc c & # k k c ) • Iterative Scaling: % k ( N (i) (i) CRF distribution becomes: • Maximizes likelihood O(!) = "log p! (y | x ) # " p!(x,y)log p! (y | x) i=1 x,y 1 $ ' by iteratively updatin g p! (y | x) = exp& ""!k fk (c,yc ,x)) Z(x) % c#C k ( !k " !k + #!k µk ! µk + "µk Distinguish between two types of features: 1 % ( • Define auxilliary function A() s.t.A(! ',!) " O(! ') # O(!) p (y | x) exp f (e,y ,x) g (v,y ,x) ! = ' $ " k k e + $ µ k k v * Z(x) & e#E,k v#V ,k ) • Initialize each !k Special Case of HMM-like Chain graph: • Do until convergence: dA(! ',!) Solve = 0 for each !"k 1 % ( d"# p (y | x) = exp " f (y ,y ,x) + µ g (y ,x) Update parak meter: ! ' # k k i$1 i # k k i * ! " ! + #! Z(x) & i,k i,k ) k k k

11 12

3 Algorithm S CRF Parameter Estimation (Generalized Iterative Scaling) dA(! ',!) • For chain CRF, setting = 0 gives • Introduce global slack feature s.t. T(x,y) becomes d"#k n+1 constant S for all (x,y) ! E[ fk ] " ! p!(x,y)! fk (yi"1,yi ,x) x,y i=1 S(x,y) S f (y ,y ,x) g (y ,x) n+1 ! ! " k i!1 i + " k i i,k i,k = ! p!(x)p(y | x)! fk (yi"1,yi ,x)exp(#$kT (x,y)) x,y i=1 • Define forward and backward variables T (x,y) f (y ,y ,x) g (y ,x) $ ' • = " k i!1 i + " k i is total feature count ! (y | x) = ! (y | x)exp f (y ,y ,x) + g (y ,x) i,k i,k i i"1 & # k i"1 i # k i ) • Unfortunately, T(x,y) is a global property of (x,y) % i,k i,k ( # & (y | x) (y | x)exp f (y ,y ,x) g (y ,x) • Dynamic programming will sum over sequences with !i = !i+1 % " k i+1 i + " k i+1 ( $ i,k i,k ' potentially varying T. Inefficient exp sum computation

13 14

Algorithm T Algorithm S (Improved Iterative Scaling) The equation we want to solve The update equations become: n+1 ! E[ fk ] = ! p!(x)p(y | x)! fk (yi"1,yi ,x)exp(#$kT (x,y)) x,y i=1 is polynomial in exp(!"k ) Where Rate of convergence So can be solved with Newton’s method governed by S n+1 Define T (x) ! maxT (x,y) E![ f ] p!(x)p(y | x) f (y ,y ,x)exp T (x) k = ! ! k i"1 i (#$k ) x,y i=1 Tmax % n+1 ( Then: t !' ! p!(x)p(y | x)! fk (yi"1,yi ,x)exp(#$k ) * t 0 i 1 = & {x,y|T (x )=t} = ) n+1 a p(x)p(y | x) f (y ,y ,x) t,T (x) Now, let ak,t,bk,t be E[fk|T(x)=t] k,t = ! ! ! k i"1 i # ( ) x,y i=1 UPDATE: Note is like posterior as in HMM

15 16

4 Experiments with Synthetic Data POS tagging experiment

1. Modeling Label Bias: Wall Street Journal dataset; 45 POS tags • Generated data by Fig 1 stochastic FSA • CRF: 4.6%, MEMM: 42% error rate 2. Modeling mixed-order sources

• Generate data by ! p(yi | yi"1,yi"2 ) + (1" !)p(yi | yi"1 )

Training time: Initial value is result of MEMM training (100 iter) Convergence for CRF+ took 1000 more iterations

17 18

Conclusion / Summary Useful References

• CRFs are undirected graphical models globally • Hanna Wallach. Efficient Training of Conditional conditioned on observations Random Fields. M.Sc. thesis, Division of • Advantages of CRFs: Informatics, University of Edinburgh, 2002. • Conditional model • Della Pietra, S., Della Pietra, V., & Lafferty, J. • Allows multiple interacting features (1997). Inducing features of random fields. • Disadvantage of CRFs: IEEE Transactions on Pattern Analysis and • Slow convergence during training Machine Intelligence, 19, 380–393. • Potential future directions: • Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to • More complex graph structures natural language processing. Computational Faster (approximate) Inference/Learning algorithms • Linguistics, 22. • Feature selection/induction algo for CRFs…

19 20

5