Introduction to Conditional Random Fields

Introduction to Conditional Random Fields

Outline • Motivation: • HMM and CMM limitations Conditional Random Fields: • Label Bias problem Probabilistic Models for Segmenting and Labeling Sequence Data • Conditional Random Field (CRF) Definition J. Lafferty, A. McCallum, F. Pereira. (ICML’01) • CRF Parameter Estimation • Iterative Scaling • Experiments • Synthetic Presented by Kevin Duh • Part-of-speech tagging March 4, 2005 UW Markovia Reading Group 2 Conditional Markov Models Hidden Markov Models (HMMs) (CMMs) Yi-3 Yi-2 Yi-1 Yi Yi-3 Yi-2 Yi-1 Yi Example: Maximum Entropy Markov Model (MEMM) Xi-3 Xi-2 Xi-1 Xi Xi-3 Xi-2 Xi-1 Xi • Generative model p(X,Y) • Conditional model P(Y|X) • Must enumerate all possible observation • No effort wasted on modeling observations sequences Requires atomic • Transition probability can depend on both past and Y future observations representation • Features can be dependent • Assumes independence of features • Suffers label-bias problem due to per-state • same as Naïve Bayes XA XB normalization 3 4 1 Label Bias Example Label Bias Problem D: determiner N: noun • The Problem: States with low-entropy next-state V: verb wheels Fred round A: adjective V N R distributions ignore observations R: adverb The robot • Fundamental cause: Per-state normalization s D N e • “Conservation of score mass” wheels N V A • Transitions leaving a given state only compete against each are round other • Solution Obs: “The robot wheels are round.” • Model accounts for whole sequence at once But if P(V|N,wheels) > P(N|N,wheels), • Prob. mass is amplified/dampened at individual then upper path is chosen regardless of obs. transitions After Wallach ‘02 5 6 Conditional Random Fields Definition of CRF (CRFs) Def: A CRF is a undirected graphical model globally Y Y Y Y i-3 i-2 i-1 i conditioned on the observation sequence X Graph: G=(V,E). V represents all Y. (X,Y) is a CRF if, when conditioned on X, Yv obeys • Single exponential model of joint the Markov property with respect to G: probability of entire state sequence P(Y | X,Y ,w ! v) = P(Y | X,Y ,w v) given observations v w v w ! • Alternative view: Finite state model with un-normalized transition prob. 7 8 2 What does the distribution of a Maximum Entropy Principle Random Field look like? • Hammersley-Clifford Theorem: • MaxEnt says: “When estimating a distribution, pick the max entropy Conditional Independence 1 • p(v ,v ,..,v ) ! ! (v ) Statements made by RF 1 2 n Z # Vc c distribution that respects all features f(x,y) seen in c"C training data” • Potential functions: Constrained optimization problem • strictly positive and real value function • • no direct probabilistic interpretation Ep! (x,y)[ f ] = Eq[ f ] • represent “constraints” on configurations of random variables • An overall configuration satisfying more constraints will have higher i.e. ! p!(x, y)f (x, y) = ! p!(x)q(y | x) f (x, y) probability x,y x,y 1 # & • Here, potential functions are chosen based on the Parametric form: • p! (y | x) = exp% "!k fk (x,y)( Maximum Entropy principle Z(x) $ k ' 9 10 Parametric Form of CRF Parameter Estimation CRF Distribution $ ' Define each potential function as: ! (y ) = exp " f (c,y ,x) Yc c & # k k c ) • Iterative Scaling: % k ( N (i) (i) CRF distribution becomes: • Maximizes likelihood O(!) = "log p! (y | x ) # " p!(x,y)log p! (y | x) i=1 x,y 1 $ ' by iteratively updatin g p! (y | x) = exp& ""!k fk (c,yc ,x)) Z(x) % c#C k ( !k " !k + #!k µk ! µk + "µk Distinguish between two types of features: 1 % ( • Define auxilliary function A() s.t.A(! ',!) " O(! ') # O(!) p (y | x) exp f (e,y ,x) g (v,y ,x) ! = ' $ " k k e + $ µ k k v * Z(x) & e#E,k v#V ,k ) • Initialize each !k Special Case of HMM-like Chain graph: • Do until convergence: dA(! ',!) Solve = 0 for each !"k 1 % ( d"# p (y | x) = exp " f (y ,y ,x) + µ g (y ,x) Update parak meter: ! ' # k k i$1 i # k k i * ! " ! + #! Z(x) & i,k i,k ) k k k 11 12 3 Algorithm S CRF Parameter Estimation (Generalized Iterative Scaling) dA(! ',!) • For chain CRF, setting = 0 gives • Introduce global slack feature s.t. T(x,y) becomes d"#k n+1 constant S for all (x,y) ! E[ fk ] " ! p!(x,y)! fk (yi"1,yi ,x) x,y i=1 S(x,y) S f (y ,y ,x) g (y ,x) n+1 ! ! " k i!1 i + " k i i,k i,k = ! p!(x)p(y | x)! fk (yi"1,yi ,x)exp(#$kT (x,y)) x,y i=1 • Define forward and backward variables T (x,y) f (y ,y ,x) g (y ,x) $ ' • = " k i!1 i + " k i is total feature count ! (y | x) = ! (y | x)exp f (y ,y ,x) + g (y ,x) i,k i,k i i"1 & # k i"1 i # k i ) • Unfortunately, T(x,y) is a global property of (x,y) % i,k i,k ( # & (y | x) (y | x)exp f (y ,y ,x) g (y ,x) • Dynamic programming will sum over sequences with !i = !i+1 % " k i+1 i + " k i+1 ( $ i,k i,k ' potentially varying T. Inefficient exp sum computation 13 14 Algorithm T Algorithm S (Improved Iterative Scaling) The equation we want to solve The update equations become: n+1 ! E[ fk ] = ! p!(x)p(y | x)! fk (yi"1,yi ,x)exp(#$kT (x,y)) x,y i=1 is polynomial in exp(!"k ) Where Rate of convergence So can be solved with Newton’s method governed by S n+1 Define T (x) ! maxT (x,y) E![ f ] = p!(x)p(y | x) f (y ,y ,x)exp #$ T (x) k ! ! k i"1 i ( k ) x,y i=1 Tmax % n+1 ( Then: t !' ! p!(x)p(y | x)! fk (yi"1,yi ,x)exp(#$k ) * t =0 & {x,y|T (x )=t} i=1 ) n+1 a p(x)p(y | x) f (y ,y ,x) t,T (x) Now, let ak,t,bk,t be E[fk|T(x)=t] k,t = ! ! ! k i"1 i # ( ) x,y i=1 UPDATE: Note is like posterior as in HMM 15 16 4 Experiments with Synthetic Data POS tagging experiment 1. Modeling Label Bias: Wall Street Journal dataset; 45 POS tags • Generated data by Fig 1 stochastic FSA • CRF: 4.6%, MEMM: 42% error rate 2. Modeling mixed-order sources • Generate data by ! p(yi | yi"1,yi"2 ) + (1" !)p(yi | yi"1 ) Training time: Initial value is result of MEMM training (100 iter) + Convergence for CRF took 1000 more iterations 17 18 Conclusion / Summary Useful References • CRFs are undirected graphical models globally • Hanna Wallach. Efficient Training of Conditional conditioned on observations Random Fields. M.Sc. thesis, Division of • Advantages of CRFs: Informatics, University of Edinburgh, 2002. • Conditional model • Della Pietra, S., Della Pietra, V., & Lafferty, J. • Allows multiple interacting features (1997). Inducing features of random fields. • Disadvantage of CRFs: IEEE Transactions on Pattern Analysis and • Slow convergence during training Machine Intelligence, 19, 380–393. • Potential future directions: • Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to • More complex graph structures natural language processing. Computational Faster (approximate) Inference/Learning algorithms • Linguistics, 22. • Feature selection/induction algo for CRFs… 19 20 5.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us