Conditional Random Fields

Machine Learning Srihari Conditional Random Fields Sargur N. Srihari [email protected] Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html 1 Machine Learning Srihari Outline 1. Generative and Discriminative Models 2. Classifiers Naïve Bayes and Logistic Regression 3. Sequential Models HMM and CRF Markov Random Field 4. CRF vs HMM performance comparison NLP: Table extraction, POS tagging, Shallow parsing, Document analysis 5. CRFs in Computer Vision 6. Summary 7. References 2 Machine Learning Srihari Methods for Classification • Generative Models (Two-step) • Infer class-conditional densities p(x|Ck) and priors p(Ck) • then use Bayes theorem to determine posterior probabilities. p(x | C ) p(C ) p(C | x) = k k k p(x) • Discriminative Models (One-step) • Directly infer posterior probabilities p(Ck|x) • Decision Theory – In both cases use decision theory to assign each new x to a class 3 Machine Learning Srihari Generative Classifier • Given M variables, x =(x1,..,xM), class variable y and joint distribution p(x,y) we can – Marginalize py()= ∑ p(x,y) x p(x, y) py(|x)= – Condition p(x) • By conditioning the joint pdf we form a classifier • Huge need for samples M –If xi are binary, need 2 values to specify p(x,y) •IfM=10, there are two classes and 100 samples are needed to estimate a given probability, then we need 2048 samples 4 Machine Learning Srihari Classification of ML Methods • Generative Methods – “Generative” since sampling can generate synthetic data points – Popular models • Naïve Bayes, Mixtures of multinomials • Mixtures of Gaussians, Hidden Markov Models • Bayesian networks, Markov random fields • Discriminative Methods – Focus on given task– better performance – Popular models • Logistic regression, SVMs • Traditional neural networks, Nearest neighbor • Conditional Random Fields (CRF) 5 Machine Learning Srihari Generative-Discriminative Pairs • Naïve Bayes and Logistic Regression form a generative-discriminative pair • Their relationship mirrors that between HMMs and linear-chain CRFs 6 Machine Learning Srihari Graphical Model Relationship Hidden Markov Model E Naïve Bayes Classifier V I T y Y A y1 yN R SEQUENCE E N E X p(Y,X) G x p(y,x) x1 xM x1 xN CONDITION CONDITION p(y/x) p(Y/X) E V I T A N SEQUENCE I M I R C S I Logistic Regression Conditional Random Field D 7 Machine Learning Srihari Naïve Bayes Classifier • Goal is to predict single class variable y given a vector of features x=(x1,..,xM) • Assume that once class labels are known the features are independent • Joint probability model has the form M py(,x)= py()∏ p(xm |y) m=1 – Need to estimate only M probabilities • Factor graph obtained by defining factors ψ(y)=p(y), ψm(y,xm)=p(xm,y) 8 Machine Learning Srihari Logistic Regression Classifier Logistic Sigmoid • Feature vector x • Two-class classification: class σ(a) variable y has values C1 and C2 a • A posteriori probability p(C1|x) Properties: written as A. Symmetry T p(C1|x) =f(x) = σ (w x) where σ(-a)=1-σ(a) 1 B. Inverse σ ()a = 1e+ xp(−a) a=ln(σ /1-σ) known as logit. • Known as logistic regression in Also known as statistics log odds since – Although a model for classification it is the ratio ln[p(C |x)/p(C |x)] rather than for regression 1 2 C. Derivative 9 dσ/da=σ(1-σ) Machine Learning Srihari Relationship between Logistic Regression and Generative Classifier • Posterior probability of class variable y is pC(x | 11)p(C) pC(|1 x)= p(x |C11)pC( )+ p(x |C2)pC( 2) 1 pC(x | )p(C) = ==σ (aa) where ln 11 1+−exp( ap) (x | C22) p(C) • In a generative model we estimate the class- conditionals (which are used to determine a) • In the discriminative approach we directly estimate a as a linear function of x i.e., a = wTx 10 Machine Learning Srihari Logistic Regression Parameters •With M variables logistic regression has M parameters w=(w1,..,wM) • By contrast, generative approach – by fitting Gaussian class-conditional densities will result in 2M parameters for means, M(M+1)/2 parameters for shared covariance matrix, and one for class prior p(C1) – Which can be reduced to O(M) parameters by assuming independence via Naïve Bayes 11 Machine Learning Srihari Learning Logistic Parameters N py(t | w) =−ttnn{1 y}1− • For data set {xn,tn}, tn={0,1} likelihood is ∏ n n T n=1 where t=(t1,..tN) and yn=p(C1|xn) • Parameter w specifies the yn as follows T yn=σ(an) and an=w xn • Defining cross-entropy error function N Ep(w) =−ln (t | w) =−∑{}tnnln y+(1−tn)ln(1−yn) • Taking gradient wrt w n=1 N ∇=Ey(w) ∑( nn−t)xn n=1 • Same form as sum-of-squares error for linear regression – Can use sequential algorithm where samples are presented one at a time using (1tt+ ) () ww= −∇η En 12 Machine Learning Srihari Multi-class Logistic Regression px( | C)(pC) pC(|x)= kk • Case of K>2 classes k ∑ p(xC| jj)(pC) j exp(a ) = k ∑exp(a j ) j • Known as normalized exponential where ak=ln p(x|Ck)p(Ck) • Normalized exponential also known as softmax since if ak>>aj then p(Ck|x)=1 and p(Cj|x)=0 • In logistic regression we assume activations given T by ak=wk x 13 Machine Learning Srihari Learning Parameters for Multiclass • Determining parameters {wk} by m.l.e. requires derivatives of y wrt all activations a ∂y k j k =−yI y kk()j j where I are elements of the identity matrix ∂a j kj • Likelihood function written using 1-of-K coding – Target vector tn for xn belonging to Ck is a binary vector with all elements zero except for element k which is one NK NK p(Tp| w ,.., w ) ==(C| x)ttnk ynk 1 Kk∏∏ n∏∏ nk nk==11 n=1k=1 where T is an N x K matrix of target variables with elements tnk • Cross entropy error function NK Ep(w11,..,w)KK=−ln (T|w,..w)=−∑∑tnklnynk nk==11 N ∑()ytnj − nj xn • Gradient of error function is n=1 – which allows a sequential weight vector update algorithm14 Machine Learning Srihari Graphical Model for Logistic Regression • Multiclass logistic regression can be written as 1 ⎧⎫K py( | x) =+exp ⎨⎬λλyy∑ jxj where Z(x) ⎩⎭j=1 ⎧⎫K Zx(x) = exp λλ+ ∑∑y ⎨⎬yyjj ⎩⎭j=1 • Rather than using one weight per class we can define feature functions that are nonzero only for a single class 1 ⎧ K ⎫ py(|x)= exp⎨∑λkkf(y,x)⎬ Z(x) ⎩⎭k=1 • This notation mirrors the usual notation for CRFs 15 Machine Learning Srihari 3. Sequence Models • Classifiers predict only a single class variable • Graphical Models are best to model many variables that are interdependent N • Given sequence of observations X={xn}n=1 N • Underlying sequence of states Y={yn}n=1 16 Machine Learning Srihari Sequence Models • Hidden Markov Model (HMM) Y X • Independence assumptions: • each state depends only on its immediate predecessor • each observation variable depends only on the current state • Limitations: • Strong independence assumptions among the observations X. • Introduction of a large number of parameters by modeling the joint probability p(y,x), which requires modeling the distribution p(x) 17 Machine Learning Srihari Sequence Models Hidden Markov Model (HMM) Y X Conditional Random Fields (CRF) Y x A key advantage of CRF is their great flexibility to include a wide variety of arbitrary, non-independent features of the observations. 18 Machine Learning Srihari Generative Model: HMM • X is observed data sequence to be labeled, y1 y2 yn yN Y is the random variable over the label sequences • HMM is a distribution that models x1 x2 xn xN p(Y, X) N • Joint distribution is p()YX,(= ∏ pynn|y−1)p(xn|yn) n=1 • Highly structured network indicates conditional independences, – past states independent of future states – Conditional independence of observed given its state. 19 Machine Learning Srihari Discriminative Model for Sequential Data • CRF models the conditional distribution p(Y/X) • CRF is a random field y y y y globally conditioned on the 1 2 n N observation X • The conditional distribution X p(Y|X) that follows from the joint distribution p(Y,X) can be rewritten as a Markov Random Field 20 Machine Learning Srihari Markov Random Field (MRF) • Also called undirected graphical model • Joint distribution of set of variables x is defined by an undirected graph as 1 p(x) = ∏ψ CC(x ) Z C where C is a maximal clique (each node connected to every other node), xC is the set of variables in that clique, ψC is a potential function (or local or compatibility function) such that ψC(xC) > 0, typically ψC(xC) = exp{-E(xC)}, and is the partition function for normalization Z = ∑ ∏ψ CC(x ) x C • Model refers to a family of distributions and Field refers to a specific one 21 Machine Learning Srihari MRF with Input-Output Variables • X is a set of input variables that are observed – Element of X is denoted x • Y is a set of output variables that we predict – Element of Y is denoted y • A are subsets of X U Y – Elements of A that are in A ^ X are denoted xA – Element of A that are in A ^ Y are denoted yA • Then undirected graphical model has the form 1 p(x,y) =Ψ∏∏A (x AA, y ) where Z=∑ ΨA(x A, y A) Z AAx,y 22 Machine Learning Srihari MRF Local Function • Assume each local function has the form ⎧ ⎫ Ψ=AA(x ,yA) exp⎨∑θ Amf Am(xA,yA) ⎬ ⎩⎭m where θA is a parameter vector, fA are feature functions and m=1,..M are feature subscripts 23 Machine Learning Srihari From HMM to CRF • In an HMM Indicator function: N 1{x = x’} takes value 1when p()YX,(= ∏ pynn|y−1)p(xn|yn) n=1 x=x’ and 0 otherwise • Can be rewritten as Parameters of 1 ⎧⎫the distribution: p(,YX)=+exp λµ1 1 1 1 ⎨⎬∑∑ ij{}yinn=={y−1 j}∑∑∑ oi{yn=i}{xon=} θ ={λ ,µ } Z ⎩⎭ni, j∈∈S n iSo∈O ij oi • Further rewritten as Feature Functions have 1 ⎧⎫M the form fm(yn,yn-1,xn): pf()Y, X = exp⎨⎬∑λmm(yn,yn−1,xn) Z ⎩⎭m=1 Need one feature for

Conditional Random Fields

Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding

A Multitask Bi-Directional RNN Model for Named Entity Recognition On

ML Cheatsheet Documentation

CRF, CTC, Word2vec

PDF Hosted at the Radboud Repository of the Radboud University Nijmegen

Conditional Random Fields Meet Deep Neural Networks for Semantic

Evaluating the Utility of Hand-Crafted Features in Sequence Labelling∗

Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction

Conditional Random Fields by Charles Sutton and Andrew Mccallum

Fast and Effective Biomedical Named Entity Recognition Using Temporal Convolutional Network with Conditional Random Field

Thesis Submitted for the Requirements of The

Conditional Random Field Autoencoders for Unsupervised Structured Prediction