Machine Learning Srihari

Conditional Random Fields

Sargur N. Srihari [email protected] Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html

1 Machine Learning Srihari Outline

1. Generative and Discriminative Models 2. Classifiers Naïve Bayes and 3. Sequential Models HMM and CRF Markov 4. CRF vs HMM performance comparison NLP: Table extraction, POS tagging, Shallow , Document analysis 5. CRFs in 6. Summary 7. References 2 Machine Learning Srihari Methods for Classification • Generative Models (Two-step)

• Infer class-conditional densities p(x|Ck) and priors p(Ck) • then use Bayes theorem to determine posterior probabilities. p(x | ) p(C ) p(C | x) = k k k p(x) • Discriminative Models (One-step)

• Directly infer posterior probabilities p(Ck|x)

• Decision Theory – In both cases use decision theory to assign each new x to a class 3 Machine Learning Srihari Generative Classifier

• Given M variables, x =(x1,..,xM), class variable y and joint distribution p(x,y) we can – Marginalize py()= ∑ p(x,y) x p(x, y) py(|x)= – Condition p(x) • By conditioning the joint pdf we form a classifier • Huge need for samples M –If xi are binary, need 2 values to specify p(x,y) •IfM=10, there are two classes and 100 samples are needed to estimate a given probability, then we need 2048 samples 4 Machine Learning Srihari Classification of ML Methods • Generative Methods – “Generative” since sampling can generate synthetic data points – Popular models • Naïve Bayes, Mixtures of multinomials • Mixtures of Gaussians, Hidden Markov Models • Bayesian networks, Markov random fields • Discriminative Methods – Focus on given task– better performance – Popular models • Logistic regression, SVMs • Traditional neural networks, Nearest neighbor • Conditional Random Fields (CRF) 5 Machine Learning Srihari

Generative-Discriminative Pairs

• Naïve Bayes and Logistic Regression form a generative-discriminative pair • Their relationship mirrors that between HMMs and linear-chain CRFs

6 Machine Learning Srihari Relationship

Hidden Markov Model

E Naïve Bayes Classifier

V I

T y Y

A y1 yN

R SEQUENCE

E N

E X p(Y,X)

G x p(y,x) x1 xM x1 xN

CONDITION CONDITION

p(y/x) p(Y/X)

E

V

I

T A

N SEQUENCE

I

M

I

R

C S I Logistic Regression Conditional Random Field D 7 Machine Learning Srihari Naïve Bayes Classifier • Goal is to predict single class variable y given a vector of features x=(x1,..,xM) • Assume that once class labels are known the features are independent • Joint probability model has the form M py(,x)= py()∏ p(xm |y) m=1 – Need to estimate only M probabilities • obtained by defining factors

ψ(y)=p(y), ψm(y,xm)=p(xm,y)

8 Machine Learning Srihari Logistic Regression Classifier Logistic Sigmoid • Feature vector x • Two-class classification: class σ(a) variable y has values C1 and C2 a • A posteriori probability p(C1|x) Properties: written as A. Symmetry T p(C1|x) =f(x) = σ (w x) where σ(-a)=1-σ(a) 1 B. Inverse σ ()a = 1e+ xp(−a) a=ln(σ /1-σ) known as logit. • Known as logistic regression in Also known as statistics log odds since – Although a model for classification it is the ratio ln[p(C |x)/p(C |x)] rather than for regression 1 2 C. Derivative 9 dσ/da=σ(1-σ) Machine Learning Srihari Relationship between Logistic Regression and Generative Classifier

• Posterior probability of class variable y is

pC(x | 11)p(C) pC(|1 x)= p(x |C11)pC( )+ p(x |C2)pC( 2) 1 pC(x | )p(C) = ==σ (aa) where ln 11 1+−exp( ap) (x | C22) p(C) • In a we estimate the class- conditionals (which are used to determine a) • In the discriminative approach we directly estimate a as a linear function of x i.e., a = wTx

10 Machine Learning Srihari

Logistic Regression Parameters

•With M variables logistic regression has M

parameters w=(w1,..,wM) • By contrast, generative approach – by fitting Gaussian class-conditional densities will result in 2M parameters for means, M(M+1)/2 parameters for shared covariance matrix, and one

for class prior p(C1) – Which can be reduced to O(M) parameters by assuming independence via Naïve Bayes

11 Machine Learning Srihari Learning Logistic Parameters

N py(t | w) =−ttnn{1 y}1− • For data set {xn,tn}, tn={0,1} likelihood is ∏ n n T n=1 where t=(t1,..tN) and yn=p(C1|xn)

• Parameter w specifies the yn as follows T yn=σ(an) and an=w xn • Defining cross-entropy error function N Ep(w) =−ln (t | w) =−∑{}tnnln y+(1−tn)ln(1−yn) • Taking gradient wrt w n=1 N ∇=Ey(w) ∑( nn−t)xn n=1 • Same form as sum-of-squares error for – Can use sequential algorithm where samples are presented one at a time using (1tt+ ) () ww= −∇η En 12 Machine Learning Srihari

Multi-class Logistic Regression

px( | C)(pC) pC(|x)= kk • Case of K>2 classes k ∑ p(xC| jj)(pC) j exp(a ) = k ∑exp(a j ) j • Known as normalized exponential

where ak=ln p(x|Ck)p(Ck) • Normalized exponential also known as softmax since if ak>>aj then p(Ck|x)=1 and p(Cj|x)=0 • In logistic regression we assume activations given T by ak=wk x 13 Machine Learning Srihari Learning Parameters for Multiclass

• Determining parameters {wk} by m.l.e. requires derivatives of y wrt all activations a ∂y k j k =−yI y kk()j j where I are elements of the identity matrix ∂a j kj • written using 1-of-K coding

– Target vector tn for xn belonging to Ck is a binary vector with all elements zero except for element k which is one NK NK p(Tp| w ,.., w ) ==(C| x)ttnk ynk 1 Kk∏∏ n∏∏ nk nk==11 n=1k=1

where T is an N x K matrix of target variables with elements tnk • Cross entropy error function NK Ep(w11,..,w)KK=−ln (T|w,..w)=−∑∑tnklnynk nk==11

N ∑()ytnj − nj xn • Gradient of error function is n=1 – which allows a sequential weight vector update algorithm14 Machine Learning Srihari Graphical Model for Logistic Regression • Multiclass logistic regression can be written as 1 ⎧⎫K py( | x) =+exp ⎨⎬λλyy∑ jxj where Z(x) ⎩⎭j=1 ⎧⎫K Zx(x) = exp λλ+ ∑∑y ⎨⎬yyjj ⎩⎭j=1 • Rather than using one weight per class we can define feature functions that are nonzero only for a single class 1 ⎧ K ⎫ py(|x)= exp⎨∑λkkf(y,x)⎬ Z(x) ⎩⎭k=1 • This notation mirrors the usual notation for CRFs 15 Machine Learning Srihari

3. Sequence Models

• Classifiers predict only a single class variable • Graphical Models are best to model many variables that are interdependent N • Given sequence of observations X={xn}n=1 N • Underlying sequence of states Y={yn}n=1

16 Machine Learning Srihari Sequence Models • (HMM) Y

X

• Independence assumptions: • each state depends only on its immediate predecessor • each observation variable depends only on the current state • Limitations:

• Strong independence assumptions among the observations X. • Introduction of a large number of parameters by modeling the joint probability p(y,x), which requires modeling the distribution p(x) 17 Machine Learning Srihari Sequence Models

Hidden Markov Model (HMM)

Y

X

Conditional Random Fields (CRF)

Y

x A key advantage of CRF is their great flexibility to include a wide variety of arbitrary, non-independent features of the observations.

18 Machine Learning Srihari Generative Model: HMM

• X is observed data sequence to be labeled, y1 y2 yn yN Y is the over the label sequences • HMM is a distribution that models x1 x2 xn xN p(Y, X) N • Joint distribution is p()YX,(= ∏ pynn|y−1)p(xn|yn) n=1

• Highly structured network indicates conditional independences, – past states independent of future states – Conditional independence of observed given its state. 19 Machine Learning Srihari

Discriminative Model for Sequential Data

• CRF models the conditional distribution p(Y/X) • CRF is a random field y y y y globally conditioned on the 1 2 n N observation X • The conditional distribution X p(Y|X) that follows from the joint distribution p(Y,X) can be rewritten as a Markov

Random Field 20 Machine Learning Srihari (MRF) • Also called undirected graphical model • Joint distribution of set of variables x is defined by an undirected graph as 1 p(x) = ∏ψ CC(x ) Z C where C is a maximal clique (each node connected to every other node),

xC is the set of variables in that clique, ψC is a potential function (or local or compatibility function) such that ψC(xC) > 0, typically ψC(xC) = exp{-E(xC)}, and is the partition function for normalization Z = ∑ ∏ψ CC(x ) x C • Model refers to a family of distributions and Field refers to a specific one 21 Machine Learning Srihari MRF with Input-Output Variables • X is a set of input variables that are observed – Element of X is denoted x • Y is a set of output variables that we predict – Element of Y is denoted y • A are subsets of X U Y

– Elements of A that are in A ^ X are denoted xA – Element of A that are in A ^ Y are denoted yA • Then undirected graphical model has the form 1 p(x,y) =Ψ∏∏A (x AA, y ) where Z=∑ ΨA(x A, y A) Z AAx,y 22 Machine Learning Srihari

MRF Local Function

• Assume each local function has the form

⎧ ⎫ Ψ=AA(x ,yA) exp⎨∑θ Amf Am(xA,yA) ⎬ ⎩⎭m

where θA is a parameter vector, fA are feature functions and m=1,..M are feature subscripts

23 Machine Learning Srihari From HMM to CRF • In an HMM Indicator function: N 1{x = x’} takes value 1when p()YX,(= ∏ pynn|y−1)p(xn|yn) n=1 x=x’ and 0 otherwise • Can be rewritten as Parameters of 1 ⎧⎫the distribution: p(,YX)=+exp λµ1 1 1 1 ⎨⎬∑∑ ij{}yinn=={y−1 j}∑∑∑ oi{yn=i}{xon=} θ ={λ ,µ } Z ⎩⎭ni, j∈∈S n iSo∈O ij oi

• Further rewritten as Feature Functions have 1 ⎧⎫M the form fm(yn,yn-1,xn): pf()Y, X = exp⎨⎬∑λmm(yn,yn−1,xn) Z ⎩⎭m=1 Need one feature for each state transition (i,j)

• Which gives us fij(y,y’,x)=1{y=i}1{y’=j} and ⎧ M ⎫ one for each state- exp ⎨ λmmfy( n, yn−1, xn)⎬ py(,x) ∑ observation pair p()Y|X ==⎩⎭m=1 M py(',x) ⎧ ⎫ fio(y,y’,x)=1{y=i}1{x=o} ∑ y' exp λ fy( , y, x ) ∑∑y' ⎨ mm n n−1 n⎬ ⎩⎭m=1 24 • Note that Z cancels out Machine Learning Srihari

CRF definition

•A linear chain CRF is a distribution p(Y|X) that takes the form

1 ⎧ M ⎫ pf(|YX)= exp⎨∑λmm(yn,yn−1,xn)⎬ Z(X) ⎩⎭m=1 •Where Z(X) is an instance specific normalization function ⎧ M ⎫ Zf(X) = ∑∑exp⎨ λmm(yn, yn−1,xn)⎬ ym⎩⎭=1 25 Machine Learning Srihari Functional Models Naïve Bayes Classifier Hidden Markov Model y Y y y y

E 1 n N

V

I

T A

R x X E x N x1 xM x1 n xN

E M N G p(y,x ) = p(y) p(|x y) ∏ m p()YX,(= ∏ pynn|y−1)p(xnn|y) m=1 n=1

⎧ M ⎫ M ⎧ ⎫ exp ⎨∑λmmfy( n, yn−1,xn)⎬ exp λ fy( , x) ⎩⎭m=1 ⎨∑ mm ⎬ p(|YX)= py(|x)= ⎩⎭m=1 ⎧ M ⎫ M exp λ fy( ', y', x ) ⎧ ⎫ ∑∑y ' ⎨ mm n n−1 n⎬ m=1 E exp λ fy( ',x) ⎩⎭ ∑∑y' ⎨ mm ⎬

V ⎩⎭m=1

I

T

A

N

I

M

I

R

C S I Logistic Regression Conditional Random Field D 26 Machine Learning Srihari 4. CRF vs Other Models

• Generative Models – Relax assuming conditional independence of observed data given the labels – Can contain arbitrary feature functions • Each feature function can use entire input data sequence. Probability of label at observed data segment may depend on any past or future data segments. • Other Discriminative Models – Avoid limitation of other discriminative Markov models biased towards states with few successor states. – Single exponential model for joint probability of entire sequence of labels given observed sequence. – Each factor depends only on previous label, and not future labels. P(y | x) = product of factors, one for each label. 27 Machine Learning Srihari NLP: Part Of Speech Tagging

For a sequence of words w = {w1,w2,..wn} find syntactic labels s for each word:

w = The quick brown fox jumped over the lazy dog s = DET VERB ADJ NOUN-S VERB-P PREP DET ADJ NOUN-S

Model Error Baseline is already 90% • Tag every word with its most frequent tag HMM 5.69% • Tag unknown words as nouns

CRF 5.55% Per-word error rates for POS tagging on the Penn treebank 28 Machine Learning Srihari Table Extraction To label lines of text document: Whether part of table and its role in table.

Finding tables and extracting information is necessary component of , question- answering and IR tasks.

HMM CRF 89.7% 99.9%

29 Machine Learning Srihari • Precursor to full parsing or – Identifies non-recursive cores of various phrase types in text • Input: words in a sentence annotated automatically with POS tags • Task: label each word with a label indicating – word is outside a chunk (O), starts a chunk (B), continues a chunk (I) NP chunks

CRFs beat all reported single-model NP chunking results on standard evaluation dataset

30 Machine Learning Srihari Handwritten Word Recognition

Given word image and lexicon, find most probable lexical entry Algorithm Outline • Oversegment image segment combinations are potential characters • Given y = a word in lexicon, s = grouping of segments, x = input word image features • Find word in lexicon and segment grouping that maximizes P(y,s | x), CRF Model eψ ( y,x;θ ) m ⎛ ⎞ P(y | x,θ ) = ψ (y, x;θ ) = ⎜ A( j, y , x;θ s + I( j,k, y , y , x,θ t )⎟ eψ ( y',x;θ ) ∑∑⎜ j j k ⎟ ∑y' jE=∈1)⎝ ( j,k ⎠

where yi ε (a-z,A-Z,0-9}, θ : model parameters

1 SDP Association Potential (state term) CRF 0.98 s s s 0.96 CRF A( j, y j , x;θ ) = ( fi ( j, y j , x)⋅θij ) 0.94 ∑ n 0.92 i io Segment-DP is 0.9 Precision

0.88 Prec Interaction Potential 0.86

0.84 t t t 31 I( j,k, y , y , x;θ ) = ( f ( j,k, y , y , x)⋅θ ) 0.82 j k i j k ijk 0.8 ∑ 0 20 40 60 80 100 120 i Word Recognition Rank WR Rank Machine Learning Srihari Document Zoning (labeling regions) error rates CRF Neural Naive Network Bayes Machine 1.64% 2.35% 11.54% Printed Text Handwritte 5.19% 20.90% 25.04% n Text Noise 10.20% 15.00% 12.23%

Total 4.25% 7.04% 12.58% 32 Machine Learning Srihari

5. CRFs in Computer Vision

33 Machine Learning Srihari Applications in Computer Vision

• Image Modeling – Labeling regions in image () – Object detection and recognition (important areas) – Sign detection in natural images – Natural scene Small patch is water in one categorization – Gesture Recognition context and sky in another

• Image retrieval • Object/motion segmentation in video scene 34 Machine Learning Srihari

……

… Various tasks in computer vision that require explicit consideration of spatial dependencies!

(a) Segmentation and labeling of input image in meaningful regions.

(b) Detection of structured textures such as buildings.

(c) restore images corrupted by noise.

35 Machine Learning Srihari Image Modeling

Image Labeling: classifying every image pixel/patch into a finite set of class Object Recognition: requires semantic image content understanding based on labeling CRF models: • Directly predict the segmentation/labeling given the observed image • Incorporate arbitrary functions of the observed features into the training process

36 Machine Learning Srihari CRFs for Image Modeling • Discriminative Random Fields (DRF) – S. Kumar and M. Hebert. Discriminative fields for modeling spatial dependencies in natural images. 2003 • Multiscale Conditional Random Fields (MCRF) – X. He, R. S. Zemel, and M. A’ . Carreira-Perpin˜a’n. Multiscale conditional random fields for image labeling. 2004. • Hierarchical Conditional Random Field (HCRF) – S. Kumar and M. Hebert. A hierarchical field framework for unified context-based classification. 2005 – Jordan Reynolds and Kevin Murphy. Figure-ground segmentation using a hierarchical conditional random field. 2007 • Tree Structured Conditional Random Fields (TCRF) – P. Awasthi, A. Gagrani, and B. Ravindran, Image Modeling using Tree Structured Conditional Random Fields. 2007

37 Machine Learning Srihari

Discriminative Random Fields (DRF)

CRF - normalized product of potential functions

transition feature function state feature function In the DRF framework Æ Interaction potential Association potential

take into account the enforce adaptive data- neighborhood dependent smoothing interaction of the data over the label field.

Proposed DRF model was applied to the task of detecting man-made structures in natural scenes.

38 Machine Learning Srihari

structure detection by DRF (label each image site as structured or nonstructured) For similar detection rates, DRF reduces the false positives considerably.

Though the model outperforms traditional MRFs, it is not strong enough to capture long range correlations among the labels due to the rigid lattice based structure which allows for only pairwise 39 interactions Machine Learning Srihari Multiscale Conditional Random Fields (MCRF)

Combination of three different classifiers operating at local, regional and global contexts respectively.

Two main drawbacks:

• Including additional classifiers operating at different scales into the mCRF framework introduces a large number of model parameters.

• The model assumes conditional independence of hidden variables given the label field.

40 Machine Learning Srihari Hierarchical CRF (HCRF)

Scene context is important in different domains to achieve good classification even though the local appearance is impoverished • region-region interaction • object-region interaction • object-object interaction

HCRF: Incorporate the local as well as the global context of any of the three types in a single model 41 Machine Learning Srihari Hierarchical CRF (HCRF)

Layer 1 short-range: pixelwise label

Layer 2 long-range : contextual object or regions

„ exploit different levels of contextual information in images for robust classification. „ Each layer is modeled as a conditional field that allows one to capture arbitrary observation dependent label interactions

42 Machine Learning Srihari Hierarchical CRF (HCRF)

43 Machine Learning Srihari

region-region interaction ……

object-region interaction

44 Machine Learning Srihari

object-object interaction

45 Machine Learning Srihari

Tree-Structured CRF (TCRF)

Combine advantages of hierarchical models and discriminative approaches in a single framework model the association between the labels of 2×2 neighborhood of pixels

a. introduce a weight vector for every edge which represents the compatibility between the labels of the pairwise nodes

b. introduce a hidden variable H which is connected to all the 4 nodes. For every value which variable H takes, it induces a probability distribution over 46 the labels of the nodes connected to it Machine Learning Srihari

Tree-Structured CRF (TCRF)

• Dividing the whole image into regions of size m × m • introducing a hidden variable for each of them gives a layer of hidden variables over the given label field • each label node is associated with a hidden variable inthe layer above

Tree structured CRF with m=2 Similar to CRFs, the conditional probability p(y,H|x) of a TCRF factors into a product of potential functions 47 Machine Learning Srihari

Tree-Structured CRF (TCRF)

Object detection Image labeling

„ long range correlations among non-neighboring pixels can be easily modeled as associations among the hidden variables in the layer above. „ tree structure allows inference to be carried out in a time linear in the number of pixels 48 Machine Learning Srihari Sign Detection in Natural Images

Weinman, J. Hanson, A. McCallum, A. Sign detection in natural images with conditional random fields. 2004.

Calculates a joint labeling of image patches, rather than labeling patches independently and use CRF to learn the characteristics of regions that contain text.

MaxEnt adding CRF 49 Machine Learning Srihari Human Gesture Recognition Wang Quattoni, A. Morency, L.-P. Demirdjian, D. Darrell, T..Hidden Conditional Random Fields for Gesture Recognition. Computer Science and Laboratory, MIT. 2006.

These gesture classes are: FB - Flip Back, SV - Shrink Vertically, EV - Expand Vertically, DB - Double Back, PB - Point and Back, EH – Expand Horizontally

50 Machine Learning Srihari Human Gesture Recognition

Hidden Conditional Random Fields

s = {s1, s2, ..., sm}, is the set of hidden states in the model, captures certain underlying structure of each class.

is potential function, measures the compatibility between a label, a set of observations and a configuration of the hidden states.

51 Machine Learning Srihari

Hidden states distribution

52 Machine Learning Srihari Natural Scene Categorization

Wang, Y., Gong, S., Conditional Random Field for Natural Scene Categorization. 2007.

Classification Oriented Conditional Random Field (COCRF)

53 Machine Learning Srihari Natural Scene Categorization

COCRF discovers the spatial layout distribution of local patches and their pairwise interaction for a category

54

pairwise interaction potential between topics for four categories Machine Learning Srihari

6. Summary

• Generative and Discriminative methods are two-broad approaches: former involve modeling, latter directly solve classification • For classification – Naïve Bayes and Logistic Regression form a generative discriminative pair • For sequential data – HMM and CRF are corresponding pair • CRF performs better in language related tasks • Generative models are more elegant, have explanatory power

55 Machine Learning Srihari

7. References

1. C. Bishop, and Machine Learning, Springer, 2006 2. C. Sutton and A. McCallum, An Introduction to Conditional Random Fields for Relational Learning 3. S. Shetty, H. Srinivasan and S. N. Srihari, Handwritten Word Recognition using CRFs, ICDAR 2007 4. S. Shetty, H.Srinivasan and S. N. Srihari, Segmentation and Labeling of Documents using CRFs, SPIE-DRR 2007 5. X. He, R. Zennel and M.A. Carreira-Perpinan, Multiscale Conditional Random Fields for Image Labeling

56 Machine Learning Srihari Other Topics in Sequential Data

• Sequential Data and Markov Models: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch1 1.1-MarkovModels.pdf • Hidden Markov Models: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch1 1.2-HiddenMarkovModels.pdf • Extensions of HMMs: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch1 1.3-HMMExtensions.pdf • Linear Dynamical Systems: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch1 1.4-LinearDynamicalSystems.pdf 57