Graphical Models and Conditional Random Fields

Presenter: Shih-Hsiang Lin

Bishop, . M., “ and ,” Springer, 2006 Sutton, C., McCallum, A., “An Introduction to Conditional Random Fields for Relational Learning,” Introduction to Statistical Relational Learning, MIT Press. 2007 Rahul Gupta, “Conditional Random Fields,” Dept. of Computer Science and Engg., IIT Bombay, India. Overview

• Introduction to graphical models • Applications of graphical models • More detail on conditional random fields • Conclusions

2 Introduction to Graphical Models Power of Probabilistic Graphical Models

• Why do we need graphical models – Graphs are an intuitive way of representing and visualizing the relationships between many variables • Used to design and motivate new models – A graph allows us to abstract out the conditional independence relationships between the variables from the details of their parametric forms. • Provide a new insights into existing model – Graphical models allow us to define general message-passing algorithms that implement probabilistic inference efficiently • Graph based algorithms for calculation and computation

Probability Graph Theory Theory

Probability Graphical Models

4 Probability Theory

• What do we need to know in advance – Probability Theory •Sum Rule(Law of Total Probability or Marginal Probability) p()x = ∑ p(x, y ) y • Product Rule (Chain Rule) p()x, y = p(x | y )p(y )= p(y | x )p(x )

• From the above we can derive Bayes’ theorem

p(x | y )p(y ) p()x | y p(y ) p()y | x = = p()x ∑ ()()p x | y p y y

5 Conditional Independence and Marginal Independence

• Conditional Independence

x y | z ⇔ p(x | y, z )= p(x | z )

which is equivalent to saying

p()x, y | z = p(x | y, z )p(y | z )= p(x | z )(p y | z )

– Conditional independence crucial in practical applications since we can rarely work with a general joint distribution • Marginal Independence x y ⇔ a b | 0 ⇔ p()x, y = p(x )p(y ) •Example empty set – Amount of Speeding Fine Type of Car | Speed – Lung Cancer Yellow Teeth | Smoking – Child’s Genes Grandparents’ Genes | Parents’ Genes – Ability of Team A Ability of Team B

6 Graphical models

a b a b

c c

Directed Graph Undirected Graph

• A comprises nodes connected by links – Nodes (vertices) correspond to random variables – Links (edges or arcs) represents the relationships between the variables • Directed graphs are useful for expressing casual relationships between random variables • Undirected graphs are better suited to expressing soft constraints between random variables

7 Directed Graphs

• Consider an arbitrary distribution p(a,b, c ), we can write the joint distribution in the form – By successive application of the product rule

p()a,b, c = p(c | a,b )p(a,b ) * Note that this decomposition holds for or any choice of joint distribution p()a,b, c = p(c | a,b )p(b | a )p(a )

• We then can represent the above equation in terms of a simple graphical models – First, we introduce a node for each of the random variables – Second, for each conditional distribution we add directed links to the graph a b A fully connected graph c 8 Directed Graphs (cont.)

• Let us consider another case p(x1 , x2 , x3 , x4 , x5 , x6 , x7 )

p(x1 , x2 , x3 , x4 , x5 , x6 , x7 )= p(x7 | x1 ,L , x6 )L p(x2 | x1 )p(x1 )

x1 x3

x2

x5 Again, it is a fully connected graph

x6 x4 x7

• What would happen if some links were dropped?? (considering the relationship between nodes)

9 Directed Graphs (cont.)

• The joint distribution of p ( x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 ) is therefore given by

x1 x3 x p(x1 , x2 , x3 , x4 , x5 , x6 , x7 ) 2 () () () ( ) = p x1 × p x2 × p x3 × p x4 | x1 , x2 , x3 × ()()() x5 p x5 | x1 , x3 × p x6 | x4 × p x7 | x4 , x5

x6 x4 x7 * The joint distribution is then defined by the product of a conditional distribution for each node conditioned on the variables corresponding to the parents of that node in the graph

• Thus, for a graph with K nodes, the joint distribution is given by K where parent ( x ) denotes the set of parents of xk p()x1 ,L , x K = ∏ p xk |()parent xk ()k k =1 • We always restrict the directed graph must have no directed cycles – Such graphs are also called directed acyclic graphs (DAGs) or 10 Directed Graph: Conditional Independence

• Joint distribution over 3 variables specified by the graph

p(a,b, c )= p(a | c )p(b | c )p(c ) c if node c is not observed a b

p()a,b = ∑ p(a | c )p(b | c )p(c )()≠ p a p(b ) Î a b | 0 c c if node c is observed a b p(a,b, c ) p()a,b | c = = p()()a | c p b | c Î a b | c p()c

The node c is said to be tail-to-tail .w.t. this path from node a to node b Î this observation ‘blocks’ the path from a to b and cause a and b to become conditionally independent

11 Directed Graph: Conditional Independence (cont.)

• The second example p(a,b, c )= p(a )p(c | a )p(b | c )

a c b if node c is not observed

p(a,b )()= p a ∑ p(c | a )p(b | c )= p(a )p(b | a )≠ p(a )p(b ) Î a b | 0 c

a c b if node c is observed

p a,()b, c p(a )p(c | a )p(b | c ) p()a,b | c = = = p()()a | c p b | c Î a b | c p c () ()p c

The node c is said to be head-to-tail r.w.t. this path from node a to node b Î this observation ‘blocks’ the path from a to b and cause a and b to become conditionally independent

12 Directed Graph: Conditional Independence (cont.)

• The third example p(a,b, c )= p(a )p(b )p(c | a,b ) c if node c is not observed a b

p(a,b )= ∑ p(a,b, c )= p(a )p(b )∑ p(c | a,b )= p(a )p(b ) Î a b | 0 c c

c if node c is observed a b p a,()b, c p(a )p(b )p(c | a,b ) p()a,b | c = = ≠ p()()a | c p b | c Î a b | c p c () ()p c

The node c is said to be head-to-head r.w.t. this path from node a to node b Î the conditioned node c ‘unblocks’ the path and renders a and b dependent

13 D-separation

•ifA B | C C d-separated A from B – We need to consider all possible paths from any node in A to any node in B – Any such path is said to be blocked if it includes a node such that either (a) the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C (b) the arrows meet head-to-tail at the node, and neither the node, nor any of its descendants, is in the set C – If all paths are blocked, then A is said to be d-separated from B by C

a f a f

e b e b

c c

a b | c a b | f

14 Markov Blankets

• Markov blankets (or Markov boundary) of a node x is the minimal set of nodes that isolates nodes A from the rest of the graph – Every set of nodes in the network is conditionally independent of A when conditioned on the Markov blanket of the node A ()p A | MB (A)∩ B = p(A | MB (A)) – MB(A)= {parents(A) and children(A) and parents-of-children(A)}

A

15 Examples of Directed Graphs

• Hidden Markov models • Kalman filters • • Probabilistic principal component analysis • Independent component analysis • Mixtures of Gaussians • Transformed component analysis • Probabilistic expert systems • Sigmoid belief networks • Hierarchical mixtures of experts • etc,…

16 Example: State Space Models (SSM)

• Hidden Markov models • Kalman filters Hidden Observed yt −1 yt yt +1

xt −1 xt xt +1

P ()X ,Y = L p(yt | yt −1 )p(xt | yt )p(yt +1 | y )L

17 Example: Factorial SSM

• Multiple hidden sequences

Hidden Observed

18 Markov Random Fields

– Let F = { F 1 , F 2 ,..., F M } be a family of random variables defined on the set S , in which each F i takes a value f i in a label set L. The family F is called a random field • – F is said to be a Markov random field on S with respect to a neighborhood system N if and only if the following two conditions are satisfied Possitivity : P(f )> 0,∀f ∈F

Markovianity : P(fi | all other f )= P(fi | neighbors(fi )) a b

c P(b | all other node)= P(b | c,d) e d

19 Undirected Graphs

• An undirected graphical model can also called Markov random fields, or also known as a Markov networks – It has a set of nodes each of which corresponds to a variable of group of variables, as well as a set of links each of which connects a pair of nodes • In an undirected graphical models, the joint distribution is product of non-negative functions over the cliques of the graph

ψ (x ) 1 where C C are the clique potential, and Z is a p()x = ∏ ψ C ()xC normalization constant (sometimes called the partition function) Z C

A B a b ψ ψ 1 p()x = ()(a, c b, c, d ψ )(c, d , e ) c Z A B C

e d C

20 Clique Potentials

• A clique is a fully connected subgraph – By clique we usually mean maximal clique (i.e. not contained within another clique) – measures “compatibility” between settings of the variables

a b

c

e d

21 Undirected Graphs: Conditional Independence

•A B | C simple graph separation can tell us about conditional independencies

A

B A C

• The Markov blanket of a node A is defined as – MB(A)={Neighbors(A)}

22 Examples of Undirected Graphs

• Markov Random Fields • Condition Random Fields • Maximum Entropy Markov Models • Maximum Entropy • Boltzmann Machines • etc,…

23 Example: Markov Random Field

Hidden ψ Observed 1 P ()X ,Y = ∏ i ()xi , yi ∏ψ jk (y j , yk ) Z i j ,k

xi

yi

24 Example: Conditional Random Field

Hidden Observed XX ψ 1 P ()Y | X = ∏ i ()yi | X ∏ψ jk (y j , yk | X ) Z i j ,k X X

yi

25 Summary of Factorization Properties

• Directed graphs K p()x1 ,L , x K = ∏ p xk |()parent xk () k =1 – Conditional independence from d-separation test – Directed graphs are better at expressing causal generative models • Undirected graphs 1 p()x = ∏ψ C ()xC Z C – Conditional independence from graph separation – Undirected graphs are better at representing soft constraints between variables

26 Applications of Graphical Models Classification

• Classification is predicting a single class variable y give a vector of feature x = ()x1 , x2 ,L x K • Naïve Bayes classifier – Assume that once the class label is known, all the features are independent • based directly on joint probability distribution p(y, x ) •in generative models set of parameters must represent y input distribution and conditional well K p()y, x = p y ()(∏ p xi | y ) i =1 x • (maximum entropy classifier) • based directly on conditional probability p () y | x Îneed no model p(x ) •in discriminative models are not as strongly tied to their input distribution classλ bias weight 1 ⎪⎧ K ⎪⎫ p()y | x = exp + λ x y ()⎨ y ∑ y , j j ⎬ Z x ⎩⎪ j =1 ⎭⎪ ⎧λ K ⎫ where ⎪ ⎪ Z ()x = ∑∑y exp ⎨ y + λ y , j x j ⎬ ⎩⎪ j =1 ⎭⎪ x * It can be shown that a Gaussian Naïve Bayes (GNB) classifier implies the parametric form of 28 p(y|x) of its discriminative pair logistic regression Naïve Bayes K p()y, x = p y ()(∏ p xk | y ) Classification (cont.) k =1

• Consider a GNB based on the following modeling assumptions

–P (xi | y = yk ) is a Gaussian distribution of the form N (μ ik .σ i ) –y is Boolean, governed by a Bernoulli distribution with parameter θ = P (y = 1) πσ μ 2 1 πσ ⎛ − ()x − ⎞ exp⎜ i i0 ⎟ p y = 1 p(()x | y = 1) 2 ⎜ 2 ⎟ p()y = 1| x = p()x | y = 0 2 ⎝ 2 i ⎠ ()()()() log i = πσlog i p y = 1 p x | y = 1 + p y = 0 p x | y = 0 ∑i ()∑i μ 2 p xi | y = 1 1 ⎛ − ()x − ⎞ 1 expπσ ⎜ i i1 ⎟ = 2 ⎜ 2 2 ⎟ p()()y = 0 p x | y = 0 2μ i ⎝ i ⎠ 1 + ()() p y = 1 p x | y = 1 ⎛ ()()x − 2 μ− x − 2 ⎞ = log exp⎜ i πσi1 i i0 ⎟ 1 ∑i μ ⎜ 2 ⎟ = ⎝ 2 i ⎠ ⎛ p()()y = 0 p x | y = 0 ⎞ 1 + exp⎜log ⎟ ⎛ 2 μ 2 ⎞ ⎜ ()()⎟ ()()xi − iπσ1 − xi − i0 ⎝ p y = 1 p x | y = 1 ⎠ = ∑ ⎜μ ⎟ i ⎜ 2 2 ⎟ 1 ⎝ μ i ⎠ = ⎛ σ μ 2 2 ⎞ ⎛ 1 −θ p()x | y = 0 ⎞ i0 − i1 i1 −μ i0 1 + exp⎜log + log i ⎟ = ⎜ x + ⎟ ⎜ ∑i ()⎟ ∑i ⎜ 2 i 2 ⎟ ⎝ θ p xi | y = 1 ⎠ Î ⎝ i 2πσ i ⎠

1 p()y = 1| x = θ ⎛ 1 − ⎛ − 2 − 2 ⎞⎞ 1 + exp⎜logθ + μ ⎜ i0 i1 x + i1 i0 ⎟⎟ ⎜ ∑i ⎜ 2 i 2 ⎟⎟ ⎝ ⎝ μ i 2 i ⎠⎠ σ μ λ 1 μ = 29 1 + exp()0 + ∑i λi xi πσ Sequence Models

• Classifier predict only a single class variable, but the true power of graphical models lies in their ability to model many variables that are interdependent – e.g. named-entity recognition (NER), part-of-speech tagging (POS) • Hidden Markov models – Relax the independence assumption by arranging the output variables in a linear chain – To model the joint distribution p ( Y , X ) , an HMM makes two assumptions • Each state depends only on its immediate predecessor (First order assumption) • Each observation variable depends only on the current state (Output-independent assumption) yt −1 yt yt +1

T p()()(Y, X = p y0 ∏ p yt | yt −1 p x t | )()yt t =1

x t −1 x t x t +1

30 Sequence Models (cont.)

• Maximum Entropy Markov Models (MEMMs) – A conditional model that representing the probability of reaching a state given an observation and the previous state T p()()(Y | X = p y1 | x1 ∏ p yt | yt −1 , xt )yt −1 yt yt +1 t = 2 1 ⎛ ⎞ p()yt | yt −1 , xt = exp ⎜ ∑ λk f k ()yt −1 , yt , xt ⎟ Z ⎝ k ⎠ ⎛ ⎞ Z = ∑∑exp ⎜ λk f k ()yt −1 , y't , xt ⎟ yk' ⎝ ⎠ x t −1 x t x t +1 * per-state normalization

– Per-state normalization will cause all the mass that arrives at a state must be distributed among the possible successor states Label Bias Problem!!!!! Potential victims: Discriminative Models

31 Sequence Models (cont.)

• Label Bias Problem – Consider this MEMM

– P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

– Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) • In the training data, label 2 is the only label value observed after label 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x

– However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro)

32 Sequence Models (cont.)

SEQUENCE GENERAL

Naïve Bayes HMMs Generative Directed Models

CONDITIONAL CONDITIONAL CONDITIONAL

SEQUENCE GENERAL

Logistic Regression Linear-chain CRFs General CRFs

33 HMMs T p()y, x = p y ()(0 ∏ p yt | yt −1 p xt | )()yt From HMM to CRFs t =1 λ • We can rewrite the HMM joint distribution p ( y , x ) as follows 1 ⎛ ⎞ p()Y, X = exp ⎜ 1 1 + μ 1 1 ⎟ ⎜ ∑∑ ij {}{}yt = i yt −1 = j ∑∑∑ oi yt = i {}{}x t = o ⎟ Z ⎝ tSi, j∈ tSiO∈∈o ⎠ – Because we do not require the parameter to be log probabilities, we are no longer guaranteed that the distribution sums to 1 • So we explicitly enforce this by using a normalization constant Z • We can write the above equation more compactly by introducing the concept of feature function Feature function for HMMs K f (y , y , x )= 1 1 1 ⎛ ⎞ k t t −1 t {}{}yt = i yt −1 = i state transition p()Y, X = exp ⎜ ∑ λk f k ()yt , yt −1 , x t ⎟ Z f (y , y , x )= 1 1 ⎝ k =1 ⎠ k t t −1 t {}{}yt = i x = o state observation

• The last step is to write the conditional distribution p(Y | X )

K λ p Y,()X exp {∑ k f k (yt , yt −1 , x t )} p()Y | X = = k =1 p Y ',()X K ∑ Y ' ∑∑y '1exp {}k = λk f k ()y't , y't −1 , x t Linear-chain CRFs

34 More Detail on Conditional Random Fields Conditional Random Fields

• CRFs have all the advantages of MEMMs without label bias problem – MEMM uses per-state exponential model for the conditional probabilities of next states given the current state – CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence ( ) • Let G ()V , E be a graph such that Y = Y v v ∈ V , so that Y is indexed by the vertices of G . Then () X , Y is a conditional random field in case, when conditioned on X , the random variables Y v obey the Markovian property

p(Yv | X, Yw , w ≠v)= p(Yv | X, neighbor (Yv ))

Y1 Y2 Y3 Y4 Y5

p(Y | X, all other Y )= p(Y | X,Y ,Y ) 3 3 2 4 X 36 Linear-Chain Conditional Random Fields

• Definition K Let Y , X be the random vectors, Λ = { λ k } ∈ R be a parameter vector, and K {f k (y, y', x t )}k =1 be a set of real-valued feature functions. Then a linear-chain conditional random field is a distribution p () y | x that takes the form

1 ⎛ K ⎞ 1 p()Y | X = exp ⎜ λ f ()y , y , x ⎟ or p()Y | X = exp (Λ T F ) ⎜ ∑ k k t t −1 t ⎟ () Z X()⎝ k =1 ⎠ Z X

Where Z () X is an instance-specific normalization function

⎛ K ⎞ sum over all possible state sequences ⎜ ⎟ Z ()X = ∑∑exp ⎜ λk f k yt , y (t −1 , x t ⎟ an exponentially ) large number of terms Y ⎝ k =1 ⎠

Fortunately, forward-backward indeed helps us to calculate this term

37 Forward and Backward Algorithms

• Suppose that we are interested in tagging a sequence only partially, say till the position i – Denote the un-normalized probability of a partial labeling ending at position i with fixed label y by α (y,i) – Denote the un-normalized probability of a partial segmentation starting at position i+1 assuming a label y at position i by β (y,i) α and β can be computed via the following recurrences

T α ()y,i = ∑ a(y',i − 1)× exp (Λ f (y, y', x i )) β y ' T ()y,i = ∑ β (y',i + 1)× exp (Λ f (y, y', x i +1 )) y ' – We can now write the marginal and partition function in term of these

P ()Yi = y | X = α (y,i)β (y,i)/ Z (X )

P ()Yi = y,Yiα+1 = y'| X = α (y,i)exp()Λf (y', y, x i +1 ) β (y',i + 1)/ Z (X ) Z ()X = ∑ (y, X )= ∑ β (y,1) y y

38 Inference in linear CRFs using the

• Given the parameter vector Λ , the best labeling for a sequence can be found exactly using the Viterbi algorithm – For each tuple of the form ( i , y ) , the Viterbi algorithm maintains the un- normalized probability of the best labeling ending at position i with the label y – The recurrence is

⎪⎧max (V (i − 1, y')()× exp (Λ T f y, y', x )) ( i > 0 ) V ()i, y = ⎨ y ' i ⎩⎪ [[ y = start ]] ( I = 0 )

• The normalized probability of the best labeling is given by

max y V ()n, y Z()X

39 Training (Parameter Estimation)

• The various methods used to train CRFs differ mainly in the objective function they try to optimize – Penalized log-likelihood criteria – Voted – Pseudo log-likelihood – Margin maximization – Gradient tree boosting – Logarithmic pooling – and so on …

40 Penalized log-likelihood criteria

• The conditional log-likelihood of a set of training instances ( X k , Y k ) using parameters Λ is given by T k k k LΛ = ∑ Λ F (Y , X )− log Z Λ (X ) k The gradient of the log-likelihood is given by ⎛ k T k ⎞ k k ∑ F (Y ', X )exp (Λ F (Y ', X )) ∇L = ⎜ F ()Y , X − Y ' ⎟ Λ ∑ ⎜ ()k ⎟ k ⎝ Z Λ X ⎠ k k k k = ∑∑()F ()Y , X − Y ' F Y ', X P ()()Y '| X k () = (F Y k , X k −E []F ()Y, X k ) ∑ P (Y |X k ) k In order to avoid overfitting problem, we impose a penalty on it 2 T k k k Λ Euclidean norm LΛ = ()∑ Λ F ()Y , X − log Z Λ X − ()2 k 2σ and the gradient is given by Λ ∇L = (F (Y k , X k )−E [F ()Y, X k ])− Λ ∑ P (Y |X k ) 2 k σ

41 Penalized log-likelihood criteria (cont.)

The tricky term in the gradient is the expectation E [ F (Y, X k )] those P ()Y |X k computation requires the enumeration of all the possible Y sequence

th k k Let us look at the j entry in this vector, viz. F j ( Y , X ) and F j ( Y , X ) is f y , y , X k E [F (Y, X k )] equal to ∑ i j ( i i − 1 i ) . Therefore, we can rewrite P () Y |X k as

⎡ ⎤ E []F ()Y, X k = E f ()y , y , X k P (Y |X k j ) P (Y |X k ⎢∑) j i i −1 i ⎥ ⎣ i ⎦ = E []f ()y , y , X k ∑ P (Y |X k j ) i i −1 i i α k T k = ∑∑ ()i − 1, y' f j ()()y, y', X i exp Λ f y, ()y', X i ()i, y i y ', y α T = ∑ i −1Qi β i i

• After obtained the gradient, various iterative methods can beβ used to maximized the log-likelihood – Improved Iterative Scaling (IIS), Generalized Iterative Scaling (GIS), Limited Memory Quasi Newton method (L-BFGS)

42 Voted Perceptron Method

• Perceptron uses an approximation of the gradient of the unregularized ∇L = (F (Y k , X k )−E [F (Y, X k )]) log- Λ ∑ P (Y |X k ) k – It considers one misclassified instance at a time, along with its contribution (F (Y k , X k )−E [F (Y , X k )]) to the gradient P ()Y |X k – The feature expectation is further approximated by a point estimate of the feature vector at the best possible labeling

k k k k ⎛ * k ⎞ ⎛ * T k ⎞ MAP-hypothesis based classifier ∇LΛ ≈ F ()Y , X − F⎜ Y , X ⎟ ⎜ Y = arg max Λ F (Y, X )⎟ ⎝ ⎠ ⎝ Y ⎠ Using this approximate gradient, the following first order update rule can be used for maximization

k Λ = Λ + F (Y k , X k )− F⎜⎛ Y * , X k ⎟⎞ t +1 t ⎝ ⎠ This update step is applied once for each misclassified instance in the training set. Or we can collect all the update in each pass and take their unweighted average to update the parameter

43 Pseudo log-likelihood

• In many scenarios, we are willing to assign different error values to different labeling k k – It makes senses to maximize the marginal distributions P ( y t | X ) instead of P (Y k | X k ) – This objective is called the pseudo-likelihood and for the case of linear CRFs, it is given by T k k LΛ = ∑∑log P (yt | X , Λ ) k t =1 ()() T exp Λ T F()y, X k = ∑∑log ∑ k k t =1 k Z X y: yt = yt Λ

44 Other types of CRFs

• Semi-Markov CRFs – It is still in the realm of first-order Markovian dependence, but the different is the label depend only on segment feature and the label of previous segment • Instead of assigning labels to each position, assign labels to segments

Semi-Markov CRFs

• Skip-Chain CRFs – A conditional model that collectively segments a document into mentions and classifies the mentions by entity type

Skip-chain CRFs A B C A A B

45 Other types of CRFs (cont.)

• Factorial CRFs – Several synchronized inter-dependent tasks • Cascading propagates errors

NE

POS Factorial CRFs

• Tree CRFs – The dependencies are organized as a tree structure

Tree CRFs

46 Conclusions

• Conditional Random Fields offer a unique combination of properties – discriminatively trained models for sequence segmentation and labeling – combination of arbitrary and overlapping observation features from both the past and future – efficient training and decoding based on dynamic programming for a simple chain graph – parameter estimation guaranteed to find the global optimum • Possible Future work? – Efficient training approach ?? – Efficient Feature Induction ?? – Constrained Inferencing ?? – Different topology ??

47 Reference

• Lafferty, J., McCallum, A., Pereira, F., “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” In: Proc. 18th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA (2001) 282–289. • Rahul Gupta, “Conditional Random Fields,” Dept. of Computer Science and Engg., IIT Bombay, India. Document available from http://www.it.iitb.ac.in/~grahul/main.pdf • Sutton, C., McCallum, A., “An Introduction to Conditional Random Fields for Relational Learning,” Introduction to Statistical Relational Learning, MIT Press. 2007. Document available from http://www.cs.berkeley.edu/~casutton/publications/crf-tutorial.pdf • Mitchell, T. M., “Machine Learning,” McGraw Hill, 1997. Document available from http://www.cs.cmu.edu/~tom/mlbook.html • Bishop, C. M., “Pattern Recognition and Machine Learning,” Springer, 2006 • Bishop, C. M., “Graphical Models and Variational Methods,” video lecture, Machine Learning Summer School 2004. Available from http://videolectures.net/mlss04_bishop_gmvm/ • Ghahramani, Z., “Graphical models,” video lecture, EPSRC Winter School in Mathematics for Data Modelling, 2008. Available from http://videolectures.net/epsrcws08_ghahramani_gm/ • Roweis, S., “Machine Learning, Probability and Graphical Models,” video lecture, Machine Learning Summer School 2006, Available from http://videolectures.net/mlss06tw_roweis_mlpgm/

48