A TUTORIAL ON CONDITIONAL RANDOM FIELDS WITH APPLICATIONS TO MUSIC ANALYSIS

Slim Essid

Telecom ParisTech

November 2013

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 1 / 122 Slides available online

http://www.telecom-paristech.fr/~essid/resources.htm

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 2 / 122 Contents

I Introduction

I References to applications in MIR-related tasks

I The model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs

I Conclusion

I References

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 3 / 122 I Appendix Introduction Preliminaries The supervised classification problem

Goal: predict labels y (aka classes or outputs) for some observations o (aka data points, inputs). Examples: • Predict genre, mood, user tag... for a music excerpt. • Predict instrument, chord, notes played... for a music segment. Supervised classification: • Each observation o is supposed to pertain to a predefined class Ck : the k-th (discrete) class of a classification problem; k = 1, ··· , K. • This is represented using a label y for each o; y ∈ Y, e.g. Y = {0, 1}, Y = {1, 2, 3, ..., K}. Examples of classes: Tags: Jazz, Bebop, Fast, Exciting, Sax, Drums, ... Chords: C7, Gmaj7, F min7, ... Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 4 / 122 Introduction Preliminaries Features

• Classification relies on features x: descriptors of some qualities/ attributes of the inputs o. Two types of features: Continuous features Discrete/categorical

−7 −7 −7 −7 o1 o2 o3 o4 C F F

x1 x2 x3 x4 CCFF real-valued: e.g. MFCC, chroma, symbolic: e.g. note/chord played, tempo... key... • Usually assembled as feature vectors x.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 5 / 122 Introduction Preliminaries Notations

• o: an input (observation) to be classified; e.g.: a music excerpt, a music symbol, an audio frame/segment...

T D • x = (x1, ··· , xD ) : a D-dimensional column vector (usually in ); xT is a row vector.

• xn is a feature vector among a collection of N examples x1, ··· , xN .

• xjn is the j-th feature coefficient of xn ; 1 ≤ j ≤ D.

• D ={x1, ..., xN } : the set of all training feature-vector examples.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 6 / 122 Introduction Preliminaries Feature functions Different from features!

Definition A feature function is a real-valued function of both the input space O (observations) and the output space Y (target labels), fj : O × Y → R, that can be used to compute characteristics of the observations.

• An alternative way to express the characteristics of the observations, in a more flexible manner:

− using output-specific features; − describing the context.

( 1 if oi = C, oi+1 = E and y = Cmaj Example: fj (oi , yi ) = 0 otherwise

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 7 / 122 Introduction Preliminaries Feature functions

I Remarks: − Different attributes may thus be considered for different classes.

− Feature functions are more general than features: one can define

∆ I fj (o, y) = xj ; or ∆ I fj (o, y) = x.

− In the following:

I Feature-function notations will be used only when needed.

I Otherwise, feature-vectors will be preferred.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 8 / 122 Introduction Preliminaries Probabilistic classification

Take decisions based on the MAP rule: yˆ = argmax p(y|x) y∈Y

in order to minimize the error rate (here the expected 0-1 loss). MAP: Maximum A Posteriori probability

→ this is the Bayes decision rule (for the 0-1 loss.)

How to get there?

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 9 / 122 Introduction Preliminaries based classification

• Objective: yˆ = argmaxy p(y|x). p(y,x) p(y)p(x|y) • By the Bayes rule p(y|x) = p(x) = p(x) ,

p(y)p(x|y) yˆ = argmax = argmax p(y)p(x|y). y p(x) y

• 1 Assuming a fixed prior p(y) (possibly uninformative: p(y) = K ), one is left with: yˆ = argmax p(x|y). y → Our decision criterion becomes a maximum-likelihood criterion. → This is a generative approach to classification: a probabilistic model of “how to generate x given a class y” is targeted.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 10 / 122 Introduction Preliminaries based classification

Directly models p(y|x) without wasting efforts on modeling the observations, which is not needed for the goal yˆ = argmaxy p(y|x).

I Pros: − The class posteriors p(y = c|x) are potentially simpler than the class-conditional densities.

5 1.2

1 4 p(y = 1|x) p(y = 2|x) p(x|y = 2) 0.8 3 p(x|y = 1) 0.6 2

Class posteriors 0.4

1 0.2 Class conditional densities

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 xx Generated using pmtk3 (Dunham and Murphy, 2010)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 11 / 122 Introduction Preliminaries Discriminative model based classification

Directly models p(y|x) without wasting efforts on modeling the observations, which is not needed for the goal yˆ = argmaxy p(y|x). I Pros: − The class posteriors p(y = k|x) are potentially simpler than the class-conditional densities. − Avoids making unwarranted assumptions about the features which may be highly dependent (especially with structured data). − Improved robustness to model imperfections, as independence assumptions will be made only among the labels, not the observations.

I Cons: − Classes need to be learned jointly and data should be available for all classes. − Models do not allow for generating observations.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 12 / 122 Introduction Preliminaries Predicting structured-output data

• In many MIR tasks the outputs are structured, e.g.: Harmonic analysis tasks: chord progressions follow predefined patterns (cadences)

Linear-chain structure

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122 Introduction Preliminaries Predicting structured-output data

• In many MIR tasks the outputs are structured, e.g.:

Musical instrument recognition

o1 o2 o3 o4

Dr BsDr BsDrPn BsDr

Linear-chain structure

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122 Introduction Preliminaries Predicting structured-output data

• In many MIR tasks the outputs are structured, e.g.:

Autotagging tasks: target tags are correlated (e.g. bebop, Jazz, fast tempo)

Tree structure

→ Need for predictors able to take advantage of this structure.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122 Introduction Preliminaries Predicting structured-output data

• In many MIR tasks the outputs are structured, e.g.:

Autotagging tasks: target tags are correlated (e.g. bebop, Jazz, fast tempo)

Tree structure

→ Need for predictors able to take advantage of this structure.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122 Introduction Preliminaries Predicting sequential data

• In this tutorial, we focus on linear-chain data

o1 o2 o3 o4 o1 o2 o3 o4

Dr BsDr BsDrPn BsDr C−7 C−7 F−7 F−7

• Specialized inference algorithms can then be used (forward-backward method), which are easier to apprehend. • More general methods can be used for more general structure ( and extensions), see e.g. (Jensen and Nielsen, 2007).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 14 / 122 Introduction Preliminaries More notations

• x is a sequence of observations: x = (x1, ··· , xn).

• y is the corresponding sequence of labels: y = (y1, ··· , yn). • We assume we have a training dataset D of N (i.i.d) such sequences: D = {(x(1), y (1)), ··· , (x(N), y (N))}.

• Remarks:

− Observations are no longer assumed to be i.i.d within each sequence.

(q) − Sequences x do not necessarily have the same length, when needed nq will denote the length of x(q).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 15 / 122 Introduction A first look at CRF The CRF model A discriminative model for structured-output data

CRF model definition

D 1 X p(y|x; θ) = exp θ F (x, y) Z(x, θ) j j j=1 1 = Ψ(x, y; θ); θ = {θ , ··· , θ }. Z(x, θ) 1 D

• P P Z(x, θ) = y exp j θj Fj (x, y) is called a partition function.

• PD Ψ(x, y; θ) = exp j=1 θj Fj (x, y) is called a potential function.

• Remark: feature functions Fj (x, y) depend on the whole sequence of observations x and labels y.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 16 / 122 Introduction A first look at CRF Applications of CRFs

CRF models have proven to be superior to competitors in a variety of application fields.

• They are the state-of-the-art techniques in many natural language processing (NLP) tasks (Taskar et al., 2002; Settles, 2004; Lavergne et al., 2010) part-of-speech tagging (POS), named-entity recognition (NER)...

• They have been successfully used for various tasks (He et al., 2004; Quattoni et al., 2007; Wang et al., 2006; Morency et al., 2007; Rudovic et al., 2012) image labeling, object and gesture recognition, facial expressions...

• Also for speech analysis tasks (Gunawardana et al., 2005; Reiter et al., 2007; Morris and Fosler-Lussier, 2008; Hong, 2010) speech recognition, speech segmentation, speaker identification...

• To date rarely used for music analysis, despite a great potential...

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 17 / 122 References to applications in MIR-related tasks

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs

I Conclusion

I References

I Appendix

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 18 / 122 References to applications in MIR-related tasks Overview

• Chord recognition (Corey and Fujinaga, 2007). • Autotagging: exploiting correlations between tags for multi-label classification:

− considering pairs of tags (Duan et al., 2008); − attempting to account for correlations between multiple labels and inputs (Li, 2011). • Musical emotion analysis: modeling temporal dynamics of musical emotion (in the V-A space):

− targeting discretized V-A values and using discretized features (Schmidt and Kim, 2011); − targeting continuous V-A values based on CCRF for regression (Imbrasaite et al., 2013).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 19 / 122 References to applications in MIR-related tasks Audio-to-score alignment based on CRF

• A sophisticated alignment system

− exploiting context of local feature observations depending on (discrete) tempo variables (Joder et al., 2010).

− carefully modeling the temporal structure using a semi-Markov CRF and a tempo-variation model (Joder et al., 2011).

− learning optimal score-audio affinity features (Joder et al., 2013).

• Substantial increase in performance compared to HMM-based approach: on pop music, 97.8% (vs 65.4%) of note onsets correctly associated, using 100-ms tolerance-window.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 20 / 122 References to applications in MIR-related tasks Beat detection based on CRF To appear...

• System developed by Thomas Fillon and Cyril Joder.

• Evaluation by Simon Durand, on a dataset of Classical, Jazz, Blues and Electro. music (Durand, 2013).

• Tolerance on temporal precision: 10% of the tempo value.

System (Davies and Plumbley, 2007) (Ellis, 2007) (Klapuri et al., 2006) CRF F-measure 68,4 55,6 69,6 72,7

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 21 / 122 The logistic regression model

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model – Model specification – Maximum Entropy Modeling – Parameter estimation – Improvements to the logistic regression model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs

I Conclusion

ISlimReferences Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 22 / 122

I Appendix The logistic regression model Model specification The logistic regression model

Approach: model the posterior probabilities of the K classes using linear functions of the inputs x, according to:

P(C1|x) T log = w10 + w1 x P(CK |x)

P(C2|x) T log = w20 + w2 x P(CK |x) . . P(CK−1|x) T log = w(K−1)0 + wK−1x P(CK |x)

Defines a log-linear model specified in terms of K − 1 log-odds: log P(Ck |x) . P(CK |x)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 23 / 122 The logistic regression model Model specification The logistic regression model

P(Ck |x) T • From log = wk0 + w x ; k = 1, ··· , K − 1; it is easy to deduce P(CK |x) k that:

Multiclass logistic regression model exp(w + wT x) P(C |x) = k0 k ; k = 1, ··· , K − 1, k PK−1 T 1 + l=1 exp(wl0 + wl x) 1 P(C |x) = K PK−1 T 1 + l=1 exp(wl0 + wl x)

• Remarks

− The model is a classification model (not a regression model!) − It is a discriminative model as it targets P(Ck |x) (as opposed to modeling p(x|Ck ) in generative models.)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 24 / 122 The logistic regression model Model specification Binary classification case

• When K = 2

1 P(C1|x) = p = T 1 + exp −(w10 + w1 x) P(C2|x) = 1 − p 1 • p = ; a = w + wT x 1 + exp −a 10 1

Logistic sigmoid function 1 σ(a) =∆ 1 + exp −a

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 25 / 122 The logistic regression model Model specification The logistic sigmoid function

1 ∆ 1 0.9 σ(a) = 1+exp −a 0.8 0.7

0.6

0.5 • Properties: 0.4 0.3

0.2

Symmetry: σ(−a) = 1 − σ(a) 0.1

0 −8 −6 −4 −2 0 2 4 6 8 σ a Inverse: a = log 1−σ : logit function

• p p The odds 1−p ∈ [0, +∞] hence the log-odds log 1−p ∈ [−∞, +∞]

• Logistic regression models the log-odds as linear functions of the inputs... why is this a good idea? → Study the link to maximum entropy models.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 26 / 122 The logistic regression model Maximum Entropy Modeling Maximum Entropy: an introductory example Inspired by (Berger et al., 1996)

Goal: perform chord transcription using notes played as input (in the symbolic domain). Method: Use a training dataset to estimate p(y|o): the probability to assign the chord y to the observed note o; to be used for MAP decision.

• The structure of both the ground-truth labels y and the observations o reflect a set of facts about the data: rules of harmony. → Our model should capture these facts to perform accurate predictions.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 27 / 122 The logistic regression model Maximum Entropy Modeling Using facts about the data

• Let’s assume we observe the note C. → The “matching” chord is among {Cmaj, Cmin, A[maj, Amin, Fmaj, Fmin}.

In terms of statistics P(Cmaj) + P(Cmin) + P(A[maj) + P(Amin) + P(Fmaj) + P(Fmin) = 1.

• How to choose P(Cmaj), ··· , P(Fmin)?

• Safe choice:

In terms of statistics 1 P(Cmaj) = P(Cmin) = ··· = P(Fmin) = 6

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 28 / 122 The logistic regression model Maximum Entropy Modeling Why “uniform”?

• Intuitively: the most uniform model according to our knowledge, the only unbiased assumption

• Ancient wisdom:

− Occam’s razor (William of Ockham, 1287-1347): principle of parsimony: “Nunquam ponenda est pluralitas sine necesitate.” [Plurality must never be posited without necessity.]

− Laplace: “when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely.” (Principle of Insufficient Reason)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 29 / 122 The logistic regression model Maximum Entropy Modeling More facts

• The matching chord is Cmaj or Fmaj 30% of the time:

P(Cmaj) + P(Fmaj) = 3/10 P(Cmaj) + P(Cmin) + ··· + P(Fmaj) + P(Fmin) = 1

• Again many solutions... and a resonable choice is:

P(Cmaj) = P(Fmaj) = 3/20 P(Cmin) = P(A[maj) = P(Amin) = P(Fmin) = 7/40

• How to generalize this? How to determine the “most uniform” model subject to the constraints at hand?

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 30 / 122 The logistic regression model Maximum Entropy Modeling Using feature functions

• Need to express the facts about the observations in a flexible way, to make sure the model will match them:

− make use of statistics of the observations: e.g. if C is played, the matching chord is Cmaj or Fmaj with frequency 3/10. − allow for using the context: e.g. if C is followed by E then the chord is Cmaj with frequency 1/2.

→ define feature functions to capture these statistics and use them to impose constraints to the model. ( 1 if oi = C, oi+1 = E and y = Cmaj Example: fj (oi , yi ) = 0 otherwise

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 31 / 122 The logistic regression model Maximum Entropy Modeling Defining constraints through feature functions

• The training sample can be described in terms of its empirical probability distribution p˜(o, y): 1 p˜(o, y) =∆ × number of times that (o, y) occurs in the sample N

∆ • ˜ P E(fj ) = o,y p˜(o, y)fj (o, y): expected value of fj w.r.t p˜(o, y).

∆ • P E(fj ) = o,y p(o)p(y|o)fj (o, y): expected v. of fj w.r.t the model p(o, y).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 32 / 122 The logistic regression model Maximum Entropy Modeling Defining constraints through feature functions

• The observed statistics (facts) are captured by enforcing:

Constraint equation ˜ E(fj ) = E(fj ), i.e.

X X p(o)p(y|o)fj (o, y) = p˜(o, y)fj (o, y) o,y o,y

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 33 / 122 The logistic regression model Maximum Entropy Modeling Maximum entropy principle

• Now how to implement the idea of uniform modeling?

• Among the set M of probability distributions that satisfy the ˜ constraints, E(fj ) = E(fj ), choose:

Maximum entropy criterion

p∗(y|o) = argmax H(y|o); p(y|o)∈M

∆ X H(y|o) = − p(o)p(y|o) log p(y|o): the conditional entropy o,y

• Hint from information theory: the discrete distribution with maximum entropy is the uniform distribution.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 34 / 122 The logistic regression model Maximum Entropy Modeling Solving the problem

∗ Primal: p (y|o) = argmaxp(y|o)∈M H(y|o) ˜ P Constraints: E(fj ) = E(fj ) and y p(y|o) = 1 ∆ P  Lagrangian: L(p, λ) = H(y|o) + λ0 y p(y|o) − 1 + P  ˜  j λj E(fj ) − E(fj )

Equating the derivative of the Lagrangian with 0: 1 X pλ(y|o) = exp λj fj (o, y); Zλ(o) j P P  Zλ(x) = y exp j λj fj (o, y)

∗ The solution is given by the dual optimal: λ = argmaxλ L(p, λ).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 35 / 122 The logistic regression model Maximum Entropy Modeling Compare to the LR model

I Maxent model: 1 P  p(y = k|o) = exp λjk fj (o, y) ; Zλ(o) j P P  Zλ(o) = y exp j λjk fj (o, y) .

I Logistic regression model:

exp(w + wT x) p(y = k|x) = k0 k PK−1 T 1 + l=1 exp(wl0 + wl x) exp(w 0 + w0T x) = k0 k PK 0 0T l=1 exp(wl0 + wl x) 1 0 0T = exp(wk0 + wk x). Zw(x)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 36 / 122 The logistic regression model Maximum Entropy Modeling Compare to the LR model

I Maxent model:

1 X p(y = k|o) = exp λkj fj (o, y) Zλ(o) j

I Logistic regression model:

1 0 0T p(y = k|x) = exp(wk0 + wk x) Zw(x)

• Using: T − feature-function: fj (o, y) = xj ; f0(o, y) = 1 and x = (x1, ··· , xj , ··· , xD ) ;

0 D 0 − 0 T P wk0 + wk x = j=0 wkj fj (o, y);

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 37 / 122 The logistic regression model Maximum Entropy Modeling Compare to the CRF model

I Maxent model:

1 X p(y = k|o) = exp λkj fj (o, y) Zλ(o) j

I Logistic regression model:

1 X 0 p(y = k|o) = exp wkj fj (o, y) Zw(o) j

I CRF model: 1 X p(y|x; θ) = exp θj Fj (x, y) Zθ(x) j

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 38 / 122 The logistic regression model Maximum Entropy Modeling Conclusion

The solution to the maximum entropy models has the same parametric form as logistic regression and CRF models.

• It is easily shown that the optimal solution is the maximum-likelihood 1 P solution in the parametric family pλ(y|x) = exp( λj xj ). Zλ(x) j

• We’ve only considered discrete inputs, what about continuous inputs? − It is found that if the class-conditional densities p(x|y) are members of the exponential family of distributions, then the posterior probabilities are again given by logistic sigmoids of a linear function.

− In particular, the model is optimal with Gaussian densities (with a shared covariance matrix).

The logistic regression model is quite well justified in a variety of situations.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 39 / 122 The logistic regression model Parameter estimation

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model – Model specification – Maximum Entropy Modeling – Parameter estimation – Improvements to the logistic regression model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs

I Conclusion

ISlimReferences Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 40 / 122

I Appendix The logistic regression model Parameter estimation Fitting the LR models

• Done by maximum likelihood estimation; in practice minimizing the Negative Log-Likelihood (NLL).

• Let θ denote the set of all parameters: θ = {w10, w1, ··· , w(K−1)0, wK−1}.

• The log-likelihood for the N (i.i.d) feature-vector observations is: N ∆ X L(D; θ) = − log p(yi |xi ; θ) i=1

• To simplify, we focus on the bi-class case...

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 41 / 122 The logistic regression model Parameter estimation NLL for bi-class LR

• Let yi = 1 for C1 observations and yi = 0 for C2 observations. ∆ • Let p(x; θ) = p(yi = 1|xi ; θ); hence p(yi = 0|xi ; θ) = 1 − p(x; θ).

• We can write: p(y|x; θ) = p(x; θ)y (1 − p(x; θ))1−y .

Negative Log-Likelihood

N X L(D; θ) = L(w˜ ) = − {yi log p(xi ; w˜ ) + (1 − yi ) log (1 − p(xi ; w˜ ))} i=1 N X n T  T o = − yi w˜ x˜i − log 1 + exp(w˜ x˜i ) i=1

T T where w˜ = (w0, w) and x˜i = (1, xi ) so that w˜ x˜i = w0 + w xi .

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 42 / 122 The logistic regression model Parameter estimation Gradient and Hessian of the NLL

PN Gradient: ∇L(D; w˜ ) = − i=1 x˜i (yi − p(xi ; w˜ ))

∂2L(D;w˜) PN T Hessian: ∂w˜∂w˜ T = i=1 x˜i x˜i p(xi ; w˜ )(1 − p(xi ; w˜ ))

→ so the Hessian is positive semi-definite, → the NLL is convex and it has a global minimum.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 43 / 122 The logistic regression model Parameter estimation Minimizing the NLL

• By setting the derivatives to zero: N ∂L(D; w˜ ) X = − x˜ji (yi − p(xi ; w˜ )) = 0 ; 0 ≤ j ≤ D. ∂wj i=1

Optimization problem Solve for w˜ the D + 1 non-linear equations:

N N X X yi x˜ji = x˜ji p(xi ; w˜ ); 0 ≤ j ≤ D i=1 i=1

• For j = 0, since the first coefficient of x˜i is 1, that is x˜0i = 1, we get: N N X X yi = p(xi ; w˜ ). i=1 i=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 44 / 122 The logistic regression model Parameter estimation Optimization methods

PN PN Objective: Solve i=1 yi x˜ji = i=1 x˜ji p(xi ; w˜ ) Problem: No closed-form solution in general (system of D + 1 non-linear equations). Solution: use descent methods.

Among the many available descent algorithms, two are widely used:

− the Newton-Raphson method: fast... but complex (efficient variants exist);

− the stochastic method: easy to implement, adapted to large scale problems.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 45 / 122 The logistic regression model Parameter estimation Optimization with the Newton-Raphson method

• To minimize g(θ), consider its second-order Taylor series approximation around θn: 1 g(θ) ≈ g(θ ) + ∇g(θ )T (θ − θ ) + (θ − θ )T H(θ )(θ − θ ); n n n 2 n n n

∇g(θn) and H(θn) are resp. the gradient and Hessian of g(θ) at θn.

• This approximation is a quadratic function which is minimized by solving:

∇g(θn) + H(θn)(θ − θn) = 0.

Hence the Newton-Raphson step

−1 θn+1 = θn − H(θn) ∇(θ).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 46 / 122 The logistic regression model Parameter estimation Optimization with the Newton-Raphson method Illustration

g(θ) g quad (θ) θn

θn θn + sn

Generated using pmtk3 (Dunham and Murphy, 2010)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 47 / 122 The logistic regression model Parameter estimation Optimization with the Newton-Raphson method Illustration

g(θ) g quad (θ) θn+1

θn θn+1 θn+1 + sn+1

Generated using pmtk3 (Dunham and Murphy, 2010)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 47 / 122 The logistic regression model Parameter estimation Optimization with the Newton-Raphson method Discussion

• Typically the algorithm converges (though overshooting may occur), and convergence speed is quadratic.

• D has to be small enough so that it is not too costly to recompute and store the inverse Hessian matrix at each iteration. • Otherwise use Quasi-Newton methods:

− BFGS (Broyden, Fletcher, Goldfarb and Shanno) method: approximates the inverse Hessian using successive gradient values. − L-BFGS (limited memory BFGS) method: stores only a few vectors used to approximate the inverse Hessian.

• Alternatively, use stochastic gradient learning (see Appendix):

− Makes gradient updates based on one training example at a time. − In practice: simple approach, slow convergence, less accurate than L-BFGS.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 48 / 122 The logistic regression model Improvements to the logistic regression model

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model – Model specification – Maximum Entropy Modeling – Parameter estimation – Improvements to the logistic regression model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs

I Conclusion

ISlimReferences Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 49 / 122

I Appendix The logistic regression model Improvements to the logistic regression model

`2-regularization

• To avoid overfitting the complexity of the model should be penalized.

• Similarly to ridge regression (Hastie et al., 2009), a quadratic regularization term can be added to the NLL:

Regularized logistic regression problem

γ wˆ = argmin L(D; w˜ ) + ||w||2 w˜ 2  N D   X h  i γ X  = argmin − y w˜ T x˜ − log 1 + exp w˜ T x˜ + w 2 i i i 2 j w˜  i=1 j=1 

γ ≥ 0 : complexity parameter controlling the amount of shrinkage; usually tuned by cross-validation.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 50 / 122 The logistic regression model Improvements to the logistic regression model

`2-regularization Discussion

Recall that: γ wˆ = argmin L(D; w˜ ) + ||w||2 w˜ 2 is equivalent to: ( wˆ = argminw˜ L(D; w˜ ) subject to ||w||2 ≤ t

for some t which has a correspondence to γ.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 51 / 122 The logistic regression model Improvements to the logistic regression model

`2-regularization Gradient and Hessian

Gradient: ∇L2(D; w˜ ) = ∇L2(D; w˜ ) + γw

Hessian: H2(w˜ ) = H(w˜ ) + γID+1

→ So the Hessian becomes positive definite, the NLL is now strictly convex and it has a unique global minimum. → The previous optimization methods can be straightforwardly adapted by modifying the expressions of the gradient and Hessian.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 52 / 122 The logistic regression model Improvements to the logistic regression model

`1-regularization

• Proceed as in the LASSO (Hastie et al., 2009), using a `1-regularization.

`1-regularized logistic regression problem

wˆ = argmin L(D; w˜ ) + γ||w||1 w˜ D X = argmin L(D; w˜ ) + γ |wj | ; γ ≥ 0. w˜ j=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 53 / 122 The logistic regression model Improvements to the logistic regression model

`1-regularization Discussion

• `1-regularization achieves .

Illustration by Alexandre Gramfort, Telecom ParisTech

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 54 / 122 The logistic regression model Improvements to the logistic regression model

`1-regularization Discussion

• `1-regularization achieves feature selection.

• Difficulties:

− The regularizer is not differentiable at zero yielding non-smooth optimization problem. → specific optimization techniques needed (Yuan et al., 2010).

− In configurations with groups of highly correlated features:

I `1-regularization tends to select randomly one feature in each group; I `2-regularization tends to yield better prediction performance. → Consider the elastic net model (Hastie et al., 2009):

2 L(D; w˜) + γ2||w||2 + γ1||w||1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 54 / 122 The logistic regression model Improvements to the logistic regression model Kernel logistic regression (KLR)

• Let K: positive definite kernel and HK: the RKHS generated by K.

• Let φ ∈ HK, a feature mapping to HK.

KLR model

1 T p(yi |xi ) = ; g(x) = w0 + w φ(x) 1 + exp −g(xi )

KLR model estimation problem: N X min L(w˜ ) = − [yi g(xi ) − log(1 + exp g(xi ))] w˜ i=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 55 / 122 The logistic regression model Improvements to the logistic regression model Regularized KLR

Regularized KLR model estimation problem: N X γ 2 min L(w˜ ) = − [yi g(xi ) − log(1 + exp g(xi ))] + ||g||H w˜ 2 K i=1

• PN By the representer theorem: g(x) = w0 + i=1 αi K(xi , x)

• The problem is strictly convex and can be solved using classic solvers.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 56 / 122 The logistic regression model Improvements to the logistic regression model KLR vs SVM

• It can be shown that KLR and SVM are quite related (see Appendix). • Very similar prediction performance and optimal margin properties. • Same refinements are possible: SMO, MKL...

+ Provides well-calibrated class probabilities. + Naturally generalizes to multi-class problems.

- No support vectors! → Import Vector Machines (Zhu and Hastie, 2002).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 57 / 122 Conditional Random Fields (for linear-chain data)

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model

I Conditional Random Fields (for linear-chain data) – Introduction – Inference – Parameter estimation

I Improvements and extensions to original CRFs

I Conclusion

I References Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 58 / 122 I Appendix Conditional Random Fields (for linear-chain data) Introduction Structured-output data

o1 o2 o3 o4 o1 o2 o3 o4

Dr BsDr BsDrPn BsDr C−7 C−7 F−7 F−7

Musical instrument classification Chord transcription

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 59 / 122 Conditional Random Fields (for linear-chain data) Introduction Recalling the notations

• x is a sequence of observations: x = (x1, ··· , xn).

• y is the corresponding sequence of labels: y = (y1, ··· , yn). • We assume we have a training dataset D of N (i.i.d) such sequences: D = {(x(1), y (1)), ··· , (x(N), y (N))}.

• Remarks:

− Observations are no longer assumed to be i.i.d within each sequence.

(q) − Sequences x do not necessarily have the same length, when needed nq will denote the length of x(q).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 60 / 122 Conditional Random Fields (for linear-chain data) Introduction The CRF model

CRF model definition

D 1 X p(y|x; θ) = exp θ F (x, y) Z(x, θ) j j j=1 1 = Ψ(x, y; θ); θ = {θ , ··· , θ }. Z(x, θ) 1 D

• P P Z(x, θ) = y exp j θj Fj (x, y) is called a partition function. • PD Ψ(x, y; θ) = exp j=1 θj Fj (x, y) is called a potential function.

• Remarks: − CRFs appear to be an extension of logistic regression to structured data. − Feature functions Fj (x, y) depend on the whole sequence of observations x and labels y.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 61 / 122 Conditional Random Fields (for linear-chain data) Introduction Defining label constraints

• Without any further assumptions on the structure of y the model is hardly usable:

one needs to enumerate all possible sequences y for:

− P P Z(x, θ) = y exp j θj Fj (x, y);

− yˆ = argmaxy p(y|x; θ).

with |Y|n possible assignments !

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 62 / 122 Conditional Random Fields (for linear-chain data) Introduction Defining label constraints Using feature functions

• Consider feature functions Fj (x, y) such that: n X Fj (x, y) = fj (yi−1, yi , x, i); where n is the length of x. i=1

→ defines linear-chain CRFs: at each position i, 1 ≤ i ≤ n,

− each fj depends on the whole observation sequence,

− but only on the current and previous labels.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 63 / 122 Conditional Random Fields (for linear-chain data) Introduction Defining label constraints Valid feature functions

n X Fj (x, y) = fj (yi−1, yi , x, i) i=1

Examples of such feature functions (for discrete observations):

− The current observation is G, the current label is Cmin7 and the previous is G7;

− The past 4 observations..., the current label is...

− The next observation is...

− The current label is...

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 64 / 122 Conditional Random Fields (for linear-chain data) Introduction Defining label constraints Observation and transition feature functions

• For convenience, one can define two types of feature functions:

− Observation (aka state) feature functions: bj (yi , x, i);

− Transition feature functions: tj (yi−1, yi , x, i).

• Hence:  n D n D  1 X Xo X Xt  p(y|x; θ) = exp θ b (y , x, i) + θ t (y y , x, i) Z(x, θ) j j i j j i−1, i  i=1 j=1 i=1 j=1 

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 65 / 122 Conditional Random Fields (for linear-chain data) Introduction Connection to HMM

The ∆ Qn ∆ phmm(y, x) = i=1 p(yi |yi−1)p(xi |yi ) ; where p(y1|y0) = p(y1).

One can write: ( n n ) X X phmm(y, x) = exp log p(yi |yi−1) + log p(xi |yi ) i=1 i=1  n X X = exp λlqI(yi = l)I(yi−1 = q)  i=1 l,q∈Y

n  X X  + µol I(y = l)I(xi = o) ; i=1 l∈Y,o∈X 

where λlq = log p(yi = l|yi−1 = q) and µol = log p(xi = o|yi = l).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 66 / 122 Conditional Random Fields (for linear-chain data) Introduction Connection to HMM

• Using the feature functions:

− bj (yi , x, i)= I(y = l)I(xi = o), where each j indexes a different “l, o configuration”; − tj (yi−1, yi , x, i)= I(yi = l)I(yi−1 = q), where j indexes a different “l, q configuration”;

p(y,x) P 0 • also using p(y|x) = P 0 and letting Z(x) = 0 p(y , x), one gets: y0 p(y ,x) y

 n n  1 X X X X  p (y|x) = exp θ b (y , x, i)+ θ t (y , y , x, i) ; hmm Z(x) j j i j j i−1 i  i=1 j i=1 j 

→ HMMs are a particular type of linear-chain CRFs.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 67 / 122 Conditional Random Fields (for linear-chain data) Introduction Connection to HMM Discussion

CRFs have a number of advantages over HMMs, as a consequence of two major differences:

• CRFs are discriminative models. • CRFs are undirected models.

yi−1 yi yi+1 yi−1 yi yi+1

xi−1 xi xi+1 xi−1 xi xi+1

HMM CRF

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 68 / 122 Conditional Random Fields (for linear-chain data) Introduction Connection to HMM Advantage of the discriminative nature of CRF

HMM: observation xi is independent of all other variables given its parent state yi . CRF: no assumptions on the dependencies among the observations: only p(y|x) is modeled.

→ CRFs can safely:

− exploit overlapping features; − account for long-term dependencies, considering the whole sequence of observations x at each location i (i 7→ bj (yi , x, i));

− use transition feature-functions tj (yi−1,yi , x, i).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 69 / 122 Conditional Random Fields (for linear-chain data) Introduction Using linear-chain CRFs

Problems to be solved:

• Inference: given a model θ, how to compute:

− yˆ = argmaxy p(y|x; θ) ?

− P P Z(x, θ) = y exp j Fj (x, y) to deduce 1 PD p(y|x; θ) = Z(x,θ) exp j=1 θj Fj (x, y) ?

• Parameter estimation: given a training dataset D = {(x(1), y (1)), ··· , (x(N), y (N))}, how to estimate the optimal θ?

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 70 / 122 Conditional Random Fields (for linear-chain data) Inference Decoding the optimal sequence

n • Problem: solve yˆ = argmaxy∈Yn p(y|x; θ), with |Y| possible assignments!

• Solution: use the .

Exploit the linear-chain structure:

D 1 X yˆ = argmax p(y|x; θ) = argmax exp θj Fj (x, y) y y Z(x, θ) j=1 D X = argmax θj Fj (x, y) y j=1 n D X X = argmax θj fj (yi−1, yi , x, i) y i=1 j=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 71 / 122 Conditional Random Fields (for linear-chain data) Inference Decoding the optimal sequence

∆ PD Let: gi (yi−1, yi ) = j=1 θj fj (yi−1, yi , x, i); then:

n D n X X X yˆ = argmax θj fj (yi−1, yi , x, i) = argmax gi (yi−1, yi ). y y i=1 j=1 i=1

Let δm(s) be the optimal “intermediate score” such that at time step m the label value is s:

"m−1 # ∆ X δm(s) = max gi (yi−1, yi ) + gm(ym−1, s) {y ,··· ,y } 1 m−1 i=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 72 / 122 Conditional Random Fields (for linear-chain data) Inference Decoding the optimal sequence Trellis representation

"m−1 # ∆ X δm(s) = max gi (yi−1, yi ) + gm(ym−1, s) {y ,··· ,y } 1 m−1 i=1

y1

y2 s

y3

yK m − 1 mn ...

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 73 / 122 Conditional Random Fields (for linear-chain data) Inference Decoding the optimal sequence

1 • The intermediate scores δm(s) can be efficiently computed using :

Viterbi recursion

δm(s) = max [δm−1(ym−1) + gm(ym−1, s)] ; 1 ≤ m ≤ n ym−1∈Y

• As we proceed we need to keep track of the selected predecessor of s, at each time step m.

• We use ψm(s) for this purpose.

1See Appendix for more details. Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 74 / 122 Conditional Random Fields (for linear-chain data) Inference Decoding the optimal sequence Viterbi algorithm Initialization: δ1(s) = g1(y0, s); ∀s ∈ Y; y0 = start

ψ1(s) = start Recursion: ∀s ∈ Y; 1 ≤ m ≤ n

δm(s) = max [δm−1(y) + gm(y, s)] y∈Y

ψm(s) = argmax [δm−1(y) + gm(y, s)] y∈Y

Termination: n ∗ X δn(yn ) = max δn(y) = max gi (yi−1, yi ). y∈Y y i=1 ∗ yn = argmax δn(y) y∈Y Path backtracking: ∗ ∗ ym = ψm+1(ym+1); m = n − 1, n − 2, ··· , 1. Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 75 / 122 Conditional Random Fields (for linear-chain data) Inference Decoding the optimal sequence Backtracking

∗ ∗ ym = ψm+1(ym+1); m = n − 1, n − 2, ··· , 1.

y1

y2

y3

yK m − 1 mn ... Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 76 / 122 Conditional Random Fields (for linear-chain data) Inference Complexity of Viterbi decoding

Remarks on the computational cost:

• O(K 2n) in the worst case; K = |Y|.

• In practice: O(T Kn), where T : average number of possible “transitions” between labels y.

• Can be reduced using beam search: exploring a subset of possible labels at each time position (the “most promising” ones) (Ortmanns et al., 1996).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 77 / 122 Conditional Random Fields (for linear-chain data) Inference Using linear-chain CRFs

Problems to be solved:

• Inference: given a model θ, how to compute:

− yˆ = argmaxy p(y|x; θ) X

− P P Z(x, θ) = y exp j Fj (x, y) to deduce 1 PD p(y|x; θ) = Z(x,θ) exp j=1 θj Fj (x, y) ?

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 78 / 122 Conditional Random Fields (for linear-chain data) Inference Computing the partition function Z(x, θ) The sum-product problem

Recall the CRF model: 1 Qn p(y|x; θ) = Z(x,θ) i=1 Mi (yi−1, yi , x) ; PD  Mi (yi−1, yi , x)= exp j=1 θj fj (yi−1, yi , x, i) ;

P Qn Z(x, θ) = y∈Yn i=1 Mi (yi−1, yi , x): intractable as is...

→ use the forward-backward method: reduces complexity from O(K n) to O(nK 2).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 79 / 122 Conditional Random Fields (for linear-chain data) Inference The forward-backward method

• Defining α (y ) = P M (y , y )α (y ); 2 ≤ m ≤ n, it is m m ym−1 m m−1 m m−1 m−1 easily shown2 that:

At the end of the sequence X Z(x, θ) = αn(yn).

yn∈Y

• Alternatively, defining β (y ) = P M (y , y )β (y ); 1 ≤ m ≤ n − 1 and m m ym+1 m+1 m m+1 m+1 m+1 βn(yn) = 1, one gets: At the beginning of the sequence X Z(x, θ) = M1(y0, y1)β1(y1). y1∈Y 2See Appendix for more details Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 80 / 122 Conditional Random Fields (for linear-chain data) Inference Marginal probability

X p(ym−1, ym|x) = p(y|x)

y\{ym−1,ym}

Marginal probability by forward-backward 1 p(y , y |x) = α (y )M (y , y , x)β (y ). m−1 m Z(x) m−1 m−1 m m−1 m m m

More details in the appendix.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 81 / 122 Conditional Random Fields (for linear-chain data) Parameter estimation

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model

I Conditional Random Fields (for linear-chain data) – Introduction – Inference – Parameter estimation

I Improvements and extensions to original CRFs

I Conclusion

I References Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 82 / 122 I Appendix Conditional Random Fields (for linear-chain data) Parameter estimation Negative log-likelihood (NLL)

• Given training data D = {(x(1), y (1)), ··· , (x(N), y (N))}, the NLL is:

N ∆ X L(D; θ) = − log p(y (q)|x(q); θ) q=1   N nq D X  (q) X X (q) (q) (q)  = log Z(x ; θ) − θj fj (yi−1, yi , x , i) q=1  i=1 j=1 

N  D  X  (q) X (q) (q)  = log Z(x ; θ) − θj Fj (x , y ) . q=1  j=1 

• L(D; θ) is convex → gradient-descent will converge to global minimum.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 83 / 122 Conditional Random Fields (for linear-chain data) Parameter estimation NLL gradient

N ∂L(D; θ) X  ∂  Gradient: = log Z(x(q); θ) − F (x(q), y (q)) . ∂θ ∂θ k k q=1 k  D  ∂ 1 X ∂ X log Z(x; θ)= exp θ F (x, y) ∂θ Z(x; θ) ∂θ  j j  k y∈Yn k j=1 D 1 X X = F (x, y) exp θ F (x, y) Z(x; θ) k j j y∈Yn j=1 P X exp j θj Fj (x, y) = F (x, y) k Z(x; θ) y∈Yn X = Fk (x, y)p(y|x; θ) y∈Yn   = Ep(y|x;θ) Fk (x, y) .

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 84 / 122 Conditional Random Fields (for linear-chain data) Parameter estimation NLL gradient

N ∂L(D; θ) X  ∂  Gradient: = log Z(x(q); θ) − F (x(q), y (q)) . ∂θ ∂θ k k q=1 k

∂   log Z(x; θ)= Fk (x, y) : conditional expectation given x. ∂θk Ep(y|x;θ)

N ∂L(D; θ) X n h (q) i (q) (q) o = (q) F (x , y) − F (x , y ) . ∂θ Ep(y|x ;θ) k k k q=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 84 / 122 Conditional Random Fields (for linear-chain data) Parameter estimation Optimality condition

∂L(D;θ) • Setting the derivatives to 0, i.e. = 0, yields: ∂θk

N N X h (q) i X (q) (q) Ep(y|x(q);θ) Fk (x , y) = Fk (x , y ); 1 ≤ k ≤ D q=1 q=1

• No closed-form solution: numerical optimization is again needed.

•  (q)  Need to compute Ep(y|x(q);θ) Fk (x , y) efficiently.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 85 / 122 Conditional Random Fields (for linear-chain data) Parameter estimation Optimality condition

∂L(D;θ) • Setting the derivatives to 0, i.e. = 0, yields: ∂θk

N N 1 X h (q) i 1 X (q) (q) (q) F (x , y) = F (x , y ); 1 ≤ k ≤ D N Ep(y|x ;θ) k N k q=1 q=1

• Average expectation under the model = empirical mean.

• No closed-form solution: numerical optimization is again needed.

•  (q)  Need to compute Ep(y|x(q);θ) Fk (x , y) efficiently.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 85 / 122 Conditional Random Fields (for linear-chain data) Parameter estimation Efficient gradient computation

  X Ep(y|x;θ) Fk (x, y) = Fk (x, y)p(y|x; θ) y∈Yn n X X = fk (yi−1, yi , x)p(y|x; θ) i=1 y∈Yn n X X = fk (yi−1, yi , x)p(yi−1, yi |x; θ) 2 i=1 yi−1,yi ∈Y

p(yi−1, yi |x; θ) is the marginal probability which thanks to the forward-backward algorithm is obtained by: 1 p(y , y |x) = α (y )M (y , y , x)β (y ). i−1 i Z(x) i−1 i−1 i i−1 i i i

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 86 / 122 Conditional Random Fields (for linear-chain data) Parameter estimation Optimization

Now that we are able to compute the gradient, we can use a descent method to solve for θ.

Many algorithms are available (see Sokolovska, 2010; Lavergne et al., 2010):

• Generalized iterative scaling (Lafferty et al., 2001): original algorithm, slow convergence, suboptimal.

• Conjugate gradient (Wallach, 2002): faster convergence, better quality.

• L-BFGS (McCallum, 2002): fast convergence, scalable; a good option, most used.

• Stochastic gradient: suboptimal, simple, online, large-scale applications.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 87 / 122 Improvements and extensions to original CRFs

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs – Regularization – Hidden-state CRF model – Other extensions

I Conclusion

I References Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 88 / 122 I Appendix Improvements and extensions to original CRFs Regularization

Using `2-norm

||θ||2 • 2 Redefine the objective function as: L(D; θ) = L(D; θ) + 2σ2 ; σ2: a free parameter penalizing large weights (as in ridge regression).

• The gradient coefficients become: n o ∂L(D;θ) PN  (q)  (q) (q) θk = (q) Fk (x , y) − Fk (x , y ) + 2 . ∂θk q=1 Ep(y|x ;θ) σ • Advantages:

− The objective becomes strictly convex. − Shrinkage of θ coefficients is achieved avoiding overfitting and numerical problems.

• σ2 needs to be tuned (usually by cross-validation).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 89 / 122 Improvements and extensions to original CRFs Regularization Regularization

Using `1-norm to perform feature selection

• Redefine the objective function as: PD L(D; θ) = L(D; θ) + ρ||θ||1 = L(D; θ) + ρ j=1 |θj | (as in the LASSO).

• Advantage: performs feature selection in some NLP apps: up to 95% of the features can be discarded without affecting performance! (see Sokolovska, 2010).

• Difficulties:

− The regularizer is not differentiable at zero: specific optimization techniques needed (Sokolovska, 2010).

− In configurations with groups of highly correlated features, tend to select randomly one feature in each group.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 90 / 122 Improvements and extensions to original CRFs Hidden-state CRF model

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs – Regularization – Hidden-state CRF model – Other extensions

I Conclusion

I References Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 91 / 122 I Appendix Improvements and extensions to original CRFs Hidden-state CRF model Motivation

Problem: the CRF model does not support hidden states.

CRF ... (i) (i) Input: x1 x2 x3 x4 D = {(x , y )}i Output: y1 y2 y3 y4 Am Dm7 G7 CMaj7

Hidden-state CRF Input: x1 x2 x3 x4 ... (i) (i) D = {(x , y )}i Hidden: h1 h2 h3 h4 Output: y = ’TRUMPET’

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 92 / 122 Improvements and extensions to original CRFs Hidden-state CRF model Motivation

Problem: the CRF model does not support hidden states.

yi−1 yi yi+1 CRF

(i) (i) D = {(x , y )}i xi−1 xi xi+1

y

hi−1 hi hi+1 Hidden-state CRF (i) (i) D = {(x , y )}i

xi−1 xi xi+1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 92 / 122 Improvements and extensions to original CRFs Hidden-state CRF model The HCRF model (Quattoni et al., 2007)

• Each sequence of observations x = (x1, ··· , xn) is associated with: − a unique label y; − a sequence of latent variables h = (h1, ··· , hn), where hi ∈ H.

HCRF model definition D 1 X p(y, h|x; θ) = exp θ F (x, y, h) Z(x, θ) j j j=1 P PD Z(x, θ) = y,h exp j=1 θj Fj (x, y, h); θ = {θ1, ··· , θD }.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 93 / 122 Improvements and extensions to original CRFs Hidden-state CRF model Inference in HCRF

• 1 PD Using the HCRF model: p(y, h|x; θ) = Z(x,θ) exp j=1 θj Fj (x, y, h); entails being able to compute:

− ∗ yˆ = argmaxy∈Y p(y|x; θ ), to classify new test cases; − the partition function Z(x, θ), to evaluate posterior probabilities.

∆ • 0 P PD Let Z (y, x, θ) = h∈Hn exp j=1 θj Fj (x, y, h): marginalization wrt h.

• We have: Z 0(y,x,θ) − P p(y|x; θ) = n p(y, h|x; θ) = P 0 ; h∈H y Z (y,x,θ)

− P 0 Z(x, θ) = y Z (y, x, θ). • Z 0(y, x, θ) can be easily computed using forward/backward recursions (as done in CRF).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 94 / 122 Improvements and extensions to original CRFs Hidden-state CRF model Negative log-likelihood

N ∆ X L(D; θ) = − log p(y (q)|x(q); θ) q=1 N ( ! ) X X = log Z 0(y, x(q), θ) − log Z 0(y (q), x(q), θ) ; q=1 y

D 0 ∆ X X Z (y, x, θ) = exp θj Fj (x, y, h) h∈Hn j=1

L(D; θ) is no longer convex → convergence to a local minimum.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 95 / 122 Improvements and extensions to original CRFs Hidden-state CRF model NLL gradient

N ( ∂L(D; θ) X X (q) = Fk (x , y, h)p(y, h|x; θ) ∂θk q=1 y,h ) X (q) (q) (q) (q) − Fk (x , y , h)p(h|y , x ; θ) h

which can be again computed using the forward-backward method.

A gradient descent method (L-BFGS) can be again used to solve for θ.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 96 / 122 Improvements and extensions to original CRFs Hidden-state CRF model Application to musical instrument classification

Input: x1 x2 x3 x4 ...

Hidden: h1 h2 h3 h4

Output: y = ’TRUMPET’

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 97 / 122 Improvements and extensions to original CRFs Hidden-state CRF model Feature functions used Following (Quattoni et al., 2007)

N N N X X X Ψ(x, y, h, θ) = < θ(hi ), xi > + θ(y, hi ) + θ(y, hi−1, hi ) i=1 i=1 i=1

• < θ(hi ), xi >: compatibility between observation xi and hidden state hi ∈ H;

• θ(y, hi ) : compatibility between hidden state hi and label y;

• θ(y, hi−1, hi ) : compatibility between transition hi−1 ↔ hi and label y.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 98 / 122 Improvements and extensions to original CRFs Hidden-state CRF model Evaluation

• Classifying 1-second long segments of solo excerpts of Cello, Guitar, Piano, Bassoon and Oboe. • Data: − training set: 2505 segments (i.e. 42’); − testing set: 2505 segments. • Classifiers:

− `2-regularized HCRF with 3 hidden states; − Linear SVM. • Features: 47 cepstral, perceptual and temporal features.

• Results Classifier SVM HCRF Average accuracy 75% 76%

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 99 / 122 Improvements and extensions to original CRFs Other extensions

I Introduction

I References to applications in MIR-related tasks

I The logistic regression model

I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs – Regularization – Hidden-state CRF model – Other extensions

I Conclusion

I References Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 100 / 122 I Appendix Improvements and extensions to original CRFs Other extensions Other extensions

• LDCRF: Latent-Dynamic Conditional (Sung and Jurafsky, 2009) modeling both hidden-states and structured-outputs.

y yi−1 yi yi+1

yi−1 yi yi+1 hi−1 hi hi+1 hi−1 hi hi+1

xi−1 xi xi+1 xi−1 xi xi+1 xi−1 xi xi+1

CRF HCRF LDCRF

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 101 / 122 Improvements and extensions to original CRFs Other extensions Other extensions

• LDCRF: Latent-Dynamic Conditional Random Field (Sung and Jurafsky, 2009) modeling both hidden-states and structured-outputs.

• Kernel-CRF (Laferty et al., 2004; Altun et al., 2004) introducing implicit features to account for (non-linear) interactions between original features.

• Semi-Markov CRF (Sarawagi and Cohen, 2005) modeling segment-level labels.

• CCRF: Continuous CRF (Qin and Liu, 2008) modeling continuous labels in a regression setting.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 101 / 122 Conclusion Take-home messagesI

• CRFs are powerful structured-data prediction models (more flexible than HMMs and other more general Bayesian networks) as they are:

− discriminative models: focus on modeling the target labels;

→ can handle a high number of feature functions, including transition features, and account for long-range dependencies.

− undirected models: no need to normalize potentials locally.

→ allow for incorporating prior knowledge about constraints and label dependencies in an intuitive way.

• Easily extendable with key mechanisms: regularization, sparsity, latent variables, kernels...

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 102 / 122 Conclusion Take-home messagesII

• Great potential for various MIR-related tasks (both symbolic and numerically-continuous data):

− transcription tasks: notes, chords, harmonic analysis...

− “classification” (autotagging) tasks:

I accounting for label-correlations in multi-label problems;

I accounting for temporal dynamics using HCRF...

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 103 / 122 References CRF software packages

Package Language Main features Reference

CRF++ C++ Linear-chain CRF, NLP, L-BFGS optimization (Taku-ku, 2003) Matlab, Linear-chain CRF, categorical features, L-BFGS crfChain Schmidt (2008) C mex optimization Linear-chain CRF, NLP, various regularization C++, CRFsuite and optimization methods (L-BFGS), designed (Okazaki, 2007) Python for fast training C++, HCRF CRF, HCRF, LDCRF, continuous inputs, Matlab, (Morency, 2010) library L-BFGS optimization Python CRF, maxent, HMM, NLP, text feature

Mallet Java extraction routines, various optimization (McCallum, 2002) methods (L-BFGS) Linear-chain CRF, NLP, large label and feature

Wapiti C99 sets, various regularization and optimization (Lavergne et al., 2010) methods (L-BFGS, SGD), multi-threaded

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 104 / 122 References CRF tutorials

• Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press, 2006. • Charles Elkan. Log-linear Models and Conditional Random Fields. Notes for a tutorial at CIKM’08, October 2008. • Roman Klinger and Katrin Tomanek. Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-2013, December 2007. • Hanna M. Wallach. Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21, Department of Computer and Information Science, University of Pennsylvania, 2004. • Roland Memisevic. An Introduction to Structured Discriminative Learning. Technical Report, University of Toronto, 2006. • Rahul Gupta. Conditional Random Fields. Unpublished report, IIT Bombay, 2006.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 105 / 122 References BibliographyI

Y. Altun, A. Smola, and T. Hofmann. Exponential families for conditional random fields. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, volume 0, 2004. URL http://dl.acm.org/citation.cfm?id=1036844. A. L. Berger, S. A. D. Pietra, and Vincent J. Della Pietra. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 1996. J. Corey and K. Fujinaga. A cross-validated study of modelling strategies for automatic chord recognition in audio. International Conference on Music Information Retrieval (ISMIR), 2007. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119.5760&rep=rep1&type=pdf. M. E. P. Davies and M. D. Plumbley. Context-dependent beat tracking of musical audio. Audio, Speech, and Language Processing, IEEE Transactions on, 15(3):1009–1020, 2007. Z. Duan, L. Lu, A. Msra, and H. District. Collective annotation of music from multiple semantic categories. In Proceedings of the 9th International Conference of Music Information Retrieval, pages 237–242, Philadelphia, Pennsylvania, USA, 2008. M. Dunham and K. Murphy. pmtk3: probabilistic modeling toolkit for Matlab/Octave, version 3, 2010. URL https://code.google.com/p/pmtk3/. S. Durand. Estimation de la position des temps et des premiers temps d’un morceau de musique. Master thesis, Universite Pierre et Marie Curie, 2013. D. Ellis. Beat Tracking by Dynamic Programming. Journal of New Music Research, 36(1):51–60, 2007. ISSN 09298215. URL http://www.informaworld.com/openurl?genre=article&doi=10.1080/09298210701653344&magic=crossref. A. Gunawardana, M. Mahajan, A. Acero, and J. Platt. Hidden conditional random fields for phone classification. In Interspeech, 2005. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.8190&rep=rep1&type=pdf. T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: , Inference, and Prediction. Springer, 2009. ISBN 0387848576. URL http://books.google.com/books?id=tVIjmNS3Ob8C&pgis=1.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 106 / 122 References BibliographyII

X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random fields for image labeling. In IEEE Computer Society Conference on Computer Vision and , CVPR, volume 2, pages 695–702. IEEE, 2004. ISBN 0-7695-2158-4. doi: 10.1109/CVPR.2004.1315232. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1315232. W.-T. Hong. Speaker identification using Hidden Conditional Random Field-based speaker models. In 2010 International Conference on and Cybernetics, pages 2811–2816. IEEE, July 2010. ISBN 978-1-4244-6526-2. doi: 10.1109/ICMLC.2010.5580793. URL http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5580793&contentType=Conference+ Publications&searchField=Search_All&queryText=Hidden+Conditional+Random+Fields. V. Imbrasaite, T. Baltrusaitis, and P. Robinson. Emotion tracking in music using continuous conditional random fields and relative feature representation. In 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, July 2013. ISBN 978-1-4799-1604-7. doi: 10.1109/ICMEW.2013.6618357. URL http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=6618357. F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision Graphs. Springer, 2007. ISBN 0387682813. URL http://books.google.fr/books/about/Bayesian_Networks_and_Decision_Graphs.html?id=goSLmQq4UBcC&pgis=1. C. Joder, S. Essid, and G. Richard. A Conditional Random Field Viewpoint of Symbolic Audio-to-Score Matching. In ACM Multimedia 2010, Florence, Italy, 2010. C. Joder, S. Essid, and G. Richard. A Conditional Random Field Framework for Robust and Scalable Audio-to-Score Matching. IEEE Transactions on Audio, Speech and Language Processing, 19(8):2385–2397, 2011. C. Joder, S. Essid, and G. Richard. Learning Optimal Features for Polyphonic Audio-to-Score Alignment. IEEE Transactions on Audio Speech and Language Processing, 2013. doi: 10.1109/TASL.2013.2266794. A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the meter of acoustic musical signals. Audio, Speech, and Language Processing, IEEE Transactions on, 14(1):342–355, 2006.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 107 / 122 References BibliographyIII

J. Laferty, Y. Liu, and X. Zhu. Kernel Conditional Random Fields: Representation, Clique Selection, and Semi-. In International Conference on Machine Learning, 2004. URL http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Kernel+Conditional+Random+Fields+: +Representation+,+Clique+Selection+,+and+Semi-Supervised+Learning#5. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Machine learning international workshop then conference, pages 282–289. Citeseer, 2001. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.9849&rep=rep1&type=pdf. T. Lavergne, O. Cappe, and F. Yvon. Practical very large scale CRFs. In 48th Annual Meeting of the Association for Computational Linguistics (ACL), number July, pages 504–513, Uppsala, Sweeden, 2010. URL http://www.aclweb.org/anthology/P10-1052. C.-X. Li. Exploiting label correlations for multi-label classification. Master thesis, UC San Diego, 2011. A. K. McCallum. Mallet: A machine learning for language toolkit, 2002. URL http://mallet.cs.umass.edu. L.-P. Morency. HCRF: Hidden- state Conditional Random Field Library, 2010. URL http://sourceforge.net/projects/hcrf/. L.-P. Morency, A. Quattoni, and T. Darrell. Latent-Dynamic Discriminative Models for Continuous Gesture Recognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, June 2007. ISBN 1-4244-1179-3. doi: 10.1109/CVPR.2007.383299. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4270324. J. Morris and E. Fosler-Lussier. Conditional Random Fields for Integrating Local Discriminative Classifiers. TASLP, 16(3):617–628, 2008. ISSN 1558-7916. doi: 10.1109/TASL.2008.916057. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4451149. N. Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs), 2007. URL http://www.chokkan.org/software/crfsuite/.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 108 / 122 References BibliographyIV

S. Ortmanns, H. Ney, and A. Eiden. Language-model look-ahead for large vocabulary speech recognition. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, volume 4, pages 2095–2098. IEEE, 1996. ISBN 0-7803-3555-4. doi: 10.1109/ICSLP.1996.607215. URL http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=607215. T. Qin and T. Liu. Global ranking using continuous conditional random fields. In Advances in Neural and Information Processing Systems, 2008. URL http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2008_0518.pdf. A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell. Hidden conditional random fields. IEEE transactions on pattern analysis and machine intelligence, 29(10):1848–53, Oct. 2007. ISSN 0162-8828. doi: 10.1109/TPAMI.2007.1124. URL http://www.ncbi.nlm.nih.gov/pubmed/17699927. S. Reiter, B. Schuller, and G. Rigoll. Hidden Conditional Random Fields for Meeting Segmentation. In Multimedia and Expo, 2007 IEEE International Conference on, pages 639–642. IEEE, July 2007. ISBN 1-4244-1016-9. doi: 10.1109/ICME.2007.4284731. URL http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=4284731&contentType=Conference+ Publications&searchField=Search_All&queryText=Hidden+Conditional+Random+Fields. O. Rudovic, V. Pavlovic, and M. Pantic. Kernel conditional ordinal random fields for temporal segmentation of facial action units. In Computer Vision - ECCV, pages 1–10, 2012. URL http://link.springer.com/chapter/10.1007/978-3-642-33868-7_26. S. Sarawagi and W. Cohen. Semi-markov conditional random fields for . In Advances in Neural Information Processing Systems, 2005. URL http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_427.pdf. E. Schmidt and Y. Kim. Modeling musical emotion dynamics with conditional random fields. In International Conference on Music Information Retrieval (ISMIR), number Ismir, pages 777–782, Miami, FL, USA, 2011. URL http://music.ece.drexel.edu/files/Navigation/Publications/Schmidt2011c.pdf. M. Schmidt. crfChain, 2008. URL http://www.di.ens.fr/~mschmidt/Software/crfChain.html.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 109 / 122 References BibliographyV

B. Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004. URL http://dl.acm.org/citation.cfm?id=1567618. N. Sokolovska. Contributions to the estimation of probabilistic discriminative models: semi-supervised learning and feature selection. PhD thesis, Telecom ParisTech, 2010. Y.-H. Sung and D. Jurafsky. Hidden Conditional Random Fields for phone recognition. 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 107–112, Dec. 2009. doi: 10.1109/ASRU.2009.5373329. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5373329. Taku-ku. CRF++, 2003. URL http://crfpp.googlecode.com/svn/trunk/doc/index.html. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of the Eighteenth conference on Uncertainty in Artificial Intelligence (UAI02), 2002. URL http://dl.acm.org/citation.cfm?id=2073934. H. M. Wallach. Efficient Training of Conditional Random Fields. PhD thesis, University of Edinburgh, 2002. S. B. Wang, A. Quattoni, L.-P. Morency, and D. Demirdjian. Hidden Conditional Random Fields for Gesture Recognition. pages 1521–1527. IEEE Computer Society, 2006. ISBN 0-7695-2597-0. URL http://portal.acm.org/citation.cfm?id=1153171.1153627. G. Yuan, K. Chang, C. Hsieh, and C. Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. The Journal of Machine Learning Research, 2010. URL http://dl.acm.org/citation.cfm?id=1953034. J. Zhu and T. Hastie. Support vector machines, kernel logistic regression and boosting. Multiple Classifier Systems, 2364:16–26, 2002. URL http://link.springer.com/chapter/10.1007/3-540-45428-4_2.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 110 / 122 I Introduction

I References to applications in MIR-related tasks

I The logistic regression model Appendix I Conditional Random Fields (for linear-chain data)

I Improvements and extensions to original CRFs

I Conclusion

I References

I Appendix – Optimization with stochastic gradient learning – Comparing KLR and SVM – Derivation of the Viterbi algorithm – The forward-backward method

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 111 / 122 Appendix Optimization with stochastic gradient learning LR model learning with stochastic gradient descent (SGL)

• Idea: make gradient updates based on one training example at a time ∂L(D;w˜) • Use: = (yi − p(xi ; w˜ )) xji ∂wj

Algorithm - Initialise w˜ - Repeat (until convergence)

- Randomly permute training examples xi - For i = 1 : N

wj ← wj + t (yσi − pσi ) xjσi ; j = 1, ··· , D

• t : step size, to be tuned • Complexity of SGL: O(NFD) per epoch; with F the average number of non-zero feature coefficients per example; an epoch is a “complete” update using all training examples.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 112 / 122 Appendix Comparing KLR and SVM Support Vector Machines Recalling SVM as a regularized function fitting problem

T • The SVM solution, g(x) = w0 + w φ(x), can be found by solving:

N X  0  γ 2 0 min 1 − yi g(xi ) + ||g||H ; yi ∈ {−1, 1} w˜ + 2 K i=1

3 Hinge

2.5

2 Hinge loss 1.5 0 0 loss [1 − yi g(xi )]+ = max(0, 1 − yi g(xi )) 1

0.5

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 y'g(x)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 113 / 122 Appendix Comparing KLR and SVM KLR vs SVM

( 1 if yi = 1 • 0 Let yi = −1 if yi = 0

• The negative log-likelihood of the KLR model can then be written as PN 0 L(D; w˜ ) = i=1 log (1 + exp −yi g(xi )) .

• Both KLR and SVM solve:

N X 0 λ 2 min l(yi g(xi )) + ||g||H ; w˜ 2 K i=1 KLR SVM

0 0  0  0  l(yi g(xi )) = log 1 + exp −yi f (xi ) l(yi g(xi )) = 1 − yi f (xi ) +

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 114 / 122 Appendix Comparing KLR and SVM KLR vs SVM Hinge vs negative binomial log-likelihood

3 Hinge NLL 2.5

2

1.5 loss

1

0.5

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 y'g(x)

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 115 / 122 Appendix Derivation of the Viterbi algorithm Decoding the optimal sequence

D ∆ X Let: gi (yi−1, yi ) = θj fj (yi−1, yi , x, i); then: j=1

n D n X X X yˆ = argmax θj fj (yi−1, yi , x, i) = argmax gi (yi−1, yi ). y y i=1 j=1 i=1

"m−1 # ∆ X Let δm(s) = max gi (yi−1, yi ) + gm(ym−1, s) {y ,··· ,y } 1 m−1 i=1 m−1 X = max gi (yi−1, yi ) + max gm(ym−1, s). {y ,··· ,y } ym−1 1 m−1 i=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 116 / 122 Appendix Derivation of the Viterbi algorithm Viterbi decoding

m−1 X δ(s) = max gi (yi−1, yi ) + max gm(ym−1, s) {y ,··· ,y } ym−1 1 m−1 i=1

"m−2 # X So: δm−1(ym−1) = max gi (yi−1, yi ) + gm−1(ym−2, y m−1) {y ,··· ,y } 1 m−2 i=1 m−1 X = max gi (yi−1, yi ). {y ,··· ,y } 1 m−2 i=1

δm(s) = max [δm−1(ym−1) + gm(ym−1, s)]. ym−1∈Y

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 117 / 122 Appendix Derivation of the Viterbi algorithm Viterbi decoding The algorithm Initialization: δ1(s) = g1(y0, s); ∀s ∈ Y; y0 = start

ψ1(s) = start Recursion: ∀s ∈ Y; 1 ≤ m ≤ n

δm(s) = max [δm−1(y) + gm(y, s)] y∈Y

ψm(s) = argmax [δm−1(y) + gm(y, s)] y∈Y Termination: ∗ δn(yn ) = max δn(y) y∈Y ∗ yn = argmax δn(y) y∈Y Path backtracking: ∗ ∗ ym = ψm+1(ym+1); m = n − 1, n − 2, ··· , 1.

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 118 / 122 Appendix The forward-backward method The forward recursion

Define α scores as:

α1(y1) = M1(y0, y1) X α2(y2) = M2(y1, y2)α1(y1)

y1∈Y X α3(y3) = M3(y2, y3)α2(y2)

y2∈Y . . X αm(ym)= Mm(ym−1, ym)αm−1(ym−1); 2 ≤ m ≤ n ym−1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 119 / 122 Appendix The forward-backward method The forward recursion

Define α scores as:

α1(y1) = M1(y0, y1) X α2(y2) = M2(y1, y2)α1(y1)

y1∈Y X X α3(y3) = M3(y2, y3)α2(y2) = M3(y2, y3)M2(y1, y2)M1(y0, y1)

y2∈Y y1,y2 . . X αm(ym) = Mm(ym−1, ym)αm−1(ym−1); 2 ≤ m ≤ n ym−1

At the end of the sequence n X X Y αn(yn) = Mi (yi−1, yi , x) = Z(x, θ). n yn∈Y y∈Y i=1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 119 / 122 Appendix The forward-backward method The forward recursion

Define α scores as: X αm(ym) = Mm(ym−1, ym)αm−1(ym−1); 2 ≤ m ≤ n. ym−1

At the end of the sequence n X X Y αn(yn) = Mi (yi−1, yi , x) = Z(x, θ). n yn∈Y y∈Y i=1

Complexity: reduced from O(K n) to O(nK 2).

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 119 / 122 Appendix The forward-backward method The backward recursion

X βm(ym) = Mm+1(ym, ym+1)βm+1(ym+1); 1 ≤ m ≤ n − 1 ym+1∈Y

βn(yn) = 1

At the beginning of the sequence X Z(x, θ)= M1(y0, y1)β1(y1). y1∈Y

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 120 / 122 Appendix The forward-backward method Marginal probability

p(y , y |x) = P p(y|x); m−1 m y\{ym−1,ym} ∆ y\{ym−1, ym} = {y1, ··· , ym−2, ym+1, ··· , yn}.

n 1 X Y p(y , y |x) = M (y , y , x) m−1 m Z(x) i i−1 i y\{ym−1,ym} i=1 m−1 1 X Y = M (y , y , x) × M (y , y , x) Z(x) i i−1 i m m−1 m y\{ym−1,ym} i=1 n Y × Mi (yi−1, yi , x) i=m+1

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 121 / 122 Appendix The forward-backward method Marginal probability

m−1 1 X Y p(y , y |x) = M (y , y , x) × M (y , y , x) m−1 m Z(x) m m−1 m i i−1 i {y1,···,ym−2} i=1 n X Y × Mi (yi−1, yi , x)

{ym+1,···,yn} i=m+1

1 p(y , y |x) = α (y )M (y , y , x)β (y ). m−1 m Z(x) m−1 m−1 m m−1 m m m

Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 122 / 122