Neural conditional random fields

Trinh-Minh-Tri Do†‡ Thierry Arti`eres‡ †Idiap Research Institute ‡LIP6, Universit´ePierre et Marie Curie Martigny, Switzerland Paris, France [email protected] [email protected]

Abstract Katagiri, 1992), learning (Collins, 2002), Maximum Mutual Information (MMI) (Woodland & We propose a non-linear Povey, 2002) or more recently large margin approaches for structured prediction. It combines the (Sha & Saul, 2007; Do & Arti`eres,2009). power of deep neural networks to extract high A more direct approach is to design a discriminative level features with the graphical framework graphical model that models the conditional distribu- of Markov networks, yielding a powerful and tion P (Y|X) instead of modeling the joint probability scalable probabilistic model that we apply to as in generative model (Mccallum et al., 2000; Lafferty, signal labeling tasks. 2001). Conditional random fields (CRF) are a typical example of this approach. Maximum Margin Markov network (M3N) (Taskar et al., 2004) go further by fo- 1 INTRODUCTION cusing on the discriminant function (which is defined as the log of potential functions in a Markov network) This paper considers the structured prediction task and extend the SVM learning algorithm for structured where one wants to build a system that predicts a prediction. While using a completely different learning structured output from an (structured) input. It is algorithm, M3N is based on the same graphical mod- a common framework for many application fields such eling as CRF and can be viewed as an instance of a as , part-of-speech tagging, information CRF. Based on log-linear potentials, CRFs have been extraction, signal (e.g. speech) labeling and recogni- widely used for sequential data such as natural lan- tion and so on. We focus here on signal and sequence guage processing or biological sequences (Altun et al., labeling tasks for signals such as speech and handwrit- 2003; Sato & Sakakibara, 2005). However, CRFs with ing. log-linear potentials only reach modest performance For decades, Hidden Markov Models (HMMs) have with respect to non-linear models exploiting kernels been the most popular approach for dealing with se- (Taskar et al., 2004). Although it is possible to use quential data (e.g. for segmentation and classifica- kernels in CRFs (Lafferty et al., 2004), the obtained tion). They rely on strong independence assumptions dense optimal solution makes it generally inefficient in and are learned using Maximum Likelihood Estima- practice. Nevertheless, kernel machines are well known tion which is a non discriminant criterion. This latter to be less scalable. point comes from the fact that HMMs are generative Besides, in recent years, deep neural architectures have models and they define a joint probability distribution been proposed as a relevant solution for extracting on the sequence of observations X and the associated high level features from data (Hinton et al., 2006; Ben- label sequence Y. gio et al., 2006). Such models have been successfully Discriminant systems are usually more powerful than applied first to images (Hinton et al., 2006), then to generative models, and focus more directly on mini- motion caption data (Taylor et al., 2007) and text mizing the error rate. Many studies have focused on data. In these fields, deep architectures have shown developing discriminant training for HMM, for exam- great capacity to discover and extract relevant features ple Minimum Classification Error (MCE) (Juang & as input to linear discriminant systems.

th This work introduces neural conditional random fields Appearing in Proceedings of the 13 International Con- which are a marriage between conditional random ference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of fields and (deep) neural networks (NNs). The idea JMLR: W&CP 9. Copyright 2010 by the authors. is to rely on deep NNs for learning relevant high level

177 Neural conditional random fields features which may then be used as inputs to a linear a CRF defines a conditional probability according to: CRF. Going further we propose such a global architec- Q ture that we call NeuroCRF and that can be globally p(y|x) = 1/Z(x) ψc(x, yc) P c∈CQ (1) trained with a discriminant criterion. Of course, using with : Z(x) = y∈Y c∈C ψc(x, yc) a deep NN as a feature extractor makes the learning become a non convex optimization problem. This pre- where Z(x) is a global normalization factor. A com- vents relying on efficient convex optimizer algorithms. mon choice for potential functions is the exponential However recently a number of researchers have pointed function of an energy, Ec: out that convexity at any price is not always a good −Ec(x,yc,w) idea. One has to look for an optimal trade-off be- ψc(x, yc) = e (2) tween modeling flexibility and optimization ease (Le- Cun et al., 1998; Collobert et al., 2006; Bengio & Le- To ease learning, a standard setting is to use linear Cun, 2007). yc energy functions Ec(x, yc, w) = −hwc , Φc(x)i of the yc Related works. Some previous works have success- parameter vector wc and of a feature vector Φc(x). fully designed NN systems for structured prediction. This leads to a log-linear model (Lafferty, 2001). A lin- For instance, graph transformer nets (Bottou et al., ear energy function is intrinsically limiting the CRF. 1997) have been applied to a complex check reading We propose neural CRFs to replace this linear energy system that uses convolution net at character level. function by non linear energy functions that are com- (Graves et al., 2006) used a recurrent NN for handwrit- puted by a NN. ing and where neural net outputs (sigmoid units) are used as conditional probabilities. 2.2 Neural conditional random fields Motivated by the success of deep belief nets on feature discovering, Collobert and his colleagues investigated Neural conditional random fields is a combination of the use of for information extraction on NNs and CRFs. They extend CRFs by placing a NN text data (Qi et al., 2009). A common point between structure between input and energy function. This these works is that the authors proposed mechanics NN, visualized in Figure 1, is described in detail next. to adapt NN for the structured prediction task rather than a global probabilistic framework, which is investi- Y gated in this paper. Recently, (Peng et al., 2009) also Y1 3 Y2 investigated the combination of CRFs and NNs in a parallel work. Our approach is different in the use of Y4 deep architecture and , and it works Output layer for general loss function. Hidden layers

2 NEURAL CONDITIONAL input layer X RANDOM FIELDS Neural network

In this section, we propose a non-linear graphical Figure 1: Example of a tree-structured NeuroCRF model for structured prediction. We start with a gen- eral framework with any graphical structure. Then we The NN takes an observation as input and outputs focus on linear chain models for sequence labeling. a number of quantities which we call energy outputs2 {Ec(x, yc, w)|c, yc} parameterized by w. The NN is 2.1 Conditional random fields feed forward with multiple hidden layers, non-linear hidden units, and an output layer with linear output Structured output prediction aims at building a model units (i.e. a linear activation function). With this that predicts accurately a structured output y for any setting, a NeuroCRF may be viewed as a standard log- input x. The output Y = {Yi} is a set of predicted linear CRF working on the high-level representation random variables whose components belong to a set computed by a neural net. In the remainder of the of labels L and are linked by conditional dependen- paper we call the top part (output layer weights) of cies encoded by an undirected graph G = (V,E) with a NeuroCRF its CRF-part and we call the remaining cliques c ∈ C. Given x, inference stands for finding part its deep-part (see Figure 2-right). Let wnn and 1 yc the output that maximizes the conditional probability wc be the neural net weights of the deep-part and p(y|x). Relying on the Hammersley-Clifford theorem 2We use the terminology energy output to stress the 1We use the notation p(y|x) = p(Y = y|X = x). difference between NN outputs and model outputs y.

178 Trinh-Minh-Tri Do,Thierry Arti`eres the CRF-part respectively. NeuroCRF implements the NeuroCRFs based on a first-order Markov chain struc- conditional probability as: ture (Figure 3). This allows investigating the potential

Y −E (x,y ,w) Y hwyc ,Φ (x,w )i power of NeuroCRFs on standard sequence labeling p(y|x) ∝ e c c = e c c nn (3) tasks. In a chain-structured NeuroCRF there are two c∈C c∈C kinds of cliques: where Φc(x, wnn) stands for the high level representa- tion of the input x at clique c computed by the deep • local cliques (x, y ) at each position t, whose po- part. This is illustrated in Figure 2-left, where the last t tential functions are noted by ψ (x, y ), and cor- hidden layer includes units that are grouped in a num- t t responding energy functions are noted by Eloc ber of sets, e.g. one for every clique Φc(x, wnn). Each output unit −Ec(x, yc, wc) is connected to Φc(x, wnn) yc • transition cliques (x, yt−1, yt) between two succes- in the last hidden layer, with the weight vector wc . Note that the number of energy outputs for each clique sive positions at t−1 and t, whose potential func- yc tions are noted as ψt−1,t(x, yt−1, yt), and corre- c equals |Yc|, hence there are |Yc| weight vectors wc for each clique c. sponding energy functions are noted by Etra

Inference in NeuroCRFs consists of finding yb that best matches input x (i.e. with lowest energy): Yt-1 Yt Yt+1 y = argmaxy p(y|x, w) b P (4) = argminy c∈C Ec(x, yc, w)

This can be done in two steps. First one feeds the NN high-level features with input x and forward information to compute all energy outputs Ec(x, yc, w). In a second step one uses X t-1input layer X X dynamic programming to find the output yb with the t t+1 lowest energy. Figure 3: A chain-structured NeuroCRF } CRF-part high-level features

deep-part In such models it is usual to consider that energy func- } tions are shared between similar cliques at different input layer input layer times (i.e. positions in the graph) (Lafferty, 2001)3. Then energy functions take an additional argument to Figure 2: NN architecture with non shared weights specify the position in the graph, which is time t. (left) or shared weights (right). ψ (x, y ) = e−Eloc(x,t,yt,w) t t (6) ψ (x, y , y ) = e−Etra(x,t,yt−1,yt,w) Shared weights network architecture. Various t−1,t t−1 t architectures may be used for the NN. One can use The additional parameter t allows the consideration of a different NN for every energy function, which may a part of input x, whose size may vary and cannot be result in overfitting and high computational cost. In- handled by a fixed size input NN. Time is used to build stead, as we presented NeuroCRFs above, one may the input to the NN in order to compute E (x, t, y ). share weights to compute a high level representa- loc t It may consist of x , the tth element of x only (see tion per clique (and the corresponding energy out- t Figure 3), or it may include a richer temporal context puts) (Figure 2-left). Or one may choose to com- such as (x , x , x ). At the end, the conditional pute a shared high level representation of the input, t−1 t t+1 probability of output y given input x is defined as: from which all energy outputs are computed (Figure

2-right). In this latter case a NeuroCRF implements − P E (x,t,y ,w)−P E (x,t,y ,y ,w) p(y|x, w) = e t≥1 loc t t>1 tra t−1 t the conditional probability as: Z(x) Y hwyc ,Φ(x,w )i p(y|x) ∝ e c nn (5) (7) c∈C with Z(x) being the normalization factor. With this modeling, one can derive a compact architecture of 2.3 LINEAR CHAIN NEUROCRFS FOR a NN with |L| + |L|2 outputs to compute all energy SEQUENCE LABELING outputs.

While the NeuroCRF framework we propose is quite 3These authors consider two set of parameters, one for general, in our experiments we focused on linear chain local cliques and one for transition cliques.

179 Neural conditional random fields

3 PARAMETER ESTIMATION integrated in the dynamic programming pass needed to compute argmaxy∈Y p(y|x). Let (x1, y1), ..., (xn, yn) ∈ X ×Y be a training set of n The elementary loss function of NeuroCRFs is then: input-output pairs. We seek parameters w such that: LM i i Ri (w) y = argmax p(y|x , w) (8) i i i i y∈Y = maxy∈Y F (x , y, w) − F (x , y , w) + ∆(y , y) P i i i = maxy∈Y c ∆Ec(x , yc, yc, w) + δ(yc, yc) This translates into a general optimization problem: (13) i i i i i with ∆Ec(x , yc, yc, w) = Ec(x , yc, w) − Ec(x , yc, w). min λΩ(w) + R(w) (9) w Perceptron approach. Perceptron learning is a 1 P simple approach for discriminative training which has where R(w) = n i Ri(w) is a data-fitting measure- ment (e.g. empirical risk), and Ω(w) is a regularization been originally proposed for training linear classifiers term, with λ being a regularization factor that is used (Rosenblatt, 1988) but can also be applied to graphi- to find a tradeoff between a good fit on training data cal models (Collins, 2002). The idea can be extended and a good generalization. A common choice of Ω(w) to NeuroCRFs by considering the following loss term: is to use L2 regularization. P erc X i i X i Ri (w) = max Ec(x , yc, w) − Ec(x , yc, w) y∈Y Now, we discuss different criterion for training Neuro- c c CRFs. Then, we explain how we optimize these cri- (14) terion to learn NeuroCRFs. Finally, we discuss about which is very similar to the large margin criterion. regularization and evoke semi-. 3.2 LEARNING 3.1 CRITERIA Due to non-convexity, initialization is a crucial step There are many discriminative criteria for training for NN learning, especially in the case of deep archi- CRFs (more generally log-linear models), which can tectures (see (Erhan et al., 2009) for an analysis). For- all be used for learning NeuroCRFs as well. tunately, an unsupervised greedy-wise pretraining al- gorithm for deep architectures has recently been pro- Probabilistic criterion. In (Lafferty, 2001), esti- posed to tackle this problem with notable success (Hin- mation of CRF parameters w was done by maximizing ton et al., 2006). (Bengio et al., 2006) provides a com- the conditional likelihood (CML) which results in: prehensible analysis about greedy pretraining. We de- scribe in detail NN initialization first, then we discuss CML i i Ri (w) = − log p(y |x , w) fine tuning the NeuroCRF. P i i = c Ec(x , yc, w) (10) P P i + y∈Y exp[− c Ec(x , yc, w)] 3.2.1 INITIALIZATION

Initialization of hidden layers in the NeuroCRF is done Large margin criterion. The large margin method incrementally as it has been popularized for learning focuses more directly on giving highest discriminant deep architectures. In our implementation, the deep- score to the correct output. In NeuroCRFs, the dis- part of the NeuroCRF is initialized layer by layer in an criminant function is a sum of energy functions over unsupervised manner using restricted Boltzmann ma- cliques (see Eq. (4)): chines4 (RBMs) as proposed by Hinton and colleagues X (Hinton et al., 2006). Depending on the task, inputs F (x, y, w) = − E (x, y , w) (11) c c may be real valued or binary valued, this may be han- c∈C dled by slightly different RBMs. We considered both Large margin training for structured output (Taskar cases in our experiments, while coding (hidden) layers et al., 2004) aims at finding w so that: always consisting of binary units.

i i i i Once a cascade of successive RBMs have been trained F (x , y , w) ≥ F (x , y, w) + ∆(y , y) ∀y ∈ Y (12) one at a time, one obtains a deep belief net which is then transformed into a feed forward NN which imple- where ∆(yi, y) allows taking into account differences ments the deep-part of the NeuroCRF (without out- between labelings (e.g. Hamming distance between y put layer). Once the deep-part is initialized, the NN is and yi). We assume a decomposable loss (alike Ham- i P i 4 ming distance) such that ∆(y , y) = c δ(yc, yc) so The NN weights are initialized with zero-mean Gaus- that it can be factorized along the graph structure and sian noise before RBM learning.

180 Trinh-Minh-Tri Do,Thierry Arti`eres used to compute a high-level representation (i.e. the experimental results than using standard regulariza- vector of activations on the last hidden-layer) of input tion by 0). Since the CRF-part is not initialized with samples. The CRF-part may then be initialized by a generative model, we regularize this part with 0 - training (in a supervised way) a linear CRF with this both for initialization and fine-tuning. high-level coding of input samples. As we said, such a Note that NeuroCRF permits semi-supervised learn- linear CRF is actually an output layer which is stacked ing in a very natural way. Indeed one may easily use over the deep part. The union of the weights of the unlabeled data for initializing the deep-part while la- deep-part and of the CRF-part constitutes an initial- beled data are used in fine tuning only. One can ex- ization solution w which is fine tuned, as described 0 pect that a good initialization of the deep-part will below, using supervised learning. improve the global performance of a NeuroCRF, since the deep-part plays an important role of finding a rel- 3.2.2 FINE TUNING evant high-level representation of the input. It is a Fine tuning aims at learning the NeuroCRF parame- perspective of our work that we did not explore yet. ters globally based on an initial and reasonable solu- tion. None of the criterion we discussed earlier (Sub- 4 EXPERIMENTS section 3.1) are convex since we naturally consider NNs with non linear (sigmoid) activation functions in hid- We performed experiments on two sequence labeling den layers. However, provided one can compute an ini- tasks with two well-known datasets. We first investi- tial and reasonable solution and provided one can com- gate the behaviour of NeuroCRFs in a first series of ex- pute the gradient of the criterion with respect to NN periments on Optical Character Recognition with the weights, one can use any gradient-based optimization OCR dataset (Kassel, 1995). Then we report exper- method such as stochastic gradient or bundle method imental comparative results of NeuroCRFs and state to learn the model and reach a (eventually local) mini- of the art methods for the more complex task of au- mum. We show now how to compute the gradient with tomatic speech recognition using the TIMIT dataset respect to the NN weights. (Lamel et al., 1986). In both cases we replicated ex- perimental settings of previous works in order to get As long as Ri(w) is continuous and there is an effi- ∂R (w) fair comparison, building on the compilation of Ben cient method for computing i (this is true ∂Ec(x,yc,w) Taskar5 for the OCR dataset, and using standard par- for all criteria discussed in the previous section) the titioning of the data and standard preprocessing for (sub)gradient of R(w) with respect to w can be com- the TIMIT corpus. We use linear chain NeuroCRF for puted with a standard backpropagation procedure. both tasks. Let Ei be the set of energy outputs corresponding to i ∂Ri(w) In fine-tuning we use a variant of our batch optimizer input x . Using the chain rule for every ∂w : named non-convex regularized bundle method (Do & ∂R(w) 1 X ∂Ri(w) 1 X ∂Ri(w) ∂Ei = = (15) Arti`eres,2009) rather than a stochastic gradient pro- ∂w n ∂w n ∂E ∂w i i i cedure. As usual in bundle methods, our stopping con- dition is based on the Gap (G) between the Best ob- ∂Ei where ∂w is the Jacobian matrix of the NN outputs served objective function value (B) and the minimum i (for input x ) with respect to weights w. of the approximation. We stop when the ratio of G/B ∂R (w) is below 1%. This is a good trade-off for reaching low Then by setting i as backpropagation errors of ∂Ei error-rates with a limited number of iterations (100 for the NN output units, we can backpropagate and get ∂Ri(w) OCR, 150 for ASR experiments). ∂w using the chain rule over hidden layers.

3.2.3 REGULARIZATION AND 4.1 OPTICAL CHARACTER SEMI-SUPERVISED LEARNING RECOGNITION

In our implementation, we used the initial solution for OCR dataset consists of 6877 words which correspond building a quadratic regularization term of the form: to roughly 52K characters (Kassel, 1995; Taskar et al., 2004). OCR data are sequences of isolated characters 1 (each represented as a binary vector of dimension 128) Ω(w) = kw − w k2 (16) belonging to 26 classes. The dataset is divided in 10 2 0 folds for cross validation. We investigated two settings, The idea is that since the deep-part of a NeuroCRF is using a large training set by training on 9 folds and initialized in an unsupervised manner, we may expect testing on 1 fold (this is the large setting) and using a using this solution for regularizing will avoid overfit- ting during fine tuning (we found that this gives better 5http://ai.stanford.edu/∼btaskar/ocr/

181 Neural conditional random fields small one by training on 1 fold and testing on 9 folds Table 1: Comparative error rates of NeuroCRF and (this is the small setting). Note that OCR results state of the art methods on OCR dataset with either a are cross-validation results, we do not use any extra small and a large training sets. Performance of Neuro- validation set to set lambda. CRF before fine tuning are indicated in brackets. Re- We learned NeuroCRFs with one or two hidden layers. sults of SVM cubic, M3N cubic and CNF come from Transition energy outputs has only one connection to (Taskar et al., 2004; Peng et al., 2009). a bias unit, meaning that we do not use any input in- formation for building transition energy. We learned small large standard RBMs for initializing the deep part of Neu- CRF linear 0.2162 0.1420 roCRFs, which are learned with 50 iterations through M3N linear 0.2113 0.1346 the training set. Learning is performed using 1 step SVM cubic 0.19 not available Contrastive Divergence. M3N cubic 0.13 not available CNF 0.131 not available Influence of network architecture. Figure 4 re- NeuroCRFCML 0.1080(0.1224 ) 0.0444(0.0697 ) ports error rates gained on the small setting with Neu- NeuroCRFLM 0.1102(0.1221 ) 0.0456(0.0736 ) roCRF with one or two hidden layers of varying size. As can be seen, increasing the size of hidden layers im- sults are not reported for the large setting due to scala- bility). Also looking at the performance of NeuroCRF 0.16 1 hidden layer 2 hidden layers before fine tuning shows that initialization by RBMs 0.14 and CRFs indeed produce a good starting point, but 0.12 fine tuning is essential for obtaining optimal perfor-

error rate 0.1 mance. Finally, one sees here that both NeuroCRF training criteria are similar with a slight advantage of 0.08 50 100 200 300 400 500 conditional likelihood criterion on large margin crite- number of hidden units per layer rion. Surprisingly, we observed that the large margin criterion required more iterations than the conditional Figure 4: Influence of NN architecture on OCR dataset likelihood criterion6. In the following, we only consid- (small training set). ered NeuroCRFs trained with the CML criterion. Note that (Perez-Cruz & Pontil, 2007) address the proves performance for one hidden layer and two hid- structured prediction with a different way and they den layer NeuroCRFs. Also two hidden layers architec- are able to reach an error rate of 0.125 in small setting tures systematically outperform single hidden layer ar- and 0.031 in large setting (using RBF kernel). Their chitectures. Note that whatever the number of hidden approach considers an approximated problem that can layers, performance reaches a plateau when increasing be transformed into many multiclass SVM problems. the hidden layers’ size. However the plateau is lower Nevertheless, the non-linear SVM problems still scale and reached faster for the two hidden layer architec- quadratically with the problem size. Such a scaling ture. These results suggest that increasing both the prevents working with large data sets with millions of size of hidden layers and the number of hidden layers tokens (such as ASR experiments in next section). may significantly improve performance. Training time. We investigated the learning time Accuracy. We compared the performance of two of NeuroCRF and its scalability. We performed ex- variants of NeuroCRFs, one trained with conditional periments on the OCR dataset both for small and maximum likelihood (CML), the other one trained large settings. Training time decomposes into RBM with a large margin criterion (LM) with state of the art , linear CRF initialization, and methods : linear and cubic M3Ns, linear CRFs, con- fine tuning the whole model. Roughly speaking RBM ditional neural fields (CNFs) with one hidden layer. learning and fine tuning take about 45% of the time NeuroCRFs have 2 hidden layers of 200 units each. each, while CRF initialization takes about 10%. More Table 1 reports cross validation error rates of these importantly, overall training time in the large setting models for the small setting and the large setting. We took about 10 times the training time in the small set- also report the performance of initial solutions (i.e. ting while the training dataset is 9 times bigger, which before fine tuning) for NeuroCRFs (in brackets). 6Actually we did not succeed at optimizing the large NeuroCRFs significantly outperform all other meth- margin criterion. We conjecture that the parameter search ods, including M3N with non-linear kernel (whose re- space might be more complex in this case.

182 Trinh-Minh-Tri Do,Thierry Arti`eres suggests a quasi-linear scaling with the problem size. CRFs with non discriminant CDHMMs (i.e. Maxi- mum Likelihood) and with state of the art approaches 4.2 AUTOMATIC SPEECH for learning CDHMMs with a discriminant criterion, RECOGNITION Maximum Conditional Likelihood (CML) (Woodland & Povey, 2002), Minimum Classification Error (MCE) We performed ASR experiments on the TIMIT dataset (Juang & Katagiri, 1992), Large margin (LM) (Sha & (Lamel et al., 1986) with standard train-test partition- Saul, 2007), Perceptron learning (PT) (Cheng et al., ing. The wave signal was preprocessed using the pro- 2009) (note that all results come from a compilation cedure described in (Sha & Saul, 2007), except that in (Sha & Saul, 2007) and from (Cheng et al., 2009)). we do not use whitening by PCA. The 39-dimensional These results call for a few comments. First increasing MFCC are simply normalized to have zero mean and the hidden layers’ size improves the NeuroCRF error unit variance. There are roughly 1.1 million frames rate. Unfortunately we do not know if it still improves in the training set, 120K frames and 57K frames re- when using larger hidden layers (due to lack of time) spectively in the development and test sets. We used but one can reasonably expect that even better results 2-layered NeuroCRFs trained with the CML criterion. may be reached by using larger hidden layers and/or adding hidden layers. Second, NeuroCRFs outperform Handling continuous feature. Note here that in- all other discriminant and non discriminant methods puts (real valued vectors of MFCC coefficients) are except Large Margin training of (Sha & Saul, 2007) continuous while RBMs originally use binary logistic when using up to 8 Gaussian distributions per state. units for both visible and hidden variables. We used While this may not look like an impressive result at an extension of RBM for dealing with continuous vari- first glance, we claim this result to be very promis- ables that have Gaussian noise (In our implementation ing. Indeed, all other systems in Table 2 rely on the we consider a Gaussian noise with standard derivation learning of a preliminary CDHMM system, which is 0.2) (Taylor et al., 2007). This Gaussian-Binary RBM then used as initialization and/or for regularization. was trained for 100 passes through the training data of Hence all these systems integrate prior information 1.1M frames, using one step Contrastive Divergence. from decades of research on how to learn and tune Once this first RBM is trained, we forward the input a non discriminant CDHMM for speech. In contrast to the hidden layer and obtain a binary-logistic repre- NeuroCRF are trained from scratch with a non super- sentation of the speech data, which is the inputs (i.e. vised initialization and a supervised fine tuning, they visible data) for learning a second binary RBM. Since require no prior information. binary RBM converge much faster, we performed only 10 learning iterations through the training data for the Note that we did not compare NeuroCRF to multiple second layer. The remaining initialization is performed states per phone CDHMM systems as those tradition- as in section 3.2.1. ally used in ASR although such systems may reach bet- ter performance (especially for ML CDHMM). Reason is that comparison would have not been so fair. Indeed Table 2: Comparative phone recognition error rate on one may imagine extending this work in order to have TIMIT dataset for discriminant and non discriminant multiple nodes per phone in NeuroCRF and one may HMM systems and for two hidden layer NeuroCRFs expect this to provide even better results, that would (of 500 or 1000 hidden units each) trained with CML. be more directly comparable with multiples states per CDHMM phone HMM systems. ML CML MCE PT LM 1 Gaussian 40.1 36.4 35.2 35.6 31.2 5 CONCLUSION 2 Gaussians 36.5 34.6 33.2 34.5 30.8 4 Gaussians 34.7 32.8 31.2 32.4 29.8 8 Gaussians 32.7 31.5 31.9 30.9 28.2 We presented a model combining CRFs and deep NNs aiming at taking advantage of both the ability of deep NeuroCRF (CML) networks to extract high level features and the dis- 500x500 29.6 criminant power of CRFs for sequence labeling tasks. 1000x1000 29.1 Results on OCR data show significant improvement over state of the art methods and demonstrate the rel- evance of the combination. On the larger scale speech Results. Table 2 reports the phone error rates for recognition task our systems outperform most state of CDHMMs and NeuroCRFs with increasing complex- the art discriminant systems without relying on any ity (number of Gaussians in CDHMMs or number of prior in contrast to all other systems relying on an ini- hidden units in NeuroCRFs). We compared Neuro- tial solution gained with a non discriminant criterion.

183 Neural conditional random fields

Acknowledgements Lafferty, J. (2001). Conditional random fields: Proba- bilistic models for segmenting and labeling sequence Authors acknowledge the support by PASCAL 2 EU data. ICML (pp. 282–289). Morgan Kaufmann. Network of Excellence. Lafferty, J., Zhu, X., & Liu, Y. (2004). Kernel condi- References tional random fields: representation and clique se- lection. ICML. Altun, Y., Johnson, M., & Hofmann, T. (2003). Inves- Lamel, L., Kassel, R., & Seneff, S. (1986). Speech tigating loss functions and optimization methods for database development: Design and analysis of the discriminative learning of label sequences. EMNLP. acoustic-phonetic corpus. DARPA (pp. 100–110). Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (Technical Report 1282). Universit´ede Montr´eal. (1998). Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE, 86. Bengio, Y., & LeCun, Y. (2007). Scaling learning algo- rithms towards ai. In Large scale kernel machines. Mccallum, A., Freitag, D., & Pereira, F. (2000). Max- Cambridge, MA: MIT Press. imum entropy markov models for information ex- traction and segmentation. ICML (pp. 591–598). Bottou, L., Bengio, Y., & LeCun, Y. (1997). Global training of document processing systems using Peng, J., Bo, L., & Xu, J. (2009). Conditional neural graph transformer networks. In Proc. of Com- fields. NIPS. puter Vision and Pattern Recognition (pp. 490–494). Perez-Cruz, F, G. Z., & Pontil, M. (2007). Condi- Puerto-Rico. IEEE. tional graphical models. In G. H. Bakir, T. Hof- Cheng, C.-C., Sha, F., & Saul, L. K. (2009). Matrix mann, B. Schlkopf, A. J. Smola, B. Taskar and updates for perceptron training of continuous den- S. V. N. Vishwanathan (Eds.), Predicting structured sity hidden markov models. ICML (pp. 153–160). data. MIT Press. Collins, M. (2002). Discriminative training methods Qi, Y., Kuksa, P. P., Collobert, R., Sadamasa, K., for hidden markov models: theory and experiments Kavukcuoglu, K., & Weston, J. (2009). Semi- with perceptron algorithms. EMNLP (pp. 1–8). supervised sequence labeling with self-learned fea- tures. ICDM’09. IEEE. Collobert, R., Sinz, F., Weston, J., & Bottou, L. (2006). Trading convexity for scalability. ICML. Rosenblatt, F. (1988). The perceptron: a probabilistic model for information storage and organization in Do, T.-M.-T., & Arti`eres,T. (2009). Large margin the brain. Neurocomputing: foundations of research, training for hidden Markov models with partially 89–114. observed states. ICML (pp. 265–272). Omnipress. Sato, K., & Sakakibara, Y. (2005). Rna secondary Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., structural alignment with conditional random fields. & Vincent, P. (2009). The difficulty of training ECCB/JBI (p. 242). deep architectures and the effect of unsupervised pre-training. AISTATS. Sha, F., & Saul, L. K. (2007). Large margin hid- den markov models for automatic speech recogni- Graves, A., Fern´andez,S., Gomez, F., & Schmidhuber, tion. NIPS 19 (pp. 1249–1256). MIT Press. J. (2006). Connectionist temporal classification: la- belling unsegmented sequence data with recurrent Taskar, B., Guestrin, C., & Koller, D. (2004). Max- neural networks. ICML (pp. 369–376). margin markov networks. NIPS 16. MIT Press.

Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2007). fast learning algorithm for deep belief nets. Neural Modeling human motion using binary latent vari- Computation, 18, 1527–1554. ables. In Nips, 1345–1352. MIT Press. Juang, B., & Katagiri, S. (1992). Discriminative learn- Woodland, P., & Povey, D. (2002). Large scale dis- ing for minimum error classification. IEEE Trans. criminative training of hidden markov models for Signal Processing, Vol.40, No.12. speech recognition. Computer Speech and Language, Kassel, R. H. (1995). A comparison of approaches to 16, 25–47(23). on-line handwritten character recognition. Doctoral dissertation, Cambridge, MA, USA.

184