Neural Conditional Random Fields

Neural conditional random fields Trinh-Minh-Tri Doyz Thierry Artièresz yIdiap Research Institute zLIP6, UniversitéPierre et Marie Curie Martigny, Switzerland Paris, France [email protected] [email protected] Abstract Katagiri, 1992), Perceptron learning (Collins, 2002), Maximum Mutual Information (MMI) (Woodland & We propose a non-linear graphical model Povey, 2002) or more recently large margin approaches for structured prediction. It combines the (Sha & Saul, 2007; Do & Artières,2009). power of deep neural networks to extract high A more direct approach is to design a discriminative level features with the graphical framework graphical model that models the conditional distribu- of Markov networks, yielding a powerful and tion P (YjX) instead of modeling the joint probability scalable probabilistic model that we apply to as in generative model (Mccallum et al., 2000; Lafferty, signal labeling tasks. 2001). Conditional random fields (CRF) are a typical example of this approach. Maximum Margin Markov network (M3N) (Taskar et al., 2004) go further by fo- 1 INTRODUCTION cusing on the discriminant function (which is defined as the log of potential functions in a Markov network) This paper considers the structured prediction task and extend the SVM learning algorithm for structured where one wants to build a system that predicts a prediction. While using a completely different learning structured output from an (structured) input. It is algorithm, M3N is based on the same graphical mod- a common framework for many application fields such eling as CRF and can be viewed as an instance of a as bioinformatics, part-of-speech tagging, information CRF. Based on log-linear potentials, CRFs have been extraction, signal (e.g. speech) labeling and recogni- widely used for sequential data such as natural lan- tion and so on. We focus here on signal and sequence guage processing or biological sequences (Altun et al., labeling tasks for signals such as speech and handwrit- 2003; Sato & Sakakibara, 2005). However, CRFs with ing. log-linear potentials only reach modest performance For decades, Hidden Markov Models (HMMs) have with respect to non-linear models exploiting kernels been the most popular approach for dealing with se- (Taskar et al., 2004). Although it is possible to use quential data (e.g. for segmentation and classifica- kernels in CRFs (Lafferty et al., 2004), the obtained tion). They rely on strong independence assumptions dense optimal solution makes it generally inefficient in and are learned using Maximum Likelihood Estima- practice. Nevertheless, kernel machines are well known tion which is a non discriminant criterion. This latter to be less scalable. point comes from the fact that HMMs are generative Besides, in recent years, deep neural architectures have models and they define a joint probability distribution been proposed as a relevant solution for extracting on the sequence of observations X and the associated high level features from data (Hinton et al., 2006; Ben- label sequence Y. gio et al., 2006). Such models have been successfully Discriminant systems are usually more powerful than applied first to images (Hinton et al., 2006), then to generative models, and focus more directly on mini- motion caption data (Taylor et al., 2007) and text mizing the error rate. Many studies have focused on data. In these fields, deep architectures have shown developing discriminant training for HMM, for exam- great capacity to discover and extract relevant features ple Minimum Classification Error (MCE) (Juang & as input to linear discriminant systems. th This work introduces neural conditional random fields Appearing in Proceedings of the 13 International Con- which are a marriage between conditional random ference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of fields and (deep) neural networks (NNs). The idea JMLR: W&CP 9. Copyright 2010 by the authors. is to rely on deep NNs for learning relevant high level 177 Neural conditional random fields features which may then be used as inputs to a linear a CRF defines a conditional probability according to: CRF. Going further we propose such a global architec- Q ture that we call NeuroCRF and that can be globally p(yjx) = 1=Z(x) c(x; yc) P c2CQ (1) trained with a discriminant criterion. Of course, using with : Z(x) = y2Y c2C c(x; yc) a deep NN as a feature extractor makes the learning become a non convex optimization problem. This pre- where Z(x) is a global normalization factor. A com- vents relying on efficient convex optimizer algorithms. mon choice for potential functions is the exponential However recently a number of researchers have pointed function of an energy, Ec: out that convexity at any price is not always a good −Ec(x;yc;w) idea. One has to look for an optimal trade-off be- c(x; yc) = e (2) tween modeling flexibility and optimization ease (Le- Cun et al., 1998; Collobert et al., 2006; Bengio & Le- To ease learning, a standard setting is to use linear Cun, 2007). yc energy functions Ec(x; yc; w) = −hwc ; Φc(x)i of the yc Related works. Some previous works have success- parameter vector wc and of a feature vector Φc(x). fully designed NN systems for structured prediction. This leads to a log-linear model (Lafferty, 2001). A lin- For instance, graph transformer nets (Bottou et al., ear energy function is intrinsically limiting the CRF. 1997) have been applied to a complex check reading We propose neural CRFs to replace this linear energy system that uses convolution net at character level. function by non linear energy functions that are com- (Graves et al., 2006) used a recurrent NN for handwrit- puted by a NN. ing and speech recognition where neural net outputs (sigmoid units) are used as conditional probabilities. 2.2 Neural conditional random fields Motivated by the success of deep belief nets on feature discovering, Collobert and his colleagues investigated Neural conditional random fields is a combination of the use of deep learning for information extraction on NNs and CRFs. They extend CRFs by placing a NN text data (Qi et al., 2009). A common point between structure between input and energy function. This these works is that the authors proposed mechanics NN, visualized in Figure 1, is described in detail next. to adapt NN for the structured prediction task rather than a global probabilistic framework, which is investi- Y gated in this paper. Recently, (Peng et al., 2009) also Y1 3 Y2 investigated the combination of CRFs and NNs in a parallel work. Our approach is different in the use of Y4 deep architecture and backpropagation, and it works Output layer for general loss function. Hidden layers 2 NEURAL CONDITIONAL input layer X RANDOM FIELDS Neural network In this section, we propose a non-linear graphical Figure 1: Example of a tree-structured NeuroCRF model for structured prediction. We start with a general framework with any graphical structure. Then we The NN takes an observation as input and outputs focus on linear chain models for sequence labeling. a number of quantities which we call energy outputs2 fEc(x; yc; w)jc; ycg parameterized by w. The NN is 2.1 Conditional random fields feed forward with multiple hidden layers, non-linear hidden units, and an output layer with linear output Structured output prediction aims at building a model units (i.e. a linear activation function). With this that predicts accurately a structured output y for any setting, a NeuroCRF may be viewed as a standard log- input x. The output Y = fYig is a set of predicted linear CRF working on the high-level representation random variables whose components belong to a set computed by a neural net. In the remainder of the of labels L and are linked by conditional dependen- paper we call the top part (output layer weights) of cies encoded by an undirected graph G = (V; E) with a NeuroCRF its CRF-part and we call the remaining cliques c 2 C. Given x, inference stands for finding part its deep-part (see Figure 2-right). Let wnn and 1 yc the output that maximizes the conditional probability wc be the neural net weights of the deep-part and p(yjx). Relying on the Hammersley-Clifford theorem 2We use the terminology energy output to stress the 1We use the notation p(yjx) = p(Y = yjX = x). difference between NN outputs and model outputs y. 178 Trinh-Minh-Tri Do,Thierry Artières the CRF-part respectively. NeuroCRF implements the NeuroCRFs based on a first-order Markov chain struc- conditional probability as: ture (Figure 3). This allows investigating the potential Y −E (x;y ;w) Y hwyc ;Φ (x;w )i power of NeuroCRFs on standard sequence labeling p(yjx) / e c c = e c c nn (3) tasks. In a chain-structured NeuroCRF there are two c2C c2C kinds of cliques: where Φc(x; wnn) stands for the high level representation of the input x at clique c computed by the deep • local cliques (x; y ) at each position t, whose po- part. This is illustrated in Figure 2-left, where the last t tential functions are noted by (x; y ), and cor- hidden layer includes units that are grouped in a num- t t responding energy functions are noted by Eloc ber of sets, e.g. one for every clique Φc(x; wnn). Each output unit −Ec(x; yc; wc) is connected to Φc(x; wnn) yc • transition cliques (x; yt−1; yt) between two succes- in the last hidden layer, with the weight vector wc . Note that the number of energy outputs for each clique sive positions at t−1 and t, whose potential func- yc tions are noted as t−1;t(x; yt−1; yt), and corre- c equals jYcj, hence there are jYcj weight vectors wc for each clique c.

Load more