Efficient Hierarchical Entity Classifier Using Conditional Random Fields
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Hierarchical Entity Classifier Using Conditional Random Fields Koen Deschacht Marie-Francine Moens Interdisciplinary Centre for Law & IT Interdisciplinary Centre for Law & IT Katholieke Universiteit Leuven Katholieke Universiteit Leuven Tiensestraat 41, 3000 Leuven, Belgium Tiensestraat 41, 3000 Leuven, Belgium [email protected] [email protected] Abstract Rigau, 1996; Yarowsky, 1995), where the sense for a word is chosen from a much larger inventory In this paper we develop an automatic of word senses. classifier for a very large set of labels, the We will employ a probabilistic model that’s WordNet synsets. We employ Conditional been used successfully in NER (Conditional Ran- Random Fields (CRFs) because of their dom Fields) and use this with an extensive inven- flexibility to include a wide variety of non- tory of word senses (the WordNet lexical database) independent features. Training CRFs on a to perform entity detection. big number of labels proved a problem be- In section 2 we describe WordNet and it’s use cause of the large training cost. By tak- for entity categorization. Section 3 gives an ing into account the hypernym/hyponym overview of Conditional Random Fields and sec- relation between synsets in WordNet, we tion 4 explains how the parameters of this model reduced the complexity of training from are estimated during training. We will drastically 2 2 O(T M NG) to O(T (logM) NG) with reduce the computational complexity of training in only a limited loss in accuracy. section 5. Section 6 describes the implementation 1 Introduction of this method, section 7 the obtained results and finally section 8 future work. The work described in this paper was carried out during the CLASS project1. The central objec- 2 WordNet tive of this project is to develop advanced learning WordNet (Fellbaum et al., 1998) is a lexical methods that allow images, video and associated database whose design is inspired by psycholin- text to be analyzed and structured automatically. guistic theories of human lexical memory. English One of the goals of the project is the alignment of nouns, verbs, adjectives and adverbs are organized visual and textual information. We will, for exam- in synsets. A synset is a collection of words that ple, learn the correspondence between faces in an have a close meaning and that represent an under- image and persons described in surrounding text. lying concept. An example of such a synset is The role of the authors in the CLASS project is “person, individual, someone, somebody, mortal, mainly on information extraction from text. soul”. All these words refer to a human being. In the first phase of the project we build a clas- WordNet (v2.1) contains 155.327 words, which sifier for automatic identification and categoriza- are organized in 117.597 synsets. WordNet de- tion of entities in texts which we report here. This fines a number of relations between synsets. For classifier extracts entities from text, and assigns a nouns the most important relation is the hyper- label to these entities chosen from an inventory nym/hyponym relation. A noun X is a hypernym of possible labels. This task is closely related to of a noun Y if Y is a subtype or instance of X. For both named entity recognition (NER), which tra- example, “bird” is a hypernym of “penguin” (and ditionally assigns nouns to a small number of cate- “penguin” is a hyponym of “bird”). This relation gories and word sense disambiguation (Agirre and organizes the synsets in a hierarchical tree (Hayes, 1http://class.inrialpes.fr/ 1999), of which a fragment is pictured in fig. 1. 33 Proceedings of the 2nd Workshop on Ontology Learning and Population, pages 33–40, Sydney, July 2006. c 2006 Association for Computational Linguistics Figure 1: Fragment of the hypernym/hyponym Figure 2: Number of synsets per level in WordNet tree which the nodes corresponding to elements of Y This tree has a depth of 18 levels and maximum form a simple first-order Markov chain. width of 17837 synsets (fig. 2). A CRF defines a conditional probability distri- We will build a classifier using CRFs that tags bution p(Y |X) of label sequences given input se- noun phrases in a text with their WordNet synset. quences. We assume that the random variable se- This will enable us to recognize entities, and to quences X and Y have the same length and use classify the entities in certain groups. Moreover, x = (x1, ..., xT ) and y = (y1, ..., yT ) for an input it allows learning the context pattern of a certain sequence and label sequence respectively. Instead meaning of a word. Take for example the sentence of defining a joint distribution over both label and “The ambulance took the remains of the bomber observation sequences, the model defines a condi- to the morgue.” Having every noun phrase tagged tional probability over labeled sequences. A novel with it’s WordNet synset reveals that in this sen- observation sequence x is labeled with y, so that tence, “bomber” is “a person who plants bombs” the conditional probability p(y|x) is maximized. (and not “a military aircraft that drops bombs dur- We define a set of K binary-valued features or ing flight”). Using the hypernym/hyponym rela- feature functions fk(yt−1,yt, x) that each express tions from WordNet, we can also easily find out some characteristic of the empirical distribution of that “ambulance” is a kind of “car”, which in turn the training data that should also hold in the model is a kind of “conveyance, transport” which in turn distribution. An example of such a feature is is a “physical object”. if x has POS ‘NN’ and 3 Conditional Random Fields 1 fk(yt−1,yt, x)= yt is concept ‘entity’ Conditional random fields (CRFs) (Lafferty et al., 0 otherwise (1) 2001; Jordan, 1999; Wallach, 2004) is a statistical method based on undirected graphical models. Let Feature functions can depend on the previous X be a random variable over data sequences to be (yt−1) and the current (yt) state. Considering K labeled and Y a random variable over correspond- feature functions, the conditional probability dis- ing label sequences. All components Yi of Y are tribution defined by the CRF is assumed to range over a finite label alphabet K. T K In this paper X will range over the sentences of 1 p(y|x)= exp λ f (y − ,y , x) Z(x) k k t 1 t a text, tagged with POS-labels and Y ranges over (t=1 k=1 ) the synsets to be recognized in these sentences. X X (2) We define G = (V, E) to be an undirected where λj is a parameter to model the observed graph such that there is a node v ∈ V correspond- statistics and Z(x) is a normalizing constant com- ing to each of the random variables representing an puted as element Yv of Y . If each random variable Yv obeys T K the Markov property with respect to G (e.g., in a Z(x)= exp λ f (y − ,y , x) first order model the transition probability depends k k t 1 t y∈Y (t=1 k=1 ) only on the neighboring state), then the model X X X (Y, X) is a Conditional Random Field. Although This method can be thought of a generalization the structure of the graph G may be arbitrary, we of both the Maximum Entropy Markov model limit the discussion here to graph structures in (MEMM) and the Hidden Markov model (HMM). 34 It brings together the best of discriminative mod- lihood (3), we get the following expression: els and generative models: (1) It can accommo- N T K date many statistically correlated features of the (i) (i) (i) l(θ) = λkfk(yt−1,yt , x ) inputs, contrasting with generative models, which i=1 t=1 k=1 often require conditional independent assumptions P NP P − log Z(x(i)) in order to make the computations tractable and (2) i=1 it has the possibility of context-dependent learning P by trading off decisions at different sequence posi- The function l(θ) cannot be maximized in closed tions to obtain a global optimal labeling. Because form, so numerical optimization is used. The par- CRFs adhere to the maximum entropy principle, tial derivates are: they offer a valid solution when learning from in- complete information. Given that in information N T extraction tasks, we often lack an annotated train- ∂l(θ) (i) (i) (i) ∂λ = fk(yt ,yt−1, x ) ing set that covers all possible extraction patterns, k i=1 t=1 N T P P ′ (i) ′ (i) this is a valuable asset. − fk(y , y, x ) p(y ,y|x ) i t Lafferty et al. (Lafferty et al., 2001) have shown =1 =1 y,y′ P P P (4) that CRFs outperform both MEMM and HMM Using these derivates, we can iteratively adjust on synthetic data and on a part-of-speech tagging the parameters θ (with Limited-Memory BFGS task. Furthermore, CRFs have been used success- (Byrd et al., 1994)) until l θ has reached an opti- fully in information extraction (Peng and McCal- ( ) mum. During each iteration we have to calculate lum, 2004), named entity recognition (Li and Mc- p y′,y x(i) . This can be done, as for the Hid- Callum, 2003; McCallum and Li, 2003) and sen- ( | ) den Markov Model, using the forward-backward tence parsing (Sha and Pereira, 2003). algorithm (Baum and Petrie, 1966; Forney, 1996). This algorithm has a computational complexity of 4 Parameter estimation O(T M 2) (where T is the length of the sequence and M the number of the labels).