Efficient Hierarchical Entity Classifier Using Conditional Random Fields

Koen Deschacht Marie-Francine Moens Interdisciplinary Centre for Law & IT Interdisciplinary Centre for Law & IT Katholieke Universiteit Leuven Katholieke Universiteit Leuven Tiensestraat 41, 3000 Leuven, Belgium Tiensestraat 41, 3000 Leuven, Belgium [email protected] [email protected]

Abstract Rigau, 1996; Yarowsky, 1995), where the sense for a word is chosen from a much larger inventory In this paper we develop an automatic of word senses. classifier for a very large set of labels, the We will employ a probabilistic model that’s WordNet synsets. We employ Conditional been used successfully in NER (Conditional Ran- Random Fields (CRFs) because of their dom Fields) and use this with an extensive inven- flexibility to include a wide variety of non- tory of word senses (the WordNet lexical database) independent features. Training CRFs on a to perform entity detection. big number of labels proved a problem be- In section 2 we describe WordNet and it’s use cause of the large training cost. By tak- for entity categorization. Section 3 gives an ing into account the hypernym/hyponym overview of Conditional Random Fields and sec- relation between synsets in WordNet, we tion 4 explains how the parameters of this model reduced the complexity of training from are estimated during training. We will drastically 2 2 O(T M NG) to O(T (logM) NG) with reduce the computational complexity of training in only a limited loss in accuracy. section 5. Section 6 describes the implementation 1 Introduction of this method, section 7 the obtained results and finally section 8 future work. The work described in this paper was carried out during the CLASS project1. The central objec- 2 WordNet tive of this project is to develop advanced learning WordNet (Fellbaum et al., 1998) is a lexical methods that allow images, video and associated database whose design is inspired by psycholin- text to be analyzed and structured automatically. guistic theories of human lexical memory. English One of the goals of the project is the alignment of nouns, verbs, adjectives and adverbs are organized visual and textual information. We will, for exam- in synsets. A synset is a collection of words that ple, learn the correspondence between faces in an have a close meaning and that represent an under- image and persons described in surrounding text. lying concept. An example of such a synset is The role of the authors in the CLASS project is “person, individual, someone, somebody, mortal, mainly on from text. soul”. All these words refer to a human being. In the first phase of the project we build a clas- WordNet (v2.1) contains 155.327 words, which sifier for automatic identification and categoriza- are organized in 117.597 synsets. WordNet de- tion of entities in texts which we report here. This fines a number of relations between synsets. For classifier extracts entities from text, and assigns a nouns the most important relation is the hyper- label to these entities chosen from an inventory nym/hyponym relation. A noun X is a hypernym of possible labels. This task is closely related to of a noun Y if Y is a subtype or instance of X. For both named entity recognition (NER), which tra- example, “bird” is a hypernym of “penguin” (and ditionally assigns nouns to a small number of cate- “penguin” is a hyponym of “bird”). This relation gories and word sense disambiguation (Agirre and organizes the synsets in a hierarchical tree (Hayes, 1http://class.inrialpes.fr/ 1999), of which a fragment is pictured in fig. 1.

33 Proceedings of the 2nd Workshop on and Population, pages 33–40, Sydney, July 2006. c 2006 Association for Computational Linguistics Figure 1: Fragment of the hypernym/hyponym Figure 2: Number of synsets per level in WordNet tree which the nodes corresponding to elements of Y This tree has a depth of 18 levels and maximum form a simple first-order Markov chain. width of 17837 synsets (fig. 2). A CRF defines a conditional probability distri- We will build a classifier using CRFs that tags bution p(Y |X) of label sequences given input se- noun phrases in a text with their WordNet synset. quences. We assume that the random variable se- This will enable us to recognize entities, and to quences X and Y have the same length and use classify the entities in certain groups. Moreover, x = (x1, ..., xT ) and y = (y1, ..., yT ) for an input it allows learning the context pattern of a certain sequence and label sequence respectively. Instead meaning of a word. Take for example the sentence of defining a joint distribution over both label and “The ambulance took the remains of the bomber observation sequences, the model defines a condi- to the morgue.” Having every noun phrase tagged tional probability over labeled sequences. A novel with it’s WordNet synset reveals that in this sen- observation sequence x is labeled with y, so that tence, “bomber” is “a person who plants bombs” the conditional probability p(y|x) is maximized. (and not “a military aircraft that drops bombs dur- We define a set of K binary-valued features or ing flight”). Using the hypernym/hyponym rela- feature functions fk(yt−1,yt, x) that each express tions from WordNet, we can also easily find out some characteristic of the empirical distribution of that “ambulance” is a kind of “car”, which in turn the training data that should also hold in the model is a kind of “conveyance, transport” which in turn distribution. An example of such a feature is is a “physical object”. if x has POS ‘NN’ and 3 Conditional Random Fields 1 fk(yt−1,yt, x)=  yt is concept ‘entity’ Conditional random fields (CRFs) (Lafferty et al.,  0 otherwise (1) 2001; Jordan, 1999; Wallach, 2004) is a statistical  method based on undirected graphical models. Let Feature functions can depend on the previous X be a random variable over data sequences to be (yt−1) and the current (yt) state. Considering K labeled and Y a random variable over correspond- feature functions, the conditional probability dis- ing label sequences. All components Yi of Y are tribution defined by the CRF is assumed to range over a finite label alphabet K. T K In this paper X will range over the sentences of 1 p(y|x)= exp λ f (y − ,y , x) Z(x) k k t 1 t a text, tagged with POS-labels and Y ranges over (t=1 k=1 ) the synsets to be recognized in these sentences. X X (2) We define G = (V, E) to be an undirected where λj is a parameter to model the observed graph such that there is a node v ∈ V correspond- statistics and Z(x) is a normalizing constant com- ing to each of the random variables representing an puted as element Yv of Y . If each random variable Yv obeys T K the Markov property with respect to G (e.g., in a Z(x)= exp λ f (y − ,y , x) first order model the transition probability depends k k t 1 t y∈Y (t=1 k=1 ) only on the neighboring state), then the model X X X (Y, X) is a Conditional Random Field. Although This method can be thought of a generalization the structure of the graph G may be arbitrary, we of both the Maximum Entropy Markov model limit the discussion here to graph structures in (MEMM) and the Hidden Markov model (HMM).

34 It brings together the best of discriminative mod- lihood (3), we get the following expression: els and generative models: (1) It can accommo- N T K date many statistically correlated features of the (i) (i) (i) l(θ) = λkfk(yt−1,yt , x ) inputs, contrasting with generative models, which i=1 t=1 k=1 often require conditional independent assumptions P NP P − log Z(x(i)) in order to make the computations tractable and (2) i=1 it has the possibility of context-dependent learning P by trading off decisions at different sequence posi- The function l(θ) cannot be maximized in closed tions to obtain a global optimal labeling. Because form, so numerical optimization is used. The par- CRFs adhere to the maximum entropy principle, tial derivates are: they offer a valid solution when learning from in- complete information. Given that in information N T extraction tasks, we often lack an annotated train- ∂l(θ) (i) (i) (i) ∂λ = fk(yt ,yt−1, x ) ing set that covers all possible extraction patterns, k i=1 t=1 N T P P ′ (i) ′ (i) this is a valuable asset. − fk(y , y, x ) p(y ,y|x ) i t Lafferty et al. (Lafferty et al., 2001) have shown =1 =1 y,y′ P P P (4) that CRFs outperform both MEMM and HMM Using these derivates, we can iteratively adjust on synthetic data and on a part-of-speech tagging the parameters θ (with Limited-Memory BFGS task. Furthermore, CRFs have been used success- (Byrd et al., 1994)) until l θ has reached an opti- fully in information extraction (Peng and McCal- ( ) mum. During each iteration we have to calculate lum, 2004), named entity recognition (Li and Mc- p y′,y x(i) . This can be done, as for the Hid- Callum, 2003; McCallum and Li, 2003) and sen- ( | ) den Markov Model, using the forward-backward tence (Sha and Pereira, 2003). algorithm (Baum and Petrie, 1966; Forney, 1996). This algorithm has a computational complexity of 4 Parameter estimation O(T M 2) (where T is the length of the sequence and M the number of the labels). We have to exe- In this section we’ll explain to some detail how to cute the forward-backward algorithm once for ev- derive the parameters θ = {λk}, given the train- ery training instance during every iteration. The ing data. The problem can be considered as a con- total cost of training a linear-chained CRFs is thus: strained optimization problem, where we have to find a set of parameters which maximizes the log O(T M 2NG) likelihood of the conditional distribution (McCal- lum, 2003). We are confronted with the problem where N is the number of training examples and G of efficiently calculating the expectation of each the number of iterations. We’ve experienced that feature function with respect to the CRF model this complexity is an important delimiting factor distribution for every observation sequence x in when learning a big collection of labels. Employ- the training data. Formally, we are given a set ing CRFs to learn the 95076 WordNet synsets with N of training examples D = x(i), y(i) where 20133 training examples was not feasible on cur- i=1 rent hardware. In the next section we’ll describe (i) (i) (i) n (i) o each x = x1 ,x2 , ..., xT is a sequence the method we’ve implemented to drastically re- n(i) (i) (i) o (i) of inputs and y = y1 ,y2 , ..., yT is a se- duce this complexity. quence of the desired labels.n We will estimateo the parameters by penalized maximum likelihood, op- 5 Reducing complexity timizing the function: In this section we’ll see how we create groups of features for every label that enable an important N reduction in complexity of both labeling and train- l θ log p y(i) x(i) (3) ( )= ( | ) ing. We’ll first discuss how these groups of fea- i=1 X tures are created (section 5.1) and then how both labeling (section 5.2) and training (section 5.3) are After substituting the CRF model (2) in the like- performed using these groups.

35 collection of features fk(yt−1,yt, x) for which it is possible to find a node vt−1 and input x for which fk(vt−1, v, x) 6= 0. If v is a non-leaf node, we de- fine Fv as the collection of features fk(yt−1,yt, x) (1) which are elements of Fv for every child node − −− − v of v and (2) for every v1 and v2 , children of v, it is valid that for every previous label vt−1 and Figure 3: Fragment of the tree used for labeling − − input x fk(vt−1, v1 , x)=fk(vt−1, v2 , x). Informally, Fv is the collection of features 5.1 Hierarchical feature selection which are useful to evaluate for a certain node. For the leaf-nodes, this is the collection of features that To reduce the complexity of CRFs, we assign a can possibly return a non-zero value. For non-leaf selection of features to every node in the hierar- nodes, it’s useful to evaluate features belonging to chical tree. As discussed in section 2 WordNet de- Fv when they have the same value for all the de- fines a relation between synsets which organises scendants of that node (which we can put to good the synsets in a tree. In its current form this tree use, see further). does not meet our needs: we need a tree where ′ + We define Fv = Fv \Fv+ where v is the parent every label used for labeling corresponds to ex- top ′ of label v. For the top node v we define F top = actly one leaf-node, and no label corresponds to v F top . We also set a non-leaf node. We therefor modify the existing v tree. We create a new top node (“top”) and add the ′ original tree as defined by WordNet as a subtree to G (y − ,y , x)= exp λ f (y − ,y , x) t 1 t  k k t 1 t  this top-node. We add leaf-nodes corresponding f ∈F  kXy′t  to the labels “NONE”, “ADJ”, “ADV”, “VERB” to the top-node and for the other labels (the noun We’ve now organised the collection of features in synsets) we add a leaf-node to the node represent- such a way that we can use the hierarchical rela- ing the corresponding synset. For example, we tions defined by WordNet when determining the add a node corresponding to the label “ENTITY” probability of a certain labeling y. We first see to the node “entity”. Fig. 3 pictures a fraction of that this tree. Nodes corresponding to a label have an uppercase name, nodes not corresponding to a la- G(y − ,y , x) = exp λ f (y − ,y , x) bel have a lowercase name. t 1 t  k k t 1 t  f ∈F We use v to denote nodes of the tree. We call  kXyt  + ′ the top concept vtop and the concept v+ the parent = G(yt−1,yt ,x)G (yt−1,yt,x)  − of v, which is the parent of v . We call Av the = ... ′ collection of ancestors of a concept v, including v = G (yt−1,v,x) itself. v∈A Yyt We will now show how we transform a regular CRF in a CRF that uses hierarchical feature selec- we can now determine the probability of a labeling tion. We first notice that we can rewrite eq. 2 as y, given input x

1 T p(y|x)= G(y − ,y , x) T x t 1 t 1 ′ Z( ) t p(y|x)= G (y − , v, x) (5) Y=1 t 1 Z(x) ∈ t=1 v Ayt K Y Y with G(yt−1,yt, x)= exp( λkfk(yt−1,yt, x)) k=1 This formula has exactly the same result as eq. 2. We rewrite this equation becauseP it will enable Because we assigned a collection of features to ev- us to reduce the complexity of CRFs and it has ery node, we can discard parts of the search space the property that p(yt|yt−1, x) ≈ G(yt−1,yt, x) when searching for possible labelings, obtaining which we will use in section 5.3. an important reduction in complexity. We elab- We now define a collection of features Fv for orate this idea in the following sections for both every node v. If v is leaf-node, we define Fv as the labeling and training.

36 5.2 Labeling The standard method to label a sentence with CRFs is by using the Viterbi algorithm (Forney, 1973; Viterbi, 1967) which has a computational complexity of O(T M 2). The basic idea to reduce this computational complexity is to select the best labeling in a number of iterations. In the first itera- tion, we label every word in a sentence with a label Figure 4: Nodes that need to be taken into account chosen from the top-level labels. After choosing during the forward-backward algorithm the best labeling, we refine our choice (choose a child label of the previous chosen label) in subse- the Viterbi algorithm. If q is the average number quent iterations until we arrive at a synset which of children of a concept, the depth of the tree is has no children. In every iteration we only have logq(M). On every level we have to execute the to choose from a very small number of labels, thus Viterbi algorithm for q labels, thus resulting in a breaking down the problem of selecting the correct total complexity of label from a large number of labels in a number of smaller problems. 2 O(T logq(M)q ) (6) Formally, when labeling a sentence we find the label sequence y such that y has the maximum 5.3 Training probability of all labelings. We will estimate the We will now discuss how we reduce the compu- best labeling in an iterative way: we start with top−1 top−1 top−1 tational complexity of training. As explained in the best labeling y = {y1 , ..., yT } − section 4 we have to estimate the parameters λk choosing only from the children ytop 1of the top t that optimize the function l(θ). We will show here ytop−1 node. The probability of this labeling is how we can reduce the computational complex- T ity of the calculation of the partial derivates ∂l(θ) top−1 1 ′ top−1 ∂λk p(y |x)= G (y − ,y , x) ′ x t 1 t (eq. 4). The predominant factor with regard to Z ( ) t Y=1 the computational complexity in the evaluation of ′ (i) where Z (x) is an appropriate normalizing con- this equation is the calculation of p(yt−1,y|x ). stant. We now select a labeling ytop−2 so that on Recall we do this with the forward-backward al- top−2 top−1 every position t node yt is a child of yt . gorithm, which has a computational complexity The probabilty of this labeling is (following eq. 5) of O(T M 2). We reduce the number of labels to T improve performance. We will do this by mak- top−2 1 ′ p(y |x)= G (yt− , v, x) ing the same assumption as in the previous sec- Z′(x) 1 t=1 v∈A top 2 tion: for every concept v at level s, for which y Y Yt − top−s ′ − v 6= yt , we set G (yt−1, vt , x) = 0 for After selecting a labeling ytop−2 with maximum every child v− of v. Since (as noted in sect. probability, we proceed by selecting a labeling 5.2) p(vt|yt−1, x) ≈ G(yt−1, vt, x), this has the top−3 y with maximum probability etc.. We pro- consequence that p(vt|yt−1, x) = 0 and that ceed using this method until we reach a labeling p(vt,yt−1|x) = 0. Fig. 4 gives a graphical repre- in which every yt is a node which has no children sentation of this reduction of the search space. The and return this labeling as the final labeling. correct label here is “LABEL1” , the grey nodes The assumption we make here is that if a node have a non-zero p(vt,yt−1|x) and the white nodes v is selected at position t of the most probable la- have a zero p(vt,yt−1|x). beling ytop−s the children v− have a larger prob- In the forward backward algorithm we only ability of being selected at position t in the most have to account every node v that has a non-zero top−s−1 probable labeling y . We reduce the num- p(v,yt−1|x). As can be easily seen from fig. 4, ber of labels we take into consideration by stating the number of nodes is qlogqM, where q is the top−s that for every concept v for which v 6= yt , we average number of children of a concept. The to- ′ − − set G (yt−1, vt , x) = 0 for every child v of v. tal complexity of running the forward-backward 2 This reduces the space of possible labelings dras- algorithm is O(T (q logqM) ). Since we have to tically, reducing the computational complexity of run this algorithm once for every gradient compu-

37 Figure 5: Time needed for one training cycle Figure 6: Time needed for labeling tation for every training instance we find the total 7 Results training cost The major goal of this paper was to build a clas- 2 O(T (q logqM) NG) (7) sifier that could learn all the WordNet synsets in a reasonable amount of time. We will first discuss 6 Implementation the improvement in time needed for training and labeling and then discuss accuracy. To implement the described method we need two We want to test the influence of the number of components: an interface to the WordNet database labels on the time needed for training. Therefor, and an implementation of CRFs using a hierar- we created different training sets, all of which had chical model. JWordNet is a Java interface to the same input (246 sentences tagged with POS la- WordNet developed by Oliver Steele (which can bels), but a different number of labels. The first be found on http://jwn.sourceforge. training set only had 5 labels (“ADJ”, “ADV”, net/). We used this interface to extract the Word- “VERB”, “entity” and “NONE”). The second had Net hierarchy. the same labels except we replaced the label “en- An implementation of CRFs using the hierar- tity” with either “physical entity”, “abstract entity” 2 chical model was obtained by adapting the Mallet or “thing”. We continued this procedure, replac- package. The Mallet package (McCallum, 2002) ing parent nouns labels with their children (i.e. is an integrated collection of Java code useful for hyponyms) for subsequent training sets. We then statistical natural language processing, document trained both a CRF using a hierarchical feature se- classification, clustering, and information extrac- lection and a standard CRF on these training sets. tion. It also offers an efficient implementation of Fig. 5 shows the time needed for one iteration CRFs. We’ve adapted this implementation so it of training with different numbers of labels. We creates hierarchical selections of features which can see how the time needed for training slowly are then used for training and labeling. increases for the CRF using hierarchical feature We used the Semcor corpus (Fellbaum et al., selection but increases fast when using a standard 1998; Landes et al., 1998) for training. This cor- CRF. This is conform to eq. 7. pus, which was created by the Princeton Univer- Fig. 6 shows the average time needed for la- sity, is a subset of the English Brown corpus con- beling a sentence. Here again the time increases taining almost 700,000 words. Every sentence in slowly for a CRF using hierarchical feature selec- the corpus is noun phrase chunked. The chunks tion, but increases fast for a standard CRF, con- are tagged by POS and both noun and verb phrases form to eq. 6. are tagged with their WordNet sense. Since we do Finally, fig 7 shows the error rate (on the train- not want to learn a classification for verb synsets, ing data) after each training cycle. We see that a we replace the tags of the verbs with one tag standard CRF and a CRF using hierarchical fea- “VERB”. ture selection perform comparable. Note that fig 2http://mallet.cs.umass.edu/ 7 gives the error rate on the training data but this

38 can differ considerable from the error rate on un- seen data. After these tests on a small section of the Sem- cor corpus, we trained a CRF using hierarchi- cal feature selection on 7/8 of the full corpus. We trained for 23 iterations, which took approx- imately 102 hours. Testing the model on the re- maining 1/8 of the corpus resulted in an accuracy of 77.82%. As reported in (McCarthy et al., 2004), a baseline approach that ignors context but simply assigns the most likely sense to a given word ob- tains a accuracy of 67%. We did not have the pos- sibility to compare the accuracy of this model with Figure 7: Error rate during training a standard CRF, since as already stated, training such a CRF takes impractically long, but we can compare our systems with existing WSD-systems. some a of beam-search (Bisiani, 1992), keeping Mihalcea and Moldovan (Mihalcea and Moldovan, a number of best labelings at every level. We 1999) use the semantic density between words to strongly suspect this will have a positive impact determine the word sense. They achieve an ac- on the accuracy of our algorithm. curacy of 86.5% (testing on the first two tagged As already mentioned, this work is carried out files of the Semcor corpus). Wilks and Stevenson during the CLASS project. In the second phase (Wilks and Stevenson, 1998) use a combination of this project we will discover classes and at- of knowledge sources and achieve an accuracy of tributes of entities in texts. To accomplish this 92%3. Note that both these methods use additional we will not only need to label nouns with their knowledge apart from the WordNet hierarchy. synset, but we also need to label verbs, adjec- The sentences in the training and testing sets tives and adverbs. This can become problem- were already (perfectly) POS-tagged and noun atic as WordNet has no hypernym/hyponym rela- chunked, and that in a real-life situation addi- tion (or equivalent) for the synsets of adjectives tional preprocessing by a POS-tagger (such as the and adverbs. WordNet has an equivalent relation LT-POS-tagger4) and noun chunker (such as de- for verbs (hypernym/troponym), but this structures scribed in (Ramshaw and Marcus, 1995)) which the verb synsets in a big number of loosely struc- will introduce additional errors. tured trees, which is less suitable for the described method. VerbNet (Kipper et al., 2000) seems a 8 Future work more promising resource to use when classify- In this section we’ll discuss some of the work we ing verbs, and we will also investigate the use plan to do in the future. First of all we wish to of other lexical databases, such as ThoughtTrea- evaluate our algorithm on standard test sets, such sure (Mueller, 1998), (Lenat, 1995), Open- as the data of the Senseval conference5, which mind Commonsense (Stork, 1999) and FrameNet tests performance on word sense disambiguation, (Baker et al., 1998). and the data of the CoNLL 2003 shared task6, on named entity recognition. Acknowledgments An important weakness of our algorithm is the The work reported in this paper was supported fact that, to label a sentence, we have to traverse by the EU-IST project CLASS (Cognitive-Level the hierarchy tree and choose the correct synsets Annotation using Latent Statistical Structure, IST- at every level. An error at a certain level can not 027978). be recovered. Therefor, we would like to perform

3This method was tested on the Semcore corpus, but use the word senses of the Longman Dictionary of Contemporary References English 4http://www.ltg.ed.ac.uk/software/ Eneko Agirre and German Rigau. 1996. Word sense 5http://www.senseval.org/ disambiguation using conceptual density. In Pro- 6http://www.cnts.ua.ac.be/conll2003/ ceedings of the 16th International Conference on

39 Computational Linguistics (Coling’96), pages 16– Andrew McCallum and Wei Li. 2003. Early results 22, Copenhagen, Denmark. for named entity recognition with conditional ran- dom fields, feature induction and web-enhanced lex- C. F. Baker, C. J. Fillmore, and J. B. Lowe. 1998. The icons. In Walter Daelemans and Miles Osborne, ed- Berkeley Framenet project. In Proceedings of the itors, Proceedings of CoNLL-2003, pages 188–191. COLING-ACL. Edmonton, Canada.

L. E. Baum and T. Petrie. 1966. Statistical in- Andrew Kachites McCallum. 2002. Mal- ference for probabilistic functions of finite state let: A machine learning for language toolkit. markov chains. Annals of Mathematical Statistics,, http://mallet.cs.umass.edu. 37:1554–1563. A. McCallum. 2003. Efficiently inducing features of R. Bisiani. 1992. Beam search. In S. C. Shapiro, conditional random fields. In Proceedings of the editor, Encyclopedia of Artificial Intelligence, New Nineteenth Conference on Uncertainty in Artificial York. Wiley-Interscience. Intelligence. D. McCarthy, R. Koeling, J. Weeds, and J. Carroll. Richard H. Byrd, Jorge Nocedal, and Robert B. Schn- 2004. Using automatically acquired predominant abel. 1994. Representations of quasi-newton matri- senses for word sense disambiguation. In Proceed- ces and their use in limited memory methods. Math. ings of the ACL SENSEVAL-3 workshop, pages 151– Program., 63(2):129–156. 154, Barcelona, Spain. C. Fellbaum, J. Grabowski, and S. Landes. 1998. Per- R. Mihalcea and D.I. Moldovan. 1999. A method formance and confidence in a semantic annotation for word sense disambiguation of unrestricted text. task. In C. Fellbaum, editor, WordNet: An Elec- In Proceedings of the 37th conference on Associa- tronic Lexical Database. The MIT Press. tion for Computational Linguistics, pages 152–158. Association for Computational Linguistics Morris- G. D. Forney. 1973. The viterbi algorithm. In Pro- town, NJ, USA. ceeding of the IEEE, pages 268 – 278. Erik T. Mueller. 1998. Natural language processing G. D. Forney. 1996. The forward-backward algo- with ThoughtTreasure. Signiform, New York. rithm. In Proceedings of the 34th Allerton Confer- ence on Communications, Control and Computing, F. Peng and A. McCallum. 2004. Accurate infor- pages 432–446. mation extraction from research papers using con- ditional random fields. In Proceedings of Human Brian Hayes. 1999. The web of words. American Language Technology Conference and North Amer- Scientist, 87(2):108–112, March-April. ican Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 329–336. Michael I. Jordan, editor. 1999. Learning in Graphical Models. The MIT Press, Cambridge. L.A. Ramshaw and M.P. Marcus. 1995. Text chunking using transformation-based learning. In Proceed- K. Kipper, H.T. Dang, and M. Palmer. 2000. Class- ings of the Third ACL Workshop on Very Large Cor- based construction of a verb lexicon. Proceedings pora, pages 82–94. Cambridge MA, USA. of the Seventh National Conference on Artificial In- F. Sha and F. Pereira. 2003. with con- telligence (AAAI-2000). ditional random fields. In Proceedings of Human Language Technology, HLT-NAACL. J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- ditional random fields: Probabilistic models for seg- D. Stork. 1999. The openmind initiative. IEEE Intelli- menting and labeling sequence data. In Proceed- gent Systems & their applications, 14(3):19–20. ings of the 18th International Conference on Ma- chine Learning. A. J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimal decoding algo- S. Landes, C. Leacock, and R.I. Tengi. 1998. Build- rithm. IEEE Trans. Informat. Theory, 13:260–269. ing semantic concordances. In C. Fellbaum, editor, WordNet: An Electronic Lexical Database.The MIT Hanna M. Wallach. 2004. Conditional random fields: Press. An introduction. Technical Report MS-CIS-04-21., University of Pennsylvania CIS. D. B. Lenat. 1995. Cyc: A large-scale investment in Y. Wilks and M. Stevenson. 1998. Word sense disam- knowledge infrastructure. Communications of the biguation using optimised combinations of knowl- ACM, 38(11):32–38. edge sources. Proceedings of COLING/ACL, 98. Wei Li and Andrew McCallum. 2003. Rapid develop- D. Yarowsky. 1995. Unsupervised word sense disam- ment of hindi named entity recognition using con- biguation rivaling supervised methods. In Proceed- ditional random fields and feature induction. ACM ings of the 33rd Annual Meeting of the Association Transactions on Asian Language Information Pro- for Computational Linguistics, pages 189–196. cessing (TALIP), 2(3):290–294.

40