<<

Deriving a Large Scale Taxonomy from

Simone Paolo Ponzetto and Michael Strube EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp

Abstract label the relations between categories. As a result we are able to derive a large scale taxonomy. We take the category system in Wikipedia as a conceptual net- work. We label the semantic relations between categories us- ing methods based on connectivity in the network and lexico- Motivation syntactic matching. As a result we are able to derive a large Arguments for the necessity of symbolically encoded scale taxonomy containing a large amount of subsumption, knowledge for AI date back at least to McCarthy (1959). i.e. isa, relations. We evaluate the quality of the created re- source by comparing it with ResearchCyc, one of the largest Such need has become clearer throughout the last decades, manually annotated ontologies, as well as computing seman- as it became obvious that AI subfields such as information tic similarity between words in benchmarking datasets. retrieval, knowledge management, and natural language pro- cessing (NLP) all profit from machine accessible knowl- edge (see Cimiano et al. (2005) for a broader motivation). Introduction E.g., from a computational linguistics perspective, knowl- The availability of large coverage, machine readable knowl- edge bases for NLP applications should be: edge is a crucial theme for Artificial Intelligence. While • domain independent, i.e. have a large coverage, in partic- advances towards robust statistical inference methods (cf. ular at the instance level; e.g. Domingos et al. (2006) and Punyakanok et al. (2006)) will certainly improve the computational modeling of intelli- • up-to-date, in order to process current information; gence, we believe that crucial advances will also come from • multilingual, in order to process information in a language rediscovering the deployment of large knowledge bases. independent fashion. Creating knowledge bases, however, is expensive and they are time-consuming to maintain. In addition, most of the The Wikipedia categorization system satisfies all these existing knowledge bases are domain dependent or have a points. Unfortunately, the Wikipedia categories do not form limited and arbitrary coverage – (Lenat & Guha, 1990) a taxonomy with a fully-fledged subsumption hierarchy, but and WordNet (Fellbaum, 1998) being notable exceptions. only a thematically organized thesaurus. As an example, the The field of ontology learning deals with these problems category CAPITALS IN ASIA1 is categorized in the upper by taking textual input and transforming it into a taxon- category CAPITALS (isa), whereas a category such as PHI- omy or a proper ontology. However, the learned ontolo- LOSOPHY is categorized under ABSTRACTION and BELIEF gies are small and mostly domain dependent, and evalua- (deals-with?) as well as HUMANITIES (isa) and SCIENCE tions have revealed a rather poor performance (see Buitelaar (isa). Another example is a page such as EUROPEAN MI- et al. (2005) for an extensive overview). CROSTATES which belongs to the categories EUROPE (are- We try to overcome such problems by relying on a wide located-in) and MICROSTATES (isa). coverage online encyclopedia developed by a large number of users, namely Wikipedia. We use semi-structured input Related Work by taking the category system in Wikipedia as a conceptual network. This provides us with pairs of related concepts There is a large body of work concerned with acquiring whose semantic relation is unspecified. The task of creat- knowledge for AI and NLP applications. Many NLP com- ing a subsumption hierarchy then boils down to distinguish ponents can get along with rather unstructured, associative between isa and notisa relations. We use methods based on knowledge as provided by the cooccurence of words in large connectivity in the network and lexico-syntactic patterns to corpora, e.g., distributional similarity (Church & Hanks,

Copyright c 2007, Association for the Advancement of Artificial 1We use Sans Serif for words and queries, CAPITALS for Intelligence (www.aaai.org). All rights reserved. Wikipedia pages and SMALL CAPS for Wikipedia categories.

1440 1990; Lee, 1999; Weeds & Weir, 2005, inter alia) and vec- Category network cleanup (1) tor space models (Schutze,¨ 1998). Such unlabeled relations We start with the full categorization network consisting of between words proved to be as useful for disambiguating 165,744 category nodes with 349,263 direct links between syntactic and semantic analyses as the manually assembled them. We first clean the network from meta-categories knowledge provided by WordNet. used for encyclopedia management, e.g. the categories un- However, the availability of reliable preprocessing com- der WIKIPEDIA ADMINISTRATION. Since this category is ponents like POS taggers, syntactic and semantic parsers connected to many content bearing categories, we cannot allows the field to move towards higher level tasks, such remove this portion of the graph entirely. We remove in- as , textual entailment, or complete di- stead all those nodes whose labels contain any of the fol- alogue systems which require to understand language. This lowing strings: wikipedia, wikiprojects, lists, mediawiki, lets researchers focus (again) on taxonomic and ontological template, user, portal, categories, articles, pages. This resources. The manually constructed Cyc ontology provides leaves 127,325 categories and 267,707 links still to be pro- a large amount of domain independent knowledge. How- cessed. ever, Cyc cannot (and is not intended to) cope with most spe- cific domains and current events. The emerging field of on- Refinement link identification (2) tology learning tries to overcome these problems by learning The next preprocessing step includes identifying so-called (mostly) domain dependent ontologies from scratch. How- refinement links. Wikipedia users tend to organize many ever, the generated ontologies are relatively small and the category pairs using patterns such as YXand XBYZ(e.g. results rather poor (e.g., Cimiano et al. (2005) report an F- MILES DAVIS ALBUMS and ALBUMS BY ARTIST). We la- measure of about 33% with regard to an existing ontology bel these patterns as expressing is-refined-by semantic re- of less than 300 concepts). It seems to be more promising lations between categories. While these links could be in to extend existing resources such as Cyc (Matuszek et al., principle assigned a full isa semantics, they represent meta- 2005) or WordNet (Snow et al., 2006). The examples shown categorization relations, i.e., their sole purpose is to better in these works, however, seem to indicate that the exten- structure and simplify the categorization network. We take sion takes place mainly with respect to named entities, a all categories containing by in the name and label all links task which is arguably not as difficult as creating a complete with their subcategories with an is-refined-by relation. This (domain-) ontology from scratch. labels 54,504 category links and leaves 213,203 relations to Another approach for building large knowledge bases re- be analyzed. lies on input by volunteers, i.e., on collaboration among the users of an ontology (Richardson & Domingos, 2003). How- Syntax-based methods (3) ever, the current status of the Open Mind and MindPixel projects2 does indicate that they are largely academic en- The first set of processing methods to label relations between terprises. Similar to the (Berners-Lee et al., categories as isa is based on string matching of syntactic 2001), where users are supposed to explicitly define the se- components of the category labels. mantics of the contents of web pages, they may be hin- Head matching. The first method labels pairs of cate- dered by too high an entrance barrier. In contrast, Wikipedia gories sharing the same lexical head, e.g. BRITISH COM- and its categorization system feature a low entrance barrier PUTER SCIENTISTS isa COMPUTER SCIENTISTS. We parse achieving quality by collaboration. In Strube & Ponzetto the category labels using the Stanford parser (Klein & Man- (2006) we proposed to take the Wikipedia categorization ning, 2003). Since we parse mostly NP fragments, we con- system as a semantic network which served as basis for com- strain the output of the head finding algorithm (Collins, puting the semantic relatedness of words. In the present 1999) to return a lexical head labeled as either a noun or work we develop this idea a step further by automatically a 3rd person singular present verb (this is to tolerate errors assigning isa and notisa labels to relations between the cat- where plural noun heads have been wrongly identified as egories. That way we are able to compute the semantic sim- verbs). In addition, we modify the head finding rules to re- ilarity between words instead of their relatedness. turn both nouns for NP coordinations (e.g. both buildings and infrastructures for BUILDINGS AND INFRASTRUC- Methods TURES IN JAPAN). Finally, we label a category link as isa if Since May 2004 Wikipedia allows for structured access by the two categories share the same head lemma, as given by means of categories3. The categories form a graph which a finite-state morphological analyzer (Minnen et al., 2001). can be taken to represent a conceptual network with un- Modifier matching. We next label category pairs as not- specified semantic relations (Strube & Ponzetto, 2006). We isa in case the stem of the lexical head of one of the cate- present here our methods to derive isa and notisa relations gories, as given by the Porter stemmer (Porter, 1980), occurs from these generic links. in non-head position in the other category. This is to rule out thematic categorization links such as CRIME COMICS and 2 www.openmind.org and www.mindpixel.com CRIME or ISLAMIC MYSTICISM and ISLAM. 3Wikipedia can be downloaded at http://download. wikimedia.org. In our experiments we use the English Both methods achieve a good coverage by identifying re- Wikipedia dump from 25 September 2006. This includes spectively 72,663 isa relations by head matching and 37,999 1,403,207 articles, 99% of which are categorized. notisa relations by modifier matching.

1441 1. NP2,? (such as|like|, especially) NP* NP1 1. NP2’s NP1 a stimulant such as caffeine car’s engine 2. such NP2 as NP* NP1 2. NP1 in NP2 such stimulants as caffeine engine in the car 3. NP1 NP* (and|or|,like) other NP2 3. NP2 with NP1 caffeine and other stimulants a car with an engine 4. NP1, one of det pl NP2 4. NP2 contain(s|ed|ing) NP1 caffeine, one of the stimulants a car containing an engine 5. NP1, det sg NP2 rel pron 5. NP1 of NP2 caffeine, a stimulant which of the car 6. NP2 like NP* NP1 6. NP1 are? used in NP2 stimulants like caffeine engines used in cars 7. NP2 ha(s|ve|d) NP1 a car has an engine Figure 1: Patterns for isa and notisa Detection

Connectivity-based methods (4) in cases where relations are unlikely to be found in free text. The next set of methods employed relies on the structure and Using instance categorization and redundant categorization connectivity of the categorization network. we find 9,890 and 11,087 isa relations, respectively. Instance categorization. Suchanek et al. (2007) show Lexico-syntactic based methods (5) that instance-of relations in Wikipedia between entities (de- noted by pages) and classes (denoted by categories) can After applying methods (1-4) we are left with 81,564 un- be found heuristically with high accuracy by determining classified relations. We next apply lexico-syntactic patterns whether the head of the page category is plural, e.g. AL- to sentences in large text corpora to identify isa relations BERT EINSTEIN belongs to the NATURALIZED CITIZENS (Hearst, 1992; Caraballo, 1999). In order to reduce the OF THE UNITED STATES category. We apply this idea to isa amount of unclassified relations and to increase the preci- relation identification as follows. For each category c, sion of the isa patterns we also apply patterns to identify notisa relations. We assume that patterns used for iden- 1. we find the page titled as the category or its lemma, tifying meronymic relations (Berland & Charniak, 1999; for instance the page MICROSOFT for the category MI- Girju et al., 2006) indicate that the relation is not an isa re- CROSOFT; lation. The text corpora used for this step are the English 8 8 2. we then collect all the page’s categories whose lexical Wikipedia (5×10 words) and the Tipster corpus (2.5×10 head is a plural noun CP = {c1,c2,...cn}; words; Harman & Liberman (1993)). In the patterns for de- tecting isa and notisa relations (Figure 1) NP1 represents the 3. for each c’s supercategory sc, we label the relation be- c sc sc hyponym, NP2 the hypernym, i.e., we want to retrieve NP1 tween and as isa, if the head lemma of matches isa NP2; NP* represents zero or more coordinated NPs. the head lemma of at least one category cp ∈ CP. To improve the recall of applying these patterns, we use For instance, from the page MICROSOFT being categorized only the lexical head of the categories which were not iden- into COMPANIES LISTED ON NASDAQ, we collect evi- tified as named entities. That is, if the lexical head of a cat- dence that Microsoft is a company and accordingly catego- egory is identified by a Named Entity Recognizer (Finkel rize as isa the links between MICROSOFT and COMPUTER et al., 2005) as belonging to a named entity, e.g. Brands in AND VIDEO GAME COMPANIES. The idea is to collect ev- YUM!BRANDS, we use the full category name, else we sim- idence from the instance describing the concept and propa- ply use the head, e.g. albums in MILES DAVIS ALBUMS. gate such evidence to the described concept itself. In order to ensure precision in applying the patterns, both the Wikipedia and Tipster corpora were preprocessed by a Redundant categorization. This method labels pairs of pipeline consisting of a trigram-based statistical POS tagger categories which have at least one page in common. If users (Brants, 2000) and a SVM-based chunker (Kudoh & Mat- redundantly categorize by assigning two directly connected sumoto, 2000), to identify noun phrases (NPs). categories to a page, they often mark by implicature the page The patterns are used to provide evidence for semantic re- as being an instance of two different category concepts with lations employing a majority voting strategy. We positively different granularities, e.g. ETHYL CARBAMATE is both label a category pair with isa in case the number of matches an AMIDE(S) and an ORGANIC COMPOUND(S). Assuming of positive patterns is greater than the number of matches that the page is an instance of both conceptual categories, we of negative ones. In addition, we use the patterns to filter can conclude by transitivity that one category is subsumed the isa relations created by the connectivity-based methods by the other, i.e. AMIDES isa ORGANIC COMPOUNDS. (4). This is due to instance categorization and redundant The connectivity-based methods provide positive isa links categorization giving results which are not always reliable,

1442 e.g. we incorrectly find that CONSONANTS isa PHONETICS. RPF1 We use the same majority voting scheme, except that this baseline (methods 1-3) 73.7 100.0 84.9 + connectivity (methods 1-4, 6) 80.6 91.8 85.8 time we mark as notisa those pairs with a number of nega- + tive matches greater than the number of positive ones. This pattern-based (methods 1-3, 5-6) 84.3 91.5 87.7 all (methods 1-6) 89.1 86.6 87.9 ensures better precision by leaving the recall basically un- changed. These methods create 15,055 isa relations and fil- ter out 3,277 previously identified positive links. Table 1: Comparison with Cyc Inference-based methods (6) < The last set of methods propagate the previously found rela- differences in performance are statistically significant at p tions by means of multiple inheritance and transitivity. We 0.001. We test for statistical significance by performing a first propagate all isa relations to those superclasses whose McNemar test. head lemma matches the head lemma of a previously iden- Discussion and Error Analysis. The simple methods em- tified isa superclass. E.g., once we found that MICROSOFT ployed for the baseline work suprisingly good with perfect isa COMPANIES LISTED IN NASDAQ we can infer also that precision and somewhat satisfying recall. However, since MICROSOFT isa MULTINATIONAL COMPANIES. only categories with identical heads are connected, we do We then propagate all isa links to those superclasses not create a single interconnected taxonomy but many sepa- which are connected through a path found along the pre- rate taxonomic islands. In practice we simply find that HIS- viously discovered subsumption hierarchy. E.g., given that TORICAL BUILDINGS are BUILDINGS. The extracted infor- FRUIT isa CROPS and CROPS isa EDIBLE PLANTS, we can mation is trivial. infer that FRUITS isa EDIBLE PLANTS. By applying the connectivity-based methods we are able to improve the recall considerably. The drawback is a de- Evaluation crease in precision. However, a closer look reveals that We evaluate the coverage and quality of the semantic rela- we now in fact created a interconnected taxonomy where tions extracted automatically. This is because the size of the concepts with quite different linguistic realization are con- induced taxonomy is very large – up to 105,418 generated nected. We observe the same trend by applying the pattern- isa semantic links – and also to avoid any bias in the evalu- based methods in addition to the baseline. They improve the ation method. recall even more, but they also have a lower precision. The best results are obtained by combining all methods. Comparison with ResearchCyc Because we did not expect such a big drop in precision We first compute the amount of isa relations we correctly – and only moderate improvement over the baseline in F- extracted by comparing with ResearchCyc4, the research measure – we closely inspected a random sample of 200 version of the Cyc (Lenat & Guha, 1990) false positives, i.e., the cases which led to the low preci- including (as of version 1.0) more than 300,000 concepts sion score. Three annotators labeled these cases as true if and 3 millions assertions. For each category pair, we first judged to be in fact correct isa relations, false otherwise. map each category to its Cyc concept using Cyc’s internal It turned out that about 50% of the false positives were in- lexeme-to-concept denotational mapper. Concepts are found deed labeled correctly as isa relations by the system, but by querying the full category label (e.g. Alan Turing). In these relations could not be found in Cyc. This is due to case no matching concept is found, we fall back to querying (1) Cyc missing the required relations (e.g. BRIAN ENO isa its lexical head (hardware for IBM hardware). MUSICIANS) or (2) missing the required concepts (e.g., we We evaluate only the 85% of the pairs which have cor- correctly find that BEE TRAIN isa ANIMATION STUDIOS, responding concepts in Cyc. These pairs are evaluated by but since Cyc provides only the TRAIN-TRANSPORTATION- querying Cyc whether the concept denoted by the Wikipedia DEVICE and STUDIO concepts, we query: “is train a stu- subcategory is either an instance of (#$isa) or is general- dio?” which leads to a false positive. ized by (#$genls) the concept denoted by its superclass5. We then take the result of the query as the actual (isa or not- Computing semantic similarity using Wikipedia isa) semantic class for the category pair and use it to eval- In Strube & Ponzetto (2006) we proposed to use the uate the system’s response. This way we are able to com- Wikipedia categorization as a conceptual network to com- pute standard measures of precision, recall and balanced F- pute the semantic relatedness of words. However, we could measure. Table 1 shows the results obtained by taking the not compute semantic similarity, because approaches to syntax-based methods (i.e. head matching) as baseline and measuring semantic similarity that rely on lexical resources incrementally augmenting them with different sets of meth- use paths based on isa relations only. These are only avail- ods, namely our connectivity and pattern based methods. All able in the present work. 4http://research.cyc.com/ We perform an extrinsic evaluation by computing seman- 5Note that our definition of isa is similar to the one found tic similarity on two commonly used datasets, namely Miller in WordNet prior to version 2.1. That is, we do not distinguish & Charles’ (1991) list of 30 noun pairs (M&C) and the hyponyms that are classes from hyponyms that are instances (cf. 65 word synonymity list from Rubenstein & Goodenough Miller & Hristea (2006)). (1965, R&G). We compare the results obtained by using

1443 M&C R&G pl wup lch res pl wup lch res WordNet all 0.72 0.77 0.82 0.78 0.78 0.82 0.86 0.81 Wikirelate! all 0.60 0.53 0.58 0.30 0.62 0.63 0.64 0.34 non-missing 0.65 0.61 0.65 0.41 0.66 0.69 0.70 0.42 Wikirelate! all 0.67 0.65 0.67 0.69 0.67 0.69 0.70 0.66 isa-only non-missing 0.71 0.70 0.72 0.74 0.70 0.73 0.73 0.70 Wikirelate! all 0.68 0.74 0.73 0.62 0.67 0.74 0.73 0.58 PageRank filter non-missing 0.72 0.79 0.78 0.68 0.70 0.79 0.77 0.63 Wikirelate! all 0.73 0.79 0.78 0.81 0.69 0.75 0.74 0.76 isa + PageRank non-missing 0.76 0.84 0.82 0.86 0.72 0.79 0.77 0.80

Table 2: Results on correlation with human judgements of similarity measures

Wikipedia with the ones obtained by using WordNet, which  PR(v) is the most widely used lexical taxonomy for this task. Fol- PR(v)=(1− d)+d lowing the literature on semantic similarity, we evaluate per- |O(v)|  formance by taking the Pearson product-moment correlation v ∈I(v) coefficient r between the similarity scores and the corre- where d ∈ (0, 1) is a dumping factor (we set it to the stan- sponding human judgements. For each dataset we report the dard value of .85), I(v) is the set of nodes linked to v and correlation computed on all pairs (all). In the case of word |O(v)| the number of outgoing links of node v. This gives pairs where at least one of the words could not be found the a ranking of the most authoritative categories, which in our similarity score is set to 0. In addition, we report the corre- case happen to be the categories in the highest regions of the lation score obtained by disregarding such pairs containing categorization network – i.e., the top-ranked categories are missing words (non-missing). FUNDAMENTAL, SOCIETY, KNOWLEDGE, PEOPLE,SCI- Table 2 reports the scores obtained by computing seman- ENCE,ACADEMIC DISCIPLINES and so on. tic similarity in WordNet as well as in Wikipedia using dif- The third experimental setting of Table 2 shows the re- ferent scenarios and measures (Rada et al. (1989, pl), Wu sults obtained by computing relatedness using the method & Palmer (1994, wup), Leacock & Chodorow (1998, lch), from Strube & Ponzetto (2006) and removing the top 200 Resnik (1995, res)). We first take as baseline the Wikirelate! highest ranked PageRank categories7. Finally, we present method outlined in Strube & Ponzetto (2006) and extend it results of using both isa and PageRank filtering. The results by first computing only paths based on isa relations. Since indicate that using both isa relations and applying PageRank experiments on development data6 revealed a performance filtering work better than the simple Wikirelate! baseline. improvement far lower than expected, we performed an error This is because in both cases we are able to filter out cate- analysis. This revealed that many dissimilar pairs received a gories and category relations which decrease the similarity score higher than expected, because of coarse-grained over- scores, i.e. coarse-grained categories using PageRank, and connected categories containing a large amount of dissim- notisa (e.g. meronymic, antonymic) semantic relations. The ilar pages, e.g. mound and shore were directly connected two methods are indeed complementary, which is shown by through LANDFORMS though they are indeed quite different the best results being obtained by applying them together. according to human judges. Using PageRank filtering together with paths including only A way to model the categories’ connectivity is to compute isa relations yields results which are close to the ones ob- their authoritativeness, i.e. we assume that overconnected, tained by using WordNet. semantically coarse categories will be the most authoritative The results indicate that Wikipedia can be successfuly ones. This can be accomplished for instance by comput- used as a taxonomy to compute the semantic similarity of ing the centrality scores of the Wikipedia categories. Since words. In addition, our application of PageRank for filter- the Wikipedia categorization network is a directed acyclic ing out coarse-grained categories highlights that, similarly graph, link analysis algorithms such as PageRank (Brin & to the connectivity-based methods used to identify isa re- Page, 1998) can be easily applied to automatically detect and lations, the internal structure of Wikipedia can be used to remove these coarse categories from the categorization net- generate semantic content, being based on a meaningful set work. We take the graph given by all the categories and the of conventions the users tend to adhere. pages that point to them and apply the PageRank algorithm. PageRank scores are computed recursively for each category Conclusions vertex v by the formula We described the automatic creation of a large scale domain independent taxonomy. We took Wikipedia’s categories as 6In order to perform a blind test evaluation, we developed the system for computing semantic similarity using a different version 7The optimal threshold value was established again by analyz- of Wikipedia, namely the database dump from 19 February 2006. ing performance on the development data.

1444 concepts in a semantic network and labeled the relations Hearst, M. A. (1992). Automatic acquisition of hyponyms from between these concepts as isa and notisa relations by us- large text corpora. In Proc. of COLING-92, pp. 539–545. ing methods based on the connectivity of the network and Klein, D. & C. D. Manning (2003). Fast exact inference with a fac- on applying lexico-syntactic patterns to very large corpora. tored model for natural language parsing. In S. Becker, S. Thrun, Both connectivity-based methods and lexico-syntactic pat- & K. Obermayer (Eds.), Advances in Neural Information Pro- terns ensure a high recall while decreasing the precision. cessing Systems 15 (NIPS 2002), pp. 3–10. Cambridge, Mass.: MIT Press. We compared the created taxonomy with ResearchCyc and Kudoh, T. & Y. Matsumoto (2000). Use of Support Vector Ma- via semantic similarity measures with WordNet. Our Wiki- chines for chunk identification. In Proc. of CoNLL-00, pp. 142– pedia-based taxonomy proved to be competitive with the two 144. arguably largest and best developed existing ontologies. We Leacock, C. & M. Chodorow (1998). Combining local context and believe that these results are caused by taking already struc- WordNet similarity for word sense identification. In C. Fell- tured and well-maintained knowledge as input. baum (Ed.), WordNet. An Electronic Lexical Database, Chp. 11, Our work on deriving a taxonomy is the first step in creat- pp. 265–283. Cambridge, Mass.: MIT Press. ing a fully-fledged ontology based on Wikipedia. This will Lee, L. (1999). Measures of distributional similarity. In Proc. of require to label the generic notisa relations with particular ACL-99, pp. 25–31. Lenat, D. B. & R. V. Guha (1990). Building Large Knowledge- ones such as has-part, has-attribute, etc. Based Systems: Representation and Inference in the CYC Project. Reading, Mass.: Addison-Wesley. Matuszek, C., M. Witbrock, R. C. Kahlert, J. Cabral, D. Schneider, Acknowledgements. This work has been funded by the P. Shah & D. Lenat (2005). Searching for common sense: Pop- Klaus Tschira Foundation, Heidelberg, Germany. The first ulating Cyc from the web. In Proc. of AAAI-05, pp. 1430–1435. author has been supported by a KTF grant (09.003.2004). McCarthy, J. (1959). Programs with common sense. In Pro- We thank our colleagues Katja Filippova and Christoph ceedings of the Teddington Conference on the Mechanization of Muller¨ for useful feedback. Thought Processes, pp. 75–91. Miller, G. A. & W. G. Charles (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1– References 28. Berland, M. & E. Charniak (1999). Finding parts in very large Miller, G. A. & F. Hristea (2006). WordNet nouns: Classes and corpora. In Proc. of ACL-99, pp. 57–64. instances. Computational Linguistics, 32(1):1–3. Berners-Lee, T., J. Hendler & O. Lassila (2001). The semantic Minnen, G., J. Carroll & D. Pearce (2001). Applied morpho- web. Scientific American, 284(5):34–43. logical processing of English. Natural Language Engineering, Brants, T. (2000). TnT – A statistical Part-of-Speech tagger. In 7(3):207–223. Proc. of ANLP-00, pp. 224–231. Porter, M. (1980). An algorithm for suffix stripping. Program, Brin, S. & L. Page (1998). The anatomy of a large-scale hypertex- 14(3):130–137. tual web search engine. Computer Networks and ISDN Systems, Punyakanok, V., D. Roth, W. Yih & D. Zimak (2006). Learning 30(1–7):107–117. and inference over constrained output. In Proc. of IJCAI-05, pp. Buitelaar, P., P. Cimiano & B. Magnini (Eds.) (2005). Ontol- 1117–1123. ogy Learning from Text: Methods, Evaluation and Applications. Rada, R., H. Mili, E. Bicknell & M. Blettner (1989). Development Amsterdam, The Netherlands: IOS Press. and application of a metric to semantic nets. IEEE Transactions Caraballo, S. A. (1999). Automatic construction of a hypernym- on Systems, Man and Cybernetics, 19(1):17–30. labeled noun hierarchy from text. In Proc. of ACL-99, pp. 120– Resnik, P. (1995). Using information content to evaluate semantic 126. similarity in a taxonomy. In Proc. of IJCAI-95, Vol. 1, pp. 448– Church, K. W. & P. Hanks (1990). Word association norms, mu- 453. tual information, and lexicography. Computational Linguistics, Richardson, M. & P. Domingos (2003). Building large knowledge 16(1):22–29. bases by mass collaboration. In Proceedings of the 2nd Interna- Cimiano, P., A. Pivk, L. Schmidt-Thieme & S. Staab (2005). tional Conference on Knowledge Capture (K-CAP 2003). Sani- Learning taxonomic relations from heterogenous sources of ev- bel Island, Fl., October 23–25, 2003, pp. 129–137. idence. In P. Buitelaar, P. Cimiano & B. Magnini (Eds.), Ontol- Rubenstein, H. & J. Goodenough (1965). Contextual correlates of ogy Learning from Text: Methods, Evaluation and Applications, synonymy. Communications of the ACM, 8(10):627–633. pp. 59–73. Amsterdam, The Netherlands: IOS Press. Schutze,¨ H. (1998). Automatic word sense discrimination. Com- Collins, M. (1999). Head-Driven Statistical Models for Natural putational Linguistics, 24(1):97–123. Language Parsing., (Ph.D. thesis). University of Pennsylvania. Snow, R., D. Jurafsky & A. Y. Ng (2006). Semantic taxonomy Domingos, P., S. Kok, H. Poon, M. Richardson & P. Singla (2006). induction from heterogeneous evidence. In Proc. of COLING- Unifying logical and statistical AI. In Proc. of AAAI-06, pp. 2–7. ACL-06, pp. 801–808. Fellbaum, C. (Ed.) (1998). WordNet: An Electronic Lexical Strube, M. & S. P. Ponzetto (2006). WikiRelate! Computing se- Database. Cambridge, Mass.: MIT Press. mantic relatedness using Wikipedia. In Proc. of AAAI-06, pp. Finkel, J. R., T. Grenager & C. Manning (2005). Incorporating non- 1419–1424. local information into information extraction systems by Gibbs Suchanek, F. M., G. Kasneci & G. Weikum (2007). : A core sampling. In Proc. of ACL-05, pp. 363–370. of semantic knowledge. In Proc. of WWW-07. Girju, R., A. Badulescu & D. Moldovan (2006). Automatic Weeds, J. & D. Weir (2005). Co-occurrence retrieval: A flexible discovery of part-whole relations. Computational Linguistics, framework for lexical distributional similarity. Computational 32(1):83–135. Linguistics, 31(4):439–475. Harman, D. & M. Liberman (1993). TIPSTER Complete. Wu, Z. & M. Palmer (1994). Verb semantics and lexical selection. LDC93T3A, Philadelphia, Penn.: Linguistic Data Consortium. In Proc. of ACL-94, pp. 133–138.

1445