Inducing criteria for lexicalization parts of speech using the Cyc KB Tom O'Hara Michael Witbrock, Bjorn Aldag, Stefano Bertolo, Computer Science Department Nancy Salay, Jon Curtis and Kathy Panton New Mexico State University Cycorp, Inc. Las Cruces, NM 88001 Austin, TX 78731
[email protected] {witbrock,bertolo,aldag,nancy, jone,panton}@cyc.com Abstract denotational assertion maps a phrase, specified via a lexical concept with optional string modifiers, into a concept, usu• We present an approach for learning part-of-speech ally a collection. The part of speech is specified by Cyc's distinctions by induction over the lexicon of the SpeechPart constants. The simplest type of denotational map• Cyc knowledge base. This produces good results ping uses the denotation predicate. For example, (denotation (74.6%) using a decision tree that incorporates both Device-Word CountNoun 0PhysicalDevice) indicates that sense 0 semantic features and syntactic features. Accurate of the count noun 'device1 refers to PhysicalDevice (via the results (90.5%) are achieved for the special case wordforms "device" and "devices"). Three additional predi• of deciding whether lexical mappings should use cates account for phrasal mappings: compoundString, head- count noun or mass noun headwords. Comparable MedialString, and multiWordString are used for phrases with results are also obtained using OpenCyc, the pub• the headword at the beginning, the middle, and the end, re• licly available version of Cyc. spectively. These denotational assertions, excluding lexical mappings 1 Introduction for technical, informal and slang terms, form the training data for a lexicalization speech part classifier. In semantic lexicons, a lexical mapping describes the re• lationship between a concept and phrases used to refer to 2 Inference of lexicalization part of speech it.