Inferring Parts of Speech for Lexical Mappings Via the Cyc KB
Total Page:16
File Type:pdf, Size:1020Kb
to appear in Proceeding 20th International Conference on Computational Linguistics (COLING-04), August 23-27, 2004 Geneva Inferring parts of speech for lexical mappings via the Cyc KB Tom O’Hara‡, Stefano Bertolo, Michael Witbrock, Bjørn Aldag, Jon Curtis,withKathy Panton, Dave Schneider,andNancy Salay ‡Computer Science Department Cycorp, Inc. New Mexico State University Austin, TX 78731 Las Cruces, NM 88001 {bertolo,witbrock,aldag}@cyc.com [email protected] {jonc,panton,daves,nancy}@cyc.com Abstract in the phrase “painting for sale.” In cases like this, semantic or pragmatic criteria, as opposed We present an automatic approach to learn- to syntactic criteria only, are necessary for deter- ing criteria for classifying the parts-of-speech mining the proper part of speech. The headword used in lexical mappings. This will fur- part of speech is important for correctly identifying ther automate our knowledge acquisition sys- phrasal variations. For instance, the first term can tem for non-technical users. The criteria for also occur in the same sense in “paint for money.” the speech parts are based on the types of However, this does not hold for the second case, the denoted terms along with morphological since “paint for sale” has an entirely different sense and corpus-based clues. Associations among (i.e., a substance rather than an artifact). these and the parts-of-speech are learned us- When lexical mappings are produced by naive ing the lexical mappings contained in the Cyc users, such as in DARPA’s Rapid Knowledge For- knowledge base as training data. With over mation (RKF) project, it is desirable that tech- 30 speech parts to choose from, the classifier nical details such as the headword part of speech achieves good results (77.8% correct). Ac- be inferred for the user. Otherwise, often complex curate results (93.0%) are achieved in the and time-consuming clarification dialogs might be special case of the mass-count distinction for necessary in order to rule out various possibilities. nouns. Comparable results are also obtained For example, Cycorp’s Dictionary Assistant was using OpenCyc (73.1% general and 88.4% developed for RKF in order to allow non-technical mass-count). users to specify lexical mappings from terms into the Cyc knowledge base (KB). Currently, when a 1 Introduction new type of activity is described, the user is asked In semantic lexicons, the term lexical mapping de- a series of questions about the ways of referring to scribes the relation between a concept and a phrase the activity. If the user enters the phrase “painting used to refer to it (Onyshkevych and Nirenburg, for money,” the system asks whether the phrases 1995; Burns and Davis, 1999). Lexical mappings “paint for money” and “painted for money” are include associated syntactic information, in partic- suitable variations in order to determine whether ular, part of speech information for phrase head- ‘painting’ should be treated as a verb. Users find words. The term lexicalize will refer to the process such clarification dialogs distracting, since they are of producing these mappings, which are referred more interested in entering domain rather than lin- to as lexicalizations. Selecting the part of speech guistic knowledge. Regardless, it is often very diffi- for the lexical mapping is required so that proper cult to produce prompts that make the distinction inflectional variations can be recognized and gen- intelligible to a linguistically naive user. erated for the term. Although producing lexi- A special case of the lexicalization speech part calizations is often a straightforward task, there classification is the handling of the mass-count dis- are many cases that can pose problems, especially tinction. Having the ability to determine if a con- when fine-grained speech part categories are used. cept takes a mass or count noun is useful not only For example, the headword ‘painting’ is a verb for parsing, but also for generation of grammat- in the phrase “painting for money” but a noun ical English. For example, automatically gener- PhysicalDevice Collection: “Dubya” refers to President George W.Bush). The Microtheory: ArtifactGVocabularyMt KB also includes a broad-coverage English lexicon isa: ExistingObjectType mapping words and phrases to terms throughout genls: Artifact ComplexPhysicalObject the KB. A subset of the Cyc KB including parts of SolidTangibleProduct the English lexicon has been made freely available Microtheory: ProductGMt as part of OpenCyc (www.opencyc.org). isa: ProductType 2.1 Ontology Figure 1: Type definition for PhysicalDevice. Central to the Cyc ontology is the concept collec- tion, which corresponds to the familiar notion of ated web pages (e.g., based on search terms) oc- a set, but with membership intensionally defined casionally produce ungrammatical term variations (so distinct collections can have identical members, because this distinction is not addressed properly. which is impossible for sets). Every object in the Although learner dictionaries provide informa- Cyc ontology is a member (or instance,inCyc tion on the mass-count distinction, they are not parlance) of one or more collections. Collection suitable for this task because different senses of a membership is expressed using the predicate (i.e., word are often conflated in the definitions for the relation-type) isa, whereas collection subsumption sake of simplicity. In cases like this, the word or is expressed using the transitive predicate genls sense might be annotated as being both count and (i.e., generalization). These predicates correspond mass, perhaps with examples illustrating the dif- to the set-theoretic notions element of and subset ferent usages. This is the case for ‘chicken’ from of respectively and thus are used to form a partially the Cambridge International Dictionary of English ordered hierarchy of concepts. For the purposes of (Procter, 1995), defined as follows: this discussion, the isa and genls assertions on a a type of bird kept on a farm for its eggs or Cyctermconstituteitstype definition. its meat, or the meat of this bird which is Figure 1 shows the type definition for Physi- cooked and eaten calDevice, a prototypical denotatum term for count nouns. The type definition of PhysicalDevice indi- This work describes an approach for automati- cates that it is a collection that is a specialization cally inferring the parts of speech for lexical map- of Artifact, etc. As is typical for terms referred to pings, using the existing lexical assertions in the by count nouns, it is an instance of the collection CycKB.Wearespecificallyconcerned with select- ExistingObjectType. ing parts of speech for entries in a semantic lexi- Figure 2 shows the type definition for Water,a con, not about determining parts of speech in con- prototypical denotation for mass nouns. Although text. After an overview of the Cyc KB in the next the asserted type information for Water does not section, Section 3 discusses the approach taken to convey any properties that would suggest a mass inferring the part of speech for lexicalizations. Sec- noun lexicalization, the genls hierarchy of collec- tion 4 then covers the classification results. This is tions does. In particular, the collection Chemical- followed by a comparison to related work in Section CompoundTypeByChemicalSpecies is known to be 5. a specialization of the collection ExistingStuffType, via the transitive properties of genls. Thus, by 2 Cyc knowledge base virtue of being an instance of ChemicalCompound- In development since 1984, the Cyc knowledge base TypeByChemicalSpecies, Water is known to be an (Lenat, 1995) is the world’s largest formalized rep- instance of ExistingStuffType. This illustrates that resentation of commonsense knowledge, contain- the decision procedure for the lexical mapping ing over 120,000 concepts and more than a mil- speech parts needs to consider not only asserted, lion axioms.1 Cyc’s upper ontology describes but also inherited collection membership. the most general and fundamental of distinctions (e.g., tangibility versus intangibility). The lower 2.2 English lexicon ontology contains facts useful for particular appli- Natural language lexicons are integrated directly cations, such as web searching, but not necessar- into the Cyc KB (Burns and Davis, 1999). Though ily required for commonsense reasoning (e.g., that several lexicons are included in the KB, the English 1These figures and the results discussed later are lexicon is the only one with general coverage. The basedonCycKBversion576andOpenCycKBversion mapping from nouns to concepts is done using one 567. of two general strategies, depending on whether the 2 Collection: Water Usage Microtheory: UniversalVocabularyMt Predicate OpenCyc Cyc isa: ChemicalCompoundTypeByChemicalSpecies multiWordString 1123 24606 Microtheory: UniversalVocabularyMt denotation 2080 16725 genls: Individual compoundString 318 2226 Microtheory: NaivePhysicsVocabularyMt headMedialString 200 942 genls: Oxide total 3721 44499 Figure 2: Type definition for Water. Table 1: Denotational predicate usage in Cyc English lexicon. This excludes slang and jargon. mapping is from a name or a common noun phrase. Several different binary predicates indicate name- Usage to-term mappings, with the name represented as a SpeechPart OpenCyc Cyc string. For example, CountNoun 2041 21820 MassNoun 566 9993 (nameString HEBCompany “HEB”) Adjective 262 6460 Verb 659 2860 A denotational assertion maps a phrase into a AgentiveNoun 81 1389 concept, usually a collection. The phrase is spec- ProperCountNoun 16 906 ified via a lexical word unit (i.e., lexeme concept) Adverb 50 310 with optional string modifiers. The part of speech ProperMassNoun 1 286 is specified via one of Cyc’s SpeechPart constants. GerundiveNoun 7 275 Syntactic information, such as the wordform vari- other 39 185 ants and their speech parts, is stored with the Cyc total 3721 44499 constant for the word unit. For example, Device- TheWord, the Cyc constant for the word ‘device,’ Table 2: Most common speech parts in deno- has a single syntactic mapping since the plural tational assertions. The other entry covers 20 form is inferable: infrequently used cases.