A Taxonomy for English Nouns and Verbs
Total Page:16
File Type:pdf, Size:1020Kb
A TAXONOMY FOR ENGLISH NOUNS AND VERBS Robert A. Amsler Computer Sciences Department University of Texas, Austin. TX 78712 ABSTRACT: The definition texts of a machine-readable First, the data shoul'd provide information on the pocket dictionary were analyzed to determine the contents of semantic domains. One should be able to disambiguated word sense of the kernel terms of each determine from a lexical taxonomy what domains one word sense being defined. The resultant sets of word might be in given one has encountered the word pairs of defined and defining words were then "periscope", or "petiole", or "petroleum". computaCionally connected into t~o taxonomic semi- lattices ("tangled hierarchies") representing some Second, dictionary data should be of use in resolving 24,000 noun nodes and 11,000 verb nodes. The study of semantic ambiguity in text. Words in definitions the nature of the "topmost" nodes in these hierarchies. appear in the company of their prototypical and the structure of the trees reveal information about associates. the nature of the dictionary's organization of the language, the concept of semantic primitives and other Third, dictionary data can provide the basis for aspects of lexical semantics. The data proves that the creating case gr-,-~-r descriptions of verbs, and noun dictionary offers a fundamentally consistent description argument descriptions of nouns. Semantic templates of of word meaning and may provide the basis for future meaning are far richer when one considers the research and applications in computational linguistic taxonomic inheritance of elements of the lexicon. systems. Fourth. the dictionary should offer a classification which anthropological linguists and psycholinguists can use as an objective reference in comparison with 1. INTRODUCTION other cultures or human memory observations. This isn't to say that the dictionary's classification is the same as the culture's or the human mind's, only In the late 1960"s, John 01ney et al. at System that it is an objective datum from which comparisons Development Corporation produced machine-readable copies can be made. of the Merriam-Webster New Pocke~ Dictionary and the Sevent~ Collegiate Dictionary. These Fifth. knowledge of how the dictionary is structured massive data files have been widely distributed within can be used by lexicographers to build better the computational linguistic community, yet research dictionaries. upon the basic structure of the dictionary has been exceedingly slow and difficult due to the Significant And finally, the dictionary if converted into a computer resources required to process tens of thousands computer tool can become more readily accessible to of definitions. all the disciplines seeking Co use the current paper-based versions. Education. historical The dictionary is a fascinating computational resource. linguistics, sociology. English composition, etc. can It contains spelling, pronunciation, hyphenation, all make steps foxward given that they can assume capitalization, usage notes for semantic domains, access to a dictionary is immediately available via geographic regions, and propriety; etymological, computer. I do not know what all these applications syntactic and semantic information about the most basic will be and the task at hand is simply an elucidation units of the language. Accompanying definitions are of the dictionary's structure as it currently exists. example sentences which often use words in prototypical contexts. Thus the dictionary should be able to serve as a resource for a variety of computational linguistic needs. My primary concern within the dictionary has 2. "TANGLED" HIERARCHIES OF NOVN S AND VERBS been the development of dictionary data for use in understanding systems. Thus I am concerned with what dictionary definitions tell us about the semantic and The grant. MCS77-01315, '~)evelopment of a Computational pragmatic structure of meaning. The hypothesis I am Methodology for Deriving Natural Language Semantic proposing is that definitions in the lexicon can be Structures via Analysis of Machine-Readable studied in the same manner as other large collections of Dictionaries". created a taxonomy for the nouns and objects such as plants, animals, and minerals are verbs of the Merriam-Webster Pocket Dictionary (MPD), studied. Thus I am concerned with enunerating the based upon the hand-disambiguated kernel words in their classifications1 organization of the lexicon as it has definitions. This taxonomy confirmed the anticipated been implicitly used by the dictionary's lexicographers. structure of the lexicon to be that of a "tangled hierarchy" [8,9] of unprecedented size (24,000 noun Each textual definition in the dictionary is senses. 11.000 verb senses). This data base is believed syntactically a noun or verb phrase with one or more to be the first Co be assembled which is representative kernel terms. If one identifies these kernel terms of of the structure of the entire English lexicon. (A definitions, and then proceeds to disambiguate them somewhat similar study of the Italian lexicon has been relative to the senses offered in the same dictionary done [2.11] ). The content categories agree under their respective definitions, then one can arrive substantially with the semantic structure of the lexicon at a large collection of pairs of disambiguated words proposed by Nida [I5], and the verb taxonomy confirms which can be assembled into a taxonomic semi-lattice. the primitives proposed by the San Diego LNR group [16]. This task has been accomplished for all the definition This "tangled hierarchy" may be described as a formal texts of nouns and verbs in a comu~n pocket dictionary. data structure whose bottom is a set of terminal This paper is an effort to reveal the results of a disambiguated words that are not used as kernel defining preliminary examination of the structure of these terms; these are the most specific elexents in the databases. structure. The tops of the structure are senses of words such as "cause", "thing", '*class", "being", etc. The applications of this data are still in the future. These are the most general elements in the tangled What might these applications be? hierarchy. If all the top terms are considered to be 133 members of the metaclass "<word-sense>", the tangled upon these measurements. For example, "g" has the most forest becomes a tangled tree. single-level descendants, 3, yet it is neither at the Cop of the Cangled hierarchy, nor does iC have the The terminal nodes of such trees are in general each highest total number of descendants° The root node "a" connected to the Cop in a lattice. An individual is at the top of the hierarchy, yet it only has I lattice can be resolved into a seC of "traces", each of single-level descendant. For nodes ¢o he considered of which describes an alternate paCh from terminal to cop. major importance in a tangled hierarchy it is chus In a crate, each element implies the terms above iC, and necessary to consider not only Cheir total number of further specifies the sense of the elements below it. descendants, buc whether Chese descendants are all accually immediately under some ocher node Co which this The collection of lattices forms a transitive acyclic higher node is attached. As we shall see, che nodes digraph (or perhaps more clearly, a "semi-lattice", that which have the most single-level descendants are is, a lattice with a greatest upper bound, <word-sense>, actually more pivoral concepts in some cases. but no least lower bound). If we specify all the traces composing such a structure, spanning all paths from top Turning to the actual forest of Cangled hierarchies, to bottom, we have topologically specified the Table 2 gives the frequencies of the size and depth of semi-lattice. Thus the list on the left in Figure I the largest noun hierarchies and Table 3 gives the sizes topologically specifies the tangled hierarchy on its alone for verb hierarchies (depths were noc oompuced for right. these, unfortunately). (a b c e f) a Table 2. Frequencies and Maxim,-. Depths of (a b c gk) I MPD Tangled Noun Hierarchies (a b d g k) I (a b c g I) b 3379 I00NE-2.1A 1068 13 MEASUREMENT-I.2A (a bd gl) / \ 2121 12 BULK-I.IA 1068 ** DIMENSION-.IA (abc gin) / \ 1907 10 PARTS-I.1A/! 1061 ** LENGTH-.IB (a b d g m) c d 1888 10 SECTIONS-.2A/! 1061 ** DISTANCE-I.IA (a b d i) II I \ 1887 9 DIVISION-.2A 1061 14 DIMENSIONS-.IA / J / \ 1832 9 PORTION-I.4A 1060 11 SZZE-I.0A • I / i 1832 8 PART-I.IA 1060 13 MEASURE-I.2A [ f / 1486 14 SERIES-.0A 1060 I0 EXTENT-.IA I I/ 1482 18 SUM-I.IA I060 14 CAPACITY-.2A f g 1461 ** AMOUNT-2.2A 869 7 HOUSE-I.1A/+ /I\ 1459 8 ACT-I.1B 836 7 SUBSTANCE-.2B /I\ 1414 ** TOTAL-2.0A 836 8 MATTER-I.4A / I \ 1408 15 NUMBER-I.IA 741 8 NENS-.2A/+ k 1 m 1379 14 AMOUNT-2.1A 740 6 PIECE-I.2B 1337 80NE-2.2A 740 7 ITEM-.2A 1204 5 PERSON-.IA 686 7 ELZMENTS-.IA Figure I. The Trace of a Tangled Hierarchy 1201 14 OPERATIONS-.IA/÷ 684 6 MATERIAL-2.1A 1190 ~r* PROCESS-I.4A 647 9 THING-.4A 1190 14 ACTIONS-.2A/+ 642 8 ACT-I.IA 2.1 TOPMOST SEMANTIC NODES OF THE TANGLED HIERARCHIES 1123 6 GROUP-I.OA/! 535 6 THINGS-.SA/! ii01 12 FOEM-I.13A 533 6 MEMBER-.2A 1089 12 VAEIETY-.4A 503 I0 PLANE-4.1A Turning from the abstract description of the forest of 1083 Ii MODE-.IA 495 6 STRUCTURE-.2A tangled hierarchies Co the actual data, the first 1076 I0 STATE-I.IA 494 I0 RANK-2.4A question which was answered was, 'What are the largest 1076 9 CONDITION-I.3A 493 9 STEP-I°3A tangled hierarchies in the dictionary?".