Conceptual Association for Compound Noun Analysis

CONCEPTUAL ASSOCIATION FOR COMPOUND NOUN ANALYSIS Mark Lauer Microsoft Institute Department of Computing 65 Epping Road Macquarie University North Ryde NSW 2113 NSW 2109 (t-markl @ microsoft.corn) AUSTRALIA (mark @macadam, mpce. mq.edu .au) Abstract dictionaries (e.g., Vanderwende, 1993). Other approaches have used rather simpler, statistical This paper describes research toward the automatic analyses of large corpora, as is done in this work. interpretation of compound nouns using corpus Hindle and Rooth (1993) used a rough parser statistics. An initial study aimed at syntactic to extract lexical preferences for prepositional phrase disambiguation is presented. The approach presented (PP) attachment. The system counted occurrences of bases associations upon thesaurus categories. unambiguously attached PPs and used these to define Association data is gathered from unambiguous cases LEXICAL ASSOCIATION between prepositions and the extracted from a corpus and is then applied to the nouns and verbs they modified. This association data analysis of ambiguous compound nouns. While the was then used to choose an appropriate attachment for work presented is still in progress, a first attempt to ambiguous cases. The counting of unambiguous cases syntactically analyse a test set of 244 examples shows in order to make inferences about ambiguous ones is 75% correctness. Future work is aimed at improving adopted in the current work. An explicit assumption is this accuracy and extending the technique to assign made that lexical preferences are relatively semantic role information, thus producing a complete independent of the presence of syntactic ambiguity. interpretation. Subsequently, Hindle and Rooth's work has been extended by Resnik and Hearst (1993). Resnik INTRODUCTION and Hearst attempted to include information about typical prepositional objects in their association data. Compound Nouns: Compound nouns (CNs) are a They introduced the notion of CONCEPTUAL commonly occurring construction in language ASSOCIATION in which associations are measured consisting of a sequence of nouns, acting as a noun; between groups of words considered to represent pottery coffee mug, for example. For a detailed concepts, in contrast to single words. Such class-based linguistic theory of compound noun syntax and approaches are used because they allow each semantics, see Levi (1978). Compound nouns are observation to be generalized thus reducing the amount analysed syntactically by means of the rule N --¢ N N of data required. In the current work, a freely available applied recursively. Compounds of more than two version of Roget's thesaurus is used to provide the nouns are ambiguous in syntactic structure. A grouping of words into concepts, which then form the necessary part of producing an interpretation of a CN basis of conceptual association. The research is an analysis of the attachments within the compound. presented here can thus be seen as investigating the Syntactic parsers cannot choose an appropriate application of several key ideas in Hindle and Rooth analysis, because attachments are not syntactically (1993) and in Resnik and Hearst (1993) to the solution governed. The current work presents a system for of an analogous problem, that of compound noun automatically deriving a syntactic analysis of arbitrary analysis. However, both these works were aimed CNs in English using corpus statistics. solely at syntactic disambiguation. The goal of Task description: The initial task can be semantic interpretation remains to be investigated. formulated as choosing the most probable binary bracketing for a given noun sequence, known to form a compound noun, without knowledge of the context. METHOD E.G.: (pottery (coffee mug)); ((coffee mug) holder) Extraction Process: The corpus used to collect Corpus Statistics: The need for wide information about compound nouns consists of some ranging lexical-semantic knowledge to support NLP, 7.8 million words from Grolier's multimedia on-line commonly referred to as the ACQUISITIONPROBLEM, encyclopedia. The University of Pennsylvania has generated a great deal of research investigating morphological analyser provides a database of more automatic means of acquiring such knowledge. Much than 315,000 inflected forms and their parts of speech. work has employed carefully constructed parsing The Grolier's text was searched for consecutive words systems to extract knowledge from machine readable 337 listed in the database as always being nouns and analyse ambiguous CNs. Suppose the compound separated only by white space. This prevented consists of three nouns: wl w2w3. A left-branching comma-separated lists and other non-compound noun analysis, [[wl w2] w3] indicates that wl modifies w2, sequences from being included. However, it did while a right-branching analysis, [wl [w2 w3]] indicates eliminate many CNs from consideration because many that wl modifies something denoted primarily by w3. A nouns are occasionally used as verbs and are thus modifier should be associated with words it modifies. ambiguous for part of speech. This resulted in 35,974 So, when CA(pottery, mug) >> CA(pottery, coffee), we noun sequences of which all but 655 were pairs. The prefer (pottery (coffee mug)). First though, we must first 1000 of the sequences were examined manually to choose concepts for the words. For each wi (i = 2 or check that they were not incidentally adjacent nouns 3), choose categories Si (with wl in Si) and Ti (with wi (as in direct and indirect objects, say). Only 2% did not in Ti) so that CA(Si, Ti) is greatest. These categories form CNs, thus establishing a reasonable utility for the represent the most significant possible word meanings extraction method. The pairs were then used as a for each possible attachment. Then choose wi so that training set, on the assumption that a two word noun CA(Si, Ti) is maximum and bracket wl as a sibling of compound is unambiguously bracketed) wi. We have then chosen the attachment having the Thesaurus Categories: The 1911 version of most significant association in terms of mutual Roget's Thesaurus contains 1043 categories, with an information between thesaurus categories. average of 34 single word nouns in each. These In compounds longer than three nouns, this categories were used to define concepts in the sense of procedure can be generalised by selecting, from all Resnik and Hearst (1993). Each noun in the training possible bracketings, that for which the product of set was taagged with a list of the categories in which it greatest conceptual associations is maximized. appeared." All sequences containing nouns not listed in Roget's were discarded from the training set. RESULTS Gathering Associations: The remaining 24,285 pairs of category lists were then processed to Test Set and Evaluation: Of the noun sequences find a conceptual association (CA) between every extracted from Grolier's, 655 were more than two ordered pair of thesaurus categories (ti, t2) using the nouns in length and were thus ambiguous. Of these, formula below. CA(t1, t2) is the mutual information 308 consisted only of nouns in Roget's and these between the categories, weighted for ambiguity. It formed the test set. All of them were triples. Using measures the degree to which the modifying category the full context of each sequence in the test set, the predicts the modified category and vice versa. When author analysed each of these, assigning one of four categories predict one another, we expect them to be possible outcomes. Some sequences were not CNs (as attached in the syntactic analysis. observed above for the extraction process) and were Let AMBIG(w) = the number of thesaurus labeled Error. Other sequences exhibited what Hindle categories w appears in (the ambiguity of w). and Rooth (1993) call SEMANTIC INDETERMINACY, Let COUNT(wb w2) = the number of instances of where the meanings associated with two attachments Wl modifying w2 in the training set cannot be distinguished in the context. For example, Let FREQ(t~, t2) = college economics texts. These were labeled COUNT(w~, w~) Indeterminate. The remainder were labeled Left or Right depending on whether the actual analysis is left- ,t "~ a ~ "~m ,2 AMBIG(w,)" AMBIG(w2) or right-branching. Let CA (tb t2) = FREQ(tl, t 2) TABLE 1 - Test set analysis distribution: FREQ(t,,i)- ~FREQ(i, t 2) Vi Vi Labels L R I E Total where i ranges over all possible thesaurus categories. Count 163 81 35 29 308 Note that this measure is asymmetric. CA(tbt2) Percentage 53% 26% 11% 9% 100% measures the tendency for tl to modify t2 in a compound noun, which is distinct from CA(t2, tO. Proportion of different labels in the test set. Automatic Compound Noun Analysis: The following procedure can be used to syntactically Table 1 shows the distribution of labels in the test set. Hereafter only those triples that received a bracketing (Left or Right) will be considered. I This introduces some additional noise, since extraction can not guarantee to produce complete noun compounds The attachment procedure was then used to 2 Some simple morphological rules were used at this point to automatically assign an analysis to each sequence in reduce plural nouns to singular forms 338 the test set. The resulting correctness is shown in be given ambiguous noun compounds and asked to Table 2. The overall correctness is 75% on 244 choose attachments for them. examples. The results show more success with left Finally, syntactic bracketing is only the first branching attachments, so it may be possible to get step in interpreting compound nouns. Once an better overall accuracy by introducing a bias. attachment is established, a semantic role needs to be selected as is done in SENS. Given the promising TABLE 2 - Results of test: results achieved for syntactic preferences, it seems likely that semantic preferences can also be extracted x Output Left Output Right from corpora. This is the main area of ongoing research within the project. Actual Left 131 32 Actual Right 30 51 CONCLUSION The proportions of correct and incorrect analyses.

Conceptual Association for Compound Noun Analysis

Compound Word Formation.Pdf

Hunspell – the Free Spelling Checker

Building a Treebank for French

Compounds and Word Trees

A Poor Man's Approach to CLEF

Robust Prediction of Punctuation and Truecasing for Medical ASR

Lexical Analysis

Name Description

Part-Of-Speech Tagging Guidelines for the Penn Treebank Project (3Rd Revision)

Learning to SMILE

Building a Large Annotated Corpus of English: the Penn Treebank

Real-Time Statistical Speech Translation