<<

From: AAAI Technical Report SS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.

Collocations, and MT

Dirk Heylen [email protected] OTS,Utrecht University, The Netherlands Kerry G. Maxwell [email protected] CL/MTGroup, University of Essex, United Kingdom Susan Armstrong-Warwick susan~divsun.unige.ch ISSCO,University of Geneva, Switzerland

1 Introduction tic operations ’1 is concerned with investigat- ing these claims and their consequences for the automatic of collocations. Collocations pose specific problems in trans- lation (both human and machine translation). Adopting this approach, leads rather nat- For the native speaker of English it maybe ob- urally towards an interlingua system, using vious that you ’pay attention’, but for a native the lexical functions as some sort of semantic speaker of Dutch it would have been muchsim- primitives. Being rather sceptical about the pler if in English people ’donated attention.’ use of such concepts, we set ourselves the task Within an MTsystem, we can deal with these to find out whether the ’empirical basis’ (collo- mismatches in different ways. Simply adding cations) constitute sufficient ground to postu- the entry to our bilingual saying late this specific inventory of semantic primi- ’pay’ is the translation of ’schenken’, leaves us tives. This has lead us into the investigation of with the job of specifying in which contexts corpora and dictionaries to confront abstract we can use this equivalence. A more elaborate concepts with real data. Although not one of dictionary might list the complete collocation our tasks, these investigations have lead to a alongside with its translation. Another possi- number of suggestions on how one could pos- bility would be to adopt an interlingua and sibly extract relevant information from these have (harmonious) monolingual components sources for the construction of collocational take care of mapping ’pay’ and ’schenken’ to dictionaries. In this paper we will focus our the same meaning representation. The idea attention on the methods to extract informa- of investigating the latter possibility suggested tion from dictionaries. itself while looking at the analysis of colloca- tions in terms of lexical functions by Mel’5uk and his followers (as presented for instance in their Explanatory Combinatory Dictionar- ies [Mel’~uk et al., 1984]). The claim that these lexical functions are universal and only few in number, would make them interesting 1Theresearch project (ET-10/75)is sponsored for the interlingua option. The ET-10 project the EuropeanCommission, L’Association Suissetra ’Collocations or the lexicalisation of seman- and OxfordUniversity Press.

75 2 Collocations and tion represents some meaningful operation Semantics which yields the (collocate) that typ- ically combines with the argument (head of Previous work ([Church and Hanks, 1989], the collocation) of the function to express [Smadja, to appear] and earlier work) on the the meaning of the function. Simplifying, acquisition of collocational information has Magn -- meaning something like "intense" concentrated on the extraction of collocations -- is expressed by heavy when said of smoker. from corpora based on some statistical mea- More complex functions can be constructed by sures. Programs such as XTRACT([Smadja, (1) combining functions; (2) adding different to appear]) are able to select an interesting col- kinds of sub- and superscripts. The following lection of potential ’collocations’. The notion examples illustrate this. of collocation that lies behind these investiga- "IncepFuncl(vent) = se lever tions is that of ’recurrent combination’. The FinFunel(vent) = se calmer notion we are investigating however, is differ- Magn[,con,equence,,] ( maladie) = sdrieuse ent as it only concerns collocations in which we The last example, clearly illustrates that the can clearly distinguish a collocate which has a basic lexical functions are in need of some ’general’ or ’non-literal’ reading. The mode refinement. The semantic component of the of defining ’collocations’ in our approach is argument which the intensifier relates to is different from the performance or probability explicitly mentioned in the subscript. In approach. Instead, we are interested in a [Heylen, 1992] we have suggested some ways linguistic or lexicographic way of defining col- to make lexical functions sensitive to the se- locations and more particularly the semantics mantic structure of the arguments, by incorpo- of collocations. The first approach leads to rating for instance qualia information ([Puste- the notion of collocations we have termed p- jovsky, 1991]) in the semantic representation collocation elsewhere ([Sloksma el at., 1992]), of nouns. Notice also that lexical functions whereas the second modeleads to a set of l- 2. go with the ideology that the association be- collocations tween heads and collocates is to be specified But what are the semantic aspects of col- lexically, because it is unpredictable. They do locations we are interested in exactly? In the not consider any kind of (sub)regularities that analysis by Mel’~ukof the kind of collocations may exist. we are interested in (mainly adjective-noun One of the aims of the project is to evaluate combinations and light verb constructions), Mel’~uk’s system of lexical functions. Among construction such as heavy smoker is analyzed the questions we are interested in are the fol- by means of a lexical function Magn. In the z lowing. Explanatory Combinatory Dictionary the en- 5 try for smoker contains information of the fol- ¯ Cancollocations be associated with lexical functions as the theory assumes they can? lowing kind. ¯ In what wayscan the notion of lexical func- Magn(smoker) = heavy tion be refined? - Can we makethem sensitive to the se- Mel’~uk claims that we need about 50 of mantic structure of their arguments? these ’universal’ functions to describe the se- -Can we make precise the parame- mantics of most collocations 4. A lexical func- ters that differentiate the possible val- Sin fact, it turns out that a further distinction ues of a function? For instance, we should be made between the lexlcographlc and the lln- might want to differentiate between guistic approach to collocations. Wewill not go into sharp/wide turns, or for instance be- further details here. tween certain "degrees of intensity" 3The dictionary that goes with the Meaning Text when we are considering Magnetc. Model. 4 Understood as l-collocations. 5At least a definable subset.

76 ¯ Canwe find someregular patterns in the for- organisation we investigate the lexicographic mation of collocations? conventions adopted and the overall rationale behind the organization of the dictionary. From a methodological point of view, it is important to be able to go beyond possi- bly deceiving intuitions. We therefore have 3.1 Lexicographic Conventions to provide for more reliable ways to get at our data. Listings of collocations as provided Of course in producing the dictionary a set of by XTRACT-Iikeprograms, are useful as fil- rules and general practices have to be devel- ters on the data to consider. But as we are oped in order to ensure internal consistency. concerned predominantly with semantic issues In other , where there is perceived flexi- whenevaluating the lexical function approach, bility in description/methodology, the lexicog- we need programs that can be made sensitive rapher may adopt a certain path simply be- to the meaning of the expressions considered. cause this corresponds to a convention among To get at a more interesting collection of data, those established for the construction of the we have looked for material from which cer- particular dictionary. Some of these conven- tain aspects of meaning can be read off in tions will ultimately reflect linguistic facts, one way or another. Of course, dictionaries whilst others will be purely arbitrary. Such conventions often remain implicit in the na- are one source which deals directly with the representation of meaning. But it is not the ture of the resulting dictionary and are never explicit encoding which is most interesting to directly explained, whilst others mayoccasion- ally be alluded to in the frontmatter of the look at. Indeed, it turns out that most dictio- naries do not provide the kind of information dictionary. For instance, in the frontmatter of we are interested in directly 6. It is the less ob- OALD3we find an explanation of the listing of vious structure of the dictionary entry which definitions and meanings of words, where the gives rise to interesting pieces of information. user is told that:- In the following sections we list a number of "...definitions are listed in order of ways we can use dictionaries to collect data on meanings from the most common or the semantic issues we are concerned with. most simple to the most rare or most complicated" (p.xiv) 3 Monolingual Dictionar- Such a convention may seem arbitrary 7 in the ies sense that it seems to rest on the intuitions of individual lexicographers, but what might In this section we investigate how far we it imply in our quest for information on the can gather information about collocations and semantics of collocates? Of course we should collocational behaviour by scrutinizing the note that "rare" and "complicated" and "sim- formats of learner’s dictionaries. We fo- ple" and "common" are not always mutually cus our attention on the Oxford Advanced prevailing characteristics, ie: a commonmean- Learner’s Dictionary, third edition [Hornby, ing is not necesarily simple nor a rare meaning 1974] (henceforth OALD3),but we would ex- complicated. A useful interpretation for our pect these observations to have more general purposes is perhaps that "rare/complicated" validity. We examine the dictionary on both corresponds to "figurative/nonstandard" and a macro- and microlexicographical level, ie: accordingly "simple/common" corresponds to in addition to looking at the details of entry 7Howeverwe should point out that lexicographers 6First of all, weare interested in ~abstract’seman- workto a set of explicit compilerinstructions which tic concept.Secondly, lexicographic tradition treats a maydetermine the most arbitrary decisions. Further, ntunberof different constructionsin the sameway they lexicographersshould be highly skilled with rich intu- treat the kind of collocationswe are interested in. itions about the languageunder description.

77 "standard" or "regular", so that in the case forms in different sentence patterns s, and the of entries for adjectives which form collocates, indication of the "style" or "context" in which the definitions which constitute the secondary, the or phrase is usually used. The latter or later explanations of meaning (ie: those means that example fields often:- that occur "further down" the entry) will be the more significant ones for our interests. "...include words or sorts of words In the discussion of the representation of id- that the is usually used iom structures, we are told that:- with.., for example at sensa- tional (2) there is a sensational "..to find an , look for it in the writer/newspaper." (p. xviii) entry for the most important word in the phrase or sentence (usually Such examples illustrate the potential for noun, verb or adjective)." (p. xvi) stumbling across example collocations, as well as hints about lexical selection, which mayin This means for instance that the idiom pick turn reveal issues in the semantics of colloca- holes in can be found in the entry for hole and tional structures (cf: section 3.3.1). that the idiom get hold of the wrong end of the To summarize - an integral aspect of the stick can be found under the entry for stick. Of rationale behind the dictionary is its exem- course, most "important" merely corresponds plary function in the learner context. This to what is perceived to be most salient, though means that information of relevance to the it may be that observation of what are consid- study of collocations may be implicit in exem- ered to be the most salient elements of plary form, as well as in standard headword may lead us to intuitions about where the se- and definition formats. mantic load lies in such structures. In sum, even though the conventions under- lying the description of such lexical phenom- 3.3 Collocational Information ena mayappear arbitrary at first sight, it may within Entries be that somereveal general linguistic insights, explain omissions, or provide pointers to the In this section we will investigate to what information in an entry that we are most in- extent information about collocational struc- terested in. We should at least be aware of tures is available in OALD3entries. The ob- them as we embark upon extraction of infor- servations made in this section are based on mation, automatic or otherwise. a study of the electronic version of the dictio- nary (henceforth OALD3E)undertaken dur- ing our project. For further documentation of 3.2 User Needs the results see [Heylen et al., 1992]. The elec- A natural consequence of the learner orien- tronic version of the dictionary contains en- tation of the OALD3is that the dictionary tries whose information fields are tagged by weighs heavily on explanatory and exemplary STiffs information may be patchy where the facts data, generally residing in specific example about collocational structures axe concerned; for in- fields, and, as will be shownin later sections, stance in the entry for attention OALD3illustrates this field of information proves extremely fruit- that the related support verb construction can be pas- ful in the collection of collocational data. The sivised,eg: No attention was paid to my advice, but says nothing in the entry for heavy about restrictions frontmatter details several reasons for the ex- on predicative use where heavy has a collocational tensive inclusion of example fields (p. xvii), reading ie: this smoker ii heavy can only mean that the key one being of course their general role the smoker has weight problems! Such gaps in cover- in reinforcing the explanation of the meaning age often arise where dictionaries concentrate on pro- viding positive rather than negative information, ie: of (groups of) words. Also worthy of note providing well-formed examples but not giving explicit the aim of illustrating words and multiword guidance as to what is no._.~t possible.

78 SGML-style markers. A simple search tool de- violentwind/attack/blows/temver/abuse veloped in-house was used to extract entries 2. caused by violent attack bearing a headword or example matching the violent death specified9. search string 3. severe The focus for this study was an investigation violent toothache of the coverage of Adjective Noun(henceforth The example details three senses of the ad- AN) collocations in OALD3E.In particular we jective. We might designate collocational sta- investigate the information available in adjec- tus to the third, violent assumes an intensify- tive entries, these proving to be a more fruit- ing function not explicitly related to its regular ful source of interesting ANstructures than use indicating force (cf: sense 1). noun entries. Adjective entries tend to list typ- Glues about collocational readings are also ical (nominal) modificands, the reverse is not found in glosses to individual examples, eg: we necessarily true (presumably since the set of see the gloss extreme against the sense of vi- modifiers given a particular noun is potentially olent in the example combination violent con- much larger and heterogenous). Further, ad- trast, where we might assume the adjective has jective entries provide us with secondary sense a similar function to that in sense 3 above. definitions and corresponding examples which So, although there is no explicit discrim- mayrelate directly to semantic aspects of col- ination of "collocational readings" listed in locational behaviour, as we discuss below. the dictionary, in sifting through the various senses of adjectives we find a number of can- 3.3.1 Searches on Adjective didate collocates (and indeed example colloca- The structures/combinations we are primarily tions). interested in are those which can be deemedto But it may not be merely the semantics of be collocational on somekind of linguistic ba- the adjectival collocate which is alluded to in sis. For AN-collocations we refer to the noun glosses and secondary senses. We might as- ’base’, a semantically autonomous element sume that the semantics of a noun is crucially preserving its standard interpretation, and an related to such factors as form, function and adjectival ’collocate’, an adjective which has causation (cf: the notion of Qualia as coined some kind of figurative or non-standard read- by Pustejovsky in a discussion of lexical se- ing, its precise interpretation being only deriv- mantic decomposition of nouns [Pustejovsky, able in the context of the noun base. These 1991]). These meaning aspects may also be being the kind of phenomenawe are attempt- reflected in sense definitions pertaining to par- ing to investigate, it would seem that we need ticular examples where the noun appears. For to be particularly attentive to those aspects of instance, the sense definition against the ex- an adjective entry devoted to explanations of ample violent death says "caused by violent at- secondary (or further alternative) senses. Such tack", reflecting the fact that death essentially explanations give clues as to the potential col- involves a "cause" of some kind (cf: Puste- locational readings of an adjective, and related jovsky’s Agentive role). The third sense of example fields contain candidate collocational the adjective regular (cf: below) refers to ex- structures. An example would be the OALD3 plicit properties of the nouns provided in the description of primary and alternative senses example, eg: "qualified" and "trained" of sol- for the adjective violent, and related examples, diers/army, (cf: Pustejovsky’s formal role). as shown below:- 3. properly qualified; recognised; trained 1. using, showing, accompanied by, great regular soldiers/army force 9Brined on a Perl regular expression, cf: [Wall and Example fields often group potential mod- Schwartz, 1990], pp.103-106. ificands of a specified adjective according

79 to some unwritten semantic criteria (which expressions, this may give clues as to the sometimes may be linked back to a partic- degree of syntactic/morphological versatility ular sense definition). For instance the sec- displayed by the collocate. In other words, ondary definition of the adjective formidable there are certain ANcombinations which we is supplied as "requiring great effort to deal may classify as collocational according to our with or overcome", the modificands obsta- basic perception, but which resemble com- cle/opposition/enemies//debts are supplied on pounds/fixed expressions in terms of eg: the mass, each in some sense representing a situ- lack of morphological versatility (compara- ation or body which must be confronted and tive/superlative formation) of the non-head. ultimately surmounted or "overcome". So we Such examples typically reside in compound observe that examplefields frequently list sets group fields (eg: tall/*taller/*tallest story, tall of modificands which may be perceived as be- story being listed in the compoundgroup field longing to a homogenous semantic class, ie: of the OALD3entry for the adjective tall). they have a commonproperty, or, more for- By contrast, those collocations higher on the mally, they share a semantic feature. Wemay versatility 11 scale seem more likely to crop up speculate about what this feature is, but, as in example fields eg: heavy rain. we illustrated, the sense definition of the ad- jective may give clues. In some sense then we gain pointers as to potential selectional re- 3.4 Summary and Conclusions strictions between an adjectival modifier and In sum, the observations made in the previ- the set of exemplary modificands. Certain ex- ous sections show that information on collo- amples seem to show adjectives predicating of cational structures can be obtained from an nouns which share particular properties. on-line dictionary, but that this information is Beyond sense definitions and related exam- practically always implicit, embedded in as- ples we also find candidate collocations in in- pects of the formal structure of the dictio- formation fields which are intended to house nary. We have performed automatic searches complex expressions such as idioms or com- for particular strings and then hand-filtered in- pound forms. An example is the entry for the formation from the entries which formed the adjective blind, where we find the examples output. The result of such hand-filtering re- blind spot, blind turning, blind flying, blind veals where the "hot spots" of information lie. alley, blind date. These complex expressions Various factors interact to place informa- may undoubtedly warrant various labels ac- tion at a certain place and present informa- cording to a given classification, but it maybe tion in a certain way. The information gleaned that somefit our criteria for collocational sta- from such ’first-pass’ searches, coupled with tus. Eg: if we take blind flying and blind turn- a knowledge of underlying lexicographic con- ing, we observe that both have a base with its ventions and the rationale behind the dictio- regular compositional meaning, and a modifier nary (ie: user needs, look-up strategies), can which derives its precise interpretation in the promote the simulation of more "intelligent" context of the base1°. Such examples indicate searches at a subsequent stage. For instance, that we should also target compound group we now knowthat example fields will probably fields as a potential "quarry" of candidate col- form one of the most productive active search locations. spaces. If we want data on adjective-noun col- If a collocational form occurs within an in- locations, we should target adjective entries formation slot devoted to compoundsor fixed primarily. Examplecollocations are likely to be found at example slots for sense definitions 1°blind in blind turning means "not easily seen by drivers" and blind in blind flying means "unable to llBy ~versatile’ we may also refer to syntactic as- see due to cloud/fog, therefore flying with the aid of pects, eg: predicative use: the rain was heavy/~the instruments only". story was tall

8O which are "lower down"the entry. It is in these pothesis that "meaning is a internal later sense definitions that we may find the affair". Translation equivalence is seen as a description of senses of adjectival collocates, mapping between different conceptualizations. from which we may be able to unpack some Though the work does not focus on colloca- kind of semantic information. If example mod- tions, it does recognize that a major problem ificands are ’lumped’ together, (commonlyde- remains in howto select the best language ex- limited by /) then we may have turned up pression for a given conceptualization. It is potential selectional restriction between mod- precisely this problem that we are trying to ifier and head (collocate and base).., and account for with the lexical functions. on. Wetherefore illustrate howon-line lexical Other studies of interest in view of extract- information, though only partially formal in ing information from bilingual machine read- nature, can be the focus of automatic searches able dictionaries includes the work reported which are crucially based on some kind of "in- on in [Calzolari, 1983] on establishing seman- telligent" strategy derived from knowledge of tic links and [Byrd et al., 1987] on mapping the nature of the dictionary. between entries and problems of symmetry. The monolingual dictionary work on navigat- ing through entries reported on in [Chodorow 4 Bilingual Dictionaries and Byrd, 1985] provides the basis for our work on exploring the data in bilingual dic- Examination of bilingual dictionaries can be tionaries. telling in various ways, not only in the collec- tion of contrastive data but also in the affir- 4.1 Accessing the data mation of ideas about the nature of colloca- tional behaviours. Of course contrastive anal- The bilingual dictionaries we have available ysis of collocational structures reveals a degree on-line are the Collins German, French and of complexity in translation, but it can also English, pocket edition [Collins, 1990] in all re- bring to light regularities about the monolin- spective pairs. The dictionaries cover the basic gual system. In addition the data can help of the (ca. 15,000 words instantiate or repudiate definitions of interlin- per direction) and include the major sense dis- gua concepts in view of translation. tinctions (differentiated only in so far as nec- In this section we will discuss what types of essary for translation purposes). Thoughrel- information we can find in bilingual dictionar- atively small, the advantages are that they ies and how this might help us towards a more are easy to parse and contain only minimal systematic account of collocations (in view of but essential information. Thus, for a first automatic translation). attempt at organizing some of the data in Thoughthe investigation of collocations has view of our goals to systematically account not been a focal point for past work with ma- for the relations between the words in colloca- chine readable dictionaries, we can identify a tional phrases, they provide an ideal starting number of related studies which have focussed point. As suggested above in the discussion of on semantics and translation issues. Bilingual monolingual dictionaries, we assume that the studies carried out in the ACQUILEXproject, data contained in these dictionaries reflects an for example, have investigated the linking of initial synthesis and organization of essential semantic hierarchies which have been built on semantic and translational considerations re- language internal concepts [Sanfilippo el al., garding the use of the words. 1992]. In a comparaison of noun taxonomies The dictionaries under consideration can be cross-linguistically (using word senses and hi- accessed via a network based dictionary con- erarchies derived from mono- and bilingual sultation tool developed at ISSCO for the dictionaries) [Vossen, 1991] defends the hy- University of Geneva [Petitpierre and Robert,

81 1991].12 The underlying format is an SGML one of the two languages and cannot be in- mark-up of the fields, identifying , terpreted automatically. The types of indica- grammatical information, sense distinctions, tors employed are only partially formalizable contextual information, , example (examples given here are German to French phrases, etc. This structure allows access to translations): the entries not only by headword, but also by words found in the different fields (via sec- ¯ synonyms ondary indexes). eg. stark (mKchtig) puissante gloss: ’strong’ in the sense of powerful ¯ typical collocations 4.2 Collocations and translation eg. stark (Schmerzen) vi olente gloss: ’strong’ pains As was the case with the monolingual informa- ¯ meta-type sem/syn information tion found in different fields, glosses and sense eg. stark (bei Massangabe) definitions in bilingual entries play a crucial gloss: ’strong’ used with measurements role in providing more information about the example: 2 cm " ---* 2 cm d’epaisseur semantics of the collocate. Fromthe user per- ¯ multl-word expressions spective, they are of course there to assist the eg. " Rancher --* grand fumeur choice of the correct translation, ie: they are gloss: ’strong’ smoker (i.e. heavy smoker) linked to the specification of a particular trans- lation where several alternatives are possible. 4.2.2 Typical collocations Fromour perspective, they give pointers to the possible interpretation of the collocate. Eg: The information in the entries is rich in both in the Collins French-English/English-French monolingual and bilingual semantic informa- Dictionary [Collins, 1990] we see (gain, re- tion. The ’semantic indicators’ and typical quire, bring/carry, etc) listed as senses of the phrases are often indicative of a set colloca- verb take. It is precisely these sense definitions tions (hypernyms, for example, are often used which maygive clues as to the interlingual con- to stand for a class of collocations). cepts which underlie such support verbs, con- Followingis a list of collocations taken from cepts which are independent of their realisa- just a few adjective entries. tion in a given language. Eg. take in English translated as remporter in French, but mean- ¯ deep -~ ing GAI~in both. (water, sorrow, thoughts)profond(e) (voice) grave ¯ hard --* dur(e) 4.2.1 Bilingual lexicographic conven- (work) dur tions (think, try) serieusement ¯ high --+ haut(e) There are a number of conventions (often im- plicit) used in bilingual lexicographical work (speed,respect, number)grand(e) which come under the global heading of "se- (price)aev (e) (wind) fort(e), violent(e) mantic indicators". The intention is to iden- (voice) aigu(aigu~) tify or restrict the semantic range of the word to be translated. Aside from the closed class of These simple entries are clearly useful for subject field markers (eg. NAUT,GEO, MED, gathering a starting list of collocations, all of etc.), these semantic indicators are little more them linguistically relevant l-collocations (as than clues for human readers well-versed in opposedto an initial set of p-collocations found

12The explicit SGMLmark-up also provides the in corpora which would require hand filtering). means for displaying the results of a given query as The grouping of objects may reflect seman- if a printed dictionary had been consulted. tic patterning, though not always, and in fact,

82 not very often. Support verbs in particular are become apparent in these examples, is the often listed with coherent semantic classes, eg. cross-linguistic verification of our definition of under "take" we can find classes like step, walk collocations. It also substantiates the claim or effort, courage. But as the sample transla- that the nouns (bases) the adjective (collo- tions for deep and high illustrate, we are of- cate) selects have some kind of semantic ho- ten confronted with groups that could hardly mogeneity. In essentially all cases (the ex- be viewed as semantically homogeneous. Nev- ceptions seem to represent metaphorical and ertheless, the examples given in the entries idiomatic expressions) the nouns retain their usually cover the basic sense distinctions and regular meaning, the adjectives are selected thus provide a useful set to test initial hy- (and ultimately interpreted) in the contexts potheses. A larger collection of data and more of the nouns, which do have a straightforward extensive empirical studies may help to iden- translation. (cf: observations in section 3.3.1). tify more general and also more refined classes Given our relatively limited understanding of of regularities that are not apparent in these these phenomenaand lack of ability to formal- selected examples. ize this system, we can perhaps better appre- One immediate way to extend this list of ex- ciate the rather intuitive and incomplete in- amples is to look for phrase or candidate collo- formation provided by the lexicographers and cations within the translations, i.e. searching hence the assymetries across entries. on the word contained in the translation and example fields. Following is a list of exam- 4.2.3 Sense distinctions and transla- ples extracted from a search for adjectives in tion the translation field (the headwordis given in boldface). Another type of information we find in the en- tries (other than specific collocations) could ¯ deep voice ~-- (voix/son) grave interpreted as sense disambiguation informa- ¯ deep hate ~ (mgpris) profond ¯ deep/great ~ vif (regret/deception) tion, i.e. when more than one translational equivalent is given without further restric- ¯ hard choice/problem ~ (choiz/probl~me) tions, we are usually confronted with at least difflcile partial synonyms. In the translation of the ¯ hard work *-- (travail) dur/ferme Germanadjective stark (strong) from French, ¯ hard (insensible) person (c oeur/- for example we find the following translations: personae) sec ¯ high price ~ (prix/sommel) 616v6 ¯ high returns ~ rendement fort ¯ farouche---* (voiontg, haine, rgsistance) ¯ high/low tide ~ mar6e haute/basse stark, heftig (gloss: heavy or forceful) ¯ high vitamin/mineral/etc content ~- ¯ intense --* stark, intensiv (gloss: inten- richesse en vitamines sive) ¯ high blood pressure ~-- faire ou avoir de ¯ violent ~ heftig, stark la tension ¯ virulent ~ stark, thdlich (gloss: deadly) In comparing this list with the examples The synonyms supplied for each transla- cited above (searching on headwords) it is im- tion choice seem to indicate a possible sense mediately apparent that the information pro- distinction. The underlying basic concept is vided is not symmetric. Given the serious that of intensification whichis precisely the in- space restrictions on these dictionaries in par- terlingual concept mediated by Mel’~uk’s LF ticular, this is not surprising, though a quick MAGN.However, in each case the meaning look at larger dictionaries will show similar discrepancies. 13 More importantly, what does talnlng the words we are interested in have been listed here; fixed expressions and other idiosyncractic exam- 13Notealso that only a subset of the entries con- ples were deleted.

83 is refined, as suggested by the glosses sup- scribed in [Chodorow and Byrd, 1985] which plied above. Following Pustejovsky’s notion of established chains through entries in order to Qualia [Pustejovsky, 1991], part of the choice build up a semantic hierarchy and the subse- for a given collocate will depend on the se- quent work on "bilingual sprouting" discussed mantics of the noun, but this must perhaps be in [Byrd et al., 1987]. In contrast to the work judged in interaction with particular senses of cited above, we did not attempt to find explicit a given adjective rather than the word itself. and formalized semantic information (such as It is unfortunate that translation dictionaries +HUMAN),nor to automatically derive tax- in general do not represent sense distinctions onomies. The intuition is simply that chain- in the translation fields. As these examples ing through words and their translations (from suggest, if we are looking to use the lexical headwords to translations and vice versa) we functions for generation, we may need to re- will discover synonymclasses across the lan- fine them on the basis of the senses in order to guages which can help refine our insights into choose from the set of possible translations. the semantic or conceptual differences within A next step would be to look at whether the and across languages. sense distinctions found in monolingual dic- In a sample chain beginning with the word tionaries can be of service in helping to clas- heavy in English, we find three basic trans- sify translation equivalents (cf. the discus- lations in French, i.e. iourde, gros and grand. sion in the following section); another is to The entry for the adjective lourd lists only the look for the semantic structures and primitives one translation heavy but occurs in the entries provided in thesauri. A search on the Roget for ’close’, ’heavy’, ’hefty’, ’weighty’, among [Roget, 1990], again searching from others. The entry for gros gives as transla- within, i.e. all classes where strong is given tions big, large, fat, extensive, thick, heavy and as example, provided some initial categories. occurs in ’big’, ’broadly’, ’fat’, ’heavy’, ’hefty’, Three major categories which included strong ’roughly’, etc. as a possible adjective were "quantity by com- As the example above suggests, after only parison with a standard", "degree of power" one step through the chain, we can quickly col- and "physical energy". Though the first cat- lect a set of words that clearly share a subset egory does not seem to help in distinguishing of semantic properties. The differences will the translations given above, the latter two do have to be accounted for in a systematic way help in distinguishing stark as "intense" and if we are not going to list every possible com- stark as "violent". This very preliminary in- bination in view of translation and subsequent vestigation, though far from giving a system- generation. It is clear that a lexical function atic account of this problem, suggests the po- such a MAGNwill account for any number of tential of combining the information from a collocations containing the words given above, number of resources in view of working to- but not why one word is chosen over another. wards more general accounts of the data. In the entry for heavy, for example, we are also supplied with some contextual infor- 4.3 Semantic classes and chains mation necessary for choosing the appropriate equivalent. I.e. it translates as gros in the Given the numerous pairs of bilingual dictio- context of work, sea, rain, eater but as grand naries and the ability to search inside entries, in the case of drinker and smoker. It is also we explored the idea of chaining through en- interesting to note that two of the three main tries. Starting with a headword in one lan- sense distinctions given in OALDcorrespond guage we find a number of translations which approximately to the first two translations, i.e. in turn serve as the query both as headwordor 1. weight, and 2. size, force, amount. A pos- as occurring in a translation field. This work sible line of investigation would be to see how is similar to the "sprouting" mechanism de- well these distinctions overlap with the seman-

84 tic14. groupings found in thesuri For instance, simple chaining through the Another study starting with the adjective dictionaries provides us with synonymsets (or fort in French and chaining through German some kind of equivalence set) which capture and English, yielded the following set of se- facets of meaning which may point to a lex- mantically related adjectives. ical function. But one could try to go be- farouche, intense, puissant, violent, viru- yond the simple identification of lexical func- lent, fort, doug, capable, vigoureux, solide, vif, tions and look for further evidence for gener- haul, grand, dlevd, aigu, grand, formidable, alizations and refinements. Indeed, one of the gros, corpulente, dur, s~rieusement, agressive, drawbacks of the lexical function approach as bruyant, sonore, beaucoup used in the ECDis that each function has to Though every native speaker would know be enumerated completely. In this account, as precisely which of these adjectives could oc- in manyothers, collocations are considered to cur with any given noun, we do not have any be totally unpredictable combinations. How- even semi-formal account of why this is the ever, looking at collocations one is sometimes case let alone how they will behave in trans- struck by the semantic patterns suggested by lation. One question to be studied is whether the data. Again there is a need to find more we can identify subsets of this class that be- evidence to be able to tackle such questions as: have similarly with respect to collocations. It what kind of regularities are we talking about, is this type of data and more that we will need to what extent are they there, and how can we to analyze and test our hypotheses on. account for them?

A second drawback concerns the ’refine- Conclusion ment’ of the functions. Lexical functions of- ten pick out one semantic facet of their argu- As the title suggests, the main focus of the ment to operate on. Mel’~uk himself already project "Collocations and the Lexicalisation of makes some provisions, but does not provide a Semantic operations" is the semantics involved systematic theoretical account of these exten- in collocational constructions. In Mel’Suk’s sions. One could also turn to other theories approach, the meaning of each ’collocate’ can of semantic structure such as Pustejovsky’s be identified as an instance of one of the fifty Qualia theory. We are then lead to the ques- or so Lexical Functions. Whether or not each tion how these theories fit the phenomenonwe collocation can be assigned such a function, are describing, and again, what kind of evi- and whether or not these functions make any dence we can use to decide on particular anal- sense, is partly a matter of empirical investi- yses. gation. In this paper, we have discussed some pos- Collocations are an inspiring source of in- sibilities in which this empirical investigation teresting hypotheses concerning all kinds of can be carried out. Although most dictionaries lexical semantic issues. Working on them we do not immediately provide the information have realised however, that conjectures need we are after, they are a rich source of lexico- thorough testing. The previous sections have semantic knowledge which can be exploited in served to identify the different types of lexi- many ways for various purposes. The primary cal resources we can exploit in our study of use we have been talking about involves the re- collocations. Electronic access to various col- search on the semantics of collocations rather lections of data, i.e. mono-and bilingual dic- than the automatic acquisition of lexical infor- tionaries, and thesauri provide the basis for a mation. more in-depth exploration of the phenomena. A next step is to refine the methods for picking 14A simple program to extract phrases from Roger identified ca. 20 expressions containing heavy as an out relevant data and combining the informa- adjective. tion gained from different perspectives.

85 References [Hornby, 1974] A.S. Hornby. The Oxford Ad- vanced Learner’s Dictionary of English. Ox- [Bloksma et al., 1992] Laura Bloksma, Dirk ford University Press, third edition edition, Heylen, and R. Lee IIumphreys. Charac- 1974. terisation. Deliverable ET-10/75 [C.1], Col- locations and the Lexicalisation of Semantic [Mel’~ukel al., 1984] Igor Operations, Utrecht, July 1992. Mel’~uk, Nadia Arbatchewsky-Jumarie, L~o Elnitsky, Lidija Iordanskaja, and Addle [Byrd et al., 1987] R. Byrd, N. Calzolari, Lessard. Diclionnaire explicalif el combina- M. Chodorow, J. Klavans, and M. Neff. toire du fran~ais contemporain. Les Presses Tools and methods for computational lin- de l’Universitd de Montreal, Montreal, 1984. guistics. Computational Linguistics, 13(3- 4):219-240, 1987. [Petitpierre and Robert, 1991] D. Petitpierre and G. Robert. Dico- a network based dic- [Calzolari, 1983] N. Calzolari. Semantic links tionary consultation tool. In SGAICOpro- and the dictionary. In Proceedings of the Sixth International Conference on Comput- ceedings, Neuchatel, 1991. ers and the Humanities, pages 47-50, Mary- [Pustejovsky, 1991] James Pustejovsky. The land, 1983. generative . Computational Linguis- [Chodorow and Byrd, 1985] M. Chodorow tics, 17(4), October-December 1991. and R. Byrd. Extracting semantic hierar- [Roger, 1990] Roger. Roger’s thesaurus, 1911, chies from a large on-line dictionary. In 1990. Association for Computational Linguistics proceedings, pages 299-304, Chicago, 1985. [Sanfilippo et al., 1992] A. Sanfil- ippo, T. Briscoe, A. Copestake, Maria An- [Church and Hanks, 1989] Kenneth Ward tonia Martl, Mariona Taul~, and Antonietta Church and Patrick Hanks. Word associ- Along. Translation equivalence and lexical- ation norms, mutual information and lex- ization in the acquilex lk. In Proceedings of icography. In 27th Annual Meeting of the TMI 4, pages 1-11, Montreal, 1992. Association for Computational Linguistics, Vancouver, 1989. [Smadja, to appear] Frank Smadja. Retriev- [Collins, 1990] Collins, editor. Collins French- ing collocations from text: Xtract. Compu- English English-French Dictionary, Collins tational Linguistics, to appear. German-English English-German Dictio- [Vossen, 1991] P. Vossen. Comparing noun- nary, Collins French-German German- taxonomies cross-linguistically. ESPRIT French Dictionary. Collins, London, Glas- BRA-3030 ACQUILEX Working Paper gow, 1990. Number.014, English Department, Univer- [iieylen et al., 1992] Dirk Heylen, Kerry G. sity of Amsterdam, 1991. Maxwell, Tim Nicolas, and Susan Warwick- [Wall and Schwartz, 1990] L. Wall and R. L. Armstrong. Analysis of existing dictionar- Schwartz. Programming Perl. O’Reilly Lz ies. Deliverable, Collocations and the Lexi- Associates Inc., Sebastopol, CA., 1990. calisation of Semantic Operations, Utrecht, July 1992. [tIeylen, 1992] Dirk Heylen. Lexical func- tions and knowledge representation. In Pro- ceedings of the Second International Work- shop on Computational , Toulouse, 1992.

86