Collocation Extraction: Needs, Feeds and Results of an Extraction System for German

Collocation Extraction: Needs, Feeds and Results of an Extraction System for German Julia Ritz Institut für Maschinelle Sprachverarbeitung (IMS) Universität Stuttgart Azenbergstr. 12 70174 Stuttgart Germany [email protected] Abstract in a collocation) (adapted from (Hausmann, 1979; Hausmann, 1989; Hausmann, 2003)). This paper provides a specification of requirements for collocation extraction sys- b. A collocation is a word combination whose tems, taking as an example the extraction semantic and/or syntactic properties cannot of noun + verb collocations from German be fully predicted from those of its compo- texts. A hybrid approach to the extrac- nents, and which therefore has to be listed in tion of habitual collocations and idioms a lexicon (Evert, 2004). is presented, aiming at a detailed description of collocations and their morphosyn- We argue that linguistic knowledge could not tax for natural language generation sys- only improve results (Krenn, 2000b; Smadja, tems as well as to support learner lexicog- 1993) but is essential when extracting colloca- raphy. tions from certain languages: this knowledge provides other applications (or a lexicon user, respec- 1 Introduction tively) with a fine-grained description of how the Since Firth first described collocations as habit- extracted collocations are to be used in context. ual word combinations in the 1950ies (cf. Firth, Additional requirements resulting from the 1968), a number of papers focusing on collocation needs of dictionary users are described in (Haus- extraction have been published (see the overviews mann, 2003; Heid and Gouws, 2005) and are of in (Evert, 2004; Bartsch, 2004)). Most studies interest not only in lexicography but can also be concentrate on the extraction from English. How- transferred to the field of natural language gener- ever, the procedures proposed in these studies can- ation. These requirements influence the develop- not necessarily be applied to other languages as ment of collocation extraction systems, which mo- English stands out, e.g. with respect to configu- tivates this paper. rationality. They rely on the fact that the syntax The structure of the paper is as follows: In chap- of English (and of all configurational languages) ter 2, the requirements, depending on factors like provides positional clues to the grammatical func- the targeted language, are presented. We then dis- tion of noun phrases, and they exploit this con- cuss and suggest methods to meet the given needs. cept by means of window-based, adjacency-based A documentation of ongoing work on the extrac- or pattern-based extraction, combined with associ- tion of noun + verb collocations from German ation measures to identify co-occurrences that are texts is given in chapter 3. Chapter 4 gives a con- more frequent than statistically expectable. What clusion and an outlook on work still to be done. these procedures do not cover is semantic-oriented 2 Collocation Extraction Tools: definitions like (a) and (b). Requirements a. A collocation is a combination of a free ('au- The development of a collocation extraction tool tosematic') element (the base) and a lexically depends on the following conditions: determined ('synsemantic') element (the collocate, which may lose (some of) its meaning 1. properties of the targeted language 41 2. the targeted application of statistical analyses are little reliable. To allow a grouping of words sharing the same lemma, lem- 3. the kinds of collocations to be extracted matisation is crucial. 4. the degree of detail 2.2 Application factors Whereas issues 1 to 3 deal with the collocation Other important factors are the targeted applica- itself, issue 4 is focused at the collocation in con- tion (i.e. analysis vs. generation) and, to some ex- text, i.e. its behaviour (from a syntagmatic analy- tent resulting from it, factors (3.) and (4.), above. sis point of view) or, respectively, its use (from a Depending on the purpose of the tool (or lexicon, generation perspective). respectively), the collocation definition chosen as an outline may vary, e.g. including transparent and 2.1 Language factors regular collocations (cf. (Tutin, 2004)) for genera- One of the most important factors is, of course, the tion purposes, but excluding them for analysis pur- targeted language and its main characteristics with poses. In addition, a more detailed description of respect to word formation and word order. De- the use of collocations in context (e.g. information pending on word and constituent order, the pros about preferences with respect to the determiner, and cons of positional vs. relational extraction etc.) is needed for generation purposes than for patterns need to be considered. Positional pat- text analysis. terns (based on adjacency or a 'window') are ad- equate for configurational languages, but in lan- 2.3 Factors of collocation definition guages with rather free word order, words belong- Collocations can be distinguished on two levels: ing to a phrase or collocation do not necessarily the formal level and the content level. On the for- occur within a predefined span1. mal level, a collocation can be classified accord- Extracting word combinations using relational ing to the structural relation between its elements. patterns (represented by part of speech (PoS) tags Typical patterns are shown in table 12 (taken from or dependency rules) offers a higher level of ab- (Heid and Gouws, 2005)). straction and improves the results (cf. (Krenn, On the content level, there are regular, transpar- 2000b; Smadja, 1993)). However, this requires ent, and opaque collocations (according to (Tutin, part of speech tagging and possibly partial parsing. 2004)) and, taking definition (b) into account, id- A system extracting word combinations by apply- ioms as well. However, as a classification at the ing relational patterns, obviously profits from lan- content level needs detailed semantic description, guage specific knowledge about phrase and sen- we see no means of accomplishing this goal other tence structure and word formation. One example than manually at the moment. is the order of adjective + noun pairs: in English and German, the adjective occurs left of the noun, 2.4 Contextual factors whereas in French, the adjective can occur left or (Hausmann, 2003; Heid and Gouws, 2005; Evert right of the noun. Another example is compound- et al., 2004) argue that collocations have strong ing, handled differently in different languages: preferences with respect to their morphosyntax noun + noun in English, typically separated by (see examples (1) and (2)) and may be combined a white space (e.g. death penalty) vs. noun + (see example (3)). The collocation in example (1) prepositional phrase in French (e.g. peine de mort) ('to charge somebody') is restricted with respect vs. compound noun in German (e.g. Todesstrafe). to the determiner (null determiner) of the base, Consequently, language specific word formation whereas the same base shows a strong preference rules need to be considered when designing ex- for a (definite or indefinite) determiner when used traction patterns. For languages with a rich in- 2Abbreviations in table 1: flectional morphology where the individual word advl - adverbial forms are rather rare, frequency counts and results prd - predicative subj - subject 1In German, e.g., in usual verb second constructions with obj - object a full verb in the left sentence bracket (topological field the- pobj - prepositional object ory see (Wöllstein-Leisten et al., 1997)), particles of particle dat - dative case verbs appear in the right sentence bracket. The middle field gen - genitive case (containing arguments and possibly adjuncts of the verb) is quant - quantifying of undetermined length. 42 No. Type Example German newspaper texts. 1 N + Adj tiefer Schlaf The syntactic patterns for the extraction of these 2 Adj + Adv tief rot combinations concentrate on verb-final construc- 3 V + Adv tief schlafen tions as in example (4) and verb second con- 4 V + NP ¢¡¤£¦¥ Bauklötze staunen structions with a modal verb in the left sentence 5 V + N §©¨ Frage + sich stellen bracket according to the topological field theory 6 V + N ¡ Anforderungen + genü- (see (Wöllstein-Leisten et al., 1997)) as in exam- gen ple (5). The reason is that, in these constructions, 7 V + N Frage + aufwerfen the particle forms one word with the verb (see ex- 8 V + PP zu + Darstellung + gelan- ample (6)), as opposed to usual verb second con- gen structions (see example (7)). Thus, we need not ¡ 9 V + Adj verrückt spielen recombine verb + particle groups that appear sep- 10 N + N ¦ Einreichung des Antrags aratedly. ¨ 11 N + N ein Schwarm Heringe (4) ... wenn Wien einen Antrag auf Vollmitglied- the category containing the base is underlined. schaft stellt. Table 1: Collocational patterns ('if Vienna an application for full membership puts') (if Vienna applies for full membership) with a different collocate (example (2), 'to drop a lawsuit'). Example (3) shows two collocations (5) ... kann Wien einen Antrag auf Vollmitglied- sharing the base can form a collocational sequence schaft stellen. (example taken from (Heid and Gouws, 2005)). ('might Vienna an application for full membership put.') (1) ø Anklage erheben (Vienna might apply for full membership.) (2) die/eine Anklage fallenlassen (6) ..., daß er ein Schild auf stellt. (3) Kritik üben + scharfe Kritik ('that he a sign upputs') scharfe Kritik üben (that he puts up a sign) For both natural language generation systems (7) Er stellt ein Schild auf. and lexicography, such information is highly rel- ('He puts a sign up.') evant. Therefore, the extraction of contextual in- (He puts up a sign.) formation (called 'context parameters' in the fol- Preprocessing lowing) should be integrated into the collocation As data, we used a collection of 300 million extraction process. words from German newspaper texts dating from 3 Extracting noun + verb collocations 1987 to 1993. The corpus is tokenized and PoS- from German tagged by the Treetagger (Schmid, 1994), then chunk annotated by YAC (Kermes, 2003).

Collocation Extraction: Needs, Feeds and Results of an Extraction System for German

D1.3 Deliverable Title: Action Ontology and Object Categories Type

Collocation Classification with Unsupervised Relation Vectors

Comparative Evaluation of Collocation Extraction Metrics

Collocation Extraction Using Parallel Corpus

Collocation Classification with Unsupervised Relation Vectors

Dish2vec: a Comparison of Word Embedding Methods in an Unsupervised Setting

Evaluating Language Models for the Retrieval and Categorization of Lexical Collocations

Collocation Segmentation for Text Chunking

17058 FULLTEXT.Pdf (1.188Mb)

Exploratory Search on Mobile Devices

Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: a Lesson of History