<<

Separable in a Reusable Morphological Dictionary for German

Pius ten Hacken I & Stephan Bopp 2 l lnstitut far Informatik/ASW 2Lexicologie, Faculteit der Letteren Universit~it Basel, Petersgraben 51 Vrije Universiteit, De Boelelaan 1105 CH-4051 Basel (Switzerhmd) NL- 1081 HV Amsterdam (Netherlands) email: [email protected] emaih [email protected]

Abstract Separable verbs are verbs with prefixes which, depending on the syntactic context, can occur as one written together or discontinuously. They occur in languages such as German and Dutch and constitute a problem tbr NLP because they are lexemes whose lbrms cannot always be recognized by dictionary lookup on the basis of a text word. Conventional solutions take a mixed lexical and syntactic approach. In this paper, we propose the solution offered by Word Manager, consisting of string-based recognition by means of rules of types also required for periphrastic inflection and . In this way, separable verbs are dealt with as part of the dolnain of reusable lexical resources. We show how this solution compares favourably with conventional approaches.

1. The Problem (2) a. Claudia auihOrtjctzt. In German there exists a large class of verbs b. Daniel versucht zu aulh6ren. which behave like aufhOren ('stop'), The problem arises in a very similar fashion illustrated in (1). in Dutch, as the Dutch translations (3) of the (1) a. Anna glaubt, dass Bernard aufhSrt. sentences in (1) show. The only difference is ('Anna believes that Bernard stops') that the in (3c) is not written b. Claudia h6rt jetzt auf. together. ('Claudia stops now PRY') (3) a. Anna geloofl dat Bernard ophoudt. c. Daniel versucht aufzuh6ren. b. Claudia houdt nu op. ('Daniel tries to_stop') c. Daniel probeert op te houden. In subordinate clauses as in ( I a), the particle On the other hand, the problem of separable aufand the inflected part of the hOrt are verbs in German and Dutch differs flom the written together, in main clauses such as corresponding one in English, because (lb), the inflected form hOrt is moved by English verbs such as look up are multi- verb-second, leaving the particle stranded. In word units in all contexts. A treatment of infinitive clauses with the particle zu ('to'), these cases which is in lille: with the solution zu separates the two components of the verb proposed here is described by Tschichold and all three elements are written together. (forthcoming). In analysis, the problem of separable verbs As suggested by the English translation, is to combine the two parts of the verb in separable verbs in German and Dutch are contexts such as (lb) and (lc). Such a lexemes. Therefore, an important issue in combination is necessary because syntactic evaluating a mechanism for dealing with and semantic properties of aufhiJren are the them is how it fits in with the reusability of same, irrespective of whether the two parts lexical resources. are written together or not, but they cannot be deduced from the syntactic and semantic Given the importance of the orthographic properties of the parts. Therefore, a solution component in the problem, it is not to the problem of separable verbs will treat surprising that it is hardly if ever treated in (lb) as if it read (2a) and (lc) as (2b): the linguistic literature.

471 It should be pointed out that Celex and 2. Previous Approaches Rosetta were not chosen because their solution to the problem of separable verbs is In existing systems or resources for NLP, separable verbs are usually treated as a worse than others. They are representative lexicographic and syntactic problem. Two examples of currently used strategies, typical approaches can be illustrated on the chosen mainly because they are relatively basis of Celex and Rosetta. well-documented. Celex (http://www.kun.nl/celex) is a lexical 3. The Word Manager database project offering a German dictionary with 50'000 entries and a Dutch Approach dictionary with 120'000 entries. In these Word Manager TM (WM) is a system for dictionaries separable verbs are listed with a morphological dictionaries, it includes rules feature conveying the information that they for inflection and deriwltion (WM proper) belong to the class of separable verbs and a and for clitics and multi-word units (Phrase bracketing structure showing the Manager, PM). We will use WM here as a decomposition into a prefix and a base, e.g. name for the combination of the two (auf)(hOren). Celex dictionaries are reusable, components. A general description of the but the rule component for the interpretation design of WM, with references to various of the information on separable verbs, i.e. publications where the formalism is the mechanism for going from (lb-c) to (2), discussed in more detail, can be found in ten remains to be developed by each NLP- Hacken & Domenig (1996). system using the dictionaries. The German WM dictionary consists of a Rosetta is a machine translation system comprehensive set of inflectional and word which includes Dutch as one of the source formation rules describing the full range of and target languages. Rosetta (1994:78-79) morphological processes in German. In the describes how separable verbs are treated. last two years we have specified more than For the verb ophouden illustrated in (3), 100'000 database entries by classification of there are three lexical entries, ophouden for lexemes in terms of inflection rules (for the continuous forms as in (3a), and houden morphologically simple entries) and by the and op for the discontinuous forms as in application of word formation rules (for (3b-c). When a form of houden is found in a morphologically complex entries). In text, it is multiply ambiguous, because it can addition, the PM module contains a set of be a form of the simple verb houden ('hold') rules for clitics and multi-word units which or of one of the separable verbs ophouden covers German l~eriphrastic inflection ('stop'), aanhouden ('arrest'), aJhouden patterns and separable verbs. ('withhold'), etc. The entry for houden as part of ophouden contains the information The rule types invoked in the treatment of that it must be combined with a particle op. separable verbs in WM include Inflection At the same time, op is ambiguous between a Rules (IRules), Word Formation Rules reading as preposition or particle. In , (WFRules), Periphrastic Inflection there is a rule combining the two elements in (PIRules), and Rules (CRules). We a sentence such as (3b). It is clear that, while will describe each of them in turn. this approach may work, it is far from elegant. It creates ambiguity and 3.1. Inflection redundancies, because ophouden written together is treated in a different entry from In inflection, at(/h6ren is treated as a verb op + houden as a discontinuous unit. These with a detachable prefix at!fi The detachable properties make the resulting dictionaries prefix is defined as an undcrspecified less transparent and do not favour IFormative. This means that, in the same way as for stems, its specification is reusability. distributed ()vet" a class specification and a

472 RIRule V_Detachable-Prefix citation-forms (ICat Detachable-Prefix) (ICat V-Stem) (ICat V-Suffix) (Hod Inf) word-forms (ICat Detachable-Prefix) (ICat V-Stem) (ICat V-Suffix) (ICat Detachable-Prefix) (ICat V-Prefix.ge) (ICat V-Stem) ...... (ICat V-Suffix) (Mod PaPa) Fig. 1 : Inflection rule for separable verbs in WM. The dots in the last line mark tile absence of a line break in the actual code. Feature specifications separated by tabs refer to sets of formativcs in paradigmatic variation. Each line thus generates one or more word forms. target (RIRule V Detachable-Prefix) separable 1 (ICat Detachable-Prefix) 2 (ICat V-Stem)

Fig. 2: Target specification of the WFRule for separable verbs in WM. specification of the individual string. The lexicographer by linking a word to a class is defined by the linguist in the WFRule having a target specification as in specification of inflection processes. The Fig. 2. In the case of au]lffJren, this is a rule specification of the string is part of the for prefixing in which "1" in Fig. 2 matches lexicographic specification, i.e. the string a closed set of predefined prefixes. The specification is the result of the application of IRules and WFRules described so far cover the word formation rule the lexicographer the non-separated occurrences as in (la). chooses for the definition of an individual entry. In the IRules, detachable prefixes are The second special property of the referred to as formatives in the formulae specification in Fig. 2 is the system keyword generating the word forms. Fig. 1 gives the "separable" in the second line. It assigns relevant rule of the database for otherwise the result of the WFRule to the predefined regular separable verbs, such as au./hi~ren. class %separable. This class, whose name is defined in the WM-fornmlism, can 3.2. Word Formation be used to establish a link between the result of word formation and the input to the Word Formation Rules consist of a source periphrastic inflection mechanism used to definition and a target definition. The source recognize occurrences such as in (lb). definition determines what (kind of) formatives are taken to form a new word. 3.3. Periphrastic Inflection The target definition specifies how the source formatives are combined, and which The mechanism for periphrastic inflection in inflection rule the new word is assigned to. WM consists of two parts. PlClasses are used to identify tile components and PlRules Separable verbs are the result of WFRules to turn them into a single word form. The which are remarkable because of their target. PlRule for separable verbs ill German is The target specification is as in Fig. 2. This given in Fig. 3. The rule in Fig. 3 consists specification departs from the usual of a name and a body, which in turn consists specification of a target in a WFRule in two of input and output specifications separated respects. First, instead of concatenating the by "=". The input specifies a form source lormatives, the rule lists them, (infinitive and are excluded by leaving concatenation to the IRule. This is "^") and a detachable prefix. The output necessary to form the past combines them in the position of the verb, aufgehOrt, where the two formatives are with the form prefix + verb, and with the separated by the prefix ge- (cf. last line of features percolated flon~ the verb (person, Fig. 1). Separable verbs are specified by the

473 Separable (Cat V)^(Mod Inf)^(Hod Part) + %separable ...... (POS i) (FORM 2+i) (PERC i) (Cat V) Fig. 3:PeriphrasticInflection Rule ~rseparableverbsinWM.

%separable + (CElement zu) + (Cat V) (Hod Inf) (femp Pres) ...... (CElement zu), %separable + (Cat V) (Mod Inf) (Temp Pres) Fig. 4: CRule for the infinitive of separable verbs in WM. number, etc.). This yields (2a) as a step in The first component to act is the clitics the analysis of (lb). component. It leaves everything unchanged except (lc), which is replaced by (2b): The possibilities for specifying the relative aufzuharen => zu at([h/Jven, Then the rules position of the two elements to be combined of WM proper are activated. They replace are the same as the possibilities for multi- each word form by a set of analyses in terms word units in general. In the PIClass for of a string and feature set. In (1 a), au./hgrt is German it is specified that the finite verb analysed as third person singular or second always precedes the particle when the two person plural of the present tense of are separated. In Dutch this is not the case, auflzaren, in (lb) hart and au/' are analysed as illustrated by (3c), so that a different separately, and in (lc) au./haren, which was specification is required. given the feature infinitive by the CRule in Fig. 4, only as infinitive, not as any of the 3.4. Clitic Rules homonymous forms in the paradigm. The next step is periphrastic inflection. It applies The clitic rule mechanism is used to analyse to (la) and (lc) vacuously, but combines au~zuharen in (lc) and produce zu au.~gren and in (lb), producing the feature as in (2b). The CRule used is given in Fig. hart at(f description corresponding to (2b): 4. Again input and output are separated by hiJrt m,{f => aufhart. Finally, the idiom recognition "=". The input consists of the concatenation component (not treated here' applies of three elements: a detachable prefix, vacuously. infinitival zu, and an infinitive. Graphic concatenation is indicated by "+". The A general remark on recognition is in order CElement zu is defined elsewhere as a form here. The rule components of PM, i.e. of the infinitival z u, rather than the clitics, periphrastic inflection and idiom homonymous preposition, in order not to recognition add their results to the set of lose information. The output consists of two intermediate representations available at the , as indicated by the comma, the relevant point. Thus, after the clitic second of which concatenates the prefix and component, az(fz.uhiJre** continues to exist the verb. alongside z.u au[h&'en in the analysis of (lc). Since the former cannot be analysed by WM 3.5. Recognition and proper, it is discarded. Likewise, hiJrt will survive in (lb) after periphrastic inflection Generation and indeed as part of the final result. This is In recognition, the input is the largest necessary in examples such as (4): domain over which components of multi- (4) Der Hund hart auf den Namen Wurzel. word units (MWUs) can be spread. In ('The dog answers to the name [of] practice, this coincides with the sentence. Wurzel') Since WM does not contain a parser, larger chunks of input will result in spurious Since rules in WM are not inherently recognition of potential MWUs. Let us directional, it is also possible to generate all assume as an example that the sentences in forms of a lexeme such as aq/hgJren in the (1) are given as input. way they may occur in a text. The client

474 application required for this task can also include codes indicating places in the string (6) daniel (Cat ) versuchen (Cat Verb)(Tense Pres) where other material nmy intervene, because (Pers Third)(Num SG) this information is available in the relevant PIClass of the database. zu (Cat lnf-marker) auth6ren (Cat Verb)(Mode lnl) 4. Conclusion The task of the client application in the recognition of separable verbs in (1) is Separable verbs in German and Dutch reduced to the choicc of (5a) rather than constitute a problem in NLP because they are (5b). lexemes whose recognition is not simply a matter of dictionary lookup. Therefore, a Finally, two points deserve to be reusable lexical database such as Celex does emphasized. First, the entire WM-formalism not offer a comprehensive solution to the for separable verbs has been implemented as problem. On the other hand, treating them as described here. The rules for German have a problem of syntactic recognition, as been formulated and a large dictionary tbr implemented in, for instance, Rosetta, fails German (100'000 entries) including to account for the lexeme character of separable verbs is awtilable. Moreover, the separable verbs. As a consequence, spurious only provision in the WM-formalism ambiguities and redundancies are created. specifically geared towards the treatment of Ambiguities arise between a sin]pie verb separable verbs is the keyword separable in such as h6ren ('hear') and the same form WFRules (cf. Fig. 2) and the corresponding functioning as part of a separable verb such class name %separable. Otherwise the entire as aufiffSren. Redundancies emerge between formalism used fox separable verbs is the two different entries for au.[hiiren, one available as a consequence ol' general for the continuous and one for the requirements of morphology and multi-word discontinuous occurrences. units. In Word Manager, the recognition of separable verbs is entirely within the reusable lexical domain. A client application References can start from an input which resembles (2) rather than (lb-c). An indication of the type ten Hacken, Plus & Domenig, Marc (1996), of input is given in (5) and (6). For (lb), 'Reusable Dictionaries for NLP: The (5a) and (5b) are offered as alternatives. For Word Manager Approach', Lexicology (lc), (6) is offered as the only analysis 2: 232-255. (modulo syncretism of versucht). Rosetta, M.T. (1994), Coml)ositio,al (5) a. claudia (Cat Noun) Translation, Kluwer Academic, aufh6ren (Cat Verb)(Tense Pres) Dordrecht. (Pets Third)(Num SG) Tschichold, Cornelia (forthcoming), English jetzt (Cat Adv) Multi-Word Units in a Lexicon Jor b. claudia (Cat Noun) Natural Language Processing, Ph.D. hSren (Cat Verb)(Tense Pres) dissertation, Universitfit Basel (Dec. (Pets Third)(Num SG) 1996), to appear at Olms Verlag, jetzt (Cat Adv) Hildesheim. auf (Cat Prep)

I Word Manager: http ://www'unibas'ch/LIlab/pr°jects/w°rdmanager/w°rdmanager'html Fig. 5: URL for Word Manager.

475