Bi-Particle Adverbs, Pos-Tagging and the Recognition of German Separable Prefix Verbs
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) Bi-particle Adverbs, PoS-Tagging and the Recognition of German Separable Prefix Verbs Martin Volk, Simon Clematide, Johannes Graen,¨ Phillip Strobel¨ University of Zurich Institute of Computational Linguistics [email protected] Abstract to compute the correct verb lemma. Unfortunately, Part-of-Speech taggers (like the TreeTagger) assign In this paper we propose an algorithm the lemma locally and do not consider the long- for computing the full lemma of German distance dependency between the verb and the pre- verbs that occur in sentences with a sepa- fix. Hence, we need to correct the verb lemma after rated prefix. The algorithm is meant for PoS tagging. In example 1, the PoS tagger will large-scale corpus annotation. It relies on assign the lemma weisen (EN: to point) to the past Part-of-Speech tags and works with 97% tense verb form wies. Only the re-attachment of the precision when the tags are correct. Un- prefix will lead to the correct lemma nach+weisen fortunately there are multi-word adverbs (EN: to prove) and thus to the correct meaning of with particles that are homographs with the verb. separated verb particles and prepositions. Some annotated corpora of German leave the Since the usage as separated particle and re-attachment of separated verb prefixes open. For preposition is much more frequent, these example, the German TIGER treebank marks only multi-word adverbs are often incorrectly the lemma of the finite verb as in figure 1. Since the tagged. We show that special treatment finite verb and the separated prefix are children of of these bi-particle adverbs improves the the same mother node S, the prefix can be assigned re-attachment of separated verb particles. unambiguously to the verb. Still, this makes query- 1 Introduction ing the treebank for verbs with separable prefixes a complex undertaking. However, recent versions Particle verbs in German often occur with verb of the TuBa-D/Z¨ treebank do contain verb lem- stem and particle split over long distances. This mas with re-attached prefixes (Versley et al., 2010). happens in matrix clauses when the verb is finite These lemmas are represented in the same way as and occurs in present or past tense, or when the the lemmas of the corresponding verbs in unsepa- verb is in imperative form. Examples: rated form (e.g. nach#weisen). We work on the annotation of large corpora (1) So eine bekannte Studie der Harvard wies for linguistic research and information extraction. University aus dem Jahr 2007 , dass ... nach Therefore we have developed an efficient and ro- (EN: A well-known study by Harvard bust algorithm to compute the lemmas of German University from 2007 that ...) proved verbs that occur with separated prefixes. In this (2) Nimm das und das mit. (EN: Take this and paper we will present the algorithm. We will then that along.) argue that multi-word adverbs cause some confu- sion to the PoS tagger and thus require special In all other tenses and forms the particle is pre- treatment. The correct handling of these adverbs, fixed to the verb (e.g. ... wie eine Studie nachwies). in return, improves the precision of the re-attached Therefore the particle is often called a separable lemmas. prefix. When analyzing German sentences we have to re-attach the separated prefix to the verb in order 297 Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) Figure 1: German syntax tree with separated verb prefix (halt¨ ... fest) and multi-word adverb (nach wie vor) from the TIGER treebank. The multi-word adverb is annotated as coordinated adverbial phrase (CAVP). (English translation: The Interior Minister still maintains the proposal of a children citizenship.) 2 The Re-attachment Algorithm This is different from the treatment of these aux- iliary and modal verbs in the TuBa-D/Z¨ treebank. We re-attach the separated prefix to the verb with The treebank includes the re-attachment of the sep- the following algorithm. After Part-of-Speech tag- arated prefixes but leaves the PoS tags unchanged. ging we search for a separated verb prefix (tagged This means, that in the TuBa-D/Z¨ treebank the verb as PTKVZ) and the most recent preceding finite innehaben is a finite full verb, when the prefix is full verb (VVFIN) or imperative verb (VVIMP) in attached, but it is an auxiliary verb, when the pre- the same sentence. In order to increase the preci- fix is separated. We consider this a misleading sion we also check whether the re-combined prefix inconsistency. + verb lemma occurs in our corpora and is licensed Our re-attachment algorithm leads to high pre- by the morphology analyzer GerTwol. In this way cision re-combined verb lemmas. We first eval- we have compiled a list of 8500 separable German uated our method against our corpus of 1.7 mil- verbs. lion German tokens from banking news (Volk et German auxiliary verbs and modal verbs do not al., 2016). PoS tagging leads to a total of 9200 take separable prefixes. This means that the aux- tokens marked as separated verb prefixes. Our al- iliary verbs haben (EN: have) and werden (EN: gorithm re-combines 7630 prefix + verb stems (re- become) must be interpreted as full verbs when sulting in 976 types). The re-combined verbs with they take a separable prefix. Consider for instance the highest frequencies are: ausgehen (345 occur- the verb innehaben in er hat ein Amt inne (EN: he rences, EN: to go out, to die down), darstellen (226, holds an office). Other examples are vorhaben (EN: EN: to depict, to represent), aussehen (169, EN: to intend), fertigwerden (EN: to be done with), or to look like, to appear), stattfinden (149, EN: to loswerden (EN: to get rid off). take place), and beitragen (136, EN: to contribute). Similarly, the modal verb mussen¨ functions as These counts do not include the occurrences of full verb in combination with the prefix durch re- these verbs where the prefix is part of the verb sulting in durchmussen¨ (EN: to have to go through), form (i.e. non-separated forms): ausgehen (148 oc- and konnen¨ functions as full verb in wegkonnen¨ currences), darstellen (216), aussehen (106), stat- (EN: to be able to leave). Our re-attachment algo- tfinden (151), and beitragen (292). rithm needs to account for these cases even though As a side effect we disambiguate between mul- state-of-the-art PoS taggers for German label all tiple lemma options. For example, the 3rd person occurrences of haben and werden as auxiliary and singular verb form fallt¨ can have the lemmas fallen all occurrences of mussen¨ and konnen¨ as modal (EN: to fall) or fallen¨ (EN: to fell). The TreeTag- verbs. Therefore we include PoS correction in the ger assigns both lemmas to this verb form. If fallt¨ re-attachment of separated verb prefixes for these occurs with the separated prefix auf, then our re- cases. attachment algorithm finds that only the combi- 298 Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) Figure 2: ParZu parser error due to incorrect recognition of the multi-word adverb durch und durch. (EN: The totally volcanic land has thousands of craters.) nation auffallen is possible (EN: to stand out, to This example sentence has a relative clause be- strike), and we eliminate the other lemma option. tween the verb busst¨ and its separated prefix ein. Obviously, this disambiguation method is depen- Since our algorithm assigns the prefix to the most dent on the separated prefix being only acceptable recent finite verb, it will erroneously assign it to with one lemma option. the verb lag which is the finite verb in the inter- We are aware of three limitations of our algo- mediate relative clause. This problem can only be rithm, all of which concern rare cases. First, the re- avoided by (at least) a shallow parser which detects attachment algorithm will fail for topicalized verb the clause boundaries. prefixes that precede the finite verb. The TIGER Thirdly, our algorithm has no provision for coor- treebank contains 33 examples of separated verb dinated prefixes. prefixes that precede the finite verb (in roughly 0.9 (4) In einer Deflation nimmt der Wert des million tokens of manually annotated newspaper Geldes zu statt ab, ... (EN: During a deflation text). The topicalized prefixes in these examples the value of the money increases instead of are semantically heavy prefixes (e.g. Zuruck¨ bleibt decreases, ...) auch die Erinnerung ...) and pronominal adverbs (e.g. Hinzu kommt die Konkurrenz ..., Zugrunde In such examples the verb has basically two dif- legten die Wiesbadener ...). We label them as ad- ferent lemmas. We could represent this in the same verbs and pronominal adverbs. way as ambiguous lemmas by assigning both lem- More serious, our re-attachment algorithm will mas zunehmen / abnehmen, but currently this is not also fail for rare cases of nested finite clauses that part of our implementation. occur between the verb and its separated prefix. For Other than that, if the PoS tagger recognized all example: verb forms and all separated prefixes correctly, then our re-attachment algorithm should work perfectly. (3) Das Konsumwachstum busst¨ im Vergleich zu We evaluated our algorithm against the TuBa-¨ den vergangenen beiden Jahren, in denen die D/Z treebank. Version 10 of the treebank contains Wachstumsrate deutlich uber¨ 2,0 Prozent lag, a total of 9181 verb forms that have lemmas with markant an Schwung ein. re-attached prefixes. In the standard configuration (EN: The growth in consumption our program correctly re-attaches 8341 separated considerably loses momentum in comparison prefixes (91%).