Morphological Analysis and Lemmatization for Swiss German Using Weighted Transducers
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) Morphological analysis and lemmatization for Swiss German using weighted transducers Reto Baumgartner University of Zurich [email protected] Abstract haar and Wyler, 1997, p. 37). Conversely it pos- sesses infinitive particles that are not known to StG. With written Swiss German becoming SwG consists of different local dialects that more popular in everyday use, it has be- mainly differ in phonology and to a lesser extent in come a target for text processing. The ab- vocabulary. There is no standard orthography, but sence of a standard orthography and the there are proposals for sound-character assignment variety of dialects, however, lead to a vast like Dieth-Schreibung (Dieth, 1986) or Bärndüt- variation in different spellings which makes schi Schrybwys (Marti, 1985a) that are, however, this task difficult. We built a system based not known to everyone. This results in a high vari- on weighted transducers that recognizes ability of spellings influenced by both dialects and over 90% of the tokens in certain texts. personal writing preferences. As an example for Weights ensure preferring the best analysis the StG word Jahr “year”, we found in our corpus for most words while at the same time al- Jahr and Jaar, Johr and Joor, even Joh for different lowing for very broad range of spelling pronunciations and spelling preferences. variations. Our morphological tagset that The lack of a standard orthography and the vast- we defined for this purpose and lemmas in ness of variants motivate the choices that have to Standard German open the possibility for be made to process these dialects. For lemmatiza- further processing. Besides our morpholo- tion we use StG words. The variants can probably gical analyzer and lemmatizer, a morpho- best be dealt with using finite-state technology that logically annotated corpus offers new re- do not rely on huge corpora but on linguistic en- sources for Swiss German and helps spread- gineering. Weighted transducers can be used for a ing our tagset. better trade-off between good coverage and over- generation. 1 Introduction 2 Related Work With an increased use of written text in Swiss Ger- man (SwG), there is a growing interest in tools to The increase of SwG in writing led to a number of process these texts. SwG dialects are spoken by resources: more than 4 million people in Switzerland in every- Corpora: By now two corpora consisting of day life around the centers Zurich, Basel and Bern everyday written language have been collected. whose dialects we covered in our system at this The Swiss SMS Corpus (Stark et al., 2009 2015) stage. In writing usually Standard German (StG) counts 275 000 tokens in SwG from short messages. is preferred but for private communication many The corpus includes manually made glosses in StG. people use their SwG dialect. NOAH’s Corpus of Swiss German Dialects (Hol- SwG differs from StG in phonology, vocabulary lenstein and Aepli, 2014) counts 115 000 tokens in and grammar. Its vowel system still resembles that SwG from different sources like blogs, wikipedia of Middle High German (MHG) with Ziit “time” entries, literature, newspapers or a business report. and Huus “house” (MHG zît and hûs; StG Zeit and The corpus has manually been annotated with parts Haus) while the differences in the consonant sys- of speech. With Archimob – A corpus of Spoken tem and the loss of endings are more modern traits Swiss German (Samardžic´ et al., 2016), there is a (Christen et al., 2012, p. 27). Over time SwG has corpus of transcribed spoken SwG, opposed to the lost its genitive case and the past preterite (Sieben- others whose material was written first. 44 Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) Taggers: Hollenstein and Aepli (2014) trained ter at-the writing”. The fact that am here stands a Part-of-Speech tagger model on their collected between the object and the infinitive makes an ana- data that reaches an accuracy of 90.62%. lysis as prepositional phrase impossible and speaks Morphological generation: A closely related against the tag APPRART for am. The comparison task to ours is morphology generation. An ap- with en Brief z schribe “to write a letter” with the proach from Scherrer (2011) uses replacement particle z is a good argument for am to be analyzed rules and information about the dialects’ location as a particle too. Our tag would also make sense for to build SwG word forms. As this system follows other varieties of the German language where such specific spelling guidelines for consistency, it is not constructions occur or where their interpretation as suited for analysis where it is important to recog- verbal forms are preferred over one as preposition– nize a broad range of different spellings. noun sequences. 3 Annotation scheme 3.2 Morphological features Due to the absence of an established morphological 3.1 Parts of speech tagset for SwG, we defined a character based tagset As both the Swiss SMS Corpus and NOAH’s Corpus that extends the STTS to STTS.gsw. The characters make use of the Stuttgart–Tübingen–TagSet (STTS) that make up the tags are listed in table 1. (Schiller et al., 1999), we chose the same tagset for our parts of speech. As it was developed for StG Category Tags we had to make some changes for use with SwG: Degree p (positive), c (comparative), Changed use: Some tags had to be opened to s (superlative) different words with the same function. The use Person 1 (first), 2 (second), 3 (third) of wo “where” as relative pronoun (PRELS) or as Case n (nom.), a (acc.), d (dat.), subordinating conjunction (KOUS) like StG als r (nom./acc.) “when” demands the expanded use of these tags. Number s (singular), p (plural) Similarly für “for” and zum “to the” can now be Gender m (masc.), f (fem.), n (neutral) conjunctions that govern an infinitive (KOUI). Mode i (indicative), j (subjunctive I), In contrast to StG, prepositions can be combined k (subjunctive II) with any article. In consequence APPRART is also Inflection s (strong), w (weak) applicable for plural forms as in id “into the” or Definiteness i (indefinite), d (definite) indefinite articles as in ime “in a”. Lacking a corresponding form, the tag PRELAT Table 1: Morphological tags. for attributive relative pronouns will not be used. Additions: For infinitive particles like go or cho We decided against a tag for the mixed adjective we decided to use the tag PTKINF like in the Swiss inflection that is used by many descriptions of the SMS Corpus and in NOAH’s Corpus. StG language. The reasons behind this are that For merged words like hets “there is” (literally this distinction is solely syntactic and that different “has it”) we copied the treatment from Hollenstein SwG dialects use the strong and weak inflection and Aepli (2014) with the plus sign. hets is there- differently. fore tagged with VAFIN+PPER. Unlike in their As there is no past preterite, the category time Part-of-Speech tagging task, for our morphological could be spared. In consequence the two subjunct- analysis task all tags must be kept. ive tenses are interpreted as different modes (as A completely new tag is PTKAM for the particle subjunctive I and II instead of subjunctive present am (literally “at the”) in the progressive verb form. and preterite). In StG examples like Ich bin am Schreiben lit- We introduced a shared tag for nominative or erally “I am at-the writing”, Schreiben is com- accusative cases even though this would consti- monly analyzed as a substantified verb forming tute a large intervention from a linguistic perspect- a prepositional phrase together with am. In SwG ive. As only personal and some related pronouns this construction is expanded with verbal objects make a distinction between these cases, different more often than in most areas outside Switzerland tags for these forms would lead to competing ana- (Van Pottelberge, 2005). Such an example would lyzes that could only be distinguished through se- be Ich bi en Brief am schriibe literally “I am a let- mantics. Therefore we exclude the task of disam- 45 Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) biguating these cases but mark this with the tag r (1985b) and Suter (1992). In addition we added (from rectus). 11 for adjectives plus ordinal numbers (as ADJA), In our example hets, the tag VAFIN is exten- 127 adverb stems, about 50 noun stems and around ded with 3si (3rd person, singular, indicative) and 90 full verb stems (21 roots plus different prefixes) PPER is extended with 3snn (3rd person, singular, that cannot easily be derifed from StG forms. neutral, nominative). 5 Implementation 3.3 Lemmas Our system is intended to be run with the Hel- For the choice of lemmas we decided to follow sinki Finite-State Transducer Technology (HFST) the rules from the Swiss SMS Corpus to ensure (Lindén et al., 2009). HFST allows building and ap- compatibility between different resources for SwG. plying weighted finite-state transducers with trop- Their main principles are that closely related words ical semi-rings. That means paths can be punished must be used, no new StG words must be inven- with weights that are added on the way and the ted and that the meaning should not be changed paths with the lowest weights are preferred. (Ueberwasser, 2013). For example hets is annotated as 5.1 Forms haben/VAFIN.3si+es/PPER.3snn after including The implementation of the SwG word forms hap- the morphological tags and lemmas.