A Finite-State Approach to Shallow Parsing and Grammatical Functions Annotation of German

A Finite-State Approach to Shallow Parsing and Grammatical Functions Annotation of German von Frank Henrik Müller Philosophische Dissertation angenommen von der Neuphilologischen Fakultät der Universität Tübingen am 16. Februar 2005 Berlin 2007 Hauptberichterstatter: Prof. Dr. Erhard W. Hinrichs Mitberichterstatterin: PD Dr. Anette Frank Dekan: Prof. Dr. Joachim Knape Danksagung Ich danke allen, ohne die diese Arbeit nicht möglich gewesen wäre. Da ist zunächst mal mein Betreuer Erhard Hinrichs, der mich mit einem komfort- ablen Stipendium ausgestattet hat und dann später im SFB 441 eingestellt hat. Bei meiner Arbeit hat er mir zur rechten Zeit Freiheit gelassen und zur rechten Zeit Druck gemacht. Vor allem hat er sich die Zeit genommen, Kapi- tel für Kapitel meiner Arbeit mit mir durchzugehen, sonst hätte sie nicht so eine gute Struktur. Und was dann noch durch die Lappen gegangen ist, hat die Zweitgutachterin Anette Frank akribisch angemerkt. Tatsächlich hätte ich mit ihren Anregungen genug Themen für eine Habilitation gehabt, aber das steht auf einem anderen Blatt. Mit dabei beim Promotionskolloquium waren dann noch Frank Richter und Irene Rapp als Mitberichterstatter und Klaus-Peter Philippi als Vorsitzender des Promotionsausschusses. Auch ih- nen gebührt ein Dank für ihre Mühe. Und dann sind da ja noch die lieben Kollegen, die mir immer beigestanden haben und manchmal sicher auch ganz schön was ertragen haben. Allen voran ist Tylman Ule zu nennen, bei dessen Hilfe ich am besten gar nicht erst anfange aufzuzählen, da es kein Ende nehmen würde. Nur die mindestens 2500 Cappuccino, die er mir gemacht hat, will ich hier nicht unerwähnt lassen. Fünf Jahre jeden Tag zehn bis zwölf Stunden in einem Büro zu sitzen, prägen sicher. Immer um Rat fragen durfte ich auch Sandra Kübler, was ich auch gerne in Anspruch genommen habe. Herzlich gedankt sei auch Heike Telljohann, wo immer sie grad ist, für die interessanten Diskussionen über die deutsche Syntax. Und nicht vergessen will ich meine beiden Kollegen Holger Wunsch und Piklu Gupta, die dann später noch dazukamen. Aber angestoßen hat das Ganze natürlich die liebe Familie, ohne die ich ja nicht wäre, was ich bin. Wäre ich nicht in einem so anregenden Um- feld groß geworden, hätte ich sicher nicht studiert und promoviert. Und so bedanke ich mich bei meinen Eltern für die jahrelange moralische und finanzielle Unterstützung und für ihr Verständnis dafür, dass ich selten den langen Weg nach Münster nehmen konnte. Meiner Schwester Stefanie danke ich stellvertrend für alle Steuerzahler, die mein Studium finanziert haben, ohne selbst je studiert zu haben. Und nicht zuletzt danke ich auch meinen Großmüttern Gertrud Müller, Anna Hakenes und Mia Robering für ihre ideelle und finanzielle Unterstützung meines Studiums. Und so ganz nebenbei habe ich mich im wunderschönen Tübingen auch sehr wohl gefühlt und denke gerne daran zurück. Daran haben natürlich auch die vielen Freundschaften Anteil, die ich dort geknüpft habe. Neben meinen lieben Kollegen wäre hier im Bündel insbesondere meine englische Theatergruppe, die Provisional Players, zu nennen. Danke. Those were the best days of my life. Contents 1 Introduction 1 1.1 InitialProblem .......................... 1 1.2 Parsing............................... 2 1.3 MainIssuesofthisThesis . 3 1.4 Reader’s Guide and Overview of the Parser . 5 2 A Rule-Based Finite-State Parser for German 9 2.1 FSAsasaParsingModel. .. 9 2.2 Parsing German with Hand-crafted Rules . 17 2.3 BasicPrinciplesoftheParser . 21 2.3.1 Robustness ........................ 21 2.3.2 Modularity of Annotation Components . 26 2.3.3 Underspecification . 31 2.4 The fsgmatch AnnotationTool. 37 3 Related Work 49 3.1 GeneralRemarks ......................... 49 3.2 ShallowParsing .......................... 50 3.2.1 Finite-State Approaches and Shallow Parsing . 50 3.2.2 Braun (1999); Neumann, Braun, and Piskorski (2000); NeumannandPiskorski(2002) . 52 3.2.3 Wauschkuhn(1996). 53 3.2.4 Schiehlen(2002,2003a,b) . 54 3.2.5 KermesandEvert(2002,2003) . 55 3.2.6 Trushkina(2004) . 56 3.2.7 Schmid and Schulte im Walde (2000) . 57 3.2.8 BeckerandFrank(2002) . 59 3.2.9 Klatt(2004a,b) . 60 I 3.2.10 Brants(1999) . 61 3.2.11 Summary ......................... 62 3.3 GrammaticalFunctionsAnnotation . 62 3.3.1 Finite-State Approaches and GF Annotation . 62 3.3.2 Schiehlen(2003a,b) . 63 3.3.3 Trushkina(2004) . 64 3.3.4 Schmid and Schulte im Walde (2000) . 65 3.3.5 Kübler(2004a,b) . 66 3.3.6 Summary ......................... 67 I Annotating Shallow Structures 71 4 Shallow Annotation Structures 73 4.1 The Nature and Right-of-Existence of Shallow Parsing . .. 73 4.2 DefinitionofShallowStructures . 93 4.2.1 Chunks .......................... 93 4.2.2 TopologicalFields. .108 4.2.3 Clauses ..........................117 5 Annotating Shallow Structures with FSAs 123 5.1 OutlineoftheTaskDescription . 123 5.2 Easy-First-Parsing and a Divide-and-Conquer Strategy . .125 5.3 Correcting Tags by Using Topological Field Information . .133 5.4 Handling Recursion by Iterating FSAs . 140 5.5 Annotating Chunks Top-down . 147 6 Annotating Complex Chunks 159 6.1 TaskDescriptionandMotivation . 159 6.2 AppositionsandNamedEntities. 161 6.3 Annotation of Appositional Constructions . 163 6.4 AnnotationofNamedEntities . 168 6.5 Annotation of Other Complex Chunks . 169 II Annotating Grammatical Functions 173 7 Annotating Deeper Structures – Grammatical Functions 175 7.1 OutlineoftheTaskDescription . 175 II 7.2 DefinitionofGrammaticalFunctions . 179 7.3 Detailed Task Description and Problems . 185 7.4 MotivationforAnnotation . .189 7.5 MethodsofAnnotation–AnOutlook. 191 8 Reducing Morphological Ambiguity 195 8.1 TaskDescription . .. .. .195 8.2 Reducing Ambiguity on Chunk Level . 199 8.3 Reducing Morphology on Clause Level . 203 8.4 Further Enriching the Corpus with Morphological Information 205 8.5 Conclusion, Outlook and Comparison with other Approaches . 207 9 Annotating Grammatical Functions (GFs) 211 9.1 TaskDescription . .. .. .211 9.2 Converting Lexical Entries into FSTs . 222 9.3 DisambiguationbyRankingRules . 230 9.4 A Longest-Match Strategy for the Disambiguation of SC Frames231 9.5 Using Markedness to Disambiguate GF Order . 232 9.6 Extending Disambiguation to non-Lexical Patterns . 238 9.7 Modeling Ranking by Using OT Constraints . 241 III Evaluation and Comparison 257 10 Shallow Annotation 259 10.1 TheEvaluationSystem. .259 10.1.1 TaskDescription . .259 10.1.2 TheTüBa-D/ZGoldCorpus. 261 10.1.3 Conversion into common representation . 262 10.1.4 MethodsofEvaluation . 265 10.2 TheTest-SettingforEvaluation . 267 10.3Results...............................270 10.3.1 Chunks ..........................270 10.3.2 TopologicalFields. .278 10.3.3 Clauses ..........................290 III 11 Grammatical Function Annotation 299 11.1 TheEvaluationSystem. .299 11.1.1 TaskDescription . .299 11.1.2 GFsin the TüBa-D/ZGold Corpus . 300 11.1.3 Conversion into Common Representation . 302 11.1.4 MethodsofEvaluation . 304 11.2 TheTest-SettingforEvaluation . 308 11.3 Discussion of Overall Evaluation Results . 310 11.4 Impact of Different Components on Annotation . 338 11.4.1 Evaluating Morphological Ambiguity Reduction . 338 11.4.2 Impact of Morphological Ambiguity Class Reduction . 344 11.4.3 Impact of non-lexical Component . 348 11.4.4 Impact of Support Verb Component . 355 12 Comparison with Related Work 359 12.1 GeneralRemarks . .359 12.2 ShallowParsing . .360 12.3 GrammaticalFunctionsAnnotation . 370 13 Conclusion 379 14 Bibliography 383 A Abbreviations 395 B The Stuttgart-Tübingen Tagset (STTS) 397 C The TüBa-D/Z Inventory of Syntactic Categories and Gram- matical Functions 401 D Lists of Verbs without SC Frame 405 D.1 112VerbTypeswithoutSCFrame . 405 D.2 44VerbTypeswithoutBaseForm. 408 IV Chapter 1 Introduction 1.1 Initial Problem During the last two decades, there has been a paradigm shift in linguistics and related sciences. Before this shift, linguists mainly relied on introspective language data for their grammaticality judgments and sometimes supported those judgments with examples taken subjectively from real language data. Since this shift, an increasing number of linguists have taken real language data into consideration which are taken from large electronic corpora of a language. This paradigm shift has not only influenced linguistics in the narrow sense but also areas such lexicography or language teaching methodology. If appropriate search tools are provided, corpora are already a useful data source even if they do not contain any linguistic annotation, i.e. if they simply contain the words of the language without any further information. How- ever, furnishing a corpus with explicit linguistic information (i.e. linguistic annotation) considerably increases the usefulness of a corpus. In some cases, information as simple as the part of speech suffices for a linguistic study. If, for example, a linguist wants to investigate the conjunction ‘da’, there is the problem that the conjunction ‘da’ is a homograph since it may be either a conjunction or an adverb as illustrated in sentence 1. If, however, the part of speech (PoS) is tagged (i.e. annotated) the homography is resolved and the conjunction ‘da’ can be distinguished from the adverb ‘da’ by its PoS tag. Still, corpora which are annotated with PoS tags are not fully satisfactory because they simply show the linear order of tokens without any constituent structure or grammatical relations. Many linguistic phenomena cannot be 1 captured just using PoS tags. (1) Er drehte sich um, aber da niemand da war, ging er He turned himself around, but since nobody there was, went he weiter. on. ‘He turned around but since nobody was there he moved on.’ In order to make a corpus suitable for the testing of linguistic theories, it is necessary to annotate it with more specific linguistic phenomena. This annotation may be done manually.

A Finite-State Approach to Shallow Parsing and Grammatical Functions Annotation of German

Deutscher Bundestag

21.9 Parlamentarische Versammlung Der NATO (NATO PV) 24.10.2016

Fraktion Intern Nr. 3/2017

Plenarprotokoll 17/89

Plenarprotokoll 18/199

PROGRAMM.Doc

Bundestagswahl 2013 Im Land Bremen: Endgültiges Ergebnis Steht Fest

227. Am: Donnerstag, 18. Juni 2009 Namentliche Abstimmung Nr.: 3

Plenarprotokoll 17/126

Deutscher Bundestag

Plenarprotokoll 16/33

11. Nationale Maritime Konferenz