Task-Based Parser Output Combination: Workflow And
Total Page:16
File Type:pdf, Size:1020Kb
Task-based Parser Output Combination: Workflow and Infrastructure Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Philosophie (Dr. phil.) genehmigte Abhandlung Vorgelegt von Kerstin Eckart aus Stuttgart-Bad Cannstatt Hauptberichter: Prof. Dr. Jonas Kuhn 1. Mitberichter: Prof. Dr. Ulrich Heid 2. Mitberichter: Prof. Dr. Andreas Witt Tag der mündlichen Prüfung: 04.12.2017 Institut für Maschinelle Sprachverarbeitung der Universität Stuttgart 2018 Hiermit versichere ich, diese Arbeit selbständig verfasst und nur die angege- benen Quellen benutzt zu haben. (I hereby declare that I have created this work completely on my own and used no other sources than the ones listed.) Kerstin Eckart Für Anna, Hermann und Käthe Contents List of abbreviations 11 Abstract 13 Zusammenfassung 17 1 Introduction 21 1.1 Outline . 27 1.2 Concepts . 32 1.2.1 A broad definition of linguistic resources . 32 1.2.2 Distinction of data layers . 34 1.2.3 Metadata . 36 1.2.4 Process metadata for workflow tracking . 38 1.2.5 Analysis relations . 42 1.2.6 Reliability of annotated linguistic information . 44 1.2.7 Syntactic features . 47 1.2.8 Precision and recall . 47 1.3 Motivation of task-based parser output combination as a device to reach the objective . 48 2 Technical background: Combining syntactic information 53 2.1 Parsing: automatic syntactic analysis . 54 2.1.1 Preprocessing . 60 2.1.2 Output models . 61 2.1.3 Categories of parsing tools and techniques . 62 7 8 CONTENTS 2.1.4 The parsers utilized in this work . 69 2.2 Combination approaches . 98 2.2.1 Areas of tool combination in natural language processing 98 2.2.2 Methods for the provision of parsing systems to be used in combinations . 103 2.2.3 Basic structures for combination in parsing . 107 2.2.4 Combination methods in parsing . 109 2.3 Task-based approaches . 113 3 Interoperability of resources 117 3.1 Notions of interoperability . 119 3.2 Refined categories of interoperability . 120 3.3 Classification of combination types . 127 3.4 Existing approaches . 128 3.5 Discussion of handling interoperability for a combination approach140 4 Supporting infrastructure: the B3 database 143 4.1 Framework and objectives . 144 4.2 Requirements . 146 4.3 Technical decisions . 148 4.4 Design decisions . 150 4.5 Schema . 163 4.5.1 Describing objects on the macro layer . 164 4.5.2 Describing relations on the macro layer . 168 4.5.3 Describing nodes on the micro layer . 168 4.5.4 Describing edges on the micro layer . 169 4.5.5 Typing . 171 4.5.6 Temporal marks . 173 4.5.7 Describing annotations on the micro layer . 174 4.5.8 Describing annotation constraints on the micro layer . 178 4.5.9 Indexes for efficient processing . 181 5 Introducing a workflow for task-based parser output combination 183 5.1 Prerequisites . 184 CONTENTS 9 5.2 Workflow steps . 186 5.2.1 Parser output inspection . 187 5.2.2 Task-related manual annotation set . 194 5.2.3 Task-related parser evaluation . 196 5.2.4 Combination scheme . 199 5.2.5 Combination of parser output . 200 5.3 Discussion of the workflow and its applicability . 201 6 Case studies 205 6.1 Task I: Automatically reproducing classes of nach-particle verb readings on web corpus data . 207 6.1.1 Resources used in the task and the case studies . 221 6.1.2 Instantiated workflow and baselines . 239 6.1.3 Case study I.i: combination type 2, two parsers, combina- tion rules . 250 6.1.4 Case study I.ii: combination type 3, three parsers, majority vote................................ 259 6.1.5 Discussion of Task I . 260 6.2 Task II: Recognizing phrases for information status annotation . 263 6.2.1 Resources used in the task and the case studies . 265 6.2.2 Instantiated workflow and baselines . 273 6.2.3 Case study II.i: combination types 1 and 2, majority vote 276 6.2.4 Case study II.ii: combination type 2, three parsers, weighted voting . 279 6.2.5 Discussion of Task II . 283 6.3 Discussion of all case studies . 286 7 Conclusion 289 A Tagsets 303 A.1 The TIGER treebank . 303 A.1.1 Part-of-speech tags in TIGER treebank . 303 A.1.2 Categories of constituents in the TIGER treebank . 305 A.1.3 Function labels in the TIGER treebank . 306 10 CONTENTS A.2 FSPar . 308 A.3 The LFG parser . 311 B Resources 315 B.1 Chapter 1 . 315 B.2 Chapter 2 . 316 B.3 Chapter 3 . 318 B.4 Chapter 4 . 318 B.5 Chapter 5 . 318 B.6 Chapter 6 . 318 C Annotation guidelines 319 C.1 Recognition of sentences with nach-particle verbs . 319 D Graphics 323 List of abbreviations ASR Automatic Speech Recognition B3DB B3 database CLARIN Common Language Resources and Technology Infrastructure CMDI Component Metadata Initiative CoNLL Conference on Computational Natural Language Learning CYK Cocke-Younger-Kasami algorithm DCMI Dublin Core Metadata Initiative DIRNDL Discourse Information Radio News Database for Linguistic analysis DRT Discourse Representation Theory EAGLES Expert Advisory Group on Language Engineering Standards GATE General Architecture for Text Engineering GOLD General Ontology for Linguistic Description GrAF Graph Annotation Format GUI Graphical User Interface HPSG Head-driven Phrase Structure Grammar ICARUS Interactive platform for Corpus Analysis and Research tools, University of Stuttgart ID Identifier IEC International Electrotechnical Commission INESS Infrastructure for the Exploration of Syntax and Semantics ISO International Organization for Standardization LAF Linguistic Annotation Framework LAUDATIO Long-term Access and Usage of Deeply Annotated 11 12 CONTENTS Information LFG Lexical Functional Grammar OLAC Open Language Archives Community OLiA Ontologies of Linguistic Annotation OWL Web Ontology Language PASSAGE Produire des Annotations Syntaxiques à Grande Échelle (‘Large scale production of syntactic annotations’) PAULA Potsdamer Austauschformat Linguistischer Annotationen (‘Potsdam Exchange Format for Linguistic Annotations’) PID Persistent Identifier POS part-of-speech RDF Resource Description Framework ROVER Recognizer Output Voting Error Reduction SALTO The SALSA Tool (referring to the SALSA XML format from the The Saarbrücken Lexical Semantics Annotation and Analysis Project) SANCL Syntactic Analysis of Non-Canonical Language (Workshop) SFB Sonderforschungsbereich (collaborative research centre) SGML Standard Generalized Markup Language SPMRL Statistical Parsing of Morphologically Rich Languages (Workshop) SQL Structured Query Language STTS Stuttgart-Tübingen-TagSet TCF Text Corpus Format TEI Text Encoding Initiative UIMA Unstructured Information Management Architecture URL Uniform Resource Locator WaCky The Web-As-Corpus Kool Yinitiative WebLicht Web-based Linguistic Chaining Tool XLE Xerox Linguistic Environment XML Extensible Markup Language Abstract This work introduces the method of task-based parser output combination as a device to enhance the reliability of automatically generated syntactic informa- tion for further processing tasks. The approach is based on two assumptions: (i) Parsing is not an isolated task, but often one step in a processing workflow. From this step, the necessary syntactic information is extracted and then applied in e.g. semantic analyses or information extraction. (ii) Parsing, as an automatic processing step to retrieve syntactic information, cannot be perfect, especially when perfect means a unique, correct and fully specified analysis for each utterance, which in addition also reflects the intention of the author. This is due to the fact that the correctness of an analysis can only be defined with respect to a syntactic theory or a grammar and it is also due to the ambiguity of natural language, which can often only be resolved by world knowledge or with the help of points of reference between the communicating individuals (common ground). Additionally, factors like creation time, genre and the relationship of the communicating individuals highly influence the lexical and structural choices. Parsers, i.e. tools generating syntactic analyses, are usually based on refer- ence data. Typically these are modern news texts. However, the data relevant for applications or tasks beyond parsing often differs from this standard domain, or only specific phenomena from the syntactic analysis are actually relevant for further processing. In these cases, the reliability of the parsing output might deviate essentially from the expected outcome on standard news text. Studies for several levels of analysis in natural language processing have shown that combining systems from the same analysis level outperforms the best involved single system. This is due to different error distributions of the 13 14 CONTENTS involved systems which can be exploited, e.g. in a majority voting approach. In other words: for an effective combination, the involved systems have to be sufficiently different. In these combination studies, usually the complete analyses are combined and evaluated. However, to be able to combine the analyses completely, a full mapping of their structures and tagsets has to be found. The need for a full mapping either restricts the degree to which the participating systems are allowed to differ or it results in information loss. Moreover, the evaluation of the combined complete analyses does not reflect the reliability achieved in the analysis of the specific aspects needed to resolve a given task. This work introduces task-based parser output combination as a method and presents an abstract workflow which can be instantiated based on the respective task and the available parsers. The approach focusses on the task-relevant aspects and aims at increasing the reliability of their analysis. Moreover, this focus allows a combination of more diverging systems, since no full mapping of the structures and tagsets from the single systems is needed. The usability of this method is also increased by focussing on the output of the parsers: It is not necessary for the users to reengineer the tools. Instead, off-the-shelf parsers and parsers for which no configuration options or sources are available to the users can be included.