Complete Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Discovering Syntactic Phenomena with and within Precision Grammars A thesis presented by Ned Letcher to The Department of Computing and Information Systems in total fulfillment of the requirements for the degree of Doctor of Philosophy The University of Melbourne Melbourne, Australia July 2018 Declaration This is to certify that: (i) the thesis comprises only my original work towards the PhD except where in- dicated in the Preface; (ii) due acknowledgement has been made in the text to all other material used; (iii) the thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibliographies and appendices. Signed: Date: c 2018 - Ned Letcher All rights reserved. Thesis advisors Author Timothy Baldwin Ned Letcher Emily M. Bender Discovering Syntactic Phenomena with and within Precision Grammars S VP VP PP V NP PP/N NP V N N PP/N PP/N Discovering AP N P CONJ PP/N N N Syntactic N with and P Precision N Phenomena within Grammars Abstract Precision grammars are hand-crafted computational models of human languages that are capable of parsing text to yield syntactic and semantic analyses. They are valuable for applications requiring the accurate extraction of semantic relationships and they also enable hypothesis testing of holistic grammatical theories over quantities of text impossible to analyse manually. Their capacity to generate linguistically accurate analyses over corpus data also supports another application: augmenting linguistic descriptions with query facilities for retrieving examples of syntactic phenomena. In order to construct such queries, it is first necessary to identify the signature of target syntactic phenomena within the analyses produced by the precision grammar in use. This is often a difficult process, however, as analyses within the descriptive grammar can diverge from those in the precision grammar due to differing theoretical assumptions made by the two resources, the use of different sets of data to inform their respective analyses, and the exigencies of implementing a large-scale formalised analyses. Abstract iv In this thesis, I present my research into developing methods for improving the discoverability of syntactic phenomena within precision grammars. This includes the construction of a corpus annotated with syntactic phenomena which supports the development of syntactic phenomenon discovery methodologies. Included within this context is the investigation of strategies for measuring inter-annotator agreement over textual annotations for which annotators both segment and label text|a prop- erty that traditional kappa-like measures do not support. The second facet of my research involves the development of an interactive methodology|and accompany- ing implementation|for navigating the alignment between dynamic characterisations of syntactic phenomena and the internal components of HPSG precision grammars associated with these phenomena. In addition to supporting the enhancement of descriptive grammars with precision grammars, this methodology has the potential to improve the accessibility of precision grammars themselves, enabling people not involved in their development to explore their internals using familiar syntactic phe- nomena, as well as allowing grammar engineers to navigate their grammars through the lens of analyses that are different to those found in the grammar. Contents Title Page . .i Abstract . iii Table of Contents . .v Acknowledgments . ix Dedication . xi 1 Introduction1 1.1 Research Scope and Contributions . .8 1.1.1 Research Scope . .8 1.1.2 Research Contributions . .8 1.2 Thesis Outline . 10 2 Background 13 2.1 Introduction . 13 2.2 Setting the Scene . 15 2.2.1 Computational Linguistics . 15 2.2.2 Describing Language . 16 2.2.3 Corpus Linguistics . 21 2.2.4 Situating this Research . 22 2.3 Grammatical Phenomena . 23 2.3.1 Linguistic Forms . 24 2.3.2 The Function of Forms . 28 2.3.3 Characterising Grammatical Phenomena . 30 2.3.4 Syntactic Phenomena . 33 2.3.5 Complexity . 34 2.3.6 Frequency . 37 2.3.7 Summary . 39 2.4 Descriptive Grammars . 39 2.4.1 Audience . 40 2.4.2 Accessibility and Basic Linguistic Theory . 41 2.4.3 The Anatomy of a Descriptive Grammar . 43 2.4.4 Organisational Strategies . 49 2.4.5 Text Collections . 54 v Contents vi 2.4.6 Summary . 57 2.5 Treebanks . 58 2.5.1 Applications . 59 2.5.2 Querying Treebanks . 61 2.5.3 Treebanking Methodology . 61 2.5.4 Summary . 63 2.6 Grammar Engineering . 64 2.6.1 Applications . 66 2.6.2 Engineering Requirements . 71 2.6.3 Workflow . 74 2.6.4 Linguistic Test Suites . 76 2.6.5 Dynamic Treebanking . 79 2.6.6 Multilingual Grammar Engineering . 81 2.6.7 Summary . 87 2.7 Electronic Grammaticography . 87 2.7.1 Overview . 88 2.7.2 Enhancements . 89 2.7.3 Taalportaal . 95 2.7.4 Summary . 98 2.8 Projects Involving Grammatical Phenomena . 99 2.8.1 The TSNLP Project . 100 2.8.2 A Theory of Syntactic Phenomena . 105 2.8.3 The ERG Semantic Documentation Project . 108 2.8.4 DepBank . 112 2.8.5 CCGBank . 114 2.8.6 Construction-focused Parser Evaluation . 117 2.8.7 Summary . 120 2.9 Syntactic Phenomenon Discovery . 122 2.9.1 Querying Pre-annotated Databases . 123 2.9.2 Signature-based Querying . 129 2.9.3 Query-by-example . 136 2.9.4 Summary . 145 2.10 Chapter Summary . 146 3 Resources 150 3.1 Introduction . 150 3.2 The LinGO English Resource Grammar . 150 3.2.1 Grammatical Phenomena . 151 3.2.2 Robustness and Accuracy . 155 3.2.3 High Quality Treebanks . 157 3.3 DeepBank . 159 3.4 The Cambridge Grammar of the English Language . 160 Contents vii 3.5 Summary . 161 4 The Phenomenal Corpus 163 4.1 Introduction . 163 4.2 Data Desiderata . 164 4.2.1 Corpus Data . 164 4.2.2 Exhaustive Annotation . 166 4.2.3 Character-level Annotation . 167 4.2.4 Grammatical Framework-independent . 167 4.3 Existing Phenomenon-annotated Resources . 168 4.4 Selection of Phenomena . 171 4.4.1 Relative clauses . 173 4.4.2 Interrogative Clauses . 174 4.4.3 Passive Clauses . 176 4.4.4 Imperative Clauses . 177 4.4.5 Complement clauses . 177 4.4.6 Summary of Phenomena . 179 4.5 Selection of Corpora . 179 4.6 Data Collection . 181 4.6.1 Methodology . 181 4.6.2 Annotation Guidelines . 182 4.6.3 Annotations Collected from the WSJ . 186 4.6.4 Creating the Gold Standard Annotations . 186 4.6.5 Error Analysis . 187 4.6.6 Improving the Annotations . 188 4.6.7 Data Packaging . 191 4.7 Assessing Annotation Reliability . 193 4.7.1 Chance-corrected Inter-annotator Agreement . 193 4.7.2 Challenges Presented by this Annotation Task . 196 4.7.3 Inter-annotator Agreement Results . 203 4.7.4 IOBE Tagging Analysis . 208 4.8 Discussion and Summary . 212 4.8.1 Discussion of Quality Assurance . 212 4.8.2 Discussion of Methodology . 216 4.8.3 Summary . 218 5 Syntactic Phenomenon Discovery 220 5.1 Introduction . 220 5.2 Proposed Approach . 222 5.3 DELPH-IN Grammar Architecture . 224 5.4 Typediff Methodology . 231 5.4.1 Type Extraction . 232 Contents viii 5.4.2 Type Labelling . 234 5.4.3 Type Weighting . 236 5.4.4 Type Filtering . ..