Irish Dependency Treebanking and Parsing Teresa Lynn
Total Page:16
File Type:pdf, Size:1020Kb
Irish Dependency Treebanking and Parsing Teresa Lynn B.Sc. in Applied Computational Linguistics A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) under a Double Award Agreement at the School of Computing Department of Computing Dublin City University Macquarie University Supervised by Dr. Jennifer Foster Associate Prof. Mark Dras January 2016 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: (Candidate) ID No.: 98300890 Date: Contents Acknowledgements xvii Abstract xx 1 Introduction 1 1.1 Irish Language Technology . .4 1.2 Dependency Treebanking . .6 1.3 Dependency Parsing . .8 1.4 Expanding Irish NLP to Social Media . 10 1.5 Research Questions . 11 1.6 Thesis Structure . 12 1.7 Publications . 13 2 Background 15 2.1 The Irish Language . 16 2.2 An Overview of Irish linguistics . 19 2.2.1 Irish Syntax . 20 2.2.2 Irish Morphology . 24 2.3 Dependency Treebanks . 29 2.3.1 An Overview of Dependency Syntax . 30 2.3.2 Projectivity . 32 2.3.3 Dependency Tree Data Formats . 33 2.4 Summary and Conclusion . 35 i 3 Irish Dependency Treebank 36 3.1 Building the Irish Dependency Treebank | the starting point . 37 3.1.1 Choice of syntactic representation . 38 3.1.2 Choice of dependency labelling scheme . 42 3.1.3 Developing upon existing NLP tools . 47 3.2 Inter-annotator Agreement . 50 3.2.1 IAA and the Irish Dependency Treebank . 52 3.2.2 IAA Analysis and Treebank Update . 53 3.2.3 Sources of annotator disagreements . 55 3.3 Summary and Conclusion . 57 4 Defining the IDT Annotation scheme 59 4.1 Irish Dependency Labelling Scheme . 60 4.2 Language-specific choices . 60 4.2.1 Labelling of predicates . 62 4.2.2 Irish copula . 62 4.2.3 Copular subject complements . 68 4.2.4 Cleft constructions { cleftparticle . 68 4.2.5 Copula drop . 70 4.2.6 Copula { Complementiser form . 70 4.2.7 Copular subordinator . 71 4.2.8 Prepositional attachment . 71 4.2.9 Prepositions and complement phrases . 73 4.2.10 Gerund object attachment . 74 4.2.11 Gerund adverb attachment . 75 4.2.12 Complement phrase attachment to verbal nouns. 75 4.2.13 Objects of infinitive phrases . 76 4.2.14 Objects of autonomous verbs . 76 4.2.15 Oblique agents of stative passives . 77 ii 4.2.16 Relative particle . 77 4.2.17 English chunks . 79 4.2.18 Internal punctuation . 80 4.2.19 Multiple coordination . 80 4.2.20 Punctuation as a coordinator . 81 4.2.21 Wh-questions . 82 4.3 Summary and Conclusion . 83 5 Universal Dependencies 84 5.1 A Universal Dependency Scheme for the Irish Dependency Treebank . 87 5.1.1 Mapping the Irish POS tagset to the Universal POS tagset . 87 5.1.2 Mapping the Irish Dependency Scheme to the Universal De- pendency Scheme (UD13) . 90 5.2 A new Universal Dependency Scheme (UD15) . 97 5.2.1 New Universal POS tagset . 97 5.2.2 2015 Universal Dependency Scheme . 100 5.3 Summary and Conclusion . 112 6 Parsing with the Irish Dependency Treebank 114 6.1 An Overview of Dependency Parsing . 115 6.1.1 How do Transition-Based Dependency Parsers work? . 117 6.1.2 Evaluation Metrics . 122 6.2 Establishing a Baseline Parsing Score . 123 6.3 Improvements over Baseline following IAA Study . 125 6.4 Bootstrapping Treebank Development { Active Learning . 126 6.4.1 Related Work . 128 6.4.2 Active Learning Experiments . 129 6.4.3 Results . 131 6.4.4 Analysis . 132 iii 6.4.5 Making Use of Unlabelled Data . 135 6.4.6 Active Learning Conclusion . 136 6.5 Bootstrapping Parser Development { Semi-supervised Learning . 137 6.5.1 Self-Training . 139 6.5.2 Co-Training . 142 6.5.3 Sample-Selection-Based Co-Training . 144 6.5.4 Analysis . 148 6.5.5 Test Set Results . 149 6.5.6 Semi-supervised Learning Conclusion . 150 6.6 Parsing with Morphological Features . 151 6.6.1 Results and Conclusion . 152 6.7 Cross-lingual Transfer Parsing . 153 6.7.1 Data and Experimental Setup . 154 6.7.2 Results . 156 6.7.3 Discussion . 157 6.7.4 Cross-lingual Transfer Parsing Conclusion . 158 6.8 Summary . 159 7 Irish Twitter POS-tagging 163 7.1 Irish Tweets . 166 7.2 Building a corpus of annotated Irish tweets . 168 7.2.1 New Irish Twitter POS tagset . 168 7.2.2 Tweet pre-processing pipeline . 172 7.2.3 Annotation . 173 7.3 Inter-Annotator Agreement . 173 7.4 Experiments . 174 7.4.1 Data . 174 7.4.2 Part-of-Speech Taggers . 175 iv 7.4.3 Results . 176 7.5 Summary and Conclusion . 178 8 Conclusion 180 8.1 Summary and Contributions . 180 8.2 Research Questions Answered . 183 8.3 Proposed Future Work . 185 8.3.1 Further Treebank Development . 185 8.3.2 Exploiting Morphological Features . 187 8.3.3 Exploiting the Hierarchical Dependency Scheme . 187 8.3.4 Semi-supervised Parsing . 187 8.3.5 Cross Lingual Studies . 188 8.3.6 NLP for Social Media . 189 8.3.7 Universal Dependencies Project . 190 8.4 Concluding Remarks . 190 Bibliography 193 A Annotation Guidelines for the Irish Dependency Treebank 218 A.1 Verb Dependents . 219 A.1.1 subj : subject { (Verb) . 219 A.1.2 obj : object { (Verb) . 219 A.1.3 obl/ obl2 : oblique nouns { (Verb) . 221 A.1.4 particlehead { (Verb) . 226 A.1.5 padjunct : prepositional adjunct { (Verb) . 227 A.1.6 advadjunct : adverbial adjunct { (Verb) . 230 A.1.7 subadjunct : subordinate adjunct{ (Verb) . 232 A.1.8 adjunct { (Verb) . 237 A.1.9 xcomp : open complement { (Verb) . 239 A.1.10 comp : closed complement { (Verb) . 243 v A.1.11 pred : predicate { (Verb) . 244 A.1.12 relparticle : relative particle { (Verb) . 249 A.1.13 cleftparticle : cleft particle { (Verb) . 251 A.1.14 vparticle : verb particle { (Verb) . 252 A.1.15 particle : verb particle | (Verb) . 256 A.1.16 addr : addressee { (Verb) . 256 A.2 Noun Dependents . 257 A.2.1 det : determiner { (Noun) . 257 A.2.2 det2 : second determiner { (Noun) . 258 A.2.3 dem : demonstrative { (Noun) . 259 A.2.4 poss : possessive { (Noun) . 259 A.2.5 quant: quantifer { (Noun) . 260 A.2.6 adjadjunct { (Noun) . ..