Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning

Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2018 Exploiting alignment in multiparallel corpora for applications in linguistics and language learning Graën, Johannes Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-153213 Dissertation Published Version Originally published at: Graën, Johannes. Exploiting alignment in multiparallel corpora for applications in linguistics and language learning. 2018, University of Zurich, Faculty of Arts. Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning Thesis presented to the Faculty of Arts and Social Sciences of the University of Zurich for the degree of Doctor of Philosophy by Johannes Graën Accepted in the spring semester 2018 on the recommendation of the doctoral committee: Prof. Dr. Martin Volk (main supervisor) Prof. Dr. Marianne Hundt Prof. Dr. Stefan Evert Zurich, 2018 Abstract This thesis exploits the automatic identification of semantically corresponding units in parallel and multiparallel corpora, which is referred to as alignment. Mul- tiparallel corpora are text collections of more than two languages that comprise reciprocal translations. The contributions of this thesis are threefold: • First, we prepare a large multiparallel corpus by adding several layers of annotation and alignment. Annotation is first performed on each language individually, while alignment is applied to two or more languages. For the latter case, we use the term multilingual alignment. We show that word alignment on parallel corpora can improve language-specific annotation by means of disambiguation. • Our second contribution consists in the development and evaluation of pro- totypical algorithms for multilingual alignment on both sentence and word level. As languages vary considerably with regard to how content is real- ized in sentences and words, multilingual alignment needs to be represented by a hierarchical structure rather than by bidirectional links as prevailing representation of bilingual alignment. • Based on our corpus, we thirdly show how word alignment in combination with different types of annotation can be employed to benefit linguists and language learners, among others. All tools developed in the context of this thesis, in particular the publicly available web applications, are driven by efficient database queries on a complex data structure. iii Zusammenfassung Gegenstand dieser Dissertation ist die Auswertung von semantischen Korrespon- denzrelationen in parallelen und multiparallen Korpora, welche als Alignierungen bezeichnet werden. Multiparallele Korpora sind Textsammlungen wechselseitiger Übersetzungen zwischen mehr als zwei Sprachen. Diese Arbeit umfasst drei Beiträge: • Zum einen die Aufbereitung eines grossen multiparallelen Korpus durch Hin- zufügen mehrerer Annotations- und Alignierungsebenen. Während jede Spra- che zuerst separat annotiert wird, erstreckt sich die Alignierung über zwei oder mehr Sprachen. Letzteren Fall bezeichnen wir als multilinguale Alignie- rung. Wir zeigen, dass Wortalignierung in parallelen Korpora helfen kann, die sprachspezifischen Annotationen mittels Disambiguierung zu verbessern. • Zum anderen die Entwicklung und Evaluierung prototypischer Algorithmen für multilinguale Alignierung sowohl auf Satz- als auch auf Wortebene. Auf- grund der starken Variation zwischen Sprachen bezüglich der Realisierung von Inhalt in Sätze und Wörter benötigt multilinguale Alignierung zur Dar- stellung der Korrespondenzen eine hierarchische Struktur, anstelle von bidi- rektionalen Verbindungen, wie sie bei bilingualer Alignierung üblich sind. • Des weiteren zeigen wir, wie Wortalignierung in Verbindung mit verschie- denen Annotationsarten zum Nutzen u.a. von Linguisten und Sprachlernern eingesetzt werden kann. Allen Werkzeugen, die im Rahmen dieser Disser- tation entwickelt wurden, insbesondere den öffentlich verfügbaren Weban- wendungen, liegen effiziente Datenbankanfragen auf einer komplexen Da- tenstruktur zugrunde. iv Acknowledgments First of all, I wish to thank my supervisor Martin Volk who guided me through the initial troubles, gave me room to realize my own ideas and had the necessary confidence in me to finish this big project of mine. I am likewise indebtedto my colleagues in the Sparcling project, Marianne Hundt, Simon Clematide and Elena Callegaro, and our student collaborators Dolores Batinic, Christof Bless and Mathias Müller.1 Other, smaller contributions are indicated at the beginning of each chapter. I am thankful for Stefan Evert for accepting our invitation to become part of my doctoral committee. During the last years, I had time to become acquainted with the other members of the Institute of Computational Linguistics. I really appreciate the supportive atmosphere that gave rise to interesting and fruitful discussions, often accompa- nied by chocolate and delicious pastry. I would like to give a special thanks to Noah Bubenhofer, Tilia Ellendorff, Anne Göhring, Samuel Läubli, Laura Mas- carell, Jeannette Roth, Gerold Schneider and Don Tuggener. Aside from people in Zurich, I am particularly grateful for the Språkbanken group in Gothenburg for kindly receiving me in spring 2017. The technical parts described in this thesis were quite demanding. My thanks therefore goes to our institute for making it possible to acquire new servers and letting me implement my concept for a new computer cluster that allows for dis- tributed computing. This would not have been possible without the comprehensive support by the technicians of the Department of Informatics: Hanspeter Kunz, Beat Rageth and Enrico Solcà. Los agradecimientos más importantes suelen venir últimos. Quisiera darles las gracias a mis amigos de Barcelona, particularmente a Alicia Burga, Graham Cole- man, Gabriela Ferraro y Simon Mille, sin los cuales probablemente no hubiese empezado un doctorado. Les expreso mi gratitud a mi familia y a mi novia Mónica por el soporte incondicional durante todos estos años. 1The Sparcling project was kindly funded by the Swiss National Science Foundation under grant 105215_146781/1. v Contents Abstract iii Zusammenfassung iv Acknowledgments v 1 Introduction 1 1.1 The Sparcling Project . 3 1.2 Research Questions . 4 1.3 Outline . 5 2 Parallel Text Corpora 7 2.1 Monolingual Corpora . 11 2.2 Parallel Corpora . 14 2.3 Multiparallel Corpora . 16 2.3.1 Our CoStEP Corpus . 17 3 Corpus Annotation 21 3.1 Tokenization . 26 3.1.1 Cutter: Our Flexible Tokenizer for Many Languages . 27 3.2 Part-of-speech Tagging and Lemmatization . 37 3.2.1 Interlingual Lemma Disambiguation . 44 3.2.2 Particle Verbs in German . 49 3.3 Dependency Parsing . 55 3.4 Database Corpus . 57 4 Alignment Methods 61 4.1 Text Alignment . 63 4.2 Sentence Alignment . 65 4.2.1 Approaches . 68 4.2.2 Evaluating Sentence Alignment . 74 4.3 Multilingual Sentence Alignment . 75 4.3.1 Our Approach to Multilingual Sentence Alignment . 79 4.3.2 Evaluation . 91 4.4 Word Alignment . 106 4.4.1 Approaches . 108 4.4.2 Evaluating Word Alignment . 118 4.5 Multilingual Word Alignment . 123 4.5.1 Our Approach to Multilingual Word Alignment . 124 4.5.2 Evaluation and Outlook . 136 5 Linguistic Applications of Word Alignment 151 5.1 Overlap of Lemma Alignment Distributions as Measure for Seman- tic Relatedness . 153 5.2 Multilingual Translation Spotting . 160 5.3 Phraseme Identification . 165 5.4 Backtranslating Prepositions for Prediction of Language Learners’ Transfer Errors . 179 6 Conclusions 187 Appendices 193 Appendix A Linguistic Annotation 195 A.1 Universal Dependency Labels . 195 A.2 Our Hierarchical Alignment Tool . 198 Appendix B Alignment Quality 201 B.1 Relation of Alignment Error Rate (AER) and F1-Score . 201 Appendix C Data Sets from Joint Measures 203 C.1 Semantic Relatedness of German Particle Verbs . 203 C.2 Generated Recommendations for Learners of English of Different L1 Backgrounds . 226 C.2.1 Verb Preposition Combinations . 226 C.2.2 Adjective Preposition Combinations . 237 Chapter 1 Introduction Collections of texts, known as text corpora, have been subject to linguists’ interest for a long time. They served, for instance, lexicographers as a source for dictionary compilation or linguists and historians for the investigation of language change over time. Parallel text corpora, parallel corpora for short, sometimes also referred to as bitexts, are text collections in two or more languages where textual units, such as articles, sentences or words in one language correspond to textual units of the same kind in another language. If there is more than one other language, we will refer to these collections as multiparallel corpora. The term parallel corpus covers merely translated material and not collections of texts that only connect to each other in terms of content. The latter ones are named comparable corpora since they describe the same topic in a comparable way without the necessity of texts being translations of each other. Wikipedia articles, as an example for comparable corpora, deal with the same topic in several languages and can either be translations from one or more existing articles, or be written independently of corresponding articles in other languages (for an overview see Plamada and Volk 2013). The size of typical corpora impedes manual examination and, hence, calls for automatic processing. Natural language processing (NLP) deals with the auto- mated treatment of natural language, predominantly

Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning

Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis

Arabic Corpus and Word Sketches

Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011

Jazykovedný Ústav Ľudovíta Štúra

ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018

Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation

A Massively Parallel Corpus: the Bible in 100 Languages

Korpusleksikograafia Uued Võimalused Eesti Keele Kollokatsioonisõnastiku Näitel

A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection

The Translation Equivalents Database (Treq) As a Lexicographer’S Aid

ESP Corpus Construction: a Plea for a Needs-Driven Approach Bâtir Des Corpus En Anglais De Spécialité : Plaidoyer Pour Une Approche Fondée Sur L’Analyse Des Besoins

The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding