Bayesian Models for Multilingual Word Alignment

Bayesian Models for Multilingual Word Alignment Robert Östling Academic dissertation for the Degree of Doctor of Philosophy in Linguistics at Stockholm University to be publicly defended on Friday 22 May 2015 at 13.00 in hörsal 5, hus B, Universitetsvägen 10 B. Abstract In this thesis I explore Bayesian models for word alignment, how they can be improved through joint annotation transfer, and how they can be extended to parallel texts in more than two languages. In addition to these general methodological developments, I apply the algorithms to problems from sign language research and linguistic typology. In the first part of the thesis, I show how Bayesian alignment models estimated with Gibbs sampling are more accurate than previous methods for a range of different languages, particularly for languages with few digital resources available —which is unfortunately the state of the vast majority of languages today. Furthermore, I explore how different variations to the models and learning algorithms affect alignment accuracy. Then, I show how part-of-speech annotation transfer can be performed jointly with word alignment to improve word alignment accuracy. I apply these models to help annotate the Swedish Sign Language Corpus (SSLC) with part-of-speech tags, and to investigate patterns of polysemy across the languages of the world. Finally, I present a model for multilingual word alignment which learns an intermediate representation of the text. This model is then used with a massively parallel corpus containing translations of the New Testament, to explore word order features in 1001 languages. Keywords: word alignment, parallel text, Bayesian models, MCMC, linguistic typology, sign language, annotation transfer, transfer learning. Stockholm 2015 http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-115541 ISBN 978-91-7649-151-5 Department of Linguistics Sto ckholm University, 106 91 Stockholm Bayesian Models for Multilingual Word Alignment Robert Ostling¨ Bayesian Models for Multilingual Word Alignment Robert Ostling¨ c Robert Ostling,¨ Stockholm 2015 ISBN 978-91-7649-151-5 Printing: Holmbergs, Malmö2015 Distributor: Publit Cover: Robert Ostling¨ Sg，¡有`，我该怎H办？ T娃，`和º文同时U²，同时诞生，F/º文不如`！ Abstract In this thesis I explore Bayesian models for word alignment, how they can be improved through joint annotation transfer, and how they can be extended to parallel texts in more than two languages. In addition to these general methodological developments, I apply the algorithms to problems from sign language research and linguistic typology. In the first part of the thesis, I show how Bayesian alignment models estimated with Gibbs sampling are more accurate than previous methods for a range of different languages, particularly for languages with few digital resources available| which is unfortunately the state of the vast majority of languages today. Further- more, I explore how different modifications to the models and learning algorithms affect alignment accuracy. Then, I show how part-of-speech annotation transfer can be performed jointly with word alignment to improve word alignment accuracy. I apply these models to help annotate the Swedish Sign Language Corpus (SSLC) with part-of-speech tags and to investigate patterns of polysemy across the languages of the world. Finally, I present a model for multilingual word alignment that learns an intermediate representation of the text. This model is then used with a massively parallel corpus containing translations of the New Testament to explore word order features in 1,001 languages. i Contents 1. Introduction 1 1.1. How to read this thesis . .1 1.2. Contributions . .2 1.2.1. Algorithms . .3 1.2.2. Applications . .3 1.2.3. Evaluations . .3 1.3. Relation to other publications . .3 2. Background 5 2.1. Parallel texts . .5 2.2. Alignment . .7 2.3. Word alignment . .8 2.3.1. Co-occurrence based models . .9 2.3.2. Multilingual word alignment . .9 2.3.2.1. Bridge languages . .9 2.3.2.2. Sampling-based alignment . 10 2.3.2.3. Word alignment in massively parallel texts . 10 2.3.3. The IBM models . 10 2.3.3.1. Fundamentals . 11 2.3.3.2. Model 1 . 11 2.3.3.3. Model 2 . 12 2.3.3.4. HMM model . 12 2.3.4. Fertility . 13 2.3.5. Structural constraints . 13 2.3.5.1. Part of Speech (PoS) tags . 14 2.3.5.2. ITGs and syntactic parses . 14 2.3.6. Stemming and lemmatization . 14 2.3.7. EM inference . 15 2.3.8. Symmetrization . 15 2.3.8.1. Union . 16 2.3.8.2. Intersection . 16 2.3.8.3. Growing . 16 2.3.8.4. Soft symmetrization methods . 16 2.3.9. Discriminative word alignment models . 16 2.3.10. Morpheme alignment . 18 2.3.11. Evaluation of bitext alignment . 19 iii 2.4. Bayesian learning . 20 2.4.1. Bayes' theorem . 21 2.4.1.1. Bernoulli distributions . 21 2.4.1.2. Binomial and beta distributions . 22 2.4.2. The Dirichlet distribution . 25 2.4.3. The predictive Dirichlet-categorical distribution . 28 2.4.4. The Dirichlet Process . 29 2.4.5. The Pitman-Yor Process . 31 2.4.6. Hierarchical priors . 31 2.5. Inference in Bayesian models . 32 2.5.1. Variational Bayesian inference . 33 2.5.2. Markov Chain Monte Carlo . 33 2.5.3. Gibbs sampling . 34 2.5.4. Simulated annealing . 34 2.5.5. Estimation of marginals . 36 2.5.5.1. Using the last sample . 36 2.5.5.2. Maximum marginal decoding . 37 2.5.5.3. Rao-Blackwellization . 37 2.5.6. Hyperparameter sampling . 37 2.6. Annotation projection . 38 2.6.1. Direct projection . 40 2.6.1.1. Structural differences between languages . 40 2.6.1.2. Errors in word alignments . 41 3. Alignment through Gibbs sampling 43 3.1. Questions . 43 3.2. Basic evaluations . 43 3.2.1. Algorithms . 44 3.2.1.1. Modeled variables . 44 3.2.1.2. Sampling . 45 3.2.2. Measures . 46 3.2.3. Symmetrization . 46 3.2.4. Data . 47 3.2.5. Baselines . 48 3.2.5.1. Manually created word alignments . 48 3.2.5.2. Other resources . 48 3.2.5.3. Baseline systems . 49 3.2.6. Results . 50 3.3. Collapsed and explicit Gibbs sampling . 57 3.3.1. Experiments . 60 3.4. Alignment pipeline . 61 3.5. Choice of priors . 61 3.5.1. Algorithms . 63 3.5.2. Evaluation setup . 64 iv 3.5.3. Results . 64 4. Word alignment and annotation transfer 67 4.1. Aligning with parts of speech . 67 4.1.1. Naive model . 67 4.1.2. Circular generation . 67 4.1.3. Alternating alignment-annotation . 68 4.2. Evaluation . 69 4.2.1. Alignment quality evaluation . 70 4.2.2. Annotation quality evaluation . 70 4.3. Tagging the Swedish Sign Language Corpus . 76 4.3.1. Data processing . 76 4.3.2. Evaluation data . 77 4.3.3. Tag set conversion . 78 4.3.4. Task-specific tag constraints . 78 4.3.5. Experimental results and analysis . 79 4.4. Lemmatization transfer . 80 4.5. Lexical typology through multi-source concept transfer . 81 4.5.1. Method . 85 4.5.2. Evaluation . 86 4.5.3. Limitations . 87 5. Multilingual word alignment 91 5.1. Interlingua alignment . 91 5.1.1. Evaluation: Strong's numbers . 93 5.1.1.1. Language-language pairwise alignments . 93 5.1.1.2. Interlingua-language pairwise alignments . 94 5.1.1.3. Clustering-based evaluation with Strong's numbers . 94 5.1.1.4. Bitext evaluation with Strong's numbers . 95 5.2. Experiments . 96 5.3. Word order typology . 97 5.3.1. Method . 97 5.3.2. Data . ..

Load more