Automatic Understanding of Unwritten Languages
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Understanding of Unwritten Languages A thesis presented by Oliver Adams to School of Computing and Information Systems in total fulfillment of the requirements for the degree of Doctor of Philosophy The University of Melbourne Melbourne, Australia December 2017 Declaration This is to certify that: (i) the thesis comprises only my original work towards the PhD except where indi- cated in the Citations to Previously Published Work; (ii) due acknowledgement has been made in the text to all other material used; (iii) the thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibliographies and appendices. Signed: Date: ©2017 - Oliver Adams All rights reserved. Thesis advisor(s) Author Steven Bird Oliver Adams Trevor Cohn Graham Neubig Automatic Understanding of Unwritten Languages Abstract Many of the world’s languages are falling out of use without a written record and minimal linguistic documentation. Language documentation is a slow process and there are an insufficient number of linguists working to ensure the world’s languages are documented before they die out. This thesis addresses automatic understanding of unwritten languages in order to perform tasks such as phonemic transcription and bilingual lexicon induction. The automation of such tasks promises to improve the leverage of field linguists and ultimately speed up the language documentation process. Modelling endangered languages is challenging due to the nature of the available data, which is typically not written text but limited quantities of recorded speech. Manually annotated information in the form of lexicons and grammars is typically also limited. Since the languages are spoken, the most efficient way of sourcing data is to collect speech in the language. Most speakers of endangered languages are bilin- gual or multilingual, so acquiring spoken translations works to the strength of the speakers. Key approaches described in this thesis make use of bilingual data, in par- ticular translated speech, which consists of segments of endangered language speech paired with translations in a larger language. Such data is important for relating the source language speech with a larger language. Additionally, the application of monolingual phoneme transcription is also explored, since it has direct applicability in more traditional phonemic transcription workflows. The overarching question is this: what can be automatically learnt about the languages with the data we have available, and how can this help automate language documentation? Abstract iv We first consider translation modelling of accurate phoneme transcriptions. This assumption allows us to investigate the feasibility of phoneme–word translation and the effectiveness of inferring bilingual lexical items from such data in isolation from confounding acoustic factors. A second investigation explores how bilingual lexicons can be used to improve language models, which are crucial components of speech recognition and machine translation systems. In a third set of experiments we re- move the assumption of accurate transcriptions and investigate operating in the face of acoustic uncertainty. Experiments in this space demonstrate that translated speech can improve automatic phoneme transcription even without a prior translation model. Finally, we make a step towards further generalisability, exploring acoustic modelling in resource-scarce environments without a lexicon or language model. In particular, we assess the use of automatic phoneme and tone transcription on Yongning Na, a threatened tonal language spoken in south-west China. Beyond quantitative investi- gation, we report on the use of this method in linguistic documentation of Na. Its effectiveness has led to its incorporation into the language documentation workflow for Na. Citations to Previously Published Work The work in this thesis is entirely original except where explicitly stated otherwise. Large portions of Chapter 3 have appeared in the following paper: Oliver Adams, Graham Neubig, Trevor Cohn, Steven Bird (2015) Induc- ing bilingual lexicons from small quantities of sentence-aligned phonemic transcriptions, in Proceedings of the 12th International Workshop on Spo- ken Language Technologies (IWSLT), Da Nang, Vietnam. pp. 248–255. Large portions of Chapter 4 have appeared in the following paper: Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, Trevor Cohn (2017) Cross-lingual word embeddings for low-resource language modeling, in Proceedings of the 15th Conference of the European Chap- ter of the Association for Computational Linguistics (EACL), Valencia, Spain. Large portions of Chapter 5 have appeared in the following papers: Oliver Adams, Graham Neubig, Trevor Cohn, Steven Bird (2016) Learn- ing a translation model from word lattices, in 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016), San Francisco, California, USA. Oliver Adams, Graham Neubig, Trevor Cohn, Steven Bird, Quoc Do Truong, Satoshi Nakamura (2016) Learning a lexicon and translation model from phoneme lattices, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, Texas, USA. pp. 2377–2382. Large portions of Chapter 6 have appeared in the following papers: Abstract vi Oliver Adams, Trevor Cohn, Graham Neubig, Alexis Michaud (2017) Phonemic transcription of low-resource tonal languages, in Proceedings of the Australasian Language Technology Association Workshop 2017 (ALTA), Brisbane, Australia. pp. 53–60. Oliver Adams, Trevor Cohn, Graham Neubig, Hilaria Cruz, Steven Bird, Alexis Michaud (2018) Evaluating phonemic transcription of low-resource tonal languages for language documentation in Proceedings of LREC 2018: 11th edition of the Language Resources and Evaluation Conference, Miyazaki, Japan. (To appear). Contents Title Page .................................... i Abstract ..................................... iii Citations to Previously Published Work ................... v Table of Contents ................................ vii List of Figures ................................. x List of Tables .................................. xii Acknowledgements ............................... xiii 1 Introduction 1 1.1 Motivation ................................. 2 1.1.1 Language Extinction and Documentation ........... 2 1.1.2 The Changing Nature of Language Documentation ...... 3 1.1.3 A Different Kind of Data .................... 5 1.1.4 Why Pursue Automatic Understanding of Unwritten Languages? 8 1.2 Aim and Scope .............................. 10 1.2.1 Research Questions ........................ 11 1.3 Overview of the Contributions ...................... 12 2 Background 15 2.1 Data Acquisition ............................. 16 2.2 Modelling Translation .......................... 17 2.2.1 Traditional Word Alignment ................... 18 2.2.2 What is the Right Granularity for Translation Units? ..... 19 2.2.3 Data Sparsity in Machine Translation ............. 23 2.2.4 Unsupervised Word Segmentation ................ 25 2.2.5 Bilingual Lexicon Induction ................... 26 2.2.6 Cross-Lingual Word Embeddings ................ 28 2.2.7 Language Modelling ....................... 30 2.3 Automatic Speech Recognition ...................... 33 2.3.1 Traditional Speech Recognition ................. 33 2.3.2 End-to-End Speech Recognition ................. 35 2.3.3 Unsupervised Speech Modelling ................. 36 vii Contents viii 2.3.4 Low-Resource and Multilingual Speech Recognition ...... 37 2.3.5 Tonal Speech Recognition .................... 40 2.4 Machine Translation Meets Speech Recognition ............ 41 2.4.1 Speech-to-Speech Translation .................. 41 2.4.2 Computer-Aided Translation .................. 42 2.4.3 Translation Modelling of Speech ................. 43 3 Translation Modelling of Phonemes 46 3.1 Introduction ................................ 46 3.2 Phoneme-Based Machine Translation .................. 47 3.2.1 Alignment Approaches ...................... 49 3.2.2 Experimental Setup ....................... 52 3.2.3 Results and Discussion ...................... 53 3.3 Phoneme-Based Bilingual Lexicon Induction .............. 55 3.3.1 Translation Models ........................ 57 3.3.2 Experimental Setup ....................... 59 3.3.3 Quantitative Evaluation ..................... 62 3.3.4 Qualitative Evaluation ...................... 67 3.4 Discussion ................................. 69 3.4.1 Evaluation Issues ......................... 70 3.4.2 Reconsidering the Value of Bilingual Lexicon Induction .... 71 4 Cross-Lingual Low-Resource Language Modelling 73 4.1 Introduction ................................ 73 4.2 Resilience of Cross-Lingual Word Embeddings ............. 76 4.2.1 Experimental Setup ....................... 77 4.2.2 Results .............................. 78 4.3 Pre-training Language Models ...................... 79 4.3.1 Experimental Setup ....................... 80 4.3.2 English Results .......................... 81 4.3.3 Other Target Languages ..................... 83 4.4 First Steps in an Under-Resourced Language .............. 84 4.4.1 Experimental Setup ....................... 85 4.4.2 Results and Discussion ...................... 87 4.5 Discussion ................................. 90 4.5.1 Future Work on Na Language Modelling ............ 91 4.5.2 Beyond Na ............................ 92 5 Harnessing Translations for Improved Phoneme Transcription 94 5.1 Introduction ................................ 94 5.2 Phrase Alignment