Automated Adaptation Between Kiranti Languages
Total Page:16
File Type:pdf, Size:1020Kb
University of Montana ScholarWorks at University of Montana Graduate Student Theses, Dissertations, & Professional Papers Graduate School 2006 Automated Adaptation Between Kiranti Languages Daniel Richard McCloy The University of Montana Follow this and additional works at: https://scholarworks.umt.edu/etd Let us know how access to this document benefits ou.y Recommended Citation McCloy, Daniel Richard, "Automated Adaptation Between Kiranti Languages" (2006). Graduate Student Theses, Dissertations, & Professional Papers. 84. https://scholarworks.umt.edu/etd/84 This Thesis is brought to you for free and open access by the Graduate School at ScholarWorks at University of Montana. It has been accepted for inclusion in Graduate Student Theses, Dissertations, & Professional Papers by an authorized administrator of ScholarWorks at University of Montana. For more information, please contact [email protected]. AUTOMATED ADAPTATION BETWEEN KIRANTI LANGUAGES By Daniel Richard McCloy B.S. Computer Science and Engineering, LeTourneau University, Longview, TX, 1991 Thesis presented in partial fulfillment of the requirements for the degree of Master of Arts in Linguistics The University of Montana Missoula, MT Autumn 2006 Approved by: Dr. David A. Strobel, Dean Graduate School Dr. Anthony Mattina, Chair Linguistics Program Dr. Mizuki Miyashita Linguistics Program Dr Nancy Mattina The Writing Center Dr. David E. Watters Research Centre for Linguistic Typology McCloy, Daniel, M.A., December 2006 Linguistics Automated Adaptation Between Kiranti Languages Chairperson: Dr. Anthony Mattina Minority language communities that are seeking to develop their language may be hampered by a lack of vernacular materials. Large volumes of such materials may be available in a related language. Automated adaptation holds potential to enable these large volumes of materials to be efficiently translated into the resource-scarce language. I describe a project to assess the feasibility of automatically adapting text between Limbu and Yamphu, two languages in Nepal’s Kiranti grouping. The approaches taken—essentially a transfer-based system partially hybridized with a Kiranti-specific interlingua—are placed in the context of machine translation efforts world-wide. A key principle embodied in this strategy is that adaptation can transcend the structural obstacles by taking advantage of functional commonalities. That is, what matters most for successful adaptation is that the languages “care about the same kinds of things.” I examine various typological phenomena of these languages to assess this degree of functional commonality. I look at the types of features marked on the finite verb, case-marking systems, the encoding of vertical deixis, object-incorporated verbs, and nominalization issues. As this Kiranti adaptation goal involves adaptation into multiple target languages, I also present a disambiguation strategy that ensures that the manual disambiguation performed for one target language is fed back into the system, such that the same disambiguation will not need to be performed again for other target languages. ii ACKNOWLEDGEMENTS It is a blessing to be able to work with people that you really enjoy, and in this regard I have been blessed at every turn. This project could have never happened without the months of work poured into it by Marius Doornenbal and Jeffrey Webster, who have tirelessly dug to find answers and solutions, and coaxed or prodded me along when I needed it. –Thank you! I cannot imagine how this thesis would have come to be but for Dr. Tony Mattina’s guidance and his time invested in me over these years. I was already indebted to him for setting my wife on the path where I consequently met her. I am now doubly indebted. –Thank you! It has a been a privilege, too, to receive much-needed help from such a master of Himalayan typology as Dr. David Watters. Not that I’ve grasped a fraction yet of what’s really going on in these languages, but what little of it I’ve begun to understand is owing in great measure to him. –Thank you! Nowhere else will it be truer to emphasize that all errors in this paper are mine alone. Dr. Mizuki Miyashita and Dr. Nancy Mattina have not only shouldered the task of serving on my committee, but have also invested time in my development and training. –Thank you! Space does not permit me to name them all, but there have been many, many others, too, who have given of themselves to bring me to this stage. I am so grateful. The greatest enabling and encouragement has come from my wife Adina, who has set aside her own desires time and time again to love me into reaching this milestone. –Thank you! How blessed I am indeed. Soli Deo Gloria. Dan McCloy Missoula, Montana, November 2006 iii CONTENTS ACKNOWLEDGEMENTS ............................................................................ III CONTENTS .....................................................................................................IV LIST OF FIGURES....................................................................................... VII 1. INTRODUCTION ...................................................................................... 1 2. OVERVIEW OF MACHINE TRANSLATION ....................................... 5 2.1. The Capabilities of Machine Translation ............................................ 5 2.2. Types of MT Strategies......................................................................... 8 2.2.1. Direct Translation....................................................................................................... 8 2.2.2. Translation via Interlingua...................................................................................... 10 2.2.3. Transfer System ........................................................................................................ 13 2.2.4. Corpus-based Methods .............................................................................................. 14 Example-based Translation...........................................................................................................14 Statistical Machine Translation (SMT) ........................................................................................15 2.3. Recent MT innovations in other resource-scarce situations............. 19 2.4. History of MT in South Asia............................................................... 23 3. OVERVIEW OF KIRANTI...................................................................... 28 3.1. Rationale for automated adaptation.................................................. 28 3.2. Factors Making Machine Translation More Feasible ....................... 29 3.3. The Multi-Target Strategy ................................................................. 30 3.4. Kiranti Languages.............................................................................. 30 3.5. Selection of Languages ....................................................................... 31 3.5.1. Source Language Selection ....................................................................................... 31 3.5.2. Target Language Selection ....................................................................................... 32 3.5.3. Nature of Collaboration ............................................................................................ 33 3.6. Language background information.................................................... 34 3.6.1. Limbu......................................................................................................................... 34 Language Name .............................................................................................................................34 Speakers .........................................................................................................................................34 Linguistic Research........................................................................................................................34 3.6.2. Yamphu...................................................................................................................... 35 Language Name .............................................................................................................................35 Speakers .........................................................................................................................................35 4. ISSUES OF WORD FREQUENCY........................................................ 36 4.1. Overview of the Corpora..................................................................... 37 4.2. High-Frequency Words....................................................................... 40 4.3. Low-Frequency Words........................................................................ 43 4.4. Conclusions Regarding Word Frequency........................................... 45 5. ISSUES OF STRUCTURAL NON-CORRESPONDENCE ................ 46 5.1. An Initial Structural Comparison...................................................... 46 5.2. An Overview of Verbal Affixes ........................................................... 47 iv 5.3. Observations on Morpheme Correspondence .................................... 51 5.3.1. Inverse/Direct Marking ............................................................................................ 51 5.3.2. Missing Correspondents............................................................................................ 55 5.3.3. Inconsistently-matched Correspondents .................................................................