Systematic Comparison of Cross-Lingual Projection Techniques for Low-Density Nlp Under Strict Resource Constraints

City University of New York (CUNY) CUNY Academic Works All Dissertations, Theses, and Capstone Projects Dissertations, Theses, and Capstone Projects 10-2014 Systematic Comparison Of Cross-Lingual Projection Techniques For Low-Density Nlp Under Strict Resource Constraints Joshua Waxman Graduate Center, City University of New York How does access to this work benefit ou?y Let us know! More information about this work at: https://academicworks.cuny.edu/gc_etds/395 Discover additional works at: https://academicworks.cuny.edu This work is made publicly available by the City University of New York (CUNY). Contact: [email protected] SYSTEMATIC COMPARISON OF CROSS-LINGUAL PROJECTION TECHNIQUES FOR LOW-DENSITY NLP UNDER STRICT RESOURCE CONSTRAINTS By JOSHUA WAXMAN A dissertation submitted to the Graduate Faculty in Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy The City University of New York 2014 ii © 2014 JOSHUA WAXMAN All Rights Reserved iii This manuscript has been read and accepted for the Graduate Faculty in Computer Science in satisfaction of the dissertation requirement for the degree of Doctor of Philosophy. Matt Huenerfauth Date Chair of Examining Committee Robert Haralick Date Executive Officer Rebecca Passonneau, Columbia University Heng Ji, Queens College, CUNY Virginia Teller, Hunter College, CUNY Supervisory Committee THE CITY UNIVERSITY OF NEW YORK iv Abstract SYSTEMATIC COMPARISON OF CROSS-LINGUAL PROJECTION TECHNIQUES FOR LOW-DENSITY NLP UNDER STRICT RESOURCE CONSTRAINTS by JOSHUA WAXMAN Advisor: Professor Matt Huenerfauth The field of low-density NLP is often approached from an engineering perspective, and evaluations are typically haphazard – considering different architectures, given different languages, and different available resources – without a systematic comparison. The resulting architectures are then tested on the unique corpus and language for which this approach has been designed. This makes it difficult to truly evaluate which approach is truly the “best,” or which approaches are best for a given language. In this dissertation, several state-of-the-art architectures and approaches to low- density language Part-Of-Speech Tagging are reimplemented; all of these techniques exploit a relationship between a high-density (HD) language and a low-density (LD) language. As a novel contribution, a testbed is created using a representative sample of seven (HD – LD) language pairs, all drawn from the same massively parallel corpus, Europarl, and selected for their particular linguistic features. With this testbed in place, never-before-possible comparisons are conducted, to evaluate which broad approach performs the best for v particular language pairs, and investigate whether particular language features should suggest a particular NLP approach. A survey of the field suggested some unexplored approaches with potential to yield better performance, be quicker to implement, and require less intensive linguistic resources. Under strict resource limitations, which are typical for low-density NLP environments, these characteristics are important. The approaches investigated in this dissertation are each a form of language-ifier, which modifies an LD-corpus to be more like an HD-corpus, or alternatively, modifies an HD-corpus to be more like an LD-corpus, prior to supervised training. Each relying on relatively few linguistic resources, four variations of language-ifier designs have been implemented and evaluated in this dissertation: lexical replacement, affix replacement, cognate replacement, and exemplar replacement. Based on linguistic properties of the languages drawn from the Europarl corpus, various predictions were made of which prior and novel approaches would be most effective for languages with specific linguistic properties, and these predictions were evaluated through systematic evaluations with the testbed of languages. The results of this dissertation serve as guidance for future researchers who must select an appropriate cross-lingual projection approach (and a high- density language from which to project) for a given low-density language. Finally, all the languages drawn from the Europarl corpus are actually HD, but for the sake of the evaluation testbed in this dissertation, certain languages are treated as if they were LD (ignoring any available HD resources). In order to evaluate how various approaches perform on an actual LD language, a case study was conducted in which part-of-speech taggers were implemented for Tajiki, harnessing linguistic resources from a related HD- language, Farsi, using all of the prior and novel approaches investigated in this dissertation. vi Insights from this case study were documented so that future researchers can gain insight into what their experience might be in implementing NLP tools for an LD language given the strict resource limitations considered in this dissertation. vii Acknowledgements Completing a dissertation is hard, especially when balancing it with a full time job and childcare for two wonderful but exhausting kids but, thank-God, I have finally done it! There are a few people I would like to single out by name for the help and support they have provided through this long journey. Foremost, I'd like to thank my advisor, Dr. Matt Huenerfauth for his advice and direction through this work. I would also like to thank my committee members, Dr. Heng Ji, Dr. Rebecca Passonneau, and Dr. Virginia Teller for their suggestions and feedback in the proposal stage, which helped clarify and focus the direction and form of this dissertation research. I would like to thank Farzona Zehni, an independent linguist whose advice regarding Farsi and Tajik and whose practical work on assembling the necessary linguistic resources proved invaluable. Thanks also to my original thesis advisor, Dr. Noemie Elhadad, of Columbia University, back when the primary focus was on building transliteration and cognation models. And thanks to Dr. William Sakas, of Hunter College, who was a committee member up until the literature survey stage, for his encouragement and suggestions. Also, many thanks go to Michael Gordon and Jonathan Waxman for their assistance in manually aligning tagger output with the gold standard, in instances when language-ification or language-specific tokenization caused misalignments. viii I would also like to express my profound gratitude to my close and extended family for their support, emotional and otherwise, through this long slog. Thanks go to my wife, Dr. Rachel Waxman, for encouraging me, putting up with me and distracting the kids as I spent hours at the computer, even as she had her own dissertation to worry about. Thanks to Meir and Eitan, for many things including letting me use the computer that otherwise could have been productively applied to such tasks as Bloons Tower Defense 4, Minecraft, Starfall, and Peep and the Big Wide World. And of course, thanks go out to my parents, Rabbi Dr. Jerry Waxman and Lorri Waxman, and to my in-laws, Rabbi Eduard and Sandra Mittelman, for all they have done for us. You are all partners in this accomplishment. Next up, trying to resolve the cold-blooded / warm-blooded dinosaur debate! ix Table of Contents Abstract ................................................................................................................................. iv Acknowledgements............................................................................................................. vii Table of Contents .................................................................................................................. ix Table of Figures .................................................................................................................. xii Table of Tables .................................................................................................................... xv Chapter 1: Introduction .......................................................................................................... 1 Structure of this Dissertation .............................................................................................. 2 Chapter 2: Existing Approaches to Low-Density NLP ......................................................... 5 2.1 An Overview of Low-Density NLP Approaches ......................................................... 5 2.2 Recent Work ............................................................................................................... 10 2.3 Utilizing High-Density Language Resources on Low-Density Data ......................... 12 2.3.1 POS Tagging of Dialectal Arabic: A Minimally Supervised Approach ................. 13 2.3.2 Applying Contextual Models from Related Languages .......................................... 17 2.3.3 Projecting Lexical Models Via Cognates ................................................................ 19 2.4 A Quick Summary of Low-Density NLP Approaches ............................................... 22 Chapter 3: Overview of Work ............................................................................................. 25 3.1 Methodological Details .............................................................................................. 27 3.1.1 Corpus and Corpus Construction ............................................................................ 27 3.1.2 Multiple POS sets .................................................................................................... 28 Chapter 4:

Load more