Leveraging Semantic Similarity in Parallel Corpora for Natural Language Processing

Leveraging Semantic Similarity in Parallel Corpora for Natural Language Processing by Shumin Wu B.S., Duke University, 1998 M.S., Duke University, 2003 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2015 This thesis entitled: Leveraging Semantic Similarity in Parallel Corpora for Natural Language Processing written by Shumin Wu has been approved for the Department of Computer Science Martha Palmer Prof. James Martin Prof. Wayne Ward Prof. Daniel Gildea Prof. Nianwen Xue Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii Wu, Shumin (Ph.D., Computer Science) Leveraging Semantic Similarity in Parallel Corpora for Natural Language Processing Thesis directed by Prof. Martha Palmer This thesis introduces a word alignment based approach for mapping PropBank predicate- argument structures across languages. We used Chinese and English as the language pair for the study and found our approach was able to reliably predict alignments between predicates and their arguments for the predicates in parallel sentence pairs, even when faced with incorrect word alignment input. Further more, by modeling the predicate-to-predicate and argument-to-argument probabilities over a large unannotated parallel corpora and using an expectation maximization (EM) based approach, we were able to further improve the alignment performance of the system. As part of the semantic mapping system, we also developed both a Chinese and an English constituent-based semantic role labeler (SRL) for arguments of verbal and non-verbal predicates. By building topic model based selectional preferences for each language, as well as using techniques such as support predicate identification and multi-stage classification, we were able to achieve what we believe is the state-of-the-art Chinese SRL performance and one of the best English SRL performances using a single constituent tree input. More over, by improving the performance of Chinese nominal SRL by over 2 F points, we demonstrated that selectional preferences can significantly improve SRL when the argument candidates are not well constrained by syntax. As a case study, we successfully applied our semantic mapping system to aligning Chinese dropped pronouns to English text as a way of understanding when replacement words would be required in English in place of the dropped pronouns during Chinese-English machine translation. In the process, we also demonstrated that semantic knowledge is essential for recovering Chinese dropped pronouns by producing a state-of-the-art SRL-enhanced Chinese empty category recovery system. Dedication This dissertation is dedicated to my parents, Wei Xiong Wu and Guoqin Jia, who have loved, supported, and encouraged me throughout my life. It is also dedicated to my wife, Simone Liu, who has loved and patiently supported me while I struggled to finish this work. v Acknowledgements We gratefully acknowledge the support of the following grants. Any contents expressed in this material are those of the authors and do not necessarily reflect the views of the grant agency. • National Science Foundation CISE-IISRI-0910992, Richer Representations for Machine Translation • DARPA FA8750-09-C-0179 (via BBN) Machine Reading: Ontology Induction:Semlink+ and AMR • DARPA HR0011-11-C-0145 (via LDC) BOLT This work utilized the Janus supercomputer, which is supported by the National Science Foundation (award number CNS-0821794) and the University of Colorado Boulder. The Janus supercomputer is a joint effort of the University of Colorado Boulder, the University of Colorado Denver and the National Center for Atmospheric Research. vi Contents Chapter 1 Introduction 1 1.1 Motivation . .1 1.2 Semantic representation . .3 1.3 Semantic mapping . .3 1.4 Automatic semantic role labeling . .4 1.5 Chinese dropped pronouns . .5 2 Background and Related Work 8 2.1 Semantic mapping . .8 2.2 Leveraging parallel corpora . .9 2.3 PropBank Semantic role labeling . 10 2.3.1 Chinese SRL . 11 2.3.2 Selectional Preference for SRL . 12 2.3.3 Topic Model based Selectional Preference . 13 2.4 Empty category recovery . 14 3 Semantic mapping 16 3.1 General approach . 16 3.2 Mapping predicate-arguments . 17 3.2.1 Argument mapping . 17 vii 3.2.2 Word alignment based argument mapping . 19 3.2.3 One-to-one predicate-argument mapping . 22 3.3 Building a mapping probability model . 23 3.3.1 Predicate-to-predicate mapping probability . 23 3.3.2 Argument-to-argument mapping probability . 23 3.4 Probabilistic mapping . 24 3.5 Experiment . 25 3.5.1 Reference predicate-argument mapping . 25 3.5.2 Parser, SRL, word aligner . 26 3.5.3 Predicate-argument mapping results . 26 3.5.4 Mapping coverage . 30 3.6 Chinese VerbNet Mapping . 30 3.6.1 Setup . 31 3.6.2 Results . 32 3.7 Discussion . 33 4 Semantic Role Labeling 35 4.1 Baseline Approach . 35 4.1.1 Features . 36 4.1.2 Classifiers . 39 4.2 Improvements to the Baseline SRL system . 39 4.2.1 Support Identification . 40 4.2.2 2-stage Argument Label Classification . 41 4.2.3 Selectional Preference . 42 4.3 Topic Model based Selectional Preference . 42 4.3.1 Representation . 42 4.3.2 SRL Filtering . 43 viii 4.3.3 Multi-lingual SRL filtering . 43 4.3.4 SP with LDA-based Topic Model . 44 4.3.5 SP Extraction Steps . 45 4.3.6 Topic Model feature for SRL . 46 4.4 Distributional Similarity based Selectional Preference . 46 4.4.1 Approach . 46 4.4.2 Distributional Similarity Corpus . 47 4.4.3 Selectional Preference Measures . 48 4.4.4 SRL Integration . 49 4.5 Experiment . 49 4.5.1 Setup . 49 4.5.2 Performance . 50 4.5.3 Automatic SRL Error Analysis . 55 4.6 Discussion . 55 5 Chinese Empty Category Recovery 60 5.1 Chinese dropped subject mapping to English . 60 5.1.1 Motivation . 60 5.1.2 Framework . 61 5.1.3 Mapping semantic roles between Chinese and English . 61 5.1.4 Heuristics . 62 5.1.5 Experiment . 67 5.1.6 Heuristics distribution . 68 5.1.7 Heuristics performance . 69 5.1.8 Analysis . 72 5.2 Chinese empty category recovery with parallel corpora . 73 5.2.1 Monolingual implementation . 73 ix 5.2.2 Parallel feature enhancement . 75 5.2.3 Experiment . 75 5.3 Dependency-based empty category recovery . 76 5.3.1 Implementation . 77 5.3.2 Experiment . 78 5.4 Discussion . 80 6 Summary 83 Bibliography 87 Appendix A SRL Comparison 95 x Tables Table 3.1 Chinese argument type (column) to English argument type (row) mapping on triple- gold Xinhua corpus . 20 3.2 Predicate-argument mapping results . 27 3.3 Predicate-argument mapping coverage. Predicate coverage denotes the number of mapped predicates over all predicates in the corpus, word coverage denotes the number of words in the mapped predicate-arguments over all words in the corpus . 30 3.4 results of English verbs with VerbNet annotation projected back to themselves through Chinese verbs. The restricted rows indicate restricting the verb pair out- put of the predicate-argument mapping methods to only those English/Chinese verb predicates that are also word aligned (rather than just induced through the argument mappings) . 32 4.1 Topics in Chinese Gigaword . 50 4.2 Chinese PropBank 1.0 verb results . 51 4.3 Chinese PropBank 1.0 results . 51 4.4 Chinese PropBank 1.0 argument performance . 52 4.5 Chinese PropBank 3.0 out-of-genre results . 53 4.6 Chinese SRL comparison . 53 4.7 Topics in English Gigaword . ..

Leveraging Semantic Similarity in Parallel Corpora for Natural Language Processing

Champollion: a Robust Parallel Text Sentence Aligner

The Web As a Parallel Corpus

Resourcing Machine Translation with Parallel Treebanks John Tinsley

Towards Controlled Counterfactual Generation for Text

Natural Language Processing Security- and Defense-Related Lessons Learned

A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems

Lines: an English-Swedish Parallel Treebank

Cross-Lingual Bootstrapping of Semantic Lexicons: the Case of Framenet

Parallel Texts

Open Architecture for Multilingual Parallel Texts M.T

Unsupervised Learning of Cross-Modal Mappings Between

BITS: a Method for Bilingual Text Search Over the Web