PDF (Cross-Lingual Textual Entailment and Applications)

PhD Dissertation International Doctorate School in Information and Communication Technologies DISI - University of Trento Cross-Lingual Textual Entailment and Applications Yashar Mehdad Advisor: Prof. Marcello Federico Fondazione Bruno Kessler, Human Language Technology Research Unit. March 2012 Abstract Textual Entailment (TE) has been proposed as a generic framework for modeling language variability. The great potential of integrating (monolingual) TE recognition components into NLP architectures has been reported in several areas, such as question answering, information retrieval, information extraction and document summarization. Mainly due to the absence of cross-lingual TE (CLTE) recognition components, similar improvements have not yet been achieved in any corresponding cross-lingual application. In this thesis, we propose and investigate Cross-Lingual Textual Entailment (CLTE) as a semantic relation between two text portions in different lan- guages. We present different practical solutions to approach this problem by i) bringing CLTE back to the monolingual scenario, translating the two texts into the same language; and ii) integrating machine translation and TE algorithms and techniques. We argue that CLTE can be a core technology for several cross-lingual NLP applications and tasks. Experiments on different datasets and two interesting cross-lingual NLP applications, namely content synchronization and machine translation evaluation, con- firm the effectiveness of our approaches leading to successful results. As a complement to the research in the algorithmic side, we successfully ex- plored the creation of cross-lingual textual entailment corpora by means of crowdsourcing, as a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators. Keywords [Natural Language Processing, Textual Entailment and Paraphrasing, Cross-Lingual Textual Inference, Content Synchronization, Machine Trans- lation, Crowd-Sourcing] 4 Acknowledgement Few years back, while I was passionately tasting the sweet joy of teaching to young students in a college, I had to bear with the bitter taste of marking the projects and written exams. Being tired of such a boring end of semester moments, once that I was driving back home in a heavy traffic jam, the idea of automatically scoring the students papers and projects shined in my mind. Further crazy science fictional thoughts and day dreaming around this fancy technology led me to the Natural Language Processing field, and motivated me in driving along this road for 4-5 years, which resulted in this PhD thesis. Though only my name appears on the cover of this dissertation, many people have contributed to and helped me in growing it up and its production. I owe my sincere gratitude to all those people who have made this thesis possible and because of whom such experience has been one that I will cherish forever. I have been fortunate to have an advisor who gave me the freedom, courage and confidence to explore this path on my own, and at the same time the helpful guidance to recover when my steps failed. Marcello Federico taught me how to be precise, organized and focused. His support even during the last stages of my thesis and his management and coordination skills have been valuable lessons that I'm sincerely grateful for. My deepest gratitude is to my true friend, colleague and co-advisor, Matteo Negri, who has been always there to listen, share ideas and give advices. Through long discussions, critical corrections, creative ideas on technical writings and presentations with him, I've learned priceless lessons during my PhD. I also owe him a debt of gratitude for carefully reading, commenting and countless revising my writings and slides in all stages of my work. Moreover, being always there, he made my stay in Trento and FBK one of the pleasant periods of my life. I am grateful to Bernardo Magnini who first brought my attention to the topic of Textual Entailment, and facilitated my entry to this community and also to FBK. I am also thankful to him for giving me the opportunity of teaching, to add to the joy of living in an academic life and being closer to my goals. Dr. Alessandro Moschitti is one of the most influential teachers that I had in my stay in Trento. He introduced me to Machine Learning and SVMs, specifically kernel methods. The works I've done in one of his courses resulted in a publication in a major conference. I am indebted to him for his continuous encouragement and guidance and for what I've learned from him during our collaborations. I am also grateful to Dr. Fabio Massimo Zanzotto for our fruitful collaboration and the discussions that we had, sometimes for hours. I'd also like to express my deep gratitude to Dr. Farid Melgani, whose great help and guidelines resulted in my first ACL publication in a very early stage of my PhD. I am also indebted to the members of the HLT group at FBK, who have provided a pleasant and stimulating environment to pursue my studies. Particularly, I would like to acknowledge the group of colleagues - Milen Kouylekov, Luisa Bentivogli, Claudio Giuliano, Christian Girardi, Mauro Cettolo, Elena Cabrio, Nicola Bertoldi, Alessandro Marchetti and Danilo Giampiccolo - for their precious collaboration. I would like to acknowledge my PhD committee members, Dr. Ido Dagan, Dr. Miles Osborne, Dr. Lluis Marquez and Dr. Chistof Monz for taking their valuable time to travel to Trento and giving me the pleasure of having them in my final exam. I am also thankful to them for reading my dissertation, commenting on my views and helping me to improve and enrich my thesis. Many friends have helped me to stay happy through these difficult years. I greatly value their friendship and I deeply appreciate their belief in me. I also like to thank Sofhia Rahim who helped me in proof-reading this manuscript. Most importantly, none of this would have been possible without the love and patience of my family. My family to whom this dissertation is dedicated to, has been a constant source of love, concern, support and strength for all these years. I would like to express my heart-felt gratitude to my parents - Iran Mehdizadegan and Ali Mehdad - that have aided and encouraged me throughout my life. I also offer my deepest thanks and love to my brother and sister - Araz and Maral - whose love and support has sustained me during these years. i ii Contents 1 Introduction1 1.1 Context . .1 1.2 Problem . .3 1.3 Solution . .4 1.4 Contributions . .6 1.5 Structure of the Thesis . 10 2 Recognizing Textual Entailment 11 2.1 Textual Entailment . 11 2.2 Datasets . 14 2.3 Knowledge Resources . 18 2.3.1 Lexical databases . 18 2.3.2 Entailment and inference rules . 20 2.3.3 Context sensitive lexical rules from wikipedia . 21 2.3.4 Paraphrase Tables as a Source of Knowledge . 25 2.4 Approaches . 27 2.4.1 Logic-based approaches to RTE . 27 2.4.2 Transformation and Similarity based Approaches . 28 2.4.3 Supervised Learning Methods for RTE . 30 2.5 Optimization . 32 2.5.1 Optimizing TE Recognition Using PSO . 32 2.5.2 Optimizing TE System Using Genetic Algorithm . 40 iii 2.6 Syntactic Semantic Learning . 47 2.6.1 Motivating Example . 47 2.6.2 Lexical similarities . 49 2.6.3 Integrating Semantic in Syntactic Tree Kernels . 51 2.6.4 Semantic Syntactic Tree Kernels for RTE . 54 2.6.5 Experiments . 56 2.6.6 Results . 58 2.7 Summary . 62 3 Cross-Lingual Textual Entailment 65 3.1 Introduction . 65 3.2 CLTE . 67 3.2.1 Definition . 67 3.2.2 Approaches . 68 3.3 Pivoting . 72 3.3.1 Experiment 1: Feasibility Study . 72 3.3.2 Experiment 2: Verification . 75 3.4 Advanced . 81 3.4.1 Exploiting Parallel Corpora for CLTE . 83 3.4.2 Beyond Lexical Features . 90 3.5 Summary . 97 4 Content Synchronization 99 4.1 Introduction . 99 4.2 CLTE . 101 4.3 Experiments . 105 4.3.1 Dataset . 105 4.3.2 Features . 107 4.3.3 Evaluation settings . 108 4.3.4 Results . 109 iv 4.4 Open Issues . 113 4.5 Summary . 115 5 MT adequcy evaluation 117 5.1 Introduction . 117 5.2 MT evaluation . 118 5.3 Predicting adequacy . 121 5.4 CLTE . 124 5.4.1 Features . 125 5.4.2 Dataset . 128 5.4.3 Algorithms and Approaches . 130 5.5 Results . 130 5.5.1 Adequacy and quality prediction . 130 5.5.2 Multi-class classification . 133 5.5.3 Recognizing \good" vs \bad" translations . 135 5.6 Summary . 138 6 CLTE dataset creation 141 6.1 Introduction . 141 6.2 Crowdsourcing . 143 6.3 RTE3-derived CLTE dataset . 145 6.3.1 Methodology . 146 6.3.2 Experiments and lessons learned . 148 6.3.3 Results . 152 6.4 Content Synchronization . 153 6.4.1 Methodology . 155 6.4.2 Further Analysis . 161 6.5 Summary . 169 v 7 Conclusion 171 7.1 Recapitulation . 171 7.2 Future direction . 173 Bibliography 177 A List of Published Papers 201 vi List of Tables 2.1 RTE datasets. 17 2.2 Comparing accuracy results over the RTE-5 dataset. 24 2.3 Coverage of rule repositories over the RTE-5 dataset. 25 2.4 Accuracy results on RTE using different lexical resources. 26 2.5 Optimized and unoptimized cost schemes comparison. 40 2.6 RTE results (acc. for RTE1-RTE5, F-meas. for RTE6). 46 2.7 Accuracy comparison using different semantic rules. 58 2.8 Lexico-syntactic kernels comparison. 60 2.9 Coverage of the different resources. 61 2.10 Efficiency comparison. 61 2.11 Comparison with other approaches to RTE .

Load more