When Information Retrieval Meets Machine Translation

ABSTRACT Title of dissertation: SEARCHING TO TRANSLATE AND TRANSLATING TO SEARCH: WHEN INFORMATION RETRIEVAL MEETS MACHINE TRANSLATION Ferhan Ture, Doctor of Philosophy, 2013 Dissertation directed by: Professor Jimmy Lin Department of Computer Science and College of Information Studies With the adoption of web services in daily life, people have access to tremen- dous amounts of information, beyond any human's reading and comprehension ca- pabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilin- gual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas pre- sented in this dissertation will spur more interest in the integration of search and translation technologies. SEARCHING TO TRANSLATE AND TRANSLATING TO SEARCH: WHEN INFORMATION RETRIEVAL MEETS MACHINE TRANSLATION by Ferhan Ture Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2013 Advisory Committee: Professor Jimmy Lin, Chair/Advisor Professor Hal DauméIII Professor Bonnie Dorr Professor Douglas W. Oard Professor Philip Resnik Professor Sennur Ulukus c Copyright by Ferhan Ture 2013 Dedication to my parents, my family, and my dearest Elif... ii Acknowledgments I am indebted to all the people who have made this dissertation possible and because of whom my graduate experience has been one that I will cherish forever. I am very lucky to have had great advisors, friends, and family, and would like to express my gratitude with a few words. First and foremost I'd like to thank my advisor, Jimmy Lin, from whom I could not have asked more. Above all, he has always made sure that I have everything to make me comfortable throughout my Ph.D life. He has played a crucial role in preparing me as a researcher, by introducing me to interesting topics, setting a high bar for standards, and always encouraging me to keep active in the field. He has also taught me well as a software developer, by passing along his industry experiences, involving me in the development of high-quality software. Additionally, I have benefited tremendously from Jimmy as an excellent role model in academia, and he has been a great inspiration during challenging times. I would also like to thank Doug Oard, for guiding me throughout the process with his great vision and breadth of knowledge, often helping me to see the bigger picture. I feel very special to have such a great expert of Information Retrieval in my committee. Philip Resnik is one of the reasons I chose this program, and I have learned a lot from him both as a teacher and researcher. His enthusiasm has encouraged me to keep excited about my work and not give up easily. As a pioneer in the field of Machine Translation, Philip's vision has influenced my research in a profound way. iii I would also like to thank Mihai Pop for his valuable feedback as one of my proposal committee members and also Hal Daumé,Bonnie Dorr, and Sennur Ulukus for agreeing to serve on my dissertation committee and for sparing their invaluable time reviewing the manuscript. Without the endless support of my previous advisors in Turkey { Deniz Yuret, Kemal Oflazer, and Esra Erdem { this dissertation would have been a distant dream. I would like to thank them for believing in me during the early years of my career. These five years wouldn't have been the same if it wasn't for my colleagues in the CLIP Lab. I would like to thank all of them for being part of this awesome research group. I have learned a lot from the weekly seminars, as well as from daily interactions within the lab. I would like to thank Hua, Ke Zhai, Hassan, Tan, Viet-An, Asad, Junhui, Yuening, Amit, and Greg, and especially Vlad, Ke Wu, Chris Dyer, Kristy, Hendra, Zhongqiang, Jags, Tamer, Mossaab, Lidan, and Nima for being such cool friends, colleagues, lab mates, and more. I would also like to acknowledge help and support from some of the staff members, especially Joe Webster for his extraordinary efforts to make sure that computing resources were ready when needed. My friends outside the lab deserve a special thank you. Many thanks to Tu~grul, S¸imal, Ali Fuad, Bengü,Bedrettin, Enes, Yasin, Salih, Orhan, Gök¸ce,Edvin, Melis, David, Jose, Pilar, Aslı, Mustafa Dayıo˜glu,Mustafa Hatipo~glu, Omürand¨ Bahadır for their sincere friendship. Hanging out with them has been the best cure to the daily stress of graduate life. I am also thankful to all of my friends in Turkey, who have provided moral support despite the physical distance. iv Last, but not least, I am grateful to be part of a wonderful family. My parents took great responsibility and went through many difficulties, without any hesitation, just to make sure that I receive good education. My entire family has always stood by me and guided me through my career. Their love and support brings me peace and comfort. My dear wife, Elif, is the one who has made this journey enjoyable. She has been a constant motivation for me, and I cannot even imagine how it would've been possible without her endless love and care. It is impossible to remember all, and I apologize to those I've inadvertently left out. Thank you all! v Table of Contents List of Figures ix 1 Introduction 1 1.1 Motivation . .1 1.2 Outline . .3 1.2.1 Searching to Translate . .3 1.2.2 Translating to Search . .6 1.3 Contributions . .8 List of Abbreviations 1 2 Related Work 12 2.1 Machine Translation . 12 2.1.1 Word Alignment . 14 2.1.2 Translation Model . 18 2.1.3 Decoding . 22 2.1.4 Evaluation . 23 2.2 Information Retrieval . 24 2.2.1 Vector-Space Model . 25 2.2.2 Other Models . 26 2.2.3 Retrieval . 29 2.2.4 Evaluation . 31 2.2.5 IR Across Languages . 33 2.3 Context-Sensitive CLIR . 38 2.4 Pairwise Similarity . 46 2.5 Cross-Lingual Pairwise Similarity . 51 2.6 Extracting Parallel Text for MT . 52 2.7 Scalability / MapReduce . 59 vi 3 Searching to Translate: Efficiently Extracting Bilingual Data from the Web 63 3.1 Overview . 63 3.2 Phase 1: Large-Scale Cross-Lingual Pairwise Similarity . 65 3.2.1 Pairwise Similarity . 65 3.2.2 Cross-Lingual Pairwise Similarity . 66 3.2.2.1 MT vs. CLIR for Cross-Lingual Pairwise Similarity . 68 3.2.3 Approach . 72 3.2.3.1 Preprocessing Documents . 73 3.2.3.2 Generating Document Signatures . 73 3.2.3.3 Sliding Window Algorithm . 78 3.2.3.4 Basic MapReduce Implementation . 80 3.2.3.5 Improved MapReduce Implementation . 84 3.2.4 Analytical Model . 86 3.2.5 Experimental Evaluation . 90 3.2.5.1 Effectiveness vs. Efficiency . 92 3.2.5.2 Parameter Exploration . 97 3.2.5.3 Error Analysis . 101 3.3 Phase 2: Extracting Bilingual Sentence Pairs . 103 3.3.1 Classifier Model . 105 3.3.2 Bitext Extraction Algorithm . 108 3.3.3 Experimental Evaluation . 112 3.4 Comprehensive MT Experiments . 123 3.4.1 Results . 125 3.4.2 Efficiency . 129 3.4.3 Error Analysis . 130 3.5 Conclusions and Future Work . 135 4 Translating to Search: Context-Sensitive Query Translation for Cross-Language Information Retrieval 138 4.1 Overview . 138 4.2 Context-Independent Query Translation . 139 4.3 Context-Sensitive Query Translation . 144 4.3.1 Learning Probabilities from Translation Model . 145 4.3.2 Learning Probabilities from n-best Derivations . 154 4.3.3 Combining Sources of Evidence . 159 4.4 Evaluation . 161 4.4.1 A Comparison of System Variants . 164 4.4.2 A Comparison of CLIR Approaches . 169 4.4.3 Efficiency . 180 4.5 Conclusions and Future Work . 185 vii 5 Searching to Translating to Search Again: An Architecture for Con- necting IR and MT 188 5.1 Introduction .

Load more