When Information Retrieval Meets Machine Translation

Total Page:16

File Type:pdf, Size:1020Kb

When Information Retrieval Meets Machine Translation ABSTRACT Title of dissertation: SEARCHING TO TRANSLATE AND TRANSLATING TO SEARCH: WHEN INFORMATION RETRIEVAL MEETS MACHINE TRANSLATION Ferhan Ture, Doctor of Philosophy, 2013 Dissertation directed by: Professor Jimmy Lin Department of Computer Science and College of Information Studies With the adoption of web services in daily life, people have access to tremen- dous amounts of information, beyond any human's reading and comprehension ca- pabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple lan- guages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilin- gual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas pre- sented in this dissertation will spur more interest in the integration of search and translation technologies. SEARCHING TO TRANSLATE AND TRANSLATING TO SEARCH: WHEN INFORMATION RETRIEVAL MEETS MACHINE TRANSLATION by Ferhan Ture Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2013 Advisory Committee: Professor Jimmy Lin, Chair/Advisor Professor Hal Daum´eIII Professor Bonnie Dorr Professor Douglas W. Oard Professor Philip Resnik Professor Sennur Ulukus c Copyright by Ferhan Ture 2013 Dedication to my parents, my family, and my dearest Elif... ii Acknowledgments I am indebted to all the people who have made this dissertation possible and because of whom my graduate experience has been one that I will cherish forever. I am very lucky to have had great advisors, friends, and family, and would like to express my gratitude with a few words. First and foremost I'd like to thank my advisor, Jimmy Lin, from whom I could not have asked more. Above all, he has always made sure that I have everything to make me comfortable throughout my Ph.D life. He has played a crucial role in preparing me as a researcher, by introducing me to interesting topics, setting a high bar for standards, and always encouraging me to keep active in the field. He has also taught me well as a software developer, by passing along his industry experiences, involving me in the development of high-quality software. Additionally, I have benefited tremendously from Jimmy as an excellent role model in academia, and he has been a great inspiration during challenging times. I would also like to thank Doug Oard, for guiding me throughout the process with his great vision and breadth of knowledge, often helping me to see the bigger picture. I feel very special to have such a great expert of Information Retrieval in my committee. Philip Resnik is one of the reasons I chose this program, and I have learned a lot from him both as a teacher and researcher. His enthusiasm has encouraged me to keep excited about my work and not give up easily. As a pioneer in the field of Machine Translation, Philip's vision has influenced my research in a profound way. iii I would also like to thank Mihai Pop for his valuable feedback as one of my proposal committee members and also Hal Daum´e,Bonnie Dorr, and Sennur Ulukus for agreeing to serve on my dissertation committee and for sparing their invaluable time reviewing the manuscript. Without the endless support of my previous advisors in Turkey { Deniz Yuret, Kemal Oflazer, and Esra Erdem { this dissertation would have been a distant dream. I would like to thank them for believing in me during the early years of my career. These five years wouldn't have been the same if it wasn't for my colleagues in the CLIP Lab. I would like to thank all of them for being part of this awesome research group. I have learned a lot from the weekly seminars, as well as from daily interactions within the lab. I would like to thank Hua, Ke Zhai, Hassan, Tan, Viet-An, Asad, Junhui, Yuening, Amit, and Greg, and especially Vlad, Ke Wu, Chris Dyer, Kristy, Hendra, Zhongqiang, Jags, Tamer, Mossaab, Lidan, and Nima for being such cool friends, colleagues, lab mates, and more. I would also like to acknowledge help and support from some of the staff members, especially Joe Webster for his extraordinary efforts to make sure that computing resources were ready when needed. My friends outside the lab deserve a special thank you. Many thanks to Tu~grul, S¸imal, Ali Fuad, Beng¨u,Bedrettin, Enes, Yasin, Salih, Orhan, G¨ok¸ce,Edvin, Melis, David, Jose, Pilar, Aslı, Mustafa Dayıo˜glu,Mustafa Hatipo~glu, Om¨urand¨ Bahadır for their sincere friendship. Hanging out with them has been the best cure to the daily stress of graduate life. I am also thankful to all of my friends in Turkey, who have provided moral support despite the physical distance. iv Last, but not least, I am grateful to be part of a wonderful family. My parents took great responsibility and went through many difficulties, without any hesitation, just to make sure that I receive good education. My entire family has always stood by me and guided me through my career. Their love and support brings me peace and comfort. My dear wife, Elif, is the one who has made this journey enjoyable. She has been a constant motivation for me, and I cannot even imagine how it would've been possible without her endless love and care. It is impossible to remember all, and I apologize to those I've inadvertently left out. Thank you all! v Table of Contents List of Figures ix 1 Introduction 1 1.1 Motivation . .1 1.2 Outline . .3 1.2.1 Searching to Translate . .3 1.2.2 Translating to Search . .6 1.3 Contributions . .8 List of Abbreviations 1 2 Related Work 12 2.1 Machine Translation . 12 2.1.1 Word Alignment . 14 2.1.2 Translation Model . 18 2.1.3 Decoding . 22 2.1.4 Evaluation . 23 2.2 Information Retrieval . 24 2.2.1 Vector-Space Model . 25 2.2.2 Other Models . 26 2.2.3 Retrieval . 29 2.2.4 Evaluation . 31 2.2.5 IR Across Languages . 33 2.3 Context-Sensitive CLIR . 38 2.4 Pairwise Similarity . 46 2.5 Cross-Lingual Pairwise Similarity . 51 2.6 Extracting Parallel Text for MT . 52 2.7 Scalability / MapReduce . 59 vi 3 Searching to Translate: Efficiently Extracting Bilingual Data from the Web 63 3.1 Overview . 63 3.2 Phase 1: Large-Scale Cross-Lingual Pairwise Similarity . 65 3.2.1 Pairwise Similarity . 65 3.2.2 Cross-Lingual Pairwise Similarity . 66 3.2.2.1 MT vs. CLIR for Cross-Lingual Pairwise Similarity . 68 3.2.3 Approach . 72 3.2.3.1 Preprocessing Documents . 73 3.2.3.2 Generating Document Signatures . 73 3.2.3.3 Sliding Window Algorithm . 78 3.2.3.4 Basic MapReduce Implementation . 80 3.2.3.5 Improved MapReduce Implementation . 84 3.2.4 Analytical Model . 86 3.2.5 Experimental Evaluation . 90 3.2.5.1 Effectiveness vs. Efficiency . 92 3.2.5.2 Parameter Exploration . 97 3.2.5.3 Error Analysis . 101 3.3 Phase 2: Extracting Bilingual Sentence Pairs . 103 3.3.1 Classifier Model . 105 3.3.2 Bitext Extraction Algorithm . 108 3.3.3 Experimental Evaluation . 112 3.4 Comprehensive MT Experiments . 123 3.4.1 Results . 125 3.4.2 Efficiency . 129 3.4.3 Error Analysis . 130 3.5 Conclusions and Future Work . 135 4 Translating to Search: Context-Sensitive Query Translation for Cross-Language Information Retrieval 138 4.1 Overview . 138 4.2 Context-Independent Query Translation . 139 4.3 Context-Sensitive Query Translation . 144 4.3.1 Learning Probabilities from Translation Model . 145 4.3.2 Learning Probabilities from n-best Derivations . 154 4.3.3 Combining Sources of Evidence . 159 4.4 Evaluation . 161 4.4.1 A Comparison of System Variants . 164 4.4.2 A Comparison of CLIR Approaches . 169 4.4.3 Efficiency . 180 4.5 Conclusions and Future Work . 185 vii 5 Searching to Translating to Search Again: An Architecture for Con- necting IR and MT 188 5.1 Introduction .
Recommended publications
  • Modeling Communities in Information Systems: Informal Learning Communities in Social Media"
    "Modeling Communities in Information Systems: Informal Learning Communities in Social Media" Von der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades einer Doktorin der Naturwissenschaften genehmigte Dissertation vorgelegt von Zinayida Kensche, geb. Petrushyna, MSc. aus Odessa, Ukraine Berichter: Universitätsprofessor Professor Dr. Matthias Jarke Universitätsprofessor Professor Dr. Marcus Specht Privatdozent Dr. Ralf Klamma Tag der mündlichen Prüfung: 17.11.2015 Diese Dissertation ist auf den Internetseiten der Universitätsbibliothek online verfügbar. i c Copyright by 2016 All Rights Reserved ii Abstract Information modeling is required for creating a successful information system while modeling of communities is pivotal for maintaining community information systems (CIS). Online social media, a special case of CIS, have been intensively used but not usually adopted for learning community needs. Thus community stakeholders meet problems by supporting learning communities in social media. Under the prism of Community of Practice theory, such communities have three dimensions that are responsible for community sustainability: mutual engagement, joint enterprises and shared repertoire. Existing modeling solutions use either perspectives of learning theories, or anal- ysis of learner or community data captured in social media but rarely combine both approaches. Therefore, current solutions produce community models that supply only a part of community stakeholders with information that can hardly describe community success and failure. We also claim that community models must be created based on community data analysis integrated with our learning community dimensions. More- over, the models need to be adapted according to environmental changes. This work provides a solution to continuous modeling of informal learning com- munities in social media.
    [Show full text]
  • Ranlpstud 2019
    RANLPStud 2019 Proceedings of the Student Research Workshop associated with The 12th International Conference on Recent Advances in Natural Language Processing (RANLP 2019) 2–4 September, 2019 Varna, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2019 PROCEEDINGS Varna, Bulgaria 2–4 September 2019 ISSN 1314-9156 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The RANLP 2019 Student Research Workshop (RANLPStud 2019) is a special track of the established international conference Recent Advances in Natural Language Processing (RANLP 2019), now in its twelfth edition. The RANLP Student Research Workshop is being organised for the sixth time and this year is running in parallel with the other tracks of the main RANLP 2019 conference. The target of RANLPStud 2019 is to be a discussion forum and provide an outstanding opportunity for students at all levels (Bachelor, Masters, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. The RANLP 2019 Student Research Workshop received a large number of submissions (23), a fact which was reflecting the record number of events, sponsors, submissions, and participants at the main RANLP 2019 conference. We have accepted 2 excellent student papers as oral presentations and 12 submissions will be presented as posters. The final acceptance rate of the workshop was 60%. We made our best to make the reviewing process in the best interest of our authors, by asking our reviewers to give as most exhaustive comments and suggestions as possible as well as to maintain an encouraging attitude.
    [Show full text]
  • The Case of 13 Wikipedia Instances
    Interaction Design and Architecture(s) Journal - IxD&A, N.22, 2014, pp. 34-47 The Impact of Culture On Smart Community Technology: The Case of 13 Wikipedia Instances Zinayida Petrushyna1, Ralf Klamma1, Matthias Jarke1,2 1 Advanced Community Information Systems Group, Information Systems and Databases Chair, RWTH Aachen University, Ahornstrasse 55, 52056 Aachen, Germany 2 Fraunhofer Institute for Applied Information Technology FIT, 53754 St. Augustin, Germany {petrushyna, klamma}@dbis.rwth-aachen.de [email protected] Abstract Smart communities provide technologies for monitoring social behaviors inside communities. The technologies that support knowledge building should consider the cultural background of community members. The studies of the influence of the culture on knowledge building is limited. Just a few works consider digital traces of individuals that they explain using cultural values and beliefs. In this work, we analyze 13 Wikipedia instances where users with different cultural background build knowledge in different ways. We compare edits of users. Using social network analysis we build and analyze co- authorship networks and watch the networks evolution. We explain the differences we have found using Hofstede dimensions and Schwartz cultural values and discuss implications for the design of smart community technologies. Our findings provide insights in requirements for technologies used for smart communities in different cultures. Keywords: Social network analysis, Wikipedia communities, Hofstede dimensions, Schwartz cultural values 1 Introduction People prefer to leave in smart cities where their needs are satisfied [1]. The development of smart cities depends on the collaboration of individuals. The investigation of the flow [1] of knowledge created by the individuals allows the monitoring of city smartness.
    [Show full text]
  • Mining Cross-Cultural Relations from Wikipedia - a Study of 31 European Food Cultures
    Mining cross-cultural relations from Wikipedia - A study of 31 European food cultures Paul Laufer Claudia Wagner Graz University of Technology GESIS & U. of Koblenz Graz, Austria Cologne, Germany [email protected] [email protected] Fabian Flöck Markus Strohmaier GESIS GESIS & U. of Koblenz Cologne, Germany Cologne, Germany fabian.fl[email protected] [email protected] ABSTRACT the editor community of the Romanian-language Wikipedia For many people, Wikipedia represents one of the primary could either have a deviant mental picture of the French sources of knowledge about foreign cultures. Yet, differ- cuisine { or it might estimate the priorities of Romanian- ent Wikipedia language editions offer different descriptions speaking readers to rather be on meat-based French deli- of cultural practices. Unveiling diverging representations of catessen than on wine and baking goods. Further, the gen- cultures provides an important insight, since they may foster eral interest of the Romanian-speaking readers in the French the formation of cross-cultural stereotypes, misunderstand- cuisine (for example measured by the number of views of the ings and potentially even conflict. In this work, we explore article about French cuisine in the Romanian language edi- to what extent the descriptions of cultural practices in var- tion) might serve to potentially displease any Francophile, ious European language editions of Wikipedia differ on the since the Romanian speaking community might show no- example of culinary practices and propose an approach to tably less interest in the French kitchen than in the Russian mine cultural relations between different language commu- or Hungarian one. This hypothetical scenario serves as an nities trough their description of and interest in their own example for numerous similar real-world cases (which can- and other communities' food culture.
    [Show full text]
  • Digital Spaces in “Popular” Culture Ontologized
    Digital Spaces in “Popular” Culture Ontologized Mícheál Mac an Airchinnigh School of Computer Science and Statistics, University of Dublin, Trinity College Dublin, Ireland [email protected] Abstract. Electronic publication seems to have the biased connotation of a one-way delivery system from the source to the masses. It originates within the “Push culture.” Specifically, it triggers the notion of (a small number of) producers and the (masses of) consumers. The World Wide Web (but not the Internet) tends to reinforce this intrinsic paradigm, one that is not very different from schooling. Wikipedia, in all its linguistic varieties is a classical publication of the collective. Much enthusiasm is expressed for the English version. The rest of the other world languages do not fare so well. It is of interest to consider the nature and status of Wikipedia with respect to two specific fields, Mathematics (which is not so popular) and Soap Operas (popular for the masses, largely sponsored by the Corporations). Keywords: Digital Space, Intelligent Artificial Life, Ontology, Soap Opera 1. The Talking Drum “Before long, there were people for whom the path of communications technology had leapt directly from the talking drum to the mobile phone, skipping over the intermediate stages” [1][2] It comes as a shock for most, to know that the first successful transmission of message data (i.e., significant amount of meaningful information, in the technical scientific sense) at a distance, was done on African drums? From the book cited above one may, with clear conscience, jump to Wikipedia to see if there is information on the “Talking drum” [1, 3].
    [Show full text]
  • Gourmet Project
    GoURMET H2020–825299 D1.1 Survey of relevant low-resource languages Global Under-Resourced MEdia Translation (GoURMET) H2020 Research and Innovation Action Number: 825299 D1.1 – Survey of relevant low-resource languages Nature Report Work Package WP1 Due Date 30/04/2019 Submission Date 30/04/2019 Main authors Mikel L. Forcada Co-authors Miquel Espla-Gomis,` Juan Antonio Perez-Ortiz,´ V´ıctor M. Sanchez-´ Cartagena, Felipe Sanchez-Mart´ ´ınez Reviewers Alexandra Birch Keywords survey, languages, resources, machine translation Version Control v0.8 Status 1st Draft 21/04/2019 v1.0 Status Final Version 30/04/2019 page 1 of 86 GoURMET H2020–825299 D1.1 Survey of relevant low-resource languages Contents 1 Introduction6 1.1 Language pairs of interest for GoURMET......................7 2 Languages8 2.1 Afaan Oromoo (om, orm)...............................8 2.1.1 Factsheet...................................8 2.1.2 Contrasts with English............................8 2.1.3 Corpora.................................... 11 2.1.4 Resources................................... 12 2.1.5 Challenges for corpus-based MT from English............... 12 2.2 Bosnian (bs, bos)................................... 14 2.2.1 Factsheet................................... 14 2.2.2 Contrasts with English............................ 14 2.2.3 Corpora.................................... 14 2.2.4 Resources................................... 15 2.2.5 Challenges for corpus-based MT from English............... 15 2.3 Bulgarian (bg, bul).................................. 16 2.3.1 Factsheet................................... 16 2.3.2 Contrasts with English............................ 16 2.3.3 Corpora................................... 17 2.3.4 Resources................................... 20 2.3.5 Challenges for corpus-based MT from English............... 21 2.4 Croatian (hr, hrv).................................. 22 2.4.1 Factsheet................................... 22 2.4.2 Contrasts with English...........................
    [Show full text]
  • Proceedings of the Student Research Workshop (Ranlpstud 2019)
    RANLPStud 2019 Proceedings of the Student Research Workshop associated with The 12th International Conference on Recent Advances in Natural Language Processing (RANLP 2019) 2–4 September, 2019 Varna, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2019 PROCEEDINGS Varna, Bulgaria 2–4 September 2019 Series Online ISSN 2603-2821 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The RANLP 2019 Student Research Workshop (RANLPStud 2019) is a special track of the established international conference Recent Advances in Natural Language Processing (RANLP 2019), now in its twelfth edition. The RANLP Student Research Workshop is being organised for the sixth time and this year is running in parallel with the other tracks of the main RANLP 2019 conference. The target of RANLPStud 2019 is to be a discussion forum and provide an outstanding opportunity for students at all levels (Bachelor, Masters, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. The RANLP 2019 Student Research Workshop received a large number of submissions (23), a fact which was reflecting the record number of events, sponsors, submissions, and participants at the main RANLP 2019 conference. We have accepted 2 excellent student papers as oral presentations and 12 submissions will be presented as posters. The final acceptance rate of the workshop was 60%. We made our best to make the reviewing process in the best interest of our authors, by asking our reviewers to give as most exhaustive comments and suggestions as possible as well as to maintain an encouraging attitude.
    [Show full text]