Translation Events in Cross-Language Informationretrieval
Total Page:16
File Type:pdf, Size:1020Kb
Syracuse University SURFACE Te School of Information Studies Faculty School of Information Studies (iSchool) Scholarship 2003 Translation Events in Cross-Language Information Retrieval: Lexical Ambiguity, Lexical Holes, Vocabulary Mismatch, and Correct Translations Anne Roel Diekema Syracuse University Follow this and additional works at: htp: surface.syr.edu istpub Part of the Library and Information Science Commons Recommended Citation Translation Events in Cross-Language Information Retrieval: Lexical Ambiguity, Lexical Holes, Vocabulary Mismatch, and Correct Translations (2003) Tis Article is brought to you for free and open access by the School of Information Studies (iSchool) at SURFACE. It has been accepted for inclusion in Te School of Information Studies Faculty Scholarship by an authorized administrator of SURFACE. For more information, please contact [email protected]. TRANSLATION EVENTS IN CROSS-LANGUAGE INFORMATION RETRIEVAL: LEXICAL AMBIGUITY, LEXICAL HOLES, VOCABULARY MISMATCH, AND CORRECT TRANSLATIONS by ANNE R. DIEKEMA Bac., Haagse Hogeschool, 1993 M.L.S., Syracuse University, 1995 DISSERTATION School of Information Studies, Syracuse University May 2003 Anne Diekema: Dissertation (May 22, 2003) iii Copyright 2003 Anne Roel Diekema All rights reserved Anne Diekema: Dissertation (May 22, 2003) iv ABSTRACT Cross-Language Information Retrieval (CLIR) systems enable users to formulate queries in their native language to retrieve documents in foreign languages. Because queries and documents in CLIR do not necessarily share the same language, translation is needed before matching can take place. This translation step tends to cause a reduction in the retrieval performance of CLIR as compared to monolingual information retrieval. The prevailing CLIR approach and the focus of this study is query translation. The translation of queries is inherently difficult due to the lack of a one-to-one mapping of a lexical item and its meaning, which creates lexical ambiguity. This, and other translation problems, result in translation errors which impact CLIR performance. To understand the events occurring in cross-language retrieval query translation and the relation of these events to retrieval performance, the study explored the following research questions: 1) What kinds of translation events affect cross-language retrieval? 2) In what way does the presence of certain translation events in query translation affect retrieval performance? The study followed a two-phase multi-method approach. In phase one, a taxonomy of translation events was created through content analysis of queries and their translations in combination with an examination of the literature. In the second and final phase, a subset of the test queries was coded using the taxonomy resulting from phase one. These queries were then used in information retrieval experimentation to assess the impact of the translation events on retrieval performance. Anne Diekema: Dissertation (May 22, 2003) v ACKNOWLEDGEMENTS I would like to thank my committee members Jaklin Kornfilt, Barbara Kwasnik, Liz Liddy, Bob Oddy, and Jeff Stanton for spending time in the realms of cross-language information retrieval and statistics, and providing helpful comments and insights. It was a pleasure to work with them. I would also like to thank Jeffrey Katzer, one of my original committee members. Although Jeffrey passed away shortly before my proposal defense, his teachings kept inspiring me. Information Retrieval (IR) is a big field and there are a number of people who have increased my understanding and influenced my thinking in this area. Liz Liddy, Arie Noordzij, and Bob Oddy were my original teachers in the field, while others like Jiangping Chen, Ted Diamond, Wen Hsiao, Wessel Kraaij, Farhad Oroumchian, Miguel Ruiz, and Arjen de Vries provided insight through discussions and work on research projects. The IR system used in this dissertation was programmed in Perl. Many thanks go to Farhad Oroumchian and Arvind Srinivasan for helping me get my Perl legs in the early years when I did not know an array from a hash. Stéphane Dubon did great work setting up Linux boxes and networking my apartment. The readability of this dissertation was greatly enhanced by the work of Eileen Allen, Sarah Harwell, and Andrew Roginski who were all excellent editors. I am especially grateful to Liz for providing me with a wonderful work environment for the last 7 years, where I gained experience as a researcher while working with a group of dedicated and inspiring colleagues on a wide variety of projects. Naturally, there is more to life than working on a doctorate and I am blessed with a great set of friends who provided continuing support and welcome diversions during these many years. Thanks to Keith Berger and Bianca Flikweert for many wonderful meals, hikes, and (Dutch!) conversation. Thanks also to Marcus van Bers, Blake Rodgers, Arvind Srinivasan, and Kate Stewart for numerous social hours on the ice and the bike. Additional thanks go to Blake for simply being a joy to be around. I also enjoyed the letters and emails of Noor Evertsen and Françoise le Griep, friends who did not exactly live nearby but kept in touch nonetheless. I could not have completed my studies without support from the home front. I am especially grateful for the company of Arie and Angus Diekema who always provided a listening ear and good company on many fabulous walks. And lastly, thanks to my (extended) family Jan Diekema, Marian Diekema-Hensums, Maurits Diekema, Myrthe Diekema, Simone Lcker, Aafke Stalman, and Lenie de Vries for taking an interest in my work and believing I could actually do this. I dedicate this dissertation to my parents Jan and Marian, who have always stressed that getting educated is never a waste of time. Financial support for this dissertation was provided through Beta Phi Mu in the form of a Eugene Garfield Doctoral Dissertation Fellowship and by the Institute of Scientific Information (ISI) in the form of a Doctoral Dissertation Proposal Award. Anne Diekema Syracuse, New York June 30, 2003 Anne Diekema: Dissertation (May 22, 2003) vi TABLE OF CONTENTS 1 INTRODUCTION TO THE STUDY.............................................................................1 1.1 INTRODUCTION .........................................................................................................1 1.2 INFORMATION RETRIEVAL.........................................................................................2 1.3 CROSS-LANGUAGE INFORMATION RETRIEVAL ..........................................................3 1.4 STATEMENT OF THE PROBLEM...................................................................................7 1.5 RESEARCH QUESTIONS ..............................................................................................7 1.5.1 What kinds of translation events affect cross-language retrieval?.........................7 1.5.2 In what way does the presence of certain translation events in query translation affect retrieval performance?................................................................................9 1.6 SCOPE OF THE STUDY.............................................................................................. 10 1.7 LIMITATIONS OF THE STUDY.................................................................................... 10 1.8 SIGNIFICANCE OF THE STUDY .................................................................................. 11 1.9 SUMMARY .............................................................................................................. 11 2 BACKGROUND ........................................................................................................... 13 2.1 INTRODUCTION ....................................................................................................... 13 2.2 MATCHING AND TRANSLATION IN CROSS-LANGUAGE INFORMATION RETRIEVAL..... 13 2.2.1 Matching approaches in CLIR............................................................................ 13 2.2.2 Translation knowledge for query translation ...................................................... 15 2.3 TRANSLATION AND ITS DIFFICULTIES.......................................................................18 2.3.1 Translation......................................................................................................... 18 2.3.2 Specific problems in translation ......................................................................... 20 2.3.2.1 Lexical ambiguity.......................................................................................21 2.3.2.2 Lexical mismatches..................................................................................... 22 2.3.2.3 Lexical holes .............................................................................................. 22 2.3.2.4 Figures of speech........................................................................................22 2.3.2.5 Multiword lexemes ..................................................................................... 23 2.3.2.6 Specialized terminology and proper nouns................................................... 23 2.3.2.7 False cognates ............................................................................................ 23 2.3.3 Query translation problems in CLIR................................................................... 23 2.3.3.1 Lexical ambiguity.......................................................................................24 2.3.3.2 Lack of translation coverage