Assessing Relevance Using Automatically Translated Documents for Cross-Language Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Middlesex University Research Repository An open access repository of Middlesex University research http://eprints.mdx.ac.uk Orengo, Viviane Moreira (2004) Assessing relevance using automatically translated documents for cross-language information retrieval. PhD thesis, Middlesex University. [Thesis] Final accepted version (with author’s formatting) This version is available at: https://eprints.mdx.ac.uk/13606/ Copyright: Middlesex University Research Repository makes the University’s research available electronically. Copyright and moral rights to this work are retained by the author and/or other copyright owners unless otherwise stated. The work is supplied on the understanding that any use for commercial gain is strictly forbidden. A copy may be downloaded for personal, non-commercial, research or study without prior permission and without charge. Works, including theses and research projects, may not be reproduced in any format or medium, or extensive quotations taken from them, or their content changed in any way, without first obtaining permission in writing from the copyright holder(s). They may not be sold or exploited commercially in any format or medium without the prior written permission of the copyright holder(s). Full bibliographic details must be given when referring to, or quoting from full items including the author’s name, the title of the work, publication details where relevant (place, publisher, date), pag- ination, and for theses or dissertations the awarding institution, the degree type awarded, and the date of the award. If you believe that any material held in the repository infringes copyright law, please contact the Repository Team at Middlesex University via the following email address: [email protected] The item will be removed from the repository while any claim is being investigated. See also repository copyright: re-use policy: http://eprints.mdx.ac.uk/policies.html#copy Middlesex University Lond on " Middlesex University Research Repository: an open access repository of Middlesex University research http://eprints.mdx.ac.uk Orengo, Viviane Moreira, 2004. Assessing relevance using automatically translated documents for cross language information retrieval. Available from Middlesex University's Research Repository. Copyright: Middlesex University Research Repository makes the University's research available electronically. Copyright and moral rights to this thesis/research project are retained by the author and/or other copyright owners. The work is supplied on the understanding that any use for commercial gain is strictly forbidden. A copy may be downloaded for personal, non-commercial, research or study without prior permission and without charge. Any use of the thesis/research project for private study or research must be properly acknowledged with reference to the work's full bibliographic details. This thesis/research project may not be reproduced in any format or medium, or extensive quotations taken from it, or its content changed in any way, without first obtaining permission in writing from the copyright holder(s). If you believe that any material held in the repository infringes copyright law, please contact the Repository Team at Middlesex University via the following email address: [email protected] The item will be removed from the repository while any claim is being investigated. Assessing Relevance Using Automatically Translated Documents for Cross-Language Information Retrieval A thesis submitted to Middlesex University in partial fulfilment of the requirements for the degree of Ph.D. March 2004. ~ Middlesex University Viviane Moreira Orengo Acknowledgements I would like to thank the following people for their help with this thesis. • Dr. Christian Huyck, for the excellent supervision and constant motivation provided throughout this work. • Dr. Michael Wing, for the advice and for enabling me to start off this PhD. • Dr. Mounia Lalmas, for the helpful comments. • Ana LUIsa Barros for the invaluable help with the user experiment. • My parents, Paulo and Tania, and my brother Henrique for the support and encouragement. • My husband, Giuliano, for his love and support. • My friends, Joana, Silvana, Victor, Rosane and Andrea for their friendship and company. • The staff from Computing Science Research Office for their constant support. ABSTRACT This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect queries, especially if given just one attempt. The process aims at improving the query specification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance. In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process. In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics. This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory. TABLE OF CONTENTS 1 INTRODUCTION ......................................................................................................... 1 1.1 Scope and Objectives of this Research ......................................................................... 1 1.2 Thesis Structure .............................................................................................................. 3 2 LITERATURE REVIEW ................................................................................................. 4 2.1 Information Retrieval. .................................................................................................... 4 2.1.1 Classic Models .............................................................. " ..................................................... 6 2.1.1.1 Boolean Model ........................................................................................................................... 7 2.1.1.2 Vector Space Model. .................................................................................................................. 7 2.1.1.3 Probabilistic Model ................................................................................................................... 8 2.1.2 Evaluation in Information Retrieval ............................................................................... 10 2.2 Cross-Language Information RetrievaL ................................................................... 14 2.2.1 Related Terms .................................................................................................................... 15 2.2.2 The Problem of CLIR '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' .............. 16 2.2.3 Brief History ............................................................................................ """""'"'''''''''''''' 17 2.2.4 Areas of Application ......................................................................................... " .............. 17 2.2.5 The User of CLIR Systems ................................................................................................ 18 2.2.6 Evaluation of CLIR Systems '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 19 2.2.7 Approaches to CLIR .......................................................................................................... 20 2.2.7.1 Machine Translation ............................................................................................................... 20 2.2.7.2 Thesauri .................................................................................................................................... 21 2.2.7.3 Dictionaries .............................................................................................................................. 22 2.2.7.4 Corpus-based Techniques ...................................................................................................... 23 2.2.7.5 Current State of the Art .......................................................................................................... 24