Cross-Language Information Retrieval

Cross-Language Information Retrieval MC_Labosky_FM.indd i Achorn International 03/11/2010 10:17AM ii SynthesisOne liner Lectures Chapter in TitleHuman Language Technologies Editor Graeme Hirst, University of Toronto Synthesis Lectures on Human Language Technologies publishes monographs on topics relat- ing to natural language processing, computational linguistics, information retrieval, and spoken language understanding. Emphasis is placed on important new techniques, on new applica- tions, and on topics that combine two or more HLT subfields. Cross-Language Information Retrieval Jian-Yun Nie 2010 Data-Intensive Text Processing with MapReduce Jimmy Lin, Chris Dyer 2010 Semantic Role Labeling Martha Palmer, Daniel Gildea, Nianwen Xue 2010 Spoken Dialogue Systems Kristiina Jokinen, Michael McTear 2010 Introduction to Chinese Natural Language Processing Kam-Fai Wong, Wenji Li, Ruifeng Xu, Zheng-sheng Zhang 2009 Introduction to Linguistic Annotation and Text Analytics Graham Wilcock 2009 MC_Labosky_FM.indd ii Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd iii Achorn International 03/11/2010 10:17AM SYNTHESIS LESCTURES IN HUMAN LANGUAGE TECHNOLOGIES iii Dependency Parsing Sandra Kübler, Ryan McDonald, Joakim Nivre 2009 Statistical Language Models for Information Retrieval ChengXiang Zhai 2008 MC_Labosky_FM.indd ii Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd iii Achorn International 03/11/2010 10:17AM Copyright © 2010 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Cross-Language Information Retrieval Jian-Yun Nie www.morganclaypool.com ISBN: 9781598298635 paperback ISBN: 9781598298642 ebook DOI: 10.2200/S00266ED1V01Y201005HLT008 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES IN HUMAN LANGUAGE TECHNOLOGIES Lecture #8 Series Editor: Graeme Hirst, University of Toronto Series ISSN ISSN 1947-4040 print ISSN 1947-4059 electronic MC_Labosky_FM.indd iv Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd v Achorn International 03/11/2010 10:17AM Cross-Language Information Retrieval Jian-Yun Nie University of Montreal SYNTHESIS LECTURES IN HUMAN LANGUAGE TECHNOLOGIES #8 MC_Labosky_FM.indd iv Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd v Achorn International 03/11/2010 10:17AM vi ABSTRACT Search for information is no longer exclusively limited within the native language of the user, but is more and more extended to other languages. This gives rise to the problem of cross-language information retrieval (CLIR), whose goal is to find relevant information written in a different language to a query. In addition to the problems of monolingual information retrieval (IR), translation is the key problem in CLIR: one should translate either the query or the documents from a language to another. However, this translation problem is not identical to full-text machine translation (MT): the goal is not to produce a human-readable translation, but a translation suitable for finding relevant documents. Specific translation methods are thus required. The goal of this book is to provide a comprehensive description of the specific problems arising in CLIR, the solutions proposed in this area, as well as the remaining problems. The book starts with a general description of the monolingual IR and CLIR problems. Different classes of approaches to translation are then presented: approaches using an MT system, dictionary-based translation and approaches based on parallel and comparable corpora. In addition, the typical retrieval effectiveness using different approaches is compared. It will be shown that translation approaches specifically designed for CLIR can rival and outperform high-quality MT systems. Finally, the book offers a look into the future that draws a strong parallel between query expansion in monolingual IR and query translation in CLIR, suggesting that many approaches developed in monolingual IR can be adapted to CLIR. The book can be used as an introduction to CLIR. Advanced readers can also find more technical details and discussions about the remaining research challenges in the future. It is suitable to new researchers who intend to carry out research on CLIR. KEYWORDS cross-language information retrieval; multilingual information retrieval; query translation; document translation; translation model; machine translation / statistical machine translation; dictionary-based translation; parallel corpus; comparable corpus; query expansion; transliteration; mining of translation relations / resources MC_Labosky_FM.indd vi Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd vii Achorn International 03/11/2010 10:17AM vii Dedication To my dear son Guillaume (子吟). MC_Labosky_FM.indd vi Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd vii Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd viii Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd ix Achorn International 03/11/2010 10:17AM ix Contents Preface .....................................................................................................................xiii 1. Introduction .......................................................................................................1 1.1 General IR Problems ........................................................................................... 1 1.2 General IR Approaches ....................................................................................... 2 1.2.1 IR Models................................................................................................ 3 1.2.1.1 Boolean Models ........................................................................ 3 1.2.1.2 Vector Space Model .................................................................. 4 1.2.1.3 Probabilistic Models.................................................................. 5 1.2.1.4 Statistical Language Models ..................................................... 6 1.2.2 Query Expansion ..................................................................................... 8 1.2.3 System Evaluation ................................................................................. 10 1.3 Language Problems in IR .................................................................................. 12 1.3.1 European Languages ............................................................................. 12 1.3.1.1 Word Stemming ...................................................................... 12 1.3.1.2 Decompounding ..................................................................... 12 1.3.2 East Asian Languages ........................................................................... 14 1.3.2.1 Chinese and Word Segmentation ........................................... 14 1.3.2.2 Japanese and Korean ............................................................... 17 1.3.3 Other Languages ................................................................................... 17 1.4 The Problems of Cross-Language Information Retrieval ................................. 18 1.4.1 Query Translation vs. Document Translation ........................................ 19 1.4.2 Using Pivot Language and Interlingua .................................................. 20 1.5 Approaches to Translation in CLIR .................................................................. 21 1.6 The Need for Cross-Language and Multilingual IR ......................................... 23 1.7 The History of CLIR ........................................................................................ 24 MC_Labosky_FM.indd viii Achorn International 03/11/2010 10:17AM MC_Labosky_FM.indd ix Achorn International 03/11/2010 10:17AM x CROSS-LANGUAGE INFORMATION RETRIEVAL 2. Using Manually Constructed Translation Systems and Resources for CLIR ......... 29 2.1 Machine Translation .......................................................................................... 29 2.1.1 Rule-Based MT ..................................................................................... 30 2.1.2 Statistical MT ........................................................................................ 32 2.2 Basic utilization of MT in CLIR ....................................................................... 37 2.2.1 Rule-Based MT ..................................................................................... 39 2.2.2 Statistical MT ........................................................................................ 41 2.2.3 Unknown Word ..................................................................................... 41 2.3 Open the Box of MT ......................................................................................... 44 2.4 Dictionary-Based Translation for CLIR ............................................................ 45 2.4.1 Basic Approaches ................................................................................... 46 2.4.2 The Term Weighting Problem .............................................................. 47 2.4.3 Coverage of the Dictionary ................................................................... 49 2.4.4 Translation Ambiguity ........................................................................... 50 2.4.5 Selection of Translation Words ............................................................

Cross-Language Information Retrieval

The Longest Words Using the Fewest Letters

Set Reading Benchmarks in South Africa

A Supplementary Topical Index

A Reference Grammar of Wanano

UNIVERSITY of CALIFORNIA Los Angeles Word Prosody And

Unsupervised Morphological Word Clustering Sergei A. Lushtak a Thesis Submitted in Partial Fulfillment of the Requirements for T

The Translation of German Nominal Compounds in Dutch and English: a Syntactic and Semantic Analysis

BRUSH up YOUR WEBSTER's She Cried

KUOT a Non-Austronesian Language of New Ireland, Papua New Guinea

Words with Letters Humane

DOCUMENT RESUME ED 049 265 TE 002 393 JOURNAL CIT Laird, Charlton

Language Learning Research Quick Language Learning 2010 Great for Teachers and Students of Languages Also for ESL Teachers and Students