BRIDJE over a Language Barrier: Cross-Language Information Access by Integrating Translation and Retrieval
Tetsuya Sakai Makoto Koyama Masaru Suzuki Akira Kumano Toshihiko Manabe Toshiba Corporate R&D Center Knowledge Media Laboratory, 1 Komukai-Toshiba-cho, Saiwai-ku, Kawasaki 212-8582, JAPAN [email protected]
Abstract outputs a ranked list of target language documents, and this list is evaluated using metrics such as This paper describes two new features of Average Precision. However, CLIR solves only part the BRIDJE system for cross-language of the Language Barrier Problem: if the user cannot information access. The first feature is express his information need in target language, the partial disambiguation function of then he probably cannot make much use of the the Bi-directional Retriever, which can retrieved documents written in the same language. be used for search request translation in (If the source and target languages are reasonably cross-language IR. Its advantage over a similar, then the user may find such plain CLIR “black-box” machine translation approach useful. However, this is certainly not the case for is consistent across five test collections pairs of disparate languages such as English and and across two language permutations: Japanese.) Thus, what deserves more attention is English-Japanese and Japanese-English. Cross-Language Information Access (CLIA), which The second new feature is the Information subsumes CLIR and provides useful information to Distiller, which performs interactive the user in source language (e.g. (Frederking et al., summarisation of retrieved documents 1997)). based on Semantic Role Analysis. Our This paper describes two new features of the examples illustrate the usefulness of BRIDJE (Bi-directional Retriever/Information Dis- this feature, and our evaluation results tiller for Japanese and English) system (Sakai et al., show that the precision of Semantic Role 2002a; Sakai et al., 2002b; Sakai et al., 2003) which Analysis is very high. integrates machine translation (MT) with informa- tion retrieval (IR) to support both English-Japanese (E-J) and Japanese-English (J-E) CLIA. The first fea- 1 Introduction ture is the partial disambiguation function of the Bi- directional Retriever part of BRIDJE for enhancing Cross-Language Information Retrieval retrieval performance in the traditional sense. While (CLIR) (Grefenstette, 1998) has received a lot most of the traditional MT-based CLIR systems use of attention recently. TREC currently studies MT as a “black box”, partial disambiguation accesses English-Arabic IR, CLEF studies CLIR across the internal data structures of a commercial MT sys- European languages, and NTCIR studies CLIR tem for search request translation so that multiple across Asian languages (Chen et al., 2003; Kando, translation candidates can be used as search terms. 2001). As with monolingual IR, CLIR evaluations We present positive results that are consistent across usually rely on the use of static test collections: The five test collections (or six topic sets), and across system accepts a source language search request and two language permutations: English-Japanese and Japanese-English. To our knowledge, BRIDJE is al., 1999a; Sakai, 2001). The recent CLIR results the first system that truly integrates MT with IR and at NTCIR-3 also showed that this approach, which performs well in terms of standard measures. The BRIDJE also employs, is very promising (Chen et second new feature is the entire Information Dis- al., 2003; Sakai et al., 2003). tiller part of BRIDJE, which can provide generic In our partial disambiguation experiments, we or query-specific summaries of the retrieved docu- go beyond the use of MT as a black box by ac- ments, as well as their translations in source lan- cessing the internal data structures of a commer- guage. Based on Semantic Role Analysis (SRA) cial MT system in order to use multiple transla- originally designed for enhancing retrieval perfor- tion candidates as search terms. Jones et al. (Jones mance (Sakai et al., 2002a; Suzuki et al., 2001), et al., 1999) have explored this approach to some the Information Distiller extracts important text frag- extent, but they observed performance degradation ments from a retrieved document on the fly. Prelim- when compared to full disambiguation (i.e. black- inary evaluations suggest that SRA can classify text box MT), as they treated the multiple candidates as fragments with very high precision, and that it is distinct search terms. In contrast, BRIDJE treats useful for efficient information access. We regard these candidates as a group of synonyms, which is our Information Distiller feature as one step towards now known to be effective in CLIR (Pirkola, 1998). Cross-Language Question Answering. Thus, while dictionary-based approaches start from BRIDJE is an enhanced, fully bilingual version the maximum ambiguity state and performs vigorous of the KIDS Japanese retrieval system that recently disambiguation, we start from the minimum ambi- achieved the highest performances in the English- guity state reached through full MT and take a step Japanese and Japanese monolingual IR tasks at backward for obtaining alternative translations. The NTCIR-3 (Chen et al., 2003; Sakai et al., 2003). SYSTRAN NLP browser (Gachot et al., 1998) also The remainder of this paper is organised as fol- integrates MT with IR at a deep level, but this was lows: Section 2 compares some of the previous work designed for retrieving sentences that match specific on CLIR/CLIA with our present study. Section 3 grammatical features, and its effectiveness as a doc- provides an overview of the BRIDJE system. Sec- ument retrieval system is not clear. tion 4 describes an extensive set of retrieval experi- Regarding CLIA (as opposed to CLIR), Frederk- ments that compares partial disambiguation with the ing et al. (Frederking et al., 1997) proposed a frame- black-box MT approach. Section 5 provides exam- work in which retrieved documents could be sum- ples to illustrate the advantages of the Information marised and translated by multiple MT engines in Distiller for efficient information access, as well as parallel. MULINEX (Capstick et al., 2000), which some evaluation results of SRA. Finally, Section 6 is a dictionary-based CLIR system for French, En- provides conclusions. glish and German, uses the LOGOS MT system for document/summary presentation. Oard and Ren- 2 Previous Work sik (Oard and Resnik, 1999) have used word-by-word J-E translations in their study on interactive docu- This section reviews some previous work on ment selection. PRIME (Higuchi et al., 2001), yet CLIR/CLIA, focussing primarily on those that deal another J-E/E-J CLIR system based on dictionaries with English and Japanese or those that use MT in and corpora, translates retrieved patent documents some way or other. phrase-by-phrase. Compared to these CLIA sys- The early J-E CLIR systems (Susaki et al., 1996; tems, the Information Distiller part of BRIDJE is Yamabana et al., 1998) employed dictionary-based unique in that it performs SRA for interactive docu- search request translation, with corpus-based dis- ment summarisation/presentation. ambiguation. However, it is not clear how effective these systems are in terms of retrieval performance. 3 The BRIDJE System In contrast, for both J-E and E-J CLIR, MT-based search request translation has been combined suc- This section describes the general features of the cessfully with Pseudo-Relevance Feedback (Sakai et BRIDJE system. Sections 3.1 and 3.2 describe the request translation and indexing/retrieval compo- disambiguation is effective. nents of the Bi-directional Retriever. Section 3.3 in- troduces the Information Distiller that provides sum- 3.2 Indexing and Retrieval maries and translations of retrieved documents. Our default retrieval strategy is Okapi/BM25 with Pseudo-Relevance Feedback (PRF) (Robertson and 3.1 Request Translation Sparck Jones, 1997; Sakai, 2001). The term selec- The default search request translation strategy of tion criterion used in all of our experiments involving
BRIDJE is full disambiguation: a source language PRF is ow , which incorporates the initial document request is fed to a commercial MT system (Amano et scores into the traditional offer weight (Sakai et al., al., 1989), and the output is treated as a monolingual 2003). request written in target language. BRIDJE is also BRIDJE employs word-based indexing, prior to capable of performing transliteration for treating which synonyms and phrases can be defined. Syn- words that are outside of the MT dictionary (Sakai et onym groups can also be defined at query time, which al., 2002b), but this feature is not used in the present is useful for partial disambiguation and translitera-
study. tion (Sakai et al., 2002b): the term frequency (tf )
In order to describe partial disambiguation,we and the document frequency (df ) are counted as if all first illustrate the disambiguation process of MT with the members of a synonym group are one identical an example. Suppose that a search request contains term. the word “play” in the context of E-J CLIR. Firstly, Optionally, BRIDJE can perform indexing and re- by dictionary lookup, we recognise that “play” can trieval based on SRA (Sakai et al., 2002a; Suzuki either be a noun or a verb, and obtain all possi- et al., 2001), which was originally designed for go- ble Japanese translations accordingly. Secondly, we ing beyond the “bag-of-words” approach to IR. Al- determine its part-of-speech through syntactic anal- though the present study uses SRA for interactive ysis. Here, suppose that “play” was used as a verb, document summarisation/presentation and not for re- and that all Japanese translations for the noun “play” trieval, it works as follows: During document index- have been filtered out. Thirdly, we perform semantic ing and search request processing, BRIDJE can ex- analysis using transfer rules. These rules contain amine the output of the parser (morphological anal- knowledge such as “IF the object of the verb “play” yser for Japanese; part-of-speech tagger/stemmer for is a sport, THEN it should be translated as “suru”. IF English), based on a set of hand-written SRA rules the object is a musical instrument, THEN it should which specify the following: be translated as “hiku”or“enso-suru¯ ”. Here, sup- pose that the object was “violin”, and therefore that How to break up the text into fragments, based “hiku” and “enso-suru¯ ” have been selected as transla- on regular expression pattern matching. A frag- tion candidates. Then, at the final output stage, only ment can be a sentence, a paragraph, a prepo- one translation is selected for each source language sitional phrase, and so on. For example, an word. For the word “play” in the above example, English sentence of the form “A for B with C” “hiku” is selected as the final translation since this can be broken into “A”, “for B” and “with C” if is the first entry in the aforementioned transfer rule. prepositions are used as fragment boundaries. (The above explanation is a simplified version of