Clirmatrix: a Massively Large Collection of Bilingual and Multilingual Datasets for Cross-Lingual Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval Shuo Sun Kevin Duh Johns Hopkins University Johns Hopkins University [email protected] [email protected] Abstract the same language, then employ a monolingual information retrieval (IR) engine to find relevant We present CLIRMatrix, a massively large col- lection of bilingual and multilingual datasets documents. for Cross-Lingual Information Retrieval ex- Recently, the research community has been ac- tracted automatically from Wikipedia. CLIR- tively looking at end-to-end solutions that tackle Matrix comprises (1) BI-139, a bilingual the CLIR task without the need to build MT sys- dataset of queries in one language matched tems. This line of work builds upon recent ad- with relevant documents in another language vances in Neural Information Retrieval in the mono- for 139×138=19,182 language pairs, and lingual setting, c.f. (Mitra and Craswell, 2018; (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different Craswell et al., 2020). There are proposals to di- languages. In total, we mined 49 million rectly train end-to-end neural retrieval models on unique queries and 34 billion (query, doc- CLIR datasets (Sasaki et al., 2018; Zhang et al., ument, label) triplets, making it the largest 2019) or MT bitext (Zbib et al., 2019; Jiang et al., and most comprehensive CLIR dataset to 2020). One can also exploit cross-lingual word em- date. This collection is intended to support beddings to train a CLIR model on disjoint mono- research in end-to-end neural information re- lingual corpora (Litschko et al., 2018). trieval and is publicly available at https: Despite the growing interest in end-to-end CLIR, //github.com/ssun32/CLIRMatrix. We provide baseline neural model results on BI- the lack of a large-scale, easily-accessible CLIR 139, and evaluate MULTI-8 in both single- dataset covering many language directions in high-, language retrieval and mix-language retrieval mid- and low-resource settings has detrimentally settings. affected the CLIR community’s capability to repli- cate and compare with previously published work. 1 Introduction For example, among the widely-used datasets, the Cross-Lingual Information Retrieval (CLIR) is a CLEF collection (Ferro and Silvello, 2015) covers retrieval task in which search queries and candi- many languages but is not large enough for training date documents are written in different languages. neural models. The more recent IARPA MATE- CLIR can be very useful in some scenarios. For RIAL/OpenCLIR collection (Zavorin et al., 2020), example, a reporter may want to search foreign- is not yet publicly accessible. This motivates us language news to obtain different perspectives for to design and build CLIRMatrix, a massively large her story; an inventor may explore the patents in collection of bilingual and multilingual datasets for another country to understand prior art. Tradition- CLIR. ally, translation-based approaches are commonly We construct CLIRMatrix from Wikipedia in used to tackle the CLIR task (Zhou et al., 2012; an automated manner, exploiting its large variety Oard, 1998; McCarley, 1999): the query transla- of languages and massive number of documents. tion approach translates the query into the same The core idea is to synthesize relevance labels via language of the documents, whereas the document an existing monolingual IR system, then propa- translation approach translates the document into gate the labels via Wikidata links that connect the same language as the query. Both approaches documents in different languages. In total, we rely on a machine translation (MT) system or bilin- were able to mine 49 million unique queries in 139 gual dictionary to map queries and documents to languages and 34 billion (query, document, label) 4160 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4160–4170, November 16–20, 2020. c 2020 Association for Computational Linguistics Figure 1: Illustration of our CLIRMatrix collection. The BI-139 portion of CLIRMatrix supports research in bilingual retrieval and covers a matrix of 139 × 138 language pairs. The MULTI-8 portion of CLIRMatrix supports research in multilingual modeling and mixed- language (ML) retrieval, where queries and documents Figure 2: Intuition of CLIR relevance label synthesis. are jointly aligned over 8 languages. For the English query “Barack Obama”, first a mono- lingual IR engine (Elasticsearch) labels documents in English; then Wikidata links are exploited to propa- triplets, creating a CLIR collection across a matrix gate the label to the corresponding Chinese documents, of 139 × 138 = 19; 182 language pairs. From this which are assumed to be topically similar. raw collection, we introduce two datasets: • BI-139 is a massively large bilingual CLIR 2 Methodology dataset that covers 139 × 138 = 19; 182 lan- Let qX be a query in language X, and dY be a guage pairs. To encourage reproducibility, we document in language Y. A bilingual CLIR dataset present standard train, validation, and test sub- consists of I triples sets for every language direction. X Y f(qi ; dij; rij)gi=1;2;:::;I (1) • MULTI-8 is a multilingual CLIR dataset Y comprising of queries and documents jointly where dij is the j-th document associated with aligned 8 languages: Arabic (ar), German X query qi , and rij is a label saying how relevant is (de), English (en), Spanish (es), French (fr), Y X the document dij to the query qi . Conventionally, Japanese (ja), Russian (ru), Chinese (zh). rij is an integer with 0 representing “not relevant” Each query will have relevant documents in and higher values indicating more relevant. the other 7 languages. Suppose there are J documents in total. In the j See Figure1 for a comparison of BI-139 and full collection search setup, the index ranges from 1;:::;J qX MULTI-8. The former facilitates the evalua- , meaning that each query i searches fdY g tion of bilingual retrieval over a wide variety of over the full set of documents ij j=1;:::;J . In X languages, while the latter supports research in the re-ranking setup, each query qi searches over mixed-language retrieval (a.k.a multilingual re- a subset of documents obtained by an initial full- fdY g trieval (Savoy and Braschler, 2019)), which is an collection retrieval engine: ij j=1;:::;Ki , where interesting yet relatively under-explored problem. Ki J. For practical reasons, machine learn- For both, the train sets are large enough to enable ing approaches to IR focus on the re-ranking setup the training of the neural IR models. with Ki set to 10∼1000 (Liu, 2009; Chapelle and We hope CLIRMatrix is useful and can empower Chang, 2011). We follow the re-ranking setup here. further developments in this field of research. To We now describe the main intuition of our con- summarize, our contributions are: struction method and detail various components and design choices in our pipeline. 1. A massive CLIR collection supporting both training and evaluation of bilin- 2.1 Intuition and Assumptions gual/multilingual models. To create a CLIR dataset, one needs to decide how to obtain qX and dY , and r . We set qX to be 2. A set of baseline neural results on BI-139 and i ij ij i Wikipedia titles, dY to be Wikipedia articles, and MULTI-8. On MULTI-8, we show that a sin- ij synthesize r automatically using a simple yet re- gle multilingual model can significantly out- ij liable method. We argue that Wikipedia is the best perform an ensemble of bilingual models. available resource for building CLIR datasets due CLIRMatrix is publicly available at https:// to two reasons: First, it is freely available and con- github.com/ssun32/CLIRMatrix. tains articles in more than 300 languages, covering 4161 a large variety of topics. Second, Wikipedia arti- Wikipedia dump of language X and then extract cles are mapped to entities in Wikidata1, which is the titles and document bodies of every article. We a relatively reliable way to find the same articles index the documents into an Elasticsearch2 search written in other languages. engine, which serves as our monolingual IR system. To synthesize relevance labels rij, we propose Using the extracted titles as search queries, we first to generate labels using an existing monolin- retrieve the top 100 relevant documents and their gual IR system in language X, then propagate the corresponding BM25 scores from Elasticsearch for labels via Wikidata links to language Y. In other every query. We then convert the BM25 scores words, we assume: into discrete relevance judgment labels using Jenks natural break optimization. Finally, we propagate X 1. the availability of documents d in the same these labels to documents in language Y that are language as the query, and linked via Wikidata. We downloaded Wikidata and Wikipedia dumps 2. the feasibility of an existing monolingual IR released on January 1, 2020. Since Wikipedia system in language X to provide labels r^ on ij dumps contain tremendous amounts of meta- (qX ; dX ) pairs i ij information such as URLs and scripts, it can be Y X expensive to extract actual text directly from those Then for any dij that links to dij , we assign the dumps. Inspired by Schwenk et al.(2019), we relevance label r^ij. This intuition is illustrated in Figure2. Sup- extracted document ids, titles, and bodies from 3 pose we wish to find Chinese documents that are Wikipedia’s search indices instead, which contain relevant for the English query “Barack Obama”. raw text data without meta-information. We first run monolingual IR to find English doc- Wikipedia dumps We discarded dumps with uments that answer the query. In this figure, 4 less than ten thousand documents, which are usu- documents are returned, and we attempt to link to ally the dumps of Wikipedia of certain dialects the corresponding Chinese versions using Wikidata and less commonly used languages.