CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval Shuo Sun Kevin Duh Johns Hopkins University Johns Hopkins University
[email protected] [email protected] Abstract the same language, then employ a monolingual information retrieval (IR) engine to find relevant We present CLIRMatrix, a massively large col- lection of bilingual and multilingual datasets documents. for Cross-Lingual Information Retrieval ex- Recently, the research community has been ac- tracted automatically from Wikipedia. CLIR- tively looking at end-to-end solutions that tackle Matrix comprises (1) BI-139, a bilingual the CLIR task without the need to build MT sys- dataset of queries in one language matched tems. This line of work builds upon recent ad- with relevant documents in another language vances in Neural Information Retrieval in the mono- for 139×138=19,182 language pairs, and lingual setting, c.f. (Mitra and Craswell, 2018; (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different Craswell et al., 2020). There are proposals to di- languages. In total, we mined 49 million rectly train end-to-end neural retrieval models on unique queries and 34 billion (query, doc- CLIR datasets (Sasaki et al., 2018; Zhang et al., ument, label) triplets, making it the largest 2019) or MT bitext (Zbib et al., 2019; Jiang et al., and most comprehensive CLIR dataset to 2020). One can also exploit cross-lingual word em- date. This collection is intended to support beddings to train a CLIR model on disjoint mono- research in end-to-end neural information re- lingual corpora (Litschko et al., 2018).