Multilingual Information Retrieval
Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA
January 14, 2019 AFIRM Global Trade
2.5 USA
2.0 EU China 1.5
1.0 Exports (Trillions of USD) Exports (Trillions Hong Kong Japan 0.5 South Korea
0.0 0.0 0.5 1.0 1.5 2.0 2.5 Imports (Trillions of USD)
Source: Wikipedia (mostly 2017 estimates) Most Widely-Spoken Languages
English Mandarin Chinese Hindi Spanish French Modern Std Arabic Russian Bengali Portuguese Indonesian Urdu German Japanese Swahili Western Punjabi Javanese Wu Chinese L1 speakers Telugu Turkish Korean L2 speakers Marathi Tamil Yue Chinese Vietnamese Italian Hausa Thai Persian Southern Min 0 200 400 600 800 1,000 1,200 Billions of Speakers
Source: Ethnologue (SIL), 2018 Global Internet Users
2% 4% 4%
4% 5% 0% 4% 2% 5% 33% English 8% Chinese Spanish 5% 2% Japanese 6% Portuguese German 6% 4% Arabic French 64% 5% Russian Korean
9%
28% What Does “Multilingual” Mean? • Mixed-language document – Document containing more than one language • Mixed-language collection – Collection of documents in different languages • Multi-monolingual systems – Can retrieve from a mixed-language collection • Cross-language system – Query in one language finds document in another • (Truly) multingual system – Queries can find documents in any language A Story in Two Parts
• IR from the ground up in any language – Focusing on document representation
• Cross-Language IR – To the extent time allows Query Documents
Representation Representation Function Function
Query Representation Document Representation
Comparison Function Index
Hits | 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | ASCII | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | • American Standard | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | Code for Information | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | Interchange | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | | 16 DLE | 48 0 | 80 P | 112 p | • ANSI X3.4-1968 | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL | The Latin-1 Character Set
• ISO 8859-1 8-bit characters for Western Europe – French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English
Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1 Other ISO-8859 Character Sets
-2 -6
-3 -7
-4 -8
-5 -9 East Asian Character Sets
• More than 256 characters are needed – Two-byte encoding schemes (e.g., EUC) are used • Several countries have unique character sets – GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam • Many characters appear in several languages – Research Libraries Group developed EACC • Unified “CJK” character set for USMARC records Unicode
• Single code for all the world’s characters – ISO Standard 10646 • Separates “code space” from “encoding” – Code space extends Latin-1 • The first 256 positions are identical – UTF-7 encoding will pass through email • Uses only the 64 printable ASCII characters – UTF-8 encoding is designed for disk file systems Limitations of Unicode
• Produces larger files than Latin-1 • Fonts may be hard to obtain for some characters • Some characters have multiple representations – e.g., accents can be part of a character or separate • Some characters look identical when printed – But they come from unrelated languages • Encoding does not define the “sort order” Strings and Segments • Retrieval is (often) a search for concepts – But what we actually search are character strings
• What strings best represent concepts? – In English, words are often a good choice • Well-chosen phrases might also be helpful – In German, compounds may need to be split • Otherwise queries using constituent words would fail – In Chinese, word boundaries are not marked • Thissegmentationproblemissimilartothatofspeech Tokenization
• Words (from linguistics): – Morphemes are the units of meaning – Combined to make words • Anti (disestablishmentarian) ism
• Tokens (from computer science) – Doug ’s running late ! Morphological Segmentation
Swahili Example a + li + ni + andik + ish + a he + past-tense + me + write + causer-effect + Declarative-mode
Credit: Ramy Eskander Morphological Segmentation
Somali Example cun + t + aa eat + sh + present- e tense
Credit: Ramy Eskander Stemming • Conflates words, usually preserving meaning – Rule-based suffix-stripping helps for English • {destroy, destroyed, destruction}: destr – Prefix-stripping is needed in some languages • Arabic: {alselam}: selam [Root: SLM (peace)] • Imperfect: goal is to usually be helpful – Overstemming • {centennial,century,center}: cent – Understamming: • {acquire,acquiring,acquired}: acquir • {acquisition}: acquis • Snowball: rule-based system for making stemmers Longest Substring Segmentation
• Greedy algorithm based on a lexicon
• Start with a list of every possible term
• For each unsegmented string – Remove the longest single substring in the list – Repeat until no substrings are found in the list Longest Substring Example
• Possible German compound term (!): – washington
• List of German words: – ach, hin, hing, sei, ton, was, wasch
• Longest substring segmentation – was-hing-ton – Roughly translates as “What tone is attached?” oil probe petroleum survey take samples
cymbidium probe survey goeringii oil take samples restrain petroleum Probabilistic Segmentation
• For an input string c1 c2 c3 … cn
• Try all possible partitions into w1 w2 w3 …
– c1 c2 c3 … cn
– c1 c2 c3 c3 … cn
– c1 c2 c3 … cn – etc. • Choose the highest probability partition
– Compute Pr(w1 w2 w3 ) using a language model • Challenges: search, probability estimation Non-Segmentation: N-gram Indexing
• Consider a Chinese document c1 c2 c3 … cn
• Don’t segment (you could be wrong!)
• Instead, treat every character bigram as a term
c1 c2 , c2 c3 , c3 c4 , … , cn-1 cn
• Break up queries the same way A “Term” is Whatever You Index
• Word sense • Token • Word • Stem • Character n-gram • Phrase Summary • A term is whatever you index – So the key is to index the right kind of terms!
• Start by finding fundamental features – We have focused on character coded text – Same ideas apply to handwriting, OCR, and speech
• Combine characters into easily recognized units – Words where possible, character n-grams otherwise
• Apply further processing to optimize results – Stemming, phrases, … A Story in Two Parts
• IR from the ground up in any language – Focusing on document representation
Cross-Language IR – To the extent time allows Query-Language CLIR
Somali Document Collection
Translation Results System
select examine
Retrieval Engine English queries English Document Collection Document-Language CLIR
Somali Document Collection
Somali documents Retrieval Translation Results Engine System Somali queries select examine
English queries Query vs. Document Translation
• Query translation – Efficient for short queries (not relevance feedback) – Limited context for ambiguous query terms
• Document translation – Rapid support for interactive selection – Need only be done once (if query language is same) Indexing Time: Statistical Document Translation
500 monolingual cross-language 400
300
200
100 Indexing time (sec)
0
0 10 15 20 25 35 40 45 Thousands of documents Language-Neutral Retrieval
Somali Query Terms
Query “Translation”
English 1: 0.91 Document “Interlingual” Document 2: 0.57 “Translation” Retrieval Terms 3: 0.36 Translation Evidence • Lexical Resources – Phrase books, bilingual dictionaries, … • Large text collections – Translations (“parallel”) – Similar topics (“comparable”) • Similarity – Similar writing (if the character set is the same) – Similar pronunciation • People – May be able to guess topic from lousy translations Types of Lexical Resources • Ontology – Organization of knowledge • Thesaurus – Ontology specialized to support search • Dictionary – Rich word list, designed for use by people • Lexicon – Rich word list, designed for use by a machine • Bilingual term list – Pairs of translation-equivalent terms Full Query
Named entities added
Named entities from term list
Named entities removed Backoff Translation • Lexicon might contain stems, surface forms, or some combination of the two.
Document Translation Lexicon mangez mangez - eat surface form surface form mangez mange mange - eats eat stem surface form mange mangez mange - eat surface form stem mangez mange mangent mange - eat stem stem Hieroglyphic
Egyptian Demotic
Greek Types of Bilingual Corpora
• Parallel corpora: translation-equivalent pairs – Document pairs – Sentence pairs – Term pairs
• Comparable corpora: topically related – Collection pairs – Document pairs Some Modern Rosetta Stones • News: – DE-News (German-English) – Hong-Kong News, Xinhua News (Chinese-English) • Government: – Canadian Hansards (French-English) – Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) – UN Treaties (Russian, English, Arabic, …) • Religion – Bible, Koran, Book of Mormon Word-Level Alignment
English Diverging opinions about planned tax reform
Unterschiedliche Meinungen zur geplanten Steuerreform German
English Madam President , I had asked the administration …
Señora Presidenta, había pedido a la administración del Parlamento … Spanish A Translation Model
• From word-aligned bilingual text, we induce a translation model = p( fi | e) where, ∑ p( fi | e) 1 f • Example: i
p(探测|survey) = 0.4 p(试探|survey) = 0.3 p(测量|survey) = 0.25 p(样品|survey) = 0.05 Using Multiple Translations • Weighted Structured Query Translation – Takes advantage of multiple translations and translation probabilities • TF and DF of query term e are computed using TF and DF of its translations:
TF(e, Dk ) = ∑ p( fi | e)×TF( fi , Dk ) fi
DF(e) = ∑ p( fi | e)× DF( fi ) fi BM-25
term frequency
(N − df (e) + 0.5) (2.2*tf (e,d )) 8*qtf (e) [log ][ k ] ∑ + dl(d ) + e∈Q (df (e) 0.5) (0.3+ 0.9* k + tf (e,d )) 7 qtf (e) avdl k document frequency document length Retrieval Effectiveness 110% DAMM IMM PSQ 100%
90%
80%
70%
60% MAP: CLIR/Monolingual MAP: 50%
40% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cumulative Probability Threshold
CLEF French Bilingual Query Expansion
source language query
Source Target Query Language Language results Translation IR IR
expanded expanded source language target language query terms
source language target language collection collection
Pre-translation expansion Post-translation expansion Query Expansion Effect
0.35
0.30
0.25 Both 0.20 Post 0.15 Pre None 0.10
Mean Average MeanPrecision Average 0.05
0.00 0 5,000 10,000 15,000 Unique Dutch Terms
Paul McNamee and James Mayfield, SIGIR-2002 Cognate Matching
• Dictionary coverage is inherently limited – Translation of proper names – Translation of newly coined terms – Translation of unfamiliar technical terms
• Strategy: model derivational translation – Orthography-based – Pronunciation-based Matching Orthographic Cognates
• Retain untranslatable words unchanged – Often works well between European languages
• Rule-based systems – Even off-the-shelf spelling correction can help!
• Subword (e.g., character-level) MT – Trained using a set of representative cognates Matching Phonetic Cognates
• Forward transliteration – Generate all potential transliterations
• Reverse transliteration – Guess source string(s) that produced a transliteration
• Match in phonetic space Cross-Language “Retrieval”
Query
Query Translation Translated Query
Search Ranked List Uses of “MT” in CLIR
Term Translation
Query Query Term Matching Formulation Translated Snippet Translation Query Query Translation Indicative Translation Search Ranked List Informative
Selection Document Translation
Examination Document Query Reformulation
Use Interactive Cross-Language Question Answering
8
7
6
5
4
3
2
1 Users Users with Answers Correct
0 8 11 13 4 16 6 14 7 2 10 15 12 1 3 9 5 Question Number iCLEF 2004 Questions, Grouped by Difficulty 8 Who is the managing director of the International Monetary Fund? 11 Who is the president of Burundi? 13 Of what team is Bobby Robson coach? 4 Who committed the terrorist attack in the Tokyo underground? 16 Who won the Nobel Prize for Literature in 1994? 6 When did Latvia gain independence?
14 When did the attack at the Saint-Michel underground station in Paris occur? 7 How many people were declared missing in the Philippines after the typhoon “Angela”? 2 How many human genes are there? 10 How many people died of asphyxia in the Baku underground? 15 How many people live in Bombay? 12 What is Charles Millon's political party?
1 What year was Thomas Mann awarded the Nobel Prize? 3 Who is the German Minister for Economic Affairs? 9 When did Lenin die? 5 How much did the Channel Tunnel cost? For Further Reading • Multilingual IR – Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009 • African-Language IR – Open CLIR Challenge (Swahili), IARPA, 2018 – Nkosana Malumba et al, AfriWeb: A Search Engine for a Marginalized Language, ICADL, 2015 • Cross-Language IR – Jian-Yun Nie, Cross-Language Information Retrieval, Synthesis Lectures in HLT, Morgan&Claypool, 2010 – Jianqiang Wang and Douglas W. Oard, Matching Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012