Machine Learning for Machine Translation
Pushpak Bhattacharyya CSE Dept., IIT Bombay ISI Kolkata, Jan 6, 2014
Jan 6, 2014 ISI Kolkata, ML for MT 1 NLP Trinity
Problem
Semantics NLP Trinity Parsing
Part of Speech Tagging
Morph Analysis Marathi French
HMM Hindi English Language CRF MEMM
Algorithm
Jan 6, 2014 ISI Kolkata, ML for MT 2 NLP Layer
Discourse and Corefernce Increased Semantics Extraction Complexity Of Processing Parsing
Chunking
POS tagging
Morphology
Jan 6, 2014 ISI Kolkata, ML for MT 3 Motivation for MT
MT: NLP Complete NLP: AI complete AI: CS complete How will the world be different when the language barrier disappears? Volume of text required to be translated currently exceeds translators’ capacity (demand > supply). Solution: automation
Jan 6, 2014 ISI Kolkata, ML for MT 4 Taxonomy of MT systems
MT Approaches
Data driven; Knowledge Machine Based; Learning Rule Based MT Based
Example Based Statistical MT Interlingua Based Transfer Based MT (EBMT)
Jan 6, 2014 ISI Kolkata, ML for MT 5 Why is MT difficult?
Language divergence
Jan 6, 2014 ISI Kolkata, ML for MT 6 Why is MT difficult: Language Divergence
One of the main complexities of MT: Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence Our work on English-IL Language Divergence with illustrations from Hindi (Dave, Parikh, Bhattacharyya, Journal of MT, 2002)
Jan 6, 2014 ISI Kolkata, ML for MT 7 Languages differ in expressing thoughts: Agglutination
Finnish: “istahtaisinkohan” English: "I wonder if I should sit down for a while“ Analysis:
ist + "sit", verb stem
ahta + verb derivation morpheme, "to do something for a while"
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or "softening" (with questions and imperatives)
Jan 6, 2014 ISI Kolkata, ML for MT 8 Language Divergence Theory: Lexico- Semantic Divergences (few examples)
Conflational divergence F: vomir; E: to be sick E: stab; H: chure se maaranaa (knife-with hit) S: Utrymningsplan; E: escape plan
Categorial divergence Change is in POS category:
The play is on_PREP (vs. The play is Sunday) Khel chal_rahaa_haai_VM (vs. khel ravivaar ko haai)
Jan 6, 2014 ISI Kolkata, ML for MT 9 Language Divergence Theory: Structural Divergences
SVOSOV E: Peter plays basketball H: piitar basketball kheltaa haai
Head swapping divergence E: Prime Minister of India H: bhaarat ke pradhaan mantrii (India-of Prime Minister)
Jan 6, 2014 ISI Kolkata, ML for MT 10 Language Divergence Theory: Syntactic Divergences (few examples)
Constituent Order divergence
E: Singh, the PM of India, will address the nation today
H: bhaarat ke pradhaan mantrii, singh, … (India-of PM, Singh…) Adjunction Divergence
E: She will visit here in the summer
H: vah yahaa garmii meM aayegii (she here summer- in will come) Preposition-Stranding divergence
E: Who do you want to go with?
H: kisake saath aap jaanaa chaahate ho? (who with…)
Jan 6, 2014 ISI Kolkata, ML for MT 11 Vauquois Triangle
Jan 6, 2014 ISI Kolkata, ML for MT 12 Kinds of MT Systems (point of entry from source to the target text)
Jan 6, 2014 ISI Kolkata, ML for MT 13 Illustration of transfer SVOSOV S S
NP VP NP VP
(transfer NP N V NP V svo sov) N
eats N N John John eats
bread bread
Jan 6, 2014 ISI Kolkata, ML for MT 14 Universality hypothesis
Universality hypothesis: At the level of “deep meaning”, all texts are the “same”, whatever the language.
Jan 6, 2014 ISI Kolkata, ML for MT 15 Understanding the Analysis-Transfer- Generation over Vauquois triangle (1/4)
H1.1: सरकार_ने चनु ावो_के _बाद मुԂबई मᴂ करⴂ_के _मा鵍यम_से अपने राजव_को बढ़ाया | T1.1: Sarkaar ne chunaawo ke baad Mumbai me karoM ke maadhyam se apne raajaswa ko badhaayaa G1.1: Government_(ergative) elections_after Mumbai_in taxes_through its revenue_(accusative) increased E1.1: The Government increased its revenue after the elections through taxes in Mumbai
Jan 6, 2014 ISI Kolkata, ML for MT 16 Interlingual representation: complete disambiguation
Washington voted Washington to power Vote
goal
Washingto Washington n power @emphasis
Jan 6, 2014 ISI Kolkata, ML for MT 17 Kinds of disambiguation needed for a complete and correct interlingua graph
N: Name
P: POS
A: Attachment
S: Sense
C: Co-reference
R: Semantic Role
Jan 6, 2014 ISI Kolkata, ML for MT 18 Issues to handle
Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.
Part Of Speech ISSUES Noun or Verb
Jan 6, 2014 ISI Kolkata, ML for MT 19 Issues to handle
Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.
John is the ISSUES Part Of Speech name of a PERSON NER
Jan 6, 2014 ISI Kolkata, ML for MT 20 Issues to handle
Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.
ISSUES Part Of Speech
NER
Financial WSD bank or River bank
Jan 6, 2014 ISI Kolkata, ML for MT 21 Issues to handle
Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.
ISSUES Part Of Speech
NER “it” “bank” .
WSD
Co-reference
Jan 6, 2014 ISI Kolkata, ML for MT 22 Issues to handle
Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.
ISSUES Part Of Speech
NER Pro drop WSD (subject “I”)
Co-reference
Subject Drop
Jan 6, 2014 ISI Kolkata, ML for MT 23 Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6, 2014 ISI Kolkata, ML for MT 24 System Architecture Simple Sentence Analyser NER Stanford Dependency Parser Stanford Dependency Parser XLE Parser Clause Marker Feature Generation WSD
Simplifier
Attribute Simple Simple Simple Simple Simple Generation Enco. Enco. Enco. Enco. Enco. Relation Generation Merger
Jan 6, 2014 ISI Kolkata, ML for MT 25 Target Sentence Generation from interlingua
Target Sentence Generation
Lexical Transfer Morphological Syntax Synthesis Planning (Word/Phrase (Word form (Sequence) Translation ) Generation)
Jan 6, 2014 ISI Kolkata, ML for MT 26 Generation Architecture
Deconversion = Transfer + Generation
Jan 6, 2014 ISI Kolkata, ML for MT 27 Transfer Based MT
Marathi-Hindi
Jan 6, 2014 ISI Kolkata, ML for MT 28 Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach:
Transfer based
Modules developed using both rule based and statistical approach
Jan 6, 2014 ISI Kolkata, ML for MT 29 Architecture of ILILMT System
Source Text Target Text
Morphological Word Analyzer Generator
POS Tagger Interchunk
Analysis Chunker Generation Intrachunk Vibhakti Computation Agreement Name Entity Feature Recognizer Transfer Word Sense Lexical Disambiguatio Transfer n Jan 6, 2014 ISI Kolkata, ML for MT 30 M-H MT system:
Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
Score : 5 Correct Translation S5: Number of score 5 Sentences, Score : 4 Understandable with S4: Number of score 4 sentences, S3: Number of score 3 sentences, minor errors N: Total Number of sentences
Score : 3 Understandable with
major errors Accuracy = Score : 2 Not Understandable Score : 1 Non sense translation
Jan 6, 2014 ISI Kolkata, ML for MT 31 Evaluation of Marathi to Hindi MT System
1.2
1
0.8
0.6 Precision Recall
0.4
0.2
0 Morph POS Tagger Chunker Vibhakti WSD Lexical Word Analyzer Compute Transfer Generator
Module-wise precision and recall
Jan 6, 2014 ISI Kolkata, ML for MT 32 Evaluation of Marathi to Hindi
MT System (cont..)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the translation quality. Accuracy: 65.32 %
Result analysis: Morph, POS tagger, chunker gives more than 90% precision but Transfer, WSD, generator modules are below 80% hence degrades MT quality. Also, morph disambiguation, parsing, transfer grammar and FW disambiguation modules are required to improve accuracy.
Jan 6, 2014 ISI Kolkata, ML for MT 33
Statistical Machine Translation
Jan 6, 2014 ISI Kolkata, ML for MT 34 Czeck-English data
[nesu] “I carry”
[ponese] “He will carry”
[nese] “He carries”
[nesou] “They carry”
[yedu] “I drive”
[plavou] “They swim”
Jan 6, 2014 ISI Kolkata, ML for MT 35 To translate …
I will carry.
They drive.
He swims.
They will drive.
Jan 6, 2014 ISI Kolkata, ML for MT 36 Hindi-English data
[DhotA huM] “I carry”
[DhoegA] “He will carry”
[DhotA hAi] “He carries”
[Dhote hAi] “They carry”
[chalAtA huM] “I drive”
[tErte hEM] “They swim”
Jan 6, 2014 ISI Kolkata, ML for MT 37 Bangla-English data
[bai] “I carry”
[baibe] “He will carry”
[bay] “He carries”
[bay] “They carry”
[chAlAi] “I drive”
[sAMtrAy] “They swim”
Jan 6, 2014 ISI Kolkata, ML for MT 38 To translate … (repeated)
I will carry.
They drive.
He swims.
They will drive.
Jan 6, 2014 ISI Kolkata, ML for MT 39 Foundation
Data driven approach Goal is to find out the English sentence e given foreign language sentence f whose p(e|f) is maximum.
Translations are generated on the basis of statistical model Parameters are estimated using bilingual parallel corpora
Jan 6, 2014 ISI Kolkata, ML for MT 40 SMT: Language Model
To detect good English sentences
Probability of an English sentence w1w2 …… wn can be written as
Pr(w1w2 …… wn) = Pr(w1) * Pr(w2|w1) *. . . * Pr(wn|w1 w2 . . . wn-1)
Here Pr(wn|w1 w2 . . . wn-1) is the probability that word wn follows word string w1 w2 . . . wn-1. N-gram model probability
Trigram model probability calculation
Jan 6, 2014 ISI Kolkata, ML for MT 41 SMT: Translation Model
P(f|e): Probability of some f given hypothesis English translation e How to assign the values to p(e|f) ? Sentence level
Sentences are infinite, not possible to find pair(e,f) for all sentences
Introduce a hidden variable a, that represents alignments between the individual words in the sentence pair
Word level
Jan 6, 2014 ISI Kolkata, ML for MT 42 Alignment
l If the string, e= e1 = e1 e2 …el, has l words, and m the string, f= f1 =f1f2...fm, has m words, then the alignment, a, can be represented by a m series, a1 = a1a2...am , of m values, each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string, then
aj= i, and
if it is not connected to any English word, then
aj= O
Jan 6, 2014 ISI Kolkata, ML for MT 43 Example of alignment
English: Ram went to school Hindi: Raama paathashaalaa gayaa
Ram went to school
Jan 6, 2014 ISI Kolkata, ML for MT 44 Translation Model: Exact expression
Choose the length Choose alignment Choose the identity of foreign language given e and m of foreign word string given e given e, m, a
Five models for estimating parameters in the expression [2]
Model-1, Model-2, Model-3, Model-4, Model-5
Jan 6, 2014 ISI Kolkata, ML for MT 45 Proof of Translation Model: Exact expression
Pr( f | e) Pr( f ,a | e) ; marginalization a
Pr( f ,a | e) Pr( f ,a,m | e) ; marginalization m
Pr( f ,a,m | e) Pr(m | e)Pr( f ,a | m,e) m Pr(m | e)Pr( f ,a | m,e) m
m j1 j1 Pr(m | e)Pr( f j ,a j | a1 , f1 ,m,e) m j1
m j1 j1 j j1 Pr(m | e)Pr(a j | a1 , f1 ,m,e)Pr( f j | a1 , f1 ,m,e) m j1 m is fixed for a particular f, hence
m j1 j1 j j1 Pr( f ,a,m | e) Pr(m | e) Pr(a j | a1 , f1 ,m,e)Pr( f j | a1 , f1 ,m,e) j1
Jan 6, 2014 ISI Kolkata, ML for MT 46 Alignment
Jan 6, 2014 ISI Kolkata, ML for MT 47 Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6, 2014 ISI Kolkata, ML for MT 48 EM for word alignment from sentence alignment: example
English French (1) three rabbits (1) trois lapins a b w x
(2) rabbits of Grenoble (2) lapins de Grenoble b c d x y z
Jan 6, 2014 ISI Kolkata, ML for MT 49 Initial Probabilities: each cell denotes t(a w), t(a x) etc.
a b c d
w 1/4 1/4 1/4 1/4
x 1/4 1/4 1/4 1/4
y 1/4 1/4 1/4 1/4
z 1/4 1/4 1/4 1/4 The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1, we get the following relationship: t (w f |w e ) ( f | e ; , ) = . c w w f e e e t (w f |w 0 ) +… + t (w f |w l )
f e f c(w |w ; f ,e ) is the fractional count of the alignment of w with w e in f and e t (w f |w e ) is the probability of w f being the translation of w e is the count of w f in f is the count of w e in e Jan 6, 2014 ISI Kolkata, ML for MT 51 Example of expected count
C[aw; (a b)(w x)]
t(aw) = ------X #(a in ‘a b’) X #(w in ‘w x’) t(aw)+t(ax) 1/4 = ------X 1 X 1= 1/2 1/4+1/4
Jan 6, 2014 ISI Kolkata, ML for MT 52 “counts” a b a b c d b c d a b c d w x x y z w 1/2 1/2 0 0 w 0 0 0 0
x 1/2 1/2 0 0 x 0 1/3 1/3 1/3
y 0 0 0 0 y 0 1/3 1/3 1/3
z 0 0 0 0 z 0 1/3 1/3 1/3
Jan 6, 2014 ISI Kolkata, ML for MT 53 Revised probability: example
trevised(a w)
1/2 = ------
(1/2+1/2 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6, 2014 ISI Kolkata, ML for MT 54 Revised probabilities table
a b c d
w 1/2 1/4 0 0
x 1/2 5/12 1/3 1/3
y 0 1/6 1/3 1/3
z 0 1/6 1/3 1/3 “revised counts” a b a b c d b c d a b c d w x x y z w 1/2 3/8 0 0 w 0 0 0 0
x 1/2 5/8 0 0 x 0 5/9 1/3 1/3
y 0 0 0 0 y 0 2/9 1/3 1/3
z 0 0 0 0 z 0 2/9 1/3 1/3
Jan 6, 2014 ISI Kolkata, ML for MT 56 Re-Revised probabilities table a b c d
w 1/2 3/16 0 0
x 1/2 85/144 1/3 1/3
y 0 1/9 1/3 1/3
z 0 1/9 1/3 1/3
Continue until convergence; notice that (b,x) binding gets progressively stronger; b=rabbits, x=lapins Derivation of EM based Alignment Expressions
VE vocalbulary of language L1 (Say English)
VF vocabulary of language L2 (Say Hindi)
what is in a name ? E1 ? नाम मᴂ 啍या है naam meM kya hai ? 1 F name in what is ?
what is in a name ?
That which we call rose, by any other name will smell as sweet.
2 जजसे हम गऱाब कहते हℂ, और भी ककसी नाम से उसकी कशब सामान मीठा होगी E ु ु ू Jise hum gulab kahte hai, aur bhi kisi naam se uski khushbu samaan mitha hogii F2 That which we rose say , any other name by its smell as sweet
That which we call rose, by any other name will smell as sweet.
Jan 6, 2014 ISI Kolkata, ML for MT 58 Vocabulary mapping
Vocabulary
VE VF what , is , in, a , name , naam, meM, kya, hai, jise, that, which, we , call ,rose, hum, gulab, kahte, hai, aur, by, any, other, will, smell, bhi, kisi, bhi, uski, khushbu, as, sweet saman, mitha, hogii
Jan 6, 2014 ISI Kolkata, ML for MT 59 Key Notations
English vocabulary : 푉퐸 French vocabulary : 푉퐹 No. of observations / sentence pairs : 푆 Data 퐷 which consists of 푆 observations looks like, 1 1 1 1 1 1 푒 1, 푒 2, … , 푒 푙1 푓 1, 푓 2, … , 푓 푚1 2 2 2 2 2 2 푒 1, 푒 2, … , 푒 푙2 푓 1, 푓 2, … , 푓 푚2 ..... 푠 푠 푠 푠 푠 푠 푒 1, 푒 2, … , 푒 푙푠 푓 1, 푓 2, … , 푓 푚푠 ..... 푆 푆 푆 푆 푆 푆 푒 1, 푒 2, … , 푒 푙푆 푓 1, 푓 2, … , 푓 푚푆 No. words on English side in 푠푡푕 sentence : 푙푠 No. words on French side in 푠푡푕 sentence : 푚푠 푠 푠 푖푛푑푒푥퐸 푒 푝 =Index of English word 푒 푝in English vocabulary/dictionary 푠 푠 푖푛푑푒푥퐹 푓 푞 =Index of French word 푓 푞in French vocabulary/dictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6, 2014 ISI Kolkata, ML for MT 60 Hidden variables and parameters
Hidden Variables (Z) : 푆 푠 푠 Total no. of hidden variables = 푠=1 푙 푚 where each hidden variable is as follows: 푠 푡푕 푡푕 푡푕 푧푝푞 = 1 , if in 푠 sentence, 푝 English word is mapped to 푞 French word. 푠 푧푝푞 = 0 , otherwise
Parameters (Θ) : Total no. of parameters = 푉퐸 × 푉퐹 , where each parameter is as follows: 푡푕 푡푕 푃푖,푗 = Probability that 푖 word in English vocabulary is mapped to 푗 word in French vocabulary
Jan 6, 2014 ISI Kolkata, ML for MT 61 Likelihoods Data Likelihood L(D; Θ) :
Data Log-Likelihood LL(D; Θ) :
Expected value of Data Log-Likelihood E(LL(D; Θ)) :
Jan 6, 2014 ISI Kolkata, ML for MT 62 Constraint and Lagrangian
푉퐹
푃푖,푗 = 1 , ∀푖 푗=1
Jan 6, 2014 ISI Kolkata, ML for MT 63 Differentiating wrt Pij
Jan 6, 2014 ISI Kolkata, ML for MT 64 Final E and M steps
M-step
E-step
Jan 6, 2014 ISI Kolkata, ML for MT 65 Combinatorial considerations
Jan 6, 2014 ISI Kolkata, ML for MT 66 Example
Jan 6, 2014 ISI Kolkata, ML for MT 67 All possible alignments
Jan 6, 2014 ISI Kolkata, ML for MT 68 First fundamental requirement of SMT Alignment requires evidence of: • firstly, a translation pair to introduce the POSSIBILITY of a mapping. • then, another pair to establish with CERTAINTY the mapping
Jan 6, 2014 ISI Kolkata, ML for MT 69 For the “certainty”
We have a translation pair containing alignment candidates and none of the other words in the translation pair OR
We have a translation pair containing all words in the translation pair, except the alignment candidates
Jan 6, 2014 ISI Kolkata, ML for MT 70 Therefore…
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty.
Jan 6, 2014 ISI Kolkata, ML for MT 71 Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge, i.e., no meanings or grammatical properties of any words, phrases or sentences Each language has a vocabulary of 100,000 words can give rise to about 500,000 word forms, through various morphological processes, assuming, each word appearing in 5 different forms, on the average For example, the word „go‟ appearing in „go‟, „going‟, „went‟ and „gone‟.
Jan 6, 2014 ISI Kolkata, ML for MT 72 Reasons for mapping to multiple words
Synonymy on the target side (e.g., “to go” in English translating to “jaanaa”, “gaman karnaa”, “chalnaa” etc. in Hindi), a phenomenon called lexical choice or register polysemy on the source side (e.g., “to go” translating to “ho jaanaa” as in “her face went red in anger””usakaa cheharaa gusse se laal ho gayaa”) syncretism (“went” translating to “gayaa”, “gayii”, or “gaye”). Masculine Gender, 1st or 3rd person, singular number, past tense, non-progressive aspect, declarative mood
Jan 6, 2014 ISI Kolkata, ML for MT 73 Estimate of corpora requirement
Assume that on an average a sentence is 10 words long.
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs.
Jan 6, 2014 ISI Kolkata, ML for MT 74 Our work on factor based SMT
Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and Pushpak Bhattacharyya, Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT, ACL-IJCNLP 2009, Singapore, August, 2009.
Jan 6, 2014 ISI Kolkata, ML for MT 75 Case Marker and Morphology crucial in E-H MT Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages, with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6, 2014 ISI Kolkata, ML for MT 76 Semantic relations+SuffixesCase Markers+inflections
I ate mangoes
I { I { mei_ne aam khaa_yaa Jan 6, 2014 ISI Kolkata, ML for MT 77 Our Approach Factored model (Koehn and Hoang, 2007) with the following translation factor: suffix + semantic relation case marker/suffix Experiments with the following relations: Dependency relations from the stanford parser Deeper semantic roles from Universal Networking Language (UNL) Jan 6, 2014 ISI Kolkata, ML for MT 78 Our Factorization Jan 6, 2014 ISI Kolkata, ML for MT 79 Experiments Jan 6, 2014 ISI Kolkata, ML for MT 80 Corpus Statistics Jan 6, 2014 ISI Kolkata, ML for MT 81 Results: The impact of suffix and semantic factors Jan 6, 2014 ISI Kolkata, ML for MT 82 Results: The impact of reordering and semantic relations Jan 6, 2014 ISI Kolkata, ML for MT 83 Subjective Evaluation: The impact of reordering and semantic relations Jan 6, 2014 ISI Kolkata, ML for MT 84 Impact of sentence length (F: Fluency; A:Adequacy; E:# Errors) Jan 6, 2014 ISI Kolkata, ML for MT 85 A feel for the improvement- baseline Jan 6, 2014 ISI Kolkata, ML for MT 86 A feel for the improvement- reorder Jan 6, 2014 ISI Kolkata, ML for MT 87 A feel for the improvement-Semantic relation Jan 6, 2014 ISI Kolkata, ML for MT 88 A recent study PAN Indian SMT Jan 6, 2014 ISI Kolkata, ML for MT 89 Pan-Indian Language SMT http://www.cfilt.iitb.ac.in/indic-translator SMT systems between 11 languages 7 Indo-Aryan: Hindi, Gujarati, Bengali, Oriya, Punjabi, Marathi, Konkani 3 Dravidian languages: Malayalam, Tamil, Telugu English Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50,000 parallel sentences Evaluation with BLEU METEOR scores also show high correlation with BLEU Jan 6, 2014 ISI Kolkata, ML for MT 90 SMT Systems Trained Phrase-based (PBSMT) baseline system (S1) E-IL PBSMT with Source side reordering rules (Ramanathan et al., 2008) (S2) E-IL PBSMT with Source side reordering rules (Patel et al., 2013) (S3) IL-IL PBSMT with transliteration post- editing (S4) Jan 6, 2014 ISI Kolkata, ML for MT 91 Natural Partitioning of SMT systems Baseline PBSMT - % BLEU scores (S1) Clear partitioning of translation pairs by language family pairs, based on translation accuracy. Shared characteristics within language families make translation simpler Divergences among language families make translation difficult Jan 6, 2014 ISI Kolkata, ML for MT 92 The Challenge of Morphology Morphological complexity vs Training Corpus size vs BLEU BLEU Vocabulary size is a proxy for morphological complexity *Note: For Tamil, a smaller corpus was used for computing vocab size • Translation accuracy decreases with increasing morphology • Even if training corpus is increased, commensurate improvement in translation accuracy is not seen for morphologically rich languages • Handling morphology in SMT is critical Jan 6, 2014 ISI Kolkata, ML for MT 93 Common Divergences, Shared Solutions Comparison of source reordering methods for E-IL SMT - % BLEU scores (S1,S2,S3) All Indian languages have similar word order The same structural divergence between English and Indian languages SOV<->SVO, etc. Common source side reordering rules improve E-IL translation by 11.4% (generic) and 18.6% (Hindi-adapted) Common divergences can be handled in a common framework in SMT systems ( This idea has been used for knowledge based MT systems e.g. Anglabharati ) Jan 6, 2014 ISI Kolkata, ML for MT 94 Harnessing Shared Characteristics PBSMT+ transliteration post-editing for E-IL SMT - % BLEU scores (S4) Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common phonetic organization of Indic scripts Accuracy Improvements of 0.5 BLEU points with this simple approach Harnessing common characteristics can improve SMT output Jan 6, 2014 ISI Kolkata, ML for MT 95 Cognition and Translation: Measuring Translation Difficulty Abhijit Mishra and Pushpak Bhattacharyya, Automatically Predicting Sentence Translation Difficulty, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 Jan 6, 2014 ISI Kolkata, ML for MT 96 Scenario Subjective notion of Sentences difficulty John ate jam Easy John ate jam made Moderate from apples John is in a jam Difficult? Jan 6, 2014 ISI Kolkata, ML for MT 97 Use behavioural data Use behavioural data to decipher strong AI algorithms Specifically, For WSD by humans, see where the eye rests for clues For the innate translation difficulty of sentences, see how the eye moves back and forth over the sentences Jan 6, 2014 ISI Kolkata, ML for MT 98 Saccades Fixations Jan 6, 2014 ISI Kolkata, ML for MT 99 Image Courtesy: http://www.smashingmagazine.com/2007/10/09/30-usability-issues-to-be-aware-of/ Eye Tracking data Gaze points : Position of eye-gaze on the screen Fixations : A long stay of the gaze on a particular object on the screen. Fixations have both Spatial (coordinates) and Temporal (duration) properties. Saccade : A very rapid movement of eye between the positions of rest. Scanpath: A path connecting a series of fixations. Regression: Revisiting a previously read segment Jan 6, 2014 ISI Kolkata, ML for MT 100 Controlling the experimental setup for eye-tracking Eye movement patterns influenced by factors like age, working proficiency, environmental distractions etc. Guidelines for eye tracking Participants metadata (age, expertise, occupation) etc. Performing a fresh calibration before each new experiment Minimizing the head movement Introduce adequate line spacing in the text and avoid scrolling Carrying out the experiments in a relatively low light environment Jan 6, 2014 ISI Kolkata, ML for MT 101 Use of eye tracking Used extensively in Psychology Mainly to study reading processes Seminal work: Just, M.A. and Carpenter, P.A. (1980). A theory of reading: from eye fixations to comprehension. Psychological Review 87(4):329–354 Used in flight simulators for pilot training Jan 6, 2014 ISI Kolkata, ML for MT 102 NLP and Eye Tracking research Kliegl (2011)- Predict word frequency and pattern from eye movements Doherty et. al (2010)- Eye-tracking as an automatic Machine Translation Evaluation Technique Stymne et al. (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis Dragsted (2010)- Co-ordination of reading and writing process during translation. Relatively new and open research direction Jan 6, 2014 ISI Kolkata, ML for MT 103 Translation Difficulty Index (TDI) Motivation: route sentences to translators with right competence, as per difficulty of translating On a crowdsourcing platform, e.g. TDI is a function of sentence length (l), degree of polysemy of constituent words (p) and structural complexity (s) Jan 6 , 2014 ISI Kolkata, ML for MT 104 Contributor to TDI: length What is more difficult to translate? John eats jam vs. John eats jam made from apples vs. John eats jam made from apples grown in orchards vs. John eats bread made from apples grown in orchards on black soil Jan 6, 2014 ISI Kolkata, ML for MT 105 Contributor to TDI: polysemy What is more difficult to translate? John is in a jam vs. John is in difficulty Jam has 4 diverse senses, difficulty has 4 related senses Jan 6, 2014 ISI Kolkata, ML for MT 106 Contributor to TDI: structural complexity What is more difficult to translate? John is in a jam. His debt is huge. The lenders cause him to shy from them, every moment he sees them. vs. John is in a jam, caused by his huge debt, which forces him to shy from his lenders every moment he sees them. Jan 6, 2014 ISI Kolkata, ML for MT 107 Measuring translation through Gaze data Translation difficulty indicated by staying of eye on segments Jumping back and forth between segments Example: The horse raced past the garden fell Jan 6, 2014 ISI Kolkata, ML for MT 108 Measuring translation difficulty through Gaze data Translation difficulty indicated by staying of eye on segments Jumping back and forth between segments Example: The horse raced past the garden fell बगीचा के पास से दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa gir gayaa The translation process will complete the task till garden, and then backtrack, revise, restart and Jantranslate 6, 2014 in a differentISI way Kolkata, ML for MT 109 Scanpaths: indicator of translation difficulty (Malsburg et. al, 2007) Sentence 2 is a clear case of “Garden pathing” which imposes cognitive load on participants and the prefer syntactic re-analysis. Jan 6, 2014 ISI Kolkata, ML for MT 110 Translog : A tool for recording Translation Process Data Translog (Carl, 2012) : A Windows based program Built with a purpose of recording gaze and key-stroke data during translation Can be used for other reading and writing related studies Using Translog, one can: Create and Customize translation/reading and writing experiments involving eye-tracking and keystroke logging Calibrate the eye-tracker Replay and analyze the recorded log files Manually correct errors in gaze recording Jan 6, 2014 ISI Kolkata, ML for MT 111 TPR Database The Translation Process Research (TPR) database (Carl, 2012) is a database containing behavioral data for translation activities Contains Gaze and Keystroke information for more than 450 experiments 40 different paragraphs are translated into 7 different languages from English by multiple translators At least 5 translators per language Source and target paragraphs are annotated with POS tags, lemmas, dependency relations etc Easy to use XML data format Jan 6, 2014 ISI Kolkata, ML for MT 112 Experimental setup (1/2) Translators translate sentence by sentence typing to a text box The display screen is attached with a remote eye-tracker which constantly records the eye movement of the translator Jan 6, 2014 ISI Kolkata, ML for MT 113 Experimental setup (2/2) Extracted 20 different text categories from the data Each piece of text contains 5-10 sentences For each category we had at least 10 participants who translated the text into different target languages . Jan 6, 2014 ISI Kolkata, ML for MT 114 A predictive framework for TDI Test Data Labeling through gaze analysis Features Training Regressor TDI data Direct annotation of TDI is fraught with subjectivity and ad-hocism. We use translator‟s gaze data as annotation to prepare training data. Jan 6, 2014 ISI Kolkata, ML for MT 115 Annotation of TDI (1/4) First approximation -> TDI equivalent to “time taken to translate”. However, time taken to translate may not be strongly related to translation difficulty. It is difficult to know what fraction of the total time is spent on translation related thinking. Sensitive to distractions from the environment. Jan 6, 2014 ISI Kolkata, ML for MT 116 Annotation of TDI (2/4) Instead of the “time taken to translate”, consider “time for which translation related processing is carried out by the brain” This is called Translation Processing Time, given by: 푇푝 = 푇푐표푚푝+푇푔푒푛 Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively. Jan 6, 2014 ISI Kolkata, ML for MT 117 Annotation of TDI (3/4) Humans spend time on what they see, and this “time” is correlated with the complexity of the information being processed f- fixation, s- saccade, Fs- source, Ft- target 푇푝 = 푑푢푟 푓 + 푑푢푟 푠 + 푓 ∈ 퐹푠 푠 ∈ 푆푠 푑푢푟 푓 + 푑푢푟 푠 푓 ∈ 퐹푡 푠 ∈ 푆푡 Jan 6, 2014 ISI Kolkata, ML for MT 118 Annotation of TDI (4/4) The measured TDI score is the Tp normalized over sentence length 푇푝 푇퐷퐼 = 푚푒푎푠푢푟푒푑 푠푒푛푡푒푛푐푒_푙푒푛푔푡푕 Jan 6, 2014 ISI Kolkata, ML for MT 119 Features Length: Word count of the sentences Degree of Polysemy: Sum of number of senses of each word in the WordNet normalized by length Structural Complexity: If the attachment units lie far from each other, the sentence has higher structural complexity. Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence. Measured TDI for TPR database for 80 sentences. Jan 6, 2014 ISI Kolkata, ML for MT 120 Experiment and results Training data of 80 examples; 10-fold cross validation Features computed using Princeton WordNet and Stanford Dependency Parser Support Vector Regression technique (Joachims et al., 1999) along with different kernels Error analysis was done by Mean Squared Error estimate We also computed the correlation of the predicted TDI with the measured TDI. Jan 6, 2014 ISI Kolkata, ML for MT 121 Examples from the dataset Jan 6, 2014 ISI Kolkata, ML for MT 122 Summary Covered Interlingual based MT: the oldest approach to MT Covered SMT: the newest approach to MT Presented some recent study in the context of Indian Languages. Jan 6, 2014 ISI Kolkata, ML for MT 123 Covered eye tracking: bring cognitive aspects to MT Summary SMT is the ruling paradigm But linguistic features can enhance performance, especially the factored based SMT with factors coming from interlingua Large scale effort sponsored by ministry of IT, TDIL program to create MT systems Parallel corpora creation is also going on in a consortium mode Jan 6, 2014 ISI Kolkata, ML for MT 124 Conclusions NLP has assumed great importance because of large amount of text in e-form Machine learning techniques are increasingly applied Highly relevant for India where multilinguality is way of life Machine Translation is more fundamental and ubiquitous than just mapping between two languages Utterancethought Speech to speech online translation Jan 6, 2014 ISI Kolkata, ML for MT 125 Pubs: http://ww.cse.iitb.ac.in/~pb Resources and tools: http://www.cfilt.iitb.ac.in Jan 6, 2014 ISI Kolkata, ML for MT 126