Machine Learning for

Pushpak Bhattacharyya CSE Dept., IIT Bombay ISI Kolkata, Jan 6, 2014

Jan 6, 2014 ISI Kolkata, ML for MT 1 NLP Trinity

Problem

Semantics NLP Trinity Parsing

Part of Speech Tagging

Morph Analysis Marathi French

HMM Hindi English Language CRF MEMM

Algorithm

Jan 6, 2014 ISI Kolkata, ML for MT 2 NLP Layer

Discourse and Corefernce Increased Semantics Extraction Complexity Of Processing Parsing

Chunking

POS tagging

Morphology

Jan 6, 2014 ISI Kolkata, ML for MT 3 Motivation for MT

 MT: NLP Complete  NLP: AI complete  AI: CS complete  How will the world be different when the language barrier disappears?  Volume of text required to be translated currently exceeds translators’ capacity (demand > supply).  Solution: automation

Jan 6, 2014 ISI Kolkata, ML for MT 4 Taxonomy of MT systems

MT Approaches

Data driven; Knowledge Machine Based; Learning Rule Based MT Based

Example Based Statistical MT Interlingua Based Transfer Based MT (EBMT)

Jan 6, 2014 ISI Kolkata, ML for MT 5 Why is MT difficult?

Language divergence

Jan 6, 2014 ISI Kolkata, ML for MT 6 Why is MT difficult: Language Divergence

 One of the main complexities of MT: Language Divergence

 Languages have different ways of expressing meaning

 Lexico-Semantic Divergence

 Structural Divergence Our work on English-IL Language Divergence with illustrations from Hindi (Dave, Parikh, Bhattacharyya, Journal of MT, 2002)

Jan 6, 2014 ISI Kolkata, ML for MT 7 Languages differ in expressing thoughts: Agglutination

Finnish: “istahtaisinkohan” English: "I wonder if I should sit down for a while“ Analysis:

 ist + "sit", verb stem

 ahta + verb derivation morpheme, "to do something for a while"

 isi + conditional affix

 n + 1st person singular suffix

 ko + question particle

 han a particle for things like reminder (with declaratives) or "softening" (with questions and imperatives)

Jan 6, 2014 ISI Kolkata, ML for MT 8 Language Divergence Theory: Lexico- Semantic Divergences (few examples)

 Conflational divergence  F: vomir; E: to be sick  E: stab; H: chure se maaranaa (knife-with hit)  S: Utrymningsplan; E: escape plan

 Categorial divergence  Change is in POS category:

 The play is on_PREP (vs. The play is Sunday)  Khel chal_rahaa_haai_VM (vs. khel ravivaar ko haai)

Jan 6, 2014 ISI Kolkata, ML for MT 9 Language Divergence Theory: Structural Divergences

 SVOSOV  E: Peter plays basketball  H: piitar basketball kheltaa haai

 Head swapping divergence  E: Prime Minister of India  H: bhaarat ke pradhaan mantrii (India-of Prime Minister)

Jan 6, 2014 ISI Kolkata, ML for MT 10 Language Divergence Theory: Syntactic Divergences (few examples)

 Constituent Order divergence

 E: Singh, the PM of India, will address the nation today

 H: bhaarat ke pradhaan mantrii, singh, … (India-of PM, Singh…)  Adjunction Divergence

 E: She will visit here in the summer

 H: vah yahaa garmii meM aayegii (she here summer- in will come)  Preposition-Stranding divergence

 E: Who do you want to go with?

 H: kisake saath aap jaanaa chaahate ho? (who with…)

Jan 6, 2014 ISI Kolkata, ML for MT 11 Vauquois Triangle

Jan 6, 2014 ISI Kolkata, ML for MT 12 Kinds of MT Systems (point of entry from source to the target text)

Jan 6, 2014 ISI Kolkata, ML for MT 13 Illustration of transfer SVOSOV S S

NP VP NP VP

(transfer NP N V NP V svo sov) N

eats N N John John eats

bread bread

Jan 6, 2014 ISI Kolkata, ML for MT 14 Universality hypothesis

Universality hypothesis: At the level of “deep meaning”, all texts are the “same”, whatever the language.

Jan 6, 2014 ISI Kolkata, ML for MT 15 Understanding the Analysis-Transfer- Generation over Vauquois triangle (1/4)

H1.1: सरकार_ने चनु ावो_के _बाद मुԂबई मᴂ करⴂ_के _मा鵍यम_से अपने राजव_को बढ़ाया | T1.1: Sarkaar ne chunaawo ke baad Mumbai me karoM ke maadhyam se apne raajaswa ko badhaayaa G1.1: Government_(ergative) elections_after Mumbai_in taxes_through its revenue_(accusative) increased E1.1: The Government increased its revenue after the elections through taxes in Mumbai

Jan 6, 2014 ISI Kolkata, ML for MT 16 Interlingual representation: complete disambiguation

 Washington voted Washington to power Vote @past action

goal

Washingto Washington n power @emphasis place capability … person

Jan 6, 2014 ISI Kolkata, ML for MT 17 Kinds of disambiguation needed for a complete and correct interlingua graph

 N: Name

 P: POS

 A: Attachment

 S: Sense

 C: Co-reference

 R: Semantic Role

Jan 6, 2014 ISI Kolkata, ML for MT 18 Issues to handle

Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.

Part Of Speech ISSUES Noun or Verb

Jan 6, 2014 ISI Kolkata, ML for MT 19 Issues to handle

Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.

John is the ISSUES Part Of Speech name of a PERSON NER

Jan 6, 2014 ISI Kolkata, ML for MT 20 Issues to handle

Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.

ISSUES Part Of Speech

NER

Financial WSD bank or River bank

Jan 6, 2014 ISI Kolkata, ML for MT 21 Issues to handle

Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.

ISSUES Part Of Speech

NER “it”  “bank” .

WSD

Co-reference

Jan 6, 2014 ISI Kolkata, ML for MT 22 Issues to handle

Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed.

ISSUES Part Of Speech

NER Pro drop WSD (subject “I”)

Co-reference

Subject Drop

Jan 6, 2014 ISI Kolkata, ML for MT 23 Typical NLP tools used

 POS tagger

 Stanford Named Entity Recognizer

 Stanford Dependency Parser

 XLE Dependency Parser

 Lexical Resource

 WordNet

 Universal Word Dictionary (UW++)

Jan 6, 2014 ISI Kolkata, ML for MT 24 System Architecture Simple Sentence Analyser NER Stanford Dependency Parser Stanford Dependency Parser XLE Parser Clause Marker Feature Generation WSD

Simplifier

Attribute Simple Simple Simple Simple Simple Generation Enco. Enco. Enco. Enco. Enco. Relation Generation Merger

Jan 6, 2014 ISI Kolkata, ML for MT 25 Target Sentence Generation from interlingua

Target Sentence Generation

Lexical Transfer Morphological Syntax Synthesis Planning (Word/Phrase (Word form (Sequence) Translation ) Generation)

Jan 6, 2014 ISI Kolkata, ML for MT 26 Generation Architecture

Deconversion = Transfer + Generation

Jan 6, 2014 ISI Kolkata, ML for MT 27 Transfer Based MT

Marathi-Hindi

Jan 6, 2014 ISI Kolkata, ML for MT 28 Indian Language to Indian Language Machine Translation (ILILMT)

 Bidirectional Machine Translation System

 Developed for nine Indian language pairs

 Approach:

 Transfer based

 Modules developed using both rule based and statistical approach

Jan 6, 2014 ISI Kolkata, ML for MT 29 Architecture of ILILMT System

Source Text Target Text

Morphological Word Analyzer Generator

POS Tagger Interchunk

Analysis Chunker Generation Intrachunk Vibhakti Computation Agreement Name Entity Feature Recognizer Transfer Word Sense Lexical Disambiguatio Transfer n Jan 6, 2014 ISI Kolkata, ML for MT 30 M-H MT system:

Evaluation

 Subjective evaluation based on machine translation quality

 Accuracy calculated based on score given by linguists

Score : 5 Correct Translation S5: Number of score 5 Sentences, Score : 4 Understandable with S4: Number of score 4 sentences, S3: Number of score 3 sentences, minor errors N: Total Number of sentences

Score : 3 Understandable with

major errors Accuracy = Score : 2 Not Understandable Score : 1 Non sense translation

Jan 6, 2014 ISI Kolkata, ML for MT 31 Evaluation of Marathi to Hindi MT System

1.2

1

0.8

0.6 Precision Recall

0.4

0.2

0 Morph POS Tagger Chunker Vibhakti WSD Lexical Word Analyzer Compute Transfer Generator

Module-wise precision and recall

Jan 6, 2014 ISI Kolkata, ML for MT 32 Evaluation of Marathi to Hindi

MT System (cont..)

 Subjective evaluation on translation quality  Evaluated on 500 web sentences  Accuracy calculated based on score given according to the translation quality.  Accuracy: 65.32 %

 Result analysis:  Morph, POS tagger, chunker gives more than 90% precision but Transfer, WSD, generator modules are below 80% hence degrades MT quality.  Also, morph disambiguation, parsing, transfer grammar and FW disambiguation modules are required to improve accuracy.

Jan 6, 2014 ISI Kolkata, ML for MT 33

Statistical Machine Translation

Jan 6, 2014 ISI Kolkata, ML for MT 34 Czeck-English data

 [nesu] “I carry”

 [ponese] “He will carry”

 [nese] “He carries”

 [nesou] “They carry”

 [yedu] “I drive”

 [plavou] “They swim”

Jan 6, 2014 ISI Kolkata, ML for MT 35 To translate …

 I will carry.

 They drive.

 He swims.

 They will drive.

Jan 6, 2014 ISI Kolkata, ML for MT 36 Hindi-English data

 [DhotA huM] “I carry”

 [DhoegA] “He will carry”

 [DhotA hAi] “He carries”

 [Dhote hAi] “They carry”

 [chalAtA huM] “I drive”

 [tErte hEM] “They swim”

Jan 6, 2014 ISI Kolkata, ML for MT 37 Bangla-English data

 [bai] “I carry”

 [baibe] “He will carry”

 [bay] “He carries”

 [bay] “They carry”

 [chAlAi] “I drive”

 [sAMtrAy] “They swim”

Jan 6, 2014 ISI Kolkata, ML for MT 38 To translate … (repeated)

 I will carry.

 They drive.

 He swims.

 They will drive.

Jan 6, 2014 ISI Kolkata, ML for MT 39 Foundation

 Data driven approach  Goal is to find out the English sentence e given foreign language sentence f whose p(e|f) is maximum.

 Translations are generated on the basis of statistical model  Parameters are estimated using bilingual parallel corpora

Jan 6, 2014 ISI Kolkata, ML for MT 40 SMT: Language Model

 To detect good English sentences

 Probability of an English sentence w1w2 …… wn can be written as

Pr(w1w2 …… wn) = Pr(w1) * Pr(w2|w1) *. . . * Pr(wn|w1 w2 . . . wn-1)

 Here Pr(wn|w1 w2 . . . wn-1) is the probability that word wn follows word string w1 w2 . . . wn-1.  N-gram model probability

 Trigram model probability calculation

Jan 6, 2014 ISI Kolkata, ML for MT 41 SMT: Translation Model

 P(f|e): Probability of some f given hypothesis English translation e  How to assign the values to p(e|f) ? Sentence level

 Sentences are infinite, not possible to find pair(e,f) for all sentences

 Introduce a hidden variable a, that represents alignments between the individual words in the sentence pair

Word level

Jan 6, 2014 ISI Kolkata, ML for MT 42 Alignment

l  If the string, e= e1 = e1 e2 …el, has l words, and m the string, f= f1 =f1f2...fm, has m words,  then the alignment, a, can be represented by a m series, a1 = a1a2...am , of m values, each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string, then

 aj= i, and

 if it is not connected to any English word, then

aj= O

Jan 6, 2014 ISI Kolkata, ML for MT 43 Example of alignment

English: Ram went to school Hindi: Raama paathashaalaa gayaa

Ram went to school

Raama paathashaalaa gayaa

Jan 6, 2014 ISI Kolkata, ML for MT 44 Translation Model: Exact expression

Choose the length Choose alignment Choose the identity of foreign language given e and m of foreign word string given e given e, m, a

 Five models for estimating parameters in the expression [2]

 Model-1, Model-2, Model-3, Model-4, Model-5

Jan 6, 2014 ISI Kolkata, ML for MT 45 Proof of Translation Model: Exact expression

Pr( f | e)  Pr( f ,a | e) ; marginalization a

Pr( f ,a | e)  Pr( f ,a,m | e) ; marginalization m

Pr( f ,a,m | e)  Pr(m | e)Pr( f ,a | m,e) m  Pr(m | e)Pr( f ,a | m,e) m

m j1 j1  Pr(m | e)Pr( f j ,a j | a1 , f1 ,m,e) m j1

m j1 j1 j j1  Pr(m | e)Pr(a j | a1 , f1 ,m,e)Pr( f j | a1 , f1 ,m,e) m j1 m is fixed for a particular f, hence

m j1 j1 j j1 Pr( f ,a,m | e)  Pr(m | e) Pr(a j | a1 , f1 ,m,e)Pr( f j | a1 , f1 ,m,e) j1

Jan 6, 2014 ISI Kolkata, ML for MT 46 Alignment

Jan 6, 2014 ISI Kolkata, ML for MT 47 Fundamental and ubiquitous

 Spell checking

 Translation

 Transliteration

 Speech to text

 Text to speeh

Jan 6, 2014 ISI Kolkata, ML for MT 48 EM for word alignment from sentence alignment: example

English French (1) three rabbits (1) trois lapins a b w x

(2) rabbits of Grenoble (2) lapins de Grenoble b c d x y z

Jan 6, 2014 ISI Kolkata, ML for MT 49 Initial Probabilities: each cell denotes t(a w), t(a x) etc.

a b c d

w 1/4 1/4 1/4 1/4

x 1/4 1/4 1/4 1/4

y 1/4 1/4 1/4 1/4

z 1/4 1/4 1/4 1/4 The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1, we get the following relationship: t (w f |w e ) ( f | e ; , ) = . c w w f e e e t (w f |w 0 ) +… + t (w f |w l )

f e f c(w |w ; f ,e ) is the fractional count of the alignment of w with w e in f and e t (w f |w e ) is the probability of w f being the translation of w e is the count of w f in f is the count of w e in e Jan 6, 2014 ISI Kolkata, ML for MT 51 Example of expected count

C[aw; (a b)(w x)]

t(aw) = ------X #(a in ‘a b’) X #(w in ‘w x’) t(aw)+t(ax) 1/4 = ------X 1 X 1= 1/2 1/4+1/4

Jan 6, 2014 ISI Kolkata, ML for MT 52 “counts” a b a b c d b c d a b c d   w x x y z w 1/2 1/2 0 0 w 0 0 0 0

x 1/2 1/2 0 0 x 0 1/3 1/3 1/3

y 0 0 0 0 y 0 1/3 1/3 1/3

z 0 0 0 0 z 0 1/3 1/3 1/3

Jan 6, 2014 ISI Kolkata, ML for MT 53 Revised probability: example

trevised(a w)

1/2 = ------

(1/2+1/2 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6, 2014 ISI Kolkata, ML for MT 54 Revised probabilities table

a b c d

w 1/2 1/4 0 0

x 1/2 5/12 1/3 1/3

y 0 1/6 1/3 1/3

z 0 1/6 1/3 1/3 “revised counts” a b a b c d b c d a b c d   w x x y z w 1/2 3/8 0 0 w 0 0 0 0

x 1/2 5/8 0 0 x 0 5/9 1/3 1/3

y 0 0 0 0 y 0 2/9 1/3 1/3

z 0 0 0 0 z 0 2/9 1/3 1/3

Jan 6, 2014 ISI Kolkata, ML for MT 56 Re-Revised probabilities table a b c d

w 1/2 3/16 0 0

x 1/2 85/144 1/3 1/3

y 0 1/9 1/3 1/3

z 0 1/9 1/3 1/3

Continue until convergence; notice that (b,x) binding gets progressively stronger; b=rabbits, x=lapins Derivation of EM based Alignment Expressions

VE  vocalbulary of language L1 (Say English)

VF  vocabulary of language L2 (Say Hindi)

what is in a name ? E1 ? नाम मᴂ 啍या है naam meM kya hai ? 1 F name in what is ?

what is in a name ?

That which we call rose, by any other name will smell as sweet.

2 जजसे हम गऱाब कहते हℂ, और भी ककसी नाम से उसकी कशब सामान मीठा होगी E ु ु ू Jise hum gulab kahte hai, aur bhi kisi naam se uski khushbu samaan mitha hogii F2 That which we rose say , any other name by its smell as sweet

That which we call rose, by any other name will smell as sweet.

Jan 6, 2014 ISI Kolkata, ML for MT 58 Vocabulary mapping

Vocabulary

VE VF what , is , in, a , name , naam, meM, kya, hai, jise, that, which, we , call ,rose, hum, gulab, kahte, hai, aur, by, any, other, will, smell, bhi, kisi, bhi, uski, khushbu, as, sweet saman, mitha, hogii

Jan 6, 2014 ISI Kolkata, ML for MT 59 Key Notations

English vocabulary : 푉퐸 French vocabulary : 푉퐹 No. of observations / sentence pairs : 푆 Data 퐷 which consists of 푆 observations looks like, 1 1 1 1 1 1 푒 1, 푒 2, … , 푒 푙1 푓 1, 푓 2, … , 푓 푚1 2 2 2 2 2 2 푒 1, 푒 2, … , 푒 푙2 푓 1, 푓 2, … , 푓 푚2 ..... 푠 푠 푠 푠 푠 푠 푒 1, 푒 2, … , 푒 푙푠 푓 1, 푓 2, … , 푓 푚푠 ..... 푆 푆 푆 푆 푆 푆 푒 1, 푒 2, … , 푒 푙푆 푓 1, 푓 2, … , 푓 푚푆 No. words on English side in 푠푡푕 sentence : 푙푠 No. words on French side in 푠푡푕 sentence : 푚푠 푠 푠 푖푛푑푒푥퐸 푒 푝 =Index of English word 푒 푝in English vocabulary/dictionary 푠 푠 푖푛푑푒푥퐹 푓 푞 =Index of French word 푓 푞in French vocabulary/dictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6, 2014 ISI Kolkata, ML for MT 60 Hidden variables and parameters

Hidden Variables (Z) : 푆 푠 푠 Total no. of hidden variables = 푠=1 푙 푚 where each hidden variable is as follows: 푠 푡푕 푡푕 푡푕 푧푝푞 = 1 , if in 푠 sentence, 푝 English word is mapped to 푞 French word. 푠 푧푝푞 = 0 , otherwise

Parameters (Θ) : Total no. of parameters = 푉퐸 × 푉퐹 , where each parameter is as follows: 푡푕 푡푕 푃푖,푗 = Probability that 푖 word in English vocabulary is mapped to 푗 word in French vocabulary

Jan 6, 2014 ISI Kolkata, ML for MT 61 Likelihoods Data Likelihood L(D; Θ) :

Data Log-Likelihood LL(D; Θ) :

Expected value of Data Log-Likelihood E(LL(D; Θ)) :

Jan 6, 2014 ISI Kolkata, ML for MT 62 Constraint and Lagrangian

푉퐹

푃푖,푗 = 1 , ∀푖 푗=1

Jan 6, 2014 ISI Kolkata, ML for MT 63 Differentiating wrt Pij

Jan 6, 2014 ISI Kolkata, ML for MT 64 Final E and M steps

M-step

E-step

Jan 6, 2014 ISI Kolkata, ML for MT 65 Combinatorial considerations

Jan 6, 2014 ISI Kolkata, ML for MT 66 Example

Jan 6, 2014 ISI Kolkata, ML for MT 67 All possible alignments

Jan 6, 2014 ISI Kolkata, ML for MT 68 First fundamental requirement of SMT Alignment requires evidence of: • firstly, a translation pair to introduce the POSSIBILITY of a mapping. • then, another pair to establish with CERTAINTY the mapping

Jan 6, 2014 ISI Kolkata, ML for MT 69 For the “certainty”

 We have a translation pair containing alignment candidates and none of the other words in the translation pair OR

 We have a translation pair containing all words in the translation pair, except the alignment candidates

Jan 6, 2014 ISI Kolkata, ML for MT 70 Therefore…

 If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty.

Jan 6, 2014 ISI Kolkata, ML for MT 71 Rough estimate of data requirement

 SMT system between two languages L1 and L2  Assume no a-priori linguistic or world knowledge, i.e., no meanings or grammatical properties of any words, phrases or sentences  Each language has a vocabulary of 100,000 words  can give rise to about 500,000 word forms, through various morphological processes, assuming, each word appearing in 5 different forms, on the average  For example, the word „go‟ appearing in „go‟, „going‟, „went‟ and „gone‟.

Jan 6, 2014 ISI Kolkata, ML for MT 72 Reasons for mapping to multiple words

 Synonymy on the target side (e.g., “to go” in English translating to “jaanaa”, “gaman karnaa”, “chalnaa” etc. in Hindi), a phenomenon called lexical choice or register  polysemy on the source side (e.g., “to go” translating to “ho jaanaa” as in “her face went red in anger””usakaa cheharaa gusse se laal ho gayaa”)  syncretism (“went” translating to “gayaa”, “gayii”, or “gaye”). Masculine Gender, 1st or 3rd person, singular number, past tense, non-progressive aspect, declarative mood

Jan 6, 2014 ISI Kolkata, ML for MT 73 Estimate of corpora requirement

 Assume that on an average a sentence is 10 words long.

  an additional 9 translation pairs for getting at one of the 5 mappings

  10 sentences per mapping per word

  a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

 Estimate is not wide off the mark

 Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs.

Jan 6, 2014 ISI Kolkata, ML for MT 74 Our work on factor based SMT

Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and Pushpak Bhattacharyya, Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT, ACL-IJCNLP 2009, Singapore, August, 2009.

Jan 6, 2014 ISI Kolkata, ML for MT 75 Case Marker and Morphology crucial in E-H MT  Order of magnitiude facelift in Fluency and fidelity

 Determined by the combination of suffixes and semantic relations on the English side

 Augment the aligned corpus of the two languages, with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6, 2014 ISI Kolkata, ML for MT 76 Semantic relations+SuffixesCase Markers+inflections

I ate mangoes

I {

I {

mei_ne aam khaa_yaa

Jan 6, 2014 ISI Kolkata, ML for MT 77 Our Approach

 Factored model (Koehn and Hoang, 2007) with the following translation factor:

 suffix + semantic relation  case marker/suffix

 Experiments with the following relations:

 Dependency relations from the stanford parser

 Deeper semantic roles from Universal Networking Language (UNL)

Jan 6, 2014 ISI Kolkata, ML for MT 78 Our Factorization

Jan 6, 2014 ISI Kolkata, ML for MT 79 Experiments

Jan 6, 2014 ISI Kolkata, ML for MT 80 Corpus Statistics

Jan 6, 2014 ISI Kolkata, ML for MT 81 Results: The impact of suffix and semantic factors

Jan 6, 2014 ISI Kolkata, ML for MT 82 Results: The impact of reordering and semantic relations

Jan 6, 2014 ISI Kolkata, ML for MT 83 Subjective Evaluation: The impact of reordering and semantic relations

Jan 6, 2014 ISI Kolkata, ML for MT 84 Impact of sentence length (F: Fluency; A:Adequacy; E:# Errors)

Jan 6, 2014 ISI Kolkata, ML for MT 85 A feel for the improvement- baseline

Jan 6, 2014 ISI Kolkata, ML for MT 86 A feel for the improvement- reorder

Jan 6, 2014 ISI Kolkata, ML for MT 87 A feel for the improvement-Semantic relation

Jan 6, 2014 ISI Kolkata, ML for MT 88 A recent study

PAN Indian SMT

Jan 6, 2014 ISI Kolkata, ML for MT 89 Pan-Indian Language SMT http://www.cfilt.iitb.ac.in/indic-translator

 SMT systems between 11 languages  7 Indo-Aryan: Hindi, Gujarati, Bengali, Oriya, Punjabi, Marathi, Konkani  3 Dravidian languages: Malayalam, Tamil, Telugu  English  Corpus  Indian Language Corpora Initiative (ILCI) Corpus  Tourism and Health Domains  50,000 parallel sentences  Evaluation with BLEU  METEOR scores also show high correlation with BLEU

Jan 6, 2014 ISI Kolkata, ML for MT 90 SMT Systems Trained

 Phrase-based (PBSMT) baseline system (S1)  E-IL PBSMT with Source side reordering rules (Ramanathan et al., 2008) (S2)  E-IL PBSMT with Source side reordering rules (Patel et al., 2013) (S3)  IL-IL PBSMT with transliteration post- editing (S4)

Jan 6, 2014 ISI Kolkata, ML for MT 91 Natural Partitioning of SMT systems

Baseline PBSMT - % BLEU scores (S1)

 Clear partitioning of translation pairs by language family pairs, based on translation accuracy.

 Shared characteristics within language families make translation simpler

 Divergences among language families make translation difficult

Jan 6, 2014 ISI Kolkata, ML for MT 92 The Challenge of Morphology Morphological complexity vs Training Corpus size vs BLEU BLEU

Vocabulary size is a proxy for morphological complexity *Note: For Tamil, a smaller corpus was used for computing vocab size • Translation accuracy decreases with increasing morphology • Even if training corpus is increased, commensurate improvement in translation accuracy is not seen for morphologically rich languages • Handling morphology in SMT is critical Jan 6, 2014 ISI Kolkata, ML for MT 93 Common Divergences, Shared Solutions

Comparison of source reordering methods for E-IL SMT - % BLEU scores (S1,S2,S3)  All Indian languages have similar word order  The same structural divergence between English and Indian languages SOV<->SVO, etc.  Common source side reordering rules improve E-IL translation by 11.4% (generic) and 18.6% (Hindi-adapted)  Common divergences can be handled in a common framework in SMT systems ( This idea has been used for knowledge based MT systems e.g. Anglabharati )

Jan 6, 2014 ISI Kolkata, ML for MT 94 Harnessing Shared Characteristics

PBSMT+ transliteration post-editing for E-IL SMT - % BLEU scores (S4)

 Out of Vocabulary words are transliterated in a post-editing step  Done using a simple transliteration scheme which harnesses the common phonetic organization of Indic scripts  Accuracy Improvements of 0.5 BLEU points with this simple approach  Harnessing common characteristics can improve SMT output

Jan 6, 2014 ISI Kolkata, ML for MT 95 Cognition and Translation: Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya, Automatically Predicting Sentence Translation Difficulty, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013

Jan 6, 2014 ISI Kolkata, ML for MT 96 Scenario Subjective notion of Sentences difficulty

 John ate jam  Easy

 John ate jam made  Moderate from apples  John is in a jam  Difficult?

Jan 6, 2014 ISI Kolkata, ML for MT 97 Use behavioural data

 Use behavioural data to decipher strong AI algorithms

 Specifically,

 For WSD by humans, see where the eye rests for clues

 For the innate translation difficulty of sentences, see how the eye moves back and forth over the sentences

Jan 6, 2014 ISI Kolkata, ML for MT 98 Saccades

Fixations

Jan 6, 2014 ISI Kolkata, ML for MT 99 Image Courtesy: http://www.smashingmagazine.com/2007/10/09/30-usability-issues-to-be-aware-of/ Eye Tracking data

 Gaze points : Position of eye-gaze on the screen

 Fixations : A long stay of the gaze on a particular object on the screen.

 Fixations have both Spatial (coordinates) and Temporal (duration) properties.

 Saccade : A very rapid movement of eye between the positions of rest.

 Scanpath: A path connecting a series of fixations.

 Regression: Revisiting a previously read segment

Jan 6, 2014 ISI Kolkata, ML for MT 100 Controlling the experimental setup for eye-tracking

 Eye movement patterns influenced by factors like age, working proficiency, environmental distractions etc.

 Guidelines for eye tracking

 Participants metadata (age, expertise, occupation) etc.

 Performing a fresh calibration before each new experiment

 Minimizing the head movement

 Introduce adequate line spacing in the text and avoid scrolling

 Carrying out the experiments in a relatively low light environment Jan 6, 2014 ISI Kolkata, ML for MT 101

Use of eye tracking

 Used extensively in Psychology

 Mainly to study reading processes

 Seminal work: Just, M.A. and Carpenter, P.A. (1980). A theory of reading: from eye fixations to comprehension. Psychological Review 87(4):329–354

 Used in flight simulators for pilot training Jan 6, 2014 ISI Kolkata, ML for MT 102 NLP and Eye Tracking research

 Kliegl (2011)- Predict word frequency and pattern from eye movements  Doherty et. al (2010)- Eye-tracking as an automatic Machine Translation Evaluation Technique  Stymne et al. (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis  Dragsted (2010)- Co-ordination of reading and writing process during translation.

Relatively new and open research direction

Jan 6, 2014 ISI Kolkata, ML for MT 103

Translation Difficulty Index (TDI)

 Motivation: route sentences to translators with right competence, as per difficulty of translating

 On a crowdsourcing platform, e.g.

 TDI is a function of

 sentence length (l),

 degree of polysemy of constituent words (p) and

 structural complexity (s) Jan 6 , 2014 ISI Kolkata, ML for MT 104 Contributor to TDI: length

 What is more difficult to translate?

 John eats jam

 vs.

 John eats jam made from apples

 vs.

 John eats jam made from apples grown in orchards

 vs.

 John eats bread made from apples grown in orchards on black soil

Jan 6, 2014 ISI Kolkata, ML for MT 105 Contributor to TDI: polysemy

 What is more difficult to translate?

 John is in a jam

 vs.

 John is in difficulty

 Jam has 4 diverse senses, difficulty has 4 related senses

Jan 6, 2014 ISI Kolkata, ML for MT 106 Contributor to TDI: structural complexity  What is more difficult to translate?

 John is in a jam. His debt is huge. The lenders cause him to shy from them, every moment he sees them.

 vs.

 John is in a jam, caused by his huge debt, which forces him to shy from his lenders every moment he sees them.

Jan 6, 2014 ISI Kolkata, ML for MT 107 Measuring translation through Gaze data

 Translation difficulty indicated by

 staying of eye on segments

 Jumping back and forth between segments

Example:

 The horse raced past the garden fell

Jan 6, 2014 ISI Kolkata, ML for MT 108 Measuring translation difficulty through Gaze data

 Translation difficulty indicated by

 staying of eye on segments

 Jumping back and forth between segments Example:

 The horse raced past the garden fell  बगीचा के पास से दौडाया गया घोड़ा गगर गया

 bagiichaa ke pas se doudaayaa gayaa ghodaa gir gayaa The translation process will complete the task till garden, and then backtrack, revise, restart and Jantranslate 6, 2014 in a differentISI way Kolkata, ML for MT 109 Scanpaths: indicator of translation difficulty

 (Malsburg et. al, 2007)

 Sentence 2 is a clear case of “Garden pathing” which imposes cognitive load on participants and the prefer syntactic re-analysis.

Jan 6, 2014 ISI Kolkata, ML for MT 110 Translog : A tool for recording Translation Process Data

 Translog (Carl, 2012) : A Windows based program

 Built with a purpose of recording gaze and key-stroke data during translation

 Can be used for other reading and writing related studies

 Using Translog, one can:

 Create and Customize translation/reading and writing experiments involving eye-tracking and keystroke logging

 Calibrate the eye-tracker

 Replay and analyze the recorded log files

 Manually correct errors in gaze recording

Jan 6, 2014 ISI Kolkata, ML for MT 111 TPR Database

 The Translation Process Research (TPR) database (Carl, 2012) is a database containing behavioral data for translation activities

 Contains Gaze and Keystroke information for more than 450 experiments

 40 different paragraphs are translated into 7 different languages from English by multiple translators

 At least 5 translators per language

 Source and target paragraphs are annotated with POS tags, lemmas, dependency relations etc

 Easy to use XML data format

Jan 6, 2014 ISI Kolkata, ML for MT 112 Experimental setup (1/2)

 Translators translate sentence by sentence typing to a text box

 The display screen is attached with a remote eye-tracker which

 constantly records the eye movement of the translator

Jan 6, 2014 ISI Kolkata, ML for MT 113 Experimental setup (2/2)

 Extracted 20 different text categories from the data

 Each piece of text contains 5-10 sentences

 For each category we had at least 10 participants who translated the text into different target languages .

Jan 6, 2014 ISI Kolkata, ML for MT 114 A predictive framework for TDI Test Data

Labeling through gaze analysis Features

Training Regressor TDI data

 Direct annotation of TDI is fraught with subjectivity and ad-hocism.

 We use translator‟s gaze data as annotation to prepare training data.

Jan 6, 2014 ISI Kolkata, ML for MT 115 Annotation of TDI (1/4)

 First approximation -> TDI equivalent to “time taken to translate”.

 However, time taken to translate may not be strongly related to translation difficulty.

 It is difficult to know what fraction of the total time is spent on translation related thinking.

 Sensitive to distractions from the environment.

Jan 6, 2014 ISI Kolkata, ML for MT 116 Annotation of TDI (2/4)

 Instead of the “time taken to translate”, consider “time for which translation related processing is carried out by the brain”

 This is called Translation Processing Time, given by:

푇푝 = 푇푐표푚푝+푇푔푒푛

 Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively. Jan 6, 2014 ISI Kolkata, ML for MT 117 Annotation of TDI (3/4)

Humans spend time on what they see, and this “time” is correlated with the complexity of the information being processed

f- fixation, s- saccade, Fs- source, Ft- target

푇푝 = 푑푢푟 푓 + 푑푢푟 푠 + 푓 ∈ 퐹푠 푠 ∈ 푆푠 푑푢푟 푓 + 푑푢푟 푠

푓 ∈ 퐹푡 푠 ∈ 푆푡

Jan 6, 2014 ISI Kolkata, ML for MT 118 Annotation of TDI (4/4)

 The measured TDI score is the Tp normalized over sentence length

푇푝 푇퐷퐼 = 푚푒푎푠푢푟푒푑 푠푒푛푡푒푛푐푒_푙푒푛푔푡푕

Jan 6, 2014 ISI Kolkata, ML for MT 119 Features

 Length: Word count of the sentences

 Degree of Polysemy: Sum of number of senses of each word in the WordNet normalized by length

 Structural Complexity: If the attachment units lie far from each other, the sentence has higher structural complexity. Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence.

Measured TDI for TPR database for 80 sentences.

Jan 6, 2014 ISI Kolkata, ML for MT 120 Experiment and results

 Training data of 80 examples; 10-fold cross validation

 Features computed using Princeton WordNet and Stanford Dependency Parser

 Support Vector Regression technique (Joachims et al., 1999) along with different kernels

 Error analysis was done by Mean Squared Error estimate

 We also computed the correlation of the predicted TDI with the measured TDI.

Jan 6, 2014 ISI Kolkata, ML for MT 121 Examples from the dataset

Jan 6, 2014 ISI Kolkata, ML for MT 122 Summary

 Covered Interlingual based MT: the oldest approach to MT

 Covered SMT: the newest approach to MT

 Presented some recent study in the context of Indian Languages. Jan 6, 2014 ISI Kolkata, ML for MT 123

 Covered eye tracking: bring cognitive aspects to MT

Summary

 SMT is the ruling paradigm

 But linguistic features can enhance performance, especially the factored based SMT with factors coming from interlingua

 Large scale effort sponsored by ministry of IT, TDIL program to create MT systems

 Parallel corpora creation is also going on in a consortium mode

Jan 6, 2014 ISI Kolkata, ML for MT 124 Conclusions

 NLP has assumed great importance because of large amount of text in e-form

 Machine learning techniques are increasingly applied

 Highly relevant for India where multilinguality is way of life

 Machine Translation is more fundamental and ubiquitous than just mapping between two languages

 Utterancethought

 Speech to speech online translation Jan 6, 2014 ISI Kolkata, ML for MT 125 Pubs: http://ww.cse.iitb.ac.in/~pb

Resources and tools: http://www.cfilt.iitb.ac.in

Jan 6, 2014 ISI Kolkata, ML for MT 126