Renduchintala-Dissertation
ALANGUAGELEARNINGFRAMEWORKBASEDON
MACARONICTEXTS
b y
Adithya Renduchintala
A dissertation sub mittedto The Johns Hopkins Universityin confor mity withthe
require mentsforthe degree of Doctor of Philosophy.
Balti more, Maryland
August, 2020
⃝c 2020 Adithya Renduchintala
All rights reserved A bst r a ct
Thisthesis explores a ne wfra me workforforeignlanguage(L2) education. Ourfra me- workintroduces ne w L2 words and phrasesinterspersed withintext writteninthe student’s
nativelanguage(L1),resultingin a macaronic docu ment. Wefocus on utilizingtheinherent
ability of studentsto co mprehend macaronic sentencesincidentally and,in doing so,learn
ne w attributes of a foreignlanguage (vocabulary and phrases). Our goalisto build an
AI-drivenforeignlanguageteacher,that converts any docu ment writtenin a student’s L1
(ne ws articles, stories, novels, etc.) into a pedagogically useful macaronic docu ment. A valuable property of macaronicinstructionisthatlanguagelearningis “disguised” as a
si m pl e r e a di n g a cti vit y.
Inthis pursuit, we first analyze ho w users guessthe meaning of asingle novel L2 word
(a noun) placed within an L1sentence. Westudythefeatures userstendto use asthey guess
the meaning ofthe L2 word. Wethen extend our modelto handle multiple novel words and
phrasesin a single sentence,resultingin a graphical modelthat perfor ms a modified cloze
task. To do so, we also define a data structurethat supportsrealizingthe exponentially many
macaronic configurations possiblefor a givensentence. We also explore waysto use a neural
ii ABSTRACT
clozelanguage modeltrained only on L1text as a “drop-in”replace mentfor areal hu man
student. Finally, we report findings on modeling students navigatingthrough a foreign
languageinflectionlearningtask. We hopethatthesefor m afoundationforfutureresearch
intothe construction of AI-drivenforeignlanguageteachers using macaroniclanguage.
Pri mary Reader and Advisor: Philipp Koehn
Co m mittee Me mber and Advisor: Jason Eisner
Co m mittee Me mber: Kevin Duh
iii Ackno wledge ments
Portions ofthis dissertation have been previously publishedinthe follo wingjointly
authored papers: Renduchintala et al.(2016b), Renduchintala et al.(2016a), Kno wles et al.
(2016), Renduchintala, Koehn, and Eisner(2017), Renduchintala, Koehn, and Eisner(2019a)
and Renduchintala, Koehn, and Eisner(2019b). This work would not be possible without
the direct efforts of my co-authors Rebecca Kno wles,Jason Eisner and Philipp Koehn.
Philipp Koehn has provided me with a cal m, collected, andsteady mentorshipthought
my Ph. D.life. So meti mes,Ifound myselfinterestedintopicsfarre movedfro m mythesis,
but Philippstillsupported me and gave methefreedo mto pursuethe m. Jason Eisnerraised
the barfor whatIthought being aresearcher meant, opened my eyestothinking deeply
about a proble m, and effectively co m municated my findings. Iliketo believe Jason has
made me a betterresearcher byteaching me(apartfro m allthetechnical kno wledge) si mple yet profoundthings – details matter, be mindful ofthe bigger picture evenif you are solving
a specific proble m.
I a mthankfulto my defense co m mittee and oral exa m co m mitteefor valuablefeedback
and guidance: Kevin Duh, Tal Linzen, Chadia Abras, and Mark Dredze(alternate). I a m
i v ABSTRACT
incredibly gratefulto Kevin Duh, who not onlyserved on my co m mittee but also mentored
me ont wo projectsleadingtot wo publications. Kevin not only cared about my work but
also about me and my well beinginstressfulti mes. Thank you, Kevin.
I wanttothank Mark Dredze and Suchi Saria, my advisors,for a brief period whenI
first started my Ph. D.I a mincredibly gratefulto Ada m Lopez, who firstinspired meto
research Machine Translation. I will neverforget my first CLSP-PI RE workshop experience
in Prague,thanksto Ada m. Matt Post, along with Ada m Lopez, weretheinstructorsin my
first year MT course, Matt has beenfantasticto discuss projectideas and collaborate.
I have beenluckyto work with excellent researchersin Speech recognition as well.
Shinji Watanabetook a chance on me and helped me workthrough my first everyspeech
paperin 2018, which eventually wonthe best paper. Shinji encouraged me and gave methe
freedo mto explore anidea eventhoughI was not an experiencedspeechresearcher.I would
alsoliketothank Sanjeev Khudanpur and Naji m Dehakforseveral discussions during my
stintinthe 2018JS ALT workshop.
I wanttothank myfellow CLSPstudents and CLSP Alu mni whosharedso meti me with me. Thank you, Tongfei Chen, Jaejin Cho, Shuoyang Ding, Katie Henry, Jonathan
Jones, Huda Khayrallah, Gaurav Ku mar, Ke Li, Ruizhi Li, Chu-Cheng Lin, Xutai Ma,
Matthe w Macieje wski, Matthe w Weisner, Kelly Marchisio, Chandler May, Arya Mc Carthy,
Hongyuan Mei, Sabrina Mielke, Phani Nidadavolu, Raghavendra Pappagari, Ada m Poliak,
Sa mik Sadhu, Elizabeth Salesky, Peter Schula m, Pa mela Shapiro, Suzanna Sia, David
Snyder, Brian Tho mpson, Ti m Vieira, Yi ming Wang, Zachary Wood- Doughty, Winston Wu,
v ABSTRACT
Patrick Xia, Hainan Xu,Jinyi Yang, Xuan Zhang, Ryan Cotterell, Nao mi Saphra, Pushpendre
Rastogi, Adrian Benton, Rachel Rudinger and Keisuke Sakaguchi, Yuan Cao, Matt Gor mley,
AnnIrvine, Keith Levin, Harish Mallidi, Vi mal Manohar, Chunxi Liu, Courtney Napoles,
Vijayaditya Peddinti, Nanyun Peng, Ada m Teichert, and Xuchen Yao. You are allthe best
part of myti me at CLSP.
Ruth Scally, Ya mese Diggs and Carl Pupa,thank youfor yourincrediblesupport.In my you, CLSPrunss moothly mainly dueto your efforts.
My Mo m, Dad, and brother provided encourage ment duringthe mosttryingti mesin my
Ph. D.,thank you, A m ma, Nana, and Chait.
Finally, Nichole,I can notstate whatit meansto have had yoursupportthrough my Ph. D.
You’ve been by mysidefro mthe first exciting phone callIreceived about my acceptanceto
the very end and everythingin bet ween. Thisthesisis more yoursthan mine. Thank you,
b e b b e.
vi C o nt e nts
A bst r a ct ii
List of Tables xiii
List of Figures x vii
1 Introduction 1
1.1 MacaronicLanguage ...... 4
1.2 Zoneofproximaldevelopment...... 5
1.3 Our Goal: A Macaronic Machine Foreign Language Teacher ...... 6
1.4 Macaronic DataStructures ...... 9
1.5 User Modeling ...... 15
1.5.1 ModelingIncidental Co mprehension...... 15
1.5.2 Proxy ModelsforIncidental Co mprehension ...... 16
1.5.3 Knowledge Tracing...... 17
1.6 Searchingin Macaronic Configurations ...... 19
vii CONTENTS
1.7 Interaction Design...... 20
1.8 Publications...... 21
2 Related Work 2 3
3 ModelingIncidental Learning 2 8
3.1 Foreign WordsinIsolation ...... 29
3.1.1 Data Collectionand Preparation ...... 30
3.1.2 Modeling Subject Guesses ...... 34
3.1.2.1 Features Used ...... 34
3.1.3 Model...... 41
3.1.3.1 Evaluatingthe Models...... 43
3.1.4 Resultsand Analysis ...... 44
3.2 MacaronicSetting...... 48
3.2.1 Data CollectionSetup ...... 49
3.2.1.1 HITsand Submissions...... 50
3.2.1.2 Clues...... 51
3.2.1.3 Feedback...... 53
3.2.1.4 Points ...... 53
3.2.1.5 Normalization ...... 54
3.2.2 User Model...... 55
3.2.3 Factor Graph ...... 56
viii CONTENTS
3.2.3.1 CognateFeatures ...... 58
3.2.3.2 History Features...... 59
3.2.3.3 Context Features...... 59
3.2.3.4 User-Specific Features...... 60
3.2.4 Inference ...... 61
3.2.5 Parameter Estimation...... 62
3.2.6 Experimental Results...... 62
3.2.6.1 Feature Ablation...... 65
3.2.6.2 Analysisof User Adaptation ...... 66
3.2.6.3 Exa mple of Learner Guesses vs. Model Predictions . . . 68
3.2.7 FutureI mprove mentstothe Model...... 69
3.2.8 Conclusion ...... 72
4 CreatingInteractive MacaronicInterfacesfor Language Learning 7 4
4.1 MacaronicInterface...... 75
4.1.1 Translation ...... 76
4.1.2 Reordering ...... 78
4.1.3 “Pop Quiz”Feature...... 79
4.1.4 Interaction Consistency...... 81
4.2 Constructing Macaronic Translations...... 82
4.2.1 Translation Mechanism...... 83
4.2.2 Reordering Mechanism...... 84
i x CONTENTS
4.2.3 Special Handling of Discontiguous Units ...... 86
4.3 Discussion...... 87
4.3.1 Machine Translation Challenges ...... 87
4.3.2 User Adaptationand Evaluation ...... 88
4.4 Conclusion ...... 89
5 Construction of Macaronic Textsfor Vocabulary Learning 9 0
5.1 Introduction...... 90
5.1.1 Limitation...... 92
5.2 Method ...... 93
5.2.1 GenericStudent Model...... 93
5.2.2 Incre mental L2 Vocabulary Learning ...... 95
5.2.3 Scoring L2embeddings ...... 97
5.2.4 Macaronic Configuration Search...... 98
5.2.5 Macaronic-Language docu mentcreation...... 99
5.3 Variationsin Generic Student Models ...... 102
5.3.1 Unidirectional Language Model ...... 102
5.3.2 Direct Prediction Model ...... 103
5.4 Experiments with Synthetic L2...... 106
5.4.1 MTurkSetup ...... 109
5.4.2 Experiment Conditions...... 110
5.4.3 Random Baseline...... 114
x CONTENTS
5.4.4 Learning Evaluation ...... 116
5.5 Spelling-Aware Extension ...... 117
5.5.1 Scoring L2embeddings ...... 118
5.5.2 Macaronic Configuration Search...... 119
5.6 Experiments withreal L2...... 120
5.6.1 Co mprehension Experi ments...... 123
5.6.2 Retention Experiments ...... 124
5.7 HyperparameterSearch...... 125
5.8 Results Varyingτ ...... 126
5.9 Macaronic Examples ...... 130
5.10 Conclusion ...... 142
6 Knowledge Tracingin Sequential LearningofInflected Vocabulary 145
6.1 Related Work ...... 148
6.2 Verb Conjugation Task ...... 152
6.2.1 TaskSetup ...... 152
6.2.2 Task Content ...... 153
6.3 Notation...... 154
6.4 Student Models ...... 154
6.4.1 Observable Student Behavior...... 154
6.4.2 Feature Design ...... 155
6.4.3 Learning Models ...... 156
xi CONTENTS
6.4.3.1 Sche mesforthe Update Vector u t ...... 157
6.4.3.2 Sche mesforthe Gates α t , β t , γ t ...... 160
6.4.4 Parameter Estimation...... 161
6.5 Data Collection ...... 162
6.5.1 Language Obfuscation ...... 162
6.5.2 Card Ordering Policy...... 163
6.6 Results & Experiments ...... 164
6.6.1 Co mparison with Less Restrictive Model ...... 170
6.7 Conclusion ...... 173
7 Conclusion & Future Direction 1 7 5
Vit a 1 9 9
xii List of Tables
1. 1 Exa mples of possible macaronic configurationsfro mthe macaronic data structure depictedin Figure 1.2. This data structure supports actionsthat reorder phrases withinthe macaronic sentence,thus generating substrings li k e t e l l e u n e a n d a s u c h which are both notinthe word-ordering of thelanguagetheyarein...... 12 1. 2 Exa mples of possible macaronic configurationsfro mthesi mplified maca- ronic data structure Figure 1.3. Notethatthe words(even French words) are al waysinthe English word-order. Thus, usingthis data structure we can not obtain configurationsthatinclude substringslike a s u c h or u n e t e l l e . We envisionthis data structureto be usefulfor a native speaker of English learningfrench vocabulary, butitis also possiblethat a student seeing En- glish wordsin French word order canlearn about French word ordering. Then, gradually, we canreplacethe English words(in French word order) with French words(in French word order)for ming fluent Frenchsentences. 13
3.1 Threetasksderivedfro mthesa me Ger mansentence...... 31 3. 2 Correlations bet ween selectedfeature values and ans wer guessability, co m- puted ontraining data(starred correlations significant at p < 0 .0 1 . U n a vail- ablefeatures arerepresented by “n/a”(for exa mple, sincethe Ger man word is not observedinthe clozetask,its edit distancetothe correct solution is unavailable). Duetothe for mat of ourtriples,itis still possibletotest whetherthese unavailablefeaturesinfluencethe subject’s guess:in al most all casestheyindeed do not appearto, sincethe correlation with guessability islo w(absolute value < 0 .1 5 ) and not statistically significant even atthe p <0.05level...... 38 3. 3 Feature ablation. The single highest-correlatingfeature(on dev set)fro m each feature groupis sho wn, follo wed bythe entire feature group. All versions with morethan onefeatureinclude afeatureforthe O O V guess.In the correlation colu mn, p-values< 0.01 are marked with an asterisk. . . . . 45 3.4 Exa mples ofincorrect guesses and potentialsources of confusion...... 46
xiii LISTOFTABLES
3. 5 Percentage offoreign wordsfor whichthe user’s actual guess appearsin our t o p-k list of predictions,for models with and without user-specificfeatures (k∈{1,25,50})...... 64 3.6 Qualitycorrelations: basicand user-adapted models...... 67 3. 7 I mpact on quality correlation( Q C) ofre movingfeaturesfro mthe model. Ablated Q C values marked with asterisk ∗ differ significantlyfro mthefull- model QC valuesinthe firstrow( p < 0 .0 5 , usingthetest of Preacher (2002))...... 68
4.1 Su m mary oflearnertriggeredinteractionsinthe MacaronicInterface. . . . 81 4.2 Generatingreorderedstrings using units...... 85
5. 1 An exa mple English(L1)sentence with Ger man(L2) glosses. Usingthe glosses, many possible macaronic configurations are possible. Notethatthe glosssequenceisnota fluent L2sentence...... 92 5. 2 Resultsfro m MTurk data. The firstsectionsho wsthe percentage oftokens that werereplaced with L2 glosses under each condition. The Accuracy sectionsho wsthe percentagetoken accuracy of MTurk participants’ guesses along with 95 % confidenceinterval calculated via bootstrapresa mpling. . . 107 5. 3 Results of MTurkresults split up by word-class. The y -axisis percentage of tokens belongingto a word-class. The pink bar(right) sho wsthe percentage oftokens(of a particular word-class)that were replaced with an L2 gloss. The blue bar(left) andindicatesthe percentage oftokens(of a particular word-class)that were guessed correctly by MTurk participants. Error bars r e pr es e nt 9 5 % confidenceintervals co mputed with bootstrap resa mpling. For exa mple, we seethat only 5 .0 % (pink) of open-classtokens werere-
placedinto L2 bythe DP m o d el at r m a x = 1 a n d 4 .3 % of all open-class tokens were guessed correctly. Thus, eventhoughthe guess accuracyfor DP
at r m a x = 1 for open-classis high(8 6 % ) we canseethat participants were notexposedto manyopen-class wordtokens...... 108 5. 4 Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe GSM model...... 110 5. 5 Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe uGSMmodel...... 111 5. 6 Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe DP genericstudent model. Only co m monfunction wordssee mto bereplaced withtheir L2translations...... 112
xi v LISTOFTABLES
5. 7 Results co mparing our generic student based approachto arando m baseline. The first partsho wsthe nu mber of L2 wordtypes exposed by each model for each word class. The second part sho wsthe average guess accuracy percentagefor each model and word class. 9 5 % confidenceintervals (in brackets) wereco mputed using bootstrapresa mpling...... 114 5. 8 Results of our L2learning experi ments where MTurksubjectssi mplyread a macaronic docu ment and ans wered a vocabulary quiz atthe end ofthe passage. Thetable sho wsthe average guess accuracy percentage along with 95 % confidenceintervals co mputedfro m bootstrapresa mpling...... 115 5. 9 Averagetoken guess quality( τ = 0 .6 )inthe co mprehension experi ments. T h e ± d e n ot es a 9 5 % confidenceinterval co mputed via bootstrapresa mpling oftheset of hu mansubjects. The % of L1tokensreplaced with L2 glosses isin parentheses.§5.8 evaluates with other choices of τ...... 122 5. 1 0 Averagetype guess quality( τ = 0 .6 )intheretention experi ment. The % of L2 glosstypesthat weresho wninthe macaronic docu mentisin parentheses. §5.8evaluates withotherchoicesof τ...... 124 5. 1 4 An expanded version of Table 5.9(hu man co mprehension experi ments), reportingresults with various values ofτ...... 128 5.11 MRRscores obtained with different hyperpara metersettings...... 129 5. 1 2 Nu mber of L1tokensreplaced by L2 glosses under different hyperpara meter settings...... 130 5. 1 3 Nu mber of distinct L2 wordtypes presentinthe macaronic docu ment under different hyperpara metersettings...... 131 5. 1 5 An expanded version of Table 5.10(hu manretention experi ments),reporting results withvariousvaluesofτ...... 144
6. 1 Content usedintraining sequences. Phrase pairs with * were usedforthe quiz atthe end ofthetraining sequence. This Spanish content wasthen transfor med usingthe methodinsection 6.5.1...... 153 6.2 Su m maryofupdatesche mes(otherthan RNG)...... 159 6. 3 Table su m marizing prediction accuracy and cross-entropy(in nats per pre- diction)for different models. Larger accuracies and s maller cross-entropies are better. Within an update sche me,the † indicates significanti mprove ment ( Mc Ne mar’stest, p < 0 .0 5 ) overthe next-best gating mechanis m. Within g a gating mechanis m,the ∗ indicates significanti mprove ment overthe next-best updatesche me. For exa mple, N G+ C Missignificantly betterthan N G+ V M,soitreceives a † ;itis alsosignificantly betterthan R G+ C M, and r e c ei ves a ∗ as well. These co mparisons are conducted only a mongthe pure update sche mes(abovethe doubleline). All other models are significantly betterthan RG+S M(p <0.01)...... 167
x v LISTOFTABLES
6. 4 Co mparison of our best-perfor ming P KT model(R N G+C M)to our LST M model. On our dataset, P KT outperfor msthe LST M bothinter ms of accuracyandcross-entropy...... 172
7. 1 Exa mples ofinputs and predicted outputs by our experi mental N MT model trainedto generate macaroniclanguage sentences using annotations onthe input sequence. We seethatthe macaroniclanguagetranslations are ableto correctly order Ger man portions ofthe sentences, especially atthe sentence ending. Thesource-features have also beenlearned bythe N MT model and translations arefaithfultothe markup. The case,tokenization anditalics addedinpost...... 179
x vi List of Figures
1.1 Aschematicoverviewofourgoal...... 7 1. 2 Macaronic data structure extractedfro m word-align ments. The blacklines represent edgesthatreplace a unit, usuallyfro m onelanguage(blackfor English)to another (blue for French). For exa mple edge (i)is replaces f i r s t wit h p r e m i e r and vice-versa. In so me casesthe edges connect t wo Englishtokens(such as i n t r o d u c e a n d s u b m i t ) asininter mediate step bet ween i n t r o d u c e a n d p r e s e n t e r . In other casesthe black substitution edge define a single substitution even whenthere are morethan onetokensinthe unit. For exa mple, s u c h a is connectedto s u c h u n e via a substitution edge(ii). The orange edges perfor m areordering action, f or e x a m pl e s u c h a can betransfor medinto a s u c h bytraversing an orange edge. Onlyt wo edges are marked withro man nu meralsfor clarity. . 11 1. 3 A Si mplified Macaronic data structurethat only considers word replace- ments withoutanyreorderingof words...... 11
3. 1 Average guessability by contexttype, co mputed on 112triples(fro mthe training data). Error bars sho w 95 %-confidenceintervals forthe mean, under bootstrapresa mpling ofthe 112triples( we use B Caintervals). Mean accuracyincreases significantlyfro m eachtasktothe next(sa metest on differenceof means,p <0.01)...... 39 3. 2 Average Nor malized Character Trigra m Overlap bet ween guesses andthe German word...... 41 3. 3 Correlation bet ween e mpirically observed probability ofthe correct ans wer (i.e.the proportion of hu man subject guessesthat were correct) and model probability assignedtothe correct ans wer across alltasksinthetest set. Spear man’scorrelationof0.725...... 44 3. 4 Percent of exa mpleslabeled with eachlabel by a majority of annotators ( maysu mto morethan 100 %,as multiplelabels wereallowed)...... 46
x vii LISTOFFIGURES
3. 5 After a user sub mits a set of guesses(top),theinterface marksthe correct guessesin green and alsoreveals a set oftranslation clues(botto m). The user no w hasthe opportunityto guess againforthere maining Ger man words. 5 1 3. 6 Inthis case, afterthe user sub mits a set of guesses (top),t wo clues are revealed(botto m): ausgestellt is movedinto English order andthen translated...... 52 3. 7 Modelfor user understanding of L2 wordsin sentential context. This figure sho ws aninference proble min which allthe observed wordsinthesentence arein Ger man(thatis, O bs = ∅ ). Asthe user observestranslations via clues
or correctly- marked guesses, so me ofthe E i becomeshaded...... 57 3. 8 A ct u al q u alit y si m( ̂e, e ∗ ) of the learner’s guess ̂e on develop ment data, versus predicted quality si m(e, e ∗ ) w h er e e is the basic model’s 1-best prediction...... 65 3. 9 A ct u al q u alit y si m( ̂e, e ∗ ) ofthelearner’s guess ̂e on develop ment data, ver- susthe expectation ofthe predicted quality si m(e, e ∗ ) w h er e e is distri b ut e d accordingtothe basic model’s posterior...... 66 3. 1 0 The user-specific weight vectors, clusteredinto groups. Average points per HITforthe HITs co mpleted by each group:(a) 45,(b) 48,(c) 50 and(d) 42. 69 3. 1 1 Two exa mples ofthe syste m’s predictions of whatthe user will guess on a single sub mission, contrasted withthe user’s actual guess. (The user’s previous sub missions onthe sa metaskinstance are not sho wn.) In 3.11a, the model correctly expectsthatthe substantial context willinfor mthe user’s guess. In 3.11b, the model predicts that the user will fall back on string si milarity —although we can seethatthe user’s actual guess of a n d d a y waslikelyinfor med bytheir guess of n i g h t , aninfluencethat our C RF did consider. The nu mbers sho wn arelog-probabilities. Both exa mples sho wthe sentencesin a macaronic state(after so mereordering ortranslation has occurred). For exa mple,the originaltext ofthe Ger man sentencein 3.11breads Deshalb durften die Paare nur noch ein Kind bekommen . The macaronic version has undergoneso me reordering, and has also erroneously droppedthe verb dueto anincorrect alignment...... 73
4.1 Actionsthattranslate words...... 77 4.2 Actionsthatreorderphrases...... 79 4. 3 State diagra m oflearnerinteraction(edges) and syste m’sresponse(vertices). Edges can betraversed by clicking( c), hovering above(a ), hovering belo w (b) orthe enter(e) key. Un marked edgesindicate an auto matictransition. . 80 4. 4 The dottedlinessho w word-to- word align ments bet weenthe Ger mansen-
t e n c e f 0 , f1 , . . . , f7 and its English translation e 0 , e1 , . . . , e6 . T h e fi g ur e highlights 3 ofthe 7 units: u 2 , u3 , u4 ...... 83
x viii LISTOFFIGURES
4. 5 A possiblestate ofthesentence, whichrenders asubset ofthetokens(sho wn in black). Therendering order(section 4.2.2)is not sho wn butis also part of the state. The string displayedinthis caseis ”Und danach they run noch einen Marathon.”(assumingnoreordering)...... 84 4. 6 Figure 4.6a sho ws a si mple discontiguous unit. Figure 4.6b sho ws along distance discontiguity whichis supported. In figure 4.6ctheinterruptions
alignto both sides of e 3 whichis not supported. In situationslike 4.6c, all associated units are merged as one phrasal unit(shaded) assho wnin figure 4.6d ...... 86
5.1 Ascreenshot of a macaronicsentence presented on Mechanical Turk. . . . 106
6. 1 Screen grabs of card modalities duringtraining. These exa mplessho w cards for a native English speakerlearning Spanish verb conjugation. Fig 6.1a is an E X card, Fig 6.1bshows a MC card beforethestudent has made a selection, and Fig 6.1c and 6.1dsho w M C cards afterthestudent has made anincorrect or correct selectionrespectively, Fig 6.1e sho ws a M C card thatis givingthe student another atte mpt(the syste mrando mly decidesto givethe student uptothree additional atte mpts), Fig 6.1f sho ws a TP card where a studentis co mpleting an ans wer, Fig 6.1g sho ws a TP cardthat has marked astudent ans wer wrong andthenrevealedtheright ans wer(the revealis decidedrando mly), and finally Fig 6.1h sho ws a cardthatis giving astudentfeedbackfortheiranswer...... 151 6.2 Quiz perfor mance distribution(afterre moving users whoscored 0). . . . . 165 6. 3 Plot co mparingthe models ontest data under different conditions. Condi- tions M C and TPindicate Multiple-choice and Typing questionsrespectively. These are broken do wntothe cases wherethestudent ans wersthe m correctly C andincorrectlyI C. S M, V M, and C Mrepresentscalar, vector, and context retention and acquisition gates(sho wn with different colors),respectively, while R G, N G and F G areredistribution, negative andfeature vector update sche mes(sho wn with different hatching patterns)...... 166 6. 4 Predicting a specific student’sresponses. For eachresponse,the plot sho ws our model’si mprove mentinlog-probability overthe unifor m baseline model. TP cards arethesquare markers connected bysolidlines(the final 7squares arethe quiz), while MC cards — which have a much higher baseline —are the circle markers connected by dashedlines. Hollo w and solid markers indicate correct andincorrect ans wersrespectively. The R N G+ C M modelis showninblueandthe FG+S M modelinred...... 169
xi x C h a pt e r 1
Introduction
Gro winginterestin self-directedlanguagelearning methodslike Duolingo( Ahn, 2013),
along withrecent advancesin machinetranslation andthe widespread ease of accessto a variety oftextsin alarge nu mber oflanguages, has givenriseto a nu mber of web-based toolsrelatedtolanguagelearning,rangingfro m dictionary appsto moreinteractivetools
like Alpheios( Nelson, 2007) or Lingua.ly(2013). Most oftheserequire hand-curatedlesson
pl a ns a n d l e ar ni n g a cti viti es, oft e n wit h e x pli cit i nstr u cti o ns.
Proponents oflanguage acquisitionthrough extensivereading, such as Krashen(1989),
arguethat much oflanguage acquisitiontakes placethroughincidentallearning — when a
learneris exposedto novel vocabulary orstructures and must find a wayto understandthe m
in orderto co mprehendthetext. Huckin and Coady(1999) and Elley and Mangubhai(1983)
observethatincidentallearningis notli mitedtoreadingin one’s nativelanguage(L1) and
extendstoreadingin a secondlanguage(L2) as well. Freereading also offersthe benefit of
1 CHAPTER1. INTRODUCTION
being a “lo w-anxiety”(or even pleasurable)source of co mprehensibleinputfor many L2
learners( Krashen, 2003).
Thereis considerable evidence sho wingthatfree voluntaryreading can play arolein
foreignlanguagelearning. Lee, Krashen, and Gribbons(1996) studiedinternational students
inthe United States andfoundthe a mount offreereading of Englishto be a significant
predictortojudgethe gra m matically of co mplex sentences. The a mount offor mal study
andlength ofresidenceinthe United States were not strong predictors. In Stokes, Krashen,
and Kartchner(1998) studentslearning Spanish weretested ontheir understanding ofthe
subjunctive. Students were notinfor med ofthe specificfocus ofthetest(i.e.thatit was on
the subjunctive). The studyfoundthat attributes such asfor mal study,length ofresidence
inforeign speaking country andthe student’s subjectiverating ofthe quality ofthefor mal
studytheyreceivefailedto predict perfor mance onthe subjunctivetest. The a mount of
freereading, ho wever, was a strong predictor. Constantino et al.(1997) sho wedthatfree
reading was astrong predictor ofthe perfor mancein Test of English as a Foreign Language
(T OEFL). Ho wever,they did findthat other attributes such asti me of residenceinthe
United States and a mount offor mal study were also significant predictors of perfor mance.
Ki m and Krashen(1998) go beyondself-reportedreading a mounts and were ableto find a
correlation bet weenthe perfor mance of studentsinthe English as aforeignlanguagetest
and perfor mance onthe “authorrecognition”test. Inthe authorrecognitiontest, subjects
indicate whethertheyrecognize a na me as an author a mong alist of na mes providedto
the m. The authorrecognitiontest has also been usedin other firstlanguage studies as well
2 CHAPTER1. INTRODUCTION
such as Chinese(Lee and Krashen, 1996) Korean( Ki m and Krashen, 1998) and Spanish
( Rodrigo, Mc Quillan, and Krashen, 1996). Inthe case of secondlanguage acquisition,
ho wever,learning byreading alreadyrequires considerable L2 fluency, which may prove
a barrierfor beginners. Thus,in orderto allo w studentsto engage withthe L2language
early on, educators may usetexts writtenin si mplifiedfor ms,texts specifically designedfor
L2learners(e.g.texts withli mited/focused vocabularies), ortextsintendedfor young L1
learners ofthe given L2. “ Handcrafted Books” has been proposed by( Dupuy and Mc Quillan,
1997) as a wayto generate L2reading materialthatis both accessible and enjoyableto foreignlanguage students. Handcrafted Books are essentially articles, novels or essays written by Foreignlanguage students at aninter mediatelevel and subsequently corrected by
ateacher. The student writers areinstructed nottolook-up words while writing, which helps
keeptheresulting material withinthe ability of beginner students. Whilethis approach gives
educators control overthelearning material,itlacks scalability and suffers fro m si milar
issues as hand-curatedlesson plans. As aresult, alearnerinterestedreadingin a second
language might havefe w choicesinthetype oftexts made availabletothe m.
Our proposalisto make use of “ macaroniclanguage,” which offers a mixture ofthe
learner’s L1 andtheirtarget L2. The a mount of mixing can vary and might depend onthe
learner’s proficiency and desired content. Additionally, we propose auto matically creating
such “ macaroniclanguage,” allo wing ourlearning paradig mto scale across a wide variety
of content. Our hopeisthatthis paradig m can potentially convert anyreading materialinto
one of pedagogical value and could easily beco me a part ofthelearner’s dailyroutine.
3 CHAPTER1. INTRODUCTION
1.1 Macaronic Language
Why dothe French only eat one eggfor breakfast? Because one eggis un œuf.
Theter m “ Macaronic”traditionallyrefersto a mash-up oflanguages, oftenintendedto be
hu morous. Si milartotextthat contains code-s witching,typical macaronictexts areintended
for a bilingual audience; ho wever,they differfro m code-s witching asthey are not governed
by syntactic and prag matic considerations. Macaronictexts are also “synthetic”inthe sense
thatthey aretraditionally constructed forthe purpose of hu mor (bilingual puns) and do
not arise naturallyin conversation. Inthisthesis, ho wever, we usetheter m macaronicfor
bilingualtextsthat have been synthetically constructedforthe purpose of secondlanguage
learning. Thus, our macaronictexts do notrequire fluencyin bothlanguages andtheir usage
only assu mesthatthe studentis fluentintheir nativelanguage. Thisthesisinvestigates
the applicability of macaronictexts as a mediu mforlife-long secondlanguagelearning.
Flavors ofthisidea have been done before, which we coverin Chapter 2, butitis especially worth notingthatthe earliest published macaronic me moir we could findis “ On Foreign
Soil”(Zolf and Green, 2003) whichis atranslation of an earlier Yiddish novel(Zolf, 1945).
Green’stranslation beginsin English with afe w Yiddish wordstransliterated, asthe novel
progresses, entiretransliterated Yiddish phrases are presentedtothereader.
4 CHAPTER1. INTRODUCTION
1.2 Zone of proxi mal develop ment
Secondlanguage(L2)learningrequiresthe acquisition of vocabulary as well as kno wledge
ofthelanguage’s constructions. One ofthe waysin whichlearners beco mefa miliar with
novel vocabulary andlinguistic constructionsisthroughreading. Accordingto Krashen’s
Input Hypothesis( Krashen, 1989),learners acquirelanguagethroughincidentallearning, which occurs whenlearners are exposedto co mprehensibleinput. What constitutes “co mpre-
hensibleinput”for alearner varies astheir kno wledge ofthe L2increases. For exa mple, a
studentintheir first month of Ger manlessons would be hard-pressedtoread Ger man novels
or even front-page ne ws, butthey might understand brief descriptions of daily routines.
Co mprehensibleinput need not be co mpletelyfa miliartothelearner;it couldinclude novel vocabularyite ms or structures, whose meaningsthey can gleanfro m context. Suchinput
fallsinthe “zone of proxi mal develop ment”( Vygotsky, 2012),just outside ofthelearner’s
co mfort zone. Therelated concept of “scaffolding”( Wood, Bruner, and Ross, 1976) consists
of providing assistancetothelearner at alevelthatisjust sufficient enoughforthe mto
co mpletetheirtask, whichin our caseis understanding asentence. Inthis context, macaronic
text can offer a flexible modalityfor L2learning. The L1 portion ofthetext can provide
scaffolding whilethe L2 portion,if not previously seen bythe student, can provide novel vocabulary andlinguistic constructions of pedagogical value. So,if were-purposethe pun
fro m aboveinto a macaronictextfor L2frenchlearners( whose L1is English) we might
constructthefollo wing withthe hopethatthereaders caninferthe meanings of u n a n d
5 CHAPTER1. INTRODUCTION
œ u f .
Why dothe French have only un eggfor breakfast? Because un œuf isenough.
In addition to vocabulary, macaronic scaffolding can extend to syntactic structures as well. For exa mple,consider the follo wing text: “ Der Student turned in die
Hausaufgaben, that the teacher assigned had .” Here, Ger man vocab-
ularyisindicatedin bold and Ger man syntactic structures areindicatedinitalics. Even a
reader with no kno wledge of Ger manislikelyto be ableto understandthis sentence by
using context and cognate clues. One cani magineincreasingthe a mount of Ger manin
such sentences asthelearner’s vocabularyincreases,thus carefullyre moving scaffolding
(English) and keepingthelearnerintheir zone of proxi mal develop ment.
1.3 Our Goal: A Macaronic Machine Foreign
Language Teacher
Our visionisto build an AIforeign-languageteacherthat gradually converts docu ments
(stories, articles, etc.) writtenin a student’s nativelanguageintothe L2the student wants
tolearn, by auto maticallyreplacingthe L1 vocabulary, morphology, and gra m mar with
the L2for ms(see 1.1). This “gradual conversion”involvesreplacing more L1 words(or
phrases) withtheir L2 counterparts, and occurs astheleanerslo wly gains L2 proficiency(say
over a period of days or weeks). We don’t expectsignificantlanguage proficiencyincrease
6 CHAPTER1. INTRODUCTION
Figure 1.1: Asche matic overvie w of our goal.
duringthe course of asingle novel, ho wever, we expect more conversions mainly duetothe
repeated appearance of key vocabularyite ms duringthe novel. Thus, L2 conversions can
accu mulate overthe course ofthe novel. This AIteacher willleverage a student’sinherent
abilityto guessthe meaning offoreign words and constructions based onthe contextin whichthey appear and si milaritiesto previously kno wn words. We envision ourtechnology
being used alongside traditional classroo m L2instruction —the sa meinstructional mix
thatleads parentsto acceptinventive spelling( Gentry, 2000), 1 in which early writers are
encouragedto writeintheir nativelanguage without concernfor correct spelling,in part so
they can morefully and happily engage withthe writing challenge of co mposinglonger and
more authentictexts without undue distraction( Clarke, 1988). Traditional gra m mar-based
instruction and assess ment, which uses “toy” sentencesin pure L2, should providefurther
scaffoldingfor our usersto acquirelanguage byreading more advanced(but macaronic)
t e xt. 1 Learning ho wto spell,likelearning an L2,is atype oflinguistic kno wledgethatis acquired after L1 fluency andlargelythroughincidentallearning( Krashen, 1993).
7 CHAPTER1. INTRODUCTION
Auto matic construction of co mprehensible macaronictexts asreading material —perhaps
online and personalized — would be a useful educationaltechnology. Broadly speaking,this
r e q uir es:
1. A data structure for manipulating word (or phrase) aligned bitexts sothey can be
rendered as macaronic sentences,
2. Modeling student’s co mprehension whentheyreadthese macaronic sentences(i.e.
what can an L2learner understandin a given context?), and
3. Searching over many possible candidate co mprehensibleinputsto find onethatis
mostsuitedtothestudent, balancingthe a mount of ne w L2they encounter as well as
ease ofreading.
Insche matic Figure 1.1the AIteacher perfor msthe points(2) and(3)fro m above. While
(1) definestheinput(along with meta-data)required bythe AIteacherto perfor m(2) and
(3). There are several ways ofrealizing each ofthethree points above. Inthisthesis,the
each chapter explores a subset ofthesethree points. Chapter 3 and Chapter 4 cover points
(1) and(2) using hu man datato constructstudent models. Chapter 5 describes another AI
teacherthat can acco mplish(2) and(3) but makesso mesi mplifying assu mptions aboutthe
input data(1). Chapter 6 also details another kind of student modeling(point(2))for a verb-inflectionlearningtask andfocuses on short-ter mlongitudinal modeling of student
kno wledge asthey progressthroughthistask.
The points made above can also berecastfro m a Reinforce ment Learning perspective.
8 CHAPTER1. INTRODUCTION
The data structure we design definesthe set of all possible actions an agent(the AIteacher)
can take. The agent is acting upon the student’s observations and tries to infer their
co mprehension of a sentence(and, more generally,theirlevel of proficiency of L2). Thus,
the studentisthe environ mentthatthe agentis acting upon. Finally,the search algorith m
thatthe AIteacher e mploysisthe policythe agentfollo ws. This perspective also suggests
thatthe policy couldinvolve planningforlong-ter mre wards( whichin ourfra me workis L2
proficiency) using policiesthatlook-aheadintothefutureto make opti mal decisionsinthe
present. Ho wever, weleave planninginthe macaronic spacetofuture work. Inthisthesis, weli mit ourselvesto greedy-searchtechniquesthat do not considerlong-ter mre wards.
Areasonable concernis whether exposuretothe mixture oflanguages a macaronictext
offers might actually har m acquisition ofthe “correct” version of atext written solelyinthe
L2. To addressthis, our proposedinterface uses color andfontto markthe L1 “intrusions”
intothe L2 sentence, orthe L2intrusionsinto L1 sentence. We again dra w a parallelto
inventive spelling and highlightthatthefocus of alearner should be more on continuous
engage ment with L2 content evenifit appears “incorrectly” within L2texts.
1.4 Macaronic Data Structures
Several strategies can befollo wedto createtherequired data structures sothat an AIteacher
can manipulate andrender macaronictexts. One such structurefor an English-French bitext
is sho wnin Figure 1.2. Wereferto a singletranslation pair of a source sentence andtarget
9 CHAPTER1. INTRODUCTION
sentence as bitext. We use word align ments, which are edges connecting wordsfro mthe
source sentencetothoseinthetarget sentenceinthe bitext, and convertthe bitextinto a
set of connected units. Whilethe majority of units contain single works, notethat so me
ofthe units contain phrases withinternalreorderings, such as u n e t e l l e or s u c h u n e .
Each unitis a bipartite graph with French words on one end and English words onthe
other. The AIteacher can select different cutsin each unittorender different macaronic
configurations (See Table 1.2). Figure 1.2 sho wsthe graph data structure and Table 1.2
lists so me macaronicrenderings or configurationsthat can be obtainedfro m different cuts
inthe graph data structure. We constructthese data structures auto matically, butin our
experi ments we correctthe m manually as word align ments are noisy and so meti melink
source andtargettokensthat are nottranslation of each other. Further more,inter mediate
tokensinthe macaronic such as T h e A r i z o n a a n d s u b m i t in Figure 1.2 can not be
obtained by merely using bitext and word align ments. Theseinter mediate for ms were
added manually as well. Werefertotheite ms(ro ws oftext)in Table 1.2 as macaronic
configurations. Notethat macaronic configurations are macaronic sentences; we usethe
ter m “configuration”to denotethe set of macaronic sentencesfro m a single piece of content
(i.e. fro m a single bitext withits supporting data structures).
An alternative strategyisto only uselexical replace ments for L1tokens. This data
structure strategy assu mesthe availability of L2 glossesfor eachtokeninthe L1 sentence.
The unitsfor medinthis case are si mple “one-to-one” mappings(see Figure 1.3). This data
structureisless expressiveinthe sensethatit can onlyrender macaronic configurations
1 0 CHAPTER1. INTRODUCTION
L’ A r i z o n a f u t l e p r e m i e r a p r e s e n t e r u n e t e l l e e x i g e n c e
telle une atelle u n e s u c h T h e A r i z o n a (i) s u b m i t
s u c h u n e t e l l e a a s u c h
(ii) A r i z o n a w a s t h e f i r s t t o i n t r o d u c e s u c h a requirement
Figure 1.2: Macaronic data structure extracted fro m word-align ments. The blacklines
represent edgesthatreplace a unit, usuallyfro m onelanguage(blackfor English)to another
(bluefor French). For exa mple edge(i)isreplaces f i r s t wit h p r e m i e r and vice-versa.
Inso me casesthe edges connectt wo Englishtokens(such as i n t r o d u c e a n d s u b m i t )
asininter mediate step bet ween i n t r o d u c e a n d p r e s e n t e r . In other casesthe black
substitution edge define a single substitution even whenthere are morethan onetokensin
the unit. For exa mple, s u c h a is connectedto s u c h u n e vi a a s u bstit uti o n e d g e (ii). T h e
orange edges perfor m areordering action,for exa mple s u c h a can betransfor medinto a
s u c h bytraversing an orange edge. Onlyt wo edges are marked withro man nu meralsfor
cl arit y.
A r i z o n a f u t l e p r e m i e r a i n t r o d u i r e t e l l e u n e e x i g e n c e
A r i z o n a w a s t h e f i r s t t o i n t r o d u c e s u c h a requirement
Figure 1.3: A Si mplified Macaronic data structurethat only considers wordreplace ments without anyreordering of words.
1 1 CHAPTER1. INTRODUCTION
L’ Arizona fut le premier a presenter une telle exigence
L’ Arizona fut le first a presenter une telle exigence . .
L’ Arizona was the first a presenter une telle exigence
L’ Arizona was the first a presenter telle une requirement
L’ Arizona was the first a presenter telle a requirement
L’ Arizona was the first a presenter such a requirement . .
L’ Arizona was the first to introduce such a requirement
Arizona was the first to introduce such a requirement
Table 1.1: Exa mples of possible macaronic configurationsfro mthe macaronic data structure depictedin Figure 1.2. This data structure supports actionsthatreorder phrases withinthe macaronic sentence,thus generating substringslike t e l l e u n e a n d a s u c h w hi c h ar e both notinthe word-ordering ofthelanguagethey arein.
1 2 CHAPTER1. INTRODUCTION
Arizona fut le premier a introduire telle une exigence
Arizonafut le premier a introduire telle une exigence
Arizona wasle premier a introduire telle une exigence . .
Arizona wasle premier a introduiresuch a requirement . .
Arizona was the first tointroduiresuch a requirement
Arizona was the first to introduce such a requirement
Table 1.2: Exa mples of possible macaronic configurationsfro mthe si mplified macaronic
datastructure Figure 1.3. Notethatthe words(even French words) are al waysinthe English word-order. Thus, usingthis data structure we can not obtain configurationsthatinclude
substringslike a s u c h or u n e t e l l e . We envisionthis data structureto be usefulfor a
native speaker of Englishlearningfrench vocabulary, butitis also possiblethat a student
seeing English wordsin French word order canlearn about French word ordering. Then,
gradually, we canreplacethe English words(in French word order) with French words(in
French word order)for ming fluent French sentences.
1 3 CHAPTER1. INTRODUCTION
inthe L1 word order. Thatis, eventhoughit can display L2 words,these words are only
displayedintheir L1 orderings. Thisli mitation si mplifiesthe AIteacher’s choice of actions
but alsoli mitstheresulting contentto only have valueforlearningforeign vocabulary. Note
that eventhis si mple structure allo wsfor exponentially many possible configurationsthat
the AIteacher must be abletosearch over.
The data structures described above are a subset of more co mplex structures. Itis
possibleto construct data structuresthat connect sub word unitsfro mthe English sideto an
equivalentsub word unit onthe Frenchside. Such a datastructure could be usedtorender
macaronic sentences and are macaronic atthe word-level. Further more,the data structure
could also support non-contiguousreorderings. Weleave such data structuresforfuture work andfocus onthe more straightfor wardlocal-reorderings and si mple wordreplace ment
methods mainlyto allo wforfast search procedures.
The graph structurein Figure 1.2 could be made more co mplex with hyper-edges
connecting aset oftokensfro mthe Englishside( s u c h a )to aset oftokensinthe French
side(une telle) ofthetext.
1 4 CHAPTER1. INTRODUCTION
1.5 User Modeling
1.5.1 ModelingIncidental Co mprehension
In orderto deliver macaronic contentthat can be understood by alearner, we must first build
a model of alearner’s co mprehension abilities. Would a native English speakerlearning
Ger man, be ableto co mprehend a sentencelike The Nil is a Fluss in Africa ?
Wouldthey correctly mapthe Ger man words N i l a n d F l u s s t o N i l e a n d r i v e r ? Is there a chancetoincorrectly mapthe Ger man wordsto other plausible English words, for exa mple, N a m i b a n d d e s e r t ? We approachthis co mprehension modeling proble m
by building probabilistic modelsthat can predictif an L2 student might co mprehend a
macaronic sentence whentheyreadit. One wayto esti mate co mprehensionisto ask L2
studentsto guessthe meaning of L2 words or phrases withinthe macaronic sentence. A
correct guessi mpliesthatthere are sufficient clues, eitherfro mthe L1 wordsinthe sentence,
fro mthe L2 words, orfro m both,to co mprehendthesentence.
We begin with asi mplified version of macaronicsentences wherein only asingle word(a
noun)isreplaced withits L2translation. For exa mple: The next important Klima
conference is in December . We build predictive modelsthattakethe L1 context
andthe L2 word asinput and predict what a novice L2learner(studying Ger maninthis
exa mple) mightsayisthe meaningforthe novel word Klima.
Next, we addressthe casein which multiple wordsinthe macaronic sentence are
1 5 CHAPTER1. INTRODUCTION convertedintotheir L2for m. Considerthe earlier exa mple: The Nil is a Fluss i n A f r i c a .Insuch cases, we needtojointly predictthe meaning astudent would assign to allthe L2 words. Thisis necessary because a student’sinterpretation of one word will influence ho wtheyinterpretthere maining L2 wordsinthesentence. For exa mple,ifthe student assignsthe meaning N i l e t o N i l ,this mightinfluencethe mto guessthat F l u s s should beinterpreted as R i v e r . Alternatively,iftheyinterpret F l u s s as f o r e s t , t h e y mightthen guessthat N i l isthe na me of aforestin Africa. In other words,thereis a cyclical dependency bet weenthe guessesthat astudent makes. Details of our proposed modelsfor capturingincidentallearning are discussedin Chapter 3. Code usedfor our experi mentsis availableathttps://github.com/arendu/MacaronicUserModeling.
1.5.2 Proxy ModelsforIncidental Co mprehension
One wayto build models of hu manincidental co mprehensionisto collect datafro m hu mans.
Inthe previous section( with detailsin Chapter 3) werequire hu manstoread a candidate macaronictext and providefeedback asto whetherthetext was co mprehensible and whether the L2 wordsinthetext were understood. Usingthisfeedback(i.e.labeled data) we build a model of a “generic” student. Collectingthislabeled data fro m student annotatorsis expensive. Not onlyfro m a data collection point of vie w but alsoforstudents, asthey would haveto givefeedback on candidate macaronictexts generated by an untrained machine t e a c h er.
As an alternative to collecting labeled data in this way, we investigate using cloze
1 6 CHAPTER1. INTRODUCTION
language models as a proxyfor models ofincidental co mprehension. A clozelanguage
model can betrained with easily available L1 corporafro m any do main(that potentiallyis of
interestto a student). We canrefinethe clozelanguage model withreal student supervision
in an onlinefashion as a studentinteracts with macaronictext generated bythe AIteacher.
In other words,this clozelanguage model can be personalizedtoindividual students by
looking at whatthey are ableto understand and making updatestothe model accordingly.
Essentially,the clozelanguage model allo ws usto bootstrapthe macaroniclearning setup without expensive data collection overhead. Details of our use of proxy user models are
describedin Chapter 5.
1.5.3 Kno wledge Tracing
Apartfro m modelingincidental co mprehension, an AIteacher would benefitfro m modeling
ho w a student might updatetheir kno wledge based on different pedagogical sti muli, which
in our casetakethefor m of different macaronic sentences. Further more,inthe case of “pop
quizzes”(see Chapter 4),the student mayreceive explicitfeedbackfortheir guesses. Ideally,
such explicitfeedback would also causethe studentto updatetheir kno wledge. Here an
“update” could entaillearning(addingtotheir kno wledge) orforgetting(re movingfro m
their kno wledge). Thelongitudinaltracking of kno wledge alearner has asthey proceed
through a sequence oflearning activitiesis referredto as “kno wledgetracing” ( Corbett
and Anderson, 1994). We study afeature-rich kno wledgetracing methodthat captures a
student’s acquisition andretention of kno wledge during aforeignlanguage phrase-learning
1 7 CHAPTER1. INTRODUCTION
task. Notethatin we deviatefro m our macaronic paradig mforthistask andfocus on
short-ter mlongitudinal modeling of a students kno wledgein a phrase-learningtaskthat
teaches L2 verbinflections. Inthis study, we use flash-cardsinstead of macaronictexts,
mainly because ofthe easein obtaininglongitudinal participationin user studies. We
model a student’s behavior as making predictions under alog-linear model, and adopt a
neural gating mechanis mto model ho wthe student updatestheirlog-linear para metersin
responsetofeedback. The gating mechanis m allo wsthe modeltolearn co mplex patterns of
retention and acquisitionfor eachfeature, whilethelog-linear para meterizationresultsin an
interpretable kno wledge state.
We collect hu man data and evaluate several versions ofthe model. We hypothesize that hu man data collectionfor verbinflectionis not as proble matic asthefull macaronic
setting asthere are only a handful(afe w dozeninflectionalfor ms evenfor morphological
richlanguages)inflectionalfor msto master. Secondly, we do not haveto subject student
annotatorstosti muli has been generated by untrained AIteachers, makingthe data collection
process a beneficialforthestudent annotators as well. Details of our proposalfor kno wledge
tracing are presentedin Chapter 6. The code and datafor our experi mentsis available at
https://github.com/arendu/vocab-trainer-experiments.
1 8 CHAPTER1. INTRODUCTION
1.6 Searchingin Macaronic Configurations
I n § 1. 5. 1 a n d § 1.5.2 weintroducethe notion of modeling hu manco mprehensionin macaronic
settings(either using actual hu man data or by proxy models). Ho wever, our AIteacher still
hasthe difficulttask of deciding which specific macaronic sentenceto generatefor a student
reach so me ne w piece oftext. Eveninthe case ofthe si mplifiedlexical macaronic data
structure, where each L1 word mapsto asingle L2 word andthere are no phrasere-orderings,
there are exponentially many possible macaronic configurationsthat can be generated. The
AIteacher must decide on a particular configurationthat will be rendered or displayed
to areader bysearching over(so me subset) ofthe possible configurations and picking a
good candidate. We propose greedy and best-first heuristicstotacklethis search proble m
and also design a scoringfunctionthat guidesthe search processto find good candidates
to display. While si mple, we findthe greedy and best-first heuristics approach offers an
effective strategyfor finding a macaronic configuration with alo w co mputationalfootprint.
In orderto continuously update ourlanguage modelinresponseto a student’sreal-ti me
interactions,the speed of our searchis a criticalfactor. The greedy best-first approach offers
a solutionthat prioritizes speed. To measurethe goodness of each search state, our scoring
function co mparestheinitiallytrained L1 word e mbeddings withtheincre mentallytrained
L2 word e mbeddings and assigns ascorereflectingthe proxi mity ofthe L2 word e mbeddings
totheir L1 counterparts. We conductintrinsic experi ments usingthis proposed scoring
functionto deter mine viable search heuristics. Further details on our scoringfunction and
1 9 CHAPTER1. INTRODUCTION
heuristic search are presentedin Chapter 5. The codefor our experi ments are available here
https://github.com/arendu/Mixed-Lang-Models.
1.7 Interaction Design
While macaronictexts can be “consu med” as static docu ments,the prevalence of e-readers
and web-basedreadinginterfaces allo ws ustotake advantage of a moreinteractivereading
andlearning experience. We propose a userinterfacethat can be helpfulto a hu manlearner, while also enablingthe AIteacherto adaptitselfto wards a specificlearner. Specifically, when a macaronic docu mentis presentedtothe student, we providefunctionalitythat allo ws
a studentto click on L2 words or phrasesin ordertorevealtheir L1translation. This helps
thereaderto progressifthey are struggling with ho wtointerpret a given word or phrase.
Further more, we canlog a student’s clickinginteractions and usethe m as feedback for
our machineteacher. Many clicks within a sentence mightindicatethatthe macaronic
configuration ofthe sentence was beyondthelearner’sreading abilities, andtheteacher can
updateits models accordingly. Apartfro mthis,thereis alsothe option oftheteacher not
revealingthe L1translation, butrather pro mptingthe studentto firsttypein a guessforthe
meaning ofthe word. This process helpsto disa mbiguate bet weenthose students who click
justfor confir mation oftheir kno wledge andthose who genuinely don’t kno wthe word at all.
Bylooking at what astudent hastyped, we can deter mine ho w closethestudent’s kno wledge
isto actual understanding ofthe word or phrase. Details of ourInteraction Design proposal
2 0 CHAPTER1. INTRODUCTION
are describedin Chapter 4, but weleavethe construction of aninteractive syste m which
iterativelyrefinesits modelforfuture work.
1.8 Publications
This thesis is mainly the cul mination of the follo wing six publications (including one
de monstrationtrack publication and one workshop publication):
1. Analyzing Learner Understanding of Novel L2 Vocabulary
Rebecca Kno wles, Adithya Renduchintala, Philipp Koehn and Jason Eisner, Confer-
ence on Co mputational Natural Language Learning( Co NLL), 2016.
2. CreatingInteractive MacaronicInterfacesfor Language Learning.
Adithya Renduchintala, Rebecca Kno wles, Philipp Koehn andJason Eisner, Syste m
Description, Annual Meeting ofthe Associationfor Co mputational Linguistics( A CL),
2 0 1 6.
3. User Modelingin Language Learning with Macaronic Texts.
Adithya Renduchintala, Rebecca Kno wles, Philipp Koehn andJason Eisner, Annual
Meeting ofthe Associationfor Co mputational Linguistics( A CL), 2016.
4. Kno wledge Tracingin Sequential Learning ofInflected Vocabulary
Adithya Renduchintala, Philipp Koehn and Jason Eisner, Conference on Co mputa-
tional Natural Language Learning( Co NLL), 2017.
2 1 CHAPTER1. INTRODUCTION
5. Si mple Construction of Mixed-Language Textsfor Vocabulary Learning
Adithya Renduchintala, Philipp Koehn and Jason Eisner. Annual Meeting ofthe
Associationfor Co mputational Linguistics( ACL) Workshop onInnovative Use of
NLPfor Building Educational Applications, 2019
6. Spelling- Aware Construction of Macaronic Texts for Teaching Foreign-Language
Vo c a b ul ar y
Adithya Renduchintala, Philipp Koehn and Jason Eisner. E mpirical Methodsin
Natural Language Processing(E M NLP), 2019
2 2 C h a pt e r 2
Related Work
Early adoption of Natural Language Processing( NLP) and speechtechnologyin education was mainly focused on Su m mative Assess ment, where a student’s writing, speaking, or
readingis analyzed by an NLP syste m. Such syste ms, essentially assigns a scoretothe
input. Pro minent exa mplesinclude Heil man and Madnani(2012), Burstein, Tetreault, and
Madnani(2013) and Madnani et al.(2012). Morerecently, NLPsyste ms have also been
usedto provide For mative Assess ment. Here,thesyste m providesfeedbackin afor mthat a
student can act upon andi mprovetheir abilities. For mative Assess ment has also beenstudied
in other areas of education such as Math and Science. Inlanguage education, For mative
Assess ment maytakethefor m of giving a student qualitativefeedback on a particular part
ofthe student’s essay. For exa mple, suggesting a different phrasing. Such syste ms fall
alongthelines ofintelligent and adaptivetutoring solutions designedtoi mprovelearning
outco mes. Recentresearchsuch as Zhang et al.(2019) and co m mercialsolutionssuch as
2 3 CHAPTER2. RELATED WORK
Gra m marly(2009) are expandingtherole of NLPinfor mativefeedback and assess ment.
An overvie w of NLP-based workinthe educationsphere can befoundin Lit man(2016).
Thereis alsolines of workthat are not withinthe definitions of Su m mative or For mative
assess ment. For exa mple, practice question generationisthetask of creating pedagogically
useful questionsfor a student allo wingthe mto practice withoutthe need of a hu manteacher.
( Du, Shao, and Cardie, 2017) and( Heil man and S mith, 2010)is one ofthe ne werresearch whichfocused on question generationforreading co mprehension. Priortothat( Mitkov and
Ha, 2003) usedrule-based methodsto generate questions.
There has also been NLP workspecifictosecondlanguage acquisition,such as Ö z b al,
Pighin, and Strapparava(2014), wherethefocus has beento build a syste mto helplearners
retain ne w vocabulary. As previously mentioned, mobile and web-based appsforsecond
languagelearningsuch aslike Duolingo( Ahn, 2013) are popular a monglearners asthey
allo wself-pacedstudy and hand-crafted curricula. While most ofthese apps have “ga mified”
thelearner’s experience,they still de mand dedicatedti mefro mthelearner.
The process of generatingtraining data for Machine Translation syste ms also have
potentialto belanguagelearningtools. Her mjakob et al.(2018)is atoolthat allo ws hu man
annotatorsto generatetranslations(targetreferences)fro m source sentencesin alanguage
they do notread. Thetool presents asourcesentence(for which areferencetargetisrequired)
to a hu man annotatorinro manizedfor m along with phrasal glosses ofthero manized-source
text using alook uptable. Her mjakob et al.(2018) observedthat by si mply allo wingthe
annotatorstotranslate source sentences ( withtheir supportinginterface)the annotators
2 4 CHAPTER2. RELATED WORK
learned vocabulary andsyntax overti me. Asi milar observation was madein Hu et al.(2011)
and Hu, Bederson, and Resnik(2010) who also builttoolsto obtainreferencetranslations
fro m hu man annotators who do notreadthesourcelanguage.
The workinthisthesis, ho wever, seeksto build a fra me work based onincidental
learning whenreading macaronic passagesfro m docu ments such as ne ws articles, stories,
or books. Our approach does notrely on hand- made curricula, does not present explicit
instructions, and(hopefully) can be used byforeignlanguage studentsinthe daily course
of their lives. 1 Our goalisthatthis would encourage continued engage ment,leadingto
“life-longlearning.” This notion ofincidentallearning has been exploredin previous work as well. Chen et al.(2015) create a web-based pluginthat can exposelearnersto ne wforeign
language vocabulary whilereading ne ws articles. They use a dictionaryto sho w Chinese
translations of English words whenthelearner clicks on an English wordinthe docu ment
(their prototypetargets native English speakerslearning Chinese). When a particular English wordis clicked,thelearneris sho wnthat word’s Chinesetranslation. Oncethe application
recordsthe click,itthen deter mines whetherthe user hasreached a certainthresholdfor
that word and auto maticallyreplacesfuture occurrences ofthe English withits Chinese
translation. Thelearner can also click onthe Chinesetranslation, at which pointtheyreceive
a multiple choice question askingthe mtoidentifythe correct Englishtranslation. While
they don’t use surrounding context when deter mining which wordstoteach,thelearner
has accessto contextinthefor m oftherest ofthe docu ment when making multiple-choice
1 Variations of ourfra me work could(ifthelearner chooses) provide explicitinstructions and makethetask morelearningfocused atthe expense of casualreading.
2 5 CHAPTER2. RELATED WORK guesses. Our workis alsorelatedto Labutov and Lipson(2014), which alsotriestoleverage incidentallearning using mixed L1 and L2languages. Whereastheir work uses surprisal to choose contextsin whichtoinsert L2 vocabulary, we consider both contextfeatures and otherfactors such as cognatefeatures. Further, we collect datathat gives direct evidence of the user’s understanding of words by askingthe mto provide English guesses,ratherthan indirectly, via questions about sentence validity. Thelatterindirect approachrunstherisk of overesti matingthe student’s kno wledge of a word;forinstance, a student may have only learned otherlinguisticinfor mation about a word, such as whetheritis ani mate orinani mate, ratherthanlearningits exact meaning. In addition, we are not onlyinterestedin whether a mixed L1 and L2 sentenceis co mprehensible; we are alsointerestedin deter mining a distribution overthelearner’s belief statefor each wordinthe sentence. We dothisin an engaging, ga me-like setting, which providesthe user with hints whenthetaskistoo difficult forthe mto co mplete.
Incidentallearning can be vie wed as a kind of “fast mapping,” a process by which children are ableto map novel wordstotheir meaning withrelativelyfe w exposures( Carey and Bartlett, 1978). Fast mappingis usuallystudied as a mapping bet ween a novel word and so me conceptinthei m mediatescene. Carey and Bartlett(1978)studied whether 3 year old children can map a novel word,for exa mple“chro miu m,”to an unfa miliar color(olive-green) using a “referent-selection”task, whichrequired a subjecttoretrievethe correct unfa miliar objectfro m a set of objects. Children were giveninstructions such as bringthe chro miu m tray, notthe blue one. It was observedthat children were ableto perfor msuch mappings
2 6 CHAPTER2. RELATED WORK
quickly. Subsequently, Alishahi, Fazly, and Stevenson(2008) constructed a probabilistic
model and were abletotunethis modelto fitthe e mpirical observations of previousfast
mapping experi ments. The model experiences a sequence of utterances U t i n s c e n es S t .
Each utterance contains words w ∈ U t , andthe “scene” contains aset of concepts m ∈ S t .
Wit h e a c h U t ,St pair,the model para meters p (m | w ), were updated using an online E M
update. Atthe end of asequence of U t ,St t ∈ 1 , . . . T pairs,the final model para meters were
usedto si mulate “referent-selection” andretentiontasks. We can vie wthe student’stask(in
our macaronic setting) as aninstance of cross-lingual structuredfast- mapping, where an
utteranceis a macaronicsentence andthestudentistryingto map novelforeign wordsto wordsintheir nativelanguage.
We are also encouraged byrecent co m mercial applicationsthat use a mixedlanguage
fra me workforlanguage education. S wych(2015) clai msto auto matically generate mixed
language docu ments while OneThirdStories(2018) creates hu man generated mixed-language
storiesthat beginin onelanguage and graduallyincorporate more and more vocabulary and
syntaxfor asecondlanguage. Such ne w develop mentsindicatethatthereis bothspace and
de mandinthelanguagelearning co m munityforfurther exploration oflanguagelearning via mixedlanguages modalities.
Inthefollo wing chapters, we detail our experi ments with modeling user co mprehension
in mixedlanguage situations, as well as proposing a si mplified process for generating
macaronictext withoutinitial hu man studentintervention. Finally, we also model waysto
track student kno wledge asthey progressthrough alanguagelearning activity.
2 7 C h a pt e r 3
Modeling Incidental Learning
This chapter details our work on constructing predictive models of hu manincidentallearning.
Concretely, we castthe modelingtask as a predictiontaskin whichthe model predicts
if a hu manstudent canguess the meaning of a novel L2 word whenit appears withthe
surrounding L1 context. Apartfro m givingthe model accesstothe context, we also provide
the model with features fro mthe novel worditself, such as spelling and pronunciation
features, asthese would all aid a hu man studentintheir guess ofthe novel word’s meaning.
Recallthatthis modelis essentially a model ofthe environ ment,takingthereinforce ment
learning perspective of our macaroniclearningfra me work(fro m § 1.5.1),thatthe agent (the
AIteacher) acts upon.
2 8 CHAPTER3. MODELINGINCIDENTAL LEARNING
3.1 Foreign WordsinIsolation
We firststudy a constrainedsetting where we present novicelearners with ne w L2 words
insertedin sentences other wise writtenintheir L1. Inthis setting only a single L2 wordis
presentin each sentence. Whilethisis notthe only possible settingforincidental acquisition
(§ 3.2 discussesthe sa metask forthe “full” macaronic setting),this experi mental design
allo ws usto assu methat all subjects understandthefull context, without needingto assess
ho w much L2they previously understood. We also present novicelearners withthesa me
novel words out of context. This allo ws usto study ho w “cognateness” and contextinteract,
in a well-controlledsetting. We hypothesizethat cognates or very co m mon words may be
easytotranslate without context, while contextual clues may be neededto make other words
g u ess a bl e.
Intheinitial experi ments we present here, wefocus onthelanguage pair of English
L1 and Ger man L2, selecting Mechanical Turk users who self-identify as fluent English speakers with mini mal exposureto Ger man. We confine ourselvesto novel nouns, as we expectthattherelativelack of morphologicalinflectionin nounsin bothlanguages 1 will
produceless noisy resultsthan verbs, for exa mple, which naive users mightincorrectly
inflectintheir( English)responses.
Even more experienced L2readers will encounter novel words whenreading L2text.
Their abilityto decipher a novel wordis kno wnto depend on boththeir understanding
1 Specifically, Ger man nouns are markedfor nu mber but onlyrarelyfor case.
2 9 CHAPTER3. MODELINGINCIDENTAL LEARNING
ofthe surrounding context words( Huckin and Coady, 1999) andthe cognateness ofthe
novel word. We seekto evaluatethis quantitatively and qualitativelyinthree “extre me”
cases(no context, no cognateinfor mation,full context with cognateinfor mation). In doing
so, we are ableto see ho wlearners mightreact differentlyto novel words based ontheir
understanding ofthe context. This can serve as a well-controlled proxyfor otherincidental
learning settings,includingreading alanguagethatthelearner kno ws well and encountering
novel words, encountering novel vocabularyite msinisolation(for exa mple on a vocabulary
list), orlearner-drivenlearningtools such as onesinvolvingthereading of macaronictext.
3.1.1 Data Collection and Preparation
We use datafro m NachrichtenLeicht.de , asource of ne ws articlesinsi mple Ger man
(Leichte Sprache, “easylanguage”)( Deutschlandfunk, 2016). Si mple Ger manisintended
for readers with cognitivei mpair ments and/or whose firstlanguageis not Ger man. It
follo ws several guidelines, such as short sentences, si mple sentence structure, active voice,
hyphenation of co mpound nouns( which are co m monin Ger man), and use of prepositions
instead ofthe genitive case( Wikipedia, 2016).
We chose 188 Ger man sentences and manuallytranslatedthe minto English. In each
sentence, weselected asingle Ger man noun whosetranslationis asingle English noun. This yields atriple of( Ger man noun, English noun, Englishtranslation ofthe context). Each
Ger man noun/English noun pair appears only once 2 and each Englishsentenceis unique,for
2 The English word may appearin othersentences, but neverinthesentencein whichits Ger man counterpart a p p e ars.
3 0 CHAPTER3. MODELINGINCIDENTAL LEARNING
Task Text Presentedto Learner Correct Ans wer
w or d Kli m a cli m at e
cloze Thenexti mportant conferenceisin Dece mber. cli mate
co mbined The nexti mportant Kli ma conferenceisin Dece mber. cli mate
Table 3.1: Threetasks derivedfro mthesa me Ger mansentence.
atotal of 188triples. Sentencesrangedinlengthfro m 5tokensto 28tokens, with a mean of
11.47tokens( median 11). Duetotheshortlength ofthesentences,in many casesthere was
only one possible pair of aligned Ger man and English nouns(both of which weresingle wordsratherthan noun phrases). Inthe cases wherethere were multiple,thetranslator chose
onethat had not yet been chosen, and atte mptedto ensure a widerange of clear cognatesto
non-cognates, as well as arange of ho w clearthe context madethe word.
As an outsideresourcefortraininglanguage models and otherresources, we choseto use
Si mple English Wikipedia( Wiki media Foundation, 2016). It contains 767,826 sentences,
covers a si milar set oftopicstothe NachrichtenLeicht.de data, and uses si mple
sentence structure. The sentencelengths are also co mparable, with a mean of 17.6tokens
and a median of 16tokens. This makesit well- matchedfor ourtask.
Our main goalisto exa mine students’ abilityto understand novel L2 words. To better
separatethe effects of context and cognate status and generalfa miliarity withthe nouns, we
assess subjects onthethreetasksillustratedin Table 3.1:
1. word : Subjects are presented with asingle Ger man word out of context, and are asked
3 1 CHAPTER3. MODELINGINCIDENTAL LEARNING
to providetheir best guessforthetranslation.
2. cloze: Asingle nounis deletedfro m asentence andsubjects are askedto fillinthe
bl a n k.
3. co mbined: Subjects are askedto providetheir best-guesstranslation for a single
Ger man nouninthe context of an English sentence. Thisisidenticaltothe clozetask,
exceptthatthe Ger man nounreplacesthe blank.
We used A mazon Mechanical Turk(henceforth MTurk), a cro wdsourcing platfor m,to recruit subjects and collecttheirresponsesto ourtasks. Tasks on MTurk arereferredto as
HITs( Hu manIntelligence Tasks). In orderto qualifyfor ourtasks, subjects co mpleted short
surveys ontheirlanguage skills. They were askedtoratetheirlanguage proficiencyinfour
languages(English, Spanish, Ger man, and French) on ascalefro m “ None”to “Fluent.” The
inter mediate options were “ Upto 1 year ofstudy(or equivalent)” and “ Morethan 1 year of
study (or equivalent)”. 3 Only subjects whoindicatedthatthey were fluentin English but
indicated “ None”for Ger man experience were per mittedto co mpletethetasks.
Additional stratification of usersinto groupsis describedinthe subsection belo w. The
HITs were presentedtosubjectsin aso me whatrando mized order(as per MTurkstandard
s et u p).
Data Collection Protocol: Inthis setup, each subject may be askedto co mpletein-
stances of allthreetasks. Ho wever,the subjectis sho wn at most onetaskinstancethat was
3 Subjects wereinstructedtolistthe mselves as having experience equivalenttolanguageinstruction evenif they had notstudiedin a classroo mifthey had been exposedtothelanguage bylivingin a placethatit was spoken, playing onlinelanguage-learning ga mes, or other such experiences.
3 2 CHAPTER3. MODELINGINCIDENTAL LEARNING
derivedfro m a given datatriple(for exa mple, at most onelinefro m Table 3.1). 4 S u bj e cts were paid bet ween $0.05 and $0.08 per HIT, where a HIT consists of 5instances ofthe
sa metask. Each HIT was co mpleted by 9 uniquesubjects. Subjects voluntarily co mpleted
fro m 5to 90taskinstances(1–18 hits), with a median of 25instances(5 HITs). HITstook
subjects a median of 80.5seconds accordingtothe MTurk outputti ming. Eachtriple gives
riseto one cloze, one word, and one co mbinedtask. For each ofthosetasks, 9 users make
guesses,for atotal of 27 guesses pertriple. Data was preprocessedtolo wercase all guesses
andto correct obvioustypos. 5 Users made 1863 unique guesses(types across alltasks). Of
these, 142 were deter minedto be errors of so me sort; 79 were correctable spelling errors, 54 were multiple- word phrasesratherthansingle words, 8 were Ger man words, and 1 was an
a mbiguous spelling error. In our experi ments, wetreat all ofthe uncorrectable errors as out
of vocabularytokens,for which we cannot co mputefeatures(such as edit distance, etc.).
D at a S plits : After collecting data on alltriplesfro m our subjects, we splitthe dataset
for purposes of predictive modeling. Werando mly partitionedthetriplesinto atraining set
(112triples), a develop ment set(38triples), and atest set(38triples). Notethatthe sa me
partition bytriples was used across alltasks. As aresult, a Ger man noun/English noun pair
that appearsintest datais genuinely unseen —it did not appearinthetraining dataforany
t as k. 4 Eachtriple givesriseto aninstance of eachtask. Subjects who co mpleted one oftheseinstances were preventedfro m co mpletingthe othert wo by being assigned an additional MTurk “qualification” —inthis case, a dis q u ali fi c ati o n. 5 All guessesthat were flagged by spell-check were manually checkedto seeifthey constitutedtypos(e.g., “langauges”for “languages”) or spelling errors(e.g., “speach”for “speech”) with clear corrections.
3 3 CHAPTER3. MODELINGINCIDENTAL LEARNING
3.1.2 Modeling Subject Guesses
In an educationaltechnology context, such as atoolforlearning vocabulary, we wouldlike
a wayto co mputethe difficulty of exa mples auto matically,in orderto presentlearners with
exa mples with appropriatelevel of difficulty. For such an application,it would be useful
to kno w not only whetherthelearnerislikelyto correctly guessthe vocabularyite m, but
also whethertheirincorrect guesses are “close enough”to allo wthe userto understandthe
sentence and proceed withreading. We seekto build modelsthat can predict a subject’s
likely guesses andtheir probabilities, giventhe context with whichthey have been presented.
We use a s mall set offeatures(described belo w)to characterize subjects’ guesses and
build predictive models of what a subjectislikelyto guess. Featurefunctions canjointly
exa minetheinput presentedtothe subject and candidate guesses.
We evaluatethe models bothinter ms of ho w wellthey predict subject guesses, as well as ho w wellthey perfor m onthesi mplersubtask of modeling guessability. We define
guessability for a wordin contextto be ho w easythat wordisfor asubjectto guess, given
the context. In practice, we esti mated guessability asthe proportion of subjectsthat exactly
guessedthe word(i.e.,thereference Englishtranslation).
3.1.2.1 Features Used
When ourfor mulafor co mputing afeature dra ws on para metersthat are esti matedfro m
corpora, we usedthe Si mple English Wikipedia data and Glo Ve vectors(Pennington, Socher,
3 4 CHAPTER3. MODELINGINCIDENTAL LEARNING and Manning, 2014). Ourfeatures arefunctions whose argu ments arethe candidate guess andthetriple(of Ger man noun, English noun, and English sentence). They are dividedinto three categories based on which portions ofthetriplethey consider:
Generic Features : Thesefeatures areindependent of subjectinput, and could be useful regardless of whetherthe subject madetheir guessinthe word, cloze, or co mbinedtask.
1. Candidate==Correct Answer Thisfeature fires whenthe candidateis equal
tothe correct ans wer.
2. Candidate==OOV Thisis used whenthe candidate guessis not a valid English
word(for exa mple, multiple words or aninco mprehensibletypo),in which case no
otherfeatures aboutthe candidate are extracted.
3. Length We co mputethe nu mber of charactersinthe correct ans wer.
4. Embedding Cosine distance bet ween e mbedding of candidate and e mbedding of
correct ans wer. Forthe e mbeddings, we usethe 300-di mensional Glo Ve vectorsfro m
the 6 B-token dataset.
5. Log Unigram Frequency of candidateinthe Si mple English Wikipedia corpus.
6. Levenshtein Distance bet weencandidateandcorrectans wer.
7. Sound Edit Distance Levenshtein Distance bet ween phoneticrepresentations
of candidate and correct ans wer, as given by Metaphone(Philips, 1990). 6
6 When several variations were availablefor a particularfeature, such as which phoneticrepresentationto use or whether or notto nor malize,the version we selectedfor our studies(and described here)isthe version that correlated most strongly with guessability ontraining data.
3 5 CHAPTER3. MODELINGINCIDENTAL LEARNING
8. L C S Length oflongest co m mon substring bet ween candidate and correct ans wer,
nor malized.
9. Normalized Trigram Overlap count of charactertrigra ms(types)that match
bet weenthe candidate and correct ans wer, nor malized bythe maxi mu m possible
m at c h es.
Word Features : Thesefeatures are dependent onthe Ger man word, andshouldthus only be usefulinthe word and co mbinedtasks. Thesecond half ofthe genericfeatures(fro m
Levenshtein Distancethrough Nor malized Trigra m Overlap) are also co mputed bet ween the candidate andthe Ger man word and are used as measures of cognateness. The use of
Metaphone( whichisintendedto predictthe pronunciation of English words)is appropriate to usefor Ger man wordsinthis case, asit correspondstothe assu mptionthat ourlearners do not yet have accuraterepresentations of Ger man pronunciation and may be applying
English pronunciationrulesto Ger man. Cloze Features : Thesefeatures are dependent on thesurrounding English context, andshouldthus only be usefulinthe cloze and co mbined t as ks.
1. L M S c o r e of candidatein context, using a 5-gra mlanguage model built using
KenL M( Heafield et al., 2013) and a neural R N NL M( Mikolov et al., 2011). 7 We
co mputethree differentfeaturesfor eachlanguage model, ara w L Mscore, asentence-
length-nor malized L Mscore, andthe difference bet weenthe L Mscore withthe correct
ans werinthesentence andthe L Mscore withthe candidateinits place.
7 We usethe Faster- R N NL Mtoolkit available at https://github.com/yandex/faster-rnnlm .
3 6 CHAPTER3. MODELINGINCIDENTAL LEARNING
2. P M I Maxi mu m point wise mutualinfor mation bet ween any wordinthe context and
the candidate.
3. Left Bigram Collocations These are bigra m association measures bet ween
the candidate’s neighbor(s)totheleft andthe candidate( Church and Hanks, 1990).
Weinclude a versionthatjust exa minesthe neighbor directlytotheleft( which we’d
expectto do wellin collocationslike “San Francisco”) as well as a versionthatreturns
the maxi mu mscore over a windo w of five, which behaveslike an asy m metric version
of P MI.
4. Embeddings Mini mu m cosine distance bet ween e mbeddings of any wordinthe
context andthe candidate.
Intuitively, we expectitto be easiestto guessthe correct wordinthe co mbinedtask,
follo wed bythe clozetask,follo wed bythe L2 word with no context.8 Assho wnin Figure
3.1,thisis borne outin our data.
In Table 3.2 wesho w Spear man correlations bet weenseveralfeatures andthe guessability
ofthe word(given a word, cloze, or co mbined context). The firstt wofeatures(log unigra m
probability andlength ofthe correct solution)in Table 3.2 belongtothe generic category
offeatures. We expectthatlearners may have an easierti me guessing short or co m mon words(forinstance,it may be easierto guess “cat”than “trilobite”) and we do observe such
c orr el ati o ns. 8 All plots/valuesinthe re mainder ofthis subsection are co mputed only overthetraining data unless other wise noted.
3 7 CHAPTER3. MODELINGINCIDENTAL LEARNING
Fe at u re Spear man’s Correlation w/ Guessability
Word Cloze Combined All
Log Unigra m Frequency 0.310* 0.262* 0.279* 0.255*
English Length -0.397* -0.392* -0.357* -0.344*
SoundEdit Distance(German+ Answer) -0.633* n/a -0.575* -0.409*
Levenshtein Distance(German+ Answer) -0.606* n/a -0.560* -0.395*
Max P MI( Ans wer + Context) n/a 0.480* 0.376* 0.306*
Max Left Bigram Collocations(Answer+ Window=5) n/a 0.474* 0.186 0.238*
Max Right Bigram Collocations(Answer+ Window=5) n/a 0.119 0.064 0.038
Table 3.2: Correlations bet ween selectedfeature values and ans wer guessability, co mputed
on training data (starred correlations significant at p < 0 .0 1 . Unavailable features are
represented by “n/a”(for exa mple, sincethe Ger man wordis not observedinthe clozetask,
its edit distancetothe correct solutionis unavailable). Duetothefor mat of ourtriples,it
is still possibletotest whetherthese unavailablefeaturesinfluencethe subject’s guess:in
al most all casestheyindeed do not appearto, sincethe correlation with guessabilityislo w
(absolute value< 0.15) and not statistically significant even atthep < 0.05 level.
3 8 CHAPTER3. MODELINGINCIDENTAL LEARNING
6 0
5 0
4 0
3 0
2 0 Average Accuracy of User Guesses 1 0
0 W or d Cl o z e C o m bi n e d
Figure 3.1: Average guessability by contexttype, co mputed on 112triples(fro mthetraining
data). Error bars sho w 95 %-confidenceintervalsforthe mean, under bootstrapresa mpling
ofthe 112triples( we use B Caintervals). Mean accuracyincreases significantlyfro m each
tasktothe next(sa metest on difference of means,p < 0.01).
Inthe middle ofthetable, we cansee ho w, despitethe wordtask being most difficult on average,there are cases such as Gitarrist (guitarist), where cognateness allo ws all or nearly alllearnersto guessthe meaning ofthe word with no context. The correlation
bet ween guessability and Sound Edit Distance as well Levenshtein Distance de monstrate
their usefulness as proxiesfor cognateness. The other wordfeatures described earlier also
sho w strong correlation with guessabilityinthe word and co mbinedtasks.
Si milarly,in so me clozetasks, strong collocations or context clues, asinthe case of “ His
planelanded atthe ,” makeit easyto guessthe correctsolution(airport). Arguably,
eventhe nu mber of blank wordsto fill are a clueto co mpletethissentence, but we do not
modelthisin ourstudy. We would expect,forinstance, a high P MI bet ween “plane” and
“airport”, and we seethisreflectedinthe correlation bet ween high P MI and guessability.
3 9 CHAPTER3. MODELINGINCIDENTAL LEARNING
The finalt wolines ofthetable exa mine aninteresting quirk of bigra m association measures.
We seethat Left Bigra m Collocations with a windo w of 5(thatis, wherethefeaturereturns
the maxi mu m collocationscore bet ween a wordinthe windo wtotheleft ofthe wordto be
guessed) sho wsreasonable correlation with guessability. Right bigra m collocations, onthe
other hand, do not appearto correlate. This suggeststhatthe subjectsfocus more onthe words precedingthe blank whenfor mulatingtheir guess( which makessense astheyread
left-to-right). Duetoits poor perfor mance, we do notinclude Right Bigra m Collocationsin
ourlater experi ments.
We expectthatlearners whosee onlythe word will make guessesthatlean heavily on
cognateness(for exa mple,incorrectly guessing “ Austria”for “ Ausland”), whilelearners who
seethe clozetask will choose wordsthat make sense se mantically(eg.incorrectly guessing
“tornado”inthesentence“The destroyed many housesand uprooted manytrees”).
Guessesforthe co mbinedtask mayfallso me where bet weentheset wo, asthelearnertakes
advantage of both sources ofinfor mation. Here wefocus onincorrect guesses(to control
forthe differencesintask difficulty).
For exa mple,in Figure 3.2, weseethat guesses( made by our hu mansubjects)inthe wordtask have higher average Nor malized Character Trigra m Overlapthan guessesinthe
clozetask, withthe co mbinedtaskin bet ween.
This pattern ofthe co mbinedtaskfalling bet weenthe word and clozetaskis consistent
across mostfeatures exa mined.
4 0 CHAPTER3. MODELINGINCIDENTAL LEARNING
Nor med Trigra m Overlap 0. 2 0
0. 1 5
0. 1 0
0. 0 5
0. 0 0 W or d Cl o z e C o m bi n e d
Figure 3.2: Average Nor malized Character Trigra m Overlap bet ween guesses andthe
Ger man word.
3. 1. 3 M o d el
The correlationsinthe previous subsection support ourintuitions about ho wto model
subject behaviorinter ms of cognateness and context. Of course, we expectthatratherthan
basingtheir guesses on a singlefeature, subjects are perfor ming cue co mbination, balancing
multiple cognate and context clues ( wheneverthey are available). Follo wingthattoits
natural conclusion, we choose a modelthat also allo wsfor cue co mbinationin orderto
model subject guesses.
We uselog-linear modelsto model subject guesses as probability distributions overthe vocabulary V , asseenin Equations 3.1 and 3.2.
exp(w ·f(x,y)) p (y |x ) = ∑ ( 3. 1) ′ y ′ ∈ V exp(w ·f(x,y ))
− w · f (x, y ) = w k f k (x, y ) ( 3. 2) k
4 1 CHAPTER3. MODELINGINCIDENTAL LEARNING
We use a 5000- word vocabulary, containing all co mplete English vocabularyfro mthe
triples and user guesses, padded withfrequent wordsfro mthe Si mple English Wikipedia
d at as et.
Giventhe context x thatthe subject was sho wn ( word, cloze, or co mbined), p (y |x )
representsthe probabilitythat a subject would guessthe vocabularyite m y ∈ V . T h e m o d el
learns weights w k for eachfeature f k (x, y ). Wetrainthe model using Mega M(Dau mé III,
2004) viathe NLT Kinterface.
An exa mplefeaturefunctionis sho wnin Equation 3.3.
⎧ ⎪ ⎪ ⎨⎪ 1, if Correct Answer= = y f (x, y ) = ( 3. 3) k ⎪ ⎪ ⎩⎪ 0, other wise
In orderto bestleveragethe clozefeatures(shared acrossthe cloze and co mbinedtasks),
the wordfeatures (shared acrossthe word and co mbinedtask) andthe genericfeatures
(shared across alltasks), wetakethe do main adaptation approach usedin( Dau mé III, 2 0 0 7).
Inthis approach,instead of a singlefeaturefor Levenshtein distance bet ween a Ger man word and a candidate guess, weinstead havethree copies ofthisfeature, onethat fires only whenthe subjectis presented withthe wordtask, onethat fires whenthesubjectis presented withthe co mbinedtask, and one which firesin either ofthose situations(notethat since
asubject whoseesthe clozetask does notseethe Ger man word, we o mitsuch a version
ofthefeature). This allo ws ustolearn different weightsfor different versions ofthe sa me
features. For exa mple,this allo wsthe modeltolearn high weightfor cognatenessfeatures
4 2 CHAPTER3. MODELINGINCIDENTAL LEARNING
inthe word and co mbinedsettings, without beinginfluencedtolearn alo w weight onit by
t h e cl o z e s etti n g.
3.1.3.1 Evaluatingthe Models
We evaluatethe modelsin several ways: using conditional cross-entropy, by co mputing
meanreciprocalrank, and co mputing correlation with guessability.
The e mpirical distributionfor a given context x is calculatedfro m all c o u nt(·|x ) l e ar n er
guesses for that context, with p (g |x ) = c o u nt (g |x ) ( w h er e c o u nt(g |x ) isthe nu mber of c o u nt (·|x )
learners who guessedg inthe contextx).
The conditional cross-entropy is definedto bethe mean negativelog probability over all ∑ testtaskinstances(pairs of subject guessesg and contexts x), 1 N − l o g p (g | x ). N i= 0 2 i i
The mean reciprocal rank is co mputed after ranking all vocabulary words (in each
context) bythe probability assignedtothe m bythe model, calculatingthereciprocalrank of
the each subject guess g i, andthen averagingthis across all contexts x i n t h e s et X of all
contexts, as sho wnin Equation 3.4.
1 − N 1 M R R = ( 3. 4) N r a n k(g |x ) i= 1 i i In orderto co mputecorrelation with guessability, we use Spear man’srhoto checkthe
correlation bet ween guessability andthe probability assigned bythe modeltothe correct
a ns w er.
4 3 CHAPTER3. MODELINGINCIDENTAL LEARNING
3.1.4 Results and Analysis
1. 0
0. 8
0. 6
0. 4
Model0. Probability 2 of Correct Ans wer
0. 0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 E mpirical Probability of Correct Ans wer
Figure 3.3: Correlation bet ween e mpirically observed probability ofthe correct ans wer(i.e.
the proportion of hu man subject guessesthat were correct) and model probability assigned
tothe correct ans wer across alltasksinthetest set. Spear man’s correlation of 0.725.
In Table 3.3 wesho w modelresults overseveralfeaturesets. None ofthefeaturesets
(genericfeatures, wordfeatures, or clozefeatures) can perfor m wellindividually onthefull
test set, which contains word, cloze, and co mbinedtasks. Once co mbined,they perfor m
b ett er o n all m etri cs.
Additionally, using do main adaptationi mproves perfor mance. Manually exa miningthe
best model’s mostinfor mative features, we see, for exa mple,that edit distance features
areranked highlyintheir word-only or word-co mbined versions, whilethe co mbined-only version ofthosefeaturesislessinfor mative. Thisreflects our earlier observationthat edit
distancefeatures are highly correlated with guessabilityinthe wordtask, and slightlyless so
4 4 CHAPTER3. MODELINGINCIDENTAL LEARNING
Fe at u res Cross-Entropy MRR Correlation
LCS(Candidate + Ans wer) 10.72 0.067 0.346*
All Generic Features 8.643 0.309 0.168
Sound Edit Distance(Candidate+ German Word) 10.847 0.081 0.494*
All Word Features 10.018 0.187 0.570*
L M Difference 11.214 0.051 0.398*
All Cloze Features 10.008 0.105 0.351*
Generic + Word 7.651 0.369 0.585*
Generic + Cloze 8.075 0.320 0.264*
Wor d + Cl o z e 8.369 0.227 0.706*
All Features( No Do main Adapt.) 7.344 0.338 0.702*
All Features + Do main Adapt. 7.134 0.382 0.725*
Table 3.3: Feature ablation. The single highest-correlatingfeature(on dev set)fro m each
feature groupis sho wn,follo wed bythe entirefeature group. All versions with morethan
onefeatureinclude afeatureforthe O O V guess. Inthe correlation colu mn, p-values < 0 .0 1
are marked with an asterisk.
4 5 CHAPTER3. MODELINGINCIDENTAL LEARNING
Context Observed Guess Truth Hypothesized Explanation
H elf er cow helpers FalseFriend: Helfer→ Heifer → Cow
Journalisten reporter journalists Synony m andincorrect nu mber.
The Lage istoo dangerous. lake location Influenced byspellingandcontext.
Table 3.4: Exa mples ofincorrect guesses and potential sources of confusion.
(thoughstillrelevant)inthe co mbinedtask. Wesho win Figure 3.3thatthe model probability
assignedtothe correct guess correlates strongly withthe probability oflearners correctly
guessingit.
Annotated Guesses : Totake a fine-grainedlook at guesses, we broke do wn subject
guessesinto several categories.
Figure 3.4: Percent of exa mpleslabeled with eachlabel by a majority of annotators( may
su mto morethan 100 %, as multiplelabels were allo wed).
We had 4 annotators(fluent English speakers, but non-experts)label 50incorrect subject
4 6 CHAPTER3. MODELINGINCIDENTAL LEARNING
guesses fro m each task, sa mpled rando mly fro m the spell-corrected incorrect guesses
inthetraining data, withthe follo winglabelsindicating whythe annotatorthoughtthe
subject madethe(incorrect) guessthey did, giventhe contextthatthe subject sa w: false
friend/cognate/spelling bias(learner appearsto have beeninfluenced bythe spelling ofthe
Ger man word),synony m(learner guessis asynony m or near-synony mtothe correct ans wer),
incorrect nu mber/P OS(correct noun withincorrect nu mber orincorrect P OS), and context
influence(a wordthat makessenseinthe cloze/co mbo context butis not correct). Exa mples
oftherange of waysin which errors can manifest aresho wnin Table 3.4. Annotators made
a binaryjudg mentfor each oftheselabels. Inter-annotator agree ment was substantial, with
Fleiss’s kappa of 0.654. Guesses were given alabel onlyifthe majority of annotators agreed.
In Figure 3.4, we can make several observations about subject behavior. First,thelabels
forthe co mbined and clozetaskstendto be more si milarto one another, and quite different
fro mthe wordtasklabels. In particular,inthe majority of cases, subjects co mpleting cloze
and co mbotasks choose wordsthat fitthe contextthey’ve observed, whilespellinginfluence
inthe wordtask doesn’t appearto be quite asstrong. Evenifthesubjectsinthe cloze and
co mbinedtasks make errors,they choose wordsthatstill makesensein context morethan
50 % oftheti me, while spelling doesn’t exert an equally stronginfluenceinthe wordtask.
Our model also makes predictionsthatlook plausiblylikethose made bythe hu man subjects. For example, giventhe context “In ,the AKP now hasthe most representatives.” the model ranks the correct ans wer (“parlia ment”) first, follo wed by
“undersecretary,” “elections,” and “congress,” all of which arethe matically appropriate, and
4 7 CHAPTER3. MODELINGINCIDENTAL LEARNING
most of which fit contextuallyintothe sentence. Forthe Ger man word “Spieler”,thetop
ranking predictions made bythe model are “spider,” “s maller,” and “spill,” while one ofthe
actual user guesses(“speaker”)isranked as 10th mostlikely(out of a vocabulary of 5000
it e ms).
3.2 Macaronic Setting
I n § 3.1 werestrictedthesentencesto only contain asingle nountokenintheforeignlanguage, whiletherest ofthetokens werein L1. Inthis section we modelincidental co mprehension
forthe “full macaronic” setting, where multipletokens can beintheforeignlanguage, and
the word ordering may also beintheforeignlanguage word-order. Thesti muli ofinterest are
n o w li k e “ Der Polizist arrested the Bankr ä u b e r . ” (“The police arrested
the bankrobber”). Eveninthis scenario, with multipletokens have beenreplaced withtheir
L2 equivalent, areader with no kno wledge of Ger manislikelyto be ableto understand
this sentence reasonably well by using cognate and context clues. These clues provide
enough scaffolding forthereadertoinferthe meaning of novel words and hopefullyinfer
the meaning ofthe entire sentence. Inthese sti mulithe novelforeign wordsjointlyinfluence
each other along with other wordsthat areinthe students nativelanguage. Of course,there
are several possible configurationsforthis sentence where areader might not be ableto
understandthis sentence. Our goalisto model configurationsthat are understandable while
exposingthereaderto as much novel L2 vocabulary and sentence structure as possible.
4 8 CHAPTER3. MODELINGINCIDENTAL LEARNING
Our experi mental subjects are required to guess what “Polizist” and “ Bankr ä u b er ”
meaninthis sentence. Wetrain afeaturized modelto predictthese guessesjointly within
each sentence andthereby predictincidental co mprehension on any macaronic sentence.
Indeed, we hope our model design will generalizefro m predictingincidental co mprehension
on macaronic sentences(for our beginner subjects, who need so me context wordsto be
in English)to predictingincidental co mprehension onfull Ger mansentences (for more
advancedstudents, who understandso me ofthe context words asif they werein English).In
addition, we developed a userinterfacethat uses macaronic sentences directly as a mediu m
oflanguageinstruction. Chapter 4 detailsthe userinterface.
3.2.1 Data Collection Setup
Our method of scaffoldingistoreplace certainforeign words and phrases withtheir English
translations, yielding a macaronic sentence.9 Si mply presentingtheseto alearner would
not give usfeedback onthelearner’s belief statefor eachforeign word. Even assessingthe
learner’sreading co mprehension would give only weakindirectinfor mation about what was understood. Thus, we collect data where alearner explicitly guesses aforeign word’s
translation when seeninthe macaronic context. These guesses arethentreated as supervised
labelstotrain our user model.
We used A mazon Mechanical Turk( MTurk)to collect data. Users qualifiedfortasks
by co mpleting ashort quiz andsurvey abouttheirlanguage kno wledge. Only users whose
9 Althoughthelanguage distinctionisindicated byitalics and color, users wereleftto figurethis out on t h eir o w n.
4 9 CHAPTER3. MODELINGINCIDENTAL LEARNING
resultsindicated no kno wledge of Ger man and self-identified as native speakers of English were allo wedto co mpletetasks. With Ger man asthe foreignlanguage, we generated
content by cra wling asi mplified- Ger man ne ws website, nachrichtenleicht.de . We
chosesi mplified Ger manin orderto mini mizetranslation errors andto makethetask more
suitablefor novicelearners. Wetranslated each Ger man sentence usingthe Moses Statistical
Machine Translation(S MT)toolkit( Koehn et al., 2007). The S MTsyste m wastrained on
the Ger man-English Co m moncra wl paralleltext usedin W MT 2015( Bojar et al., 2015).
We used 200 Ger man sentences, presenting eachto 10 different users. In MTurkjargon,
this yielded 2000 Hu manIntelligence Tasks( HITs). Each HITrequiredits userto participate
in severalrounds of guessing asthe Englishtranslation wasincre mentallyrevealed. A user was paid US $0.12 per HIT, witha bonus of US $6toany user whoaccu mulated morethan
2 0 0 0 t ot al p oi nts.
3.2.1.1 HITs and Sub missions
For each HIT,the user firstsees a Ger mansentence 1 0 (Figure 3.5). Atext boxis presented
belo w each Ger man wordinthe sentence, forthe usertotypeintheir “best guess” of what each Ger man word means. The user must fillin atleast half ofthetext boxes before
sub mittingthis set of guesses. The resulting sub mission—i.e., the macaronic sentence
together withtheset of guesses —isloggedin a database as asingletraining exa mple, and
1 0 Exceptthat we first “translate” any Ger man wordsthat haveidentical spellingin English(case-insensitive). Thisincludes most proper na mes, nu merals, and punctuation marks. Suchtranslated words are displayedin English style(blueitalics), andthe useris not askedto guesstheir meanings.
5 0 CHAPTER3. MODELINGINCIDENTAL LEARNING
Figure 3.5: After a user sub mits a set of guesses (top),theinterface marksthe correct
guessesin green and alsoreveals a set oftranslation clues(botto m). The user no w hasthe
opportunityto guess againforthere maining Ger man words.
the syste m displaysfeedbacktothe user about which guesses were correct.
After each sub mission, ne w clues arerevealed(providingincreased scaffolding) andthe
useris askedto guess again. The process continues, yielding multiple sub missions, until
all Ger man wordsinthe sentence have beentranslated. Atthis point,the entire HITis
considered co mpleted andthe user movesto a ne w HIT(i.e., a ne wsentence).
Fro m our 2000 HITs, we obtained 9392 sub missions(4.7 per HIT)fro m 79 distinct
M Turk users.
3. 2. 1. 2 Cl u es
Each update provides ne w cluesto helpthe user makefurther guesses. There are 2 kinds of
cl u es:
Translation Clue (Figure 3.5): Aset of wordsthat were originallyin Ger man arereplaced withtheir Englishtranslations. Thetext boxes belo wthese words disappear, sinceitis no
longer necessaryto guessthe m.
5 1 CHAPTER3. MODELINGINCIDENTAL LEARNING
Figure 3.6: Inthis case, afterthe user sub mits a set of guesses(top),t wo clues arerevealed
(botto m):ausgestellt is movedinto English order andthentranslated.
Reordering Clue (Figure 3.6): A Ger mansubstringis movedinto a more English-like
position. The reordering positions are calculated usingthe word and phrase align ments
obtainedfro m Moses.
Eachti methe usersub mits aset of guesses, wereveal asequence of n = max(1 ,round(N/ 3))
clues, where N isthe nu mber of Ger man wordsre maininginthesentence. For each clue, wesa mple atokenthatis currentlyin Ger man.Ifthetokenis part of a movable phrase, we
movethat phrase; other wise wetranslatethe mini mal phrase containingthattoken. These
moves correspond exactlyto cluesthat a user couldrequest by clicking onthetokeninthe
macaronicreadinginterface, see Chapter 4for details of ho w moves are constructed and
ani mated. In our present experi ments,the syste misin controlinstead, and grants clues by
“rando mly clicking” on n tokens.
The syste m’s probability of sa mpling a giventokenis proportionaltoits unigra mtype
probabilityinthe W MT corpus. Thus,rarer wordstendtore mainin Ger manforlonger,
allo wingthe Turkerto atte mpt more guessesforthese “difficult” words. Ho wever, close
cognates would be an exeptiontothisrule, asthey are notfrequent but probably easyto
5 2 CHAPTER3. MODELINGINCIDENTAL LEARNING
guess by Turkers.
3.2.1.3 Feedback
When a user sub mits a set of guesses,the syste mresponds withfeedback. Each guessis visibly “ marked”inleft-to-right order, mo mentarily shaded with green(for correct), yello w
(for close) orred(forincorrect). Depending on whether a guessis correct, close, or wrong,
users are a warded points as discussed belo w. Yello w andred shadingthenfades,to signalto
the userthatthey maytry entering a ne w guess. Correct guessesre main onthescreenfor
the entiretask.
3. 2. 1. 4 P oi nts
Adding pointstothe process(Figures 3.5–3.6) adds a ga me-like quality andlets usincen-
tivize users by payingthe mfor good perfor mance. We a ward 1 0 points for each exactly
correct guess(case-insensitive). We give additional “effort points”for a guessthatis close
tothe correcttranslation, as measured by cosine si milarityin vector space. ( We used pre-
trained GLo Ve word vectors(Pennington, Socher, and Manning, 2014); whenthe guess or
correcttranslation has multiple words, wetakethe average ofthe word vectors.) We deduct
effort points for guessesthat are careless or very poor. Our rubric for effort pointsis as
f oll o ws:
5 3 CHAPTER3. MODELINGINCIDENTAL LEARNING
⎧ ⎪ ⎪ ⎪ − 1 , if̂eisrepeated or nonsense(red) ⎪ ⎪ ⎪ ⎪ ∗ ⎪ − 1 , if si m( ̂e, e ) < 0 (r e d) ⎪ ⎨⎪ e p = 0 , if0 ≤ si m(̂e,e ∗ ) < 0 .4 (r e d) ⎪ ⎪ ⎪ ⎪ ⎪ 0 , if̂eis blank ⎪ ⎪ ⎪ ⎪ ⎩⎪ 10 × si m(̂e,e ∗ ) other wise(yello w)
H er e si m( ̂e, e ∗ ) is cosine si milarity bet weenthe vector e mbeddings ofthe user’s guess ̂e a n d
our referencetranslation e ∗ . A “nonsense” guess contains a wordthat does not appearinthe
sentence bitext norinthe 20,000 mostfrequent wordtypesinthe GLo Vetraining corpus.
A “repeated” guessis anincorrect guessthat appears morethan onceintheset of guesses
b ei n g s u b mitt e d.
In so me cases, ̂e or e ∗ mayitself consist of multiple words. Inthis case, our points
andfeedback are based onthe best match bet ween any word of ̂e andany word of e ∗ . I n
align ments where multiple Ger man wordstranslate as a single phrase, 1 1 wetakethe phrasal
translationto bethe correct ans were ∗ foreach ofthe Ger man words.
3.2.1.5 Nor malization
After collectingthe data, we nor malizedthe user guessesforfurther analysis. All guesses werelo wercased. Multi- word guesses were crudelyreplaced bythelongest wordinthe
1 1 Our Ger man-English align ments are constructed asin Renduchintala et al.(2016a)
5 4 CHAPTER3. MODELINGINCIDENTAL LEARNING
guess(breakingtiesinfavor ofthe earliest word).
The guessesincluded many spelling errors as well as so me nonsense strings and direct
copies oftheinput. We definedthe dictionary to bethe 100,000 mostfrequent wordtypes
(lo wercased)fro mthe W MT English data.If a user’s guess ̂e does not match e ∗ a n d is n ot
inthe dictionary, wereplaceit with:
• the special sy mbol < C O P Y > , if ̂e appearsto be a copy ofthe Ger mansource word f
( meaningthatits Levenshtein distancefro mf is< 0.2 ·max( |̂e|,|f|));
• else,the closest wordinthe dictionary 1 2 as measured by Levenshtein distance(break-
ingties alphabetically), providedthe dictionary has a word at distance≤ 2;
• else
3.2.2 User Model
In each sub mission, the userjointly guesses several English words, given spelling and
context clues. One waythat a machine could perfor mthistaskis via probabilisticinference
in afactor graph —and wetakethis as our model of ho wthehu man user solvesthe proble m.
The user observes a Ger mansentence f = [ f 1 , f2 , . . . , fi, . . . fn ]. Thetranslation of each w or d t o k e n f i is E i, whichisfro mthe user’s point of vie w arando m variable. Let O bs d e n ot e
∗ the set ofindices i for whichthe user also observesthat E i = e i ,the aligned reference
∗ translation, because e i has already been guessed correctly(greenfeedback) orsho wn as a 1 2 Considering only words returned by the Pyenchant ‘suggest’ function ( htt p:// p yt h o n h ost e d. or g/ p y e n c h a nt/).
5 5 CHAPTER3. MODELINGINCIDENTAL LEARNING
∗ clue. Thus,the user’s posterior distribution over E is P θ (E = e | E O bs = e O bs ,f,history), where “history” denotesthe user’s history of pastinteractions.
We assu methat a user’s sub mission ê is derivedfro mthis posterior distribution si mply
as arando msa mple. Wetryto fitthe para meter vector θ to maxi mizethelog-probability ofthesub mission. Notethat our modelistrained onthe user guesses ̂e , notthereference
translations e ∗ . Thatis, weseek para meters θ that would explain why all users madetheir
g u ess es.
Although we fit a single θ ,this does not meanthat wetreat users asinterchangeable
(si n c e θ caninclude user-specific para meters) or unvarying(since our model conditions
users’ behavior ontheir history, which can capture so melearning).
3.2.3 Factor Graph
We modelthe posterior distribution as a conditionalrando m field(Figure 3.7)in whichthe val u e of E i depends onthefor m of f i as well as onthe meanings e j ( which may be either
observed orjointly guessed) ofthe context words at j ̸= i:
∗ P θ (E = e | E O bs = e O bs ,f,history) ( 3. 5) ∑ − ef e e ∝ (ψ (e i, fi)· ψ (e i, ej , i − j )) i /∈O bs j ̸= i
We will definethe factors ψ (thepotentialfunctions)in such a waythatthey do not
“kno w Ger man” but only have accesstoinfor mationthatis availableto an naive English
ef speaker. In brief,thefactor ψ (e i, fi) considers whetherthe hypothesized English word e i
5 6 CHAPTER3. MODELINGINCIDENTAL LEARNING
e e ψ (e 1 , en )
e e e e ψ (e 1 , ei ) ψ (e i , en ) ...... E 1 E i E n
ef ef ef ψ (e 1 , f1 )ψ (e i , fi ) ψ (e n , fn )
f 1 ... f i ... f n
Figure 3.7: Modelfor user understanding of L2 wordsin sentential context. This figure
sho ws aninference proble min which allthe observed wordsinthesentence arein Ger man
(t h at is, O bs = ∅ ). Asthe user observestranslations via clues or correctly- marked guesses,
s o m e of t h e E i beco meshaded.
“lookslike”the observed Ger man word f i, and whetherthe user has previously observed
during data collectionthat e i is a correct orincorrecttranslation of f i. Mean while,thefactor
e e ψ (e i, ej ) considers whether e i is co m monlyseeninthe context of e j in Englishtext. For
exa mple,the user will elevatethe probabilitythat E i = c a k e ifthey arefairly certainthat
E j isarelated wordlikeeat or chocolate.
The potential functions ψ are para meterized by θ , a vector of feature weights. For
convenience, we definethefeaturesinsuch a waythat we expecttheir weightsto be positive.
We rely onjust 6 features at present (see Section 3.2.7 for future work), although each
is co mplex andreal-valued. Thus,the weights θ controltherelativeinfluence ofthese 6
differenttypes ofinfor mation on a user’s guess. Ourfeatures broadlyfall underthefollo wing
categories: Cognate ,History , andContext . We preco mputed cognate and contextfeatures,
5 7 CHAPTER3. MODELINGINCIDENTAL LEARNING while historyfeatures are co mputed on-the-flyfor eachtraininginstance. Allfeatures are
case-insensitive.
3.2.3.1 Cognate Features
ef Foreach Ger mantoken f i, t h e ψ factor can score each possible guesse i ofitstranslation:
ef ef ef ψ (e i, fi) = e x p(θ · ϕ (e i, fi)) ( 3. 6)
Thefeaturefunction ϕ ef returns a vector of 4real nu mbers:
• Orthographic Si milarity : The nor malized Levenshtein distance bet weenthe 2 strings.
ef l e v(e i, fi) ϕ ort h (e i, fi) = 1 − ( 3. 7) m a x( |e i|, |f i|)
The weight onthisfeature encodes ho w much users pay attentionto spelling.
• Pronunciation Si milarity : Thisfeatureis si milartothe previous one, exceptthatit
calculatesthe nor malized distance bet weenthe pronunciations ofthet wo words:
ef ef ϕ pr o n (e i, fi) = ϕ ort h ( pr n(e i), pr n( f i)) ( 3. 8)
wherethe function pr n( x ) maps a string x toits pronunciation. We obtained pro-
nunciationsfor all wordsinthe English and Ger man vocabularies usingthe C M U
pronunciation dictionarytool( Weide, 1998). Notethat we use English pronunciation
rules evenfor Ger man words. Thisis because we are modeling a naivelearner who
may,inthe absence ofintuition about Ger man pronunciationrules, apply English
pronunciationrulesto Ger man.
5 8 CHAPTER3. MODELINGINCIDENTAL LEARNING
3.2.3.2 History Features
• Positive History Feature:If a user has beenre wardedin a previous HITfor guessing
e i as atranslation of f i,thentheyshould be morelikelyto guessit again. We define
ef ϕ hist + (e i, fi) to be 1inthis case and 0 other wise. The weight onthisfeature encodes
whether userslearnfro m positivefeedback.
• Negative History Feature : If a user has alreadyincorrectly guessed e i as atranslation
of f i in a previoussub mission duringthis HIT,thentheyshould belesslikelyto guess
ef it again. We define ϕ hist- (e i, fi) t o b e − 1 inthis case and 0 other wise. The weight on
thisfeature encodes whether usersre me mber negativefeedback.1 3
3.2.3.3 Context Features
ef Inthesa me way,the ψ factor can scorethe co mpatibility of a guess e i with a context word
e j , which mayitself be a guess, or may be observed:
e e e e e e ψ ij (e i, ej ) = e x p(θ · ϕ (e i, ej , i − j )) ( 3. 9)
1 3 Atleastin short-ter m me mory —this feature currently o mitsto consider any negative feedback fro m previous HI Ts.
5 9 CHAPTER3. MODELINGINCIDENTAL LEARNING
ϕ e e returns a vector of 2real nu mbers: ⎧ ⎪ ⎪ ⎨ P MI( e i, ej ) if |i − j | > 1 ϕ e e (e , e ) = ( 3. 1 0) p mi i j ⎪ ⎪ ⎩⎪ 0 ot h er wis e ⎧ ⎪ ⎪ ⎨ PMI 1 (e i, ej ) if |i − j | = 1 ϕ e e (e , e ) = ( 3. 1 1) p mi 1 i j ⎪ ⎪ ⎩⎪ 0 ot h er wis e wherethe point wise mutualinfor mation P MI( x, y ) measuresthe degreeto whichthe English w or ds x, y tendto occurinthesa me Englishsentence, and PMI 1 (x, y ) measures ho w often
theytendto occurin adjacent positions. These measure ments are esti matedfro mthe English
side ofthe W MT corpus, withs moothing perfor med asin Kno wles et al.(2016).
For exa mple,if f i = S u p p e ,the user’s guess of E i should beinfluenced by f j = B r o t
appearing in the sa me sentence, if the user suspects or observes that its translation is
E j = b r e a d . The P MIfeature kno wsthat s o u p a n d b r e a d tendto appearinthesa me
English sentences, whereas P MI 1 kno wsthattheytend notto appearinthe bigra m s o u p
bread or bread soup.
3.2.3.4 User-Specific Features
Apartfro mthe basic 6-feature model, we alsotrained a versionthatincludes user-specific
copies of eachfeature(si milartothe do main adaptationtechnique of Dau mé III (2007)).
ef ef For exa mple, ϕ ort h ,3 2 (e i, fi) is definedto equal ϕ ort h (e i, fi) forsub missions by user 32, and
definedto be 0forsub missions by other users.
6 0 CHAPTER3. MODELINGINCIDENTAL LEARNING
Thus, with 79 usersin our dataset, welearned 6 × 8 0 feature weights: alocal weight
ef vector for each user and a global vector of “backoff” weights. The global weight θ ort h
ef islargeif usersin general re ward orthographic si milarity, while θ ort h ,3 2 (which may be positive or negative) capturesthe degreeto which user 32re wardsit more orlessthanis
typical. The user-specificfeatures areintendedto captureindividual differencesinincidental
co mprehension.
3.2.4 Inference
Accordingto our model,the probabilitythatthe user guesses E i = ̂e i is given by a marginal
probability fro mthe C RF. Co mputingthese marginalsis a co mbinatorial opti mization
proble mthatinvolves reasoningjointly aboutthe possible values of each E i (i /∈ O bs ), whichrange overthe English vocabulary V e .
We e mployloopy belief propagation ( Murphy, Weiss, and Jordan, 1999)to obtain approxi mate marginals overthe variables E . Atree-basedschedulefor message passing was used( Dreyer and Eisner, 2009). Werun 3iterations with a ne wrando mrootfor each
it er ati o n.1 4
e ∗ We definethe vocabulary V to consist of allreferencetranslations e i and nor malized
user guesses ̂e i fro m our entire dataset (see Section 3.2.1.5), about 5 Ktypes altogether
i n cl u di n g < B L A N K > a n d < C O P Y > . We definethe cognatefeaturestotreat < B L A N K > as the e mpty string andtotreat < C O P Y > as f i. We definethe P MI ofthesespecialsy mbols 1 4 Re mark: Inthe non-loopy case( which arisesfor usin cases with ≤ 2 unobserved variables),this schedule boils do wntothefor ward-back ward algorith m. Inthis case, a singleiterationis sufficientfor exactinference.
6 1 CHAPTER3. MODELINGINCIDENTAL LEARNING wit h a n y e to bethe mean P MI with e of all dictionary words, sothatthey are essentially
uninfor mative.
3.2.5 Para meter Esti mation
Welearn our para meter vector θ to approxi mately maxi mizetheregularizedlog-likelihood
ofthe users’ guesses:
∑ − ) ∗ 2 l o g P θ (E = ê | E O bs = e O bs ,f,history) − λ ||θ || ( 3. 1 2) wherethesu m mationis over allsub missionsin our dataset. The gradient of eachsu m mand
reducesto a difference bet ween observed and expected values ofthefeature vector ϕ =
(ϕ ef , ϕ e e ), su m med over allfactorsin ( 3. 5). The observedfeatures are co mputed directly
b y s etti n g E = ê . The expectedfeatures( which arisefro mthelog ofthe nor malization
constant of(3.5)) are co mputed approxi mately byloopy belief propagation.
We tr ai n e d θ using stochastic gradient descent(S G D), 1 5 with alearningrate of 0 .1 a n d
regularization para meter of 0 .2 . Theregularization para meter wastuned on our develop ment
s et.
3.2.6 Experi mental Results
We divided our datarando mlyinto 5550traininginstances, 1903 develop mentinstances,
and 1939testinstances. Eachinstance was a single sub missionfro m one user, consisting of
1 5 To speed uptraining, S G D was parallelized using Recht et al.’s(2011) Hog wild! algorith m. Wetrained for 8 epochs.
6 2 CHAPTER3. MODELINGINCIDENTAL LEARNING
a batch of “si multaneous” guesses on a macaronic sentence.
We noted qualitativelythat when alarge nu mber of English words have beenrevealed,
particularly content words,the userstendto make better guesses. Conversely, when most
contextis Ger man, we unsuprisinglyseethe userleave many guesses blank and make other
guesses based on string si milaritytriggers. Such sub missions are difficultto predict as
different users will co me up with a wide variety of guesses; our modelthereforeresortsto
predicting si milar-sounding words. For detailed exa mples ofthis see Appendix 3.2.6.3.
For eachforeign word f i in a sub mission with i /∈ O bs , ourinference method (sec- tion 3.2.4) predicts a marginal probability distribution over a user’s guesses ̂e i. Ta bl e 3. 5
sho ws our abilityto predict user guesses. 1 6 Recallthatthistaskis essentially a structured
predictiontaskthat doesjoint 4919- way classification of each Ger man word. Roughly 1/3
oftheti me, our model’stop 25 wordsincludethe user’s exact guess.
Ho wever,therecallreportedin Table 3.5istoo stringentfor our educational application.
We could givethe model partial creditfor predicting a synony m ofthelearner’s guess ̂e .
More precisely, we wouldliketo givethe model partial creditfor predicting whenthelearner will make a poor guess ofthetruth e ∗ —evenifthe model does not predictthe user’s specific
incorrect guess ̂e.
To get atthis question, we use English word e mbeddings(asin Section 3.2.1.4) as a
proxyforthese mantics and morphology ofthe words. We measurethe actual quality ofthe
1 6 Throughoutthis section, weignorethe 5.2 % oftokens on whichthe user did not guess(i.e.,the guess w as < B L A N K > afterthe nor malization of Section 3.2.1.5). Our present model si mplytreats < B L A N K > as a n ordinary and very bland word(section 3.2.4),ratherthantruly atte mptingto predict whenthe user will not guess. Indeed,the model’s posterior probability of < B L A N K > inthese casesis a paltry 0.0000267 on average (versus 0.0000106 whenthe user does guess). See Section 3.2.7.
6 3 CHAPTER3. MODELINGINCIDENTAL LEARNING
R e c all at k R e c all at k M o d el ( d e v) (t est)
1 2 5 5 0 1 2 5 5 0
Basic 15.24 34.26 38.08 16.14 35.56 40.30
User-Adapted 15.33 34.40 38.67 16.45 35.71 40.57
Table 3.5: Percentage offoreign wordsfor whichthe user’s actual guess appearsin ourtop- k
list of predictions,for models with and without user-specificfeatures(k ∈ { 1,25,50}).
learner’s guess ̂e asits cosine si milaritytothetruth, si m( ̂e, e ∗ ). W hil e q u alit y of 1 is a n e x a ct
match, and quality scores > 0 .7 5 are consistently good matches, wefound quality of ≈ 0 .6
alsoreasonable. Pairs such as( m o s q u e , i s l a m i c ) a n d (p o l i t i c s , g o v e r n m e n t ) ar e
exa mplesfro mthe collected data with quality ≈ 0 .6 . As quality beco mes < 0 .4 , h o w e ver,
therelationship beco mestenuous, e.g.,(refugee,soil).
Si milarly, we measurethe predicted quality as si m(e, e ∗ ), w h er e e isthe model’s 1-best
prediction ofthe user’s guess. Figure 3.8 plots predicted vs. actual quality (each point
represents one ofthelearner’s guesses on develop ment data), obtaining a correlation of
0.38, which we callthe “quality correlation” or Q C. A clear diagonal band can be seen,
correspondingtotheinstances wherethe model exactly predictsthe user’s guess. The cloud
aroundthe diagonalisfor med byinstances wherethe model’s prediction was notidentical
tothe user’s guess but had si milar quality.
We also considerthe expected predicted quality, averaging overthe model’s predictions
6 4 CHAPTER3. MODELINGINCIDENTAL LEARNING
Fi g ur e 3. 8: A ct u al q u alit y si m( ̂e, e ∗ ) ofthelearner’s guess ̂e on develop ment data, versus
predicted quality si m(e, e ∗ ) where e isthe basic model’s 1-best prediction. e of ̂e (f or all e ∈ V e )in proportiontothe probabilitiesthatit assignsthe m. This allo ws the modelto more s moothly assess whetherthelearnerislikelyto make a high-quality
guess. Figure 3.9 sho wsthis version, wherethe pointstendto shift up ward andthe quality
correlation( Q C)risesto 0.53.
All QC values are givenin Table 3.6. We used expected QC onthe develop mentset as
the criterion for selectingthe regularization coefficient λ and asthe early stopping criterion
duringtraining.
3.2.6.1 Feature Ablation
To test the usefulness of different features, we trained our model with various feature
categories disabled. To speed up experi mentation, we sa mpled 1000instancesfro mthe
trainingset andtrained our model onthose. Theresulting Q C values on dev data aresho wn
6 5 CHAPTER3. MODELINGINCIDENTAL LEARNING
Fi g ur e 3. 9: A ct u al q u alit y si m( ̂e, e ∗ ) ofthelearner’s guess ̂e on develop ment data, versus
the expectation ofthe predicted quality si m(e, e ∗ ) w h er e e is distributed accordingtothe
basic model’s posterior.
in Table 3.7. We seethatre moving history-basedfeatures hasthe most significanti mpact
on model perfor mance: both Q C measures droprelativetothefull model. For cognate and
contextfeatures, wesee nosignificanti mpact onthe expected Q C, but asignificant dropin
the 1-best Q C, especiallyfor contextfeatures.
3.2.6.2 Analysis of User Adaptation
Table 3.6 sho wsthatthe user-specificfeatures significantlyi mprovethe 1-best Q C of our
model, althoughthe much s malleri mprove mentin expected Q Cisinsignificant.
User adaptation allo ws usto discern different styles ofincidental co mprehension. A user-
adapted model makes fine-grained predictionsthat could helpto construct better macaronic
sentencesfor a given user. Each user who co mpleted atleast 10 HITs hastheir user-specific
6 6 CHAPTER3. MODELINGINCIDENTAL LEARNING
D e v Test M o d el Exp 1-Best Exp 1-Best
Basic 0.525 0.379 0.543 0.411
User-Adapted 0.527 0.427 0.544 0.439
Table 3.6: Quality correlations: basic and user-adapted models. weight vector sho wn as aro win Figure 3.10. Recallthatthe user-specific weights are not
usedinisolation, but are added to backoff weights shared by all users.
These user-specific weight vectors clusterintofour groups. Further more,the average points per HIT differ by cluster (significantly bet ween each cluster pair), reflectingthe
success of different strategies. 1 7 Usersin group(a) e mploy a generaliststrategyforincidental
co mprehension. They paytypical or greater-than-typical attentionto all features ofthe
current HIT, but many ofthe m have di minished me moryfor vocabularylearned during past
HITs (the hist+ feature). Usersin group (b) see mto usethe opposite strategy, deriving
their successfro mretaining co m mon vocabulary across HITs(hist+) andfalling back on
orthographyfor ne w words. Group(c) users, who earnedthe most points per HIT, appearto
make heavy use of context and pronunciationfeatures togetherwith hist+. We also seethat
pronunciation si milarity see msto be a strongerfeaturefor group(c) users,in contrasttothe
more superficial orthographic si milarity. Group(d), which earnedthefe west points per HIT,
1 7 Recallthatin our data collection process, we a ward pointsfor each HIT(section 3.2.1.4). Whilethe points were designed more as are wardthan as an evaluation oflearner success, a higher score doesreflect more guessesthat were correct or close, while alo werscoreindicatesthatso me words were never guessed beforethe syste mrevealedthe m as clues.
6 7 CHAPTER3. MODELINGINCIDENTAL LEARNING
QC Feature Re moved Expected 1-Best
N o n e 0. 5 2 2 0. 4 2 5
Cognate 0.516 0.366 ∗
Context 0.510 0.366 ∗
History 0.499 ∗ 0. 2 5 9 ∗
Table 3.7: I mpact on quality correlation( Q C) ofre movingfeaturesfro mthe model. Ablated
Q C values marked with asterisk ∗ differ significantlyfro mthefull- model Q C valuesinthe
firstro w(p < 0.05, usingthetest of Preacher(2002)).
appearsto be an “extre me” version of group(b):these users pay unusuallylittle attentionto
any modelfeatures otherthan orthographic si milarity and hist+. ( More precisely,the model
finds group(d)’s guesses harderto predict onthe basis ofthe availablefeatures, and so gives
a more unifor m distribution over V e .)
3.2.6.3 Exa mple of Learner Guesses vs. Model Predictions
To give a sense ofthe proble m difficulty, we have hand-picked and presentedt wotrain-
ing exa mples(sub missions) along withthe predictions of our basic model andtheirlog-
probabilities. In Figure 3.11a alarge portion ofthe sentence has beenrevealedtothe userin
English(bluetext) only 2 words arein Ger man. Thetextin boldfontisthe user’s guess.
Our model expected both wordsto be guessed;the predictions arelisted belo wthe Ger man
6 8 CHAPTER3. MODELINGINCIDENTAL LEARNING
Figure 3.10: The user-specific weight vectors, clusteredinto groups. Average points per
HITforthe HITs co mpleted by each group:(a) 45,(b) 48,(c) 50 and(d) 42. w or ds Verschiedene a n d Regierungen . Thereferencetranslationforthe 2 words
ar e V a r i o u s a n d governments .In Figure 3.11b wesee a much harder context where
only one wordis sho wnin English andthis wordis not particularly helpful as a contextual
a n c h or.
3.2.7 FutureI mprove mentstothe Model
Our model’sfeature set(section 3.2.3) could clearly berefined and extended. Indeed,in
the previous section, we use a moretightly controlled experi mental designto explore so me
si mplefeature variants. A cheap wayto vetfeatures would betotest whetherthey help on
thetask of modelingreferencetranslations, which are more plentiful andless noisythanthe
user guesses.
6 9 CHAPTER3. MODELINGINCIDENTAL LEARNING
For Cognate features,there exist many other good string si milarity metrics(including
ef trainable ones). We could alsoinclude ϕ featuresthat consider whether e i’s part of speech,
frequency, andlength are plausible given f i’s burstiness, observedfrequency, andlength.
(E.g., only short co m mon words are plausiblytranslated as deter miners.)
For Context features, we could design versionsthat are more sensitivetothe position and status ofthe context word j . We speculatethatthe actualinfluence of e j o n a us er’s
g u ess e i is stronger when e j is observed ratherthanitself guessed; whenthere are fe wer interveningtokens(and particularlyfe wer observed ones);and when j < i. Orthogonally,
ef ϕ (e i, ej ) could go beyond P MI and windo wed P MIto also consider cosinesi milarity, as well as variants ofthese metricsthat arethresholded or nonlinearlytransfor med. Finally, we do not havetotreatthe context positions j asindependent multiplicativeinfluences as
i n e q u ati o n ( 3. 5) (cf. Naive Bayes): we couldinstead use atopic model orso mefor m of
language modelto deter mine a conditional probability distribution over E i gi ve n all ot h er wordsinthe context.
An obvious gapin our currentfeature setisthat we have no ϕ e features to capture
e thatso me words e i ∈ V are morelikely guesses a priori. By defining several versions
ofthisfeature, based onfrequenciesin corpora of differentreadinglevels, we couldlearn
user-specific weights modeling which users are unlikelytothink of an obscure word. We
∗ should alsoinclude featuresthat fire specifically onthe referencetranslation e i a n d t h e
specialsy mbols < B L A N K > a n d < C O P Y > , as eachis much morelikelythanthe otherfeatures would suggest.
7 0 CHAPTER3. MODELINGINCIDENTAL LEARNING
For History features, we could considernegative feedbackfro m other HITs(notjustthe
current HIT), as well as positive infor mation provided byrevealed clues(notjust confir med
guesses). We could also devise non-binary versionsin which morerecent or morefrequent
feedback on a word has a stronger effect. More a mbitiously, we could model generalization:
after being sho wnthat K i n d m e a ns c h i l d , alearner mightincreasethe probabilitythat the si milar word K i n d e r m e a ns c h i l d or so methingrelated( c h i l d r e n , c h i l d i s h ,
...), whether because of superficial orthographic si milarity or a deeper understanding of the morphology. Si milarly, alearner might gradually acquire a model oftypical spelling
changesin English- Ger man cognate pairs.
A more significant extension would beto model a user’slearning process. Instead of
representing each user by a s mall vector of user-specific weights, we couldrecognizethat
the user’s guessing strategy and kno wledge can change overti me.
A serious deficiencyin our current modelisthat wetreat < B L A N K > like any other word.
A more attractive approach would betolearn a stochasticlinkfro mthe posterior distribution
tothe user’s guess or non-guess,instead of assu mingthatthe user si mply sa mplesthe
guessfro mthe posterior. As asi mple exa mple, we mightsaythe user guesses e ∈ V e wit h
pr o b a bilit y p (e )β — w h er e p (e ) isthe posterior probability and β > 1 is alearned para meter — withthere maining probability assignedto < B L A N K > . Thissaysthatthe usertendsto avoid
guessing except whenthere arerelatively high-posterior-probability wordsto guess.
Finally, ne werrepresentationlearning models such as BE RT( Devlin et al., 2019) and
XL Net( Yang et al., 2019) could also be usedin our model. XL Netin particular, withits
7 1 CHAPTER3. MODELINGINCIDENTAL LEARNING
abilityto accountforinterdependencies bet ween outputtokens, would be a good candidate
to providerich contextualfeaturesto eitherreplace or used along withthe weaker pair- wise
P MI basedfeatures of our current model.
3.2.8 Conclusion
We have presented a methodologyfor collecting data andtraining a modelto esti mate a
foreignlanguagelearner’s understanding of L2 vocabularyin partially understood contexts.
Both are novel contributionstothe study of L2 acquisition.
Our current modelis arguably crude, with only 6features, yetit can already often do a
reasonablejob of predicting what a user might guess and whetherthe user’s guess will be
roughly correct. This opensthe doorto a nu mber offuture directions with applicationsto
language acquisition using personalized content andlearners’ kno wledge.
Weleave asfuture worktheintegration ofthis modelinto an adaptive syste mthattracks
learner understanding and creates scaffolded contentthat fallsintheir zone of proxi mal
develop ment, keepingthe m engaged while stretchingtheir understanding.
7 2 CHAPTER3. MODELINGINCIDENTAL LEARNING
( a)
( b)
Figure 3.11: Two exa mples ofthe syste m’s predictions of whatthe user will guess on a
single sub mission, contrasted withthe user’s actual guess. ( The user’s previous sub missions
onthe sa metaskinstance are not sho wn.) In 3.11a,the model correctly expectsthatthe
substantial context willinfor mthe user’s guess. In 3.11b,the model predictsthatthe user willfall back on string si milarity —although we can seethatthe user’s actual guess of a n d
d a y waslikelyinfor med bytheir guess of n i g h t , aninfluencethat our C RF did consider.
The nu mberssho wn arelog-probabilities. Both exa mplessho wthesentencesin a macaronic
state (after so me reordering ortranslation has occurred). For exa mple,the originaltext
ofthe Ger mansentencein 3.11breads Deshalb durften die Paare nur noch
ein Kind bekommen . The macaronic version has undergoneso mereordering, and
has also erroneously droppedthe verb dueto anincorrect align ment.
7 3 C h a pt e r 4
Creating Interactive Macaronic
Interfacesfor Language Learning
Inthe previous chapter, we presented modelsforincidentallearning. We hopeto generate macaronictext by consulting such models. Recallthatthe AIteacher’s goalisto generate co mprehensible macaronictextsfor a studenttoread. Given a macaronic data structure associated with a piece oftext,the AIteacher mustrender a macaronic configurationthatit believesthestudent will understand(andlearnfro m). But whatifthe AIteacher makes a sentencethatistoo difficultforthe studenttoread? Ortoo easy with verylittle L2 words?
For such cases, we wouldliketo give “control” backtothe student andletthe minteractively modifythe macaronic sentence. Supposethe sentenceistoo difficult, we wouldlikethe studentto not get co mpletely stuck, so we wouldliketo givethe mthe chanceto askfor hints. Onthe flip-side,if a studentfeelsthereis not enough L2 contentin a macaronic
7 4 CHAPTER4. CREATING MACARONICINTERFACES
sentence, we wantinteract withthe data structure and explore macaronic spectru m.
To providethese featurestothe student, we design a user-interfacein which such
modifications are possible. We presentthe details of our user-interface along withinteraction
modalitiesinthis chapter.
We provide details ofthe current userinterface and discuss ho w contentfor our syste m
can be auto matically generated using existing statistical machinetranslation(S M T) methods,
enablinglearners orteachersto choosetheir o wntextstoread. Ourinterfaceletsthe user
navigatethroughthespectru mfro m L2to L1, going beyondthesingle- word orsingle-phrase
translations offered by other onlinetools such as S wych(2015), or dictionary-like bro wser
pl u gi ns.
Finally, we notethattheinteraction design couldincludelogging all ofthe actions a student makes whilereading atext. We canthen usethelogged actionstorefine our
incidentallearning modelto hopefully produce macaronictextthatis more personalizedto
the student’s L2level. Weleavethisforfuture work.
4.1 MacaronicInterface
Toillustratethe workings of ourinterface, we assu me a native English speaker(L1=English) whoislearning Ger man(L2= Ger man). Ho wever, our existinginterface can acco m modate
any pair oflanguages whose writing syste ms share directionality. 1 The pri mary goal of
theinterfaceisto e mpo wer alearnertotranslate andreorder parts of a confusingforeign
1 We also assu methatthetextis seg mentedinto words.
7 5 CHAPTER4. CREATING MACARONICINTERFACES
language sentence. Thesetranslations andreorderings serveto makethe Ger man sentence
more English-like. Theinterface also per mitsreversetransfor mations,lettingthe curious
learner “peek ahead” at ho w specific English words and constructions would surfacein
G er m a n.
Usingthesefunda mentalinteractions as building blocks, we create aninteractivefra me- workfor alanguagelearnerto explorethis continuu m of “English-like”to “foreign-like”
sentences. Byrepeatedinteraction with ne w content and exposuretorecurring vocabulary
ite ms andlinguistic patterns, we believe alearner can pick up vocabulary and otherlinguistic
rules oftheforeignlanguage.
4.1.1 Translation
The basicinterfaceideaisthat aline of macaronictextis equipped with hiddeninterlinear
annotations. Notionally, Englishtranslationslurk belo wthe macaronictext, and Ger man
o n es a b o ve.
The Translation interaction allo ws the learner to change the text in the macaronic
sentencefro m onelanguageto another. Consider a macaronic sentencethatis co mpletely
intheforeign state(i.e.,, entirelyin Ger man), as sho wnin Fig. 4.1a. Hovering on or under
a Ger man wordsho ws a previe w of atranslation(Fig. 4.1b). Clicking onthe previe w will
causethetranslationto “rise up” andreplacethe Ger man word(Fig. 4.1c).
Totranslateinthereverse direction,the user can hover and click above an English word
( Fi g. 4. 1 d).
7 6 CHAPTER4. CREATING MACARONICINTERFACES
Sincethe sa me mechanis m appliesto allthe wordsinthe sentence, alearner can manipulatetranslationsfor each wordindependently. For exa mple, Fig. 4.1e sho wst wo wordsin English.
( a) I niti al s e nt e n c e st at e.
(b) Mouse hovered underPreis.
(c)Preis translatedtoprize.
(d) Mouse hovered above p r i z e . Clicking above will revertthe
sentence backtotheinitial state 4.1a.
(e) Sentence with 2 different wordstranslatedinto English
Figure 4.1: Actionsthattranslate words.
The version of our prototype displayedin Figure 4.1 blursthe previe wtokens when a
learneris hovering above or belo w a word. This blurred previe w acts as a visualindication
of a potential changetothe sentence state(if clicked) butit also givesthelearner a chance
7 7 CHAPTER4. CREATING MACARONICINTERFACES
tothink about whatthetranslation might be, based on visual clues such aslength and shape
ofthe blurredtext.
4.1.2 Reordering
Whenthelearner hoversslightly belo w the words n a c h G e o r g B ü c h n e r a Reordering
arro wis displayed(as sho wnin Figure 4.2). The arro wis anindicator ofreordering. In
this exa mple,the Ger man past participle b e n a n n t appears atthe end ofthe sentence(the
conjugatedfor m ofthe verbis i s t b e n a n n t , or i s n a m e d );thisisthe gra m matically
correctlocationforthe participlein Ger man, whilethe Englishfor m should appear earlier
inthe equivalent English sentence.
Si milartothetranslation actions, reordering actions also have a directional attribute.
Figure 4.2b sho ws a Ger man-to-English direction arro w. Whenthelearner clicksthe arro w,
theinterfacerearranges allthe wordsinvolvedinthereordering. The ne w word positions
are sho wnin 4.2c. Once again,the user can undo: hoveringjust above n a c h G e o r g
B ü c h n e r no wsho ws a gray arro w, whichif clickedreturnsthe phrasetoits Ger man word
order(sho wnin 4.2d).
Ger man phrasesthat are notin original Ger man order are highlighted as a warning
(Figure 4.2c).
7 8 CHAPTER4. CREATING MACARONICINTERFACES
( a)
( b)
( c)
( d)
Figure 4.2: Actionsthatreorder phrases.
4.1.3 “Pop Quiz” Feature
Sofar, we have describedthesyste m’sstandardresponsesto alearner’s actions. We no w add
occasional “pop quizzes.” When alearner hovers belo w a Ger man word( s 0 in Figure 4.3)
and clicksthe blurry Englishtext,the syste m can eitherrevealthetranslation ofthe Ger man w or d (st at e s 2 ) as describedin section 4.1.1 or quizthelearner(state s 3 ). Wei mple mentthe
quiz by presenting atextinput boxtothelearner: herethelearneris expectedtotype what
they believethe Ger man word means. Once a guessistyped,the syste mindicatesifthe
7 9 CHAPTER4. CREATING MACARONICINTERFACES
a s s 2 c 6 c
e s 0 b s 4 c e s 1 s 3 s 5
Figure 4.3: State diagra m oflearnerinteraction (edges) and syste m’s response(vertices).
Edges can betraversed by clicking( c), hovering above(a ), hovering belo w(b) orthe enter
(e) key. Un marked edgesindicate an auto matictransition.
guessis correct( s 4 ) orincorrect(s 5 ) by flashing green orred highlightsinthetext box. The
boxthen disappears(after 700 ms) andthe syste m auto matically proceedstothereveal state
s 2 . Asthisi mposes a high cognitiveload andincreasestheinteraction co mplexity(typing vs. clicking), weintendto usethe pop quizinfrequently.
The pop quiz servest wo vitalfunctions. First,itfurtherincentivizesthe usertoretain
learned vocabulary. Second,it allo wsthe syste mto updateits model ofthe user’s current L2
lexicon, macaronic co mprehension, andlearning style;thisis workin progress(see section
4. 3. 2).
8 0 CHAPTER4. CREATING MACARONICINTERFACES
4.1.4 Interaction Consistency
Again, weregardthe macaronic sentence as a kind ofinterlineartext, written bet weent wo
mostlyinvisible sentences: Ger man above and English belo w. In general, hovering above
the macaronicsentence willreveal Ger man words or word orders, whichfall do wnintothe
macaronic sentence upon clicking. Hovering belo w willreveal Englishtranslations, which
rise up upon clicking.
The wordsinthe macaronicsentence are colored accordingtotheirlanguage. We want
the userto beco me accusto medtoreading Ger man,sothe Ger man words arein plain black
text by default, whilethe English words use a marked color andfont(italic blue). Reordering
arro ws alsofollo wthesa me colorsche me: arro wsthat will makethe macaronicsentence
more “ Ger man-like” are gray, while arro wsthat makethe sentence more “English-like” are
blue. Thesu m mary ofinteractionsissho wnin Table 4.1.
Action Direction Trigger Preview Preview Color Confirm R es ult
Blurry Ger man Cli c k o n translation replaces E-to-G Hoverabove English Gr a y Bl ur Translation translation above Bl urr y Te xt English word(s)
Hover under Ger man Blurry English Cli c k o n translation replaces G-t o- E Bl u e Bl ur t o k e n translation belo w Bl urr y Te xt Ger man word(s)
Arr o w a b o ve E-to-G Hoverabovetoken Gray Arrow Clickon Arrow tokensreorder R e o r d e ri n g reorderingtokens
Arr o w b el o w G-to-E Hoverundertoken Blue Arrow Clickon Arrow tokensreorder reorderingtokens
Table 4.1: Su m mary oflearnertriggeredinteractionsinthe MacaronicInterface.
8 1 CHAPTER4. CREATING MACARONICINTERFACES
4.2 Constructing Macaronic Translations
Inthis section, we describethe details ofthe underlying data structures neededto allo w all
theinteractions mentionedinthe previous section. A keyrequire mentinthe design ofthe
data structure wasto support orthogonal actionsin each sentence. Making alltranslation
andreordering actionsindependent of one another creates alarge space of macaronic states
for alearnerto explore.
At present,theinputto our macaronicinterfaceis bitext with word-to- word align ments
provided by a phrase-based S MT syste m(or,if desired, by hand). We e mploy Moses
( Koehn et al., 2007)totranslate Ger mansentences and generate phrase align ments. Ne ws
articles writtenin si mple Ger manfro m nachrichtenleicht.de ( Deutschlandfunk,
2016) weretranslated aftertrainingthe S MTsyste m onthe W MT15 Ger man-English corpus
( B oj ar et al., 2 0 1 5).
We convertthe word align mentsinto “ mini mal align ments”that are either one-to-one,
one-to- many or many-to-one. For each many-to- many align mentreturned bythe S MT
syste m, were move align ment edges(lo west probability first) untilthe align mentis nolonger
many-to- many. Then we greedily add edgesfro m unalignedtokens(highest probability first),
subjectto not creating many-to- many align ments and subjectto mini mizingthe nu mber
of crossing edges, until alltokens are aligned. This step ensures consistentreversibility of
actions and preventslarge phrasesfro m beingtranslated with a single click. 2 The resulting
2 Preli minary experi ments sho wedthat allo winglarge phrasestotranslate with one clickresultedin abrupt ju mpsinthe visualization, which usersfound hardtofollo w.
8 2 CHAPTER4. CREATING MACARONICINTERFACES
bipartite graph can beregarded as a collection of connected co mponents, or units (Fig. 4.4).3
Figure 4.4: The dottedlinessho w word-to- word align ments bet weenthe Ger mansentence
f 0 , f1 , . . . , f7 andits Englishtranslation e 0 , e1 , . . . , e6 . The figure highlights 3 ofthe 7 units:
u 2 , u3 , u4 .
4.2.1 Translation Mechanis m
In a givenstate ofthe macaronicsentence, each unitis displayedin either English or Ger man.
Atranslation actiontogglesthe displaylanguage ofthe unit,leavingitin place. For exa mple,
in Figure 4.5, wherethe macaronic sentenceis currently displaying f 4 f 5 = n o c h e i n e n ,
atranslation action willreplacethis with e 4 = a . 3 Inthe sections belo w, we gloss over cases where a unitis discontiguous(in onelanguage). Such units are handled specially( we o mit detailsforreasons of space). If a unit wouldfall outsidethe bounds of what our special handling can handle, wefuseit with another unit.
8 3 CHAPTER4. CREATING MACARONICINTERFACES
Figure 4.5: A possible state ofthe sentence, whichrenders a subset ofthetokens(sho wnin
black). Therendering order(section 4.2.2)is notsho wn butis also part ofthestate. Thestring
displayedinthis caseis ” Und danach they run noch einen Marathon. ” ( as-
su ming noreordering).
4.2.2 Reordering Mechanis m
Areordering action changesthe unit order ofthe current macaronic sentence. The output
stri n g “Und danach they run noch einen Marathon. ”is obtainedfro m Fig-
ure 4.5 onlyif unit u 2 (aslabeledin Figure 4.4)is rendered (inits currentlanguage)to theleft of unit u 3 , which we write as u 2 < u 3 . Inthis case,itis possibleforthe userto
changethe order ofthese units, because u 3 < u 2 in Ger man. Table 4.2sho wsthe 8 possible
co mbinations of ordering andtranslation choicesforthis pair of units.
8 4 CHAPTER4. CREATING MACARONICINTERFACES
String Rendered Unit Ordering
...they run...
...they laufen... { u 2 } < { u 3 } ...sie run...
...sie laufen...
...run they...
...run sie... { u 2 } > { u 3 } ...laufen they...
...laufen sie...
Table 4.2: Generatingreordered strings using units.
Thespace of possible orderingsfor asentence pairis defined by a bracketingIT Gtree
( Wu, 1997), whichtransfor msthe Ger man ordering ofthe unitsintothe English ordering by
a collection of nested binary s waps of subsequences. 4 The ordering state ofthe macaronic
sentenceis given bythesubset oftheses wapsthat have been perfor med. Areordering action
toggles one ofthe s wapsinthis collection.
Since we have a parserfor Ger man(Rafferty and Manning, 2008), wetake careto
select anIT Gtreethatis “co mpatible” withthe Ger man sentence’s dependency structure,
inthefollo wing sense: iftheIT Gtree co mbinest wo spans A a n d B ,thenthere are not
dependenciesfro m wordsin A to wordsinB and vice-versa.
4 Occasionally nosuchIT Gtree exists,in which case wefuse units as needed until one does.
8 5 CHAPTER4. CREATING MACARONICINTERFACES
( a) ( b)
( c) ( d)
Figure 4.6: Figure 4.6asho ws asi mple discontiguous unit. Figure 4.6bsho ws along distance
discontiguity whichis supported. In figure 4.6ctheinterruptions alignto both sides of e 3 whichis not supported. In situationslike 4.6c, all associated units are merged as one phrasal
unit(shaded) assho wnin figure 4.6d
4.2.3 Special Handling of Discontiguous Units
We provideli mited supportfor align ments whichfor m discontiguous units. Figure 4.6a
sho ws a si mple discontiguous unit. Inthis exa mple, areordering action( G-to-E direction)
perfor med on either f 2 or f 4 will m o ve f 2 tothei m mediateleft of f 4 eli mi n ati n g t h e
interrupting align ment. After reordering,thetranslation action beco mes availabletothe
learner,just asin a multi- word contiguous unit. The syste m currently supports one or
moreinterrupting units aslong asthese units are contiguous and arefro m only one side of
the singletoken(see Figure 4.6a and 4.6b). Ifthe conditionsfor special handling are not
8 6 CHAPTER4. CREATING MACARONICINTERFACES
satisfied(see Figure 4.6c),the syste mforces allthetokensto a single unit, whichresultsin a
phrasal align ment andistreated as a single unit(Figure 4.2d). Such units have noreordering
actions andresultin a phrasaltranslation. We also e mploythis “back off” phrasal align ment
in cases where align ments do not satisfytheIT G constraint.
4.3 Discussion
4.3.1 Machine Translation Challenges
Whenthe English version ofthesentenceis produced by an MTsyste m,it maysufferfro m
MT errors and/or poor align ments.
Even with correct MT, a given syntactic construction may be handledinconsistently on
different occasions, depending onthe particular wordsinvolved(asthese affect what phrasal
align mentisfound and ho w we convertitto a mini mal align ment). Syntax-based MT could
be usedto design a more consistentinterfacethatis also more closelytiedto classroo m L2
l ess o ns.
Cross-linguistic divergencesinthe expression ofinfor mation ( Dorr, 1994) could be
confusing. For exa mple, when movingthrough macaronicspacefro m K a f f e e g e f ä l l t
M e n s c h e n (coffee pleases hu mans)toitstranslation humans like coffee , it m a y
not be cleartothelearnerthatthereorderingistriggered bythefactthat l i k e is n ot a lit er al translation of g e f ä l l t . One waytoi mprovethis might beto havethesyste m passs moothly
8 7 CHAPTER4. CREATING MACARONICINTERFACES
through a range of inter mediate translations fro m word-by- word glosses to idio matic
phrasaltranslations, ratherthan al ways directlytranslatingidio ms. Concretely, we can
firsttransfor m K a f f e e g e f allẗ Menschen i nt o K a f f e e g e f ä l l t h u m a n s a n d
t h e n i nt o Kaffee pleases humans and finallyinto coffee pleases humans .
Thesetransitions could be done via manualrules. Once alltokens ofthe Ger man phrase are
in Englishthe finaltransition wouldrenderthe phrasein “correct” English h u m a n s l i k e
c o f f e e . We might also see benefitin guiding our gradualtranslations with cognates(for
exa mple,ratherthantranslate directlyfro mthe Ger man M ö h r e tothe English c a r r o t , w e
might offerthe cognate Karotte as aninter mediatestep).
Another avenue ofresearchistotransitionthrough wordsthat are macaronic atthe sub- wordlevel. For exa mple, hovering overthe unfa miliar Ger man word g e s p r o c h e n mi g ht
deco mposeitinto ge-sprochen ;then clicking on one ofthose morphe mes might yield
g e - t a l k or s p r e c h - e d beforereaching t a l k e d . This could guidelearnersto wards an
understanding of Ger mantense marking andste m changes. Generation ofthesesub- word
macaronic for ms could be done using multilingualtrained morphological reinflectionn
syste mssuch as Kann, Cotterell, and Schutze(2017).̈
4.3.2 User Adaptation and Evaluation
We would preferto sho wthelearner a macaronic sentencethat providesjust enough clues
forthelearnerto be ableto co mprehendit, while still pushingthe mto figure out ne w vocabulary or ne w structures. Thus, we planto situatethisinterfacein afra me workthat
8 8 CHAPTER4. CREATING MACARONICINTERFACES
continuously adapts asthe user progresses. Asthe userlearns ne w vocabulary,the syste m will auto matically presentthe m with more challenging sentences(containingless L1). In ?? wesho wthat we can predict a novicelearner’s guesses of L2 word meaningsin macaronic
sentences using afe w si mplefeatures. We will subsequentlytrackthe user’slearning by
observingtheir mouse actions and “pop quiz”responses(section 4.1).
While we have had usersinteract with our syste min orderto collect data about novice
learners’ guesses, we are workingto ward an evaluation where oursyste mis usedtosupple-
ment classroo minstructionforrealforeign-language students.
4.4 Conclusion
Inthis work we present a prototype of aninteractiveinterface forlearningto readin a
foreignlanguage. We exposethelearnerto L2 vocabulary and constructionsin contexts
that are co mprehensible becausethey have beenpartially translatedintothelearner’s native
language, using statistical MT. Using MT affords flexibility: learners orinstructors can
choose whichtextstoread, andlearners orthe syste m can control which parts of a sentence
aretranslated.
Inthelongter m, we wouldliketo extendthe approachto allo w users alsotoproduce
macaroniclanguage, dra wing ontechniquesfro m gra m matical error correction or co mputer-
aidedtranslationto helpthe m graduallyre move L1featuresfro mtheir writing(or speech)
and makeit more L2-like. Weleavethisforfuture work.
8 9 C h a pt e r 5
Construction of Macaronic Textsfor
Vocabulary Learning
5.1 Introduction
Inthe previous chapters, we presented ainteractiveinterfacetoread macaronic sentences
and a modelthat predicts a student’s guessing abilities which usedinfor mationfro mthe
L1 and L2 context as well as cognateinfor mation asinputfeatures. Totrainthis model we
require supervised data, meaning data on student behaviors and capabilities( Renduchintala
et al., 2016b; Labutov and Lipson, 2014). The data collectionfor supervised datainvolves
pro mptingstudents(in our experi ments we used MTurk users) with macaronicsentences
created rando mly (or withso me heuristic) andthen askingthe MTurk “students”to guess
the meanings of L2 wordsinthese sentences. Therando m macaronic sentences paired
9 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION with student guessesfor msthetraining data. This stepis expensive, not onlyfro m a data
collection point of vie w, but alsofro mthe point of vie w ofstudents, asthey would haveto
givefeedback(i.e. generatelabeled data) onthe actions of aninitially untrained machine
t e a c h er.
Inthis chapter, we sho wthatitis possibleto design a machineteacher without any
supervised datafro m(hu man)students. We use a neural clozelanguage modelinstead of
the weaker conditionalrando m field used earlier. We also propose a methodto allo w our
clozelanguage modeltoincre mentallylearn ne w vocabularyite ms, and usethislanguage
model as a proxyforthe word guessing andlearning ability ofreal students. A machine
foreign-languageteacher decides which subset of wordstoreplace by consultingthis cloze
language model. The clozelanguage modelisinitiallytrained on a corpus of L1texts and
istherefore not personalizedto a(hu man) student. Despitethis, we sho wthat a machine
foreign-language can generate pedagogically useful macaronictexts after consulting with
the clozelanguage model. We are essentially using a clozelanguage model as a “drop-in”
replace mentfor atrue user model werefertothe clozelanguage as agenericstudent model.
We evaluatethree variants of our generic studentlanguage modelsthrough a study on
A mazon Mechanical Turk( MTurk). We findthat MTurk “students” were ableto guessthe
meanings of L2 words(in context)introduced bythe machineteacher with high accuracyfor
bothfunction words as well as content wordsint wo out ofthethree models. Further more, we selectthe best perfor ming variant and evaluateif participants can actually learnthe L2 words byletting participantsread a macaronic passage and give an L2 vocabulary quiz at
9 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Sentence The Nile is a river in Africa
Gloss Der Nil ist ein Fluss in Afrika
Macaronic Der Nile ist a river in Africa
Configurations The Nile is a Fluss in Africa
Der Nil ist ein river in Africa
Table 5.1: An exa mple English(L1)sentence with Ger man(L2) glosses. Usingthe glosses,
many possible macaronic configurations are possible. Notethatthe gloss sequenceis not a
fluent L2 sentence.
the end of passage, wherethe L2 words are presented withouttheir sentential context.
5. 1. 1 Li mit ati o n
While we gainthe abilityto construct macaronictextsfor students without any prior data
collection, weli mit ourselvestolexicalreplace ments only. Thisli mitation arises because
of our proposed methodto evaluatethe kno wledge ofthe genericstudent model co mpares
lexical word e mbeddings andistherefore unableto measure otherlinguistic kno wledgesuch
as word order. Thisis a keyli mitation ofthe work proposedinthis chapter.
9 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5. 2 M et h o d
Our machineteacher can be vie wed as a search algorith mthattriesto findthe(approxi-
mately) best macaronic configurationforthe next sentencein a given L1 docu ment. We
assu methe availability of a “gold” L2 glossfor each L1 word: in our experi ments, we
obtainedthesefro m bilingual speakers using Mechanical Turk. Table 5.1 sho ws an exa mple
English sentence with Ger man glosses andthree possible macaronic configurations(there
are exponentially many configurations). The machineteacher must assess, for exa mple,
ho w accurately astudent would understandthe meanings of D e r , i s t , e i n , a n d F l u s s when presented withthefollo wing candidate macaronic configuration: D e r N i l e i s t
ein Fluss in Africa .1 Understanding may arise fro minference onthis sentence as well as whateverthe student haslearned aboutthese wordsfro m previous sentences.
Theteacher makesthis assess ment by presentingthis sentenceto a generic student model
(§ § 5.2.1–5.5).It uses a L2 e mbeddingscoringsche me( § 5.5.1)to guide a greedy searchfor
the best macaronic configuration(§5.5.2).
5.2.1 Generic Student Model
Our model of a “generic student”( GSM )is equipped with a clozelanguage modelthat
uses a bidirectional LST Mto predict L1 wordsin L1 context( Mousa and Schuller, 2017;
1 By “ meaning” we meanthe L1tokenthat was originallyinthesentence beforeit wasreplaced by an L2 gl oss.
9 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Hochreiter and Sch midhuber, 1997). Given a sentence x = [ x 1 , . . . , xt , . . . , xT ], t h e cl o z e
f b model defines p(x t | h t , h t )∀t∈ {1,..., T}, where:
f f f D h t = L S T M ([x 1 , . . . , x t− 1 ]; θ ) ∈ R ( 5. 1)
b b b D h t = L S T M ([x T , . . . , x t+ 1 ]; θ ) ∈ R ( 5. 2)
are hidden states offor ward and back ward LST M encoders para meterized by θ f a n d θ b
respectively. The model assu mes a fixed L1 vocabulary ofsize V , andthe vectors x t a b o ve
are e mbeddings ofthese wordtypes, which correspondtothero ws of an e mbedding matrix
E ∈ R V × D . The cloze distribution at each positiontinthe sentenceis obtained using
p (· | h f , h b ) =soft max(E h([h f ; h b ]; θ h )) ( 5. 3) w h er e h (·; θ h ) is a projectionfunctionthatreducesthe di mension ofthe concatenated hidden
st at es fr o m 2 D t o D . We “tie”theinput e mbeddings and output e mbeddings asin Press and
Wolf (2017).
Wetrainthe para meters θ = [ θ f ; θ b ; θ h ;E] using Ada m( King ma and Ba, 2014)to ∑ m a xi mi z e x L (x ), wherethesu m mationis oversentences x in alarge L1training corpus,
a n d
− f b L (x ) = l o g p (x t | h t , h t ) ( 5. 4) t
Wesetthe di mensionality of word e mbeddings and LST M hidden unitsto 3 0 0 . We us e
the WikiText-103 corpus( Merity et al., 2016) asthe L1training corpus. We apply dropout
(p = 0 .2) bet weenthe word e mbeddings and LST Mlayers, and bet weenthe LST M and
9 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION
projectionlayers(Srivastava et al., 2014). We assu methattheresulting modelrepresents
the entirety ofthe student’s L1 kno wledge.
5.2.2 Incre mental L2 Vocabulary Learning
The model so far can assign probabilityto an L1 sentence such as T h e N i l e i s a
river in Africa , using equation(5.4), but what about a macaronicsentencesuch as
Der Nile ist ein Fluss in Africa ? Toacco m modatethe new L2 words, we
use another word-e mbedding matrix, F ∈ R V ′ × D and modify Eq 5.3to consider boththe L1
and L2 e mbeddings:
p (· | [h f : h b ]) =soft max([E ;F ]·h([h f : h b ]; θ h ))
We also restrictthe soft max function aboveto produce a distribution not overthe full
bilingual vocabulary of size |V | + |V ′|, but only overthe bilingual vocabulary consisting
of t h e V L1typestogether with onlythe v ′ ⊂ V ′ L2typesthat actually appearinthe
macaronic sentence x .Inthe above exa mple macaronicsentence, |v ′| is 4 . We i niti ali z e F
by dra wingits ele mentsII Dfro m Unifor m [− 0.01,0.01] . Thus, all L2 wordsinitially have
rando m e mbeddings[− 0.01,0.01] 1 × D .
These modificationslets us co mpute L (x ) for a macaronic sentence x . Weassu methat when a hu manstudentreads a macaronicsentence x ,they updatetheir L2 para meters F ( b ut
nottheir L1 para meters θ )toincrease L (x ). Specifically, we assu methat F will b e u p d at e d
9 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION
t o m a xi mi z e
L (x ; θ f , θ b , θ h ,E,F)− λ∥F − F p r e v ∥ 2 ( 5. 5)
Maxi mizing equation(5.5) adjuststhe e mbeddings of each L2 wordinthesentencesothat
itis more easily predictedfro mthe other L1/L2 words, and also sothatitis more helpful at
predictingthe other L1/L2 words. Sincetherest ofthe model’s para meters do not change, we expectto find an e mbeddingfor F l u s s thatis si milartothe e mbeddingfor r i v e r .
Ho wever,theregularizationter m with coefficient λ > 0 pr e ve nts F fro m strayingtoofar
fr o m fr o m F p r e v , whichrepresentsthe value of F beforethis sentence wasread. Thisli mits
the degreeto which oursi mulatedstudent will changetheir e mbedding of an L2 wordsuch
as F l u s s based on a singleexa mple. As aresult,the e mbedding of F l u s s reflects all of
the past sentencesthat contained F l u s s , although(realistically) with so me biasto wardthe
mostrecent such sentences. We do not currently model spacing effects,i.e.,forgetting due
tothe passage ofti me.
I n pri n ci pl e, λ should be set based on hu man-subjects experi ments, and might differ fro m hu manto hu man. In practice,inthis paper, we si mplytook λ = 1 . We(approxi-
mately) maxi mizedthe objective above using 5 steps of gradient ascent, which gave good
convergencein practice.
9 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.2.3 Scoring L2e mbeddings
Theincre mental vocabularylearning procedure( § 5.2.2)takes a macaronic configuration
and generates a ne w L2 word-e mbedding matrix by applying gradient updatesto a previous version ofthe L2 word-e mbedding matrix. The ne w matrixrepresentsthe proxystudent’s
L2 kno wledge after observingthe macaronic configuration.
Thus,if we canscorethe ne w L2 e mbeddings, we can,in essence,scorethe macaronic
configurationthat generatedit. The abilityto score configurations affords search( § § 5. 2. 4
and 5.2.5) for high-scoring configurations. With this motivation, we design a scoring
functionto measurethe “goodness” of L2 word-e mbeddings,F .
The machineteacher evaluates F with referenceto all correct word-gloss pairs fro m
theentire docu ment. For our exa mplesentence,the word pairs are { (T h e , D e r ), (i s ,i s t ),
(a ,e i n ), (r i v e r ,F l u s s )} . Butthe machineteacher also has accessto, for exa mple,
{ (w a t e r ,W a s s e r ), (s t r e a m , F l u s s )... } , which co mefro m else whereinthe docu ment.
Thus,if P istheset of word pairs,{(x 1 , f1 ), ...(x | P|, f| P|)}, we co mpute:
̃r p = R (x p , cs (F f p ,E)) ( 5. 6) ⎧ ⎪ ⎪ ⎨ ̃r p if ̃r p < r m a x r = p ⎪ ⎪ ⎩⎪ ∞ ot h er wis e
1 ∑ 1 M R R (F , E , r ) = ( 5. 7) m a x | P| r p p w h er e cs (F f ,E) denotesthe vector of cosine si milarities bet weenthe e mbedding of an L2
9 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION
w or d f andthe entire L1 vocabulary. R (x, cs (E , F f )) queriestherank ofthe correct L1 w or d x t h at p airs wit h f . r cantake valuesfro m 1 t o |V |, but we use arankthreshold r m a x
andforce pairs with arank worsethan r m a x t o ∞ . Thus, given a word-gloss pairing P , t h e
current state ofthe L2 e mbedding matrix F , andthe L1 e mbedding matrix E , we obtainthe
Mean Reciprocal Rank( M R R)scorein(5.21).
We canthink ofthe scoringfunction as a “vocabularytest”in whichthe proxy student
gives (its best) r m a x guessesfor each L2 wordtype andreceives a nu merical grade.
5.2.4 Macaronic Configuration Search
Sofar we have detailed our si mulated studentthat wouldlearnfro m a macaronic sentence,
and a metricto measure ho w goodthelearned L2 e mbeddings would be. No wthe machine
teacher only hasto searchforthe best macaronic configuration of a sentence. Asthere
are exponentially many possible configurationsto consider, exhaustive searchisinfeasible.
We use a si mple left-to-right greedy search to approxi mately find the highest scoring
configuration for a given sentence. Algorith m 1 sho wsthe pseudo-code forthe search
process. Theinputstothe search algorith m aretheinitial L2 word-e mbeddings matrix
F p r e v ,the scoringfunction MRR(), andthe genericstudent model SPM(). The algorith m
proceedslefttoright, making a binary decision at eachtoken: Shouldthetoken bereplaced withits L2 gloss orleft asis? Forthe firsttoken,theset wo decisionsresultinthet wo
configurations: (i) D e r N i l e . . . a n d (ii) T h e N i l e . . . These configurations are given
tothe genericstudent model which updatesthe L2 word e mbeddings. Thescoringfunction
9 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION
(section 5.5.1) co mputes ascorefor each L2 word-e mbedding matrix and cachesthe best
configuration(i.e.the configuration associated withthe highest scoring L2 word-e mbedding
matrix). Ift wo configurationsresultinthe sa me MRR score,the nu mber of L2 wordtypes
exposedis usedto breakties. In Algorith m 1, ρ (c ) isthefunctionthat countsthe nu mber of
L2 wordtypes exposedin a configuration c.
5.2.5 Macaronic-Language docu ment creation
Ourideaisthat a sequence of macaronic configurationsis goodifit drivesthe generic
student model’s L2 e mbeddingsto ward an MRRscore closeto 1 ( maxi mu m possible). Note
that we donot changethe sentence order( we still want a coherent docu ment),justthe
macaronic configuration of each sentence. For each sentenceinturn, we greedily search
over macaronic configurations using Algorith m 1,then choosethe configurationthatlearns
t h e b est F , and proceedtothe nextsentence with F p r e v no w settothislearned F .2 T his
processisrepeated untilthe end ofthe docu ment. The pseudo-codefor generating an entire
docu ment of macaronic contentissho wnin Algorith m 2.
Insu m mary, our machineteacheris co mposed of(i)a genericstudent model whichis
a contextual L2 wordlearning model( § 5. 2. 1 a n d § 5.2.2) and (ii)a configuration sequence
search algorith m (§ 5. 2. 4 a n d § 5.2.5), whichis guided by (iii) an L2 vocabulary scoring f u n cti o n (§ 5.5.1). Inthe next section, we describet wo variationsforthe generic student
m o d els. 2 Forthe first sentence, weinitialize F p r e v to have valuesrando mly bet ween[− 0.01,0.01].
9 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Algorith m 1 Mixed-Lang. Config. Search
Require: x =[ x 1 , x2 , . . . , xT ] ▷ L 1 t o k e ns
Require: g =[ g 1 , g2 , . . . , gT ] ▷ L 2 gl oss es
R e q ui re: E ▷ L1e mb. matrix
R e q ui re: F p r e v ▷ i niti al L 2 e m b. m atri x
Require: SP M ▷ Student Proxy Model
Require: MRR, r m a x ▷ Scoring Func.,threshold
1: f u n cti o n S EARCH (x , g , E , F p r e v )
2: c ← x ▷ initial configurationisthe L1 sentence
3: F ← F p r e v
4: s= MRR (E,F,rm a x )
5: fori=1; i≤ T;i+ + do
′ 6: c ← c 1 · · · c i− 1 g i x i+ 1 · · · x T
7: F ′ = S P M (F p r e v , c ′)
′ ′ 8: s = M R R (E , F , rm a x )
9: if (s ′, − ρ (c ′))≥ (s,− ρ(c))then
1 0: c ← c ′, F ← F ′, s ← s ′
1 1: e n d if
1 2: e n d f o r
1 3: ret u r n c , F ▷ Mixed-Lang. Config.
1 4: endfunction
1 0 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Algorith m 2 Mixed-Lang. Docu ment Gen.
Require: D =[( x 1 , g 1 ), . . . , (x N , g N )] ▷ D o c u m e nt
R e q ui re: E ▷ L1e mb. matrix
R e q ui re: F 0 ▷ i niti al L 2 e m b. m atri x
1: f u n cti o n D OC G EN (D,F 0 )
2: C = [] ▷ Configuration List
3: fori=1; i≤ N ;i+ + do
4: x i, g i = D i
i i i− 1 5: c ,F = S EARCH (x i, g i,E,F )
6: C ← C + [ c i]
7: e n d f o r
8: ret u r n C ▷ Mixed-Lang. Docu ment
9: endfunction
1 0 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.3 Variationsin Generic Student Models
We developedt wo variationsforthe generic student modelto co mpare and contrastthe
macaronic docu mentsthat can be generated.
5.3.1 Unidirectional Language Model
This variation restrictsthe bidirectional model (fro m Section 5.2.1)to be unidirectional
(u G S M ) andfollo ws astandardrecurrent neural net work( R N N)language model( Mikolov
et al., 2 0 1 0).
∑ f l o g p (x ) = l o g p (x t | h t ) ( 5. 8) t
f f f h t = L S T M (x 0 , . . . , x t− 1 ; θ ) ( 5. 9)
p (· | h f ) =soft max(E ·h f ) ( 5. 1 0)
O n c e a g ai n, h f ∈ R D × 1 isthe hidden state ofthe LST M recurrent net work, whichis
para meterized by θ f , but unlikethe modelin Section 5.2.1, no back ward LST M and no
projection functionis used.
Thesa me procedurefro mthe bidirectional modelis usedto update L2 word e mbeddings
(Section 5.2.2). Whilethis model does not explicitly encode contextfro m “future”tokens
(i.e. wordstotheright of x t ),thereis still pressurefro mright-sidetokens x t+ t:T b e c a us e
the ne w e mbeddings will be adjustedto explainthetokenstotheright as well. Fixing allthe
L1 para metersfurther strengthensthis pressure on L2 e mbeddingsfro m wordstotheirright.
1 0 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.3.2 Direct Prediction Model
The previoust wo models variants adjust L2 e mbeddings using gradient stepstoi mprovethe
pseudo-likelihood ofthe presented macaronicsentences. One dra wback ofsuch an approach
is co mputation speed caused bythe bottleneckintroduced bythe soft max operation.
We designed an alternate student prediction modelthat can “directly” predictthe e mbed-
dingsfor wordsin a sentence using contextualinfor mation. Werefertothis variation asthe
Direct Prediction (DP ) model. Like our previous generic student models,the DP m o d el als o
uses bidirectional LST Msto encode context and an L1 word e mbedding matrix E . H o w e ver,
t h e DP model does not atte mptto produce a distribution overthe output vocabulary;instead
ittriesto predict areal-valued vector using afeed-for ward high way net work(Srivastava,
Greff, and Sch midhuber, 2015). The DP model’s objectiveisto mini mizethe mean square
error( MSE) bet ween a predicted word e mbedding andthe true e mbedding. For ati me-step
t,the predicted word e mbedding x̂ t ,is generated by:
f f f h t = L S T M ([x 1 , . . . , x t− 1 ]; θ ) ( 5. 1 1)
b b b h t = L S T M ([x t+ 1 , . . . , x T ]; θ ) ( 5. 1 2)
f b w x̂ t = F F ([x t : h t : h t ]; θ ) ( 5. 1 3) ∑ f b w 2 L (θ , θ , θ ) = (̂x t − x t ) ( 5. 1 4) t w h er e F F (.; θ w ) denotes a feed for ward high way net work with para meters θ w . T h us,
t h e DP modeltraining requiresthat we already havethe “true e mbeddings” for allthe
1 0 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION
L1 wordsin our corpus. We use pretrained L1 word e mbeddingsfro m FastText as “true
e mbeddings” (Bojano wski et al., 2017). Thisleavesthe LST M para meters θ f , θ b a n d the high way feed-for ward net work para meters θ w to belearned. Equation 5.14 can be
mini mized bysi mply copyingtheinput x t asthe prediction(ignoring all context). We use
maskedtraining to preventthe modelitselffro mtrivially copying( Devlin et al., 2018). We
rando mly “ mask” 3 0 % oftheinput e mbeddings duringtraining. This masking operation
replacesthe original e mbedding with either(i) 0 vectors, or (ii)vectors of arando m word
in vocabulary, or(iii) vectors of a “neighboring” word fro mthe vocabulary. 3 T h e l oss,
ho wever,is al ways co mputed withrespecttothe correcttoken e mbedding.
Withthe L1 para meters ofthe DP modeltrained, weturnto L2learning. Once againthe
L2 vocabularyis encodedin F , whichisinitializedto 0 (i.e. before any sentenceis observed).
Considerthe configuration: The Nile is a Fluss in Africa . Thetokens are convertedinto a sequence of e mbeddings: [x = E , . . . , x = F , ..., x = E ]. N ot e 0 x 0 t f t T x T that at ti me-step t the L2 word-e mbedding matrixis used(t = 4 , ft = F l u s s f or t h e
exa mple above). A prediction x̂ t is generated bythe model using Equations 5.11-5.13. Our
hopeisthatthe predictionis a “refined” version ofthe e mbeddingforthe L2 word. The
refine ment arisesfro m consideringthe context ofthe L2 word. If F l u s s w as n ot s e e n
b ef or e, x t = F f t = 0 ,forcingthe DP modelto only use contextualinfor mation. We apply a 3 We preco mpute 2 0 neighboring words(based on cosine-si milarity)for each wordinthe vocabulary using Fast Text e mbeddings beforetraining.
1 0 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION
si mple updaterulethat modifiesthe L2 e mbeddings based onthe direct predictions:
F f t ← ( 1 − η )F f t + η ̂x t ( 5. 1 5) w h er e η controlstheinterpolation bet weenthe old values of a word e mbedding andthe ne w values which have been predicted based onthe current mixed sentence. Ifthere are multiple L2 wordsin a configuration, say at positions i a n d j ( w h er e i < j), w e c a n still
follo w Eq 5.11–5.13. Ho wever,to allo wthe predictions ̂x i a n d ̂x j tojointlyinfluence each other, we needto execute multiple predictioniterations.
Concretely,let X = [ x ,...,F ,...,F , . . . , x ] bethesequence of worde mbeddings 0 f i f j T ̂ fora macaronicsentence. The DP model generates predictions X = [ x̂ 0 ,..., x̂ i,..., x̂ j ,..., x̂ T ].
We only useits predictions atti me-steps correspondingto L2tokens sincethe L2 words are
those we wantto update(Eq 5.16).
X 1 = D P (X 0 )
W h er e , X 0 = [ x ,...,F ,...,F , . . . , x ] 1 f i f j T
1 1 1 X = [ x 1 ,..., x̂ i ,..., x̂ j , . . . , x T ] ( 5. 1 6)
X k = D P (X k − 1 ) ∀ 0 ≤ k < K − 1 ( 5. 1 7) w h er e X 1 contains predictions at i a n d j andthe original L1 word-e mbeddingsin other
positions. Wethen pass X 1 asinput againtothe DP model. Thisis executedfor K it er ati o ns
(Eq 5.17). With eachiteration, our hopeisthatthe DP model’s predictions x̂ i a n d x̂ j
getrefined byinfluencing each other andresultin e mbeddingsthat are well-suitedtothe
sentence context. A si milar style ofi mputation has been studiedfor one di mensionalti me-
1 0 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION
series data by Zhou and Huang(2018). Finally, after K − 1 iterations, we usethe predictions
K of x̂ i a n d x̂ j fr o m X to updatethe L2 word-e mbeddingsin F correspondingtothe L2
t o k e ns f i a n d f j .η wassetto 0.3 andthe nu mber ofiterations K = 5 .
F ← ( 1 − η )F + η x̂ K f i f i i
F ← ( 1 − η )F + η x̂ K ( 5. 1 8) f j f j j
Figure 5.1: Ascreenshot of a macaronicsentence presented on Mechanical Turk.
5.4 Experi ments with Synthetic L2
We firstinvestigatethe patterns of word replace ment produced bythe machineteacher
undertheinfluence ofthe different generic student models and ho wthesereplace ments
affectthe guessability of L2 words. Tothis end, we usedthe machineteacherto generate
macaronic docu ments and asked MTurk participantsto guesstheforeign words. Figure 5.1
sho ws an exa mplescreenshot of our guessinginterface. The wordsin blue are L2 words
1 0 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION whose meaning(in English)is guessed by MTurk participants. For ourstudy, we created
asynthetic L2language byrando mlyreplacing charactersfro m English wordtypes. This
steplets us safely assu methat all MTurk participants are “absolute beginners.” Wetried
to ensurethattheresulting synthetic words are pronounceable byreplacing vo wels with vo wels, stop-consonants with other stop-consonants, etc. We alsoinserted or deleted one
characterfro mso me ofthe wordsto preventthereaderfro m usingthelength ofthesynthetic word as a clue.
Metric Model r m a x = 1 r m a x = 4 r m a x = 8
GSM 0.25 0.31 0.35
Replaced uGSM 0.20 0.25 0.25
DP 0.19 0.22 0.21
GSM 86.00(± 0.87) 74.00(± 1.10) 55.13(± 2.54)
Guess Accuracy uGSM 84.57(± 0.56) 73.89(± 1.72) 72.83(± 1.58)
DP 88.44(± 0.73) 81.07(± 1.03) 70.85(± 1.49)
Table 5.2: Results fro m MTurk data. The first section sho wsthe percentage oftokens
that werereplaced with L2 glosses under each condition. The Accuracy section sho ws
the percentagetoken accuracy of MTurk participants’ guesses along with 9 5 % c o n fi d e n c e
interval calculated via bootstrapresa mpling.
We studiedthethree generic student models( GSM , u G S M , a n d DP ) while keepingthe rest ofthe machineteacher’s co mponents fixed (i.e. sa me scoring function and search
1 0 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION
O p e n- Cl ass Closed- Class All
6 0 6 0 6 0 4 4 .3 4 3 .6 3 9 .7 3 8 .8 3 7 .4
4 0 4 0 3 3 .4 4 0
r m a x = 1 2 5 .0 2 1 .5 2 0 .3 1 8 .8 1 7 .2 2 0 2 0 2 0 1 6 .7 1 2 .9 1 0 .3 1 0 .1 7 .7 5 .0 4 .3 0 0 0 DP G S M u G S M DP G S M u G S M DP G S M u G S M 6 0 6 0 6 0 5 1 .9 4 9 .2 4 1 .1 4 1 .0 4 0 4 0 4 0 .8 4 0 3 1 .8 3 0 .8
r = 4 2 4 .8 2 2 .9
m a x 2 2 .0 1 8 .9 1 8 .3 1 7 .8
2 0 1 4 .8 2 0 2 0 1 2 .5 1 0 .1 6 .8 4 .9 0 0 0 DP G S M u G S M DP G S M u G S M DP G S M u G S M 6 0 6 0 6 0 4 4 .2 4 2 .2 3 8 .8
4 0 4 0 4 0 3 5 .1 3 0 .9 3 0 .7 2 9 .5 2 8 .4
r m a x = 8 2 4 .9 2 0 .7 1 9 .4 1 8 .1 1 6 .0 1 4 .7 2 0 1 3 .9 2 0 2 0 1 0 .0 8 .1 5 .2 0 0 0 DP G S M u G S M DP G S M u G S M DP G S M u G S M
Table 5.3: Results of MTurkresults split up by word-class. The y -axisis percentage of
tokens belongingto a word-class. The pink bar (right) sho wsthe percentage oftokens
(of a particular word-class)that were replaced with an L2 gloss. The blue bar(left) and
indicatesthe percentage oftokens(of a particular word-class)that wereguessed correctly by
M Turk participants. Error barsrepresent 9 5 % confidenceintervals co mputed with bootstrap
resa mpling. For exa mple, weseethat only 5 .0 % (pink) of open-classtokens werereplaced
into L2 bythe DP m o d el at r m a x = 1 a n d 4 .3 % of all open-classtokens were guessed
correctly. Thus, eventhoughthe guess accuracyfor DP at r m a x = 1 for open-classis high
(86 %) we canseethat participants were not exposedto many open-class wordtokens.
1 0 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION
algorith ms). Allthree models were constructedto haveroughlythe sa me nu mber of L1
para meters ( ≈ 2 0 M ). T h e u G S M m o d el us e d 2 unidirectional LST Mlayersinstead of a
single bidirectionallayer. The L1 and L2 word e mbeddingsize andthe nu mber ofrecurrent
u nits D w er e s et t o 3 0 0 for allthree models (to matchthe size of FastText’s pretrained
e mbeddings). Wetrainedthethree models onthe Wikipedia-103 corpus ( Merity et al.,
2 0 1 6). 4 All models weretrainedfor 8 epochs usingthe Ada m opti mizer( King ma and Ba,
2014). Weli mitthe L1 vocabularytothe 60k mostfrequent Englishtypes.
5.4.1 MTurk Setup
We s el e ct e d 6 docu ments fro m Si mple Wikipediato serve astheinput for macaronic
c o nt e nt. 5 To keep ourstudyshort enoughfor MTurk, weselected docu mentsthat contained
2 0 − 2 5 sentences. A participant could co mplete upto 6 HITs( Hu manIntelligence Tasks)
correspondingtothe 6 docu ments. Participants were given 2 5 minutesto co mplete each
HI T(on average,the participantstook 1 2 minutesto co mpletethe HITs). To preventtypos, w e us e d a 2 0 k word English dictionary, whichincludes allthe wordtypesfro mthe 6 Si m pl e
Wikipedia docu ments. We provided nofeedbackregardingthe correctness of guesses. We
r e cr uit e d 1 2 8 English speaking MTurk participants and obtained 1 6 2 responses, with each
response enco mpassing a participant’s guesses over afull docu ment.6 Participants were
co mpensated $4 per HIT.
4 FastText pretrained e mbeddings weretrained on more data. 5 htt ps:// d u m ps. wi ki m e di a. or g/si m pl e wi ki/ 2 0 1 9 0 1 2 0/ 6 Participants self-reportedtheir English proficiency, only native or fluent speakers were allo wedto partici- pate. Our HITs were only availableto participantsfro mthe US.
1 0 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.4.2 Experi ment Conditions
We generated 9 macaronic versions( 3 m o d els {GSM ,u G S M ,DP} in co mbination with 3 r a n k t hr es h ol ds r m a x ∈ { 1 , 4 , 8 } )for each ofthe 6 Si mple Wikipedia docu ments. For each HIT, an MTurk participant wasrando mly assigned one ofthe 9 macaronic versions.
M o d el r m a x = 1 r m a x = 8
GSM Hu Nile (‘‘an-nī l’’) ev a river um Hu Nile (‘‘an-nī l’’) ev u river um
Africa. Up is hu longest river Africa. Up ev the longest river
i ñ Earth (about 6,650 km or 4,132 on Earth (about 6,650 km or 4,132
miles), though other rivers carry miles), though other rivers carry
more water... more water...
Many ozvolomb types iv emoner live Emu ozvolomb types of emoner live
in or near hu waters iv hu Nile, um or iul the waters of hu Uro,
including crocodiles, birds, fish including crocodiles, ultf, yvh
ñ b many others. Not only do animals and emu others. Ip only do animals
d e p e n d i ñ hu Nile for survival, but d e p e n d i ñ the Nile zi survival, but
also people who live there need up also daudr who live there need up
zi everyday use like washing, as u zi everyday use like washing, ez a
jopi supply, keeping crops watered jopi supply, keeping crops watered
ñ b other jobs... ñ b other jobs...
Table 5.4: Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe GSM model.
Tables 5.4to 5.6 sho wsthe outputforthe GSM , u G S M a n d DP generic student models at
1 1 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION
t wo settings of r m a x for one ofthe docu ments.Inthese experi ments we use asynthetic L2 l a n g u a g e.
M o d el r m a x = 1 r m a x = 8
uGSM The Nile (‘‘an-nī l’’) ev a river Hu Nile (‘‘an-nī l’’) ev u river um
um Africa. It ev hu longest river Africa. Up ev the longest river
on Earth (about 6,650 km or 4,132 i ñ Earth (about 6,650 km or 4,132
miles), though other rivers carry miles), though other rivers carry
more jopi... more jopi...
Many different pita of emoner live Many different pita of emoner live
in or near hu waters iv hu Nile, um or near hu waters iv hu Nile,
including crocodiles, ultf, fish including crocodiles, ultf, fish
and many others. Not mru do emoner and many others. Not mru do emoner
d e p e n d i ñ hu Nile for survival, but depend on the Nile for survival, id
also people who live there need it also people who live there need it
for everyday use like washing, as a zi everyday use like washing, as u
jopi supply, keeping crops watered water supply, keeping crops watered
ñ b other jobs... ñ b other jobs...
Table 5.5: Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe uGSM model.
1 1 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION
M o d el r m a x = 1 r m a x = 8
DP Hu Nile (‘‘an-nī l’’) ev a river um Hu Nile (‘‘an-nī l’’) ev a river um
Africa. Up ev hu longest river Africa. Up ev hu longest river
on Earth (about 6,650 km or 4,132 on Earth (about 6,650 km or 4,132
miles), though other rivers carry miles), though udho rivers carry
more water... more water...
Many different types iv animals Many different pita of animals live
live in or near hu waters iv hu i n o r n e a r hu waters of hu Nile,
Nile, including crocodiles, birds, including crocodiles, birds, fish
fish and many others. Not only and many others. Not mru do animals
do animals depend iñ h u N i l e f o r d e p e n d i ñ hu Nile zi survival, id
survival, but also people who live also people who live there need it
there need it for everyday use like zi everyday use like washing, ez a
washing, as u water supply, keeping water supply, keeping crops watered
crops watered and other jobs... and udho jobs...
Table 5.6: Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe DP generic student model. Only co m monfunction wordssee mto bereplaced withtheir L2translations.
Thet wo colu mnssho wthe effect oftherankthreshold r m a x . Notethatthis macaronic d o c u m e nt is 2 5 sentenceslong; here, we only sho wthe first 2 sentences and another mi d dl e 2 sentencesto save space. We seethat r m a x controlsthe nu mber of L2 wordsthe machineteacher dee ms guessable, which affectstextreadability. Theincreasein L2 words is most noticeable withthe GSM model. We also seethatthe DP model differs fro mthe
1 1 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION others by favoring high frequency words al most exclusively. Whilethe GSM a n d u G S M models si milarlyreplace a nu mber of highfrequency words,they also occasionallyreplace lo werfrequency word classeslike nouns and adjectives(e m o n e r , E m u , et c.). Ta bl e 5. 2 su m marizes our findings. The firstsection of 5.2sho wsthe percentage oftokensthat were dee med guessable by our machineteacher. The GSM modelreplaces more words as r m a x isincreasedto 8 , but weseethat MTurkers had a hardti me guessingthe meaning ofthe replacedtokens:their guessing accuracy dropsto 5 5 % at r m a x = 8 wit h t h e GSM m o d el. T h e u G S M model, ho wever, displays areluctancetoreplacetoo manytokens, even as r m a x w as increasedto8.
Wefurther analyzedthereplace ments and MTurk guesses based on word-class. We taggedthe L1tokens withtheir part-of-speech and categorizedtokensinto open or closed classfollo wing Universal Dependency guidelines(“ Universal Dependencies v1: A Multi- lingual Treebank Collection.”).7 Table 5.3su m marizes our analysis of model and hu man behavior whenthe datais separated by word-class. The pink barsindicatethe percentage of tokensreplaced per word-class. The blue barsrepresentthe percentage oftokensfro m a particular word-classthat MTurk users guessed correctly. Thus, anideal machineteacher should striveforthe highest possible pink bar while ensuringthatthe blue baris as close as possibletothe pink. Our findings suggestthatthe u G S M m o d el at r m a x = 8 a n d t h e
GSM m o d el at r m a x = 4 sho wthe desirable properties – high guessing accuracy and more representation of L2 words(particularly open-class words).
7 https://universaldependencies.org/u/pos/
1 1 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION
M et ri c Model Closed Open
r a n d o m 5 9 5 2 4 Types Replaced GSM 3 3 1 4 9
random 62.06(± 1.54) 39.36(± 1.75) Guess Accuracy GSM 74.91(± 0.94) 61.96(± 1.24)
Table 5.7: Results co mparing our generic student based approachto arando m baseline. The
first partsho wsthe nu mber of L2 wordtypes exposed by each modelfor each word class.
The second part sho wsthe average guess accuracy percentagefor each model and word
class. 95 % confidenceintervals(in brackets) were co mputed using bootstrapresa mpling.
5.4.3 Rando m Baseline
So far we’ve co mpared different generic student models against each other, butis our
generic student based approachrequired at all? Ho w much better(or worse)isthis approach
co mparedto a rando m baseline? To ans werthese questions, we co mparethe GSM wit h
r m a x = 4 model against arando mly generated macaronic docu ment. Asthe na mesuggests, wordreplace ments are decidedrando mlyfortherando m condition, but we ensurethatthe
nu mber oftokensreplacedin eachsentence equalsthatfro mthe GSM condition.
We us e d t h e 6 Si mple Wikipedia docu mentsfro m § 5.4.1 andrecruited 6 4 n e w M T ur k
partipants who co mpleted atotal of 6 6 HITs(co mpensation was $ 4 per HIT). For each
1 1 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Model Closed O p e n
random 9.86(±0.94) 4.28(±0.69)
GSM 35.53(± 1.03) 27.77(± 1.03)
Table 5.8: Results of our L2learning experi ments where MTurk subjects si mplyread a
macaronic docu ment and ans wered a vocabulary quiz atthe end ofthe passage. Thetable
sho wsthe average guess accuracy percentage along with 9 5 % confidenceintervals co mputed
fro m bootstrapresa mpling.
HIT,the participant was given eithertherando mly generated orthe GSM based macaronic
docu ment. Once again, participants were madeto entertheir guessfor each L2 wordthat
appearsin a sentence. Theresults are su m marizedin Table 5.7.
We findthatrando mlyreplacing words with glosses exposes more L2 wordtypes(59
and 524 closed-class and open-class words respectively) whilethe GSM modelis more
conservative withreplace ments(33 and 149). Ho wever,therando m macaronic docu mentis
much harderto co mprehend,indicated by significantlylo wer average guess accuraciesthan
those withthe GSM model. Thisis especiallytruefor open-class words. Notethat Table 5.7
sho wsthe nu mber of wordtypesreplaced across all 6 docu ments.
1 1 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.4.4 Learning Evaluation
Our macaronic based approachrelies onincidentallearning, which statesthatif a novel wordisrepeatedly presentedto a student with sufficient context,the student will eventually
be abletolearnthe novel word. Sofar our experi mentstest MTurk participants onthe
“guessability” of novel wordsin context, but notlearning. To studyif students can actually
learnthe L2 words, we conduct an MTurk experi ment where participants aresi mplyrequired
toread a macaronic docu ment(onesentence at ati me). Atthe end ofthe docu ment an L2 vocabulary quizis given. Participants must enterthe meaning of every L2 wordtypethey
have seen duringthereading phase.
Once again, we co mpare our GSM (r m a x = 4 ) model against arando m baseline using t h e 6 Si mple Wikipedia docu ments. 4 7 HITs were obtainedfro m 4 5 M Turk participants
forthis experi ment. Participants were made a warethatthere would be a vocabulary quiz at
the end ofthe docu ment. Our findings aresu m marizedin Table 5.8. We findthe accuracy
of guessesforthe vocabulary quiz atthe end ofthe docu mentis considerablylo werthan
guesses with context. Ho wever, subjects still managedto retain 3 5 .5 3 % a n d 2 7 .7 7 % of
closed-class and open-class L2 wordtypesrespectively. Onthe other hand, when arando m
macaronic docu ment was presentedto participants,their guess accuracy droppedto 9 .8 6 %
a n d 4 .2 8 % for closed and open class wordsrespectively. Thus, eventhough more word
types were exposed bytherando m baseline,fe wer words wereretained.
Additionally, we wouldliketoinvestigate ho w our approach could be extendedto enable
1 1 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION
phrasallearning( which should consider word-ordering differences bet weenthe L1 and L2).
As t h e GSM a n d u G S M modelssho wedthe most pro misingresultsin our experi ments, we
believethese models could serve asthe baselineforfuture work.
5.5 Spelling- Aware Extension
Sofar, our genericstudent modelignoresthefactthat a novel wordlike A f r i k a is guessable
si m pl y b y its s p elli n g si mil arit y t o A f r i c a . Thus, we aug mentthe genericstudent model
to use character n -gra ms. We choosethe bidirectional generic student model for our
spelling-a ware extension based onthe pilot experi ments detailedin § 5.4.2. In additionto an
e mbedding per wordtype, welearn e mbeddingsfor character n -gra mtypesthat appearin
our L1corpus. Therowin E fora wordw is now para meterizedas:
∑ 1 Ẽ · ̃w + Ẽ n · ̃w n ( 5. 1 9) 1 · ̃w n n w h er e Ẽ isthefull- word e mbedding matrix and ̃w is a one-hot vector associated withthe w or d t y p e w , Ẽ n is a character n -gra m e mbedding matrix and ̃w n is amulti -hot vector associated with allthe character n -gra msforthe wordtype w . F or e a c h n ,thesu m mand
givesthe average e mbedding of all n - gr a ms i n w ( w h er e 1 · ̃w n countsthese n - gr a ms). We
s et n torangefro m 3 t o 4 (s e e § 5.7). Thisfor mulationis si milarto previous sub- word based
e mbedding models( Wieting et al., 2016; Bojano wski et al., 2017).
1 1 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Si milarly,the e mbedding of an L2 word w is para meterized as
∑ 1 F̃ · ̃w + F̃ n · ̃w n ( 5. 2 0) 1 · ̃w n n
Cr u ci all y, w e i niti ali z e F̃ n t o µ Ẽ n ( w h er e µ > 0 )sothat L2 words caninherit part of
̃ 4 ̃ 4 8 theirinitial e mbeddingfro m si milarly spelled L1 words: F A f r i := µ E A f r i . B ut w e all o w
F̃ n to diverge overti mein case an n -gra mfunctions differentlyinthet wolanguages. In thesa me way, weinitialize eachro w of F̃ tothe correspondingro w of µ · Ẽ , if a n y, a n d other wiseto 0 . Our experi ments set µ = 0 .2 (s e e § 5.7). We refertothis spelling-a ware
extensionto GSM assGSM.
5.5.1 Scoring L2e mbeddings
Didthe si mulated studentlearn correctly and usefully? Let P bethe “reference set” of all
(L1 word, L2 gloss) pairsfro malltokensinthe entire docu ment. We assessthe machine
teacher’ssuccess by ho w many ofthese pairsthesi mulatedstudent haslearned. (Thestudent
may even succeed on so me pairsthatit has never been sho wn,thanksto n -gra m clues.)
Specifically, we measurethe “goodness” ofthe updated L2 word e mbedding matrix F .
For each pair p = ( e, f ) ∈ P , sort allthe wordsinthe entire L1 vocabulary accordingto
their cosine si milaritytothe L2 word f , a n d l et r p denotetherank of e . For exa mple,if thestudent had managedtolearn a matrix F whose e mbedding of f exactly equalled E ’s
e mbedding of e , t h e n r p w o ul d b e 1 . Wethen co mpute a meanreciprocalrank( MRR)score
8 Weset µ = 0 .2 based on findingsfro m our hyperpara metersearch(see §5.7).
1 1 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION
of F :
1 ∑ − 1 ) MRR (F ) = if r p ≤ r m a x els e 0 ( 5. 2 1) | P| r p p ∈ P
We s et r m a x = 4 based on our pilot study. Thisthreshold hasthe effect of only giving credit
toane mbedding of f suchthatthe correct e isinthesi mulatedstudent’stop 4 guesses. As a
r es ult, § 5.5.2’s machineteacherfocuses onintroducing L2tokens whose meaning can be
deduced rather accurately fro mtheir single context(together with any prior exposureto
that L2type). This makesthe macaronictext co mprehensiblefor a hu manstudent,rather
thanfrustratingtoread. In our pilot study wefoundthat r m a x substantiallyi mproved hu man
l e ar ni n g.
5.5.2 Macaronic Configuration Search
Our current machineteacher producesthe macaronic docu ment greedily, one sentence at a
ti me. Actual docu ments produced aresho wnin??.
L et F p r e v bethe student model’s e mbedding matrix afterthereadingthe first n − 1
macaronic sentences. We evaluate a candidate next sentence x b y t h e s c or e MRR(F) w h er e
F m a xi mi z es ( 5. 5) andisthusthe e mbedding matrixthatthe student would arrive at after
readingx asthe n t h macaronic sentence.
We use best-first searchto seek a high-scoring x . Asearchstateis a pair (i, x ) w h er e x is a macaronic configuration(Table 5.1) whose first i tokens may beeither L1 or L2, but whose
re mainingtokens are still L1. The state’s scoreis obtained by evaluating x as described
1 1 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION
a b o ve. I n t h e i niti al st at e, i = 0 a n d x is t h e n t h sentence ofthe original L1 docu ment. The
st at e (i, x ) is a final stateif i = |x |. Other wiseitst wosuccessors are (i + 1 , x ) a n d (i + 1 , x ′), w h er e x ′ isidenticalto x exceptthatthe (i + 1) t h token has beenreplaced byits L2 gloss. The
search algorith m maintains a priority queue of states sorted by score. Initially,this contains
onlytheinitial state. A step ofthe algorith m consists of poppingthe highest-scoring state
and,ifitis not final,replacingit byitst wo successors. The queueisthen pruned backtothe
top 8states. Whenthe queue beco mes e mpty,the algorith mreturnsthe configuration x fr o m
the highest-scoring final statethat was ever popped.
5.6 Experi ments withreal L2
Does our machineteacher generate useful macaronictext? To ans werthis, we measure whether hu man students(i)co mprehendthe L2 wordsin context, and (ii)retain kno wledge
ofthose L2 words whenthey arelaterseen without context.
We assess(i) by displaying each successive sentence of a macaronic docu mentto a
hu manstudent and askingthe mto guessthe L1 meaningfor each L2token f inthe sentence.
For a given machineteacher, all hu mansubjectssa wthesa me macaronic docu ment, and
each subject’s co mprehension scoreisthe average quality oftheir guesses on allthe L2
tokens presented bythatteacher. A guess’s quality q ∈ [ 0, 1] is a thresholded cosine si milarity bet weenthe e mbeddings 9 ofthe guessed word ̂e andthe original L1 word e :
9 Here we used pretrained word e mbeddingsfro m Mikolov et al.(2018),in orderto measure actualse mantic si mil arit y.
1 2 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION
q = cs(e,̂e)ifcs(e,̂e) ≥ τ else 0 . T h us, ̂e = e o bt ai ns q = 1 (f ull cr e dit), w hil e q = 0 if
the guessis “toofar”fro mthetruth(as deter mined byτ).
To assess(ii), we ad minister an L2 vocabulary quiz after having hu man subjects si mply
read a macaronic passage( without any guessing asthey arereading). They arethen askedto
guessthe L1translation of each L2 wordtypethat appeared atleast onceinthe passage. We
usedthe sa me guess quality metric asin(i). 1 0 Thistestsif hu man subjects naturallylearn
the meanings of L2 words,ininfor mative contexts, well enoughtolatertranslatethe m out
of context. Thetestrequires only short-ter mretention, since we givethe vocabulary quiz
i m mediately after a passageisread.
We co mpared results on macaronic docu ments constructed withthe generic student m o d el ( GSM ),its spelling-a ware variant (s G S M ), and arando m baseline. Inthe baseline, tokensto replace are rando mly chosen while ensuringthat each sentence replacesthe
sa me nu mber oftokens asinthe GSM docu ment. Thisignores context, spelling, and prior
exposures asreasonstoreplace atoken.
Our evaluation was ai med at native English(L1)speakerslearning Spanish or Ger man
(L2). Werecruited L2“students” on A mazon Mechanical Turk( MTurk). They wereabsolute
beginners, selected using a place menttest and self-reported L2 ability.
1 0 If multiple L1types e were glossedinthe docu ment withthis L2type, we generously usethe e t h at maxi mizes cs (e, ̂e).
1 2 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION
L2 Model Closed-class Open-class
rando m 0.74± 0.0126(54) 0.61± 0.0134(17)
Es GSM 0.72± 0.0061(54) 0.70± 0.0084(17)
sGSM 0.82± 0.0038(41) 0.80± 0.0044(21)
rando m 0.59± 0.0054(34) 0.38± 0.0065(13)
D e GSM 0.80± 0.0033(34) 0.78± 0.0056(13)
sGSM 0.82± 0.0063(33) 0.79± 0.0062(14)
Table 5.9: Averagetoken guess quality( τ = 0 .6 )inthe co mprehension experi ments. The ±
d e n ot es a 9 5 % confidenceinterval co mputed via bootstrapresa mpling oftheset of hu man
subjects. The % of L1tokensreplaced with L2 glossesisin parentheses. § 5. 8 e val u at es wit h
other choices of τ.
1 2 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.6.1 Co mprehension Experi ments
We usedthe first chapter of Jane Austen’s “Sense and Sensibility”for Spanish, andthe first
6 0 sentences of Franz Kafka’s “ Meta morphosis”for Ger man. Bilingual speakers provided
the L2 glosses(see§5.9for exa mples).
For English-Spanish, 1 1 , 8 , a n d 7 subjects were assigned macaronic docu ments generated wit h s G S M , GSM , andtherando m baseline,respectively. The corresponding nu mbersfor
English- Ger man were 1 2 , 7 a n d 7 . A t ot al of 3 9 subjects were usedinthese experi ments
(so me subjects did bothlanguages). They were given 3 hours to co mplete the entire
docu ment(average co mpletionti me was ≈ 1.5 hours) and were co mpensated $10.
Table 5.9reportsthe mean co mprehension score over all subjects, broken do wninto
co mprehension offunction words(closed-class P OS) and content words(open-class P OS). 1 1
For Spanish, the s G S M -basedteacher replaces more content words (but fe wer function words), andfurther morethereplaced wordsin both cases are better understood on average, which we hopeleadsto more engage ment and morelearning. For Ger man, by contrast,
the nu mber of wordsreplaced does notincrease under s G S M , and co mprehension only
i mproves marginally. Both GSM a n d s G S M do strongly outperfor mtherando m baseline.
B ut t h e s G S M -basedteacher onlyreplaces afe w additional cognates(h u n d e r t b ut n ot
M u t t e r ), apparently because English- Ger man cognates do not exhibitlargeexact character
n -gra m overlap. We hypothesizethat character skip n -gra ms might be more appropriatefor
1 1 https://universaldependencies.org/u/pos/
1 2 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION
L2 Model Closed-class Open-class
rando m 0.47± 0.0058(60) 0.40± 0.0041(46)
Es GSM 0.48± 0.0084(60) 0.42± 0.0105(15)
sGSM 0.52± 0.0054(47) 0.50± 0.0037(24)
Table 5.10: Averagetype guess quality( τ = 0 .6 )intheretention experi ment. The % of L2
glosstypesthat were sho wninthe macaronic docu mentisin parentheses. § 5.8 evaluates with other choices of τ.
English- Ger man.
5.6.2 Retention Experi ments
Forretention experi ments we usedthe first 2 5 sentences of our English-Spanish dataset. Ne w
participants wererecruited and co mpensated $ 5 . Each participant was assigned a macaronic
docu ment generated withthe s G S M , GSM orrando m model( 2 0 , 1 8 , a n d 2 2 p arti ci p a nts
respectively). As Table 5.10 sho ws, s G S M ’s advantage over GSM on co mprehension holds
up onretention. Onthe vocabulary quiz, students correctlytranslated > 3 0 ofthe 71 word
typesthey hadseen(Table 5.15), and morethan half when near-synony ms earned partial
credit ( Table 5.10).
1 2 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.7 Hyperpara meter Search
Wetunedthe model hyperpara meters by hand onseparate English-Spanish data, na melythe
second chapter of “Sense and Sensibility,” equipped with glosses. Hyperpara metertuning
results arereportedinthis appendix. All other English-Spanishresultsinthe paper are on
the first chapter of “Sense and Sensibility,” which was held outfortesting. We might have
i mprovedtheresults on English- Ger man bytuning separate hyperpara metersforthat setting.
Thetables belo w sho wthe effect of different hyperpara meter choices onthe quality
MRR(F) ofthe e mbeddingslearned bythesi mulatedstudent. Recallfro m § 5.5.1thatthe
M R R score evaluates F using all glosses, notjustthose usedin a particular macaronic docu ment. Thus,itis co mparable acrossthe different macaronic docu ments produced by
different machineteachers.
Q u e u e Si z e ( § 5.5.2) affects only ho w hardthe machineteacher searchesfor macaronic
sentencesthat will helpthe si mulated student. We findthatlarger QueueSizeisin fact val u a bl e.
The other choices( Model, n - gr a ms, µ ) affect ho wthe si mulated student actuallylearns.
The machineteacherthen searchesfor a docu mentthat will helpthat particular si mulated
studentlearn as many ofthe wordsinthereferenceset as possible. Thus,the M R Rscore
is hightothe extentthatthe si mulated student “can be successfullytaught.” By choosing
hyperpara metersthat achieve a high M R Rscore, we are assu mingthat hu manstudents are
adapted(or can adapt online)to beteachable.
1 2 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION
The scale factor µ (used onlyfor s G S M ) noticeably affectsthe macaronic docu ment
generated bythe machineteacher. Settingit high( µ = 1 .0 ) has a adverse effect onthe
M R Rscore. Table 5.11sho ws ho wthe M R Rscore ofthesi mulatedstudent( §5.5.1) varies
accordingtothe student model’s µ value. Tables 5.12 and 5.13sho wtheresult ofthesa me
hyperpara meters weep onthe nu mber of L1 wordtokens andtypesreplaced with L2 glosses.
N ot e t h at µ only affectsinitialization ofthe F para meters. Thus, with µ = 0 , t h e L 2 word and sub word e mbeddings areinitializedto 0 , butthe si mulated s G S M st u d e nt still
hasthe abilitytolearnsub word e mbeddingsfor both L1 and L2. This allo wsitto beatthe
si mulated GSM student.
We seethat for s G S M , µ = 0 .2 resultsin replacingthe most words (bothtypes and tokens), and also has very nearlythe highest M R Rscore. Thus,for s G S M , we decidedto
use µ = 0 .2 and allo w both 3-gra m and 4-gra m e mbeddings.
5.8 Results Varying τ
L2 τ Model Closed-class Open-class
rand 0.81± 0.0084(54) 0.72± 0.0088(17)
0. 0 GSM 0.80± 0.0045(54) 0.79± 0.0057(17)
sGSM 0.86± 0.0027(41) 0.84± 0.0032(21)
rand 0.81± 0.0085(54) 0.72± 0.0089(17)
0. 2
1 2 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION
GSM 0.80± 0.0045(54) 0.79± 0.0057(17)
sGSM 0.86± 0.0027(41) 0.84± 0.0033(21)
rand 0.79± 0.0101(54) 0.66± 0.0117(17)
0. 4 GSM 0.76± 0.0057(54) 0.75± 0.0071(17)
Es sGSM 0.84± 0.0033(41) 0.82± 0.0039(21)
rando m 0.74± 0.0126(54) 0.61± 0.0134(17)
0 .6 GSM 0.72± 0.0061(54) 0.70± 0.0084(17)
sGSM 0.82± 0.0038(41) 0.80± 0.0044(21)
rand 0.62± 0.0143(54) 0.46± 0.0124(17)
0. 8 GSM 0.59± 0.0081(54) 0.58± 0.0106(17)
sGSM 0.71± 0.0052(41) 0.67± 0.0062(21)
rand 0.62± 0.0143(54) 0.45± 0.0124(17)
1. 0 GSM 0.59± 0.0081(54) 0.55± 0.0097(17)
sGSM 0.70± 0.0052(41) 0.64± 0.0063(21)
rando m 0.70± 0.0039(34) 0.56± 0.0046(13)
0 .0 GSM 0.85± 0.0023(34) 0.84± 0.0039(13)
sGSM 0.87± 0.0045(33) 0.84± 0.0044(14)
rando m 0.69± 0.0042(34) 0.56± 0.0047(13)
0 .2 GSM 0.85± 0.0024(34) 0.84± 0.0039(13)
sGSM 0.87± 0.0046(33) 0.84± 0.0044(14)
1 2 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION
rando m 0.64± 0.0052(34) 0.45± 0.0064(13)
0 .4 GSM 0.83± 0.0029(34) 0.81± 0.0045(13)
De sGSM 0.84± 0.0055(33) 0.81± 0.0054(14)
rando m 0.59± 0.0054(34) 0.38± 0.0065(13)
0 .6 GSM 0.80± 0.0033(34) 0.78± 0.0056(13)
sGSM 0.82± 0.0063(33) 0.79± 0.0062(14)
rando m 0.45± 0.0058(34) 0.25± 0.0061(13)
0 .8 GSM 0.72± 0.0037(34) 0.66± 0.0081(13)
sGSM 0.75± 0.0079(33) 0.65± 0.0077(14)
rando m 0.45± 0.0058(34) 0.24± 0.0061(13)
1 .0 GSM 0.71± 0.0040(34) 0.63± 0.0082(13)
sGSM 0.75± 0.0079(33) 0.63± 0.0081(14)
Table 5.14: An expanded version of Table 5.9(hu man co mprehension experi ments),report- ingresults with various values ofτ.
A more co mprehensive variant of Table 5.9is givenin Table 5.14. Thistablereportsthe
sa me hu man-subjects experi ments as before;it only variesthe measure usedto assessthe
quality ofthe hu mans’ guesses, by varyingthethreshold τ . N ot e t h at τ = 1 ass ess es
exact- match accuracy, τ = 0 .6 asin Table 5.9 correspondsroughlytosynony my(atleastfor
content words), and τ = 0 assesses average unthresholded cosine si milarity. We findthat
s G S M consistently outperfor ms both GSM andtherando m baseline overthe entirerange of τ .
1 2 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Scale Factor µ Model n-gra ms QueueSize 1.0 0.4 0.2 0.1 0.05 0.0
sGSM 2,3,4 1 0.108 0.207 0.264 0.263 0.238 0.175
s G S M 3, 4 1 0.113 0.199 0.258 0.274 0.277 0.189
s G S M 3, 4 4 - - 0.267 0.286 - -
s G S M 3, 4 8 - - 0.288 0.292 - -
G S M ∅ 1 0. 1 5 9
G S M ∅ 4 0. 1 7 1
G S M ∅ 8 0. 1 7 2
Table 5.11: M R R scores obtained with different hyperpara meter settings.
As we get closerto exact match,therando m baseline suffersthelargest dropin perfor mance.
Si milarly, Table 5.15 sho ws a expanded version oftheretentionresultsin Table 5.10.
The gap bet weenthe modelsiss maller onretentionthanit was on co mprehension. Ho wever,
a g ai n s G S M > G S M > rando m acrosstherange of τ . We findthatforfunction words,the
rando m baseline perfor ms as well as GSM as τ isincreased. For content words, ho wever,the
rando m baselinefallsfasterthanGSM.
We warnthatthe nu mbers are not genuinely co mparable acrossthe 3 models, because
each model resultedin a different docu ment andthus a different vocabulary quiz. Our
hu man subjects were askedtotranslatejustthe L2 wordsinthe docu menttheyread. In
1 2 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Scale Factor µ Model n-gra ms QueueSize 1.0 0.4 0.2 0.1 0.05 0.0
sGSM 2,3,4 1 149 301 327 275 201 247
s G S M 3, 4 1 190 340 439 399 341 341
s G S M 3, 4 4 - - 462 440 - -
s G S M 3, 4 8 - - 478 450 - -
G S M ∅ 1 5 4 9
G S M ∅ 4 5 5 7
G S M ∅ 8 5 3 0
Table 5.12: Nu mber of L1tokensreplaced by L2 glosses under different hyperpara meter
s etti n gs.
p arti c ul ar, s G S M taughtfe wertotaltypes(71)than GSM (75) ortherando m baseline(106).
Allthat Table 5.15 sho wsisthatittaughtits chosentypes better(on average)thanthe other
methodstaughttheir chosentypes.
5.9 Macaronic Exa mples
Belo w, we displaythe actual macaronic docu ments generated by our methods. Firstfe w
paragraphs of “Sense and Sensibility” withthe s G S M m o d el usi n g µ = 0 .2 , 3 - a n d 4 - gr a ms,
priority queue size of 8, andr m a x = 4 aresho wn belo w:
1 3 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Scale Factor µ Model n-gra ms QueueSize 1.0 0.4 0.2 0.1 0.05 0.0
sGSM 2,3,4 1 39 97 121 106 75 88
s G S M 3, 4 1 44 97 125 124 112 99
s G S M 3, 4 4 - - 124 127 - -
s G S M 3, 4 8 - - 145 129 - -
G S M ∅ 1 1 0 6
G S M ∅ 4 1 1 1
G S M ∅ 8 1 1 4
Table 5.13: Nu mber of distinct L2 wordtypes presentinthe macaronic docu ment under different hyperpara meter settings.
Sense y Sensibility
La family de Dashwood llevaba long been settled en Sussex.
Their estate era large, and their residencia was en Norland Park, in el centre de their property, where, for muchas generations, t h e y h a b ı́ an lived en so respectable a manner as to engage el general good opinion of los surrounding acquaintance. El late owner de this propiedad was un single man, que lived to una very advanced age, y que durante many years of his life, had a constante companion and housekeeper in su sister. But ella death,
1 3 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION que happened ten añ os antes his own, produced a great alteration in su home; for to supply her loss, he invited and received into his house la family of su sobrino señ or Henry Dashwood, the legal inheritor of the Norland estate, and the person to whom he intended to bequeath it. En la society de su nephew y niece, y their children, el old Gentleman’s days fueron comfortably spent.
Su attachment a them all increased. The constant attention de
Mr. y Mrs. Henry Dashwood to sus wishes, que proceeded no merely from interest, but from goodness of heart, dio him every degree de solid comfort which su age podı́ a receive; and la cheerfulness of the children added a relish to his existencia.
By un former marriage, Mr. Henry Dashwood tenı́ a o n e s o n : by su present lady, three hijas. El son, un steady respectable young man, was amply provided for por the fortuna de his mother, w h i c h h a b ı́ a been large, y half of which devolved on him on his coming of edad. Por su own matrimonio, likewise, which happened soon despué s, he added a his wealth. To him therefore la succession a la Norland estate era no so really importante as to his sisters; para their fortuna, independent de what pudiera arise a ellas from su father’s inheriting that propiedad, could ser but small. Su mother had nothing, y their father only seven mil pounds en his own disposició n; for la remaining moiety of his
1 3 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION first esposa’s fortune was also secured to her child, and é l t e n ı́ a s ó lo a life-interé s i n i t .
el anciano gentleman died: his will was read, and like almost todo other will, dio as tanto disappointment as pleasure.
H e fue neither so unjust, ni so ungrateful, as para leave his estate de his nephew; --but he left it a him en such terms as destroyed half the valor de el bequest. Mr. Dashwood habı́ a wished for it more por el sake of his esposa and hijas than for himself or su son; --but a his son, y su son’s son, un child d e f o u r a ñ os old, it estaba secured, in tal a way, as a leave a himself no power de providing por those que were most dear para him, and who most necesitaban a provisió n by any charge on la estate, or por any sale de its valuable woods. El whole fue tied arriba para the beneficio de this child, quien, in occasional visits with his padre and mother at Norland, had tan far gained on el affections de his uncle, by such attractions as are by no means unusual in children of two o three years old; una imperfect articulació n, an earnest desire of having his own way, many cunning tricks, and a great deal of noise, as to outweigh all the value de all the attention which, for years, é l h a b ı́ a received from his niece and sus daughters. He meant no a ser unkind, however, y, como a mark de his affection for las three
1 3 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION girls, he left ellas un mil libras a-piece.
1 3 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Next,the firstfe w paragraphs of “Sense and Sensibility” withthe GSM m o d el usi n g priority queuesize of 8 and r m a x = 4 . Sense y Sensibility
La family de Dashwood llevaba long been settled en Sussex.
Su estate era large, and su residence estaba en Norland Park, in el centre de their property, where, por many generations, they had lived in so respectable una manner as a engage el general good opinion de los surrounding acquaintance. El late owner de esta estate was un single man, que lived to una very advanced age, y who durante many years de su existencia, had una constant companion y housekeeper in his sister. But ella death, que happened ten years antes su own, produced a great alteration in su home; for para supply her loss, é l invited and received into his house la family de su nephew Mr. Henry Dashwood, the legal inheritor de the Norland estate, and the person to whom se intended to bequeath it. In the society de su nephew and niece, and sus children, el old Gentleman’s days fueron comfortably spent. Su attachment a them all increased. La constant attention de Mr. y Mrs. Henry Dashwood to sus wishes, which proceeded not merely from interest, but de goodness de heart, dio him every degree de solid comfort que his age could receive; y la cheerfulness of the children added un relish a su existence.
By un former marriage, Mr. Henry Dashwood tenı́ a o n e s o n :
1 3 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION by su present lady, three hijas. El son, un steady respectable joven man, was amply provided for por la fortune de su madre, que h a b ı́ a been large, y half de cuya devolved on him on su coming de edad. By su own marriage, likewise, que happened soon despué s , h e a d d e d a su wealth. Para him therefore la succession a la
Norland estate was no so really importante as to his sisters; para their fortune, independent de what pudiera arise a them from su father’s inheriting that property, could ser but small. Su madre had nothing, y su padre only siete thousand pounds in su own disposal; for la remaining moiety of his first wife’s fortune era also secured a su child, y é l had only una life-interest in ello.
el old gentleman died: su will was read, y like almost every otro will, gave as tanto disappointment as pleasure. He fue neither so unjust, nor so ungrateful, as to leave su estate from his nephew; --but he left it to him en such terms como destroyed half the valor of the bequest. Mr. Dashwood habı́ a wished for it m á s for el sake de su wife and daughters than para himself or su hijo; --but a su hijo, y his son’s hijo, un child de four añ o s old, it estaba secured, en tal un way, as a leave a himself no power of providing for aquellos who were most dear para him, y who most needed un provision by any charge sobre la estate, or por any sale de its valuable woods. El whole was tied arriba for
1 3 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION el benefit de this child, quien, en ocasionales visits with his father and mother at Norland, had tan far gained on the affections of his uncle, by such attractions as are por no means unusual in children of two or three years old; an imperfect articulation, an earnest desire of having his own way, many cunning tricks, and a gran deal of noise, as to outweigh todo the value of all the attention which, for years, he had received from his niece and her daughters. He meant no a ser unkind, however, y, como una mark de su affection por las three girls, he left them un mil pounds a - p i e z a .
1 3 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Firstfe w paragraphs of “The Meta morphosis” withthe s G S M m o d el usi n g µ = 0 .2 , 3 - and 4-gra ms, priority queuesize of8, andr m a x = 4 . Metamorphosis
One morning, als Gregor Samsa woke from troubled dreams, he fand himself transformed in seinem bed into einem horrible vermin.
H e l a y o n seinem armour-like back, und if er lifted seinen head a little he konnte see his brown belly, slightly domed und divided by arches into stiff sections. The bedding war hardly able zu cover it und seemed ready zu slide off any moment. His many legs, pitifully thin compared mit the size von dem rest von him, waved about helplessly as er looked.
‘‘What’s happened to mir?’’ he thought. His room, ein proper human room although ein little too small, lay peacefully between its four familiar walls. Eine collection of textile samples lay spread out on the table - Samsa was ein travelling salesman - and above it there hung a picture das he had recently cut out von einer illustrated magazine und housed in einem nice, gilded frame. It showed eine lady fitted out mit a fur hat und fur boa who sat upright, raising einen heavy fur muff der covered the whole von her lower arm towards dem viewer.
Gregor then turned zu look out the window at the dull weather. Drops von rain could sein heard hitting the pane, welche m a d e h i m f ̈uhlen quite sad. ‘‘How about if ich sleep ein little
1 3 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION bit longer and forget all diesen nonsense,’’ he thought, aber that was something er was unable zu do because he war used zu sleeping auf his right, und in seinem present state couldn’t bringen into that position. However hard he threw sich onto seine right, he always rolled zur ̈uck to where he was. He must haben tried it a hundert times, shut seine eyes so dass er wouldn’t haben zu look at die floundering legs, and only stopped when er began zu f ̈u h l e n einen mild, dull pain there das he hatte never felt before.
‘‘Ach, God,’’ he thought, ‘‘what a strenuous career it is das I’ve chosen! Travelling day in und day out. Doing business like diese takes viel more effort than doing your own business at home, und auf top of that there’s the curse des travelling, worries um making train connections, bad und irregular food, contact mit different people all the time so that du can nie get to know anyone or become friendly mit ihnen. It can alles go zum
Hell!’’ He felt a slight itch up auf his belly; pushed himself slowly up auf his back towards dem headboard so dass he konnte lift his head better; fand where das itch was, und saw that es was covered mit vielen of little weißen spots which he didn’t know what to make of; und als he versuchte to f ̈uhlen the place with one von seinen legs he drew it quickly back because as soon as he touched it he was overcome von a cold shudder.
1 3 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION
Firstfe w paragraphs of “The Meta morphosis” withthe GSM model using priority queue size of8 andr m a x = 4 . Metamorphosis
One morning, als Gregor Samsa woke from troubled dreams, he fand himself transformed in his bed into einem horrible vermin.
Er lay on seinem armour-like back, und if er lifted his head a little er could see seinen brown belly, slightly domed und divided by arches into stiff teile. das bedding was hardly f ä h i g t o cover es und seemed ready zu slide off any moment. His many legs, pitifully thin compared mit the size von dem rest von him, waved about helplessly als er looked.
‘‘What’s happened to mir?’’ er thought. His room, ein proper human room although ein little too klein, lay peacefully between seinen four familiar walls. Eine collection of textile samples lay spread out on the table - Samsa was ein travelling salesman - und above it there hung a picture that er had recently cut aus of einer illustrated magazine und housed in einem nice, gilded frame. Es showed a lady fitted out with a fur hat and fur boa who saß upright, raising a heavy fur muff der covered the whole of her lower arm towards dem viewer.
Gregor then turned zu look out the window at the dull weather. Drops von rain could sein heard hitting the pane, which machte him feel ganz sad. ‘‘How about if ich sleep ein little
1 4 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION bit longer and forget all diesen nonsense,’’ he thought, but that war something he was unable to tun because er was used to sleeping auf his right, and in his present state couldn’t get into that position. However hard he warf himself onto seine right, he always rolled zur ̈uck to wo he was. Er must haben tried it ein hundred times, shut seine eyes so dass he wouldn’t haben to sehen at die floundering legs, und only stopped when he begann to feel einen mild, dull pain there that he hatte nie felt before.
‘‘Ach, God,’’ he thought, ‘‘what a strenuous career it ist that I’ve chosen! Travelling day in und day aus. Doing business like diese takes much mehr effort than doing your own business at home, und on oben of that there’s der curse of travelling, worries um making train connections, bad and irregular food, contact with different people all the time so that you kannst nie get to know anyone or become friendly with ihnen. It kann all go to Teufel!’’
He felt ein slight itch up auf seinem belly; pushed himself slowly up auf his back towards dem headboard so dass he could lift his head better; fand where das itch was, and saw that it was besetzt with lots of little weißen spots which he didn’t know what to make of; and als he tried to feel the place with one of his legs he drew it quickly back because as soon as he touched it he was overcome by a cold shudder.
1 4 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION
5.10 Conclusion
We presented a methodto generate macaronic( mixed-language) docu mentsto aidforeign
languagelearners with vocabulary acquisition. Our keyideaisto derive a model of student
learning fro m only a clozelanguage model, which uses both context and spelling fea-
tures. We findthat our model-basedteacher generates co mprehensible macaronictext
that pro motes vocabulary learning. We find noticeable differences bet ween the word
replace ment choices bythe GSM (only uses context) and s G S M (uses spelling and con-
text) models, especiallyinthe English-Spanish case sho wnin § 5.9. We find more L2 replacesfor wordsthat have a high overlap withtheir spellingin English. For exa mple,
e x i s t e n c i a , f o r t u n a , m a t r i m o n i o , p r o p i e d a d , necesitaban , b e n e f i c i o ,
articulacion , i n t e r é s , i m p o r t a n t e , c o n s t a n t e a n d r e s i d e n c i a w er e all
replaced using the s G S M model. As futher confir mation, we find exact replace ments were also selected bythe s G S M model, such as D a s h w o o d , P a r k a n d g e n e r a l . T h e
GSM modelreplacedfe wertokens with high-overlap, ocasionales , i m p o r t a n t e a n d
e x i s t e n c i a can be seenin L2. Weleavethetask of extendingitto phrasaltranslation andincorporating word reordering as future work. We alsoleavethe exploration of al- ternate character-based co mpositions such as Ki m et al.(2016)forfuture work. Beyond that, we envision machineteachinginterfacesin whichthe studentreaderinteractswith the macaronictext —advancingthroughthe docu ment, clicking on wordsfor hints, and
facing occasional quizzes( Renduchintala et al., 2016b) —and with other educational sti muli.
1 4 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION
As we beganto explorein Renduchintala et al.(2016a) and Renduchintala, Koehn, and
Eisner(2017),interactions providefeedbackthatthe machineteacher could useto adjust
its model ofthe student’slexicons(here E,F ),inference(here θ f , θ b , θ h , µ), andlearning
( h er e λ ). Inthis context, we areinterestedin using modelsthat are student-specific (to
reflectindividuallearning styles), stochastic (sincethe student’s observed behavior may be
inconsistent o wingto distraction orfatigue), and ableto modelforgettingas well aslearning
(Settles and Meeder, 2016).
1 4 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION
L2 τ Model Closed-class Open-class
rando m 0.67 ± 0.0037(60) 0.60 ± 0.0027(46)
0 .0 GSM 0.67± 0.0060(60) 0.62± 0.0076(15)
sGSM 0.71± 0.0035(47) 0.68± 0.0028(24)
rando m 0.67 ± 0.0037(60) 0.60 ± 0.0027(46)
0 .2 GSM 0.67± 0.0061(60) 0.61± 0.0080(15)
sGSM 0.71± 0.0036(47) 0.67± 0.0029(24)
rando m 0.60 ± 0.0051(60) 0.50 ± 0.0037(46)
0 .4 GSM 0.60± 0.0086(60) 0.51± 0.0106(15)
sGSM 0.66± 0.0044(47) 0.61± 0.0037(24) Es rando m 0.47 ± 0.0058(60) 0.40 ± 0.0041(46)
0 .6 GSM 0.48± 0.0084(60) 0.42± 0.0105(15)
sGSM 0.52± 0.0054(47) 0.50± 0.0037(24)
rando m 0.40 ± 0.0053(60) 0.30 ± 0.0032(46)
0 .8 GSM 0.41± 0.0078(60) 0.37± 0.0097(15)
sGSM 0.46± 0.0055(47) 0.41± 0.0041(24)
rando m 0.40 ± 0.0053(60) 0.29 ± 0.0031(46)
1 .0 GSM 0.40± 0.0077(60) 0.36± 0.0092(15)
sGSM 0.45± 0.0053(47) 0.39± 0.0042(24)
Table 5.15: An expanded version of Table 5.10(hu manretention experi ments),reporting results with various values ofτ.
1 4 4 C h a pt e r 6
Kno wledge Tracingin Sequential
Learning ofInflected Vocabulary
Our macaronic fra me work facilitateslearning novel vocabulary andlinguistic structures while a studentis progressingthrough a docu ment sequentially. In doing so,the student
should(hopefully) acquire ne w kno wledge but may alsoforget whatthey have previously
learned. Further more, ne w evidence,inthefor m of a ne w macaronicsentencefor exa mple,
might forcethe studentto adjusttheir understanding of previously seen L2 words and
structures.
In other words,the previous chapters were concerned with what a student canlearn when presented with a macaronicsentence.In chapters Chapter 3 and Chapter 5 we make
si mplistic assu mptions about whatthe student already kno ws and model whatthey gain
fro m a ne w macaronicsti mulus. Inthis chapter, westudykno wledgetracing inthe context
1 4 5 CHAPTER6. KNO WLEDGETRACING
ofinflectionlearningtask. We vie wthelearning process as a sequence of s mallerlearning
events and modelthe interactionbet ween ne w kno wledge(arriving viaso me ne w evidence,
perhaps a macaronic sentence or,inthis chapter, a flash card), existing kno wledge which
could be corrupted byforgetting or confusing si milar vocabularyite ms etc.
Kno wledgetracing atte mptstoreconstruct when a student acquired(orforgot) each of severalfacts. Yet we often hearthat “learningis notjust me morizingfacts.” Facts are
not ato mic objectsto be discretely andindependently manipulated. Rather, we suppose, a
student whorecalls afactin a given settingis de monstrating a skill—by solving a structured
prediction proble mthatis akintoreconstructive me mory(Schacter, 1989; Posner, 1989) or
pattern co mpletion( Hopfield, 1982; S molensky, 1986). The atte mpt at structured prediction
may dra w on many cooperatingfeature weights,so me of which may beshared with other
f a cts or s kills.
Inthis chapter, we study modelsfor kno wledgetracingforthetask offoreign-language vocabularyinflectionlearning, we will adopt a specific structured prediction model and learning algorith m. Different kno wledge states correspondto model para meter settings
(feature weights). Differentlearning styles correspondto different hyperpara metersthat
governthelearning algorith m. 1 As weinteract with eachstudentthrough a si mple online
tutoring syste m, we wouldliketotracktheir evolving kno wledge state andidentifytheir
learning style. Thatis, we wouldliketo discover para meters and hyperpara metersthat can
explainthe evidence sofar and predict ho wthe student willreactinfuture. This could
1 currently, we assu methat all students sharethe sa me hyperpara meters(sa melearning style), although each student will havetheir o wn para meters, which change astheylearn.
1 4 6 CHAPTER6. KNO WLEDGETRACING
help us make goodfuture choices about ho wtoinstructthis student, although weleavethis
reinforce mentlearning proble mtofuture work. Wesho wthat we can predictthestudent’s
next ans wer.
In short, we expandthe notion of a kno wledgetracing modeltoincluderepresentations
for a student’s(i)current kno wledge, (ii)retention of kno wledge, and(iii)acquisition of ne w
kno wledge. Ourreconstruction ofthe student’s kno wledge statere mainsinterpretable, since
it correspondstothe weights of hand-designedfeatures(sub-skills). Interpretability may
help afutureteaching syste m provide usefulfeedbackto students andto hu manteachers,
and helpit construct educational sti mulithat aretargeted ati mproving particular sub-skills,
such asfeaturesthat select correct verb suffixes.
As mentioned, we consider a verb conjugationtaskinstead of a macaroniclearningtask, where aforeignlanguagelearnerlearnsthe verb conjugation paradig m byrevie wing and
interacting with a series of flash cards. Thistaskis a goodtestbed, asit needsthelearnerto
deploysub- wordfeatures andto generalizeto ne w exa mples. For exa mple, astudentlearning
Spanish verb conjugation might encounter pairs such as( t ú e n t r a s , y o u e n t e r ), (y o
m i r o , I w a t c h ). Usingthese exa mples,the student needstorecognize suffix patterns
and applythe mto ne w pairsseensuch as( y o e n t r o , I e n t e r ). While we considered
sub- word features evenin out macaronic experi ments,the verbinflectiontaskis more
focused on sub- word based generalizationsthatthe student must understandin orderto
perfor mthetask.
Vocabularylearning presents a challenginglearning environ ment duetothelarge nu mber
1 4 7 CHAPTER6. KNO WLEDGETRACING
of skills( words)that needto betraced. Learning vocabularyin conjunction withinflection
further co mplicatesthe challenge duetothe nu mber of ne w sub-skillsthat areintroduced.
Huang, Guerra, and Brusilovsky(2016) suggestthat modeling sub-skillinteractionis crucial
to several kno wledgetracing do mains. For our do main, alog-linearfor mulation elegantly
allo ws for arbitrary sub-skills via feature functions.
6.1 Related Work
Bayesian kno wledgetracing( Corbett and Anderson, 1994)( B KT) haslong beenthestandard
methodtoinfer astudent’s kno wledgefro m his or her perfor mance on asequence oftask
ite ms. In B KT, eachskillis modeled by an H M M witht wo hiddenstates(“known” or
“not-kno wn”), andthe probability of success on anite m depends onthe state ofthe skill
it exercises. Transition and e mission probabilities arelearnedfro mthe perfor mance data
using Expectation Maxi mization(E M). Many extensions of B KT have beeninvestigated,
including personalization(e.g., Lee and Brunskill, 2012; Khajah et al., 2014a) and modeling
ite m difficulty( Khajah et al., 2014b).
Our approach could be called Para metric Kno wledge Tracing(P KT) because wetake a
student’s kno wledgeto be a vector of prediction para meters(feature weights)ratherthan a vector of skill bits. Although several B KT variants( Koedinger et al., 2011; Xu and Mosto w,
2 0 1 2; G o n z ález- Brenes, Huang, and Brusilovsky, 2014) have modeledthefactthatrelated
skills share sub-skills orfeatures,that work does not associate areal-valued weight with
1 4 8 CHAPTER6. KNO WLEDGETRACING
eachfeature at eachti me. Either skills are stillrepresented with separate H M Ms, whose
transition and/or e mission probabilities are para meterizedinter ms of sharedfeatures with
ti me-invariant weights; or else H M Ms are associated withtheindividual sub-skills, andthe
perfor mance of a skill depends on which ofits subskills areinthe “kno wn” state.
Our current versionis not Bayesian sinceit assu mes deter ministic updates (but see footnote 4). A closelyrelatedline of work with deter ministic updatesis deep kno wledge
tracing( D KT)(Piech et al., 2015), which applied a classical LST M model( Hochreiter and
Sch midhuber, 1997)to kno wledgetracing and sho wed strongi mprove ments over B KT.
Our P KT model differsfro m D KTinthatthe student’s state at eachti me stepis a more
interpretablefeature vector, andthe state updateruleis alsointerpretable —itis atype of
error-correctinglearning rule. In addition,the student’s stateis ableto predictthe student’s
actualresponse and not merely whethertheresponse was correct. We expectthat having
aninterpretablefeature vector has betterinductive bias(see experi mentin section 6.6.1),
andthatit may be usefulto planfuture actions bys mart flash cardsyste ms. Moreover,in
this work wetest different plausible state updaterules and see ho wthey fit actual student
responses,in orerto gaininsight aboutlearning.
Mostrecently, Settles and Meeder(2016)’s half-liferegression assu mesthat a student’s
retention of a particular skill exponentially decays withti me andlearns a para meterthat
models the rate of decay (“half-life regression”). Like Gonza ́lez- Brenes, Huang, and
Brusilovsky (2014) and Settles and Meeder (2016), our modelleverages a feature-rich for mulationto predictthe probability of alearner correctlyre me mbering a skill, but can
1 4 9 CHAPTER6. KNO WLEDGETRACING
also capture co mplex spacing/retention patterns using a neural gating mechanis m. Another
distinction bet ween our work and half-liferegressionisthat wefocus on kno wledgetracing within a single session, while half-liferegression collapses a sessioninto a single data point
and operates on manysuch data points overlongerti mespans.
1 5 0 CHAPTER6. KNO WLEDGETRACING
( a) ( b) ( c) ( d)
( e) (f) ( g) ( h)
Figure 6.1: Screen grabs of card modalities duringtraining. These exa mples sho w cardsfor a native Englishspeakerlearning Spanish verb conjugation. Fig 6.1ais an E X card, Fig 6.1b sho ws a MC card beforethestudent has made aselection, and Fig 6.1c and 6.1dsho w MC cards afterthe student has made anincorrect or correct selectionrespectively, Fig 6.1e sho ws a M C cardthatis givingthestudent another atte mpt(thesyste mrando mly decidesto give the student uptothree additional atte mpts), Fig 6.1f sho ws a TP card where a studentis co mpleting an ans wer, Fig 6.1gsho ws a TP cardthat has marked astudent ans wer wrong andthenrevealedtheright ans wer(therevealis decidedrando mly), and finally Fig 6.1h sho ws a cardthatis giving a studentfeedbackfortheir ans wer.
1 5 1 CHAPTER6. KNO WLEDGETRACING
6.2 Verb Conjugation Task
We devised a flash cardtraining syste mtoteach verb conjugationsin aforeignlanguage. In
this study, we only askedthe studenttotranslatefro mtheforeignlanguageto English, not vice-versa. 2
6.2.1 Task Setup
We consider a setting where students gothrough a series ofinteractive flash cards during a
training session. Figure 6.1 sho wsthethreetypes of cards:(i)Exa mple (E X) cards si mply
display a foreign phrase andits Englishtranslation (for 7 seconds). (ii) Multiple- Choice
( M C) cardssho w asingleforeign phrase andrequirethestudenttoselect one of five possible
English phrasessho wn as options. (iii)Typing (TP) cardssho w aforeign phrase and atext
input box,requiringthe studenttotype out whattheythinkisthe Englishtranslation. Our
syste m can providefeedbackfor each studentresponse. (i)Indicative Feedback: Thisrefers
to marking a student’s ans wer as correct orincorrect(Fig. 6.1c, 6.1d and 6.1h). Indicative
feedbackis al wayssho wnfor both M C and TP cards. (ii)Explicit Feedback:Ifthestudent
makes an error on a TP card,thesyste m has a 50 % chance ofsho wingthe mthetrue ans wer
(Fig. 6.1g).(iii)Retry:Ifthestudent makes an error on a M C card,thesyste m has a 50 %
chance of allo wingthe mtotry again, upto a maxi mu m of 3 atte mpts.
2 We wouldregardthese ast wo separate skillsthat share para metersto so me degree, aninteresting subject forfuture study.
1 5 2 CHAPTER6. KNO WLEDGETRACING
Categories Inf SPre,1,N SPre,2,N SPre,3,M SPre,3,F SF,1,N SF,2,N SF,3,M SF,3,F SP,1,N SP,2,N SP,3,M SP,3,F
acceptar yoacepto tú a c e pt as élacepta ellaacepta yoaceptaré t́ u a c e pt ar ás él a c e pt aŕ a ella aceptará y o a c e pt é t́ u a c e pt ast e él a c e pt́ o ell a a c e pt ó
toaccept Iaccept youaccept heaccepts sheaccepts I willaccept you willaccept* he willaccept she willaccept Iaccepted* youaccepted heaccepted sheaccepted
entrar yoentro tú e ntr as élentra ellaentra yoentraré t́ u e ntr ar ás él e ntr aŕ a ella entrará y o e ntr é t́ u e ntr ast e él e ntŕ o ell a e ntr ó L e m m a toenter Ienter youenter heenters sheenters I willenter you willenter he willenter she willenter Ientered youentered heentered sheentered
mirar yomiro tú mir as él mira ella mira yo miraré t́ u mir ar ás él mir aŕ a ell a mir ar á y o mir é t́ u mir ast e él miŕ o ell a mir ó
to watch I watch* you watch* he watches* she watches I will watch you will watch* he will watch she will watch I watched you watched he watched* she watched
Table 6.1: Content usedintraining sequences. Phrase pairs with * were usedforthe quiz atthe end ofthetraining sequence. This Spanish content wasthentransfor med usingthe methodin section 6.5.1.
6.2.2 Task Content
Inthis particulartask we usedthree verble m mas, eachinflectedin 13 different ways
(Table 6.1). Theinflectionsincludedthreetenses(si mple past, present, andfuture)in each of four persons (first, second, third masculine, third fe minine), as well astheinfinitive for m. We ensuredthat each surface realization was unique and regular, resultingin 39 possible phrases. 3 Seven phrasesfro mthisset wererando mlyselectedfor a quiz, whichis sho wn atthe end ofthetraining session,leaving 32 phrasesthat a student may seeinthe training session. The student’sresponses onthe quiz do notreceive anyfeedbackfro mthe syste m. We alsoli mitedthetrainingsessionto 35 cards(so me of which mayrequire multiple rounds ofinteraction, o wingtoretries). All ofthe methods presentedinthis paper could be appliedtolarger content sets as well.
3 Theinflected surfacefor msincluded explicit pronouns.
1 5 3 CHAPTER6. KNO WLEDGETRACING
6.3 Notation
We will usethefollo wing conventionsinthis paper. Syste m actions a t , studentresponses
′ y t , andfeedbackite ms a t are subscripted by ati me 1 ≤ t ≤ T . Other subscripts pick out ele ments of vectors or matrices. Ordinarylo wercaselettersindicate scalars ( α , β , et c.),
boldfacedlo wercaselettersindicate vectors ( θ , y , w z x ), and boldfaced uppercaseletters indicate matrices (Φ , W h h , etc.). The ro man-font superscripts are part ofthe vector or
matrix na me.
6.4 Student Models
6.4.1 Observable Student Behavior
A flash cardis a structured object a = ( x, O ), w h er e x ∈ X istheforeign phrase and O is
aset of allo wedresponses. For an MC card, O isthe set of 5 multiple-choice options on
that card(orfe wer on aretry atte mpt). For a E X or TP card, O istheset of all 39 English
phrases(the TP userinterface preventsthe studentfro m sub mitting a guess outsidethis set).
For non-E X cards, we assu methestudentsa mplestheirresponse y ∈ O fro m alog-linear
distribution para meterized bytheir kno wledge state θ ∈ R d :
p(y|a;θ) =p(y|x,O ;θ) exp(θ ·ϕ (x,y)) = ∑ ( 6. 1) ′ y ′ ∈ O exp(θ ·ϕ (x,y ))
1 5 4 CHAPTER6. KNO WLEDGETRACING where ϕ(x,y)∈ {0,1} d is afeature vector extractedfro mthe(x, y) pair.
6.4.2 Feature Design
Thestudent’s kno wledgestateis described bythe weights θ placed onthefeatures ϕ (x, y ) i n
e q u ati o n ( 6. 1). We assu methefollo wing binaryfeatures willsufficeto describethestudent’s
b e h a vi or.
• Phrasal features: Weinclude a uniqueindicatorfeaturefor each possible (x, y ) p air,
yi el di n g 3 9 2 features. For exa mple,there exists afeaturethat firesiff x = y o m i r o ∧
y = I e n t e r .
• Word features: Weincludeindicatorfeaturesfor all(source word,target word) pairs:
e.g.,yo ∈ x ∧ enter ∈ y.(These words need not bealigned.)
• Morphe me features: Weincludeindicatorfeaturesfor all (w, mc ) pairs, where w is
a word ofthesource phrase x , a n d m is a possibletense, person, or nu mberforthe
target phrase y (dra wnfro m Table 6.1). For exa mple, m mi g ht b e 1 s t (first person)
or SPre (si mple present).
• Prefix and suffix features: For each word or morphe mefeaturethat fires, 8 backoff
features also fire, wherethe source word and(if present)thetarget word arereplaced
bytheir first orlast icharacters,for i∈ { 1,2,3,4}.
Thesete mplates yield about 4600 featuresin all, sothe kno wledge state has d ≈ 4 6 0 0
di mensions.
1 5 5 CHAPTER6. KNO WLEDGETRACING
6.4.3 Learning Models
We no wturntothe question of modeling ho wthestudent’s kno wledgestate changes during
their session. θ t denotesthe state atthe start ofround t. We t a k e θ 1 = 0 and assu methat
the student uses a deter ministic updaterule ofthefollo wingfor m:4
θ t+ 1 = β t ⊙ θ t + α t ⊙ u t ( 6. 2)
′ w h er e u t is anupdate vector that depends onthe student’s experience (a t , yt , at ) at r o u n d t.
d In general, we canregard α t ∈ ( 0, 1) as modelingtherates at whichthelearner updates
d the various para meters accordingto u t , a n d β t ∈ ( 0, 1) as modelingtherates at which
those para meters areforgotten. These vectors correspondrespectivelytotheinput gatesand
forget gatesinrecurrent neural net work architectures such asthe LST M( Hochreiter and
Sch midhuber, 1997) or G R U( Cho et al., 2014). Asinthose architectures, we will use neural
net worksto choose α t , β t at eachti me step t, sothatthey may be sensitivein nonlinear waystothe context atround t.
Whythisfor m? Firsti maginethatthe studentislearning by stochastic gradient descent ∑ 2 o n s o m e L 2 -regularizedloss function C · ∥ θ ∥ + t L t (θ ). This algorith m’s updaterule
hasthesi mplifiedfor m
θ t+ 1 = β t · θ t + α t · u t ( 6. 3)
4 Sincelearningis not perfectly predictable, it would be more realisticto co mpute θ t by a stochastic update —or equivalently, by a deter ministic updatethat also depends on arando m noise vector ϵ t ( w hi c h is dra wnfro m, say, a Gaussian). These noise vectors are “nuisance para meters,” butratherthanintegrating over their possible values, a straightfor ward approxi mationisto opti mizethe m by gradient descent —along withthe other update para meters —so astolocally maxi mizelikelihood.
1 5 6 CHAPTER6. KNO WLEDGETRACING
w h er e u t = − ∇ L t (θ ) isthe steepest-descent direction on exa mple t, α t > 0 isthelearning
r at e at ti m e t, a n d β t = 1 − α t C handlesthe weight decay duetofollo wingthe gradient of
the regularizer.
Adaptive versions of stochastic gradient descent —such as Ada Grad( Duchi, Hazan, and
Singer, 2011) and Ada Delta(Zeiler, 2012) —are morelike ourfullrule ( 6. 2) i n t h at t h e y
allo w differentlearning rates for different para meters.
6.4.3.1 Sche mesforthe Update Vector u t
We assu methat u t isthe gradient of so melog-probability, sothatthe studentlearns by
tryingtoincreasethelog-probability ofthe correct ans wer. Ho wever,the student does not
al ways observe the correct ans wer y . For exa mple,thereis no outputlabel provided when
the student onlyreceivesfeedbackthattheir ans werisincorrect. Evenin such cases,the
student can changetheir kno wledge state.
′ Inthis section, we define sche mesfor defining u t fro mthe experience (a t , yt , at ) at
roundt. Recallthata t = ( x t ,O t ). We o mitthetsubscripts belo w.
Supposethe studentistoldthat a particular phrase y ∈ O isthe correcttranslation of x
(viaan E Xcard or viafeedback onanans wertoan MC or TPcard). Thenanaptstrategy
1 5 7 CHAPTER6. KNO WLEDGETRACING
forthe student would beto usethefollo wing gradient:5
✓ ∆ = ∇ θ logp(y |x,O ;θ) ( 6. 4) ∑ = ϕ (x, y ) − p (y ′ | x )ϕ (x, y ′) y ′ ∈ O Ifthe studentistoldthat y isincorrect, an apt strategyisto move probability mass
collectivelytothe other available options,increasingtheirtotal probability, since one of
those optionsmust be correct. We callthisthe redistribution gradient(R G):
✗ ∆ = ∇ θ logp(O −{y}|x,O ;θ) ( 6. 5) − = p (y ′ | x, y ′ ̸= y )ϕ (x, y ′) ( 6. 6) y ′ ∈ O − { y } ) − p (y ′ | x )ϕ (x, y ′) y ′ ∈ O w h er e p (y ′ | x, y ′ ̸= y ) is arenor malized distribution overjustthe options y ′ ∈ O − { y } .
Notethatifthe student selectst wo wrong ans wers y 1 , y2 inarow onan MCcard,the
first update will subtractthe averagefeatures of O and addthose of O − { y 1 } ; t h e s e c o n d
update will subtractthe averagefeatures of O − { y 1 } and addthose of O − { y 1 , y2 } . T h e
inter mediate addition and subtraction cancel outifthe sa me α vectoris used at bothrounds,
sothe net effectisto shift probability massfro mthe 5initial optionstothe 3re maining
o n es. 6
An alternate sche meforincorrect y is t o us e − ∆ ✓ . We callthisnegative gradient( N G).
5 An objectionisthatfor an E X or TP card,thestudent may not actually kno wthe exactset of options O i n the deno minator. We atte mpted setting O to bethe set of English phrasesthe student has seen priortothe current question. Thoughintuitive,this setting perfor med worse on allthe update and gating sche mes. 6 Arguably, a zeroth updateshould be allo wed as well: upon first vie wingthe M C card,thestudentshould havethe chanceto subtractthe averagefeatures ofthefull set of possibilities and addthose ofthe 5 optionsin O ,since again,thesyste misi mplyingthat one ofthose 5 optionsmust be correct.
1 5 8 CHAPTER6. KNO WLEDGETRACING
Update Scheme Correct Incorrect
✓ ✗ redistribution( R G) u t = ∆ u t = ∆
✓ ✓ negative grad.( N G) u t = ∆ u t = − ∆
feature vector(F G) u t = ϕ (x, y ) u t = − ϕ (x, y )
Table 6.2: Su m mary of updatesche mes(otherthan R N G).
Sincethe R G and N G update vectors both worked wellfor handlingincorrect y , w e
✗ ✓ alsotriedlinearlyinterpolatingthe m ( R N G), withu t = γ t ⊙ ∆ + ( 1 − γ t ) ⊙ − ∆ . T h e
interpolation vector γ t has ele mentsin ( 0, 1) , and may depend onthe context(possibly
differentfor M C and E X cards,for exa mple).
Finally,the feature vector(F G)sche mesi mply addsthefeatures ϕ (x, y ) w h e n y is c orr e ct
orsubtractsthe m when y isincorrect. Thisis appropriatefor a student who pays attention
o nl y t o y , without botheringto notethatthe alternative optionsin O are (respectively)
incorrect or correct.
Recallfro m section 6.2.1thatthe syste m so meti mes gives bothindicative and explicit
feedback,tellingthe studentthat one phraseisincorrect and a different phraseis correct.
Wetreatthese ast wo successive updates with update vectors u t a n d u t+ 1 . Noticethatin
the F G sche me, addingthis pair of update vectorsrese mbles a perceptron update. Table 6.2
su m marizes our updatesche mes.
1 5 9 CHAPTER6. KNO WLEDGETRACING
6.4.3.2 Sche mesforthe Gates α t, β t, γ t
We characterize each update t by a 7-di mensional context vector c t , which su m marizes whatthe student has experienced. The firstthree ele mentsin c t are binaryindicators ofthe
type of flash card(E X, MC or TP). The nextthree ele ments are binaryindicators ofthe
type ofinfor mationthat causedthe update: correct student ans wer,incorrect student ans wer,
orrevealed ans wer(via an E X card or explicitfeedback). As are minder,the syste m can
respond with anindicationthatthe ans weris correct orincorrect, orit canrevealthe ans wer.
Finally,thelast ele ment of c t is 1 / | O|,the chance probability of success onthis card. Fro m
c t , w e d e fi n e
α α d α t = σ (W c t + b 1 ) ∈ ( 0, 1) ( 6. 7)
β β d β t = σ (W c t− 1 + b 1 ) ∈ ( 0, 1) ( 6. 8)
γ γ d γ t = σ (W c t + b 1 ) ∈ ( 0, 1) ( 6. 9)
d × 7 w h er e c 0 = 0 . Each gate vectoris no w para meterized by a weight matrix W ∈ R , w h er e
d isthe di mensionality ofthe gradient and kno wledge state.
We alsotried si mpler versions ofthis model. Inthe vector model(V M), we define
α α t = σ (b ), a n d β t , γ t si milarly. These vectors do not vary withti me and si mplyreflect
that so me para meters are morelabilethan others. Finally,thescalar model(S M) defines
α α t = σ (b 1 ), sothat all para meters are equallylabile. One could alsoi maginetyingthe
gatesforfeatures derivedfro mthe sa mete mplate, meaningthat so me kinds offeatures
(in so me contexts) are morelabilethan others, orreducingthe nu mber of para meters by
1 6 0 CHAPTER6. KNO WLEDGETRACING
learninglo w-rankW matrices.
While we alsotried aug mentingthe context vector c t withthe kno wledge state θ t , t his
resultedinfartoo many para meterstotrain well, and did not help perfor mancein pilottests.
6.4.4 Para meter Esti mation
We t u n e t h e W a n d b para meters ofthe model by maxi mu mlikelihood, so asto better
predictthe students’responses y t . Thelikelihoodfunctionis
∑T ′ p (y 1 , . . . yT | a t , . . . aT ) = p (y t | a 1 :t , y1 :t− 1 , a1 :t− 1 ) t= 1 −T = p (y t | a t ; θ t ) ( 6. 1 0) t= 1 where wetake p (y t | · · · ) = 1 at steps wherethe student makes noresponse(E X cards and explicitfeedback). Notethatthe model assu mesthat θ t is a sufficient statistic ofthe
student’s past experiences.
For each(update sche me, gating sche me) co mbination, wetrainedthe para meters using
S G D with R MSProp updates(Tiele man and Hinton, 2012)to maxi mizetheregularized
log-likelihood
) 2 l o g p (y t | x t ; θ t ) − C · ∥ W ∥ ( 6. 1 1)
t, τt = 0
su m med over all students. Notethat θ t depends onthe para metersthroughthe gated update
r ul e ( 6. 2). The develop mentset was usedfor earlystopping andtotunetheregularization
para meter C .7
7 We s e ar c h e d C ∈ { 0 .0 0 0 2 5 , 0 .0 0 0 5 , 0 .0 0 1 , . . ., 0 .0 1 , 0 .0 2 5 , 0 .0 5 , 0 .1 } for each gating model and update
1 6 1 CHAPTER6. KNO WLEDGETRACING
6.5 Data Collection
Werecruited 153 unique “students” via A mazon Mechanical Turk( MTurk). MTurk partici-
pants were co mpensated $1for co mpletingthetraining andtestsessions and a bonus of $10 was giventothethreetop scoring students. In our dataset, weretained onlythe 121 students who ans wered all questions.
6.5.1 Language Obfuscation
Fig. 6.1 sho ws afe w exa mple flash cardsfor a native English speakerlearning Spanish.
Fig. 6.1sho ws all our Spanish-English phrase pairs. In our actualtask, ho wever, weinvented
an artificiallanguageforthe MTurkstudentstolearn, which allo wed ustoignorethe proble m
of students with differentinitial kno wledgelevels. We generated our artificiallanguage
by encipheringthe Spanish orthographicrepresentations. We created a mappingfro mthe
true source string alphabetto an alternative, manually defined alphabet, while atte mptingto
preserve pronounceability(by mapping vo welsto vo wels, etc.). For exa mple, m i r a r w as
transfor medintomelil and tú aceptas becamepi icedpiz. sche me co mbination. C = 0 .0025 gave bestresultsforthe C M models, 0.01 for V M and0.0005 for S M.
1 6 2 CHAPTER6. KNO WLEDGETRACING
6.5.2 Card Ordering Policy
Inthefuture, we expectto use planning orreinforce mentlearningto choosethe sequence of
sti muliforthestudent. Forthe presentstudy ofstudent behavior, ho wever, we hand-designed
a si mple stochastic policyfor choosingthe sti muli.
The policy must decide whatforeign phrase and card modalityto use at eachtraining
step. Our policylikestorepeat phrases with which participants hadtrouble —in hopesthat
these already-taught phrases are onthe verge of beinglearned. It alsolikesto pick out ne w
phrases. This wasinspired bythe popular Leitner(1972) approach, which devised a syste m
of bucketsthat control ho wfrequently anite misrevie wed by a student. Leitner proposed
buckets withrevie wfrequencyrates of every day, every 2 days, every 4 days andso on.
For eachforeign phrase x ∈ X , we maintain anovelty score v x , whichis afunction ofthe nu mber ofti mesthe phraseis exposedto astudent and an errorscore e x , w hi c h is
afunction ofthe nu mber ofti mesthe studentincorrectlyrespondedtothe phrase. These
scores areinitializedto 1 and updated asfollo ws: 8
v x ← v x − 1when x isviewed ⎧ ⎪ ⎪ ⎨ 2 e x whenstudent gets x wrong e x ← ⎪ ⎪ ⎩ 0 .5 e x whenstudent gets x right
g (v ) + g (e ) x ∼ ( 6. 1 2) 2
8 Arguably weshould have updated e x instead by adding/subtracting 1, sinceit will be exponentiatedlater.
1 6 3 CHAPTER6. KNO WLEDGETRACING
On eachround, we sa mple a phrase x fr o m eit h er P v or P e (equal probability); these
distributions are co mputed by applying a soft max g (.) overthe vectors v a n d e respectively
(see Eq. 6.12). Oncethe phrase x is decided, the modality (E X, MC, TP)is chosen
st o c h asti c all y usi n g pr o b a biliti es (0.2,0.4,0.4) , exceptthat probabilities ( 1, 0 , 0) ar e us e d
forthe first exa mple ofthesession, and (0.4,0.6,0) if x is not “TP-qualified.” A phraseis
TP-qualifiedifthe student has seen both x ’s pronoun and x ’s verble m ma on previous cards
(eveniftheir correcttranslation was notrevealed). For an M C card,the distractor phrases
are sa mpled unifor mly withoutreplace mentfro mthe 38 other phrases.
6.6 Results & Experi ments
We partitionedthe studentsintothree groups: 80 studentsfortraining, 20for develop ment,
and 21fortesting. Moststudentsfoundthetask difficult;the averagescore onthe 7-question
q ui z — w as 2 .8 1 correct, with maxi mu m score of 6. ( Recallfro m section 6.2.2thatthe quiz
questions weretyping questions, not multiple choice questions.) The histogra m of user
perfor manceis sho wnin Fig. 6.2.
After constructing each model, we evaluatedit onthe held-out data:the 728responses
fro mthe 21testing students. We measurethelog-probability underthe model of each actual
response(“cross-entropy”), and alsothefraction ofresponsesthat were correctly predicted
if our prediction wasthe model’s max-probabilityresponse(“accuracy”).
Table 6.3 sho wstheresults of our experi ment. All of our models were predictive, doing
1 6 4 CHAPTER6. KNO WLEDGETRACING
User perfor mance on Quiz
2 5
2 0
1 5 Count of users 1 0
5
0 1 2 3 4 5 6 7 quiz perfor mance
Figure 6.2: Quiz perfor mance distribution(afterre moving users who scored 0).
far betterthan a unifor m baselinethat assigned equal probability 1 / | O| t o all o pti o ns. O ur
best modelsareshowninthe finalt wolines, R N G+ V Mand R N G+C M.
Which update sche me was best? Interestingly, althoughthe R G update vectoris princi-
pledfro m a machine learning vie wpoint,the N G update vectorso meti mes achieved better
accuracy —though worse perplexity — when predictingtheresponses of hu man learners.9
We got our bestresults on both metrics byinterpolating bet ween R G and N G(the R N G
sche me). Recallthatthe N Gsche me was motivated bythe notionthatstudents who guessed wrong may not studythe alternative ans wers(eventhough oneis correct), either because
itistoo muchtroubleto studythe m or because(for a TP card)those alternatives are not
actually sho wn.
Which gating mechanis m was best?In al most all cases, wefoundthat more para meters
h el p e d, wit h C M > V M > S M on accuracy, and a si milar pattern on cross-entropy( with
V M so meti mes winning but only slightly). In short,it helpsto use differentlearningrates
9 Eventhe F G vector so meti mes won(on both metrics!), butthis happened only withthe worst gating mechanis m, S M.
1 6 5 CHAPTER6. KNO WLEDGETRACING
for differentfeatures, andit probably helpsto makethe m sensitivetothelearning context.
SM 0. 8 VM 0. 6 CM 0. 4 RG
a c c0. ur a c2 y FV 0. 0 NG MC M C C M CI C TP T P C T PI C
Figure 6.3: Plot co mparingthe models ontest data under different conditions. Conditions
M C and TPindicate Multiple-choice and Typing questionsrespectively. These are broken
do wntothe cases wherethestudent ans wersthe m correctly C andincorrectlyI C. S M, V M,
and C Mrepresent scalar, vector, and contextretention and acquisition gates(sho wn with
different colors),respectively, while R G, N G and F G areredistribution, negative andfeature vector update sche mes(sho wn with different hatching patterns).
Surprisingly,the si mple F G sche me outperfor med both R G and N G when usedin conjunction with a scalarretention and acquisition gate. This, ho wever, did not extendto
more co mplex gates.
1 6 6 CHAPTER6. KNO WLEDGETRACING
Update Sche me Gating Mechanis m accuracy cross-ent.
( Unifor m baseline) 0.133 2.459
F G S M 0. 2 3 9 ∗ 2. 3 6 2
F G V M 0. 3 5 7 † 2. 1 3 0
F G C M 0.401 2.025
R G S M 0.135 3.194
R G V M 0. 3 9 7 † 1. 9 0 9
R G C M 0.405 1.938
N G S M 0. 1 8 5 ∗ 4. 6 7 4
N G V M 0. 3 9 4 † 2. 3 2 0
N G C M 0. 4 4 9 † ∗ 2. 2 4 4
RNG(mixed) S M 0.183 3.502
RNG(mixed) V M 0.427 1.855
RNG(mixed) C M 0.449 1.888
Table 6.3: Table su m marizing prediction accuracy and cross-entropy(in nats per prediction)
for different models. Larger accuracies and s maller cross-entropies are better. Within an
update sche me,the † indicates significanti mprove ment( Mc Ne mar’stest, p < 0 .0 5 ) o ver
the next-best gating mechanis m. Within g a gating mechanis m,the ∗ indicates significant
i mprove ment overthe next-best updatesche me. For exa mple, N G+ C Missignificantly better
than N G+ V M,soitreceives a † ;itis also significantly betterthan R G+ C M, andreceives a ∗ 1 6 7 as well. These co mparisons are conducted only a mongthe pure updatesche mes(abovethe
doubleline). All other models aresignificantly betterthan R G+S M(p < 0.01). CHAPTER6. KNO WLEDGETRACING
Fig. 6.3sho ws a breakdo wn ofthe prediction accuracy measures accordingto whether the card was M C or TP, and accordingto whetherthestudent’s ans wer was correct( C) or incorrect(I C). Unsurprisingly, allthe models have an easierti me predictingthe student’s guess whenthe studentis correct, sincethe predicted para meters θ t will oft e n pi c k t h e correct ans wer. Ho wever,thisis wherethe vector and context gates far outperfor mthe scalar gates. Allthe models find predictingtheincorrect ans wers ofthe students difficult.
Moreover, when predictingtheseincorrect ans wers,the R G models do slightly betterthan the N G models.
The models obviously have higher accuracy when predictingstudent ans wersfor M C cardsthanfor TP cards, as MC cards havefe wer options. Again, within both ofthese modalities,the vector and context gates outperfor mthe scalar gate.
1 6 8 CHAPTER6. KNO WLEDGETRACING
4 4
2 2
0 0
− 2 − 2
suprisal− reduction 4 (bits) suprisal− reduction 4 (bits)
− 6 − 6
− 8 − 8 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 45 training steps training steps
(a) a student with quiz score 6/7 (b) astudent with quizscore 2/7
Figure 6.4: Predicting a specific student’sresponses. For eachresponse,the plot sho ws
our model’si mprove mentinlog-probability overthe unifor m baseline model. TP cards are
the square markers connected by solidlines(the final 7 squares arethe quiz), while M C
cards — which have a much higher baseline —arethe circle markers connected by dashed
lines. Hollo w and solid markersindicate correct andincorrect ans wersrespectively. The
RNG+C M modelisshownin blueandthe FG+S M modelinred.
Finally, Fig. 6.4 exa mines ho wthese models behave when makingspecific predictions over
atrainingsequencefor asinglestudent. At eachstep we plotthe differenceinlog-probability
bet ween our model and a unifor m baseline model. Thus, a marker above 0 meansthat our
model assignedthe student’s ans wer a probability higherthan chance. 1 0 To contrastthe perfor mance difference, we sho w boththe highest-accuracy model(R N G+C M) andthe
lo west-accuracy model( R G+S M). For a high-scoringstudent(Fig. 6.4a), wesee R N G+ C M
1 0 1 1 1 For M C cards,the chance probabilityisin { 5 , 4 , 3 } —depending on ho w many optionsre main — while 1 for TP cardsitis 3 9 .
1 6 9 CHAPTER6. KNO WLEDGETRACING has alarge margin over R G+S M and a slight up wardtrend. A higher probabilitythan chanceis noticeable even whenthe student makes mistakes(indicated by hollo w markers).
In contrast,for an average student(Fig. 6.4b),the margin bet weenthet wo modelsisless perceptible. Whilethe C M+ N G modelis still abovethe S M+R Gline,there are so me ans wers where C M+ N G does very poorly. Thisis especiallytrueforso me ofthe wrong ans wers,for exa mple attrainingsteps 25, 29 and 33. Upon closerinspectionintothe model’s errorin step 33, wefoundthe pro mptreceived atthistraining step was e k k i m e l ü as a
M C card, which had beensho wntothestudent onthree prior occasions, andthestudent even ans wered correctly on one ofthese occasions. This explains whythe model wassurprisedto seethe student makethis error.
6.6.1 Co mparison with Less Restrictive Model
Our para metric kno wledgetracing architecture modelsthe student as atypical structured prediction syste m, which maintains weightsfor hand-designedfeatures and updatesthe m roughly as an onlinelearning algorith m would. A natural questionis whetherthisrestricted architecture sacrifices perfor manceforinterpretability, ori mproves perfor mance via useful inductive bias.
To considerthe other end ofthespectru m, wei mple mented a flexible LST M modelin the style ofrecent deeplearningresearch. This alternative model predicts eachresponse by a student(i.e., on an M C or TP card) giventhe entire history of previousinteractions with thatstudent assu m marized by an LST M. The LST M architectureisfor mally capable of
1 7 0 CHAPTER6. KNO WLEDGETRACING capturing updaterules exactlylikethose of P KT, butitisfarfro mli mitedto suchrules.
Muchlike equation(6.1), at eachti me twe predict
e x p( h t · ψ (y )) p (y t = y | a t ) = ∑ ( 6. 1 3) ′ e x p( h · ψ (y )) y ∈ O t t
d for each possible response y inthe set of options O t , w h er e ψ (y ) ∈ R is a l e ar n e d
d e mbedding ofresponse y . H er e h t ∈ R denotesthe hidden state ofthe LST M, which evolves asthe studentinteracts withthe syste m andlearns. h t depends onthe LST Minputs for allti mes < t ,justlikethe kno wledge state θ t in equations ( 6. 1)– ( 6. 2).It also depends on the LST Minputforti me t, sincethat specifiesthe flash card a t to which we are predicting theresponsey t .
Each flash card a = ( x, O ) is encoded by a concatenation a ofthree vectors: a one-hot
39-di mensional vector specifyingtheforeign phrase x , a 39-di mensional binary vector O indicatingthe possible English optionsin O , and a one-hot vectorindicating whetherthe cardis E X, MC, or TP.
Whenreadingthe history of pastinteractions,the LST Minput at eachti me step t c o n- catenatesthe vector representation a t ofthe current flash card with vectors a t− 1 , y t− 1 , f t− 1 that describethe student’s experienceinround t − 1 :theserespectively encodethe previous
flash card,the student’sresponsetoit(a one-hot 39-di mensional vector), andtheresulting feedback(a 39-di mensional binary vectorthatindicatesthere maining options afterfeed- back). Thus,ifthe studentreceives nofeedback,then f t− 1 = O t− 1 . Indicativefeedback sets f t− 1 = y t− 1 or f t− 1 = O t− 1 − y t , accordingto whetherthe student was correct orincorrect.
Explicitfeedback(includingfor an E X card) sets f t− 1 to a one-hotrepresentation ofthe
1 7 1 CHAPTER6. KNO WLEDGETRACING
Model Para meters Accuracy(test) Cross-Entropy
RNG+CM ≈97K 0.449 1.888
LSTM ≈25K 0.429 1.992
Table 6.4: Co mparison of our best-perfor ming P KT model(R N G+C M)to our LST M model.
On our dataset, P KT outperfor msthe LST M bothinter ms of accuracy and cross-entropy.
correct ans wer. Thus, f t− 1 givestheset of “positive” optionsthat we usedinthe R G update ve ct or, w hil e O t− 1 givesthe set of “negative” options, allo wingthe LST Mto si milarly
1 1 updateits hidden statefro m h t− 1 t o h t toreflectlearning.
Asinsection 6.4.4, wetrainthe para meters by L 2 -regularized maxi mu mlikelihood, with
earlystopping on develop ment data. The weightsforthe LST M wereinitialized unifor mly
at r a n d o m ∼ U (− δ, + δ ), w h er e δ = 0 .0 1 , and R MSProp was usedfor gradient descent. We
settled on a regularization coefficient of 0 .0 0 2 after aline search. The nu mber of hidden u nits d was alsotuned usingline search. Interestingly, a di mensionality ofjust d = 1 0
perfor med best on dev data: 1 2 atthis size,the LST M has fewerpara metersthan our best
m o d el.
Theresultis sho wnin Table 6.4. Theseresultsfavor ourrestricted P KT architecture. We 1 1 This architectureisfor mally ableto mi mic P KT. We wouldstore θ inthe LST M’s vector of cell activations, and configurethe LST M’s “input” and “forget” gatesto updatethis accordingto ( 6. 2) w h er e u t is c o m p ut e d fro mtheinput. Observethat eachfeaturein section 6.4.2 hasthefor m ϕ i j (x, y ) = ξ i (x ) · ψ j (y ). Considerthe hidden unitin h correspondingtothisfeature, with activation θ i j . By configuringthis unit’s “output” gateto b e ξ i (x ) ( w h er e x isthe currentforeign phrase givenintheinput), we would arrangeforthis hidden unitto h a ve o ut p ut ξ i (x ) · θ i j , w hi c h will b e m ulti pli e d b y ψ j (y ) i n ( 6. 1 3) t o r e c o ver θ i j · ϕ i j (x, y ) j ust as i n ( 6. 1). ( More precisely,the output would be si g m oi d(ξ i (x ) · θ i j ), but we can evadethis nonlinearityif wetakethe cell activationsto be ascaled-do wn version of θ andscale upthe e mbeddings ψ (y)to co mpensate.) 1 2 We searched 0.001,0.002,0.005,0.01,0.02,0.05 for the regularization coefficient, and 5,10,15,20,50,100,200 forthe nu mber of hidden units.
1 7 2 CHAPTER6. KNO WLEDGETRACING
ackno wledgethatthe LST M might perfor m better when alargertrainingset was available
( which would allo w alarger hiddenlayer), or using a differentfor m ofregularization(Sri- vast a va et al., 2 0 1 4).
Inter mediate or hybrid models would of course also be possible. For exa mple, we could
⊤ pr e di ct p (y | a t ) vi a ( 6. 1), d e fi ni n g θ t as h t M , alearnedlinearfunction of h t . This variant would again have accessto our hand-designedfeatures ϕ (x, y ),sothatit would kno w which
flash cards were si milar. Infact θ t · ϕ (x, y ) i n ( 6. 1) e q u als h t · (M ϕ (x, y )), s o M c a n b e regarded as projecting ϕ (x, y ) do wntothe LST M’s hidden di mension d ,learning ho wto weight and usethesefeatures. Inthis variant,the LST M would nolonger needtotake a t as
part ofitsinput atti me t: r at h er, h t (j ust li k e θ t in P KT) would be a purerepresentation of
the student’s kno wledge state atti me t, capable of predicting y t f or a n y a t . Thissetup more
closelyrese mbles P KT —orthe D KT LST M of Piech et al.(2015). Unlikethe D KT paper,
ho wever,it would still predictthe student’s specificresponse, not merely whetherthey were
right or wrong.
6.7 Conclusion
We have presented a cognitively plausible modelthattraces a hu man student’s kno wledge as
he or sheinteracts with a si mple onlinetutoring syste m. The student mustlearntotranslate veryshortinflected phrasesfro m an unfa miliarlanguageinto English. Our model assu mes
that when a studentrecalls or guessesthetranslation, he or sheis atte mptingto solve a
1 7 3 CHAPTER6. KNO WLEDGETRACING
structured prediction proble m of choosingthe besttranslation, based on salientfeatures of
theinput-output pair. Specifically, we characterizethe student’s kno wledge as a vector of
feature weights, whichis updated asthe studentinteracts withthe syste m. Whilethe phrasal
features me morizethetranslations of entireinput phrases,the otherfeatures can pick up on
thetranslations ofindividual words and sub- words, which arereusable across phrases.
We collected and modeled hu man-subjects data. We experi mented with models us- ing several different update mechanis ms,focusing onthe student’streat ment of negative
feedback andthe degreeto whichthestudenttendsto update orforgetspecific weightsin
particular contexts. We alsofoundthatin co mparisonto aless constrained LST M model, we
can better fitthe hu man behavior by using weight updatesche mesthat are broadly consistent with sche mes usedin machinelearning.
Inthefuture, we planto experi ment with more variants ofthe model,including variants
that allo w noise and personalization. Mosti mportant, we meanto usethe modelfor planning which flash cards,feedback, or other sti mulito sho w nextto a given student.
1 7 4 C h a pt e r 7
Conclusion & Future Direction
Thisthesisintroducesthe proble m of generating macaroniclanguagetexts as a foreign
languagelearning paradig m. Adult foreignlanguagelearningis a challengingtaskthat
requires dedicatedti me and effortinfollo wing a curriculu m. We believethe macaronic
fra me workintroducedinthisthesis allo ws a studentin engageinlanguagelearning while
si mplyreading any docu ment. We hopethatsuchinstruction will be a valuable additionto
thetraditionalforeignlanguagelearning process.
We have made progressto wardsidentifying appropriate data structuresto all possible
macaronic configurationsfor a givensentence, devised a methodto modelthereadability and
guessability offoreignlanguage words and phrasesin macaronic configurations and sho w
ho w a si mple search heuristic can find pedagogically useful macaronic configurations. We
have also presented aninteraction mechanis mfor macaronic docu ments, hopefullyleading
toi mproved student engage ment while gainingthe abilityto updatethestudent model based
1 7 5 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION onfeedback. Finally, we also studied sequential modeling of student’s kno wledge asthey navigatethrough arestrictedforeignlanguageinflectionlearning activity.
There are several possibleresearch directions movingfor ward. We are mostinterested ini mproving methodsthatfollo wthe generic student model based approach asit allo ws usto create macaronic docu mentsfro m a wide variety of do mains without data collection involving hu man students. Weidentifythefollo wingli mitations currentlyinthe generic student model based approach:
Capturing uncertainty of L2 word e mbeddings: Word e mbeddings are pointsin a subspace. Assigning each L2 word with a single pointinthat subspaceignores uncertainty associated withthat words meaning. Thisissue might not be very crucial whenlearning
L1 e mbeddings, as we can assu me(atleastforfrequent words)that we canlearntheir e mbeddings fro m differentinstancesinthetraining data. Ho wever, ourincre mental L2 learning approach assigns/learns an e mbeddingfro m(initially)just one exposure. Even subsequent exposures are not batched. Thus,it should be usefulto maintain arange(or distribution) ofreasonable e mbeddingsfor a ne w L2 word after each exposure,instead of a single pointinthe e mbedding space.
A possible approach could betorepresent each L2 e mbedding by a multidi mensional
Gaussian with a mean vector µ ∈ R D and variance σ 2 ∈ R D . Si milarideas have beensho wn to help word e mbeddinglearning( Vilnis and Mc Callu m, 2014; Athi waratkun and Wilson,
2017) using “ word2vec”style objectives( C B O W orskip-gra m)( Mikolov et al., 2013). We could also e mploytherecentrepara meterization method( King ma and Welling, 2013).
1 7 6 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION
Search Heuristics and Planning: We currently e mploy a si mple ”lefttoright” best-first
search heuristicto searchforthe “best” macaronic configurationfor a given sentence. We
can explore several alternative search heuristics. One si mple alternativeisto replacethe
search ordering fro m “leftto right”to “lo w-frequencyto high-frequency”. Thatis, we willtrytoreplacelo w-frequency wordsinthe sentence withtheir L2translations before
trying high-frequency words. This heuristic should provide more opportunitiestoreplace
lo w-frequency content words atthe expense of high-frequency stop words, ho wever, since
high-frequency words are morelikelytosho w upintherest ofthe docu mentthere will be
other opportunitiesforthe modeltoreplacethese with L2translations. Pilot experi ments withlo w-frequencyto high-frequency(in conjunction with best-first search) outperfor ms
thelefttoright heuristicinter ms of cosine si milarity score as definedin§5.5.2.
Our currentsche me does not considerthe entire docu ment whensearchingforthe best
macaronic configurationfor a sentence. Ifthe machineteacher kno ws,for exa mple,that a
certain L2 vocabularyite mis more guessableinso mefuture part ofthe docu ment,thenit
could usethe current sentencetoteach a different L2 vocabularyite mtothe student. Thus,
lookingintothefuture ofthe docu mentis a possiblefuture direction ofresearch. Our pilot
experi ments, using Monte- Carlotree searchto findthe best macaronic configuration, also
suggestthe sa me. Ho wever, withlongerlook-ahead horizons searchtakes moreti meto
co mplete which might hinder “online” search.
Contextual Representationsfro m BE RT and KL- Net: The clozelanguage model
usedin our generic student modelis closelyrelatedto sentencerepresentation models such
1 7 7 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION
as BE RT and EL Mo( Devlin et al., 2019; Peters et al., 2018). We could usethese pretrained
modelsinstead or our clozelanguage model, ho wever, we would berestricted bythe L1 vocabulary used bytheselarge maskedlanguage models.
Modified soft maxlayerin cloze model: Ourincre mental approachtolearningthe
e mbeddings of novel L2 wordsis scored using cosine si milarity( § 5.5.1). Ho wever,the
i niti al cl o z e m o d el (§ § 5.2.1 and 5.2.2) does nottakethis particular scoringinto account.
Whentrainingthe cloze model on L1 data and duringincre mental L2 wordlearning,the
nor ms ofthe e mbeddings are not constrained,leadingto co m mon words havinglarger
nor ms. This creates a mis match bet ween ho w welearn L1 e mbeddings and L2 e mbeddings
(incre mentally) and ho w we scorethe m. Whileitis unclearifthis dra matically changesthe
resulting macaronic configurations, a possible solution could beto use cosine si milarityto
obtainlogits during L1learning and duringincre mental L2learning. This would encourage
theinitial cloze modeltorestrictthe nor ms of word e mbeddingsto be closeto one.
Phrasal Translations: Finally,the one-to-onelexicaltranslation setupis ali mitation as
it only affordsforteaching single L2 words and not phrases. Additionally, doesit expose
thestudentto word order differences bet weenthe L1 and L2. There aret wo main challenges when movingto non-lexicalteaching. To addressthisli mitation we would first haveto
consider different scoringfunctionsto guidethe macaronic configuration search. Currently,
the scoringfunction(§ 5.5.1) straight-for ward forlexicaltranslation case but not for scoring
L1-L2 phrase pairs especially when also considering word-order differences bet ween an
L1 and L2 phrase. Consequently, we needto address ho w werepresent L2 “kno wledge”
1 7 8 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION
d e Markup He gaveatalkabout howeducation a n d s c h o ol kills cr e ati v it y.
Pr e di cti o n He gave a talk about how education und schulen kreativit ä t t ö t e t .
d e M ar k u p Itwas so mbody who was tryingto ask a question aboutJavascript.
Prediction Es war jemand , der versuchte , toaskaquestionaboutJavascript.
d e d e M ar k u p We were standingon theedge ofthousands of acres of c ot t o n.
Pr e di cti o n Wir standen am rande of thousands of acres of baumwolle.
d e Markup And we’re building uponinnovations of generationswho went beforeus.
Pr e di cti o n And we’re building upon innovations of generationen , die vor uns gingen .
Table 7.1: Exa mples ofinputs and predicted outputs by our experi mental N MT
modeltrainedto generate macaroniclanguage sentences using annotations onthe
input sequence. We seethatthe macaroniclanguagetranslations are ableto correctly
order Ger man portions ofthe sentences, especially atthe sentence ending. The
source-features have also beenlearned bythe N MT model andtranslations are
faithfultothe markup. The case,tokenization anditalics addedin post.
in our generic student model. Merely, using L2 word e mbeddings would beinsufficient.
One possibleresearch endeavoristo not onlylearn ne w L2 word e mbeddings but also
learn L2recurrent para metersin anincre mentalfashion. Thatis, we canlearn a entirely
ne w L2 clozelanguage model. Toscorethis cloze model we could use held outfully L2
sentences, perhapsfro mthere mainder ofthe current docu ment. Apartfro mredesigning
the generic student model andincre mentallearning paradig mto enable phrasal L2learning, we also haveto alter ho w we generatethe macaronic data structurethat can support phrasal
macaronic configurations. Creatingthe back-end macaronic data structure using statistical
1 7 9 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION
machinetranslation § 1.4 mayresultintranslationsthat are not accurate. Neural Machine
Translation may provide better partialtranslations which could be usedto generatethe
required back-end data structure. We findthat we are ableto generate fluent macaronic
translations bytaggingtokensintheinput sequence ( whichis fullyin L1) with either
a T r a n s l a t e or No-Translate tag. Table 7.1 sho ws exa mples of generated En- De
(L1-L2) macaronic sentences.
Longitudinal User Modelling: Inthisthesis, experi mentsinvolving hu man students were conducted onrelatively shortti me-fra mes. Modellinglongter mlearning andforgetting
patternsin a macaroniclearning setup wouldleadto better configurations asthe machine
teacher can accountfor student’sforgetting patterns. Such experi ments, ho wever, would
exhibit high variation and wouldrequirelarger nu mber of participants. Generallylonger
st u di es als o e x hi bit p o or p arti ci p a nt r et e nti o n.
1 8 0 Bibliography
Ahn, Luis von(2013). “ Duolingo: Learn a Languagefor Free While Helpingto Translate
the Web”.In:Proceedings ofthe 2013International Conference onIntelligent User
Interfaces, pp. 1–2.
Alishahi, Afra, Afsaneh Fazly, and Suzanne Stevenson (2008). “Fast mappingin word
learning: What probabilities tell us”. In: Proceedings of the t welfth conference on
co mputational naturallanguagelearning. Associationfor Co mputational Linguistics,
p p. 5 7 – 6 4.
Athi waratkun, Ben and Andre w Wilson (2017). “ Multi modal Word Distributions”. In:
Proceedings ofthe 55th Annual Meeting ofthe Associationfor Co mputational Linguistics
(Volu me 1: Long Papers), pp. 1645–1656.
Bojano wski, Piotr, Edouard Grave, Ar mandJoulin, and To mas Mikolov(2017). “Enriching
Word Vectors with Sub word Infor mation”. In: Transactions of the Association for
Co mputational Linguistics 5, pp. 135–146. URL : https://www.aclweb.org/
anthology/Q17-1010.
1 8 1 BIBLIOGRAPHY
B oj ar, O n dřej, Rajen Chatterjee, Christian Feder mann, Barry Haddo w, Matthias Huck,
Chris Hoka mp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt
Post, Carolina Scarton, Lucia Specia, and Marco Turchi(2015). “Findings ofthe 2015
Workshop on Statistical Machine Translation”. In: Proceedings of the Tenth Work-
shop on Statistical Machine Translation, pp. 1–46. URL : http://aclweb.org/
anthology/W15-3001.
Burstein, Jill, Joel Tetreault, and Nitin Madnani(2013). “The E- Rater Auto mated Essay
Scoring Syste m”.In: Handbook of Auto mated Essay Evaluation: Current Applications
and New Directions. Ed. by Mark D. Sher mis. Routledge, pp. 55–67.
Carey, Susan and Elsa Bartlett(1978). “ Acquiring a single ne w word.”In:
Chen, Tao, Naijia Zheng, Yue Zhao, Muthu Ku mar Chandrasekaran, and Min- Yen Kan(July
2015). “Interactive Second Language Learningfro m Ne ws Websites”.In: Proceedings
ofthe 2nd Workshop on Natural Language Processing Techniquesfor Educational
Applications . Beijing, China: Association for Co mputational Linguistics, pp. 34–42.
URL :https://www.aclweb.org/anthology/W15-4406.
Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, Dz mitry Bahdanau, Fethi
Bougares, Holger Sch wenk, and Yoshua Bengio( Oct. 2014). “Learning Phrase Rep-
resentations using R N N Encoder– Decoder for Statistical Machine Translation”. In:
Proceedings ofthe 2014 Conference on E mpirical Methodsin Natural Language Pro-
cessing(E M NLP). Doha, Qatar: Associationfor Co mputational Linguistics, pp. 1724–
1 7 3 4. URL :http://www.aclweb.org/anthology/D14-1179.
1 8 2 BIBLIOGRAPHY
Church, Kenneth Ward and Patrick Hanks( Mar. 1990). “ Word Association Nor ms, Mutual
Infor mation, and Lexicography”.In:Co mputational Linguistics 16.1, pp. 22–29. URL :
http://dl.acm.org/citation.cfm?id=89086.89095.
Clarke, Linda K.( Oct. 1988). “Invented versus Traditional Spellingin First Graders’ Writ-
ings: Effects on Learningto Spell and Read”.In:Researchinthe Teaching of English,
pp. 281–309. URL : http://www.jstor.org.proxy1.library.jhu.edu/
stable/40171140.
Constantino, Rebecca, Sy- Ying Lee, Kyung-Sook Cho, and Stephen Krashen(1997). “Free
Voluntary Reading as a Predictor of T OEFL Scores.”In: Applied Language Learning
8.1, pp. 111–18.
Corbett, Albert T andJohn R Anderson(1994). “ Kno wledgetracing: Modelingthe acqui-
sition of procedural kno wledge”.In: User modeling and user-adaptedinteraction 4.4,
pp. 253–278.
D a u m é III, Hal ( Aug. 2004). “ Notes on C G and L M-BF GS Opti mization of Logistic
Regression”. URL :http://hal3.name/megam/.
D a u m é III, Hal(2007). “Frustratingly Easy Do main Adaptation”.In:Proceedings of A CL.
Prague, Czech Republic.
D a u m é III, Hal(June 2007). “Frustratingly Easy Do main Adaptation”.In: Proceedings of
ACL, pp. 256–263. URL :http://www.aclweb.org/anthology/P07-1033.
Deutschlandfunk (2016). nachrichtenleicht. http://www.nachrichtenleicht.
de/. Accessed: 2015-09-30. URL :www.nachrichtenleicht.de.
1 8 3 BIBLIOGRAPHY
Devlin,Jacob, Ming- Wei Chang, Kenton Lee, and Kristina Toutanova(2018). “ Bert: Pre-
training of deep bidirectional transfor mers for language understanding”. In: arXiv
preprint ar Xiv:1810.04805.
Devlin,Jacob, Ming- Wei Chang, Kenton Lee, and Kristina Toutanova(June 2019). “ BE RT:
Pre-training of Deep Bidirectional Transfor mers for Language Understanding”. In:
Proceedings ofthe 2019 Conference ofthe North A merican Chapter ofthe Association
for Co mputational Linguistics: Hu man Language Technologies, Volu me 1(Long and
Short Papers). Minneapolis, Minnesota: Association for Co mputational Linguistics,
pp. 4171–4186. URL :https://www.aclweb.org/anthology/N19-1423.
Dorr, Bonnie J. ( Dec. 1994). “ Machine Translation Divergences: A For mal Description
and Proposed Solution”.In: Co mputational Linguistics 20.4, pp. 597–633. URL : h t t p :
//aclweb.org/anthology/J/J94/J94-4004.pdf.
Dreyer, Markus and Jason Eisner( Aug. 2009). “ Graphical Models Over Multiple Strings”.
In:Proceedings of E M NLP. Singapore, pp. 101–110. URL : http://cs.jhu.edu/
̃ jason/papers/\#dreyer-eisner-2009.
Du, Xinya,Junru Shao, and Claire Cardie(July 2017). “Learningto Ask: Neural Question
Generationfor Reading Co mprehension”.In: Proceedings ofthe 55th Annual Meeting
ofthe Associationfor Co mputational Linguistics(Volu me 1: Long Papers). Vancouver,
Canada: Association for Co mputational Linguistics, pp. 1342–1352. URL : h t t p s :
//www.aclweb.org/anthology/P17-1123.
1 8 4 BIBLIOGRAPHY
Duchi, John, Elad Hazan, and Yora m Singer(2011). “ Adaptive subgradient methodsfor
onlinelearning and stochastic opti mization”.In: Journal of Machine Learning Research
12.Jul, pp. 2121–2159.
Dupuy, B and J Mc Quillan(1997). “ Handcrafted books: Twoforthe price of one”.In:
Successful strategiesfor extensive reading, pp. 171–180.
Elley, War wick Band Francis Mangubhai(1983).“Thei mpact ofreading onsecondlanguage
learning”.In:Reading research quarterly , pp. 53–67.
Gentry,J. Richard( Nov. 2000). “ A Retrospective onInvented Spelling and a Look For ward”.
In: The Reading Teacher 54.3, pp. 318–332. URL : http://www.jstor.org.
proxy1.library.jhu.edu/stable/20204910.
G o n z ález- Brenes, José, Yun Huang, and Peter Brusilovsky(2014). “ Generalfeaturesin
kno wledgetracingto model multiplesubskills,te mporalite mresponsetheory, and expert
kno wledge”.In: Proceedings ofthe 7thInternational Conference on Educational Data
Mining . University of Pittsburgh, pp. 84–91.
Gra m marly(2009). Gra m marly . https://app.grammarly.com . Accessed: 2019-02-
2 0.
Heafield, Kenneth,Ivan Pouzyrevsky,Jonathan H. Clark, and Philipp Koehn( Aug. 2013).
“Scalable Modified Kneser- Ney Language Model Esti mation”.In: Proceedings of ACL.
Sofia, Bulgaria, pp. 690–696. URL : http://kheafield.com/professional/
edinburgh/estimate\_paper.pdf.
1 8 5 BIBLIOGRAPHY
Heil man, Michael and Nitin Madnani(2012). “ETS: Discri minative Edit Modelsfor Para-
phrase Scoring”.In: *SE M 2012: The First Joint Conference on Lexical and Co mputa-
tional Se mantics – Volu me 1: Proceedings ofthe main conference andthesharedtask,
and Volu me 2: Proceedings ofthe SixthInternational Workshop on Se mantic Evaluation
(Se mEval 2012). Montréal, Canada: Associationfor Co mputational Linguistics, pp. 529–
5 3 5. URL :https://www.aclweb.org/anthology/S12-1076.
Heil man, Michael and Noah A S mith(2010). “ Good question!statisticalrankingfor question
generation”.In: Hu man Language Technologies: The 2010 Annual Conference ofthe
North A merican Chapter ofthe Associationfor Co mputational Linguistics . Association
for Co mputational Linguistics, pp. 609–617.
Her mjakob, Ulf, Jonathan May, Michael Pust, and Kevin Knight(July 2018). “Translating a
Language You Don’t Kno wInthe Chinese Roo m”.In: Proceedings of ACL 2018, Sys-
te m De monstrations. Melbourne, Australia: Associationfor Co mputational Linguistics,
p p. 6 2 – 6 7. URL :https://www.aclweb.org/anthology/P18-4011.
Hochreiter, Sepp and J urgen̈ Sch midhuber(1997). “Longshort-ter m me mory”.In: Neural
Co mputation 9.8, pp. 1735–1780.
Hopfield, J. J. (1982). “ Neural net works and physical syste ms with e mergent collective
co mputational abilities”.In: Proceedings ofthe National Acade my of Sciences ofthe
USA . Vol. 79. 8, pp. 2554–2558.
1 8 6 BIBLIOGRAPHY
Hu, Chang, Benja min B Bederson, and Philip Resnik(2010). “Translation byiterative col-
laboration bet ween monolingual users”.In:Proceedings ofthe AC M SI G K D D Workshop
on Hu man Co mputation, pp. 54–55.
Hu, Chang, Benja min B Bederson, Philip Resnik, and Yakov Kronrod(2011). “ Monotrans2:
A ne w hu man co mputation syste mto support monolingualtranslation”.In: Proceedings
ofthe SI GC HI Conference on Hu man Factorsin Co mputing Syste ms, pp. 1133–1136.
Huang, Yun, J Guerra, and Peter Brusilovsky(2016). “ Modeling Skill Co mbination Patterns
for Deeper Kno wledge Tracing”.In:Proceedings ofthe 6th Workshop on Personalization
Approachesin Learning Environ ments(PALE 2016). 24th Conference on User Modeling,
Adaptation and Personalization. Halifax, Canada.
Huckin, Tho mas andJa mes Coady(1999). “Incidental vocabulary acquisitionin asecond
language”.In:Studiesin Second Language Acquisition 21.2, pp. 181–193.
Kann, Katharina, Ryan Cotterell, and Hinrich Sch utze(̈ Apr. 2017). “ Neural Multi-Source
Morphological Reinflection”.In: Proceedings ofthe 15th Conference ofthe European
Chapter ofthe Associationfor Co mputational Linguistics: Volu me 1, Long Papers .
Valencia, Spain: Associationfor Co mputational Linguistics, pp. 514–524. URL : h t t p s :
//www.aclweb.org/anthology/E17-1049.
Khajah, Moha m mad, Ro wan Wing, Robert Lindsey, and Michael Mozer(2014a). “Inte-
gratinglatent-factor and kno wledge-tracing modelsto predictindividual differencesin
learning”. In:Proceedings ofthe 7th International Conference on Educational Data
Mi ni n g .
1 8 7 BIBLIOGRAPHY
Khajah, Moha m mad M, Yun Huang,José P G o n z ález- Brenes, Michael C Mozer, and Peter
Brusilovsky(2014b). “Integrating kno wledgetracing andite mresponsetheory: Atale
oft wofra me works”.In: Proceedings of Workshop on Personalization Approachesin
Learning Environ ments (PALE 2014) atthe 22th International Conference on User
Modeling, Adaptation, and Personalization . University of Pittsburgh, pp. 7–12.
Ki m, Haeyoung and Stephen Krashen(1998). “The Author Recognition and Magazine
Recognition Tests, and Free Voluntary Rereading as Predictors of Vocabulary Develop-
mentin English as a Foreign Languagefor Korean High School Students.”In: Syste m
26.4, pp. 515–23.
Ki m, Yoon, YacineJernite, David Sontag, and Alexander M. Rush(2016). “ Character-a ware
neurallanguage models”.In: Thirtieth AAAI Conference on ArtificialIntelligence.
King ma, Diederik P. andJi m my Ba(2014). “ Ada m: A methodforstochastic opti mization”.
In:ar Xiv preprint ar Xiv:1412.6980.
King ma, Diederik P and Max Welling(2013). “ Auto-encoding variational bayes”.In: arXiv
preprint ar Xiv:1312.6114.
Kno wles, Rebecca, Adithya Renduchintala, Philipp Koehn, and Jason Eisner( Aug. 2016).
“ Analyzing Learner Understanding of Novel L2 Vocabulary”.In: Proceedings of The
20th SI GNLL Conference on Co mputational Natural Language Learning. To appear.
Berlin, Ger many: Associationfor Co mputational Linguistics, pp. 126–135. URL : h t t p :
//www.aclweb.org/anthology/K16-1013.
1 8 8 BIBLIOGRAPHY
Koedinger, K. R., P.I. Pavlick Jr., J. Sta mper, T. Nixon, and S. Ritter(2011). “ Avoiding
Proble m Selection Thrashing with Conjunctive Kno wledge Tracing”.In: Proceedings of
the 4thInternational Conference on Educational Data Mining. Eindhoven, NL, pp. 91–
1 0 0.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison- Burch, Marcello Federico,
Nicola Bertoldi, Brooke Co wan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,
Ondrej Bojar, Alexandra Constantin, and Evan Herbst(2007). “ Moses: Open Source
Toolkit for Statistical Machine Translation”. In: Proceedings of A CL: Short Papers ,
pp. 177–180. URL : http://www.aclweb.org/anthology/P/P07/P07-
2 0 4 5 .
Krashen, S.(1993). “ Ho w Well do People Spell?”In: ReadingI mprove ment 30.1. URL :
http://www.sdkrashen.com/content/articles/1993\_how\_well\
_do\_people\_spell.pdf.
Krashen, Stephen (1989). “ We acquire vocabulary and spelling by reading: Additional
evidencefortheinput hypothesis”.In: The Modern Language Journal 73.4, pp. 440–
4 6 4.
Krashen, Stephen D(2003). Explorationsinlanguage acquisition and use. Heine mann
Ports mouth, N H.
Labutov,Igor and Hod Lipson(June 2014). “ Generating Code-s witched Textfor Lexical
Learning”.In: Proceedings ofthe 52nd Annual Meeting ofthe Associationfor Co m-
putational Linguistics (Volu me 1: Long Papers). Balti more, Maryland: Association
1 8 9 BIBLIOGRAPHY
for Co mputational Linguistics, pp. 562–571. URL : https://www.aclweb.org/
anthology/P14-1053.
Lee,JungIn and E m ma Brunskill(2012). “TheI mpact onIndividualizing Student Models on
Necessary Practice Opportunities.”In: International Educational Data Mining Society.
Lee, Sy- Ying and Stephen Krashen(1996). “Free Voluntary Reading and Writing Co m-
petencein Tai wanese High School Students”. In: Perceptual and Motor Skills 83.2,
pp. 687–690. eprint: https://doi.org/10.2466/pms.1996.83.2.687 .
URL :https://doi.org/10.2466/pms.1996.83.2.687.
Lee, Yon Ok, Stephen D Krashen, and Barry Gribbons(1996). “The effect ofreading on
the acquisition of English relative clauses”. In:ITL-International Journal of Applied
Linguistics 113.1, pp. 263–273.
Leitner, Sebastian(1972). Solernt manlernen: der Weg zu m Erfolg. Herder, Freiburg.
Lingua.ly(2013). Lingua.ly.https://lingua.ly/. Accessed: 2016-04-04.
Lit man, Diane(2016). “ Natural Language Processingfor Enhancing Teaching and Learning”.
In:Proceedings of AAAI.
Madnani, Nitin, Michael Heil man,Joel Tetreault, and Martin Chodoro w(2012). “Identifying
High-Level Organizational Ele mentsin Argu mentative Discourse”.In: Proceedings of
NAA CL- HLT . Associationfor Co mputational Linguistics, pp. 20–28.
Merity, Stephen, Cai ming Xiong, Ja mes Bradbury, and Richard Socher(2016). “Pointer
sentinel mixture models”.In: arXiv preprint arXiv:1609.07843.
1 9 0 BIBLIOGRAPHY
Mikolov, To mas, Martin Karafia ́t, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudan-
pur(2010). “Recurrent neural net work basedlanguage model.”In: INTERSPEEC H
2010, 11th Annual Conference oftheInternational Speech Co m munication Association,
Makuhari, Chiba, Japan, Septe mber 26-30, 2010 , pp. 1045–1048.
Mikolov, To mas, Stefan Ko mbrink, Anoop Deoras, Lukar Burget, andJan Cernocky(2011).
“R N NL M —Recurrent Neural Net work Language Modeling Toolkit”.In: Proc. ofthe
2011 ASRU Workshop, pp. 196–201.
Mikolov, To mas,Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean(2013). “ Dis-
tributedrepresentations of words and phrases andtheir co mpositionality”.In:Advances
in neuralinfor mation processing syste ms, pp. 3111–3119.
Mikolov, To mas, Edouard Grave, Piotr Bojano wski, Christian Puhrsch, and Ar mand Joulin
( May 2018). “ Advancesin Pre-Training Distributed Word Representations”.In: Proceed-
ings ofthe EleventhInternational Conference on Language Resources and Evaluation
(LRE C-2018). Miyazaki,Japan: European Languages Resources Association(EL R A).
URL :https://www.aclweb.org/anthology/L18-1008.
Mitkov, Ruslan and Le An Ha(2003). “ Co mputer-aided generation of multiple-choicetests”.
In:Proceedings ofthe HLT- NAACL 03 workshop on Building educational applications
using naturallanguage processing-Volu me 2. Associationfor Co mputational Linguistics,
p p. 1 7 – 2 2.
Mousa, A mrand Bj orn̈ Schuller( Apr. 2017). “ Contextual Bidirectional Long Short-Ter m
Me mory Recurrent Neural Net work Language Models: A Generative Approachto
1 9 1 BIBLIOGRAPHY
Senti ment Analysis”.In: Proceedings ofthe 15th Conference ofthe European Chapter of
the Associationfor Co mputational Linguistics: Volu me 1, Long Papers. Valencia, Spain,
pp. 1023–1032. URL :https://www.aclweb.org/anthology/E17-1096.
Murphy, Kevin P., Yair Weiss, and MichaelI. Jordan(1999). “Loopy belief propagationfor
approxi mateinference: An e mpiricalstudy”.In: Proceedings of UAI. Morgan Kauf mann
PublishersInc., pp. 467–475.
Nelson, Mark (2007). The Alpheios Project. http://alpheios.net/ . A c c ess e d:
2016-04-05.
Nivre, Joaki m, Marie- Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, et al.
“ Universal Dependencies v1: A Multilingual Treebank Collection.”In:
One ThirdStories(2018). OneThirdStories . https://onethirdstories.com/ . A c-
cessed: 2019-02-20.
Ö z b al, G ozde,̈ Daniele Pighin, and Carlo Strapparava(2014). “ Auto mation and Evaluation of
the Key word Methodfor Second Language Learning”.In:Proceedings of ACL(Volu me
2: Short Papers), pp. 352–357.
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning(2014). “ GLo Ve: Global
Vectorsfor Word Representation”.In: Proceedings of E M NLP. Vol. 14, pp. 1532–1543.
URL :http://www.aclweb.org/anthology/D14-1162.
Peters, Matthe w, Mark Neu mann, MohitIyyer, Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettle moyer(June 2018). “ Deep Contextualized Word Representations”.
In:Proceedings ofthe 2018 Conference ofthe North A merican Chapter ofthe Association
1 9 2 BIBLIOGRAPHY
for Co mputational Linguistics: Hu man Language Technologies, Volu me 1(Long Papers).
Ne w Orleans, Louisiana: Associationfor Co mputational Linguistics, pp. 2227–2237.
URL :https://www.aclweb.org/anthology/N18-1202.
Philips, La wrence( Dec. 1990). “ Hanging onthe Metaphone”.In: Co mputer Language 7.12.
Piech, Chris,Jonathan Bassen,Jonathan Huang, Surya Ganguli, Mehran Saha mi, LeonidasJ
Guibas, and Jascha Sohl- Dickstein(2015). “ Deep Kno wledge Tracing”.In: Advancesin
NeuralInfor mation Processing Syste ms , pp. 505–513.
Posner, MichaelI(1989). Foundations of cognitivescience. MIT press Ca mbridge, M A.
Preacher, K.J.( May 2002). Calculationforthetest ofthe difference bet weent woindependent
correlation coefficients [ Co mputer soft ware]. URL : http://www.quantpsy.org/
corrtest/corrtest.htm.
Press, Ofir and Lior Wolf( Apr. 2017). “ Usingthe Output E mbeddingtoI mprove Language
Models”. In: Proceedings ofthe 15th Conference ofthe European Chapter ofthe
Associationfor Co mputational Linguistics: Volu me 2, Short Papers . Valencia, Spain,
pp. 157–163. URL :https://www.aclweb.org/anthology/E17-2025.
Rafferty, Anna N and Christopher D Manning(2008). “Parsing Three Ger man Treebanks:
Lexicalized and Unlexicalized Baselines”.In: Proceedings ofthe Workshop on Parsing
Ger man . Associationfor Co mputational Linguistics, pp. 40–46.
Recht, Benja min, Christopher Re, Stephen Wright, and Feng Niu(2011). “ Hog wild!: A
Lock-Free Approachto Parallelizing Stochastic Gradient Descent”. In: Advancesin
NeuralInfor mation Processing Syste ms , pp. 693–701.
1 9 3 BIBLIOGRAPHY
Renduchintala, Adithya, Philipp Koehn, andJason Eisner( Aug. 2017). “ Kno wledge Tracing
in Sequential Learning ofInflected Vocabulary”.In:Proceedings ofthe 21st Conference
on Co mputational Natural Language Learning (CoNLL 2017). Vancouver, Canada,
pp. 238–247. URL :https://www.aclweb.org/anthology/K17-1025.
Renduchintala, Adithya, Philipp Koehn, and Jason Eisner ( Aug. 2019a). “Si mple Con-
struction of Mixed-Language Textsfor Vocabulary Learning”.In: Proceedings ofthe
14th Workshop onInnovative Use of NLPfor Building Educational Applications(BEA).
Fl or e n c e. URL : http://cs.jhu.edu/ ̃ jason/papers/\#renduchintala-
et-al-2019.
Renduchintala, Adithya, Philipp Koehn, and Jason Eisner( Nov. 2019b). “Spelling- A ware
Construction of Macaronic Textsfor Teaching Foreign-Language Vocabulary”.In: Pro-
ceedings ofthe 2019 Conference on E mpirical Methodsin Natural Language Processing
andthe 9thInternational Joint Conference on Natural Language Processing(E M NLP-
IJ C NLP). Hong Kong, China: Associationfor Co mputational Linguistics, pp. 6438–
6 4 4 3. URL :https://www.aclweb.org/anthology/D19-1679.
Renduchintala, Adithya, Rebecca Kno wles, Philipp Koehn, and Jason Eisner( Aug. 2016a).
“ Creating Interactive Macaronic Interfaces for Language Learning”. In: Proceedings
of ACL-2016 Syste m De monstrations. Berlin, Ger many, pp. 133–138. URL : h t t p s :
//www.aclweb.org/anthology/P16-4023.
Renduchintala, Adithya, Rebecca Kno wles, Philipp Koehn, and Jason Eisner( Aug. 2016b).
“ User Modelingin Language Learning with Macaronic Texts”.In: Proceedings ofthe
1 9 4 BIBLIOGRAPHY
54th Annual Meeting ofthe Associationfor Co mputational Linguistics(Volu me 1: Long
Papers). Berlin, Ger many, pp. 1859–1869. URL : https://www.aclweb.org/
anthology/P16-1175.
Rodrigo, Victoria, Jeff Mc Quillan, and Stephen Krashen(1996). “Free Voluntary Reading
and Vocabulary Kno wledgein Native Speakers of Spanish”.In: Perceptual and Motor
S kills 8 3. 2, p p. 6 4 8 – 6 5 0. e pri nt: https://doi.org/10.2466/pms.1996.83.
2 . 6 4 8 . URL :https://doi.org/10.2466/pms.1996.83.2.648.
Schacter, D. L. (1989). “ Me mory”. In: Foundations of Cognitive Science. Ed. by M. I.
Postner. MIT Press, pp. 683–725.
Settles, Burr and Brendan Meeder( Aug. 2016). “ A Trainable Spaced Repetition Modelfor
Language Learning”.In: Proceedings ofthe 54th Annual Meeting ofthe Associationfor
Co mputational Linguistics(Volu me 1: Long Papers) . Vol. 1. Berlin, Ger many: Associ-
ationfor Co mputational Linguistics, pp. 1848–1858. URL : http://www.aclweb.
org/anthology/P16-1174.
S molensky, Paul(1986). “Infor mation Processingin Dyna mical Syste ms: Foundations of
Har mony Theory”.In: Parallel Distributed Processing: Explorationsinthe Microstruc-
ture of Cognition. Ed. by D. E. Ru melhart, J. L. McClelland, andthe P DP Research
Group. Vol. 1: Foundations. Ca mbridge, M A: MIT Press/ Bradford Books, pp. 194–281.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhut-
dinov (2014). “ Dropout: A Si mple Wayto Prevent Neural Net works fro m Overfit-
1 9 5 BIBLIOGRAPHY
ting”. In:Journal of Machine Learning Research 15, pp. 1929–1958. URL : h t t p :
//jmlr.org/papers/v15/srivastava14a.html.
Srivastava, Rupesh Ku mar, Klaus Greff, and J urgen̈ Sch midhuber(2015). “ High way net-
works”.In: arXiv preprint arXiv:1505.00387.
Stokes,Jeffery, Stephen D Krashen, andJohn Kartchner(1998). “Factorsinthe acquisition of
the present subjunctivein Spanish: Therole ofreading and study”.In:ITL-International
Journal of Applied Linguistics 121.1, pp. 19–25.
S wych(2015). Swych.http://swych.it/. Accessed: 2019-02-20.
Tiele man, Tij men and Geoffrey Hinton(2012). “Lecture 6.5-r msprop: Dividethe gradient
by arunning average ofitsrecent magnitude”.In: C O URSERA: Neural net worksfor
machinelearning 4.2.
Vilnis, Luke and Andre w Mc Callu m(2014). “ Word Representations via Gaussian E mbed-
ding”.In: CoRR abs/1412.6623.
Vygotsky, Lev(2012). Thought and Language(Revised and Expanded Edition). MIT Press.
Weide, R.(1998). The C M U pronunciation dictionary, release 0.6.
Wieting,John, Mohit Bansal, Kevin Gi mpel, and Karen Livescu( Nov. 2016). “ Charagra m:
E mbedding Words and Sentences via Character n-gra ms”.In: Proceedings ofthe 2016
Conference on E mpirical Methodsin Natural Language Processing . Austin, Texas,
pp. 1504–1515. URL :https://www.aclweb.org/anthology/D16-1157.
Wiki media Foundation (2016). Si mple English Wikipedia. Retrieved fro m h t t p s : / /
dumps.wikimedia.org/simplewiki/20160407/ 8-April-2016.
1 9 6 BIBLIOGRAPHY
Wikipedia (2016). Leichte Sprache — Wikipedia, Diefreie Enzyklop adie.̈ [ Online; ac-
cessed 16- March-2016]. URL : https://de.wikipedia.org/wiki/Leichte\
_ S p r a c h e .
Wood, David, Jero me S. Bruner, and Gail Ross(1976). “Therole oftutoringin proble m
solving”.In: Journal of Child Psychology and Psychiatry 17.2, pp. 89–100.
Wu, Dekai(1997). “StochasticInversion Transduction Gra m mars and Bilingual Parsing of
Parallel Corpora”.In: Co mputational Linguistics 23.3, pp. 377–404.
Xu, Yanbo andJack Mosto w(2012). “ Co mparison of Methodsto Trace Multiple Subskills:
Is L R- D B N Best?”In:Proceedings ofthe 5thInternational Conference on Educational
Data Mining , pp. 41–48.
Yang, Zhilin, Zihang Dai, Yi ming Yang,Jai me Carbonell, Russ R Salakhutdinov, and Quoc
V Le(2019). “ Xlnet: Generalized autoregressive pretrainingforlanguage understanding”.
In:Advancesin neuralinfor mation processing syste ms , pp. 5753–5763.
Zeiler, Matthe w D(2012). “ A D A DELTA: an adaptivelearningrate method”.In: arXiv
preprint ar Xiv:1212.5701.
Zhang, Haoran, Ah med Magooda, Diane Lit man, Richard Correnti, Elaine Wang, LC
Mats mura, E mily Ho we, and Rafael Quintana(2019). “e Revise: Using Natural Language
Processingto Provide For mative Feedback on Text Evidence Usagein Student Writing”.
In:Proceedings ofthe AAAI Conference on ArtificialIntelligence . Vol. 33, pp. 9619–
9 6 2 5.
1 9 7 BIBLIOGRAPHY
Zhou, Jingguang and Zili Huang(2018). “ Recover missing sensor data withiterativei m-
puting net work”. In: Workshops atthe Thirty-Second AAAI Conference on Artificial
I nt elli g e n c e.
Zolf, Flak(1945). On Stranger Land: Pages of a Life .
Zolf, Flak and Martin Green(2003). On Foreign Soil: Tales of a Wandering Jew . Bench mark
P u blis hi n g.
1 9 8 Vit a
Adithya Renduchintala
ar e n d u. git h u b.i o
a di.r @j h u. e d u
I NTERESTS
I a m broadlyinterestedin proble ms attheintersection of( Deep) Machine Learning, Neural
Machine Translation, Natural Language Processing, & User Modeling
E DUCATION
Ph D, Co mputer Science 2 0 1 3 - 2 0 2 0
Johns Hopkins University, Balti more, M D
MS, Co mputer Science 2 0 1 0 - 2 0 1 2
University of Colorado, Boulder, C O
1 9 9 VITA
MS, Electrical Engineering, Arts Media and Engineering 2 0 0 5- 2 0 0 8
Arizona State University, Te mpe, AZ
B E, Electrical Engineering 2 0 0 1- 2 0 0 5
Anna University, S R M Engineering College, Chennai,India
P UBLICATIONS
1. Spelling- Aware Construction of Macaronic Texts for Teaching Foreign-Language
Vocabulary, Adithya Renduchintala , Philipp Koehn andJason Eisner. Proceedings
ofthe 2019 Conference on E mpirical Methodsin Natural Language Processing and
the 9thInternational Joint Conference on Natural Language Processing
2. Si mple Construction of Mixed-Language Textsfor Vocabulary Learning, A dit h y a
Renduchintala , Philipp Koehn and Jason Eisner. Annual Meeting ofthe Associa-
tionfor Co mputational Linguistics( ACL) Workshop onInnovative Use of NLPfor
B uil di n g E d u c ati o n al A p pli c ati o ns, 2 0 1 9.
3. Pretraining by Backtranslation for End-to-End AS R in Lo w- Resource Settings,
Matthe w Wiesner, Adithya Renduchintala , S hi nji Wat a n a b e, C h u n xi Li, N aji m
Dehak and Sanjeev Khudanpur,Interspeech 2019
4. A Callfor prudent choice of Sub word Merge Operations, Shuoyang Ding, A dit h y a
Renduchintala , and Kevin Duh. Machine Translation Su m mit 2019.
2 0 0 VITA
5. Character- Aware Decoder for Translation into Morphologically Rich Languages,
Adithya Renduchintala *, Pa mela Shapiro*, Kevin Duh and Philipp Koehn. Machine
Tr a nsl ati o n S u m mit 2 0 1 9.
6. TheJ H U/ Kyoto U Speech Translation Syste mforI WSLT 2018, Hirofu miInagu ma,
Xuan Zhang, Zhiqi Wang, Adithya Renduchintala , Shinji Watanabe and Kevin Duh.
TheInternational Workshop on Spoken Language Translation 2018(I WSLT)
7. Multi- Modal Data Aug mentation for End-to-End ASR Adithya Renduchintala ,
Shuoyang Ding, Matthe w Wiesner and Shinji Watanabe, Interspeech 2018. Best
Student Paper Award (3/700 +)
8. ESPnet: End-to-End Speech Processing Toolkit, Shinji Watanabe, Takaaki Hori,
Shigeki Karita, To moki Hayashi,Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta
Soplin, Jahn Hey mann, Matthe w Wiesner, Nanxin Chen, Adithya Renduchintala
and Tsubasa Ochiai,Interspeech 2018.
9. Kno wledge Tracingin Sequential Learning ofInflected Vocabulary, Adithya Ren-
d u c hi nt al a , Philipp Koehn andJason Eisner, Conference on Co mputational Natural
Language Learning( Co NLL), 2017.
1 0. User Modelingin Language Learning with Macaronic Texts, Adithya Renduchin-
t al a, Rebecca Kno wles, Philipp Koehn, and Jason Eisner. Annual Meeting ofthe
* Equal contribution
2 0 1 VITA
Associationfor Co mputational Linguistics( A CL) 2016.
1 1. Creatinginteractive macaronicinterfacesforlanguagelearning, Adithya Renduch-
i nt al a, Rebecca Kno wles, Philipp Koehn, andJason Eisner. Annual Meeting ofthe
Associationfor Co mputational Linguistics( ACL) De mo Session 2016.
1 2. Analyzinglearner understanding of novel L2 vocabulary, Rebecca Kno wles, A dit h y a
Renduchintala , Philipp Koehn, and Jason Eisner, Conference on Co mputational
Natural Language Learning( Co NLL), 2016.
13. Algerian Arabic-French Code-S witched Corpus, Ryan Cotterell, Adithya Renduch-
i nt al a, Nao mi P. Saphra and Chris Callison-Burch. An LREC-2014 Workshop on
Free/ Open-Source Arabic Corpora and Corpora Processing Tools. 2014.
1 4. Using Machine Learning and HL7 L OI NC D Ofor Classification of Clinical Docu-
m e nts, Adithya Renduchintala , A my Zhang, Tho mas Polzin, G. Saada wi. A merican
MedicalInfor matics Association( A MI A) 2013.
1 5. Collaborative Tagging and Persistent Audio Conversations, Ajita John, Shreeharsh
Kelkar, Ed Peebles, Adithya Renduchintala , Doree Selig mann Web 2.0 and Social
Soft ware Workshopin Conjunction with ECSC W. 2007.
1 6. Designingfor persistent Audio Conversationsinthe Enterprise, Adithya Renduchin-
t al a, AjitaJohn, Shreeharsh Kelkar, and Doree Duncan-Selig mann. Designfor User
Experience. 2007.
2 0 2 VITA
E XPERIENCE
Facebook AI, Menlo Park, C A 2020-Present
Research Scientist
Working on proble msrelatedto Neural Machine Translation.
Johns Hopkins University, Balti more, M D 2 0 1 3- 2 0 2 0
Research Assistant
Designed and evaluated AIforeignlanguageteachingsyste ms. Also worked on Machine
Translation and End-to-End Speech Recognition.
Duolingo, Pittsburgh, P A S u m m e r 2 0 1 7
ResearchIntern
Prototyped a Chatbotsyste mthat detects and corrects word-ordering errors made bylanguage
learners. Explored spelling-errorrobustness of co mpositional word e mbeddings
M* Modal, Pittsburgh, PA 2 0 1 2- 2 0 1 3
NLP Engineer
Developed S V M based clinical docu ment classification syste m. Worked onfeature engi-
neering for statistical models ( Docu ment Classification, Entity Detection, Tokenization,
C h u n ki n g)
Rosetta Stone, Boulder, C O 2 0 0 8- 2 0 1 2
Soft ware Developer
Designed, prototyped and evaluated speechrecognition based ga mes and applicationsfor
2 0 3 VITA
languagelearning. Prototyped ai mage-to-concept relation visualizationtool for second
language vocabularylearning.
Avaya, Lincroft NJ S u m m e r 2 0 0 7
Research Scientist Intern
Developed aninteractive graph based visualizationtoolto explore and annotate conference
callsin enterprises.
Arizona State University, Te mpe A Z 2 0 0 6- 2 0 0 8
Research Assistant, Arts Media & Engineering
Designed and prototyped syste msfor serendipitousinteractionsin distributed workplaces.
T EACHING
1. Intro.to Hu man Language Technology, Teaching Assistant Fall, 2 0 1 9
2. Neural Machine Translation Lab Session,JS ALT Su m mer School Su m mer 2018
3. Machine Translation, Teaching Assistant S pri n g 2 0 1 6
4. Intro.to Progra m mingfor Scientists & Engineers, Teaching Assistant Fall 2013
P ROGRAMMING
Advanced: Python(nu mpy, scipy, scikit-learn)
Proficient: Java, C/ C++, Javascript, Jquery, NodeJs
2 0 4 VITA
Deep Learning Fra me works: PyTorch( Advanced), Mx Net, Tensorflo w
Deep Learning Toolkits: Open N MT, Fairseq, ESP Net, Sockeye
N ATIONALITY
Indian, Per manent US Resident
Updated 08/28/2020
2 0 5