Renduchintala-Dissertation

ALANGUAGELEARNINGFRAMEWORKBASEDON

MACARONICTEXTS

b y

Adithya Renduchintala

A dissertation sub mittedto The Johns Hopkins Universityin confor mity withthe

require mentsforthe degree of Doctor of Philosophy.

Balti more, Maryland

August, 2020

⃝c 2020 Adithya Renduchintala

Thisthesis explores a ne wfra me workforforeignlanguage(L2) education. Ourfra me- workintroduces ne w L2 words and phrasesinterspersed withintext writteninthe student’s

nativelanguage(L1),resultingin a macaronic docu ment. Wefocus on utilizingtheinherent

ability of studentsto co mprehend macaronic sentencesincidentally and,in doing so,learn

ne w attributes of a foreignlanguage (vocabulary and phrases). Our goalisto build an

AI-drivenforeignlanguageteacher,that converts any docu ment writtenin a student’s L1

(ne ws articles, stories, novels, etc.) into a pedagogically useful macaronic docu ment. A valuable property of macaronicinstructionisthatlanguagelearningis “disguised” as a

si m pl e r e a di n g a cti vit y.

Inthis pursuit, we ﬁrst analyze ho w users guessthe meaning of asingle novel L2 word

(a noun) placed within an L1sentence. Westudythefeatures userstendto use asthey guess

the meaning ofthe L2 word. Wethen extend our modelto handle multiple novel words and

phrasesin a single sentence,resultingin a graphical modelthat perfor ms a modiﬁed cloze

task. To do so, we also deﬁne a data structurethat supportsrealizingthe exponentially many

macaronic conﬁgurations possiblefor a givensentence. We also explore waysto use a neural

ii ABSTRACT

clozelanguage modeltrained only on L1text as a “drop-in”replace mentfor areal hu man

student. Finally, we report ﬁndings on modeling students navigatingthrough a foreign

languageinﬂectionlearningtask. We hopethatthesefor m afoundationforfutureresearch

intothe construction of AI-drivenforeignlanguageteachers using macaroniclanguage.

Pri mary Reader and Advisor: Philipp Koehn

Co m mittee Me mber and Advisor: Jason Eisner

Co m mittee Me mber: Kevin Duh

iii Ackno wledge ments

Portions ofthis dissertation have been previously publishedinthe follo wingjointly

authored papers: Renduchintala et al.(2016b), Renduchintala et al.(2016a), Kno wles et al.

(2016), Renduchintala, Koehn, and Eisner(2017), Renduchintala, Koehn, and Eisner(2019a)

and Renduchintala, Koehn, and Eisner(2019b). This work would not be possible without

the direct efforts of my co-authors Rebecca Kno wles,Jason Eisner and Philipp Koehn.

Philipp Koehn has provided me with a cal m, collected, andsteady mentorshipthought

my Ph. D.life. So meti mes,Ifound myselfinterestedintopicsfarre movedfro m mythesis,

but Philippstillsupported me and gave methefreedo mto pursuethe m. Jason Eisnerraised

the barfor whatIthought being aresearcher meant, opened my eyestothinking deeply

about a proble m, and effectively co m municated my ﬁndings. Iliketo believe Jason has

made me a betterresearcher byteaching me(apartfro m allthetechnical kno wledge) si mple yet profoundthings – details matter, be mindful ofthe bigger picture evenif you are solving

a speciﬁc proble m.

I a mthankfulto my defense co m mittee and oral exa m co m mitteefor valuablefeedback

and guidance: Kevin Duh, Tal Linzen, Chadia Abras, and Mark Dredze(alternate). I a m

i v ABSTRACT

incredibly gratefulto Kevin Duh, who not onlyserved on my co m mittee but also mentored

me ont wo projectsleadingtot wo publications. Kevin not only cared about my work but

also about me and my well beinginstressfulti mes. Thank you, Kevin.

I wanttothank Mark Dredze and Suchi Saria, my advisors,for a brief period whenI

ﬁrst started my Ph. D.I a mincredibly gratefulto Ada m Lopez, who ﬁrstinspired meto

research Machine Translation. I will neverforget my ﬁrst CLSP-PI RE workshop experience

in Prague,thanksto Ada m. Matt Post, along with Ada m Lopez, weretheinstructorsin my

ﬁrst year MT course, Matt has beenfantasticto discuss projectideas and collaborate.

I have beenluckyto work with excellent researchersin Speech recognition as well.

Shinji Watanabetook a chance on me and helped me workthrough my ﬁrst everyspeech

paperin 2018, which eventually wonthe best paper. Shinji encouraged me and gave methe

freedo mto explore anidea eventhoughI was not an experiencedspeechresearcher.I would

alsoliketothank Sanjeev Khudanpur and Naji m Dehakforseveral discussions during my

stintinthe 2018JS ALT workshop.

I wanttothank myfellow CLSPstudents and CLSP Alu mni whosharedso meti me with me. Thank you, Tongfei Chen, Jaejin Cho, Shuoyang Ding, Katie Henry, Jonathan

Jones, Huda Khayrallah, Gaurav Ku mar, Ke Li, Ruizhi Li, Chu-Cheng Lin, Xutai Ma,

Matthe w Macieje wski, Matthe w Weisner, Kelly Marchisio, Chandler May, Arya Mc Carthy,

Hongyuan Mei, Sabrina Mielke, Phani Nidadavolu, Raghavendra Pappagari, Ada m Poliak,

Sa mik Sadhu, Elizabeth Salesky, Peter Schula m, Pa mela Shapiro, Suzanna Sia, David

Snyder, Brian Tho mpson, Ti m Vieira, Yi ming Wang, Zachary Wood- Doughty, Winston Wu,

v ABSTRACT

Patrick Xia, Hainan Xu,Jinyi Yang, Xuan Zhang, Ryan Cotterell, Nao mi Saphra, Pushpendre

Rastogi, Adrian Benton, Rachel Rudinger and Keisuke Sakaguchi, Yuan Cao, Matt Gor mley,

AnnIrvine, Keith Levin, Harish Mallidi, Vi mal Manohar, Chunxi Liu, Courtney Napoles,

Vijayaditya Peddinti, Nanyun Peng, Ada m Teichert, and Xuchen Yao. You are allthe best

part of myti me at CLSP.

Ruth Scally, Ya mese Diggs and Carl Pupa,thank youfor yourincrediblesupport.In my you, CLSPrunss moothly mainly dueto your efforts.

My Mo m, Dad, and brother provided encourage ment duringthe mosttryingti mesin my

Ph. D.,thank you, A m ma, Nana, and Chait.

Finally, Nichole,I can notstate whatit meansto have had yoursupportthrough my Ph. D.

You’ve been by mysidefro mthe ﬁrst exciting phone callIreceived about my acceptanceto

the very end and everythingin bet ween. Thisthesisis more yoursthan mine. Thank you,

b e b b e.

vi C o nt e nts

A bst r a ct ii

List of Tables xiii

List of Figures x vii

1 Introduction 1

1.1 MacaronicLanguage ...... 4

1.2 Zoneofproximaldevelopment...... 5

1.3 Our Goal: A Macaronic Machine Foreign Language Teacher ...... 6

1.4 Macaronic DataStructures ...... 9

1.5 User Modeling ...... 15

1.5.1 ModelingIncidental Co mprehension...... 15

1.5.2 Proxy ModelsforIncidental Co mprehension ...... 16

1.5.3 Knowledge Tracing...... 17

1.6 Searchingin Macaronic Conﬁgurations ...... 19

vii CONTENTS

1.7 Interaction Design...... 20

1.8 Publications...... 21

2 Related Work 2 3

3 ModelingIncidental Learning 2 8

3.1 Foreign WordsinIsolation ...... 29

3.1.1 Data Collectionand Preparation ...... 30

3.1.2 Modeling Subject Guesses ...... 34

3.1.2.1 Features Used ...... 34

3.1.3 Model...... 41

3.1.3.1 Evaluatingthe Models...... 43

3.1.4 Resultsand Analysis ...... 44

3.2 MacaronicSetting...... 48

3.2.1 Data CollectionSetup ...... 49

3.2.1.1 HITsand Submissions...... 50

3.2.1.2 Clues...... 51

3.2.1.3 Feedback...... 53

3.2.1.4 Points ...... 53

3.2.1.5 Normalization ...... 54

3.2.2 User Model...... 55

3.2.3 Factor Graph ...... 56

viii CONTENTS

3.2.3.1 CognateFeatures ...... 58

3.2.3.2 History Features...... 59

3.2.3.3 Context Features...... 59

3.2.3.4 User-Speciﬁc Features...... 60

3.2.4 Inference ...... 61

3.2.5 Parameter Estimation...... 62

3.2.6 Experimental Results...... 62

3.2.6.1 Feature Ablation...... 65

3.2.6.2 Analysisof User Adaptation ...... 66

3.2.6.3 Exa mple of Learner Guesses vs. Model Predictions . . . 68

3.2.7 FutureI mprove mentstothe Model...... 69

3.2.8 Conclusion ...... 72

4 CreatingInteractive MacaronicInterfacesfor Language Learning 7 4

4.1 MacaronicInterface...... 75

4.1.1 Translation ...... 76

4.1.2 Reordering ...... 78

4.1.3 “Pop Quiz”Feature...... 79

4.1.4 Interaction Consistency...... 81

4.2 Constructing Macaronic Translations...... 82

4.2.1 Translation Mechanism...... 83

4.2.2 Reordering Mechanism...... 84

i x CONTENTS

4.2.3 Special Handling of Discontiguous Units ...... 86

4.3 Discussion...... 87

4.3.1 Machine Translation Challenges ...... 87

4.3.2 User Adaptationand Evaluation ...... 88

4.4 Conclusion ...... 89

5 Construction of Macaronic Textsfor Vocabulary Learning 9 0

5.1 Introduction...... 90

5.1.1 Limitation...... 92

5.2 Method ...... 93

5.2.1 GenericStudent Model...... 93

5.2.2 Incre mental L2 Vocabulary Learning ...... 95

5.2.3 Scoring L2embeddings ...... 97

5.2.4 Macaronic Conﬁguration Search...... 98

5.2.5 Macaronic-Language docu mentcreation...... 99

5.3 Variationsin Generic Student Models ...... 102

5.3.1 Unidirectional Language Model ...... 102

5.3.2 Direct Prediction Model ...... 103

5.4 Experiments with Synthetic L2...... 106

5.4.1 MTurkSetup ...... 109

5.4.2 Experiment Conditions...... 110

5.4.3 Random Baseline...... 114

x CONTENTS

5.4.4 Learning Evaluation ...... 116

5.5 Spelling-Aware Extension ...... 117

5.5.1 Scoring L2embeddings ...... 118

5.5.2 Macaronic Conﬁguration Search...... 119

5.6 Experiments withreal L2...... 120

5.6.1 Co mprehension Experi ments...... 123

5.6.2 Retention Experiments ...... 124

5.7 HyperparameterSearch...... 125

5.8 Results Varyingτ ...... 126

5.9 Macaronic Examples ...... 130

5.10 Conclusion ...... 142

6 Knowledge Tracingin Sequential LearningofInﬂected Vocabulary 145

6.1 Related Work ...... 148

6.2 Verb Conjugation Task ...... 152

6.2.1 TaskSetup ...... 152

6.2.2 Task Content ...... 153

6.3 Notation...... 154

6.4 Student Models ...... 154

6.4.1 Observable Student Behavior...... 154

6.4.2 Feature Design ...... 155

6.4.3 Learning Models ...... 156

xi CONTENTS

6.4.3.1 Sche mesforthe Update Vector u t ...... 157

6.4.3.2 Sche mesforthe Gates α t , β t , γ t ...... 160

6.4.4 Parameter Estimation...... 161

6.5 Data Collection ...... 162

6.5.1 Language Obfuscation ...... 162

6.5.2 Card Ordering Policy...... 163

6.6 Results & Experiments ...... 164

6.6.1 Co mparison with Less Restrictive Model ...... 170

6.7 Conclusion ...... 173

7 Conclusion & Future Direction 1 7 5

Vit a 1 9 9

xii List of Tables

1. 1 Exa mples of possible macaronic configurationsfro mthe macaronic data structure depictedin Figure 1.2. This data structure supports actionsthat reorder phrases withinthe macaronic sentence,thus generating substrings li k e t e l l e u n e a n d a s u c h which are both notinthe word-ordering of thelanguagetheyarein...... 12 1. 2 Exa mples of possible macaronic configurationsfro mthesi mplified macaronic data structure Figure 1.3. Notethatthe words(even French words) are al waysinthe English word-order. Thus, usingthis data structure we can not obtain configurationsthatinclude substringslike a s u c h or u n e t e l l e . We envisionthis data structureto be usefulfor a native speaker of English learningfrench vocabulary, butitis also possiblethat a student seeing En- glish wordsin French word order canlearn about French word ordering. Then, gradually, we canreplacethe English words(in French word order) with French words(in French word order)for ming fluent Frenchsentences. 13

3.1 Threetasksderivedfro mthesa me Ger mansentence...... 31 3. 2 Correlations bet ween selectedfeature values and ans wer guessability, co m- puted ontraining data(starred correlations significant at p < 0 .0 1 . U n a vail- ablefeatures arerepresented by “n/a”(for exa mple, sincethe Ger man word is not observedinthe clozetask,its edit distancetothe correct solution is unavailable). Duetothe for mat of ourtriples,itis still possibletotest whetherthese unavailablefeaturesinfluencethe subject’s guess:in al most all casestheyindeed do not appearto, sincethe correlation with guessability islo w(absolute value < 0 .1 5 ) and not statistically significant even atthe p <0.05level...... 38 3. 3 Feature ablation. The single highest-correlatingfeature(on dev set)fro m each feature groupis sho wn, follo wed bythe entire feature group. All versions with morethan onefeatureinclude afeatureforthe O O V guess.In the correlation colu mn, p-values< 0.01 are marked with an asterisk. . . . . 45 3.4 Exa mples ofincorrect guesses and potentialsources of confusion...... 46

xiii LISTOFTABLES

3. 5 Percentage offoreign wordsfor whichthe user’s actual guess appearsin our t o p-k list of predictions,for models with and without user-specificfeatures (k∈{1,25,50})...... 64 3.6 Qualitycorrelations: basicand user-adapted models...... 67 3. 7 I mpact on quality correlation( Q C) ofre movingfeaturesfro mthe model. Ablated Q C values marked with asterisk ∗ differ significantlyfro mthefull- model QC valuesinthe firstrow( p < 0 .0 5 , usingthetest of Preacher (2002))...... 68

4.1 Su m mary oflearnertriggeredinteractionsinthe MacaronicInterface. . . . 81 4.2 Generatingreorderedstrings using units...... 85

5. 1 An exa mple English(L1)sentence with Ger man(L2) glosses. Usingthe glosses, many possible macaronic configurations are possible. Notethatthe glosssequenceisnota fluent L2sentence...... 92 5. 2 Resultsfro m MTurk data. The firstsectionsho wsthe percentage oftokens that werereplaced with L2 glosses under each condition. The Accuracy sectionsho wsthe percentagetoken accuracy of MTurk participants’ guesses along with 95 % confidenceinterval calculated via bootstrapresa mpling. . . 107 5. 3 Results of MTurkresults split up by word-class. The y -axisis percentage of tokens belongingto a word-class. The pink bar(right) sho wsthe percentage oftokens(of a particular word-class)that were replaced with an L2 gloss. The blue bar(left) andindicatesthe percentage oftokens(of a particular word-class)that were guessed correctly by MTurk participants. Error bars r e pr es e nt 9 5 % confidenceintervals co mputed with bootstrap resa mpling. For exa mple, we seethat only 5 .0 % (pink) of open-classtokens werere-

placedinto L2 bythe DP m o d el at r m a x = 1 a n d 4 .3 % of all open-class tokens were guessed correctly. Thus, eventhoughthe guess accuracyfor DP

at r m a x = 1 for open-classis high(8 6 % ) we canseethat participants were notexposedto manyopen-class wordtokens...... 108 5. 4 Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe GSM model...... 110 5. 5 Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe uGSMmodel...... 111 5. 6 Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe DP genericstudent model. Only co m monfunction wordssee mto bereplaced withtheir L2translations...... 112

xi v LISTOFTABLES

5. 7 Results co mparing our generic student based approachto arando m baseline. The first partsho wsthe nu mber of L2 wordtypes exposed by each model for each word class. The second part sho wsthe average guess accuracy percentagefor each model and word class. 9 5 % confidenceintervals (in brackets) wereco mputed using bootstrapresa mpling...... 114 5. 8 Results of our L2learning experi ments where MTurksubjectssi mplyread a macaronic docu ment and ans wered a vocabulary quiz atthe end ofthe passage. Thetable sho wsthe average guess accuracy percentage along with 95 % confidenceintervals co mputedfro m bootstrapresa mpling...... 115 5. 9 Averagetoken guess quality( τ = 0 .6 )inthe co mprehension experi ments. T h e ± d e n ot es a 9 5 % confidenceinterval co mputed via bootstrapresa mpling oftheset of hu mansubjects. The % of L1tokensreplaced with L2 glosses isin parentheses.§5.8 evaluates with other choices of τ...... 122 5. 1 0 Averagetype guess quality( τ = 0 .6 )intheretention experi ment. The % of L2 glosstypesthat weresho wninthe macaronic docu mentisin parentheses. §5.8evaluates withotherchoicesof τ...... 124 5. 1 4 An expanded version of Table 5.9(hu man co mprehension experi ments), reportingresults with various values ofτ...... 128 5.11 MRRscores obtained with different hyperpara metersettings...... 129 5. 1 2 Nu mber of L1tokensreplaced by L2 glosses under different hyperpara meter settings...... 130 5. 1 3 Nu mber of distinct L2 wordtypes presentinthe macaronic docu ment under different hyperpara metersettings...... 131 5. 1 5 An expanded version of Table 5.10(hu manretention experi ments),reporting results withvariousvaluesofτ...... 144

6. 1 Content usedintraining sequences. Phrase pairs with * were usedforthe quiz atthe end ofthetraining sequence. This Spanish content wasthen transfor med usingthe methodinsection 6.5.1...... 153 6.2 Su m maryofupdatesche mes(otherthan RNG)...... 159 6. 3 Table su m marizing prediction accuracy and cross-entropy(in nats per prediction)for different models. Larger accuracies and s maller cross-entropies are better. Within an update sche me,the † indicates significanti mprove ment ( Mc Ne mar’stest, p < 0 .0 5 ) overthe next-best gating mechanis m. Within g a gating mechanis m,the ∗ indicates significanti mprove ment overthe next-best updatesche me. For exa mple, N G+ C Missignificantly betterthan N G+ V M,soitreceives a † ;itis alsosignificantly betterthan R G+ C M, and r e c ei ves a ∗ as well. These co mparisons are conducted only a mongthe pure update sche mes(abovethe doubleline). All other models are significantly betterthan RG+S M(p <0.01)...... 167

x v LISTOFTABLES

6. 4 Co mparison of our best-perfor ming P KT model(R N G+C M)to our LST M model. On our dataset, P KT outperfor msthe LST M bothinter ms of accuracyandcross-entropy...... 172

7. 1 Exa mples ofinputs and predicted outputs by our experi mental N MT model trainedto generate macaroniclanguage sentences using annotations onthe input sequence. We seethatthe macaroniclanguagetranslations are ableto correctly order Ger man portions ofthe sentences, especially atthe sentence ending. Thesource-features have also beenlearned bythe N MT model and translations arefaithfultothe markup. The case,tokenization anditalics addedinpost...... 179

x vi List of Figures

1.1 Aschematicoverviewofourgoal...... 7 1. 2 Macaronic data structure extractedfro m word-align ments. The blacklines represent edgesthatreplace a unit, usuallyfro m onelanguage(blackfor English)to another (blue for French). For exa mple edge (i)is replaces f i r s t wit h p r e m i e r and vice-versa. In so me casesthe edges connect t wo Englishtokens(such as i n t r o d u c e a n d s u b m i t ) asininter mediate step bet ween i n t r o d u c e a n d p r e s e n t e r . In other casesthe black substitution edge deﬁne a single substitution even whenthere are morethan onetokensinthe unit. For exa mple, s u c h a is connectedto s u c h u n e via a substitution edge(ii). The orange edges perfor m areordering action, f or e x a m pl e s u c h a can betransfor medinto a s u c h bytraversing an orange edge. Onlyt wo edges are marked withro man nu meralsfor clarity. . 11 1. 3 A Si mpliﬁed Macaronic data structurethat only considers word replace- ments withoutanyreorderingof words...... 11

3. 1 Average guessability by contexttype, co mputed on 112triples(fro mthe training data). Error bars sho w 95 %-conﬁdenceintervals forthe mean, under bootstrapresa mpling ofthe 112triples( we use B Caintervals). Mean accuracyincreases signiﬁcantlyfro m eachtasktothe next(sa metest on differenceof means,p <0.01)...... 39 3. 2 Average Nor malized Character Trigra m Overlap bet ween guesses andthe German word...... 41 3. 3 Correlation bet ween e mpirically observed probability ofthe correct ans wer (i.e.the proportion of hu man subject guessesthat were correct) and model probability assignedtothe correct ans wer across alltasksinthetest set. Spear man’scorrelationof0.725...... 44 3. 4 Percent of exa mpleslabeled with eachlabel by a majority of annotators ( maysu mto morethan 100 %,as multiplelabels wereallowed)...... 46

x vii LISTOFFIGURES

3. 5 After a user sub mits a set of guesses(top),theinterface marksthe correct guessesin green and alsoreveals a set oftranslation clues(botto m). The user no w hasthe opportunityto guess againforthere maining Ger man words. 5 1 3. 6 Inthis case, afterthe user sub mits a set of guesses (top),t wo clues are revealed(botto m): ausgestellt is movedinto English order andthen translated...... 52 3. 7 Modelfor user understanding of L2 wordsin sentential context. This ﬁgure sho ws aninference proble min which allthe observed wordsinthesentence arein Ger man(thatis, O bs = ∅ ). Asthe user observestranslations via clues

or correctly- marked guesses, so me ofthe E i becomeshaded...... 57 3. 8 A ct u al q u alit y si m( ̂e, e ∗ ) of the learner’s guess ̂e on develop ment data, versus predicted quality si m(e, e ∗ ) w h er e e is the basic model’s 1-best prediction...... 65 3. 9 A ct u al q u alit y si m( ̂e, e ∗ ) ofthelearner’s guess ̂e on develop ment data, ver- susthe expectation ofthe predicted quality si m(e, e ∗ ) w h er e e is distri b ut e d accordingtothe basic model’s posterior...... 66 3. 1 0 The user-speciﬁc weight vectors, clusteredinto groups. Average points per HITforthe HITs co mpleted by each group:(a) 45,(b) 48,(c) 50 and(d) 42. 69 3. 1 1 Two exa mples ofthe syste m’s predictions of whatthe user will guess on a single sub mission, contrasted withthe user’s actual guess. (The user’s previous sub missions onthe sa metaskinstance are not sho wn.) In 3.11a, the model correctly expectsthatthe substantial context willinfor mthe user’s guess. In 3.11b, the model predicts that the user will fall back on string si milarity —although we can seethatthe user’s actual guess of a n d d a y waslikelyinfor med bytheir guess of n i g h t , aninﬂuencethat our C RF did consider. The nu mbers sho wn arelog-probabilities. Both exa mples sho wthe sentencesin a macaronic state(after so mereordering ortranslation has occurred). For exa mple,the originaltext ofthe Ger man sentencein 3.11breads Deshalb durften die Paare nur noch ein Kind bekommen . The macaronic version has undergoneso me reordering, and has also erroneously droppedthe verb dueto anincorrect alignment...... 73

4.1 Actionsthattranslate words...... 77 4.2 Actionsthatreorderphrases...... 79 4. 3 State diagra m oflearnerinteraction(edges) and syste m’sresponse(vertices). Edges can betraversed by clicking( c), hovering above(a ), hovering belo w (b) orthe enter(e) key. Un marked edgesindicate an auto matictransition. . 80 4. 4 The dottedlinessho w word-to- word align ments bet weenthe Ger mansen-

t e n c e f 0 , f1 , . . . , f7 and its English translation e 0 , e1 , . . . , e6 . T h e ﬁ g ur e highlights 3 ofthe 7 units: u 2 , u3 , u4 ...... 83

x viii LISTOFFIGURES

4. 5 A possiblestate ofthesentence, whichrenders asubset ofthetokens(sho wn in black). Therendering order(section 4.2.2)is not sho wn butis also part of the state. The string displayedinthis caseis ”Und danach they run noch einen Marathon.”(assumingnoreordering)...... 84 4. 6 Figure 4.6a sho ws a si mple discontiguous unit. Figure 4.6b sho ws along distance discontiguity whichis supported. In ﬁgure 4.6ctheinterruptions

alignto both sides of e 3 whichis not supported. In situationslike 4.6c, all associated units are merged as one phrasal unit(shaded) assho wnin ﬁgure 4.6d ...... 86

5.1 Ascreenshot of a macaronicsentence presented on Mechanical Turk. . . . 106

6. 1 Screen grabs of card modalities duringtraining. These exa mplessho w cards for a native English speakerlearning Spanish verb conjugation. Fig 6.1a is an E X card, Fig 6.1bshows a MC card beforethestudent has made a selection, and Fig 6.1c and 6.1dsho w M C cards afterthestudent has made anincorrect or correct selectionrespectively, Fig 6.1e sho ws a M C card thatis givingthe student another atte mpt(the syste mrando mly decidesto givethe student uptothree additional atte mpts), Fig 6.1f sho ws a TP card where a studentis co mpleting an ans wer, Fig 6.1g sho ws a TP cardthat has marked astudent ans wer wrong andthenrevealedtheright ans wer(the revealis decidedrando mly), and finally Fig 6.1h sho ws a cardthatis giving astudentfeedbackfortheiranswer...... 151 6.2 Quiz perfor mance distribution(afterre moving users whoscored 0). . . . . 165 6. 3 Plot co mparingthe models ontest data under different conditions. Condi- tions M C and TPindicate Multiple-choice and Typing questionsrespectively. These are broken do wntothe cases wherethestudent ans wersthe m correctly C andincorrectlyI C. S M, V M, and C Mrepresentscalar, vector, and context retention and acquisition gates(sho wn with different colors),respectively, while R G, N G and F G areredistribution, negative andfeature vector update sche mes(sho wn with different hatching patterns)...... 166 6. 4 Predicting a specific student’sresponses. For eachresponse,the plot sho ws our model’si mprove mentinlog-probability overthe unifor m baseline model. TP cards arethesquare markers connected bysolidlines(the final 7squares arethe quiz), while MC cards — which have a much higher baseline —are the circle markers connected by dashedlines. Hollo w and solid markers indicate correct andincorrect ans wersrespectively. The R N G+ C M modelis showninblueandthe FG+S M modelinred...... 169

xi x C h a pt e r 1

Introduction

Gro winginterestin self-directedlanguagelearning methodslike Duolingo( Ahn, 2013),

along withrecent advancesin machinetranslation andthe widespread ease of accessto a variety oftextsin alarge nu mber oflanguages, has givenriseto a nu mber of web-based toolsrelatedtolanguagelearning,rangingfro m dictionary appsto moreinteractivetools

like Alpheios( Nelson, 2007) or Lingua.ly(2013). Most oftheserequire hand-curatedlesson

pl a ns a n d l e ar ni n g a cti viti es, oft e n wit h e x pli cit i nstr u cti o ns.

Proponents oflanguage acquisitionthrough extensivereading, such as Krashen(1989),

arguethat much oflanguage acquisitiontakes placethroughincidentallearning — when a

learneris exposedto novel vocabulary orstructures and must ﬁnd a wayto understandthe m

in orderto co mprehendthetext. Huckin and Coady(1999) and Elley and Mangubhai(1983)

observethatincidentallearningis notli mitedtoreadingin one’s nativelanguage(L1) and

extendstoreadingin a secondlanguage(L2) as well. Freereading also offersthe beneﬁt of

1 CHAPTER1. INTRODUCTION

being a “lo w-anxiety”(or even pleasurable)source of co mprehensibleinputfor many L2

learners( Krashen, 2003).

Thereis considerable evidence sho wingthatfree voluntaryreading can play arolein

foreignlanguagelearning. Lee, Krashen, and Gribbons(1996) studiedinternational students

inthe United States andfoundthe a mount offreereading of Englishto be a signiﬁcant

predictortojudgethe gra m matically of co mplex sentences. The a mount offor mal study

andlength ofresidenceinthe United States were not strong predictors. In Stokes, Krashen,

and Kartchner(1998) studentslearning Spanish weretested ontheir understanding ofthe

subjunctive. Students were notinfor med ofthe speciﬁcfocus ofthetest(i.e.thatit was on

the subjunctive). The studyfoundthat attributes such asfor mal study,length ofresidence

inforeign speaking country andthe student’s subjectiverating ofthe quality ofthefor mal

studytheyreceivefailedto predict perfor mance onthe subjunctivetest. The a mount of

freereading, ho wever, was a strong predictor. Constantino et al.(1997) sho wedthatfree

reading was astrong predictor ofthe perfor mancein Test of English as a Foreign Language

(T OEFL). Ho wever,they did ﬁndthat other attributes such asti me of residenceinthe

United States and a mount offor mal study were also signiﬁcant predictors of perfor mance.

Ki m and Krashen(1998) go beyondself-reportedreading a mounts and were ableto ﬁnd a

correlation bet weenthe perfor mance of studentsinthe English as aforeignlanguagetest

and perfor mance onthe “authorrecognition”test. Inthe authorrecognitiontest, subjects

indicate whethertheyrecognize a na me as an author a mong alist of na mes providedto

the m. The authorrecognitiontest has also been usedin other ﬁrstlanguage studies as well

2 CHAPTER1. INTRODUCTION

such as Chinese(Lee and Krashen, 1996) Korean( Ki m and Krashen, 1998) and Spanish

( Rodrigo, Mc Quillan, and Krashen, 1996). Inthe case of secondlanguage acquisition,

ho wever,learning byreading alreadyrequires considerable L2 ﬂuency, which may prove

a barrierfor beginners. Thus,in orderto allo w studentsto engage withthe L2language

early on, educators may usetexts writtenin si mpliﬁedfor ms,texts speciﬁcally designedfor

L2learners(e.g.texts withli mited/focused vocabularies), ortextsintendedfor young L1

learners ofthe given L2. “ Handcrafted Books” has been proposed by( Dupuy and Mc Quillan,

1997) as a wayto generate L2reading materialthatis both accessible and enjoyableto foreignlanguage students. Handcrafted Books are essentially articles, novels or essays written by Foreignlanguage students at aninter mediatelevel and subsequently corrected by

ateacher. The student writers areinstructed nottolook-up words while writing, which helps

keeptheresulting material withinthe ability of beginner students. Whilethis approach gives

educators control overthelearning material,itlacks scalability and suffers fro m si milar

issues as hand-curatedlesson plans. As aresult, alearnerinterestedreadingin a second

language might havefe w choicesinthetype oftexts made availabletothe m.

Our proposalisto make use of “ macaroniclanguage,” which offers a mixture ofthe

learner’s L1 andtheirtarget L2. The a mount of mixing can vary and might depend onthe

learner’s proﬁciency and desired content. Additionally, we propose auto matically creating

such “ macaroniclanguage,” allo wing ourlearning paradig mto scale across a wide variety

of content. Our hopeisthatthis paradig m can potentially convert anyreading materialinto

one of pedagogical value and could easily beco me a part ofthelearner’s dailyroutine.

3 CHAPTER1. INTRODUCTION

1.1 Macaronic Language

Why dothe French only eat one eggfor breakfast? Because one eggis un œuf.

Theter m “ Macaronic”traditionallyrefersto a mash-up oflanguages, oftenintendedto be

hu morous. Si milartotextthat contains code-s witching,typical macaronictexts areintended

for a bilingual audience; ho wever,they differfro m code-s witching asthey are not governed

by syntactic and prag matic considerations. Macaronictexts are also “synthetic”inthe sense

thatthey aretraditionally constructed forthe purpose of hu mor (bilingual puns) and do

not arise naturallyin conversation. Inthisthesis, ho wever, we usetheter m macaronicfor

bilingualtextsthat have been synthetically constructedforthe purpose of secondlanguage

learning. Thus, our macaronictexts do notrequire ﬂuencyin bothlanguages andtheir usage

only assu mesthatthe studentis ﬂuentintheir nativelanguage. Thisthesisinvestigates

the applicability of macaronictexts as a mediu mforlife-long secondlanguagelearning.

Flavors ofthisidea have been done before, which we coverin Chapter 2, butitis especially worth notingthatthe earliest published macaronic me moir we could ﬁndis “ On Foreign

Soil”(Zolf and Green, 2003) whichis atranslation of an earlier Yiddish novel(Zolf, 1945).

Green’stranslation beginsin English with afe w Yiddish wordstransliterated, asthe novel

progresses, entiretransliterated Yiddish phrases are presentedtothereader.

4 CHAPTER1. INTRODUCTION

1.2 Zone of proxi mal develop ment

Secondlanguage(L2)learningrequiresthe acquisition of vocabulary as well as kno wledge

ofthelanguage’s constructions. One ofthe waysin whichlearners beco mefa miliar with

novel vocabulary andlinguistic constructionsisthroughreading. Accordingto Krashen’s

Input Hypothesis( Krashen, 1989),learners acquirelanguagethroughincidentallearning, which occurs whenlearners are exposedto co mprehensibleinput. What constitutes “co mpre-

hensibleinput”for alearner varies astheir kno wledge ofthe L2increases. For exa mple, a

studentintheir ﬁrst month of Ger manlessons would be hard-pressedtoread Ger man novels

or even front-page ne ws, butthey might understand brief descriptions of daily routines.

Co mprehensibleinput need not be co mpletelyfa miliartothelearner;it couldinclude novel vocabularyite ms or structures, whose meaningsthey can gleanfro m context. Suchinput

fallsinthe “zone of proxi mal develop ment”( Vygotsky, 2012),just outside ofthelearner’s

co mfort zone. Therelated concept of “scaffolding”( Wood, Bruner, and Ross, 1976) consists

of providing assistancetothelearner at alevelthatisjust sufﬁcient enoughforthe mto

co mpletetheirtask, whichin our caseis understanding asentence. Inthis context, macaronic

text can offer a ﬂexible modalityfor L2learning. The L1 portion ofthetext can provide

scaffolding whilethe L2 portion,if not previously seen bythe student, can provide novel vocabulary andlinguistic constructions of pedagogical value. So,if were-purposethe pun

fro m aboveinto a macaronictextfor L2frenchlearners( whose L1is English) we might

constructthefollo wing withthe hopethatthereaders caninferthe meanings of u n a n d

5 CHAPTER1. INTRODUCTION

œ u f .

Why dothe French have only un eggfor breakfast? Because un œuf isenough.

In addition to vocabulary, macaronic scaffolding can extend to syntactic structures as well. For exa mple,consider the follo wing text: “ Der Student turned in die

Hausaufgaben, that the teacher assigned had .” Here, Ger man vocab-

ularyisindicatedin bold and Ger man syntactic structures areindicatedinitalics. Even a

reader with no kno wledge of Ger manislikelyto be ableto understandthis sentence by

using context and cognate clues. One cani magineincreasingthe a mount of Ger manin

such sentences asthelearner’s vocabularyincreases,thus carefullyre moving scaffolding

(English) and keepingthelearnerintheir zone of proxi mal develop ment.

1.3 Our Goal: A Macaronic Machine Foreign

Language Teacher

Our visionisto build an AIforeign-languageteacherthat gradually converts docu ments

(stories, articles, etc.) writtenin a student’s nativelanguageintothe L2the student wants

tolearn, by auto maticallyreplacingthe L1 vocabulary, morphology, and gra m mar with

the L2for ms(see 1.1). This “gradual conversion”involvesreplacing more L1 words(or

phrases) withtheir L2 counterparts, and occurs astheleanerslo wly gains L2 proﬁciency(say

over a period of days or weeks). We don’t expectsigniﬁcantlanguage proﬁciencyincrease

6 CHAPTER1. INTRODUCTION

Figure 1.1: Asche matic overvie w of our goal.

duringthe course of asingle novel, ho wever, we expect more conversions mainly duetothe

repeated appearance of key vocabularyite ms duringthe novel. Thus, L2 conversions can

accu mulate overthe course ofthe novel. This AIteacher willleverage a student’sinherent

abilityto guessthe meaning offoreign words and constructions based onthe contextin whichthey appear and si milaritiesto previously kno wn words. We envision ourtechnology

being used alongside traditional classroo m L2instruction —the sa meinstructional mix

thatleads parentsto acceptinventive spelling( Gentry, 2000), 1 in which early writers are

encouragedto writeintheir nativelanguage without concernfor correct spelling,in part so

they can morefully and happily engage withthe writing challenge of co mposinglonger and

more authentictexts without undue distraction( Clarke, 1988). Traditional gra m mar-based

instruction and assess ment, which uses “toy” sentencesin pure L2, should providefurther

scaffoldingfor our usersto acquirelanguage byreading more advanced(but macaronic)

t e xt. 1 Learning ho wto spell,likelearning an L2,is atype oflinguistic kno wledgethatis acquired after L1 ﬂuency andlargelythroughincidentallearning( Krashen, 1993).

7 CHAPTER1. INTRODUCTION

Auto matic construction of co mprehensible macaronictexts asreading material —perhaps

online and personalized — would be a useful educationaltechnology. Broadly speaking,this

r e q uir es:

1. A data structure for manipulating word (or phrase) aligned bitexts sothey can be

rendered as macaronic sentences,

2. Modeling student’s co mprehension whentheyreadthese macaronic sentences(i.e.

what can an L2learner understandin a given context?), and

3. Searching over many possible candidate co mprehensibleinputsto ﬁnd onethatis

mostsuitedtothestudent, balancingthe a mount of ne w L2they encounter as well as

ease ofreading.

Insche matic Figure 1.1the AIteacher perfor msthe points(2) and(3)fro m above. While

(1) deﬁnestheinput(along with meta-data)required bythe AIteacherto perfor m(2) and

(3). There are several ways ofrealizing each ofthethree points above. Inthisthesis,the

each chapter explores a subset ofthesethree points. Chapter 3 and Chapter 4 cover points

(1) and(2) using hu man datato constructstudent models. Chapter 5 describes another AI

teacherthat can acco mplish(2) and(3) but makesso mesi mplifying assu mptions aboutthe

input data(1). Chapter 6 also details another kind of student modeling(point(2))for a verb-inﬂectionlearningtask andfocuses on short-ter mlongitudinal modeling of student

kno wledge asthey progressthroughthistask.

The points made above can also berecastfro m a Reinforce ment Learning perspective.

8 CHAPTER1. INTRODUCTION

The data structure we design deﬁnesthe set of all possible actions an agent(the AIteacher)

can take. The agent is acting upon the student’s observations and tries to infer their

co mprehension of a sentence(and, more generally,theirlevel of proﬁciency of L2). Thus,

the studentisthe environ mentthatthe agentis acting upon. Finally,the search algorith m

thatthe AIteacher e mploysisthe policythe agentfollo ws. This perspective also suggests

thatthe policy couldinvolve planningforlong-ter mre wards( whichin ourfra me workis L2

proﬁciency) using policiesthatlook-aheadintothefutureto make opti mal decisionsinthe

present. Ho wever, weleave planninginthe macaronic spacetofuture work. Inthisthesis, weli mit ourselvesto greedy-searchtechniquesthat do not considerlong-ter mre wards.

Areasonable concernis whether exposuretothe mixture oflanguages a macaronictext

offers might actually har m acquisition ofthe “correct” version of atext written solelyinthe

L2. To addressthis, our proposedinterface uses color andfontto markthe L1 “intrusions”

intothe L2 sentence, orthe L2intrusionsinto L1 sentence. We again dra w a parallelto

inventive spelling and highlightthatthefocus of alearner should be more on continuous

engage ment with L2 content evenifit appears “incorrectly” within L2texts.

1.4 Macaronic Data Structures

Several strategies can befollo wedto createtherequired data structures sothat an AIteacher

can manipulate andrender macaronictexts. One such structurefor an English-French bitext

is sho wnin Figure 1.2. Wereferto a singletranslation pair of a source sentence andtarget

9 CHAPTER1. INTRODUCTION

sentence as bitext. We use word align ments, which are edges connecting wordsfro mthe

source sentencetothoseinthetarget sentenceinthe bitext, and convertthe bitextinto a

set of connected units. Whilethe majority of units contain single works, notethat so me

ofthe units contain phrases withinternalreorderings, such as u n e t e l l e or s u c h u n e .

Each unitis a bipartite graph with French words on one end and English words onthe

other. The AIteacher can select different cutsin each unittorender different macaronic

conﬁgurations (See Table 1.2). Figure 1.2 sho wsthe graph data structure and Table 1.2

lists so me macaronicrenderings or conﬁgurationsthat can be obtainedfro m different cuts

inthe graph data structure. We constructthese data structures auto matically, butin our

experi ments we correctthe m manually as word align ments are noisy and so meti melink

source andtargettokensthat are nottranslation of each other. Further more,inter mediate

tokensinthe macaronic such as T h e A r i z o n a a n d s u b m i t in Figure 1.2 can not be

obtained by merely using bitext and word align ments. Theseinter mediate for ms were

added manually as well. Werefertotheite ms(ro ws oftext)in Table 1.2 as macaronic

conﬁgurations. Notethat macaronic conﬁgurations are macaronic sentences; we usethe

ter m “conﬁguration”to denotethe set of macaronic sentencesfro m a single piece of content

(i.e. fro m a single bitext withits supporting data structures).

An alternative strategyisto only uselexical replace ments for L1tokens. This data

structure strategy assu mesthe availability of L2 glossesfor eachtokeninthe L1 sentence.

The unitsfor medinthis case are si mple “one-to-one” mappings(see Figure 1.3). This data

structureisless expressiveinthe sensethatit can onlyrender macaronic conﬁgurations

1 0 CHAPTER1. INTRODUCTION

L’ A r i z o n a f u t l e p r e m i e r a p r e s e n t e r u n e t e l l e e x i g e n c e

telle une atelle u n e s u c h T h e A r i z o n a (i) s u b m i t

s u c h u n e t e l l e a a s u c h

(ii) A r i z o n a w a s t h e f i r s t t o i n t r o d u c e s u c h a requirement

Figure 1.2: Macaronic data structure extracted fro m word-align ments. The blacklines

represent edgesthatreplace a unit, usuallyfro m onelanguage(blackfor English)to another

(bluefor French). For exa mple edge(i)isreplaces f i r s t wit h p r e m i e r and vice-versa.

Inso me casesthe edges connectt wo Englishtokens(such as i n t r o d u c e a n d s u b m i t )

asininter mediate step bet ween i n t r o d u c e a n d p r e s e n t e r . In other casesthe black

substitution edge deﬁne a single substitution even whenthere are morethan onetokensin

the unit. For exa mple, s u c h a is connectedto s u c h u n e vi a a s u bstit uti o n e d g e (ii). T h e

orange edges perfor m areordering action,for exa mple s u c h a can betransfor medinto a

s u c h bytraversing an orange edge. Onlyt wo edges are marked withro man nu meralsfor

cl arit y.

A r i z o n a f u t l e p r e m i e r a i n t r o d u i r e t e l l e u n e e x i g e n c e

A r i z o n a w a s t h e f i r s t t o i n t r o d u c e s u c h a requirement

Figure 1.3: A Si mpliﬁed Macaronic data structurethat only considers wordreplace ments without anyreordering of words.

1 1 CHAPTER1. INTRODUCTION

L’ Arizona fut le premier a presenter une telle exigence

L’ Arizona fut le first a presenter une telle exigence . .

L’ Arizona was the first a presenter une telle exigence

L’ Arizona was the first a presenter telle une requirement

L’ Arizona was the first a presenter telle a requirement

L’ Arizona was the first a presenter such a requirement . .

L’ Arizona was the first to introduce such a requirement

Arizona was the first to introduce such a requirement

Table 1.1: Exa mples of possible macaronic conﬁgurationsfro mthe macaronic data structure depictedin Figure 1.2. This data structure supports actionsthatreorder phrases withinthe macaronic sentence,thus generating substringslike t e l l e u n e a n d a s u c h w hi c h ar e both notinthe word-ordering ofthelanguagethey arein.

1 2 CHAPTER1. INTRODUCTION

Arizona fut le premier a introduire telle une exigence

Arizonafut le premier a introduire telle une exigence

Arizona wasle premier a introduire telle une exigence . .

Arizona wasle premier a introduiresuch a requirement . .

Arizona was the first tointroduiresuch a requirement

Arizona was the first to introduce such a requirement

Table 1.2: Exa mples of possible macaronic conﬁgurationsfro mthe si mpliﬁed macaronic

datastructure Figure 1.3. Notethatthe words(even French words) are al waysinthe English word-order. Thus, usingthis data structure we can not obtain conﬁgurationsthatinclude

substringslike a s u c h or u n e t e l l e . We envisionthis data structureto be usefulfor a

native speaker of Englishlearningfrench vocabulary, butitis also possiblethat a student

seeing English wordsin French word order canlearn about French word ordering. Then,

gradually, we canreplacethe English words(in French word order) with French words(in

French word order)for ming ﬂuent French sentences.

1 3 CHAPTER1. INTRODUCTION

inthe L1 word order. Thatis, eventhoughit can display L2 words,these words are only

displayedintheir L1 orderings. Thisli mitation si mpliﬁesthe AIteacher’s choice of actions

but alsoli mitstheresulting contentto only have valueforlearningforeign vocabulary. Note

that eventhis si mple structure allo wsfor exponentially many possible conﬁgurationsthat

the AIteacher must be abletosearch over.

The data structures described above are a subset of more co mplex structures. Itis

possibleto construct data structuresthat connect sub word unitsfro mthe English sideto an

equivalentsub word unit onthe Frenchside. Such a datastructure could be usedtorender

macaronic sentences and are macaronic atthe word-level. Further more,the data structure

could also support non-contiguousreorderings. Weleave such data structuresforfuture work andfocus onthe more straightfor wardlocal-reorderings and si mple wordreplace ment

methods mainlyto allo wforfast search procedures.

The graph structurein Figure 1.2 could be made more co mplex with hyper-edges

connecting aset oftokensfro mthe Englishside( s u c h a )to aset oftokensinthe French

side(une telle) ofthetext.

1 4 CHAPTER1. INTRODUCTION

1.5 User Modeling

1.5.1 ModelingIncidental Co mprehension

In orderto deliver macaronic contentthat can be understood by alearner, we must ﬁrst build

a model of alearner’s co mprehension abilities. Would a native English speakerlearning

Ger man, be ableto co mprehend a sentencelike The Nil is a Fluss in Africa ?

Wouldthey correctly mapthe Ger man words N i l a n d F l u s s t o N i l e a n d r i v e r ? Is there a chancetoincorrectly mapthe Ger man wordsto other plausible English words, for exa mple, N a m i b a n d d e s e r t ? We approachthis co mprehension modeling proble m

by building probabilistic modelsthat can predictif an L2 student might co mprehend a

macaronic sentence whentheyreadit. One wayto esti mate co mprehensionisto ask L2

studentsto guessthe meaning of L2 words or phrases withinthe macaronic sentence. A

correct guessi mpliesthatthere are sufﬁcient clues, eitherfro mthe L1 wordsinthe sentence,

fro mthe L2 words, orfro m both,to co mprehendthesentence.

We begin with asi mpliﬁed version of macaronicsentences wherein only asingle word(a

noun)isreplaced withits L2translation. For exa mple: The next important Klima

conference is in December . We build predictive modelsthattakethe L1 context

andthe L2 word asinput and predict what a novice L2learner(studying Ger maninthis

exa mple) mightsayisthe meaningforthe novel word Klima.

Next, we addressthe casein which multiple wordsinthe macaronic sentence are

1 5 CHAPTER1. INTRODUCTION convertedintotheir L2for m. Considerthe earlier exa mple: The Nil is a Fluss i n A f r i c a .Insuch cases, we needtojointly predictthe meaning astudent would assign to allthe L2 words. Thisis necessary because a student’sinterpretation of one word will inﬂuence ho wtheyinterpretthere maining L2 wordsinthesentence. For exa mple,ifthe student assignsthe meaning N i l e t o N i l ,this mightinﬂuencethe mto guessthat F l u s s should beinterpreted as R i v e r . Alternatively,iftheyinterpret F l u s s as f o r e s t , t h e y mightthen guessthat N i l isthe na me of aforestin Africa. In other words,thereis a cyclical dependency bet weenthe guessesthat astudent makes. Details of our proposed modelsfor capturingincidentallearning are discussedin Chapter 3. Code usedfor our experi mentsis availableathttps://github.com/arendu/MacaronicUserModeling.

1.5.2 Proxy ModelsforIncidental Co mprehension

One wayto build models of hu manincidental co mprehensionisto collect datafro m hu mans.

Inthe previous section( with detailsin Chapter 3) werequire hu manstoread a candidate macaronictext and providefeedback asto whetherthetext was co mprehensible and whether the L2 wordsinthetext were understood. Usingthisfeedback(i.e.labeled data) we build a model of a “generic” student. Collectingthislabeled data fro m student annotatorsis expensive. Not onlyfro m a data collection point of vie w but alsoforstudents, asthey would haveto givefeedback on candidate macaronictexts generated by an untrained machine t e a c h er.

As an alternative to collecting labeled data in this way, we investigate using cloze

1 6 CHAPTER1. INTRODUCTION

language models as a proxyfor models ofincidental co mprehension. A clozelanguage

model can betrained with easily available L1 corporafro m any do main(that potentiallyis of

interestto a student). We canreﬁnethe clozelanguage model withreal student supervision

in an onlinefashion as a studentinteracts with macaronictext generated bythe AIteacher.

In other words,this clozelanguage model can be personalizedtoindividual students by

looking at whatthey are ableto understand and making updatestothe model accordingly.

Essentially,the clozelanguage model allo ws usto bootstrapthe macaroniclearning setup without expensive data collection overhead. Details of our use of proxy user models are

describedin Chapter 5.

1.5.3 Kno wledge Tracing

Apartfro m modelingincidental co mprehension, an AIteacher would beneﬁtfro m modeling

ho w a student might updatetheir kno wledge based on different pedagogical sti muli, which

in our casetakethefor m of different macaronic sentences. Further more,inthe case of “pop

quizzes”(see Chapter 4),the student mayreceive explicitfeedbackfortheir guesses. Ideally,

such explicitfeedback would also causethe studentto updatetheir kno wledge. Here an

“update” could entaillearning(addingtotheir kno wledge) orforgetting(re movingfro m

their kno wledge). Thelongitudinaltracking of kno wledge alearner has asthey proceed

through a sequence oflearning activitiesis referredto as “kno wledgetracing” ( Corbett

and Anderson, 1994). We study afeature-rich kno wledgetracing methodthat captures a

student’s acquisition andretention of kno wledge during aforeignlanguage phrase-learning

1 7 CHAPTER1. INTRODUCTION

task. Notethatin we deviatefro m our macaronic paradig mforthistask andfocus on

short-ter mlongitudinal modeling of a students kno wledgein a phrase-learningtaskthat

teaches L2 verbinﬂections. Inthis study, we use ﬂash-cardsinstead of macaronictexts,

mainly because ofthe easein obtaininglongitudinal participationin user studies. We

model a student’s behavior as making predictions under alog-linear model, and adopt a

neural gating mechanis mto model ho wthe student updatestheirlog-linear para metersin

responsetofeedback. The gating mechanis m allo wsthe modeltolearn co mplex patterns of

retention and acquisitionfor eachfeature, whilethelog-linear para meterizationresultsin an

interpretable kno wledge state.

We collect hu man data and evaluate several versions ofthe model. We hypothesize that hu man data collectionfor verbinﬂectionis not as proble matic asthefull macaronic

setting asthere are only a handful(afe w dozeninﬂectionalfor ms evenfor morphological

richlanguages)inﬂectionalfor msto master. Secondly, we do not haveto subject student

annotatorstosti muli has been generated by untrained AIteachers, makingthe data collection

process a beneﬁcialforthestudent annotators as well. Details of our proposalfor kno wledge

tracing are presentedin Chapter 6. The code and datafor our experi mentsis available at

https://github.com/arendu/vocab-trainer-experiments.

1 8 CHAPTER1. INTRODUCTION

1.6 Searchingin Macaronic Conﬁgurations

I n § 1. 5. 1 a n d § 1.5.2 weintroducethe notion of modeling hu manco mprehensionin macaronic

settings(either using actual hu man data or by proxy models). Ho wever, our AIteacher still

hasthe difﬁculttask of deciding which speciﬁc macaronic sentenceto generatefor a student

reach so me ne w piece oftext. Eveninthe case ofthe si mpliﬁedlexical macaronic data

structure, where each L1 word mapsto asingle L2 word andthere are no phrasere-orderings,

there are exponentially many possible macaronic conﬁgurationsthat can be generated. The

AIteacher must decide on a particular conﬁgurationthat will be rendered or displayed

to areader bysearching over(so me subset) ofthe possible conﬁgurations and picking a

good candidate. We propose greedy and best-ﬁrst heuristicstotacklethis search proble m

and also design a scoringfunctionthat guidesthe search processto ﬁnd good candidates

to display. While si mple, we ﬁndthe greedy and best-ﬁrst heuristics approach offers an

effective strategyfor ﬁnding a macaronic conﬁguration with alo w co mputationalfootprint.

In orderto continuously update ourlanguage modelinresponseto a student’sreal-ti me

interactions,the speed of our searchis a criticalfactor. The greedy best-ﬁrst approach offers

a solutionthat prioritizes speed. To measurethe goodness of each search state, our scoring

function co mparestheinitiallytrained L1 word e mbeddings withtheincre mentallytrained

L2 word e mbeddings and assigns ascorereﬂectingthe proxi mity ofthe L2 word e mbeddings

totheir L1 counterparts. We conductintrinsic experi ments usingthis proposed scoring

functionto deter mine viable search heuristics. Further details on our scoringfunction and

1 9 CHAPTER1. INTRODUCTION

heuristic search are presentedin Chapter 5. The codefor our experi ments are available here

https://github.com/arendu/Mixed-Lang-Models.

1.7 Interaction Design

While macaronictexts can be “consu med” as static docu ments,the prevalence of e-readers

and web-basedreadinginterfaces allo ws ustotake advantage of a moreinteractivereading

andlearning experience. We propose a userinterfacethat can be helpfulto a hu manlearner, while also enablingthe AIteacherto adaptitselfto wards a speciﬁclearner. Speciﬁcally, when a macaronic docu mentis presentedtothe student, we providefunctionalitythat allo ws

a studentto click on L2 words or phrasesin ordertorevealtheir L1translation. This helps

thereaderto progressifthey are struggling with ho wtointerpret a given word or phrase.

Further more, we canlog a student’s clickinginteractions and usethe m as feedback for

our machineteacher. Many clicks within a sentence mightindicatethatthe macaronic

conﬁguration ofthe sentence was beyondthelearner’sreading abilities, andtheteacher can

updateits models accordingly. Apartfro mthis,thereis alsothe option oftheteacher not

revealingthe L1translation, butrather pro mptingthe studentto ﬁrsttypein a guessforthe

meaning ofthe word. This process helpsto disa mbiguate bet weenthose students who click

justfor conﬁr mation oftheir kno wledge andthose who genuinely don’t kno wthe word at all.

Bylooking at what astudent hastyped, we can deter mine ho w closethestudent’s kno wledge

isto actual understanding ofthe word or phrase. Details of ourInteraction Design proposal

2 0 CHAPTER1. INTRODUCTION

are describedin Chapter 4, but weleavethe construction of aninteractive syste m which

iterativelyreﬁnesits modelforfuture work.

1.8 Publications

This thesis is mainly the cul mination of the follo wing six publications (including one

de monstrationtrack publication and one workshop publication):

1. Analyzing Learner Understanding of Novel L2 Vocabulary

Rebecca Kno wles, Adithya Renduchintala, Philipp Koehn and Jason Eisner, Confer-

ence on Co mputational Natural Language Learning( Co NLL), 2016.

2. CreatingInteractive MacaronicInterfacesfor Language Learning.

Adithya Renduchintala, Rebecca Kno wles, Philipp Koehn andJason Eisner, Syste m

Description, Annual Meeting ofthe Associationfor Co mputational Linguistics( A CL),

2 0 1 6.

3. User Modelingin Language Learning with Macaronic Texts.

Adithya Renduchintala, Rebecca Kno wles, Philipp Koehn andJason Eisner, Annual

Meeting ofthe Associationfor Co mputational Linguistics( A CL), 2016.

4. Kno wledge Tracingin Sequential Learning ofInﬂected Vocabulary

Adithya Renduchintala, Philipp Koehn and Jason Eisner, Conference on Co mputa-

tional Natural Language Learning( Co NLL), 2017.

2 1 CHAPTER1. INTRODUCTION

5. Si mple Construction of Mixed-Language Textsfor Vocabulary Learning

Adithya Renduchintala, Philipp Koehn and Jason Eisner. Annual Meeting ofthe

Associationfor Co mputational Linguistics( ACL) Workshop onInnovative Use of

NLPfor Building Educational Applications, 2019

6. Spelling- Aware Construction of Macaronic Texts for Teaching Foreign-Language

Vo c a b ul ar y

Adithya Renduchintala, Philipp Koehn and Jason Eisner. E mpirical Methodsin

Natural Language Processing(E M NLP), 2019

2 2 C h a pt e r 2

Related Work

Early adoption of Natural Language Processing( NLP) and speechtechnologyin education was mainly focused on Su m mative Assess ment, where a student’s writing, speaking, or

readingis analyzed by an NLP syste m. Such syste ms, essentially assigns a scoretothe

input. Pro minent exa mplesinclude Heil man and Madnani(2012), Burstein, Tetreault, and

Madnani(2013) and Madnani et al.(2012). Morerecently, NLPsyste ms have also been

usedto provide For mative Assess ment. Here,thesyste m providesfeedbackin afor mthat a

student can act upon andi mprovetheir abilities. For mative Assess ment has also beenstudied

in other areas of education such as Math and Science. Inlanguage education, For mative

Assess ment maytakethefor m of giving a student qualitativefeedback on a particular part

ofthe student’s essay. For exa mple, suggesting a different phrasing. Such syste ms fall

alongthelines ofintelligent and adaptivetutoring solutions designedtoi mprovelearning

outco mes. Recentresearchsuch as Zhang et al.(2019) and co m mercialsolutionssuch as

2 3 CHAPTER2. RELATED WORK

Gra m marly(2009) are expandingtherole of NLPinfor mativefeedback and assess ment.

An overvie w of NLP-based workinthe educationsphere can befoundin Lit man(2016).

Thereis alsolines of workthat are not withinthe deﬁnitions of Su m mative or For mative

assess ment. For exa mple, practice question generationisthetask of creating pedagogically

useful questionsfor a student allo wingthe mto practice withoutthe need of a hu manteacher.

( Du, Shao, and Cardie, 2017) and( Heil man and S mith, 2010)is one ofthe ne werresearch whichfocused on question generationforreading co mprehension. Priortothat( Mitkov and

Ha, 2003) usedrule-based methodsto generate questions.

There has also been NLP workspeciﬁctosecondlanguage acquisition,such as Ö z b al,

Pighin, and Strapparava(2014), wherethefocus has beento build a syste mto helplearners

retain ne w vocabulary. As previously mentioned, mobile and web-based appsforsecond

languagelearningsuch aslike Duolingo( Ahn, 2013) are popular a monglearners asthey

allo wself-pacedstudy and hand-crafted curricula. While most ofthese apps have “ga miﬁed”

thelearner’s experience,they still de mand dedicatedti mefro mthelearner.

The process of generatingtraining data for Machine Translation syste ms also have

potentialto belanguagelearningtools. Her mjakob et al.(2018)is atoolthat allo ws hu man

annotatorsto generatetranslations(targetreferences)fro m source sentencesin alanguage

they do notread. Thetool presents asourcesentence(for which areferencetargetisrequired)

to a hu man annotatorinro manizedfor m along with phrasal glosses ofthero manized-source

text using alook uptable. Her mjakob et al.(2018) observedthat by si mply allo wingthe

annotatorstotranslate source sentences ( withtheir supportinginterface)the annotators

2 4 CHAPTER2. RELATED WORK

learned vocabulary andsyntax overti me. Asi milar observation was madein Hu et al.(2011)

and Hu, Bederson, and Resnik(2010) who also builttoolsto obtainreferencetranslations

fro m hu man annotators who do notreadthesourcelanguage.

The workinthisthesis, ho wever, seeksto build a fra me work based onincidental

learning whenreading macaronic passagesfro m docu ments such as ne ws articles, stories,

or books. Our approach does notrely on hand- made curricula, does not present explicit

instructions, and(hopefully) can be used byforeignlanguage studentsinthe daily course

of their lives. 1 Our goalisthatthis would encourage continued engage ment,leadingto

“life-longlearning.” This notion ofincidentallearning has been exploredin previous work as well. Chen et al.(2015) create a web-based pluginthat can exposelearnersto ne wforeign

language vocabulary whilereading ne ws articles. They use a dictionaryto sho w Chinese

translations of English words whenthelearner clicks on an English wordinthe docu ment

(their prototypetargets native English speakerslearning Chinese). When a particular English wordis clicked,thelearneris sho wnthat word’s Chinesetranslation. Oncethe application

recordsthe click,itthen deter mines whetherthe user hasreached a certainthresholdfor

that word and auto maticallyreplacesfuture occurrences ofthe English withits Chinese

translation. Thelearner can also click onthe Chinesetranslation, at which pointtheyreceive

a multiple choice question askingthe mtoidentifythe correct Englishtranslation. While

they don’t use surrounding context when deter mining which wordstoteach,thelearner

has accessto contextinthefor m oftherest ofthe docu ment when making multiple-choice

1 Variations of ourfra me work could(ifthelearner chooses) provide explicitinstructions and makethetask morelearningfocused atthe expense of casualreading.

2 5 CHAPTER2. RELATED WORK guesses. Our workis alsorelatedto Labutov and Lipson(2014), which alsotriestoleverage incidentallearning using mixed L1 and L2languages. Whereastheir work uses surprisal to choose contextsin whichtoinsert L2 vocabulary, we consider both contextfeatures and otherfactors such as cognatefeatures. Further, we collect datathat gives direct evidence of the user’s understanding of words by askingthe mto provide English guesses,ratherthan indirectly, via questions about sentence validity. Thelatterindirect approachrunstherisk of overesti matingthe student’s kno wledge of a word;forinstance, a student may have only learned otherlinguisticinfor mation about a word, such as whetheritis ani mate orinani mate, ratherthanlearningits exact meaning. In addition, we are not onlyinterestedin whether a mixed L1 and L2 sentenceis co mprehensible; we are alsointerestedin deter mining a distribution overthelearner’s belief statefor each wordinthe sentence. We dothisin an engaging, ga me-like setting, which providesthe user with hints whenthetaskistoo difﬁcult forthe mto co mplete.

Incidentallearning can be vie wed as a kind of “fast mapping,” a process by which children are ableto map novel wordstotheir meaning withrelativelyfe w exposures( Carey and Bartlett, 1978). Fast mappingis usuallystudied as a mapping bet ween a novel word and so me conceptinthei m mediatescene. Carey and Bartlett(1978)studied whether 3 year old children can map a novel word,for exa mple“chro miu m,”to an unfa miliar color(olive-green) using a “referent-selection”task, whichrequired a subjecttoretrievethe correct unfa miliar objectfro m a set of objects. Children were giveninstructions such as bringthe chro miu m tray, notthe blue one. It was observedthat children were ableto perfor msuch mappings

2 6 CHAPTER2. RELATED WORK

quickly. Subsequently, Alishahi, Fazly, and Stevenson(2008) constructed a probabilistic

model and were abletotunethis modelto ﬁtthe e mpirical observations of previousfast

mapping experi ments. The model experiences a sequence of utterances U t i n s c e n es S t .

Each utterance contains words w ∈ U t , andthe “scene” contains aset of concepts m ∈ S t .

Wit h e a c h U t ,St pair,the model para meters p (m | w ), were updated using an online E M

update. Atthe end of asequence of U t ,St t ∈ 1 , . . . T pairs,the ﬁnal model para meters were

usedto si mulate “referent-selection” andretentiontasks. We can vie wthe student’stask(in

our macaronic setting) as aninstance of cross-lingual structuredfast- mapping, where an

utteranceis a macaronicsentence andthestudentistryingto map novelforeign wordsto wordsintheir nativelanguage.

We are also encouraged byrecent co m mercial applicationsthat use a mixedlanguage

fra me workforlanguage education. S wych(2015) clai msto auto matically generate mixed

language docu ments while OneThirdStories(2018) creates hu man generated mixed-language

storiesthat beginin onelanguage and graduallyincorporate more and more vocabulary and

syntaxfor asecondlanguage. Such ne w develop mentsindicatethatthereis bothspace and

de mandinthelanguagelearning co m munityforfurther exploration oflanguagelearning via mixedlanguages modalities.

Inthefollo wing chapters, we detail our experi ments with modeling user co mprehension

in mixedlanguage situations, as well as proposing a si mpliﬁed process for generating

macaronictext withoutinitial hu man studentintervention. Finally, we also model waysto

track student kno wledge asthey progressthrough alanguagelearning activity.

2 7 C h a pt e r 3

Modeling Incidental Learning

This chapter details our work on constructing predictive models of hu manincidentallearning.

Concretely, we castthe modelingtask as a predictiontaskin whichthe model predicts

if a hu manstudent canguess the meaning of a novel L2 word whenit appears withthe

surrounding L1 context. Apartfro m givingthe model accesstothe context, we also provide

the model with features fro mthe novel worditself, such as spelling and pronunciation

features, asthese would all aid a hu man studentintheir guess ofthe novel word’s meaning.

Recallthatthis modelis essentially a model ofthe environ ment,takingthereinforce ment

learning perspective of our macaroniclearningfra me work(fro m § 1.5.1),thatthe agent (the

AIteacher) acts upon.

2 8 CHAPTER3. MODELINGINCIDENTAL LEARNING

3.1 Foreign WordsinIsolation

We ﬁrststudy a constrainedsetting where we present novicelearners with ne w L2 words

insertedin sentences other wise writtenintheir L1. Inthis setting only a single L2 wordis

presentin each sentence. Whilethisis notthe only possible settingforincidental acquisition

(§ 3.2 discussesthe sa metask forthe “full” macaronic setting),this experi mental design

allo ws usto assu methat all subjects understandthefull context, without needingto assess

ho w much L2they previously understood. We also present novicelearners withthesa me

novel words out of context. This allo ws usto study ho w “cognateness” and contextinteract,

in a well-controlledsetting. We hypothesizethat cognates or very co m mon words may be

easytotranslate without context, while contextual clues may be neededto make other words

g u ess a bl e.

Intheinitial experi ments we present here, wefocus onthelanguage pair of English

L1 and Ger man L2, selecting Mechanical Turk users who self-identify as fluent English speakers with mini mal exposureto Ger man. We confine ourselvesto novel nouns, as we expectthattherelativelack of morphologicalinflectionin nounsin bothlanguages 1 will

produceless noisy resultsthan verbs, for exa mple, which naive users mightincorrectly

inﬂectintheir( English)responses.

Even more experienced L2readers will encounter novel words whenreading L2text.

Their abilityto decipher a novel wordis kno wnto depend on boththeir understanding

1 Speciﬁcally, Ger man nouns are markedfor nu mber but onlyrarelyfor case.

2 9 CHAPTER3. MODELINGINCIDENTAL LEARNING

ofthe surrounding context words( Huckin and Coady, 1999) andthe cognateness ofthe

novel word. We seekto evaluatethis quantitatively and qualitativelyinthree “extre me”

cases(no context, no cognateinfor mation,full context with cognateinfor mation). In doing

so, we are ableto see ho wlearners mightreact differentlyto novel words based ontheir

understanding ofthe context. This can serve as a well-controlled proxyfor otherincidental

learning settings,includingreading alanguagethatthelearner kno ws well and encountering

novel words, encountering novel vocabularyite msinisolation(for exa mple on a vocabulary

list), orlearner-drivenlearningtools such as onesinvolvingthereading of macaronictext.

3.1.1 Data Collection and Preparation

We use datafro m NachrichtenLeicht.de , asource of ne ws articlesinsi mple Ger man

(Leichte Sprache, “easylanguage”)( Deutschlandfunk, 2016). Si mple Ger manisintended

for readers with cognitivei mpair ments and/or whose ﬁrstlanguageis not Ger man. It

follo ws several guidelines, such as short sentences, si mple sentence structure, active voice,

hyphenation of co mpound nouns( which are co m monin Ger man), and use of prepositions

instead ofthe genitive case( Wikipedia, 2016).

We chose 188 Ger man sentences and manuallytranslatedthe minto English. In each

sentence, weselected asingle Ger man noun whosetranslationis asingle English noun. This yields atriple of( Ger man noun, English noun, Englishtranslation ofthe context). Each

Ger man noun/English noun pair appears only once 2 and each Englishsentenceis unique,for

2 The English word may appearin othersentences, but neverinthesentencein whichits Ger man counterpart a p p e ars.

3 0 CHAPTER3. MODELINGINCIDENTAL LEARNING

Task Text Presentedto Learner Correct Ans wer

w or d Kli m a cli m at e

cloze Thenexti mportant conferenceisin Dece mber. cli mate

co mbined The nexti mportant Kli ma conferenceisin Dece mber. cli mate

Table 3.1: Threetasks derivedfro mthesa me Ger mansentence.

atotal of 188triples. Sentencesrangedinlengthfro m 5tokensto 28tokens, with a mean of

11.47tokens( median 11). Duetotheshortlength ofthesentences,in many casesthere was

only one possible pair of aligned Ger man and English nouns(both of which weresingle wordsratherthan noun phrases). Inthe cases wherethere were multiple,thetranslator chose

onethat had not yet been chosen, and atte mptedto ensure a widerange of clear cognatesto

non-cognates, as well as arange of ho w clearthe context madethe word.

As an outsideresourcefortraininglanguage models and otherresources, we choseto use

Si mple English Wikipedia( Wiki media Foundation, 2016). It contains 767,826 sentences,

covers a si milar set oftopicstothe NachrichtenLeicht.de data, and uses si mple

sentence structure. The sentencelengths are also co mparable, with a mean of 17.6tokens

and a median of 16tokens. This makesit well- matchedfor ourtask.

Our main goalisto exa mine students’ abilityto understand novel L2 words. To better

separatethe effects of context and cognate status and generalfa miliarity withthe nouns, we

assess subjects onthethreetasksillustratedin Table 3.1:

1. word : Subjects are presented with asingle Ger man word out of context, and are asked

3 1 CHAPTER3. MODELINGINCIDENTAL LEARNING

to providetheir best guessforthetranslation.

2. cloze: Asingle nounis deletedfro m asentence andsubjects are askedto ﬁllinthe

bl a n k.

3. co mbined: Subjects are askedto providetheir best-guesstranslation for a single

Ger man nouninthe context of an English sentence. Thisisidenticaltothe clozetask,

exceptthatthe Ger man nounreplacesthe blank.

We used A mazon Mechanical Turk(henceforth MTurk), a cro wdsourcing platfor m,to recruit subjects and collecttheirresponsesto ourtasks. Tasks on MTurk arereferredto as

HITs( Hu manIntelligence Tasks). In orderto qualifyfor ourtasks, subjects co mpleted short

surveys ontheirlanguage skills. They were askedtoratetheirlanguage proﬁciencyinfour

languages(English, Spanish, Ger man, and French) on ascalefro m “ None”to “Fluent.” The

inter mediate options were “ Upto 1 year ofstudy(or equivalent)” and “ Morethan 1 year of

study (or equivalent)”. 3 Only subjects whoindicatedthatthey were ﬂuentin English but

indicated “ None”for Ger man experience were per mittedto co mpletethetasks.

Additional stratiﬁcation of usersinto groupsis describedinthe subsection belo w. The

HITs were presentedtosubjectsin aso me whatrando mized order(as per MTurkstandard

s et u p).

Data Collection Protocol: Inthis setup, each subject may be askedto co mpletein-

stances of allthreetasks. Ho wever,the subjectis sho wn at most onetaskinstancethat was

3 Subjects wereinstructedtolistthe mselves as having experience equivalenttolanguageinstruction evenif they had notstudiedin a classroo mifthey had been exposedtothelanguage bylivingin a placethatit was spoken, playing onlinelanguage-learning ga mes, or other such experiences.

3 2 CHAPTER3. MODELINGINCIDENTAL LEARNING

derivedfro m a given datatriple(for exa mple, at most onelinefro m Table 3.1). 4 S u bj e cts were paid bet ween $0.05 and $0.08 per HIT, where a HIT consists of 5instances ofthe

sa metask. Each HIT was co mpleted by 9 uniquesubjects. Subjects voluntarily co mpleted

fro m 5to 90taskinstances(1–18 hits), with a median of 25instances(5 HITs). HITstook

subjects a median of 80.5seconds accordingtothe MTurk outputti ming. Eachtriple gives

riseto one cloze, one word, and one co mbinedtask. For each ofthosetasks, 9 users make

guesses,for atotal of 27 guesses pertriple. Data was preprocessedtolo wercase all guesses

andto correct obvioustypos. 5 Users made 1863 unique guesses(types across alltasks). Of

these, 142 were deter minedto be errors of so me sort; 79 were correctable spelling errors, 54 were multiple- word phrasesratherthansingle words, 8 were Ger man words, and 1 was an

a mbiguous spelling error. In our experi ments, wetreat all ofthe uncorrectable errors as out

of vocabularytokens,for which we cannot co mputefeatures(such as edit distance, etc.).

D at a S plits : After collecting data on alltriplesfro m our subjects, we splitthe dataset

for purposes of predictive modeling. Werando mly partitionedthetriplesinto atraining set

(112triples), a develop ment set(38triples), and atest set(38triples). Notethatthe sa me

partition bytriples was used across alltasks. As aresult, a Ger man noun/English noun pair

that appearsintest datais genuinely unseen —it did not appearinthetraining dataforany

t as k. 4 Eachtriple givesriseto aninstance of eachtask. Subjects who co mpleted one oftheseinstances were preventedfro m co mpletingthe othert wo by being assigned an additional MTurk “qualification” —inthis case, a dis q u ali fi c ati o n. 5 All guessesthat were flagged by spell-check were manually checkedto seeifthey constitutedtypos(e.g., “langauges”for “languages”) or spelling errors(e.g., “speach”for “speech”) with clear corrections.

3 3 CHAPTER3. MODELINGINCIDENTAL LEARNING

3.1.2 Modeling Subject Guesses

In an educationaltechnology context, such as atoolforlearning vocabulary, we wouldlike

a wayto co mputethe difﬁculty of exa mples auto matically,in orderto presentlearners with

exa mples with appropriatelevel of difﬁculty. For such an application,it would be useful

to kno w not only whetherthelearnerislikelyto correctly guessthe vocabularyite m, but

also whethertheirincorrect guesses are “close enough”to allo wthe userto understandthe

sentence and proceed withreading. We seekto build modelsthat can predict a subject’s

likely guesses andtheir probabilities, giventhe context with whichthey have been presented.

We use a s mall set offeatures(described belo w)to characterize subjects’ guesses and

build predictive models of what a subjectislikelyto guess. Featurefunctions canjointly

exa minetheinput presentedtothe subject and candidate guesses.

We evaluatethe models bothinter ms of ho w wellthey predict subject guesses, as well as ho w wellthey perfor m onthesi mplersubtask of modeling guessability. We deﬁne

guessability for a wordin contextto be ho w easythat wordisfor asubjectto guess, given

the context. In practice, we esti mated guessability asthe proportion of subjectsthat exactly

guessedthe word(i.e.,thereference Englishtranslation).

3.1.2.1 Features Used

When ourfor mulafor co mputing afeature dra ws on para metersthat are esti matedfro m

corpora, we usedthe Si mple English Wikipedia data and Glo Ve vectors(Pennington, Socher,

3 4 CHAPTER3. MODELINGINCIDENTAL LEARNING and Manning, 2014). Ourfeatures arefunctions whose argu ments arethe candidate guess andthetriple(of Ger man noun, English noun, and English sentence). They are dividedinto three categories based on which portions ofthetriplethey consider:

Generic Features : Thesefeatures areindependent of subjectinput, and could be useful regardless of whetherthe subject madetheir guessinthe word, cloze, or co mbinedtask.

1. Candidate==Correct Answer Thisfeature ﬁres whenthe candidateis equal

tothe correct ans wer.

2. Candidate==OOV Thisis used whenthe candidate guessis not a valid English

word(for exa mple, multiple words or aninco mprehensibletypo),in which case no

otherfeatures aboutthe candidate are extracted.

3. Length We co mputethe nu mber of charactersinthe correct ans wer.

4. Embedding Cosine distance bet ween e mbedding of candidate and e mbedding of

correct ans wer. Forthe e mbeddings, we usethe 300-di mensional Glo Ve vectorsfro m

the 6 B-token dataset.

5. Log Unigram Frequency of candidateinthe Si mple English Wikipedia corpus.

6. Levenshtein Distance bet weencandidateandcorrectans wer.

7. Sound Edit Distance Levenshtein Distance bet ween phoneticrepresentations

of candidate and correct ans wer, as given by Metaphone(Philips, 1990). 6

6 When several variations were availablefor a particularfeature, such as which phoneticrepresentationto use or whether or notto nor malize,the version we selectedfor our studies(and described here)isthe version that correlated most strongly with guessability ontraining data.

3 5 CHAPTER3. MODELINGINCIDENTAL LEARNING

8. L C S Length oflongest co m mon substring bet ween candidate and correct ans wer,

nor malized.

9. Normalized Trigram Overlap count of charactertrigra ms(types)that match

bet weenthe candidate and correct ans wer, nor malized bythe maxi mu m possible

m at c h es.

Word Features : Thesefeatures are dependent onthe Ger man word, andshouldthus only be usefulinthe word and co mbinedtasks. Thesecond half ofthe genericfeatures(fro m

Levenshtein Distancethrough Nor malized Trigra m Overlap) are also co mputed bet ween the candidate andthe Ger man word and are used as measures of cognateness. The use of

Metaphone( whichisintendedto predictthe pronunciation of English words)is appropriate to usefor Ger man wordsinthis case, asit correspondstothe assu mptionthat ourlearners do not yet have accuraterepresentations of Ger man pronunciation and may be applying

English pronunciationrulesto Ger man. Cloze Features : Thesefeatures are dependent on thesurrounding English context, andshouldthus only be usefulinthe cloze and co mbined t as ks.

1. L M S c o r e of candidatein context, using a 5-gra mlanguage model built using

KenL M( Heaﬁeld et al., 2013) and a neural R N NL M( Mikolov et al., 2011). 7 We

co mputethree differentfeaturesfor eachlanguage model, ara w L Mscore, asentence-

length-nor malized L Mscore, andthe difference bet weenthe L Mscore withthe correct

ans werinthesentence andthe L Mscore withthe candidateinits place.

7 We usethe Faster- R N NL Mtoolkit available at https://github.com/yandex/faster-rnnlm .

3 6 CHAPTER3. MODELINGINCIDENTAL LEARNING

2. P M I Maxi mu m point wise mutualinfor mation bet ween any wordinthe context and

the candidate.

3. Left Bigram Collocations These are bigra m association measures bet ween

the candidate’s neighbor(s)totheleft andthe candidate( Church and Hanks, 1990).

Weinclude a versionthatjust exa minesthe neighbor directlytotheleft( which we’d

expectto do wellin collocationslike “San Francisco”) as well as a versionthatreturns

the maxi mu mscore over a windo w of ﬁve, which behaveslike an asy m metric version

of P MI.

4. Embeddings Mini mu m cosine distance bet ween e mbeddings of any wordinthe

context andthe candidate.

Intuitively, we expectitto be easiestto guessthe correct wordinthe co mbinedtask,

follo wed bythe clozetask,follo wed bythe L2 word with no context.8 Assho wnin Figure

3.1,thisis borne outin our data.

In Table 3.2 wesho w Spear man correlations bet weenseveralfeatures andthe guessability

ofthe word(given a word, cloze, or co mbined context). The ﬁrstt wofeatures(log unigra m

probability andlength ofthe correct solution)in Table 3.2 belongtothe generic category

offeatures. We expectthatlearners may have an easierti me guessing short or co m mon words(forinstance,it may be easierto guess “cat”than “trilobite”) and we do observe such

c orr el ati o ns. 8 All plots/valuesinthe re mainder ofthis subsection are co mputed only overthetraining data unless other wise noted.

3 7 CHAPTER3. MODELINGINCIDENTAL LEARNING

Fe at u re Spear man’s Correlation w/ Guessability

Word Cloze Combined All

Log Unigra m Frequency 0.310* 0.262* 0.279* 0.255*

English Length -0.397* -0.392* -0.357* -0.344*

SoundEdit Distance(German+ Answer) -0.633* n/a -0.575* -0.409*

Levenshtein Distance(German+ Answer) -0.606* n/a -0.560* -0.395*

Max P MI( Ans wer + Context) n/a 0.480* 0.376* 0.306*

Max Left Bigram Collocations(Answer+ Window=5) n/a 0.474* 0.186 0.238*

Max Right Bigram Collocations(Answer+ Window=5) n/a 0.119 0.064 0.038

Table 3.2: Correlations bet ween selectedfeature values and ans wer guessability, co mputed

on training data (starred correlations signiﬁcant at p < 0 .0 1 . Unavailable features are

represented by “n/a”(for exa mple, sincethe Ger man wordis not observedinthe clozetask,

its edit distancetothe correct solutionis unavailable). Duetothefor mat of ourtriples,it

is still possibletotest whetherthese unavailablefeaturesinﬂuencethe subject’s guess:in

al most all casestheyindeed do not appearto, sincethe correlation with guessabilityislo w

(absolute value< 0.15) and not statistically signiﬁcant even atthep < 0.05 level.

3 8 CHAPTER3. MODELINGINCIDENTAL LEARNING

6 0

5 0

4 0

3 0

2 0 Average Accuracy of User Guesses 1 0

0 W or d Cl o z e C o m bi n e d

Figure 3.1: Average guessability by contexttype, co mputed on 112triples(fro mthetraining

data). Error bars sho w 95 %-conﬁdenceintervalsforthe mean, under bootstrapresa mpling

ofthe 112triples( we use B Caintervals). Mean accuracyincreases signiﬁcantlyfro m each

tasktothe next(sa metest on difference of means,p < 0.01).

Inthe middle ofthetable, we cansee ho w, despitethe wordtask being most difﬁcult on average,there are cases such as Gitarrist (guitarist), where cognateness allo ws all or nearly alllearnersto guessthe meaning ofthe word with no context. The correlation

bet ween guessability and Sound Edit Distance as well Levenshtein Distance de monstrate

their usefulness as proxiesfor cognateness. The other wordfeatures described earlier also

sho w strong correlation with guessabilityinthe word and co mbinedtasks.

Si milarly,in so me clozetasks, strong collocations or context clues, asinthe case of “ His

planelanded atthe ,” makeit easyto guessthe correctsolution(airport). Arguably,

eventhe nu mber of blank wordsto ﬁll are a clueto co mpletethissentence, but we do not

modelthisin ourstudy. We would expect,forinstance, a high P MI bet ween “plane” and

“airport”, and we seethisreﬂectedinthe correlation bet ween high P MI and guessability.

3 9 CHAPTER3. MODELINGINCIDENTAL LEARNING

The ﬁnalt wolines ofthetable exa mine aninteresting quirk of bigra m association measures.

We seethat Left Bigra m Collocations with a windo w of 5(thatis, wherethefeaturereturns

the maxi mu m collocationscore bet ween a wordinthe windo wtotheleft ofthe wordto be

guessed) sho wsreasonable correlation with guessability. Right bigra m collocations, onthe

other hand, do not appearto correlate. This suggeststhatthe subjectsfocus more onthe words precedingthe blank whenfor mulatingtheir guess( which makessense astheyread

left-to-right). Duetoits poor perfor mance, we do notinclude Right Bigra m Collocationsin

ourlater experi ments.

We expectthatlearners whosee onlythe word will make guessesthatlean heavily on

cognateness(for exa mple,incorrectly guessing “ Austria”for “ Ausland”), whilelearners who

seethe clozetask will choose wordsthat make sense se mantically(eg.incorrectly guessing

“tornado”inthesentence“The destroyed many housesand uprooted manytrees”).

Guessesforthe co mbinedtask mayfallso me where bet weentheset wo, asthelearnertakes

advantage of both sources ofinfor mation. Here wefocus onincorrect guesses(to control

forthe differencesintask difﬁculty).

For exa mple,in Figure 3.2, weseethat guesses( made by our hu mansubjects)inthe wordtask have higher average Nor malized Character Trigra m Overlapthan guessesinthe

clozetask, withthe co mbinedtaskin bet ween.

This pattern ofthe co mbinedtaskfalling bet weenthe word and clozetaskis consistent

across mostfeatures exa mined.

4 0 CHAPTER3. MODELINGINCIDENTAL LEARNING

Nor med Trigra m Overlap 0. 2 0

0. 1 5

0. 1 0

0. 0 5

0. 0 0 W or d Cl o z e C o m bi n e d

Figure 3.2: Average Nor malized Character Trigra m Overlap bet ween guesses andthe

Ger man word.

3. 1. 3 M o d el

The correlationsinthe previous subsection support ourintuitions about ho wto model

subject behaviorinter ms of cognateness and context. Of course, we expectthatratherthan

basingtheir guesses on a singlefeature, subjects are perfor ming cue co mbination, balancing

multiple cognate and context clues ( wheneverthey are available). Follo wingthattoits

natural conclusion, we choose a modelthat also allo wsfor cue co mbinationin orderto

model subject guesses.

We uselog-linear modelsto model subject guesses as probability distributions overthe vocabulary V , asseenin Equations 3.1 and 3.2.

exp(w ·f(x,y)) p (y |x ) = ∑ ( 3. 1) ′ y ′ ∈ V exp(w ·f(x,y ))

− w · f (x, y ) = w k f k (x, y ) ( 3. 2) k

4 1 CHAPTER3. MODELINGINCIDENTAL LEARNING

We use a 5000- word vocabulary, containing all co mplete English vocabularyfro mthe

triples and user guesses, padded withfrequent wordsfro mthe Si mple English Wikipedia

d at as et.

Giventhe context x thatthe subject was sho wn ( word, cloze, or co mbined), p (y |x )

representsthe probabilitythat a subject would guessthe vocabularyite m y ∈ V . T h e m o d el

learns weights w k for eachfeature f k (x, y ). Wetrainthe model using Mega M(Dau mé III,

2004) viathe NLT Kinterface.

An exa mplefeaturefunctionis sho wnin Equation 3.3.

⎧ ⎪ ⎪ ⎨⎪ 1, if Correct Answer= = y f (x, y ) = ( 3. 3) k ⎪ ⎪ ⎩⎪ 0, other wise

In orderto bestleveragethe clozefeatures(shared acrossthe cloze and co mbinedtasks),

the wordfeatures (shared acrossthe word and co mbinedtask) andthe genericfeatures

(shared across alltasks), wetakethe do main adaptation approach usedin( Dau mé III, 2 0 0 7).

Inthis approach,instead of a singlefeaturefor Levenshtein distance bet ween a Ger man word and a candidate guess, weinstead havethree copies ofthisfeature, onethat fires only whenthe subjectis presented withthe wordtask, onethat fires whenthesubjectis presented withthe co mbinedtask, and one which firesin either ofthose situations(notethat since

asubject whoseesthe clozetask does notseethe Ger man word, we o mitsuch a version

ofthefeature). This allo ws ustolearn different weightsfor different versions ofthe sa me

features. For exa mple,this allo wsthe modeltolearn high weightfor cognatenessfeatures

4 2 CHAPTER3. MODELINGINCIDENTAL LEARNING

inthe word and co mbinedsettings, without beinginﬂuencedtolearn alo w weight onit by

t h e cl o z e s etti n g.

3.1.3.1 Evaluatingthe Models

We evaluatethe modelsin several ways: using conditional cross-entropy, by co mputing

meanreciprocalrank, and co mputing correlation with guessability.

The e mpirical distributionfor a given context x is calculatedfro m all c o u nt(·|x ) l e ar n er

guesses for that context, with p (g |x ) = c o u nt (g |x ) ( w h er e c o u nt(g |x ) isthe nu mber of c o u nt (·|x )

learners who guessedg inthe contextx).

The conditional cross-entropy is deﬁnedto bethe mean negativelog probability over all ∑ testtaskinstances(pairs of subject guessesg and contexts x), 1 N − l o g p (g | x ). N i= 0 2 i i

The mean reciprocal rank is co mputed after ranking all vocabulary words (in each

context) bythe probability assignedtothe m bythe model, calculatingthereciprocalrank of

the each subject guess g i, andthen averagingthis across all contexts x i n t h e s et X of all

contexts, as sho wnin Equation 3.4.

1 − N 1 M R R = ( 3. 4) N r a n k(g |x ) i= 1 i i In orderto co mputecorrelation with guessability, we use Spear man’srhoto checkthe

correlation bet ween guessability andthe probability assigned bythe modeltothe correct

a ns w er.

4 3 CHAPTER3. MODELINGINCIDENTAL LEARNING

3.1.4 Results and Analysis

1. 0

0. 8

0. 6

0. 4

Model0. Probability 2 of Correct Ans wer

0. 0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 E mpirical Probability of Correct Ans wer

Figure 3.3: Correlation bet ween e mpirically observed probability ofthe correct ans wer(i.e.

the proportion of hu man subject guessesthat were correct) and model probability assigned

tothe correct ans wer across alltasksinthetest set. Spear man’s correlation of 0.725.

In Table 3.3 wesho w modelresults overseveralfeaturesets. None ofthefeaturesets

(genericfeatures, wordfeatures, or clozefeatures) can perfor m wellindividually onthefull

test set, which contains word, cloze, and co mbinedtasks. Once co mbined,they perfor m

b ett er o n all m etri cs.

Additionally, using do main adaptationi mproves perfor mance. Manually exa miningthe

best model’s mostinfor mative features, we see, for exa mple,that edit distance features

areranked highlyintheir word-only or word-co mbined versions, whilethe co mbined-only version ofthosefeaturesislessinfor mative. Thisreﬂects our earlier observationthat edit

distancefeatures are highly correlated with guessabilityinthe wordtask, and slightlyless so

4 4 CHAPTER3. MODELINGINCIDENTAL LEARNING

Fe at u res Cross-Entropy MRR Correlation

LCS(Candidate + Ans wer) 10.72 0.067 0.346*

All Generic Features 8.643 0.309 0.168

Sound Edit Distance(Candidate+ German Word) 10.847 0.081 0.494*

All Word Features 10.018 0.187 0.570*

L M Difference 11.214 0.051 0.398*

All Cloze Features 10.008 0.105 0.351*

Generic + Word 7.651 0.369 0.585*

Generic + Cloze 8.075 0.320 0.264*

Wor d + Cl o z e 8.369 0.227 0.706*

All Features( No Do main Adapt.) 7.344 0.338 0.702*

All Features + Do main Adapt. 7.134 0.382 0.725*

Table 3.3: Feature ablation. The single highest-correlatingfeature(on dev set)fro m each

feature groupis sho wn,follo wed bythe entirefeature group. All versions with morethan

onefeatureinclude afeatureforthe O O V guess. Inthe correlation colu mn, p-values < 0 .0 1

are marked with an asterisk.

4 5 CHAPTER3. MODELINGINCIDENTAL LEARNING

Context Observed Guess Truth Hypothesized Explanation

H elf er cow helpers FalseFriend: Helfer→ Heifer → Cow

Journalisten reporter journalists Synony m andincorrect nu mber.

The Lage istoo dangerous. lake location Inﬂuenced byspellingandcontext.

Table 3.4: Exa mples ofincorrect guesses and potential sources of confusion.

(thoughstillrelevant)inthe co mbinedtask. Wesho win Figure 3.3thatthe model probability

assignedtothe correct guess correlates strongly withthe probability oflearners correctly

guessingit.

Annotated Guesses : Totake a ﬁne-grainedlook at guesses, we broke do wn subject

guessesinto several categories.

Figure 3.4: Percent of exa mpleslabeled with eachlabel by a majority of annotators( may

su mto morethan 100 %, as multiplelabels were allo wed).

We had 4 annotators(ﬂuent English speakers, but non-experts)label 50incorrect subject

4 6 CHAPTER3. MODELINGINCIDENTAL LEARNING

guesses fro m each task, sa mpled rando mly fro m the spell-corrected incorrect guesses

inthetraining data, withthe follo winglabelsindicating whythe annotatorthoughtthe

subject madethe(incorrect) guessthey did, giventhe contextthatthe subject sa w: false

friend/cognate/spelling bias(learner appearsto have beeninﬂuenced bythe spelling ofthe

Ger man word),synony m(learner guessis asynony m or near-synony mtothe correct ans wer),

incorrect nu mber/P OS(correct noun withincorrect nu mber orincorrect P OS), and context

inﬂuence(a wordthat makessenseinthe cloze/co mbo context butis not correct). Exa mples

oftherange of waysin which errors can manifest aresho wnin Table 3.4. Annotators made

a binaryjudg mentfor each oftheselabels. Inter-annotator agree ment was substantial, with

Fleiss’s kappa of 0.654. Guesses were given alabel onlyifthe majority of annotators agreed.

In Figure 3.4, we can make several observations about subject behavior. First,thelabels

forthe co mbined and clozetaskstendto be more si milarto one another, and quite different

fro mthe wordtasklabels. In particular,inthe majority of cases, subjects co mpleting cloze

and co mbotasks choose wordsthat ﬁtthe contextthey’ve observed, whilespellinginﬂuence

inthe wordtask doesn’t appearto be quite asstrong. Evenifthesubjectsinthe cloze and

co mbinedtasks make errors,they choose wordsthatstill makesensein context morethan

50 % oftheti me, while spelling doesn’t exert an equally stronginﬂuenceinthe wordtask.

Our model also makes predictionsthatlook plausiblylikethose made bythe hu man subjects. For example, giventhe context “In ,the AKP now hasthe most representatives.” the model ranks the correct ans wer (“parlia ment”) ﬁrst, follo wed by

“undersecretary,” “elections,” and “congress,” all of which arethe matically appropriate, and

4 7 CHAPTER3. MODELINGINCIDENTAL LEARNING

most of which ﬁt contextuallyintothe sentence. Forthe Ger man word “Spieler”,thetop

ranking predictions made bythe model are “spider,” “s maller,” and “spill,” while one ofthe

actual user guesses(“speaker”)isranked as 10th mostlikely(out of a vocabulary of 5000

it e ms).

3.2 Macaronic Setting

I n § 3.1 werestrictedthesentencesto only contain asingle nountokenintheforeignlanguage, whiletherest ofthetokens werein L1. Inthis section we modelincidental co mprehension

forthe “full macaronic” setting, where multipletokens can beintheforeignlanguage, and

the word ordering may also beintheforeignlanguage word-order. Thesti muli ofinterest are

n o w li k e “ Der Polizist arrested the Bankr ä u b e r . ” (“The police arrested

the bankrobber”). Eveninthis scenario, with multipletokens have beenreplaced withtheir

L2 equivalent, areader with no kno wledge of Ger manislikelyto be ableto understand

this sentence reasonably well by using cognate and context clues. These clues provide

enough scaffolding forthereadertoinferthe meaning of novel words and hopefullyinfer

the meaning ofthe entire sentence. Inthese sti mulithe novelforeign wordsjointlyinﬂuence

each other along with other wordsthat areinthe students nativelanguage. Of course,there

are several possible conﬁgurationsforthis sentence where areader might not be ableto

understandthis sentence. Our goalisto model conﬁgurationsthat are understandable while

exposingthereaderto as much novel L2 vocabulary and sentence structure as possible.

4 8 CHAPTER3. MODELINGINCIDENTAL LEARNING

Our experi mental subjects are required to guess what “Polizist” and “ Bankr ä u b er ”

meaninthis sentence. Wetrain afeaturized modelto predictthese guessesjointly within

each sentence andthereby predictincidental co mprehension on any macaronic sentence.

Indeed, we hope our model design will generalizefro m predictingincidental co mprehension

on macaronic sentences(for our beginner subjects, who need so me context wordsto be

in English)to predictingincidental co mprehension onfull Ger mansentences (for more

advancedstudents, who understandso me ofthe context words asif they werein English).In

addition, we developed a userinterfacethat uses macaronic sentences directly as a mediu m

oflanguageinstruction. Chapter 4 detailsthe userinterface.

3.2.1 Data Collection Setup

Our method of scaffoldingistoreplace certainforeign words and phrases withtheir English

translations, yielding a macaronic sentence.9 Si mply presentingtheseto alearner would

not give usfeedback onthelearner’s belief statefor eachforeign word. Even assessingthe

learner’sreading co mprehension would give only weakindirectinfor mation about what was understood. Thus, we collect data where alearner explicitly guesses aforeign word’s

translation when seeninthe macaronic context. These guesses arethentreated as supervised

labelstotrain our user model.

We used A mazon Mechanical Turk( MTurk)to collect data. Users qualiﬁedfortasks

by co mpleting ashort quiz andsurvey abouttheirlanguage kno wledge. Only users whose

9 Althoughthelanguage distinctionisindicated byitalics and color, users wereleftto ﬁgurethis out on t h eir o w n.

4 9 CHAPTER3. MODELINGINCIDENTAL LEARNING

resultsindicated no kno wledge of Ger man and self-identiﬁed as native speakers of English were allo wedto co mpletetasks. With Ger man asthe foreignlanguage, we generated

content by cra wling asi mpliﬁed- Ger man ne ws website, nachrichtenleicht.de . We

chosesi mpliﬁed Ger manin orderto mini mizetranslation errors andto makethetask more

suitablefor novicelearners. Wetranslated each Ger man sentence usingthe Moses Statistical

Machine Translation(S MT)toolkit( Koehn et al., 2007). The S MTsyste m wastrained on

the Ger man-English Co m moncra wl paralleltext usedin W MT 2015( Bojar et al., 2015).

We used 200 Ger man sentences, presenting eachto 10 different users. In MTurkjargon,

this yielded 2000 Hu manIntelligence Tasks( HITs). Each HITrequiredits userto participate

in severalrounds of guessing asthe Englishtranslation wasincre mentallyrevealed. A user was paid US $0.12 per HIT, witha bonus of US $6toany user whoaccu mulated morethan

2 0 0 0 t ot al p oi nts.

3.2.1.1 HITs and Sub missions

For each HIT,the user ﬁrstsees a Ger mansentence 1 0 (Figure 3.5). Atext boxis presented

belo w each Ger man wordinthe sentence, forthe usertotypeintheir “best guess” of what each Ger man word means. The user must ﬁllin atleast half ofthetext boxes before

sub mittingthis set of guesses. The resulting sub mission—i.e., the macaronic sentence

together withtheset of guesses —isloggedin a database as asingletraining exa mple, and

1 0 Exceptthat we ﬁrst “translate” any Ger man wordsthat haveidentical spellingin English(case-insensitive). Thisincludes most proper na mes, nu merals, and punctuation marks. Suchtranslated words are displayedin English style(blueitalics), andthe useris not askedto guesstheir meanings.

5 0 CHAPTER3. MODELINGINCIDENTAL LEARNING

Figure 3.5: After a user sub mits a set of guesses (top),theinterface marksthe correct

guessesin green and alsoreveals a set oftranslation clues(botto m). The user no w hasthe

opportunityto guess againforthere maining Ger man words.

the syste m displaysfeedbacktothe user about which guesses were correct.

After each sub mission, ne w clues arerevealed(providingincreased scaffolding) andthe

useris askedto guess again. The process continues, yielding multiple sub missions, until

all Ger man wordsinthe sentence have beentranslated. Atthis point,the entire HITis

considered co mpleted andthe user movesto a ne w HIT(i.e., a ne wsentence).

Fro m our 2000 HITs, we obtained 9392 sub missions(4.7 per HIT)fro m 79 distinct

M Turk users.

3. 2. 1. 2 Cl u es

Each update provides ne w cluesto helpthe user makefurther guesses. There are 2 kinds of

cl u es:

Translation Clue (Figure 3.5): Aset of wordsthat were originallyin Ger man arereplaced withtheir Englishtranslations. Thetext boxes belo wthese words disappear, sinceitis no

longer necessaryto guessthe m.

5 1 CHAPTER3. MODELINGINCIDENTAL LEARNING

Figure 3.6: Inthis case, afterthe user sub mits a set of guesses(top),t wo clues arerevealed

(botto m):ausgestellt is movedinto English order andthentranslated.

Reordering Clue (Figure 3.6): A Ger mansubstringis movedinto a more English-like

position. The reordering positions are calculated usingthe word and phrase align ments

obtainedfro m Moses.

Eachti methe usersub mits aset of guesses, wereveal asequence of n = max(1 ,round(N/ 3))

clues, where N isthe nu mber of Ger man wordsre maininginthesentence. For each clue, wesa mple atokenthatis currentlyin Ger man.Ifthetokenis part of a movable phrase, we

movethat phrase; other wise wetranslatethe mini mal phrase containingthattoken. These

moves correspond exactlyto cluesthat a user couldrequest by clicking onthetokeninthe

macaronicreadinginterface, see Chapter 4for details of ho w moves are constructed and

ani mated. In our present experi ments,the syste misin controlinstead, and grants clues by

“rando mly clicking” on n tokens.

The syste m’s probability of sa mpling a giventokenis proportionaltoits unigra mtype

probabilityinthe W MT corpus. Thus,rarer wordstendtore mainin Ger manforlonger,

allo wingthe Turkerto atte mpt more guessesforthese “difﬁcult” words. Ho wever, close

cognates would be an exeptiontothisrule, asthey are notfrequent but probably easyto

5 2 CHAPTER3. MODELINGINCIDENTAL LEARNING

guess by Turkers.

3.2.1.3 Feedback

When a user sub mits a set of guesses,the syste mresponds withfeedback. Each guessis visibly “ marked”inleft-to-right order, mo mentarily shaded with green(for correct), yello w

(for close) orred(forincorrect). Depending on whether a guessis correct, close, or wrong,

users are a warded points as discussed belo w. Yello w andred shadingthenfades,to signalto

the userthatthey maytry entering a ne w guess. Correct guessesre main onthescreenfor

the entiretask.

3. 2. 1. 4 P oi nts

Adding pointstothe process(Figures 3.5–3.6) adds a ga me-like quality andlets usincen-

tivize users by payingthe mfor good perfor mance. We a ward 1 0 points for each exactly

correct guess(case-insensitive). We give additional “effort points”for a guessthatis close

tothe correcttranslation, as measured by cosine si milarityin vector space. ( We used pre-

trained GLo Ve word vectors(Pennington, Socher, and Manning, 2014); whenthe guess or

correcttranslation has multiple words, wetakethe average ofthe word vectors.) We deduct

effort points for guessesthat are careless or very poor. Our rubric for effort pointsis as

f oll o ws:

5 3 CHAPTER3. MODELINGINCIDENTAL LEARNING

⎧ ⎪ ⎪ ⎪ − 1 , if̂eisrepeated or nonsense(red) ⎪ ⎪ ⎪ ⎪ ∗ ⎪ − 1 , if si m( ̂e, e ) < 0 (r e d) ⎪ ⎨⎪ e p = 0 , if0 ≤ si m(̂e,e ∗ ) < 0 .4 (r e d) ⎪ ⎪ ⎪ ⎪ ⎪ 0 , if̂eis blank ⎪ ⎪ ⎪ ⎪ ⎩⎪ 10 × si m(̂e,e ∗ ) other wise(yello w)

H er e si m( ̂e, e ∗ ) is cosine si milarity bet weenthe vector e mbeddings ofthe user’s guess ̂e a n d

our referencetranslation e ∗ . A “nonsense” guess contains a wordthat does not appearinthe

sentence bitext norinthe 20,000 mostfrequent wordtypesinthe GLo Vetraining corpus.

A “repeated” guessis anincorrect guessthat appears morethan onceintheset of guesses

b ei n g s u b mitt e d.

In so me cases, ̂e or e ∗ mayitself consist of multiple words. Inthis case, our points

andfeedback are based onthe best match bet ween any word of ̂e andany word of e ∗ . I n

align ments where multiple Ger man wordstranslate as a single phrase, 1 1 wetakethe phrasal

translationto bethe correct ans were ∗ foreach ofthe Ger man words.

3.2.1.5 Nor malization

After collectingthe data, we nor malizedthe user guessesforfurther analysis. All guesses werelo wercased. Multi- word guesses were crudelyreplaced bythelongest wordinthe

1 1 Our Ger man-English align ments are constructed asin Renduchintala et al.(2016a)

5 4 CHAPTER3. MODELINGINCIDENTAL LEARNING

guess(breakingtiesinfavor ofthe earliest word).

The guessesincluded many spelling errors as well as so me nonsense strings and direct

copies oftheinput. We deﬁnedthe dictionary to bethe 100,000 mostfrequent wordtypes

(lo wercased)fro mthe W MT English data.If a user’s guess ̂e does not match e ∗ a n d is n ot

inthe dictionary, wereplaceit with:

• the special sy mbol < C O P Y > , if ̂e appearsto be a copy ofthe Ger mansource word f

( meaningthatits Levenshtein distancefro mf is< 0.2 ·max( |̂e|,|f|));

• else,the closest wordinthe dictionary 1 2 as measured by Levenshtein distance(break-

ingties alphabetically), providedthe dictionary has a word at distance≤ 2;

• else , asifthe user had not guessed.

3.2.2 User Model

In each sub mission, the userjointly guesses several English words, given spelling and

context clues. One waythat a machine could perfor mthistaskis via probabilisticinference

in afactor graph —and wetakethis as our model of ho wthehu man user solvesthe proble m.

The user observes a Ger mansentence f = [ f 1 , f2 , . . . , fi, . . . fn ]. Thetranslation of each w or d t o k e n f i is E i, whichisfro mthe user’s point of vie w arando m variable. Let O bs d e n ot e

∗ the set ofindices i for whichthe user also observesthat E i = e i ,the aligned reference

∗ translation, because e i has already been guessed correctly(greenfeedback) orsho wn as a 1 2 Considering only words returned by the Pyenchant ‘suggest’ function ( htt p:// p yt h o n h ost e d. or g/ p y e n c h a nt/).

5 5 CHAPTER3. MODELINGINCIDENTAL LEARNING

∗ clue. Thus,the user’s posterior distribution over E is P θ (E = e | E O bs = e O bs ,f,history), where “history” denotesthe user’s history of pastinteractions.

We assu methat a user’s sub mission ê is derivedfro mthis posterior distribution si mply

as arando msa mple. Wetryto ﬁtthe para meter vector θ to maxi mizethelog-probability ofthesub mission. Notethat our modelistrained onthe user guesses ̂e , notthereference

translations e ∗ . Thatis, weseek para meters θ that would explain why all users madetheir

g u ess es.

Although we ﬁt a single θ ,this does not meanthat wetreat users asinterchangeable

(si n c e θ caninclude user-speciﬁc para meters) or unvarying(since our model conditions

users’ behavior ontheir history, which can capture so melearning).

3.2.3 Factor Graph

We modelthe posterior distribution as a conditionalrando m ﬁeld(Figure 3.7)in whichthe val u e of E i depends onthefor m of f i as well as onthe meanings e j ( which may be either

observed orjointly guessed) ofthe context words at j ̸= i:

∗ P θ (E = e | E O bs = e O bs ,f,history) ( 3. 5) ∑ − ef e e ∝ (ψ (e i, fi)· ψ (e i, ej , i − j )) i /∈O bs j ̸= i

We will deﬁnethe factors ψ (thepotentialfunctions)in such a waythatthey do not

“kno w Ger man” but only have accesstoinfor mationthatis availableto an naive English

ef speaker. In brief,thefactor ψ (e i, fi) considers whetherthe hypothesized English word e i

5 6 CHAPTER3. MODELINGINCIDENTAL LEARNING

e e ψ (e 1 , en )

e e e e ψ (e 1 , ei ) ψ (e i , en ) ...... E 1 E i E n

ef ef ef ψ (e 1 , f1 )ψ (e i , fi ) ψ (e n , fn )

f 1 ... f i ... f n

Figure 3.7: Modelfor user understanding of L2 wordsin sentential context. This ﬁgure

sho ws aninference proble min which allthe observed wordsinthesentence arein Ger man

(t h at is, O bs = ∅ ). Asthe user observestranslations via clues or correctly- marked guesses,

s o m e of t h e E i beco meshaded.

“lookslike”the observed Ger man word f i, and whetherthe user has previously observed

during data collectionthat e i is a correct orincorrecttranslation of f i. Mean while,thefactor

e e ψ (e i, ej ) considers whether e i is co m monlyseeninthe context of e j in Englishtext. For

exa mple,the user will elevatethe probabilitythat E i = c a k e ifthey arefairly certainthat

E j isarelated wordlikeeat or chocolate.

The potential functions ψ are para meterized by θ , a vector of feature weights. For

convenience, we deﬁnethefeaturesinsuch a waythat we expecttheir weightsto be positive.

We rely onjust 6 features at present (see Section 3.2.7 for future work), although each

is co mplex andreal-valued. Thus,the weights θ controltherelativeinﬂuence ofthese 6

differenttypes ofinfor mation on a user’s guess. Ourfeatures broadlyfall underthefollo wing

categories: Cognate ,History , andContext . We preco mputed cognate and contextfeatures,

5 7 CHAPTER3. MODELINGINCIDENTAL LEARNING while historyfeatures are co mputed on-the-ﬂyfor eachtraininginstance. Allfeatures are

case-insensitive.

3.2.3.1 Cognate Features

ef Foreach Ger mantoken f i, t h e ψ factor can score each possible guesse i ofitstranslation:

ef ef ef ψ (e i, fi) = e x p(θ · ϕ (e i, fi)) ( 3. 6)

Thefeaturefunction ϕ ef returns a vector of 4real nu mbers:

• Orthographic Si milarity : The nor malized Levenshtein distance bet weenthe 2 strings.

ef l e v(e i, fi) ϕ ort h (e i, fi) = 1 − ( 3. 7) m a x( |e i|, |f i|)

The weight onthisfeature encodes ho w much users pay attentionto spelling.

• Pronunciation Si milarity : Thisfeatureis si milartothe previous one, exceptthatit

calculatesthe nor malized distance bet weenthe pronunciations ofthet wo words:

ef ef ϕ pr o n (e i, fi) = ϕ ort h ( pr n(e i), pr n( f i)) ( 3. 8)

wherethe function pr n( x ) maps a string x toits pronunciation. We obtained pro-

nunciationsfor all wordsinthe English and Ger man vocabularies usingthe C M U

pronunciation dictionarytool( Weide, 1998). Notethat we use English pronunciation

rules evenfor Ger man words. Thisis because we are modeling a naivelearner who

may,inthe absence ofintuition about Ger man pronunciationrules, apply English

pronunciationrulesto Ger man.

5 8 CHAPTER3. MODELINGINCIDENTAL LEARNING

3.2.3.2 History Features

• Positive History Feature:If a user has beenre wardedin a previous HITfor guessing

e i as atranslation of f i,thentheyshould be morelikelyto guessit again. We deﬁne

ef ϕ hist + (e i, fi) to be 1inthis case and 0 other wise. The weight onthisfeature encodes

whether userslearnfro m positivefeedback.

• Negative History Feature : If a user has alreadyincorrectly guessed e i as atranslation

of f i in a previoussub mission duringthis HIT,thentheyshould belesslikelyto guess

ef it again. We deﬁne ϕ hist- (e i, fi) t o b e − 1 inthis case and 0 other wise. The weight on

thisfeature encodes whether usersre me mber negativefeedback.1 3

3.2.3.3 Context Features

ef Inthesa me way,the ψ factor can scorethe co mpatibility of a guess e i with a context word

e j , which mayitself be a guess, or may be observed:

e e e e e e ψ ij (e i, ej ) = e x p(θ · ϕ (e i, ej , i − j )) ( 3. 9)

1 3 Atleastin short-ter m me mory —this feature currently o mitsto consider any negative feedback fro m previous HI Ts.

5 9 CHAPTER3. MODELINGINCIDENTAL LEARNING

ϕ e e returns a vector of 2real nu mbers: ⎧ ⎪ ⎪ ⎨ P MI( e i, ej ) if |i − j | > 1 ϕ e e (e , e ) = ( 3. 1 0) p mi i j ⎪ ⎪ ⎩⎪ 0 ot h er wis e ⎧ ⎪ ⎪ ⎨ PMI 1 (e i, ej ) if |i − j | = 1 ϕ e e (e , e ) = ( 3. 1 1) p mi 1 i j ⎪ ⎪ ⎩⎪ 0 ot h er wis e wherethe point wise mutualinfor mation P MI( x, y ) measuresthe degreeto whichthe English w or ds x, y tendto occurinthesa me Englishsentence, and PMI 1 (x, y ) measures ho w often

theytendto occurin adjacent positions. These measure ments are esti matedfro mthe English

side ofthe W MT corpus, withs moothing perfor med asin Kno wles et al.(2016).

For exa mple,if f i = S u p p e ,the user’s guess of E i should beinﬂuenced by f j = B r o t

appearing in the sa me sentence, if the user suspects or observes that its translation is

E j = b r e a d . The P MIfeature kno wsthat s o u p a n d b r e a d tendto appearinthesa me

English sentences, whereas P MI 1 kno wsthattheytend notto appearinthe bigra m s o u p

bread or bread soup.

3.2.3.4 User-Speciﬁc Features

Apartfro mthe basic 6-feature model, we alsotrained a versionthatincludes user-speciﬁc

copies of eachfeature(si milartothe do main adaptationtechnique of Dau mé III (2007)).

ef ef For exa mple, ϕ ort h ,3 2 (e i, fi) is deﬁnedto equal ϕ ort h (e i, fi) forsub missions by user 32, and

deﬁnedto be 0forsub missions by other users.

6 0 CHAPTER3. MODELINGINCIDENTAL LEARNING

Thus, with 79 usersin our dataset, welearned 6 × 8 0 feature weights: alocal weight

ef vector for each user and a global vector of “backoff” weights. The global weight θ ort h

ef islargeif usersin general re ward orthographic si milarity, while θ ort h ,3 2 (which may be positive or negative) capturesthe degreeto which user 32re wardsit more orlessthanis

typical. The user-speciﬁcfeatures areintendedto captureindividual differencesinincidental

co mprehension.

3.2.4 Inference

Accordingto our model,the probabilitythatthe user guesses E i = ̂e i is given by a marginal

probability fro mthe C RF. Co mputingthese marginalsis a co mbinatorial opti mization

proble mthatinvolves reasoningjointly aboutthe possible values of each E i (i /∈ O bs ), whichrange overthe English vocabulary V e .

We e mployloopy belief propagation ( Murphy, Weiss, and Jordan, 1999)to obtain approxi mate marginals overthe variables E . Atree-basedschedulefor message passing was used( Dreyer and Eisner, 2009). Werun 3iterations with a ne wrando mrootfor each

it er ati o n.1 4

e ∗ We deﬁnethe vocabulary V to consist of allreferencetranslations e i and nor malized

user guesses ̂e i fro m our entire dataset (see Section 3.2.1.5), about 5 Ktypes altogether

i n cl u di n g a n d < C O P Y > . We definethe cognatefeaturestotreat as the e mpty string andtotreat < C O P Y > as f i. We definethe P MI ofthesespecialsy mbols 1 4 Re mark: Inthe non-loopy case( which arisesfor usin cases with ≤ 2 unobserved variables),this schedule boils do wntothefor ward-back ward algorith m. Inthis case, a singleiterationis sufficientfor exactinference.

6 1 CHAPTER3. MODELINGINCIDENTAL LEARNING wit h a n y e to bethe mean P MI with e of all dictionary words, sothatthey are essentially

uninfor mative.

3.2.5 Para meter Esti mation

Welearn our para meter vector θ to approxi mately maxi mizetheregularizedlog-likelihood

ofthe users’ guesses:

∑ − ) ∗ 2 l o g P θ (E = ê | E O bs = e O bs ,f,history) − λ ||θ || ( 3. 1 2) wherethesu m mationis over allsub missionsin our dataset. The gradient of eachsu m mand

reducesto a difference bet ween observed and expected values ofthefeature vector ϕ =

(ϕ ef , ϕ e e ), su m med over allfactorsin ( 3. 5). The observedfeatures are co mputed directly

b y s etti n g E = ê . The expectedfeatures( which arisefro mthelog ofthe nor malization

constant of(3.5)) are co mputed approxi mately byloopy belief propagation.

We tr ai n e d θ using stochastic gradient descent(S G D), 1 5 with alearningrate of 0 .1 a n d

regularization para meter of 0 .2 . Theregularization para meter wastuned on our develop ment

s et.

3.2.6 Experi mental Results

We divided our datarando mlyinto 5550traininginstances, 1903 develop mentinstances,

and 1939testinstances. Eachinstance was a single sub missionfro m one user, consisting of

1 5 To speed uptraining, S G D was parallelized using Recht et al.’s(2011) Hog wild! algorith m. Wetrained for 8 epochs.

6 2 CHAPTER3. MODELINGINCIDENTAL LEARNING

a batch of “si multaneous” guesses on a macaronic sentence.

We noted qualitativelythat when alarge nu mber of English words have beenrevealed,

particularly content words,the userstendto make better guesses. Conversely, when most

contextis Ger man, we unsuprisinglyseethe userleave many guesses blank and make other

guesses based on string si milaritytriggers. Such sub missions are difﬁcultto predict as

different users will co me up with a wide variety of guesses; our modelthereforeresortsto

predicting si milar-sounding words. For detailed exa mples ofthis see Appendix 3.2.6.3.

For eachforeign word f i in a sub mission with i /∈ O bs , ourinference method (section 3.2.4) predicts a marginal probability distribution over a user’s guesses ̂e i. Ta bl e 3. 5

sho ws our abilityto predict user guesses. 1 6 Recallthatthistaskis essentially a structured

predictiontaskthat doesjoint 4919- way classiﬁcation of each Ger man word. Roughly 1/3

oftheti me, our model’stop 25 wordsincludethe user’s exact guess.

Ho wever,therecallreportedin Table 3.5istoo stringentfor our educational application.

We could givethe model partial creditfor predicting a synony m ofthelearner’s guess ̂e .

More precisely, we wouldliketo givethe model partial creditfor predicting whenthelearner will make a poor guess ofthetruth e ∗ —evenifthe model does not predictthe user’s speciﬁc

incorrect guess ̂e.

To get atthis question, we use English word e mbeddings(asin Section 3.2.1.4) as a

proxyforthese mantics and morphology ofthe words. We measurethe actual quality ofthe

1 6 Throughoutthis section, weignorethe 5.2 % oftokens on whichthe user did not guess(i.e.,the guess w as afterthe nor malization of Section 3.2.1.5). Our present model si mplytreats as a n ordinary and very bland word(section 3.2.4),ratherthantruly atte mptingto predict whenthe user will not guess. Indeed,the model’s posterior probability of inthese casesis a paltry 0.0000267 on average (versus 0.0000106 whenthe user does guess). See Section 3.2.7.

6 3 CHAPTER3. MODELINGINCIDENTAL LEARNING

R e c all at k R e c all at k M o d el ( d e v) (t est)

1 2 5 5 0 1 2 5 5 0

Basic 15.24 34.26 38.08 16.14 35.56 40.30

User-Adapted 15.33 34.40 38.67 16.45 35.71 40.57

Table 3.5: Percentage offoreign wordsfor whichthe user’s actual guess appearsin ourtop- k

list of predictions,for models with and without user-speciﬁcfeatures(k ∈ { 1,25,50}).

learner’s guess ̂e asits cosine si milaritytothetruth, si m( ̂e, e ∗ ). W hil e q u alit y of 1 is a n e x a ct

match, and quality scores > 0 .7 5 are consistently good matches, wefound quality of ≈ 0 .6

alsoreasonable. Pairs such as( m o s q u e , i s l a m i c ) a n d (p o l i t i c s , g o v e r n m e n t ) ar e

exa mplesfro mthe collected data with quality ≈ 0 .6 . As quality beco mes < 0 .4 , h o w e ver,

therelationship beco mestenuous, e.g.,(refugee,soil).

Si milarly, we measurethe predicted quality as si m(e, e ∗ ), w h er e e isthe model’s 1-best

prediction ofthe user’s guess. Figure 3.8 plots predicted vs. actual quality (each point

represents one ofthelearner’s guesses on develop ment data), obtaining a correlation of

0.38, which we callthe “quality correlation” or Q C. A clear diagonal band can be seen,

correspondingtotheinstances wherethe model exactly predictsthe user’s guess. The cloud

aroundthe diagonalisfor med byinstances wherethe model’s prediction was notidentical

tothe user’s guess but had si milar quality.

We also considerthe expected predicted quality, averaging overthe model’s predictions

6 4 CHAPTER3. MODELINGINCIDENTAL LEARNING

Fi g ur e 3. 8: A ct u al q u alit y si m( ̂e, e ∗ ) ofthelearner’s guess ̂e on develop ment data, versus

predicted quality si m(e, e ∗ ) where e isthe basic model’s 1-best prediction. e of ̂e (f or all e ∈ V e )in proportiontothe probabilitiesthatit assignsthe m. This allo ws the modelto more s moothly assess whetherthelearnerislikelyto make a high-quality

guess. Figure 3.9 sho wsthis version, wherethe pointstendto shift up ward andthe quality

correlation( Q C)risesto 0.53.

All QC values are givenin Table 3.6. We used expected QC onthe develop mentset as

the criterion for selectingthe regularization coefﬁcient λ and asthe early stopping criterion

duringtraining.

3.2.6.1 Feature Ablation

To test the usefulness of different features, we trained our model with various feature

categories disabled. To speed up experi mentation, we sa mpled 1000instancesfro mthe

trainingset andtrained our model onthose. Theresulting Q C values on dev data aresho wn

6 5 CHAPTER3. MODELINGINCIDENTAL LEARNING

Fi g ur e 3. 9: A ct u al q u alit y si m( ̂e, e ∗ ) ofthelearner’s guess ̂e on develop ment data, versus

the expectation ofthe predicted quality si m(e, e ∗ ) w h er e e is distributed accordingtothe

basic model’s posterior.

in Table 3.7. We seethatre moving history-basedfeatures hasthe most signiﬁcanti mpact

on model perfor mance: both Q C measures droprelativetothefull model. For cognate and

contextfeatures, wesee nosigniﬁcanti mpact onthe expected Q C, but asigniﬁcant dropin

the 1-best Q C, especiallyfor contextfeatures.

3.2.6.2 Analysis of User Adaptation

Table 3.6 sho wsthatthe user-speciﬁcfeatures signiﬁcantlyi mprovethe 1-best Q C of our

model, althoughthe much s malleri mprove mentin expected Q Cisinsigniﬁcant.

User adaptation allo ws usto discern different styles ofincidental co mprehension. A user-

adapted model makes ﬁne-grained predictionsthat could helpto construct better macaronic

sentencesfor a given user. Each user who co mpleted atleast 10 HITs hastheir user-speciﬁc

6 6 CHAPTER3. MODELINGINCIDENTAL LEARNING

D e v Test M o d el Exp 1-Best Exp 1-Best

Basic 0.525 0.379 0.543 0.411

User-Adapted 0.527 0.427 0.544 0.439

Table 3.6: Quality correlations: basic and user-adapted models. weight vector sho wn as aro win Figure 3.10. Recallthatthe user-speciﬁc weights are not

usedinisolation, but are added to backoff weights shared by all users.

These user-specific weight vectors clusterintofour groups. Further more,the average points per HIT differ by cluster (significantly bet ween each cluster pair), reflectingthe

success of different strategies. 1 7 Usersin group(a) e mploy a generaliststrategyforincidental

co mprehension. They paytypical or greater-than-typical attentionto all features ofthe

current HIT, but many ofthe m have di minished me moryfor vocabularylearned during past

HITs (the hist+ feature). Usersin group (b) see mto usethe opposite strategy, deriving

their successfro mretaining co m mon vocabulary across HITs(hist+) andfalling back on

orthographyfor ne w words. Group(c) users, who earnedthe most points per HIT, appearto

make heavy use of context and pronunciationfeatures togetherwith hist+. We also seethat

pronunciation si milarity see msto be a strongerfeaturefor group(c) users,in contrasttothe

more superﬁcial orthographic si milarity. Group(d), which earnedthefe west points per HIT,

1 7 Recallthatin our data collection process, we a ward pointsfor each HIT(section 3.2.1.4). Whilethe points were designed more as are wardthan as an evaluation oflearner success, a higher score doesreﬂect more guessesthat were correct or close, while alo werscoreindicatesthatso me words were never guessed beforethe syste mrevealedthe m as clues.

6 7 CHAPTER3. MODELINGINCIDENTAL LEARNING

QC Feature Re moved Expected 1-Best

N o n e 0. 5 2 2 0. 4 2 5

Cognate 0.516 0.366 ∗

Context 0.510 0.366 ∗

History 0.499 ∗ 0. 2 5 9 ∗

Table 3.7: I mpact on quality correlation( Q C) ofre movingfeaturesfro mthe model. Ablated

Q C values marked with asterisk ∗ differ signiﬁcantlyfro mthefull- model Q C valuesinthe

ﬁrstro w(p < 0.05, usingthetest of Preacher(2002)).

appearsto be an “extre me” version of group(b):these users pay unusuallylittle attentionto

any modelfeatures otherthan orthographic si milarity and hist+. ( More precisely,the model

ﬁnds group(d)’s guesses harderto predict onthe basis ofthe availablefeatures, and so gives

a more unifor m distribution over V e .)

3.2.6.3 Exa mple of Learner Guesses vs. Model Predictions

To give a sense ofthe proble m difﬁculty, we have hand-picked and presentedt wotrain-

ing exa mples(sub missions) along withthe predictions of our basic model andtheirlog-

probabilities. In Figure 3.11a alarge portion ofthe sentence has beenrevealedtothe userin

English(bluetext) only 2 words arein Ger man. Thetextin boldfontisthe user’s guess.

Our model expected both wordsto be guessed;the predictions arelisted belo wthe Ger man

6 8 CHAPTER3. MODELINGINCIDENTAL LEARNING

Figure 3.10: The user-speciﬁc weight vectors, clusteredinto groups. Average points per

HITforthe HITs co mpleted by each group:(a) 45,(b) 48,(c) 50 and(d) 42. w or ds Verschiedene a n d Regierungen . Thereferencetranslationforthe 2 words

ar e V a r i o u s a n d governments .In Figure 3.11b wesee a much harder context where

only one wordis sho wnin English andthis wordis not particularly helpful as a contextual

a n c h or.

3.2.7 FutureI mprove mentstothe Model

Our model’sfeature set(section 3.2.3) could clearly bereﬁned and extended. Indeed,in

the previous section, we use a moretightly controlled experi mental designto explore so me

si mplefeature variants. A cheap wayto vetfeatures would betotest whetherthey help on

thetask of modelingreferencetranslations, which are more plentiful andless noisythanthe

user guesses.

6 9 CHAPTER3. MODELINGINCIDENTAL LEARNING

For Cognate features,there exist many other good string si milarity metrics(including

ef trainable ones). We could alsoinclude ϕ featuresthat consider whether e i’s part of speech,

frequency, andlength are plausible given f i’s burstiness, observedfrequency, andlength.

(E.g., only short co m mon words are plausiblytranslated as deter miners.)

For Context features, we could design versionsthat are more sensitivetothe position and status ofthe context word j . We speculatethatthe actualinﬂuence of e j o n a us er’s

g u ess e i is stronger when e j is observed ratherthanitself guessed; whenthere are fe wer interveningtokens(and particularlyfe wer observed ones);and when j < i. Orthogonally,

ef ϕ (e i, ej ) could go beyond P MI and windo wed P MIto also consider cosinesi milarity, as well as variants ofthese metricsthat arethresholded or nonlinearlytransfor med. Finally, we do not havetotreatthe context positions j asindependent multiplicativeinﬂuences as

i n e q u ati o n ( 3. 5) (cf. Naive Bayes): we couldinstead use atopic model orso mefor m of

language modelto deter mine a conditional probability distribution over E i gi ve n all ot h er wordsinthe context.

An obvious gapin our currentfeature setisthat we have no ϕ e features to capture

e thatso me words e i ∈ V are morelikely guesses a priori. By deﬁning several versions

ofthisfeature, based onfrequenciesin corpora of differentreadinglevels, we couldlearn

user-speciﬁc weights modeling which users are unlikelytothink of an obscure word. We

∗ should alsoinclude featuresthat ﬁre speciﬁcally onthe referencetranslation e i a n d t h e

specialsy mbols a n d < C O P Y > , as eachis much morelikelythanthe otherfeatures would suggest.

7 0 CHAPTER3. MODELINGINCIDENTAL LEARNING

For History features, we could considernegative feedbackfro m other HITs(notjustthe

current HIT), as well as positive infor mation provided byrevealed clues(notjust conﬁr med

guesses). We could also devise non-binary versionsin which morerecent or morefrequent

feedback on a word has a stronger effect. More a mbitiously, we could model generalization:

after being sho wnthat K i n d m e a ns c h i l d , alearner mightincreasethe probabilitythat the si milar word K i n d e r m e a ns c h i l d or so methingrelated( c h i l d r e n , c h i l d i s h ,

...), whether because of superﬁcial orthographic si milarity or a deeper understanding of the morphology. Si milarly, alearner might gradually acquire a model oftypical spelling

changesin English- Ger man cognate pairs.

A more signiﬁcant extension would beto model a user’slearning process. Instead of

representing each user by a s mall vector of user-speciﬁc weights, we couldrecognizethat

the user’s guessing strategy and kno wledge can change overti me.

A serious deﬁciencyin our current modelisthat wetreat like any other word.

A more attractive approach would betolearn a stochasticlinkfro mthe posterior distribution

tothe user’s guess or non-guess,instead of assu mingthatthe user si mply sa mplesthe

guessfro mthe posterior. As asi mple exa mple, we mightsaythe user guesses e ∈ V e wit h

pr o b a bilit y p (e )β — w h er e p (e ) isthe posterior probability and β > 1 is alearned para meter — withthere maining probability assignedto . Thissaysthatthe usertendsto avoid

guessing except whenthere arerelatively high-posterior-probability wordsto guess.

Finally, ne werrepresentationlearning models such as BE RT( Devlin et al., 2019) and

XL Net( Yang et al., 2019) could also be usedin our model. XL Netin particular, withits

7 1 CHAPTER3. MODELINGINCIDENTAL LEARNING

abilityto accountforinterdependencies bet ween outputtokens, would be a good candidate

to providerich contextualfeaturesto eitherreplace or used along withthe weaker pair- wise

P MI basedfeatures of our current model.

3.2.8 Conclusion

We have presented a methodologyfor collecting data andtraining a modelto esti mate a

foreignlanguagelearner’s understanding of L2 vocabularyin partially understood contexts.

Both are novel contributionstothe study of L2 acquisition.

Our current modelis arguably crude, with only 6features, yetit can already often do a

reasonablejob of predicting what a user might guess and whetherthe user’s guess will be

roughly correct. This opensthe doorto a nu mber offuture directions with applicationsto

language acquisition using personalized content andlearners’ kno wledge.

Weleave asfuture worktheintegration ofthis modelinto an adaptive syste mthattracks

learner understanding and creates scaffolded contentthat fallsintheir zone of proxi mal

develop ment, keepingthe m engaged while stretchingtheir understanding.

7 2 CHAPTER3. MODELINGINCIDENTAL LEARNING

( a)

( b)

Figure 3.11: Two exa mples ofthe syste m’s predictions of whatthe user will guess on a

single sub mission, contrasted withthe user’s actual guess. ( The user’s previous sub missions

onthe sa metaskinstance are not sho wn.) In 3.11a,the model correctly expectsthatthe

substantial context willinfor mthe user’s guess. In 3.11b,the model predictsthatthe user willfall back on string si milarity —although we can seethatthe user’s actual guess of a n d

d a y waslikelyinfor med bytheir guess of n i g h t , aninﬂuencethat our C RF did consider.

The nu mberssho wn arelog-probabilities. Both exa mplessho wthesentencesin a macaronic

state (after so me reordering ortranslation has occurred). For exa mple,the originaltext

ofthe Ger mansentencein 3.11breads Deshalb durften die Paare nur noch

ein Kind bekommen . The macaronic version has undergoneso mereordering, and

has also erroneously droppedthe verb dueto anincorrect align ment.

7 3 C h a pt e r 4

Creating Interactive Macaronic

Interfacesfor Language Learning

Inthe previous chapter, we presented modelsforincidentallearning. We hopeto generate macaronictext by consulting such models. Recallthatthe AIteacher’s goalisto generate co mprehensible macaronictextsfor a studenttoread. Given a macaronic data structure associated with a piece oftext,the AIteacher mustrender a macaronic conﬁgurationthatit believesthestudent will understand(andlearnfro m). But whatifthe AIteacher makes a sentencethatistoo difﬁcultforthe studenttoread? Ortoo easy with verylittle L2 words?

For such cases, we wouldliketo give “control” backtothe student andletthe minteractively modifythe macaronic sentence. Supposethe sentenceistoo difﬁcult, we wouldlikethe studentto not get co mpletely stuck, so we wouldliketo givethe mthe chanceto askfor hints. Onthe ﬂip-side,if a studentfeelsthereis not enough L2 contentin a macaronic

7 4 CHAPTER4. CREATING MACARONICINTERFACES

sentence, we wantinteract withthe data structure and explore macaronic spectru m.

To providethese featurestothe student, we design a user-interfacein which such

modiﬁcations are possible. We presentthe details of our user-interface along withinteraction

modalitiesinthis chapter.

We provide details ofthe current userinterface and discuss ho w contentfor our syste m

can be auto matically generated using existing statistical machinetranslation(S M T) methods,

enablinglearners orteachersto choosetheir o wntextstoread. Ourinterfaceletsthe user

navigatethroughthespectru mfro m L2to L1, going beyondthesingle- word orsingle-phrase

translations offered by other onlinetools such as S wych(2015), or dictionary-like bro wser

pl u gi ns.

Finally, we notethattheinteraction design couldincludelogging all ofthe actions a student makes whilereading atext. We canthen usethelogged actionstoreﬁne our

incidentallearning modelto hopefully produce macaronictextthatis more personalizedto

the student’s L2level. Weleavethisforfuture work.

4.1 MacaronicInterface

Toillustratethe workings of ourinterface, we assu me a native English speaker(L1=English) whoislearning Ger man(L2= Ger man). Ho wever, our existinginterface can acco m modate

any pair oflanguages whose writing syste ms share directionality. 1 The pri mary goal of

theinterfaceisto e mpo wer alearnertotranslate andreorder parts of a confusingforeign

1 We also assu methatthetextis seg mentedinto words.

7 5 CHAPTER4. CREATING MACARONICINTERFACES

language sentence. Thesetranslations andreorderings serveto makethe Ger man sentence

more English-like. Theinterface also per mitsreversetransfor mations,lettingthe curious

learner “peek ahead” at ho w speciﬁc English words and constructions would surfacein

G er m a n.

Usingthesefunda mentalinteractions as building blocks, we create aninteractivefra me- workfor alanguagelearnerto explorethis continuu m of “English-like”to “foreign-like”

sentences. Byrepeatedinteraction with ne w content and exposuretorecurring vocabulary

ite ms andlinguistic patterns, we believe alearner can pick up vocabulary and otherlinguistic

rules oftheforeignlanguage.

4.1.1 Translation

The basicinterfaceideaisthat aline of macaronictextis equipped with hiddeninterlinear

annotations. Notionally, Englishtranslationslurk belo wthe macaronictext, and Ger man

o n es a b o ve.

The Translation interaction allo ws the learner to change the text in the macaronic

sentencefro m onelanguageto another. Consider a macaronic sentencethatis co mpletely

intheforeign state(i.e.,, entirelyin Ger man), as sho wnin Fig. 4.1a. Hovering on or under

a Ger man wordsho ws a previe w of atranslation(Fig. 4.1b). Clicking onthe previe w will

causethetranslationto “rise up” andreplacethe Ger man word(Fig. 4.1c).

Totranslateinthereverse direction,the user can hover and click above an English word

( Fi g. 4. 1 d).

7 6 CHAPTER4. CREATING MACARONICINTERFACES

Sincethe sa me mechanis m appliesto allthe wordsinthe sentence, alearner can manipulatetranslationsfor each wordindependently. For exa mple, Fig. 4.1e sho wst wo wordsin English.

( a) I niti al s e nt e n c e st at e.

(b) Mouse hovered underPreis.

(c)Preis translatedtoprize.

(d) Mouse hovered above p r i z e . Clicking above will revertthe

sentence backtotheinitial state 4.1a.

(e) Sentence with 2 different wordstranslatedinto English

Figure 4.1: Actionsthattranslate words.

The version of our prototype displayedin Figure 4.1 blursthe previe wtokens when a

learneris hovering above or belo w a word. This blurred previe w acts as a visualindication

of a potential changetothe sentence state(if clicked) butit also givesthelearner a chance

7 7 CHAPTER4. CREATING MACARONICINTERFACES

tothink about whatthetranslation might be, based on visual clues such aslength and shape

ofthe blurredtext.

4.1.2 Reordering

Whenthelearner hoversslightly belo w the words n a c h G e o r g B ü c h n e r a Reordering

arro wis displayed(as sho wnin Figure 4.2). The arro wis anindicator ofreordering. In

this exa mple,the Ger man past participle b e n a n n t appears atthe end ofthe sentence(the

conjugatedfor m ofthe verbis i s t b e n a n n t , or i s n a m e d );thisisthe gra m matically

correctlocationforthe participlein Ger man, whilethe Englishfor m should appear earlier

inthe equivalent English sentence.

Si milartothetranslation actions, reordering actions also have a directional attribute.

Figure 4.2b sho ws a Ger man-to-English direction arro w. Whenthelearner clicksthe arro w,

theinterfacerearranges allthe wordsinvolvedinthereordering. The ne w word positions

are sho wnin 4.2c. Once again,the user can undo: hoveringjust above n a c h G e o r g

B ü c h n e r no wsho ws a gray arro w, whichif clickedreturnsthe phrasetoits Ger man word

order(sho wnin 4.2d).

Ger man phrasesthat are notin original Ger man order are highlighted as a warning

(Figure 4.2c).

7 8 CHAPTER4. CREATING MACARONICINTERFACES

( a)

( b)

( c)

( d)

Figure 4.2: Actionsthatreorder phrases.

4.1.3 “Pop Quiz” Feature

Sofar, we have describedthesyste m’sstandardresponsesto alearner’s actions. We no w add

occasional “pop quizzes.” When alearner hovers belo w a Ger man word( s 0 in Figure 4.3)

and clicksthe blurry Englishtext,the syste m can eitherrevealthetranslation ofthe Ger man w or d (st at e s 2 ) as describedin section 4.1.1 or quizthelearner(state s 3 ). Wei mple mentthe

quiz by presenting atextinput boxtothelearner: herethelearneris expectedtotype what

they believethe Ger man word means. Once a guessistyped,the syste mindicatesifthe

7 9 CHAPTER4. CREATING MACARONICINTERFACES

a s s 2 c 6 c

e s 0 b s 4 c e s 1 s 3 s 5

Figure 4.3: State diagra m oflearnerinteraction (edges) and syste m’s response(vertices).

Edges can betraversed by clicking( c), hovering above(a ), hovering belo w(b) orthe enter

(e) key. Un marked edgesindicate an auto matictransition.

guessis correct( s 4 ) orincorrect(s 5 ) by ﬂashing green orred highlightsinthetext box. The

boxthen disappears(after 700 ms) andthe syste m auto matically proceedstothereveal state

s 2 . Asthisi mposes a high cognitiveload andincreasestheinteraction co mplexity(typing vs. clicking), weintendto usethe pop quizinfrequently.

The pop quiz servest wo vitalfunctions. First,itfurtherincentivizesthe usertoretain

learned vocabulary. Second,it allo wsthe syste mto updateits model ofthe user’s current L2

lexicon, macaronic co mprehension, andlearning style;thisis workin progress(see section

4. 3. 2).

8 0 CHAPTER4. CREATING MACARONICINTERFACES

4.1.4 Interaction Consistency

Again, weregardthe macaronic sentence as a kind ofinterlineartext, written bet weent wo

mostlyinvisible sentences: Ger man above and English belo w. In general, hovering above

the macaronicsentence willreveal Ger man words or word orders, whichfall do wnintothe

macaronic sentence upon clicking. Hovering belo w willreveal Englishtranslations, which

rise up upon clicking.

The wordsinthe macaronicsentence are colored accordingtotheirlanguage. We want

the userto beco me accusto medtoreading Ger man,sothe Ger man words arein plain black

text by default, whilethe English words use a marked color andfont(italic blue). Reordering

arro ws alsofollo wthesa me colorsche me: arro wsthat will makethe macaronicsentence

more “ Ger man-like” are gray, while arro wsthat makethe sentence more “English-like” are

blue. Thesu m mary ofinteractionsissho wnin Table 4.1.

Action Direction Trigger Preview Preview Color Conﬁrm R es ult

Blurry Ger man Cli c k o n translation replaces E-to-G Hoverabove English Gr a y Bl ur Translation translation above Bl urr y Te xt English word(s)

Hover under Ger man Blurry English Cli c k o n translation replaces G-t o- E Bl u e Bl ur t o k e n translation belo w Bl urr y Te xt Ger man word(s)

Arr o w a b o ve E-to-G Hoverabovetoken Gray Arrow Clickon Arrow tokensreorder R e o r d e ri n g reorderingtokens

Arr o w b el o w G-to-E Hoverundertoken Blue Arrow Clickon Arrow tokensreorder reorderingtokens

Table 4.1: Su m mary oflearnertriggeredinteractionsinthe MacaronicInterface.

8 1 CHAPTER4. CREATING MACARONICINTERFACES

4.2 Constructing Macaronic Translations

Inthis section, we describethe details ofthe underlying data structures neededto allo w all

theinteractions mentionedinthe previous section. A keyrequire mentinthe design ofthe

data structure wasto support orthogonal actionsin each sentence. Making alltranslation

andreordering actionsindependent of one another creates alarge space of macaronic states

for alearnerto explore.

At present,theinputto our macaronicinterfaceis bitext with word-to- word align ments

provided by a phrase-based S MT syste m(or,if desired, by hand). We e mploy Moses

( Koehn et al., 2007)totranslate Ger mansentences and generate phrase align ments. Ne ws

articles writtenin si mple Ger manfro m nachrichtenleicht.de ( Deutschlandfunk,

2016) weretranslated aftertrainingthe S MTsyste m onthe W MT15 Ger man-English corpus

( B oj ar et al., 2 0 1 5).

We convertthe word align mentsinto “ mini mal align ments”that are either one-to-one,

one-to- many or many-to-one. For each many-to- many align mentreturned bythe S MT

syste m, were move align ment edges(lo west probability ﬁrst) untilthe align mentis nolonger

many-to- many. Then we greedily add edgesfro m unalignedtokens(highest probability ﬁrst),

subjectto not creating many-to- many align ments and subjectto mini mizingthe nu mber

of crossing edges, until alltokens are aligned. This step ensures consistentreversibility of

actions and preventslarge phrasesfro m beingtranslated with a single click. 2 The resulting

2 Preli minary experi ments sho wedthat allo winglarge phrasestotranslate with one clickresultedin abrupt ju mpsinthe visualization, which usersfound hardtofollo w.

8 2 CHAPTER4. CREATING MACARONICINTERFACES

bipartite graph can beregarded as a collection of connected co mponents, or units (Fig. 4.4).3

Figure 4.4: The dottedlinessho w word-to- word align ments bet weenthe Ger mansentence

f 0 , f1 , . . . , f7 andits Englishtranslation e 0 , e1 , . . . , e6 . The ﬁgure highlights 3 ofthe 7 units:

u 2 , u3 , u4 .

4.2.1 Translation Mechanis m

In a givenstate ofthe macaronicsentence, each unitis displayedin either English or Ger man.

Atranslation actiontogglesthe displaylanguage ofthe unit,leavingitin place. For exa mple,

in Figure 4.5, wherethe macaronic sentenceis currently displaying f 4 f 5 = n o c h e i n e n ,

atranslation action willreplacethis with e 4 = a . 3 Inthe sections belo w, we gloss over cases where a unitis discontiguous(in onelanguage). Such units are handled specially( we o mit detailsforreasons of space). If a unit wouldfall outsidethe bounds of what our special handling can handle, wefuseit with another unit.

8 3 CHAPTER4. CREATING MACARONICINTERFACES

Figure 4.5: A possible state ofthe sentence, whichrenders a subset ofthetokens(sho wnin

black). Therendering order(section 4.2.2)is notsho wn butis also part ofthestate. Thestring

displayedinthis caseis ” Und danach they run noch einen Marathon. ” ( as-

su ming noreordering).

4.2.2 Reordering Mechanis m

Areordering action changesthe unit order ofthe current macaronic sentence. The output

stri n g “Und danach they run noch einen Marathon. ”is obtainedfro m Fig-

ure 4.5 onlyif unit u 2 (aslabeledin Figure 4.4)is rendered (inits currentlanguage)to theleft of unit u 3 , which we write as u 2 < u 3 . Inthis case,itis possibleforthe userto

changethe order ofthese units, because u 3 < u 2 in Ger man. Table 4.2sho wsthe 8 possible

co mbinations of ordering andtranslation choicesforthis pair of units.

8 4 CHAPTER4. CREATING MACARONICINTERFACES

String Rendered Unit Ordering

...they run...

...they laufen... { u 2 } < { u 3 } ...sie run...

...sie laufen...

...run they...

...run sie... { u 2 } > { u 3 } ...laufen they...

...laufen sie...

Table 4.2: Generatingreordered strings using units.

Thespace of possible orderingsfor asentence pairis deﬁned by a bracketingIT Gtree

( Wu, 1997), whichtransfor msthe Ger man ordering ofthe unitsintothe English ordering by

a collection of nested binary s waps of subsequences. 4 The ordering state ofthe macaronic

sentenceis given bythesubset oftheses wapsthat have been perfor med. Areordering action

toggles one ofthe s wapsinthis collection.

Since we have a parserfor Ger man(Rafferty and Manning, 2008), wetake careto

select anIT Gtreethatis “co mpatible” withthe Ger man sentence’s dependency structure,

inthefollo wing sense: iftheIT Gtree co mbinest wo spans A a n d B ,thenthere are not

dependenciesfro m wordsin A to wordsinB and vice-versa.

4 Occasionally nosuchIT Gtree exists,in which case wefuse units as needed until one does.

8 5 CHAPTER4. CREATING MACARONICINTERFACES

( a) ( b)

( c) ( d)

Figure 4.6: Figure 4.6asho ws asi mple discontiguous unit. Figure 4.6bsho ws along distance

discontiguity whichis supported. In ﬁgure 4.6ctheinterruptions alignto both sides of e 3 whichis not supported. In situationslike 4.6c, all associated units are merged as one phrasal

unit(shaded) assho wnin ﬁgure 4.6d

4.2.3 Special Handling of Discontiguous Units

We provideli mited supportfor align ments whichfor m discontiguous units. Figure 4.6a

sho ws a si mple discontiguous unit. Inthis exa mple, areordering action( G-to-E direction)

perfor med on either f 2 or f 4 will m o ve f 2 tothei m mediateleft of f 4 eli mi n ati n g t h e

interrupting align ment. After reordering,thetranslation action beco mes availabletothe

learner,just asin a multi- word contiguous unit. The syste m currently supports one or

moreinterrupting units aslong asthese units are contiguous and arefro m only one side of

the singletoken(see Figure 4.6a and 4.6b). Ifthe conditionsfor special handling are not

8 6 CHAPTER4. CREATING MACARONICINTERFACES

satisﬁed(see Figure 4.6c),the syste mforces allthetokensto a single unit, whichresultsin a

phrasal align ment andistreated as a single unit(Figure 4.2d). Such units have noreordering

actions andresultin a phrasaltranslation. We also e mploythis “back off” phrasal align ment

in cases where align ments do not satisfytheIT G constraint.

4.3 Discussion

4.3.1 Machine Translation Challenges

Whenthe English version ofthesentenceis produced by an MTsyste m,it maysufferfro m

MT errors and/or poor align ments.

Even with correct MT, a given syntactic construction may be handledinconsistently on

different occasions, depending onthe particular wordsinvolved(asthese affect what phrasal

align mentisfound and ho w we convertitto a mini mal align ment). Syntax-based MT could

be usedto design a more consistentinterfacethatis also more closelytiedto classroo m L2

l ess o ns.

Cross-linguistic divergencesinthe expression ofinfor mation ( Dorr, 1994) could be

confusing. For exa mple, when movingthrough macaronicspacefro m K a f f e e g e f ä l l t

M e n s c h e n (coffee pleases hu mans)toitstranslation humans like coffee , it m a y

not be cleartothelearnerthatthereorderingistriggered bythefactthat l i k e is n ot a lit er al translation of g e f ä l l t . One waytoi mprovethis might beto havethesyste m passs moothly

8 7 CHAPTER4. CREATING MACARONICINTERFACES

through a range of inter mediate translations fro m word-by- word glosses to idio matic

phrasaltranslations, ratherthan al ways directlytranslatingidio ms. Concretely, we can

ﬁrsttransfor m K a f f e e g e f allẗ Menschen i nt o K a f f e e g e f ä l l t h u m a n s a n d

t h e n i nt o Kaffee pleases humans and ﬁnallyinto coffee pleases humans .

Thesetransitions could be done via manualrules. Once alltokens ofthe Ger man phrase are

in Englishthe ﬁnaltransition wouldrenderthe phrasein “correct” English h u m a n s l i k e

c o f f e e . We might also see beneﬁtin guiding our gradualtranslations with cognates(for

exa mple,ratherthantranslate directlyfro mthe Ger man M ö h r e tothe English c a r r o t , w e

might offerthe cognate Karotte as aninter mediatestep).

Another avenue ofresearchistotransitionthrough wordsthat are macaronic atthe sub- wordlevel. For exa mple, hovering overthe unfa miliar Ger man word g e s p r o c h e n mi g ht

deco mposeitinto ge-sprochen ;then clicking on one ofthose morphe mes might yield

g e - t a l k or s p r e c h - e d beforereaching t a l k e d . This could guidelearnersto wards an

understanding of Ger mantense marking andste m changes. Generation ofthesesub- word

macaronic for ms could be done using multilingualtrained morphological reinﬂectionn

syste mssuch as Kann, Cotterell, and Schutze(2017).̈

4.3.2 User Adaptation and Evaluation

We would preferto sho wthelearner a macaronic sentencethat providesjust enough clues

forthelearnerto be ableto co mprehendit, while still pushingthe mto ﬁgure out ne w vocabulary or ne w structures. Thus, we planto situatethisinterfacein afra me workthat

8 8 CHAPTER4. CREATING MACARONICINTERFACES

continuously adapts asthe user progresses. Asthe userlearns ne w vocabulary,the syste m will auto matically presentthe m with more challenging sentences(containingless L1). In ?? wesho wthat we can predict a novicelearner’s guesses of L2 word meaningsin macaronic

sentences using afe w si mplefeatures. We will subsequentlytrackthe user’slearning by

observingtheir mouse actions and “pop quiz”responses(section 4.1).

While we have had usersinteract with our syste min orderto collect data about novice

learners’ guesses, we are workingto ward an evaluation where oursyste mis usedtosupple-

ment classroo minstructionforrealforeign-language students.

4.4 Conclusion

Inthis work we present a prototype of aninteractiveinterface forlearningto readin a

foreignlanguage. We exposethelearnerto L2 vocabulary and constructionsin contexts

that are co mprehensible becausethey have beenpartially translatedintothelearner’s native

language, using statistical MT. Using MT affords ﬂexibility: learners orinstructors can

choose whichtextstoread, andlearners orthe syste m can control which parts of a sentence

aretranslated.

Inthelongter m, we wouldliketo extendthe approachto allo w users alsotoproduce

macaroniclanguage, dra wing ontechniquesfro m gra m matical error correction or co mputer-

aidedtranslationto helpthe m graduallyre move L1featuresfro mtheir writing(or speech)

and makeit more L2-like. Weleavethisforfuture work.

8 9 C h a pt e r 5

Construction of Macaronic Textsfor

Vocabulary Learning

5.1 Introduction

Inthe previous chapters, we presented ainteractiveinterfacetoread macaronic sentences

and a modelthat predicts a student’s guessing abilities which usedinfor mationfro mthe

L1 and L2 context as well as cognateinfor mation asinputfeatures. Totrainthis model we

require supervised data, meaning data on student behaviors and capabilities( Renduchintala

et al., 2016b; Labutov and Lipson, 2014). The data collectionfor supervised datainvolves

pro mptingstudents(in our experi ments we used MTurk users) with macaronicsentences

created rando mly (or withso me heuristic) andthen askingthe MTurk “students”to guess

the meanings of L2 wordsinthese sentences. Therando m macaronic sentences paired

9 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION with student guessesfor msthetraining data. This stepis expensive, not onlyfro m a data

collection point of vie w, but alsofro mthe point of vie w ofstudents, asthey would haveto

givefeedback(i.e. generatelabeled data) onthe actions of aninitially untrained machine

t e a c h er.

Inthis chapter, we sho wthatitis possibleto design a machineteacher without any

supervised datafro m(hu man)students. We use a neural clozelanguage modelinstead of

the weaker conditionalrando m ﬁeld used earlier. We also propose a methodto allo w our

clozelanguage modeltoincre mentallylearn ne w vocabularyite ms, and usethislanguage

model as a proxyforthe word guessing andlearning ability ofreal students. A machine

foreign-languageteacher decides which subset of wordstoreplace by consultingthis cloze

language model. The clozelanguage modelisinitiallytrained on a corpus of L1texts and

istherefore not personalizedto a(hu man) student. Despitethis, we sho wthat a machine

foreign-language can generate pedagogically useful macaronictexts after consulting with

the clozelanguage model. We are essentially using a clozelanguage model as a “drop-in”

replace mentfor atrue user model werefertothe clozelanguage as agenericstudent model.

We evaluatethree variants of our generic studentlanguage modelsthrough a study on

A mazon Mechanical Turk( MTurk). We ﬁndthat MTurk “students” were ableto guessthe

meanings of L2 words(in context)introduced bythe machineteacher with high accuracyfor

bothfunction words as well as content wordsint wo out ofthethree models. Further more, we selectthe best perfor ming variant and evaluateif participants can actually learnthe L2 words byletting participantsread a macaronic passage and give an L2 vocabulary quiz at

9 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Sentence The Nile is a river in Africa

Gloss Der Nil ist ein Fluss in Afrika

Macaronic Der Nile ist a river in Africa

Conﬁgurations The Nile is a Fluss in Africa

Der Nil ist ein river in Africa

Table 5.1: An exa mple English(L1)sentence with Ger man(L2) glosses. Usingthe glosses,

many possible macaronic conﬁgurations are possible. Notethatthe gloss sequenceis not a

ﬂuent L2 sentence.

the end of passage, wherethe L2 words are presented withouttheir sentential context.

5. 1. 1 Li mit ati o n

While we gainthe abilityto construct macaronictextsfor students without any prior data

collection, weli mit ourselvestolexicalreplace ments only. Thisli mitation arises because

of our proposed methodto evaluatethe kno wledge ofthe genericstudent model co mpares

lexical word e mbeddings andistherefore unableto measure otherlinguistic kno wledgesuch

as word order. Thisis a keyli mitation ofthe work proposedinthis chapter.

9 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5. 2 M et h o d

Our machineteacher can be vie wed as a search algorith mthattriesto ﬁndthe(approxi-

mately) best macaronic conﬁgurationforthe next sentencein a given L1 docu ment. We

assu methe availability of a “gold” L2 glossfor each L1 word: in our experi ments, we

obtainedthesefro m bilingual speakers using Mechanical Turk. Table 5.1 sho ws an exa mple

English sentence with Ger man glosses andthree possible macaronic conﬁgurations(there

are exponentially many conﬁgurations). The machineteacher must assess, for exa mple,

ho w accurately astudent would understandthe meanings of D e r , i s t , e i n , a n d F l u s s when presented withthefollo wing candidate macaronic conﬁguration: D e r N i l e i s t

ein Fluss in Africa .1 Understanding may arise fro minference onthis sentence as well as whateverthe student haslearned aboutthese wordsfro m previous sentences.

Theteacher makesthis assess ment by presentingthis sentenceto a generic student model

(§ § 5.2.1–5.5).It uses a L2 e mbeddingscoringsche me( § 5.5.1)to guide a greedy searchfor

the best macaronic conﬁguration(§5.5.2).

5.2.1 Generic Student Model

Our model of a “generic student”( GSM )is equipped with a clozelanguage modelthat

uses a bidirectional LST Mto predict L1 wordsin L1 context( Mousa and Schuller, 2017;

1 By “ meaning” we meanthe L1tokenthat was originallyinthesentence beforeit wasreplaced by an L2 gl oss.

9 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Hochreiter and Sch midhuber, 1997). Given a sentence x = [ x 1 , . . . , xt , . . . , xT ], t h e cl o z e

f b model deﬁnes p(x t | h t , h t )∀t∈ {1,..., T}, where:

f f f D h t = L S T M ([x 1 , . . . , x t− 1 ]; θ ) ∈ R ( 5. 1)

b b b D h t = L S T M ([x T , . . . , x t+ 1 ]; θ ) ∈ R ( 5. 2)

are hidden states offor ward and back ward LST M encoders para meterized by θ f a n d θ b

respectively. The model assu mes a ﬁxed L1 vocabulary ofsize V , andthe vectors x t a b o ve

are e mbeddings ofthese wordtypes, which correspondtothero ws of an e mbedding matrix

E ∈ R V × D . The cloze distribution at each positiontinthe sentenceis obtained using

p (· | h f , h b ) =soft max(E h([h f ; h b ]; θ h )) ( 5. 3) w h er e h (·; θ h ) is a projectionfunctionthatreducesthe di mension ofthe concatenated hidden

st at es fr o m 2 D t o D . We “tie”theinput e mbeddings and output e mbeddings asin Press and

Wolf (2017).

Wetrainthe para meters θ = [ θ f ; θ b ; θ h ;E] using Ada m( King ma and Ba, 2014)to ∑ m a xi mi z e x L (x ), wherethesu m mationis oversentences x in alarge L1training corpus,

a n d

− f b L (x ) = l o g p (x t | h t , h t ) ( 5. 4) t

Wesetthe di mensionality of word e mbeddings and LST M hidden unitsto 3 0 0 . We us e

the WikiText-103 corpus( Merity et al., 2016) asthe L1training corpus. We apply dropout

(p = 0 .2) bet weenthe word e mbeddings and LST Mlayers, and bet weenthe LST M and

9 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION

projectionlayers(Srivastava et al., 2014). We assu methattheresulting modelrepresents

the entirety ofthe student’s L1 kno wledge.

5.2.2 Incre mental L2 Vocabulary Learning

The model so far can assign probabilityto an L1 sentence such as T h e N i l e i s a

river in Africa , using equation(5.4), but what about a macaronicsentencesuch as

Der Nile ist ein Fluss in Africa ? Toacco m modatethe new L2 words, we

use another word-e mbedding matrix, F ∈ R V ′ × D and modify Eq 5.3to consider boththe L1

and L2 e mbeddings:

p (· | [h f : h b ]) =soft max([E ;F ]·h([h f : h b ]; θ h ))

We also restrictthe soft max function aboveto produce a distribution not overthe full

bilingual vocabulary of size |V | + |V ′|, but only overthe bilingual vocabulary consisting

of t h e V L1typestogether with onlythe v ′ ⊂ V ′ L2typesthat actually appearinthe

macaronic sentence x .Inthe above exa mple macaronicsentence, |v ′| is 4 . We i niti ali z e F

by dra wingits ele mentsII Dfro m Unifor m [− 0.01,0.01] . Thus, all L2 wordsinitially have

rando m e mbeddings[− 0.01,0.01] 1 × D .

These modiﬁcationslets us co mpute L (x ) for a macaronic sentence x . Weassu methat when a hu manstudentreads a macaronicsentence x ,they updatetheir L2 para meters F ( b ut

nottheir L1 para meters θ )toincrease L (x ). Speciﬁcally, we assu methat F will b e u p d at e d

9 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION

t o m a xi mi z e

L (x ; θ f , θ b , θ h ,E,F)− λ∥F − F p r e v ∥ 2 ( 5. 5)

Maxi mizing equation(5.5) adjuststhe e mbeddings of each L2 wordinthesentencesothat

itis more easily predictedfro mthe other L1/L2 words, and also sothatitis more helpful at

predictingthe other L1/L2 words. Sincetherest ofthe model’s para meters do not change, we expectto ﬁnd an e mbeddingfor F l u s s thatis si milartothe e mbeddingfor r i v e r .

Ho wever,theregularizationter m with coefﬁcient λ > 0 pr e ve nts F fro m strayingtoofar

fr o m fr o m F p r e v , whichrepresentsthe value of F beforethis sentence wasread. Thisli mits

the degreeto which oursi mulatedstudent will changetheir e mbedding of an L2 wordsuch

as F l u s s based on a singleexa mple. As aresult,the e mbedding of F l u s s reﬂects all of

the past sentencesthat contained F l u s s , although(realistically) with so me biasto wardthe

mostrecent such sentences. We do not currently model spacing effects,i.e.,forgetting due

tothe passage ofti me.

I n pri n ci pl e, λ should be set based on hu man-subjects experi ments, and might differ fro m hu manto hu man. In practice,inthis paper, we si mplytook λ = 1 . We(approxi-

mately) maxi mizedthe objective above using 5 steps of gradient ascent, which gave good

convergencein practice.

9 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.2.3 Scoring L2e mbeddings

Theincre mental vocabularylearning procedure( § 5.2.2)takes a macaronic conﬁguration

and generates a ne w L2 word-e mbedding matrix by applying gradient updatesto a previous version ofthe L2 word-e mbedding matrix. The ne w matrixrepresentsthe proxystudent’s

L2 kno wledge after observingthe macaronic conﬁguration.

Thus,if we canscorethe ne w L2 e mbeddings, we can,in essence,scorethe macaronic

conﬁgurationthat generatedit. The abilityto score conﬁgurations affords search( § § 5. 2. 4

and 5.2.5) for high-scoring conﬁgurations. With this motivation, we design a scoring

functionto measurethe “goodness” of L2 word-e mbeddings,F .

The machineteacher evaluates F with referenceto all correct word-gloss pairs fro m

theentire docu ment. For our exa mplesentence,the word pairs are { (T h e , D e r ), (i s ,i s t ),

(a ,e i n ), (r i v e r ,F l u s s )} . Butthe machineteacher also has accessto, for exa mple,

{ (w a t e r ,W a s s e r ), (s t r e a m , F l u s s )... } , which co mefro m else whereinthe docu ment.

Thus,if P istheset of word pairs,{(x 1 , f1 ), ...(x | P|, f| P|)}, we co mpute:

̃r p = R (x p , cs (F f p ,E)) ( 5. 6) ⎧ ⎪ ⎪ ⎨ ̃r p if ̃r p < r m a x r = p ⎪ ⎪ ⎩⎪ ∞ ot h er wis e

1 ∑ 1 M R R (F , E , r ) = ( 5. 7) m a x | P| r p p w h er e cs (F f ,E) denotesthe vector of cosine si milarities bet weenthe e mbedding of an L2

9 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION

w or d f andthe entire L1 vocabulary. R (x, cs (E , F f )) queriestherank ofthe correct L1 w or d x t h at p airs wit h f . r cantake valuesfro m 1 t o |V |, but we use arankthreshold r m a x

andforce pairs with arank worsethan r m a x t o ∞ . Thus, given a word-gloss pairing P , t h e

current state ofthe L2 e mbedding matrix F , andthe L1 e mbedding matrix E , we obtainthe

Mean Reciprocal Rank( M R R)scorein(5.21).

We canthink ofthe scoringfunction as a “vocabularytest”in whichthe proxy student

gives (its best) r m a x guessesfor each L2 wordtype andreceives a nu merical grade.

5.2.4 Macaronic Conﬁguration Search

Sofar we have detailed our si mulated studentthat wouldlearnfro m a macaronic sentence,

and a metricto measure ho w goodthelearned L2 e mbeddings would be. No wthe machine

teacher only hasto searchforthe best macaronic conﬁguration of a sentence. Asthere

are exponentially many possible conﬁgurationsto consider, exhaustive searchisinfeasible.

We use a si mple left-to-right greedy search to approxi mately ﬁnd the highest scoring

conﬁguration for a given sentence. Algorith m 1 sho wsthe pseudo-code forthe search

process. Theinputstothe search algorith m aretheinitial L2 word-e mbeddings matrix

F p r e v ,the scoringfunction MRR(), andthe genericstudent model SPM(). The algorith m

proceedslefttoright, making a binary decision at eachtoken: Shouldthetoken bereplaced withits L2 gloss orleft asis? Forthe ﬁrsttoken,theset wo decisionsresultinthet wo

conﬁgurations: (i) D e r N i l e . . . a n d (ii) T h e N i l e . . . These conﬁgurations are given

tothe genericstudent model which updatesthe L2 word e mbeddings. Thescoringfunction

9 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION

(section 5.5.1) co mputes ascorefor each L2 word-e mbedding matrix and cachesthe best

conﬁguration(i.e.the conﬁguration associated withthe highest scoring L2 word-e mbedding

matrix). Ift wo conﬁgurationsresultinthe sa me MRR score,the nu mber of L2 wordtypes

exposedis usedto breakties. In Algorith m 1, ρ (c ) isthefunctionthat countsthe nu mber of

L2 wordtypes exposedin a conﬁguration c.

5.2.5 Macaronic-Language docu ment creation

Ourideaisthat a sequence of macaronic conﬁgurationsis goodifit drivesthe generic

student model’s L2 e mbeddingsto ward an MRRscore closeto 1 ( maxi mu m possible). Note

that we donot changethe sentence order( we still want a coherent docu ment),justthe

macaronic conﬁguration of each sentence. For each sentenceinturn, we greedily search

over macaronic conﬁgurations using Algorith m 1,then choosethe conﬁgurationthatlearns

t h e b est F , and proceedtothe nextsentence with F p r e v no w settothislearned F .2 T his

processisrepeated untilthe end ofthe docu ment. The pseudo-codefor generating an entire

docu ment of macaronic contentissho wnin Algorith m 2.

Insu m mary, our machineteacheris co mposed of(i)a genericstudent model whichis

a contextual L2 wordlearning model( § 5. 2. 1 a n d § 5.2.2) and (ii)a conﬁguration sequence

search algorith m (§ 5. 2. 4 a n d § 5.2.5), whichis guided by (iii) an L2 vocabulary scoring f u n cti o n (§ 5.5.1). Inthe next section, we describet wo variationsforthe generic student

m o d els. 2 Forthe ﬁrst sentence, weinitialize F p r e v to have valuesrando mly bet ween[− 0.01,0.01].

9 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Algorith m 1 Mixed-Lang. Conﬁg. Search

Require: x =[ x 1 , x2 , . . . , xT ] ▷ L 1 t o k e ns

Require: g =[ g 1 , g2 , . . . , gT ] ▷ L 2 gl oss es

R e q ui re: E ▷ L1e mb. matrix

R e q ui re: F p r e v ▷ i niti al L 2 e m b. m atri x

Require: SP M ▷ Student Proxy Model

Require: MRR, r m a x ▷ Scoring Func.,threshold

1: f u n cti o n S EARCH (x , g , E , F p r e v )

2: c ← x ▷ initial conﬁgurationisthe L1 sentence

3: F ← F p r e v

4: s= MRR (E,F,rm a x )

5: fori=1; i≤ T;i+ + do

′ 6: c ← c 1 · · · c i− 1 g i x i+ 1 · · · x T

7: F ′ = S P M (F p r e v , c ′)

′ ′ 8: s = M R R (E , F , rm a x )

9: if (s ′, − ρ (c ′))≥ (s,− ρ(c))then

1 0: c ← c ′, F ← F ′, s ← s ′

1 1: e n d if

1 2: e n d f o r

1 3: ret u r n c , F ▷ Mixed-Lang. Conﬁg.

1 4: endfunction

1 0 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Algorith m 2 Mixed-Lang. Docu ment Gen.

Require: D =[( x 1 , g 1 ), . . . , (x N , g N )] ▷ D o c u m e nt

R e q ui re: E ▷ L1e mb. matrix

R e q ui re: F 0 ▷ i niti al L 2 e m b. m atri x

1: f u n cti o n D OC G EN (D,F 0 )

2: C = [] ▷ Conﬁguration List

3: fori=1; i≤ N ;i+ + do

4: x i, g i = D i

i i i− 1 5: c ,F = S EARCH (x i, g i,E,F )

6: C ← C + [ c i]

7: e n d f o r

8: ret u r n C ▷ Mixed-Lang. Docu ment

9: endfunction

1 0 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.3 Variationsin Generic Student Models

We developedt wo variationsforthe generic student modelto co mpare and contrastthe

macaronic docu mentsthat can be generated.

5.3.1 Unidirectional Language Model

This variation restrictsthe bidirectional model (fro m Section 5.2.1)to be unidirectional

(u G S M ) andfollo ws astandardrecurrent neural net work( R N N)language model( Mikolov

et al., 2 0 1 0).

∑ f l o g p (x ) = l o g p (x t | h t ) ( 5. 8) t

f f f h t = L S T M (x 0 , . . . , x t− 1 ; θ ) ( 5. 9)

p (· | h f ) =soft max(E ·h f ) ( 5. 1 0)

O n c e a g ai n, h f ∈ R D × 1 isthe hidden state ofthe LST M recurrent net work, whichis

para meterized by θ f , but unlikethe modelin Section 5.2.1, no back ward LST M and no

projection functionis used.

Thesa me procedurefro mthe bidirectional modelis usedto update L2 word e mbeddings

(Section 5.2.2). Whilethis model does not explicitly encode contextfro m “future”tokens

(i.e. wordstotheright of x t ),thereis still pressurefro mright-sidetokens x t+ t:T b e c a us e

the ne w e mbeddings will be adjustedto explainthetokenstotheright as well. Fixing allthe

L1 para metersfurther strengthensthis pressure on L2 e mbeddingsfro m wordstotheirright.

1 0 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.3.2 Direct Prediction Model

The previoust wo models variants adjust L2 e mbeddings using gradient stepstoi mprovethe

pseudo-likelihood ofthe presented macaronicsentences. One dra wback ofsuch an approach

is co mputation speed caused bythe bottleneckintroduced bythe soft max operation.

We designed an alternate student prediction modelthat can “directly” predictthe e mbed-

dingsfor wordsin a sentence using contextualinfor mation. Werefertothis variation asthe

Direct Prediction (DP ) model. Like our previous generic student models,the DP m o d el als o

uses bidirectional LST Msto encode context and an L1 word e mbedding matrix E . H o w e ver,

t h e DP model does not atte mptto produce a distribution overthe output vocabulary;instead

ittriesto predict areal-valued vector using afeed-for ward high way net work(Srivastava,

Greff, and Sch midhuber, 2015). The DP model’s objectiveisto mini mizethe mean square

error( MSE) bet ween a predicted word e mbedding andthe true e mbedding. For ati me-step

t,the predicted word e mbedding x̂ t ,is generated by:

f f f h t = L S T M ([x 1 , . . . , x t− 1 ]; θ ) ( 5. 1 1)

b b b h t = L S T M ([x t+ 1 , . . . , x T ]; θ ) ( 5. 1 2)

f b w x̂ t = F F ([x t : h t : h t ]; θ ) ( 5. 1 3) ∑ f b w 2 L (θ , θ , θ ) = (̂x t − x t ) ( 5. 1 4) t w h er e F F (.; θ w ) denotes a feed for ward high way net work with para meters θ w . T h us,

t h e DP modeltraining requiresthat we already havethe “true e mbeddings” for allthe

1 0 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION

L1 wordsin our corpus. We use pretrained L1 word e mbeddingsfro m FastText as “true

e mbeddings” (Bojano wski et al., 2017). Thisleavesthe LST M para meters θ f , θ b a n d the high way feed-for ward net work para meters θ w to belearned. Equation 5.14 can be

mini mized bysi mply copyingtheinput x t asthe prediction(ignoring all context). We use

maskedtraining to preventthe modelitselffro mtrivially copying( Devlin et al., 2018). We

rando mly “ mask” 3 0 % oftheinput e mbeddings duringtraining. This masking operation

replacesthe original e mbedding with either(i) 0 vectors, or (ii)vectors of arando m word

in vocabulary, or(iii) vectors of a “neighboring” word fro mthe vocabulary. 3 T h e l oss,

ho wever,is al ways co mputed withrespecttothe correcttoken e mbedding.

Withthe L1 para meters ofthe DP modeltrained, weturnto L2learning. Once againthe

L2 vocabularyis encodedin F , whichisinitializedto 0 (i.e. before any sentenceis observed).

Considerthe conﬁguration: The Nile is a Fluss in Africa . Thetokens are convertedinto a sequence of e mbeddings: [x = E , . . . , x = F , ..., x = E ]. N ot e 0 x 0 t f t T x T that at ti me-step t the L2 word-e mbedding matrixis used(t = 4 , ft = F l u s s f or t h e

exa mple above). A prediction x̂ t is generated bythe model using Equations 5.11-5.13. Our

hopeisthatthe predictionis a “reﬁned” version ofthe e mbeddingforthe L2 word. The

reﬁne ment arisesfro m consideringthe context ofthe L2 word. If F l u s s w as n ot s e e n

b ef or e, x t = F f t = 0 ,forcingthe DP modelto only use contextualinfor mation. We apply a 3 We preco mpute 2 0 neighboring words(based on cosine-si milarity)for each wordinthe vocabulary using Fast Text e mbeddings beforetraining.

1 0 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION

si mple updaterulethat modiﬁesthe L2 e mbeddings based onthe direct predictions:

F f t ← ( 1 − η )F f t + η ̂x t ( 5. 1 5) w h er e η controlstheinterpolation bet weenthe old values of a word e mbedding andthe ne w values which have been predicted based onthe current mixed sentence. Ifthere are multiple L2 wordsin a conﬁguration, say at positions i a n d j ( w h er e i < j), w e c a n still

follo w Eq 5.11–5.13. Ho wever,to allo wthe predictions ̂x i a n d ̂x j tojointlyinﬂuence each other, we needto execute multiple predictioniterations.

Concretely,let X = [ x ,...,F ,...,F , . . . , x ] bethesequence of worde mbeddings 0 f i f j T ̂ fora macaronicsentence. The DP model generates predictions X = [ x̂ 0 ,..., x̂ i,..., x̂ j ,..., x̂ T ].

We only useits predictions atti me-steps correspondingto L2tokens sincethe L2 words are

those we wantto update(Eq 5.16).

X 1 = D P (X 0 )

W h er e , X 0 = [ x ,...,F ,...,F , . . . , x ] 1 f i f j T

1 1 1 X = [ x 1 ,..., x̂ i ,..., x̂ j , . . . , x T ] ( 5. 1 6)

X k = D P (X k − 1 ) ∀ 0 ≤ k < K − 1 ( 5. 1 7) w h er e X 1 contains predictions at i a n d j andthe original L1 word-e mbeddingsin other

positions. Wethen pass X 1 asinput againtothe DP model. Thisis executedfor K it er ati o ns

(Eq 5.17). With eachiteration, our hopeisthatthe DP model’s predictions x̂ i a n d x̂ j

getreﬁned byinﬂuencing each other andresultin e mbeddingsthat are well-suitedtothe

sentence context. A si milar style ofi mputation has been studiedfor one di mensionalti me-

1 0 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION

series data by Zhou and Huang(2018). Finally, after K − 1 iterations, we usethe predictions

K of x̂ i a n d x̂ j fr o m X to updatethe L2 word-e mbeddingsin F correspondingtothe L2

t o k e ns f i a n d f j .η wassetto 0.3 andthe nu mber ofiterations K = 5 .

F ← ( 1 − η )F + η x̂ K f i f i i

F ← ( 1 − η )F + η x̂ K ( 5. 1 8) f j f j j

Figure 5.1: Ascreenshot of a macaronicsentence presented on Mechanical Turk.

5.4 Experi ments with Synthetic L2

We ﬁrstinvestigatethe patterns of word replace ment produced bythe machineteacher

undertheinﬂuence ofthe different generic student models and ho wthesereplace ments

affectthe guessability of L2 words. Tothis end, we usedthe machineteacherto generate

macaronic docu ments and asked MTurk participantsto guesstheforeign words. Figure 5.1

sho ws an exa mplescreenshot of our guessinginterface. The wordsin blue are L2 words

1 0 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION whose meaning(in English)is guessed by MTurk participants. For ourstudy, we created

asynthetic L2language byrando mlyreplacing charactersfro m English wordtypes. This

steplets us safely assu methat all MTurk participants are “absolute beginners.” Wetried

to ensurethattheresulting synthetic words are pronounceable byreplacing vo wels with vo wels, stop-consonants with other stop-consonants, etc. We alsoinserted or deleted one

characterfro mso me ofthe wordsto preventthereaderfro m usingthelength ofthesynthetic word as a clue.

Metric Model r m a x = 1 r m a x = 4 r m a x = 8

GSM 0.25 0.31 0.35

Replaced uGSM 0.20 0.25 0.25

DP 0.19 0.22 0.21

GSM 86.00(± 0.87) 74.00(± 1.10) 55.13(± 2.54)

Guess Accuracy uGSM 84.57(± 0.56) 73.89(± 1.72) 72.83(± 1.58)

DP 88.44(± 0.73) 81.07(± 1.03) 70.85(± 1.49)

Table 5.2: Results fro m MTurk data. The ﬁrst section sho wsthe percentage oftokens

that werereplaced with L2 glosses under each condition. The Accuracy section sho ws

the percentagetoken accuracy of MTurk participants’ guesses along with 9 5 % c o n ﬁ d e n c e

interval calculated via bootstrapresa mpling.

We studiedthethree generic student models( GSM , u G S M , a n d DP ) while keepingthe rest ofthe machineteacher’s co mponents ﬁxed (i.e. sa me scoring function and search

1 0 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION

O p e n- Cl ass Closed- Class All

6 0 6 0 6 0 4 4 .3 4 3 .6 3 9 .7 3 8 .8 3 7 .4

4 0 4 0 3 3 .4 4 0

r m a x = 1 2 5 .0 2 1 .5 2 0 .3 1 8 .8 1 7 .2 2 0 2 0 2 0 1 6 .7 1 2 .9 1 0 .3 1 0 .1 7 .7 5 .0 4 .3 0 0 0 DP G S M u G S M DP G S M u G S M DP G S M u G S M 6 0 6 0 6 0 5 1 .9 4 9 .2 4 1 .1 4 1 .0 4 0 4 0 4 0 .8 4 0 3 1 .8 3 0 .8

r = 4 2 4 .8 2 2 .9

m a x 2 2 .0 1 8 .9 1 8 .3 1 7 .8

2 0 1 4 .8 2 0 2 0 1 2 .5 1 0 .1 6 .8 4 .9 0 0 0 DP G S M u G S M DP G S M u G S M DP G S M u G S M 6 0 6 0 6 0 4 4 .2 4 2 .2 3 8 .8

4 0 4 0 4 0 3 5 .1 3 0 .9 3 0 .7 2 9 .5 2 8 .4

r m a x = 8 2 4 .9 2 0 .7 1 9 .4 1 8 .1 1 6 .0 1 4 .7 2 0 1 3 .9 2 0 2 0 1 0 .0 8 .1 5 .2 0 0 0 DP G S M u G S M DP G S M u G S M DP G S M u G S M

Table 5.3: Results of MTurkresults split up by word-class. The y -axisis percentage of

tokens belongingto a word-class. The pink bar (right) sho wsthe percentage oftokens

(of a particular word-class)that were replaced with an L2 gloss. The blue bar(left) and

indicatesthe percentage oftokens(of a particular word-class)that wereguessed correctly by

M Turk participants. Error barsrepresent 9 5 % conﬁdenceintervals co mputed with bootstrap

resa mpling. For exa mple, weseethat only 5 .0 % (pink) of open-classtokens werereplaced

into L2 bythe DP m o d el at r m a x = 1 a n d 4 .3 % of all open-classtokens were guessed

correctly. Thus, eventhoughthe guess accuracyfor DP at r m a x = 1 for open-classis high

(86 %) we canseethat participants were not exposedto many open-class wordtokens.

1 0 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION

algorith ms). Allthree models were constructedto haveroughlythe sa me nu mber of L1

para meters ( ≈ 2 0 M ). T h e u G S M m o d el us e d 2 unidirectional LST Mlayersinstead of a

single bidirectionallayer. The L1 and L2 word e mbeddingsize andthe nu mber ofrecurrent

u nits D w er e s et t o 3 0 0 for allthree models (to matchthe size of FastText’s pretrained

e mbeddings). Wetrainedthethree models onthe Wikipedia-103 corpus ( Merity et al.,

2 0 1 6). 4 All models weretrainedfor 8 epochs usingthe Ada m opti mizer( King ma and Ba,

2014). Weli mitthe L1 vocabularytothe 60k mostfrequent Englishtypes.

5.4.1 MTurk Setup

We s el e ct e d 6 docu ments fro m Si mple Wikipediato serve astheinput for macaronic

c o nt e nt. 5 To keep ourstudyshort enoughfor MTurk, weselected docu mentsthat contained

2 0 − 2 5 sentences. A participant could co mplete upto 6 HITs( Hu manIntelligence Tasks)

correspondingtothe 6 docu ments. Participants were given 2 5 minutesto co mplete each

HI T(on average,the participantstook 1 2 minutesto co mpletethe HITs). To preventtypos, w e us e d a 2 0 k word English dictionary, whichincludes allthe wordtypesfro mthe 6 Si m pl e

Wikipedia docu ments. We provided nofeedbackregardingthe correctness of guesses. We

r e cr uit e d 1 2 8 English speaking MTurk participants and obtained 1 6 2 responses, with each

response enco mpassing a participant’s guesses over afull docu ment.6 Participants were

co mpensated $4 per HIT.

4 FastText pretrained e mbeddings weretrained on more data. 5 htt ps:// d u m ps. wi ki m e di a. or g/si m pl e wi ki/ 2 0 1 9 0 1 2 0/ 6 Participants self-reportedtheir English proﬁciency, only native or ﬂuent speakers were allo wedto participate. Our HITs were only availableto participantsfro mthe US.

1 0 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.4.2 Experi ment Conditions

We generated 9 macaronic versions( 3 m o d els {GSM ,u G S M ,DP} in co mbination with 3 r a n k t hr es h ol ds r m a x ∈ { 1 , 4 , 8 } )for each ofthe 6 Si mple Wikipedia docu ments. For each HIT, an MTurk participant wasrando mly assigned one ofthe 9 macaronic versions.

M o d el r m a x = 1 r m a x = 8

GSM Hu Nile (‘‘an-nī l’’) ev a river um Hu Nile (‘‘an-nī l’’) ev u river um

Africa. Up is hu longest river Africa. Up ev the longest river

i ñ Earth (about 6,650 km or 4,132 on Earth (about 6,650 km or 4,132

miles), though other rivers carry miles), though other rivers carry

more water... more water...

Many ozvolomb types iv emoner live Emu ozvolomb types of emoner live

in or near hu waters iv hu Nile, um or iul the waters of hu Uro,

including crocodiles, birds, fish including crocodiles, ultf, yvh

ñ b many others. Not only do animals and emu others. Ip only do animals

d e p e n d i ñ hu Nile for survival, but d e p e n d i ñ the Nile zi survival, but

also people who live there need up also daudr who live there need up

zi everyday use like washing, as u zi everyday use like washing, ez a

jopi supply, keeping crops watered jopi supply, keeping crops watered

ñ b other jobs... ñ b other jobs...

Table 5.4: Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe GSM model.

Tables 5.4to 5.6 sho wsthe outputforthe GSM , u G S M a n d DP generic student models at

1 1 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION

t wo settings of r m a x for one ofthe docu ments.Inthese experi ments we use asynthetic L2 l a n g u a g e.

M o d el r m a x = 1 r m a x = 8

uGSM The Nile (‘‘an-nī l’’) ev a river Hu Nile (‘‘an-nī l’’) ev u river um

um Africa. It ev hu longest river Africa. Up ev the longest river

on Earth (about 6,650 km or 4,132 i ñ Earth (about 6,650 km or 4,132

miles), though other rivers carry miles), though other rivers carry

more jopi... more jopi...

Many different pita of emoner live Many different pita of emoner live

in or near hu waters iv hu Nile, um or near hu waters iv hu Nile,

including crocodiles, ultf, fish including crocodiles, ultf, fish

and many others. Not mru do emoner and many others. Not mru do emoner

d e p e n d i ñ hu Nile for survival, but depend on the Nile for survival, id

also people who live there need it also people who live there need it

for everyday use like washing, as a zi everyday use like washing, as u

jopi supply, keeping crops watered water supply, keeping crops watered

ñ b other jobs... ñ b other jobs...

Table 5.5: Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe uGSM model.

1 1 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION

M o d el r m a x = 1 r m a x = 8

DP Hu Nile (‘‘an-nī l’’) ev a river um Hu Nile (‘‘an-nī l’’) ev a river um

Africa. Up ev hu longest river Africa. Up ev hu longest river

on Earth (about 6,650 km or 4,132 on Earth (about 6,650 km or 4,132

miles), though other rivers carry miles), though udho rivers carry

more water... more water...

Many different types iv animals Many different pita of animals live

live in or near hu waters iv hu i n o r n e a r hu waters of hu Nile,

Nile, including crocodiles, birds, including crocodiles, birds, fish

fish and many others. Not only and many others. Not mru do animals

do animals depend iñ h u N i l e f o r d e p e n d i ñ hu Nile zi survival, id

survival, but also people who live also people who live there need it

there need it for everyday use like zi everyday use like washing, ez a

washing, as u water supply, keeping water supply, keeping crops watered

crops watered and other jobs... and udho jobs...

Table 5.6: Portions of one of our Si mple Wikipedia articles. The docu ment has been convertedinto a macaronic docu ment bythe machineteacher usingthe DP generic student model. Only co m monfunction wordssee mto bereplaced withtheir L2translations.

Thet wo colu mnssho wthe effect oftherankthreshold r m a x . Notethatthis macaronic d o c u m e nt is 2 5 sentenceslong; here, we only sho wthe ﬁrst 2 sentences and another mi d dl e 2 sentencesto save space. We seethat r m a x controlsthe nu mber of L2 wordsthe machineteacher dee ms guessable, which affectstextreadability. Theincreasein L2 words is most noticeable withthe GSM model. We also seethatthe DP model differs fro mthe

1 1 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION others by favoring high frequency words al most exclusively. Whilethe GSM a n d u G S M models si milarlyreplace a nu mber of highfrequency words,they also occasionallyreplace lo werfrequency word classeslike nouns and adjectives(e m o n e r , E m u , et c.). Ta bl e 5. 2 su m marizes our ﬁndings. The ﬁrstsection of 5.2sho wsthe percentage oftokensthat were dee med guessable by our machineteacher. The GSM modelreplaces more words as r m a x isincreasedto 8 , but weseethat MTurkers had a hardti me guessingthe meaning ofthe replacedtokens:their guessing accuracy dropsto 5 5 % at r m a x = 8 wit h t h e GSM m o d el. T h e u G S M model, ho wever, displays areluctancetoreplacetoo manytokens, even as r m a x w as increasedto8.

Wefurther analyzedthereplace ments and MTurk guesses based on word-class. We taggedthe L1tokens withtheir part-of-speech and categorizedtokensinto open or closed classfollo wing Universal Dependency guidelines(“ Universal Dependencies v1: A Multi- lingual Treebank Collection.”).7 Table 5.3su m marizes our analysis of model and hu man behavior whenthe datais separated by word-class. The pink barsindicatethe percentage of tokensreplaced per word-class. The blue barsrepresentthe percentage oftokensfro m a particular word-classthat MTurk users guessed correctly. Thus, anideal machineteacher should striveforthe highest possible pink bar while ensuringthatthe blue baris as close as possibletothe pink. Our ﬁndings suggestthatthe u G S M m o d el at r m a x = 8 a n d t h e

GSM m o d el at r m a x = 4 sho wthe desirable properties – high guessing accuracy and more representation of L2 words(particularly open-class words).

7 https://universaldependencies.org/u/pos/

1 1 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION

M et ri c Model Closed Open

r a n d o m 5 9 5 2 4 Types Replaced GSM 3 3 1 4 9

random 62.06(± 1.54) 39.36(± 1.75) Guess Accuracy GSM 74.91(± 0.94) 61.96(± 1.24)

Table 5.7: Results co mparing our generic student based approachto arando m baseline. The

ﬁrst partsho wsthe nu mber of L2 wordtypes exposed by each modelfor each word class.

The second part sho wsthe average guess accuracy percentagefor each model and word

class. 95 % conﬁdenceintervals(in brackets) were co mputed using bootstrapresa mpling.

5.4.3 Rando m Baseline

So far we’ve co mpared different generic student models against each other, butis our

generic student based approachrequired at all? Ho w much better(or worse)isthis approach

co mparedto a rando m baseline? To ans werthese questions, we co mparethe GSM wit h

r m a x = 4 model against arando mly generated macaronic docu ment. Asthe na mesuggests, wordreplace ments are decidedrando mlyfortherando m condition, but we ensurethatthe

nu mber oftokensreplacedin eachsentence equalsthatfro mthe GSM condition.

We us e d t h e 6 Si mple Wikipedia docu mentsfro m § 5.4.1 andrecruited 6 4 n e w M T ur k

partipants who co mpleted atotal of 6 6 HITs(co mpensation was $ 4 per HIT). For each

1 1 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Model Closed O p e n

random 9.86(±0.94) 4.28(±0.69)

GSM 35.53(± 1.03) 27.77(± 1.03)

Table 5.8: Results of our L2learning experi ments where MTurk subjects si mplyread a

macaronic docu ment and ans wered a vocabulary quiz atthe end ofthe passage. Thetable

sho wsthe average guess accuracy percentage along with 9 5 % conﬁdenceintervals co mputed

fro m bootstrapresa mpling.

HIT,the participant was given eithertherando mly generated orthe GSM based macaronic

docu ment. Once again, participants were madeto entertheir guessfor each L2 wordthat

appearsin a sentence. Theresults are su m marizedin Table 5.7.

We ﬁndthatrando mlyreplacing words with glosses exposes more L2 wordtypes(59

and 524 closed-class and open-class words respectively) whilethe GSM modelis more

conservative withreplace ments(33 and 149). Ho wever,therando m macaronic docu mentis

much harderto co mprehend,indicated by signiﬁcantlylo wer average guess accuraciesthan

those withthe GSM model. Thisis especiallytruefor open-class words. Notethat Table 5.7

sho wsthe nu mber of wordtypesreplaced across all 6 docu ments.

1 1 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.4.4 Learning Evaluation

Our macaronic based approachrelies onincidentallearning, which statesthatif a novel wordisrepeatedly presentedto a student with sufﬁcient context,the student will eventually

be abletolearnthe novel word. Sofar our experi mentstest MTurk participants onthe

“guessability” of novel wordsin context, but notlearning. To studyif students can actually

learnthe L2 words, we conduct an MTurk experi ment where participants aresi mplyrequired

toread a macaronic docu ment(onesentence at ati me). Atthe end ofthe docu ment an L2 vocabulary quizis given. Participants must enterthe meaning of every L2 wordtypethey

have seen duringthereading phase.

Once again, we co mpare our GSM (r m a x = 4 ) model against arando m baseline using t h e 6 Si mple Wikipedia docu ments. 4 7 HITs were obtainedfro m 4 5 M Turk participants

forthis experi ment. Participants were made a warethatthere would be a vocabulary quiz at

the end ofthe docu ment. Our ﬁndings aresu m marizedin Table 5.8. We ﬁndthe accuracy

of guessesforthe vocabulary quiz atthe end ofthe docu mentis considerablylo werthan

guesses with context. Ho wever, subjects still managedto retain 3 5 .5 3 % a n d 2 7 .7 7 % of

closed-class and open-class L2 wordtypesrespectively. Onthe other hand, when arando m

macaronic docu ment was presentedto participants,their guess accuracy droppedto 9 .8 6 %

a n d 4 .2 8 % for closed and open class wordsrespectively. Thus, eventhough more word

types were exposed bytherando m baseline,fe wer words wereretained.

Additionally, we wouldliketoinvestigate ho w our approach could be extendedto enable

1 1 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION

phrasallearning( which should consider word-ordering differences bet weenthe L1 and L2).

As t h e GSM a n d u G S M modelssho wedthe most pro misingresultsin our experi ments, we

believethese models could serve asthe baselineforfuture work.

5.5 Spelling- Aware Extension

Sofar, our genericstudent modelignoresthefactthat a novel wordlike A f r i k a is guessable

si m pl y b y its s p elli n g si mil arit y t o A f r i c a . Thus, we aug mentthe genericstudent model

to use character n -gra ms. We choosethe bidirectional generic student model for our

spelling-a ware extension based onthe pilot experi ments detailedin § 5.4.2. In additionto an

e mbedding per wordtype, welearn e mbeddingsfor character n -gra mtypesthat appearin

our L1corpus. Therowin E fora wordw is now para meterizedas:

∑ 1 Ẽ · ̃w + Ẽ n · ̃w n ( 5. 1 9) 1 · ̃w n n w h er e Ẽ isthefull- word e mbedding matrix and ̃w is a one-hot vector associated withthe w or d t y p e w , Ẽ n is a character n -gra m e mbedding matrix and ̃w n is amulti -hot vector associated with allthe character n -gra msforthe wordtype w . F or e a c h n ,thesu m mand

givesthe average e mbedding of all n - gr a ms i n w ( w h er e 1 · ̃w n countsthese n - gr a ms). We

s et n torangefro m 3 t o 4 (s e e § 5.7). Thisfor mulationis si milarto previous sub- word based

e mbedding models( Wieting et al., 2016; Bojano wski et al., 2017).

1 1 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Si milarly,the e mbedding of an L2 word w is para meterized as

∑ 1 F̃ · ̃w + F̃ n · ̃w n ( 5. 2 0) 1 · ̃w n n

Cr u ci all y, w e i niti ali z e F̃ n t o µ Ẽ n ( w h er e µ > 0 )sothat L2 words caninherit part of

̃ 4 ̃ 4 8 theirinitial e mbeddingfro m si milarly spelled L1 words: F A f r i := µ E A f r i . B ut w e all o w

F̃ n to diverge overti mein case an n -gra mfunctions differentlyinthet wolanguages. In thesa me way, weinitialize eachro w of F̃ tothe correspondingro w of µ · Ẽ , if a n y, a n d other wiseto 0 . Our experi ments set µ = 0 .2 (s e e § 5.7). We refertothis spelling-a ware

extensionto GSM assGSM.

5.5.1 Scoring L2e mbeddings

Didthe si mulated studentlearn correctly and usefully? Let P bethe “reference set” of all

(L1 word, L2 gloss) pairsfro malltokensinthe entire docu ment. We assessthe machine

teacher’ssuccess by ho w many ofthese pairsthesi mulatedstudent haslearned. (Thestudent

may even succeed on so me pairsthatit has never been sho wn,thanksto n -gra m clues.)

Speciﬁcally, we measurethe “goodness” ofthe updated L2 word e mbedding matrix F .

For each pair p = ( e, f ) ∈ P , sort allthe wordsinthe entire L1 vocabulary accordingto

their cosine si milaritytothe L2 word f , a n d l et r p denotetherank of e . For exa mple,if thestudent had managedtolearn a matrix F whose e mbedding of f exactly equalled E ’s

e mbedding of e , t h e n r p w o ul d b e 1 . Wethen co mpute a meanreciprocalrank( MRR)score

8 Weset µ = 0 .2 based on ﬁndingsfro m our hyperpara metersearch(see §5.7).

1 1 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION

of F :

1 ∑ − 1 ) MRR (F ) = if r p ≤ r m a x els e 0 ( 5. 2 1) | P| r p p ∈ P

We s et r m a x = 4 based on our pilot study. Thisthreshold hasthe effect of only giving credit

toane mbedding of f suchthatthe correct e isinthesi mulatedstudent’stop 4 guesses. As a

r es ult, § 5.5.2’s machineteacherfocuses onintroducing L2tokens whose meaning can be

deduced rather accurately fro mtheir single context(together with any prior exposureto

that L2type). This makesthe macaronictext co mprehensiblefor a hu manstudent,rather

thanfrustratingtoread. In our pilot study wefoundthat r m a x substantiallyi mproved hu man

l e ar ni n g.

5.5.2 Macaronic Conﬁguration Search

Our current machineteacher producesthe macaronic docu ment greedily, one sentence at a

ti me. Actual docu ments produced aresho wnin??.

L et F p r e v bethe student model’s e mbedding matrix afterthereadingthe ﬁrst n − 1

macaronic sentences. We evaluate a candidate next sentence x b y t h e s c or e MRR(F) w h er e

F m a xi mi z es ( 5. 5) andisthusthe e mbedding matrixthatthe student would arrive at after

readingx asthe n t h macaronic sentence.

We use best-first searchto seek a high-scoring x . Asearchstateis a pair (i, x ) w h er e x is a macaronic configuration(Table 5.1) whose first i tokens may beeither L1 or L2, but whose

re mainingtokens are still L1. The state’s scoreis obtained by evaluating x as described

1 1 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION

a b o ve. I n t h e i niti al st at e, i = 0 a n d x is t h e n t h sentence ofthe original L1 docu ment. The

st at e (i, x ) is a ﬁnal stateif i = |x |. Other wiseitst wosuccessors are (i + 1 , x ) a n d (i + 1 , x ′), w h er e x ′ isidenticalto x exceptthatthe (i + 1) t h token has beenreplaced byits L2 gloss. The

search algorith m maintains a priority queue of states sorted by score. Initially,this contains

onlytheinitial state. A step ofthe algorith m consists of poppingthe highest-scoring state

and,ifitis not ﬁnal,replacingit byitst wo successors. The queueisthen pruned backtothe

top 8states. Whenthe queue beco mes e mpty,the algorith mreturnsthe conﬁguration x fr o m

the highest-scoring ﬁnal statethat was ever popped.

5.6 Experi ments withreal L2

Does our machineteacher generate useful macaronictext? To ans werthis, we measure whether hu man students(i)co mprehendthe L2 wordsin context, and (ii)retain kno wledge

ofthose L2 words whenthey arelaterseen without context.

We assess(i) by displaying each successive sentence of a macaronic docu mentto a

hu manstudent and askingthe mto guessthe L1 meaningfor each L2token f inthe sentence.

For a given machineteacher, all hu mansubjectssa wthesa me macaronic docu ment, and

each subject’s co mprehension scoreisthe average quality oftheir guesses on allthe L2

tokens presented bythatteacher. A guess’s quality q ∈ [ 0, 1] is a thresholded cosine si milarity bet weenthe e mbeddings 9 ofthe guessed word ̂e andthe original L1 word e :

9 Here we used pretrained word e mbeddingsfro m Mikolov et al.(2018),in orderto measure actualse mantic si mil arit y.

1 2 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION

q = cs(e,̂e)ifcs(e,̂e) ≥ τ else 0 . T h us, ̂e = e o bt ai ns q = 1 (f ull cr e dit), w hil e q = 0 if

the guessis “toofar”fro mthetruth(as deter mined byτ).

To assess(ii), we ad minister an L2 vocabulary quiz after having hu man subjects si mply

read a macaronic passage( without any guessing asthey arereading). They arethen askedto

guessthe L1translation of each L2 wordtypethat appeared atleast onceinthe passage. We

usedthe sa me guess quality metric asin(i). 1 0 Thistestsif hu man subjects naturallylearn

the meanings of L2 words,ininfor mative contexts, well enoughtolatertranslatethe m out

of context. Thetestrequires only short-ter mretention, since we givethe vocabulary quiz

i m mediately after a passageisread.

We co mpared results on macaronic docu ments constructed withthe generic student m o d el ( GSM ),its spelling-a ware variant (s G S M ), and arando m baseline. Inthe baseline, tokensto replace are rando mly chosen while ensuringthat each sentence replacesthe

sa me nu mber oftokens asinthe GSM docu ment. Thisignores context, spelling, and prior

exposures asreasonstoreplace atoken.

Our evaluation was ai med at native English(L1)speakerslearning Spanish or Ger man

(L2). Werecruited L2“students” on A mazon Mechanical Turk( MTurk). They wereabsolute

beginners, selected using a place menttest and self-reported L2 ability.

1 0 If multiple L1types e were glossedinthe docu ment withthis L2type, we generously usethe e t h at maxi mizes cs (e, ̂e).

1 2 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION

L2 Model Closed-class Open-class

rando m 0.74± 0.0126(54) 0.61± 0.0134(17)

Es GSM 0.72± 0.0061(54) 0.70± 0.0084(17)

sGSM 0.82± 0.0038(41) 0.80± 0.0044(21)

rando m 0.59± 0.0054(34) 0.38± 0.0065(13)

D e GSM 0.80± 0.0033(34) 0.78± 0.0056(13)

sGSM 0.82± 0.0063(33) 0.79± 0.0062(14)

Table 5.9: Averagetoken guess quality( τ = 0 .6 )inthe co mprehension experi ments. The ±

d e n ot es a 9 5 % conﬁdenceinterval co mputed via bootstrapresa mpling oftheset of hu man

subjects. The % of L1tokensreplaced with L2 glossesisin parentheses. § 5. 8 e val u at es wit h

other choices of τ.

1 2 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.6.1 Co mprehension Experi ments

We usedthe ﬁrst chapter of Jane Austen’s “Sense and Sensibility”for Spanish, andthe ﬁrst

6 0 sentences of Franz Kafka’s “ Meta morphosis”for Ger man. Bilingual speakers provided

the L2 glosses(see§5.9for exa mples).

For English-Spanish, 1 1 , 8 , a n d 7 subjects were assigned macaronic docu ments generated wit h s G S M , GSM , andtherando m baseline,respectively. The corresponding nu mbersfor

English- Ger man were 1 2 , 7 a n d 7 . A t ot al of 3 9 subjects were usedinthese experi ments

(so me subjects did bothlanguages). They were given 3 hours to co mplete the entire

docu ment(average co mpletionti me was ≈ 1.5 hours) and were co mpensated $10.

Table 5.9reportsthe mean co mprehension score over all subjects, broken do wninto

co mprehension offunction words(closed-class P OS) and content words(open-class P OS). 1 1

For Spanish, the s G S M -basedteacher replaces more content words (but fe wer function words), andfurther morethereplaced wordsin both cases are better understood on average, which we hopeleadsto more engage ment and morelearning. For Ger man, by contrast,

the nu mber of wordsreplaced does notincrease under s G S M , and co mprehension only

i mproves marginally. Both GSM a n d s G S M do strongly outperfor mtherando m baseline.

B ut t h e s G S M -basedteacher onlyreplaces afe w additional cognates(h u n d e r t b ut n ot

M u t t e r ), apparently because English- Ger man cognates do not exhibitlargeexact character

n -gra m overlap. We hypothesizethat character skip n -gra ms might be more appropriatefor

1 1 https://universaldependencies.org/u/pos/

1 2 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION

L2 Model Closed-class Open-class

rando m 0.47± 0.0058(60) 0.40± 0.0041(46)

Es GSM 0.48± 0.0084(60) 0.42± 0.0105(15)

sGSM 0.52± 0.0054(47) 0.50± 0.0037(24)

Table 5.10: Averagetype guess quality( τ = 0 .6 )intheretention experi ment. The % of L2

glosstypesthat were sho wninthe macaronic docu mentisin parentheses. § 5.8 evaluates with other choices of τ.

English- Ger man.

5.6.2 Retention Experi ments

Forretention experi ments we usedthe ﬁrst 2 5 sentences of our English-Spanish dataset. Ne w

participants wererecruited and co mpensated $ 5 . Each participant was assigned a macaronic

docu ment generated withthe s G S M , GSM orrando m model( 2 0 , 1 8 , a n d 2 2 p arti ci p a nts

respectively). As Table 5.10 sho ws, s G S M ’s advantage over GSM on co mprehension holds

up onretention. Onthe vocabulary quiz, students correctlytranslated > 3 0 ofthe 71 word

typesthey hadseen(Table 5.15), and morethan half when near-synony ms earned partial

credit ( Table 5.10).

1 2 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.7 Hyperpara meter Search

Wetunedthe model hyperpara meters by hand onseparate English-Spanish data, na melythe

second chapter of “Sense and Sensibility,” equipped with glosses. Hyperpara metertuning

results arereportedinthis appendix. All other English-Spanishresultsinthe paper are on

the ﬁrst chapter of “Sense and Sensibility,” which was held outfortesting. We might have

i mprovedtheresults on English- Ger man bytuning separate hyperpara metersforthat setting.

Thetables belo w sho wthe effect of different hyperpara meter choices onthe quality

MRR(F) ofthe e mbeddingslearned bythesi mulatedstudent. Recallfro m § 5.5.1thatthe

M R R score evaluates F using all glosses, notjustthose usedin a particular macaronic docu ment. Thus,itis co mparable acrossthe different macaronic docu ments produced by

different machineteachers.

Q u e u e Si z e ( § 5.5.2) affects only ho w hardthe machineteacher searchesfor macaronic

sentencesthat will helpthe si mulated student. We ﬁndthatlarger QueueSizeisin fact val u a bl e.

The other choices( Model, n - gr a ms, µ ) affect ho wthe si mulated student actuallylearns.

The machineteacherthen searchesfor a docu mentthat will helpthat particular si mulated

studentlearn as many ofthe wordsinthereferenceset as possible. Thus,the M R Rscore

is hightothe extentthatthe si mulated student “can be successfullytaught.” By choosing

hyperpara metersthat achieve a high M R Rscore, we are assu mingthat hu manstudents are

adapted(or can adapt online)to beteachable.

1 2 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION

The scale factor µ (used onlyfor s G S M ) noticeably affectsthe macaronic docu ment

generated bythe machineteacher. Settingit high( µ = 1 .0 ) has a adverse effect onthe

M R Rscore. Table 5.11sho ws ho wthe M R Rscore ofthesi mulatedstudent( §5.5.1) varies

accordingtothe student model’s µ value. Tables 5.12 and 5.13sho wtheresult ofthesa me

hyperpara meters weep onthe nu mber of L1 wordtokens andtypesreplaced with L2 glosses.

N ot e t h at µ only affectsinitialization ofthe F para meters. Thus, with µ = 0 , t h e L 2 word and sub word e mbeddings areinitializedto 0 , butthe si mulated s G S M st u d e nt still

hasthe abilitytolearnsub word e mbeddingsfor both L1 and L2. This allo wsitto beatthe

si mulated GSM student.

We seethat for s G S M , µ = 0 .2 resultsin replacingthe most words (bothtypes and tokens), and also has very nearlythe highest M R Rscore. Thus,for s G S M , we decidedto

use µ = 0 .2 and allo w both 3-gra m and 4-gra m e mbeddings.

5.8 Results Varying τ

L2 τ Model Closed-class Open-class

rand 0.81± 0.0084(54) 0.72± 0.0088(17)

0. 0 GSM 0.80± 0.0045(54) 0.79± 0.0057(17)

sGSM 0.86± 0.0027(41) 0.84± 0.0032(21)

rand 0.81± 0.0085(54) 0.72± 0.0089(17)

0. 2

1 2 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION

GSM 0.80± 0.0045(54) 0.79± 0.0057(17)

sGSM 0.86± 0.0027(41) 0.84± 0.0033(21)

rand 0.79± 0.0101(54) 0.66± 0.0117(17)

0. 4 GSM 0.76± 0.0057(54) 0.75± 0.0071(17)

Es sGSM 0.84± 0.0033(41) 0.82± 0.0039(21)

rando m 0.74± 0.0126(54) 0.61± 0.0134(17)

0 .6 GSM 0.72± 0.0061(54) 0.70± 0.0084(17)

sGSM 0.82± 0.0038(41) 0.80± 0.0044(21)

rand 0.62± 0.0143(54) 0.46± 0.0124(17)

0. 8 GSM 0.59± 0.0081(54) 0.58± 0.0106(17)

sGSM 0.71± 0.0052(41) 0.67± 0.0062(21)

rand 0.62± 0.0143(54) 0.45± 0.0124(17)

1. 0 GSM 0.59± 0.0081(54) 0.55± 0.0097(17)

sGSM 0.70± 0.0052(41) 0.64± 0.0063(21)

rando m 0.70± 0.0039(34) 0.56± 0.0046(13)

0 .0 GSM 0.85± 0.0023(34) 0.84± 0.0039(13)

sGSM 0.87± 0.0045(33) 0.84± 0.0044(14)

rando m 0.69± 0.0042(34) 0.56± 0.0047(13)

0 .2 GSM 0.85± 0.0024(34) 0.84± 0.0039(13)

sGSM 0.87± 0.0046(33) 0.84± 0.0044(14)

1 2 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION

rando m 0.64± 0.0052(34) 0.45± 0.0064(13)

0 .4 GSM 0.83± 0.0029(34) 0.81± 0.0045(13)

De sGSM 0.84± 0.0055(33) 0.81± 0.0054(14)

rando m 0.59± 0.0054(34) 0.38± 0.0065(13)

0 .6 GSM 0.80± 0.0033(34) 0.78± 0.0056(13)

sGSM 0.82± 0.0063(33) 0.79± 0.0062(14)

rando m 0.45± 0.0058(34) 0.25± 0.0061(13)

0 .8 GSM 0.72± 0.0037(34) 0.66± 0.0081(13)

sGSM 0.75± 0.0079(33) 0.65± 0.0077(14)

rando m 0.45± 0.0058(34) 0.24± 0.0061(13)

1 .0 GSM 0.71± 0.0040(34) 0.63± 0.0082(13)

sGSM 0.75± 0.0079(33) 0.63± 0.0081(14)

Table 5.14: An expanded version of Table 5.9(hu man co mprehension experi ments),reportingresults with various values ofτ.

A more co mprehensive variant of Table 5.9is givenin Table 5.14. Thistablereportsthe

sa me hu man-subjects experi ments as before;it only variesthe measure usedto assessthe

quality ofthe hu mans’ guesses, by varyingthethreshold τ . N ot e t h at τ = 1 ass ess es

exact- match accuracy, τ = 0 .6 asin Table 5.9 correspondsroughlytosynony my(atleastfor

content words), and τ = 0 assesses average unthresholded cosine si milarity. We ﬁndthat

s G S M consistently outperfor ms both GSM andtherando m baseline overthe entirerange of τ .

1 2 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Scale Factor µ Model n-gra ms QueueSize 1.0 0.4 0.2 0.1 0.05 0.0

sGSM 2,3,4 1 0.108 0.207 0.264 0.263 0.238 0.175

s G S M 3, 4 1 0.113 0.199 0.258 0.274 0.277 0.189

s G S M 3, 4 4 - - 0.267 0.286 - -

s G S M 3, 4 8 - - 0.288 0.292 - -

G S M ∅ 1 0. 1 5 9

G S M ∅ 4 0. 1 7 1

G S M ∅ 8 0. 1 7 2

Table 5.11: M R R scores obtained with different hyperpara meter settings.

As we get closerto exact match,therando m baseline suffersthelargest dropin perfor mance.

Si milarly, Table 5.15 sho ws a expanded version oftheretentionresultsin Table 5.10.

The gap bet weenthe modelsiss maller onretentionthanit was on co mprehension. Ho wever,

a g ai n s G S M > G S M > rando m acrosstherange of τ . We ﬁndthatforfunction words,the

rando m baseline perfor ms as well as GSM as τ isincreased. For content words, ho wever,the

rando m baselinefallsfasterthanGSM.

We warnthatthe nu mbers are not genuinely co mparable acrossthe 3 models, because

each model resultedin a different docu ment andthus a different vocabulary quiz. Our

hu man subjects were askedtotranslatejustthe L2 wordsinthe docu menttheyread. In

1 2 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Scale Factor µ Model n-gra ms QueueSize 1.0 0.4 0.2 0.1 0.05 0.0

sGSM 2,3,4 1 149 301 327 275 201 247

s G S M 3, 4 1 190 340 439 399 341 341

s G S M 3, 4 4 - - 462 440 - -

s G S M 3, 4 8 - - 478 450 - -

G S M ∅ 1 5 4 9

G S M ∅ 4 5 5 7

G S M ∅ 8 5 3 0

Table 5.12: Nu mber of L1tokensreplaced by L2 glosses under different hyperpara meter

s etti n gs.

p arti c ul ar, s G S M taughtfe wertotaltypes(71)than GSM (75) ortherando m baseline(106).

Allthat Table 5.15 sho wsisthatittaughtits chosentypes better(on average)thanthe other

methodstaughttheir chosentypes.

5.9 Macaronic Exa mples

Belo w, we displaythe actual macaronic docu ments generated by our methods. Firstfe w

paragraphs of “Sense and Sensibility” withthe s G S M m o d el usi n g µ = 0 .2 , 3 - a n d 4 - gr a ms,

priority queue size of 8, andr m a x = 4 aresho wn belo w:

1 3 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Scale Factor µ Model n-gra ms QueueSize 1.0 0.4 0.2 0.1 0.05 0.0

sGSM 2,3,4 1 39 97 121 106 75 88

s G S M 3, 4 1 44 97 125 124 112 99

s G S M 3, 4 4 - - 124 127 - -

s G S M 3, 4 8 - - 145 129 - -

G S M ∅ 1 1 0 6

G S M ∅ 4 1 1 1

G S M ∅ 8 1 1 4

Table 5.13: Nu mber of distinct L2 wordtypes presentinthe macaronic docu ment under different hyperpara meter settings.

Sense y Sensibility

La family de Dashwood llevaba long been settled en Sussex.

Their estate era large, and their residencia was en Norland Park, in el centre de their property, where, for muchas generations, t h e y h a b ı́ an lived en so respectable a manner as to engage el general good opinion of los surrounding acquaintance. El late owner de this propiedad was un single man, que lived to una very advanced age, y que durante many years of his life, had a constante companion and housekeeper in su sister. But ella death,

1 3 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION que happened ten añ os antes his own, produced a great alteration in su home; for to supply her loss, he invited and received into his house la family of su sobrino señ or Henry Dashwood, the legal inheritor of the Norland estate, and the person to whom he intended to bequeath it. En la society de su nephew y niece, y their children, el old Gentleman’s days fueron comfortably spent.

Su attachment a them all increased. The constant attention de

Mr. y Mrs. Henry Dashwood to sus wishes, que proceeded no merely from interest, but from goodness of heart, dio him every degree de solid comfort which su age podı́ a receive; and la cheerfulness of the children added a relish to his existencia.

By un former marriage, Mr. Henry Dashwood tenı́ a o n e s o n : by su present lady, three hijas. El son, un steady respectable young man, was amply provided for por the fortuna de his mother, w h i c h h a b ı́ a been large, y half of which devolved on him on his coming of edad. Por su own matrimonio, likewise, which happened soon despué s, he added a his wealth. To him therefore la succession a la Norland estate era no so really importante as to his sisters; para their fortuna, independent de what pudiera arise a ellas from su father’s inheriting that propiedad, could ser but small. Su mother had nothing, y their father only seven mil pounds en his own disposició n; for la remaining moiety of his

1 3 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION first esposa’s fortune was also secured to her child, and é l t e n ı́ a s ó lo a life-interé s i n i t .

el anciano gentleman died: his will was read, and like almost todo other will, dio as tanto disappointment as pleasure.

H e fue neither so unjust, ni so ungrateful, as para leave his estate de his nephew; --but he left it a him en such terms as destroyed half the valor de el bequest. Mr. Dashwood habı́ a wished for it more por el sake of his esposa and hijas than for himself or su son; --but a his son, y su son’s son, un child d e f o u r a ñ os old, it estaba secured, in tal a way, as a leave a himself no power de providing por those que were most dear para him, and who most necesitaban a provisió n by any charge on la estate, or por any sale de its valuable woods. El whole fue tied arriba para the beneficio de this child, quien, in occasional visits with his padre and mother at Norland, had tan far gained on el affections de his uncle, by such attractions as are by no means unusual in children of two o three years old; una imperfect articulació n, an earnest desire of having his own way, many cunning tricks, and a great deal of noise, as to outweigh all the value de all the attention which, for years, é l h a b ı́ a received from his niece and sus daughters. He meant no a ser unkind, however, y, como a mark de his affection for las three

1 3 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION girls, he left ellas un mil libras a-piece.

1 3 4 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Next,the ﬁrstfe w paragraphs of “Sense and Sensibility” withthe GSM m o d el usi n g priority queuesize of 8 and r m a x = 4 . Sense y Sensibility

La family de Dashwood llevaba long been settled en Sussex.

Su estate era large, and su residence estaba en Norland Park, in el centre de their property, where, por many generations, they had lived in so respectable una manner as a engage el general good opinion de los surrounding acquaintance. El late owner de esta estate was un single man, que lived to una very advanced age, y who durante many years de su existencia, had una constant companion y housekeeper in his sister. But ella death, que happened ten years antes su own, produced a great alteration in su home; for para supply her loss, é l invited and received into his house la family de su nephew Mr. Henry Dashwood, the legal inheritor de the Norland estate, and the person to whom se intended to bequeath it. In the society de su nephew and niece, and sus children, el old Gentleman’s days fueron comfortably spent. Su attachment a them all increased. La constant attention de Mr. y Mrs. Henry Dashwood to sus wishes, which proceeded not merely from interest, but de goodness de heart, dio him every degree de solid comfort que his age could receive; y la cheerfulness of the children added un relish a su existence.

By un former marriage, Mr. Henry Dashwood tenı́ a o n e s o n :

1 3 5 CHAPTER5. MACARONIC TEXT CONSTRUCTION by su present lady, three hijas. El son, un steady respectable joven man, was amply provided for por la fortune de su madre, que h a b ı́ a been large, y half de cuya devolved on him on su coming de edad. By su own marriage, likewise, que happened soon despué s , h e a d d e d a su wealth. Para him therefore la succession a la

Norland estate was no so really importante as to his sisters; para their fortune, independent de what pudiera arise a them from su father’s inheriting that property, could ser but small. Su madre had nothing, y su padre only siete thousand pounds in su own disposal; for la remaining moiety of his first wife’s fortune era also secured a su child, y é l had only una life-interest in ello.

el old gentleman died: su will was read, y like almost every otro will, gave as tanto disappointment as pleasure. He fue neither so unjust, nor so ungrateful, as to leave su estate from his nephew; --but he left it to him en such terms como destroyed half the valor of the bequest. Mr. Dashwood habı́ a wished for it m á s for el sake de su wife and daughters than para himself or su hijo; --but a su hijo, y his son’s hijo, un child de four añ o s old, it estaba secured, en tal un way, as a leave a himself no power of providing for aquellos who were most dear para him, y who most needed un provision by any charge sobre la estate, or por any sale de its valuable woods. El whole was tied arriba for

1 3 6 CHAPTER5. MACARONIC TEXT CONSTRUCTION el benefit de this child, quien, en ocasionales visits with his father and mother at Norland, had tan far gained on the affections of his uncle, by such attractions as are por no means unusual in children of two or three years old; an imperfect articulation, an earnest desire of having his own way, many cunning tricks, and a gran deal of noise, as to outweigh todo the value of all the attention which, for years, he had received from his niece and her daughters. He meant no a ser unkind, however, y, como una mark de su affection por las three girls, he left them un mil pounds a - p i e z a .

1 3 7 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Firstfe w paragraphs of “The Meta morphosis” withthe s G S M m o d el usi n g µ = 0 .2 , 3 - and 4-gra ms, priority queuesize of8, andr m a x = 4 . Metamorphosis

One morning, als Gregor Samsa woke from troubled dreams, he fand himself transformed in seinem bed into einem horrible vermin.

H e l a y o n seinem armour-like back, und if er lifted seinen head a little he konnte see his brown belly, slightly domed und divided by arches into stiff sections. The bedding war hardly able zu cover it und seemed ready zu slide off any moment. His many legs, pitifully thin compared mit the size von dem rest von him, waved about helplessly as er looked.

‘‘What’s happened to mir?’’ he thought. His room, ein proper human room although ein little too small, lay peacefully between its four familiar walls. Eine collection of textile samples lay spread out on the table - Samsa was ein travelling salesman - and above it there hung a picture das he had recently cut out von einer illustrated magazine und housed in einem nice, gilded frame. It showed eine lady fitted out mit a fur hat und fur boa who sat upright, raising einen heavy fur muff der covered the whole von her lower arm towards dem viewer.

Gregor then turned zu look out the window at the dull weather. Drops von rain could sein heard hitting the pane, welche m a d e h i m f ̈uhlen quite sad. ‘‘How about if ich sleep ein little

1 3 8 CHAPTER5. MACARONIC TEXT CONSTRUCTION bit longer and forget all diesen nonsense,’’ he thought, aber that was something er was unable zu do because he war used zu sleeping auf his right, und in seinem present state couldn’t bringen into that position. However hard he threw sich onto seine right, he always rolled zur ̈uck to where he was. He must haben tried it a hundert times, shut seine eyes so dass er wouldn’t haben zu look at die floundering legs, and only stopped when er began zu f ̈u h l e n einen mild, dull pain there das he hatte never felt before.

‘‘Ach, God,’’ he thought, ‘‘what a strenuous career it is das I’ve chosen! Travelling day in und day out. Doing business like diese takes viel more effort than doing your own business at home, und auf top of that there’s the curse des travelling, worries um making train connections, bad und irregular food, contact mit different people all the time so that du can nie get to know anyone or become friendly mit ihnen. It can alles go zum

Hell!’’ He felt a slight itch up auf his belly; pushed himself slowly up auf his back towards dem headboard so dass he konnte lift his head better; fand where das itch was, und saw that es was covered mit vielen of little weißen spots which he didn’t know what to make of; und als he versuchte to f ̈uhlen the place with one von seinen legs he drew it quickly back because as soon as he touched it he was overcome von a cold shudder.

1 3 9 CHAPTER5. MACARONIC TEXT CONSTRUCTION

Firstfe w paragraphs of “The Meta morphosis” withthe GSM model using priority queue size of8 andr m a x = 4 . Metamorphosis

One morning, als Gregor Samsa woke from troubled dreams, he fand himself transformed in his bed into einem horrible vermin.

Er lay on seinem armour-like back, und if er lifted his head a little er could see seinen brown belly, slightly domed und divided by arches into stiff teile. das bedding was hardly f ä h i g t o cover es und seemed ready zu slide off any moment. His many legs, pitifully thin compared mit the size von dem rest von him, waved about helplessly als er looked.

‘‘What’s happened to mir?’’ er thought. His room, ein proper human room although ein little too klein, lay peacefully between seinen four familiar walls. Eine collection of textile samples lay spread out on the table - Samsa was ein travelling salesman - und above it there hung a picture that er had recently cut aus of einer illustrated magazine und housed in einem nice, gilded frame. Es showed a lady fitted out with a fur hat and fur boa who saß upright, raising a heavy fur muff der covered the whole of her lower arm towards dem viewer.

Gregor then turned zu look out the window at the dull weather. Drops von rain could sein heard hitting the pane, which machte him feel ganz sad. ‘‘How about if ich sleep ein little

1 4 0 CHAPTER5. MACARONIC TEXT CONSTRUCTION bit longer and forget all diesen nonsense,’’ he thought, but that war something he was unable to tun because er was used to sleeping auf his right, and in his present state couldn’t get into that position. However hard he warf himself onto seine right, he always rolled zur ̈uck to wo he was. Er must haben tried it ein hundred times, shut seine eyes so dass he wouldn’t haben to sehen at die floundering legs, und only stopped when he begann to feel einen mild, dull pain there that he hatte nie felt before.

‘‘Ach, God,’’ he thought, ‘‘what a strenuous career it ist that I’ve chosen! Travelling day in und day aus. Doing business like diese takes much mehr effort than doing your own business at home, und on oben of that there’s der curse of travelling, worries um making train connections, bad and irregular food, contact with different people all the time so that you kannst nie get to know anyone or become friendly with ihnen. It kann all go to Teufel!’’

He felt ein slight itch up auf seinem belly; pushed himself slowly up auf his back towards dem headboard so dass he could lift his head better; fand where das itch was, and saw that it was besetzt with lots of little weißen spots which he didn’t know what to make of; and als he tried to feel the place with one of his legs he drew it quickly back because as soon as he touched it he was overcome by a cold shudder.

1 4 1 CHAPTER5. MACARONIC TEXT CONSTRUCTION

5.10 Conclusion

We presented a methodto generate macaronic( mixed-language) docu mentsto aidforeign

languagelearners with vocabulary acquisition. Our keyideaisto derive a model of student

learning fro m only a clozelanguage model, which uses both context and spelling fea-

tures. We ﬁndthat our model-basedteacher generates co mprehensible macaronictext

that pro motes vocabulary learning. We ﬁnd noticeable differences bet ween the word

replace ment choices bythe GSM (only uses context) and s G S M (uses spelling and con-

text) models, especiallyinthe English-Spanish case sho wnin § 5.9. We ﬁnd more L2 replacesfor wordsthat have a high overlap withtheir spellingin English. For exa mple,

e x i s t e n c i a , f o r t u n a , m a t r i m o n i o , p r o p i e d a d , necesitaban , b e n e f i c i o ,

articulacion , i n t e r é s , i m p o r t a n t e , c o n s t a n t e a n d r e s i d e n c i a w er e all

replaced using the s G S M model. As futher conﬁr mation, we ﬁnd exact replace ments were also selected bythe s G S M model, such as D a s h w o o d , P a r k a n d g e n e r a l . T h e

GSM modelreplacedfe wertokens with high-overlap, ocasionales , i m p o r t a n t e a n d

e x i s t e n c i a can be seenin L2. Weleavethetask of extendingitto phrasaltranslation andincorporating word reordering as future work. We alsoleavethe exploration of alternate character-based co mpositions such as Ki m et al.(2016)forfuture work. Beyond that, we envision machineteachinginterfacesin whichthe studentreaderinteractswith the macaronictext —advancingthroughthe docu ment, clicking on wordsfor hints, and

facing occasional quizzes( Renduchintala et al., 2016b) —and with other educational sti muli.

1 4 2 CHAPTER5. MACARONIC TEXT CONSTRUCTION

As we beganto explorein Renduchintala et al.(2016a) and Renduchintala, Koehn, and

Eisner(2017),interactions providefeedbackthatthe machineteacher could useto adjust

its model ofthe student’slexicons(here E,F ),inference(here θ f , θ b , θ h , µ), andlearning

( h er e λ ). Inthis context, we areinterestedin using modelsthat are student-speciﬁc (to

reﬂectindividuallearning styles), stochastic (sincethe student’s observed behavior may be

inconsistent o wingto distraction orfatigue), and ableto modelforgettingas well aslearning

(Settles and Meeder, 2016).

1 4 3 CHAPTER5. MACARONIC TEXT CONSTRUCTION

L2 τ Model Closed-class Open-class

rando m 0.67 ± 0.0037(60) 0.60 ± 0.0027(46)

0 .0 GSM 0.67± 0.0060(60) 0.62± 0.0076(15)

sGSM 0.71± 0.0035(47) 0.68± 0.0028(24)

rando m 0.67 ± 0.0037(60) 0.60 ± 0.0027(46)

0 .2 GSM 0.67± 0.0061(60) 0.61± 0.0080(15)

sGSM 0.71± 0.0036(47) 0.67± 0.0029(24)

rando m 0.60 ± 0.0051(60) 0.50 ± 0.0037(46)

0 .4 GSM 0.60± 0.0086(60) 0.51± 0.0106(15)

sGSM 0.66± 0.0044(47) 0.61± 0.0037(24) Es rando m 0.47 ± 0.0058(60) 0.40 ± 0.0041(46)

0 .6 GSM 0.48± 0.0084(60) 0.42± 0.0105(15)

sGSM 0.52± 0.0054(47) 0.50± 0.0037(24)

rando m 0.40 ± 0.0053(60) 0.30 ± 0.0032(46)

0 .8 GSM 0.41± 0.0078(60) 0.37± 0.0097(15)

sGSM 0.46± 0.0055(47) 0.41± 0.0041(24)

rando m 0.40 ± 0.0053(60) 0.29 ± 0.0031(46)

1 .0 GSM 0.40± 0.0077(60) 0.36± 0.0092(15)

sGSM 0.45± 0.0053(47) 0.39± 0.0042(24)

Table 5.15: An expanded version of Table 5.10(hu manretention experi ments),reporting results with various values ofτ.

1 4 4 C h a pt e r 6

Kno wledge Tracingin Sequential

Learning ofInﬂected Vocabulary

Our macaronic fra me work facilitateslearning novel vocabulary andlinguistic structures while a studentis progressingthrough a docu ment sequentially. In doing so,the student

should(hopefully) acquire ne w kno wledge but may alsoforget whatthey have previously

learned. Further more, ne w evidence,inthefor m of a ne w macaronicsentencefor exa mple,

might forcethe studentto adjusttheir understanding of previously seen L2 words and

structures.

In other words,the previous chapters were concerned with what a student canlearn when presented with a macaronicsentence.In chapters Chapter 3 and Chapter 5 we make

si mplistic assu mptions about whatthe student already kno ws and model whatthey gain

fro m a ne w macaronicsti mulus. Inthis chapter, westudykno wledgetracing inthe context

1 4 5 CHAPTER6. KNO WLEDGETRACING

ofinﬂectionlearningtask. We vie wthelearning process as a sequence of s mallerlearning

events and modelthe interactionbet ween ne w kno wledge(arriving viaso me ne w evidence,

perhaps a macaronic sentence or,inthis chapter, a ﬂash card), existing kno wledge which

could be corrupted byforgetting or confusing si milar vocabularyite ms etc.

Kno wledgetracing atte mptstoreconstruct when a student acquired(orforgot) each of severalfacts. Yet we often hearthat “learningis notjust me morizingfacts.” Facts are

not ato mic objectsto be discretely andindependently manipulated. Rather, we suppose, a

student whorecalls afactin a given settingis de monstrating a skill—by solving a structured

prediction proble mthatis akintoreconstructive me mory(Schacter, 1989; Posner, 1989) or

pattern co mpletion( Hopﬁeld, 1982; S molensky, 1986). The atte mpt at structured prediction

may dra w on many cooperatingfeature weights,so me of which may beshared with other

f a cts or s kills.

Inthis chapter, we study modelsfor kno wledgetracingforthetask offoreign-language vocabularyinﬂectionlearning, we will adopt a speciﬁc structured prediction model and learning algorith m. Different kno wledge states correspondto model para meter settings

(feature weights). Differentlearning styles correspondto different hyperpara metersthat

governthelearning algorith m. 1 As weinteract with eachstudentthrough a si mple online

tutoring syste m, we wouldliketotracktheir evolving kno wledge state andidentifytheir

learning style. Thatis, we wouldliketo discover para meters and hyperpara metersthat can

explainthe evidence sofar and predict ho wthe student willreactinfuture. This could

1 currently, we assu methat all students sharethe sa me hyperpara meters(sa melearning style), although each student will havetheir o wn para meters, which change astheylearn.

1 4 6 CHAPTER6. KNO WLEDGETRACING

help us make goodfuture choices about ho wtoinstructthis student, although weleavethis

reinforce mentlearning proble mtofuture work. Wesho wthat we can predictthestudent’s

next ans wer.

In short, we expandthe notion of a kno wledgetracing modeltoincluderepresentations

for a student’s(i)current kno wledge, (ii)retention of kno wledge, and(iii)acquisition of ne w

kno wledge. Ourreconstruction ofthe student’s kno wledge statere mainsinterpretable, since

it correspondstothe weights of hand-designedfeatures(sub-skills). Interpretability may

help afutureteaching syste m provide usefulfeedbackto students andto hu manteachers,

and helpit construct educational sti mulithat aretargeted ati mproving particular sub-skills,

such asfeaturesthat select correct verb sufﬁxes.

As mentioned, we consider a verb conjugationtaskinstead of a macaroniclearningtask, where aforeignlanguagelearnerlearnsthe verb conjugation paradig m byrevie wing and

interacting with a series of ﬂash cards. Thistaskis a goodtestbed, asit needsthelearnerto

deploysub- wordfeatures andto generalizeto ne w exa mples. For exa mple, astudentlearning

Spanish verb conjugation might encounter pairs such as( t ú e n t r a s , y o u e n t e r ), (y o

m i r o , I w a t c h ). Usingthese exa mples,the student needstorecognize sufﬁx patterns

and applythe mto ne w pairsseensuch as( y o e n t r o , I e n t e r ). While we considered

sub- word features evenin out macaronic experi ments,the verbinﬂectiontaskis more

focused on sub- word based generalizationsthatthe student must understandin orderto

perfor mthetask.

Vocabularylearning presents a challenginglearning environ ment duetothelarge nu mber

1 4 7 CHAPTER6. KNO WLEDGETRACING

of skills( words)that needto betraced. Learning vocabularyin conjunction withinﬂection

further co mplicatesthe challenge duetothe nu mber of ne w sub-skillsthat areintroduced.

Huang, Guerra, and Brusilovsky(2016) suggestthat modeling sub-skillinteractionis crucial

to several kno wledgetracing do mains. For our do main, alog-linearfor mulation elegantly

allo ws for arbitrary sub-skills via feature functions.

6.1 Related Work

Bayesian kno wledgetracing( Corbett and Anderson, 1994)( B KT) haslong beenthestandard

methodtoinfer astudent’s kno wledgefro m his or her perfor mance on asequence oftask

ite ms. In B KT, eachskillis modeled by an H M M witht wo hiddenstates(“known” or

“not-kno wn”), andthe probability of success on anite m depends onthe state ofthe skill

it exercises. Transition and e mission probabilities arelearnedfro mthe perfor mance data

using Expectation Maxi mization(E M). Many extensions of B KT have beeninvestigated,

including personalization(e.g., Lee and Brunskill, 2012; Khajah et al., 2014a) and modeling

ite m difﬁculty( Khajah et al., 2014b).

Our approach could be called Para metric Kno wledge Tracing(P KT) because wetake a

student’s kno wledgeto be a vector of prediction para meters(feature weights)ratherthan a vector of skill bits. Although several B KT variants( Koedinger et al., 2011; Xu and Mosto w,

2 0 1 2; G o n z ález- Brenes, Huang, and Brusilovsky, 2014) have modeledthefactthatrelated

skills share sub-skills orfeatures,that work does not associate areal-valued weight with

1 4 8 CHAPTER6. KNO WLEDGETRACING

eachfeature at eachti me. Either skills are stillrepresented with separate H M Ms, whose

transition and/or e mission probabilities are para meterizedinter ms of sharedfeatures with

ti me-invariant weights; or else H M Ms are associated withtheindividual sub-skills, andthe

perfor mance of a skill depends on which ofits subskills areinthe “kno wn” state.

Our current versionis not Bayesian sinceit assu mes deter ministic updates (but see footnote 4). A closelyrelatedline of work with deter ministic updatesis deep kno wledge

tracing( D KT)(Piech et al., 2015), which applied a classical LST M model( Hochreiter and

Sch midhuber, 1997)to kno wledgetracing and sho wed strongi mprove ments over B KT.

Our P KT model differsfro m D KTinthatthe student’s state at eachti me stepis a more

interpretablefeature vector, andthe state updateruleis alsointerpretable —itis atype of

error-correctinglearning rule. In addition,the student’s stateis ableto predictthe student’s

actualresponse and not merely whethertheresponse was correct. We expectthat having

aninterpretablefeature vector has betterinductive bias(see experi mentin section 6.6.1),

andthatit may be usefulto planfuture actions bys mart ﬂash cardsyste ms. Moreover,in

this work wetest different plausible state updaterules and see ho wthey ﬁt actual student

responses,in orerto gaininsight aboutlearning.

Mostrecently, Settles and Meeder(2016)’s half-liferegression assu mesthat a student’s

retention of a particular skill exponentially decays withti me andlearns a para meterthat

models the rate of decay (“half-life regression”). Like Gonza ́lez- Brenes, Huang, and

Brusilovsky (2014) and Settles and Meeder (2016), our modelleverages a feature-rich for mulationto predictthe probability of alearner correctlyre me mbering a skill, but can

1 4 9 CHAPTER6. KNO WLEDGETRACING

also capture co mplex spacing/retention patterns using a neural gating mechanis m. Another

distinction bet ween our work and half-liferegressionisthat wefocus on kno wledgetracing within a single session, while half-liferegression collapses a sessioninto a single data point

and operates on manysuch data points overlongerti mespans.

1 5 0 CHAPTER6. KNO WLEDGETRACING

( a) ( b) ( c) ( d)

( e) (f) ( g) ( h)

Figure 6.1: Screen grabs of card modalities duringtraining. These exa mples sho w cardsfor a native Englishspeakerlearning Spanish verb conjugation. Fig 6.1ais an E X card, Fig 6.1b sho ws a MC card beforethestudent has made aselection, and Fig 6.1c and 6.1dsho w MC cards afterthe student has made anincorrect or correct selectionrespectively, Fig 6.1e sho ws a M C cardthatis givingthestudent another atte mpt(thesyste mrando mly decidesto give the student uptothree additional atte mpts), Fig 6.1f sho ws a TP card where a studentis co mpleting an ans wer, Fig 6.1gsho ws a TP cardthat has marked astudent ans wer wrong andthenrevealedtheright ans wer(therevealis decidedrando mly), and ﬁnally Fig 6.1h sho ws a cardthatis giving a studentfeedbackfortheir ans wer.

1 5 1 CHAPTER6. KNO WLEDGETRACING

6.2 Verb Conjugation Task

We devised a ﬂash cardtraining syste mtoteach verb conjugationsin aforeignlanguage. In

this study, we only askedthe studenttotranslatefro mtheforeignlanguageto English, not vice-versa. 2

6.2.1 Task Setup

We consider a setting where students gothrough a series ofinteractive ﬂash cards during a

training session. Figure 6.1 sho wsthethreetypes of cards:(i)Exa mple (E X) cards si mply

display a foreign phrase andits Englishtranslation (for 7 seconds). (ii) Multiple- Choice

( M C) cardssho w asingleforeign phrase andrequirethestudenttoselect one of ﬁve possible

English phrasessho wn as options. (iii)Typing (TP) cardssho w aforeign phrase and atext

input box,requiringthe studenttotype out whattheythinkisthe Englishtranslation. Our

syste m can providefeedbackfor each studentresponse. (i)Indicative Feedback: Thisrefers

to marking a student’s ans wer as correct orincorrect(Fig. 6.1c, 6.1d and 6.1h). Indicative

feedbackis al wayssho wnfor both M C and TP cards. (ii)Explicit Feedback:Ifthestudent

makes an error on a TP card,thesyste m has a 50 % chance ofsho wingthe mthetrue ans wer

(Fig. 6.1g).(iii)Retry:Ifthestudent makes an error on a M C card,thesyste m has a 50 %

chance of allo wingthe mtotry again, upto a maxi mu m of 3 atte mpts.

2 We wouldregardthese ast wo separate skillsthat share para metersto so me degree, aninteresting subject forfuture study.

1 5 2 CHAPTER6. KNO WLEDGETRACING

Categories Inf SPre,1,N SPre,2,N SPre,3,M SPre,3,F SF,1,N SF,2,N SF,3,M SF,3,F SP,1,N SP,2,N SP,3,M SP,3,F

acceptar yoacepto tú a c e pt as élacepta ellaacepta yoaceptaré t́ u a c e pt ar ás él a c e pt aŕ a ella aceptará y o a c e pt é t́ u a c e pt ast e él a c e pt́ o ell a a c e pt ó

toaccept Iaccept youaccept heaccepts sheaccepts I willaccept you willaccept* he willaccept she willaccept Iaccepted* youaccepted heaccepted sheaccepted

entrar yoentro tú e ntr as élentra ellaentra yoentraré t́ u e ntr ar ás él e ntr aŕ a ella entrará y o e ntr é t́ u e ntr ast e él e ntŕ o ell a e ntr ó L e m m a toenter Ienter youenter heenters sheenters I willenter you willenter he willenter she willenter Ientered youentered heentered sheentered

mirar yomiro tú mir as él mira ella mira yo miraré t́ u mir ar ás él mir aŕ a ell a mir ar á y o mir é t́ u mir ast e él miŕ o ell a mir ó

to watch I watch* you watch* he watches* she watches I will watch you will watch* he will watch she will watch I watched you watched he watched* she watched

Table 6.1: Content usedintraining sequences. Phrase pairs with * were usedforthe quiz atthe end ofthetraining sequence. This Spanish content wasthentransfor med usingthe methodin section 6.5.1.

6.2.2 Task Content

Inthis particulartask we usedthree verble m mas, eachinﬂectedin 13 different ways

(Table 6.1). Theinflectionsincludedthreetenses(si mple past, present, andfuture)in each of four persons (first, second, third masculine, third fe minine), as well astheinfinitive for m. We ensuredthat each surface realization was unique and regular, resultingin 39 possible phrases. 3 Seven phrasesfro mthisset wererando mlyselectedfor a quiz, whichis sho wn atthe end ofthetraining session,leaving 32 phrasesthat a student may seeinthe training session. The student’sresponses onthe quiz do notreceive anyfeedbackfro mthe syste m. We alsoli mitedthetrainingsessionto 35 cards(so me of which mayrequire multiple rounds ofinteraction, o wingtoretries). All ofthe methods presentedinthis paper could be appliedtolarger content sets as well.

3 Theinﬂected surfacefor msincluded explicit pronouns.

1 5 3 CHAPTER6. KNO WLEDGETRACING

6.3 Notation

We will usethefollo wing conventionsinthis paper. Syste m actions a t , studentresponses

′ y t , andfeedbackite ms a t are subscripted by ati me 1 ≤ t ≤ T . Other subscripts pick out ele ments of vectors or matrices. Ordinarylo wercaselettersindicate scalars ( α , β , et c.),

boldfacedlo wercaselettersindicate vectors ( θ , y , w z x ), and boldfaced uppercaseletters indicate matrices (Φ , W h h , etc.). The ro man-font superscripts are part ofthe vector or

matrix na me.

6.4 Student Models

6.4.1 Observable Student Behavior

A ﬂash cardis a structured object a = ( x, O ), w h er e x ∈ X istheforeign phrase and O is

aset of allo wedresponses. For an MC card, O isthe set of 5 multiple-choice options on

that card(orfe wer on aretry atte mpt). For a E X or TP card, O istheset of all 39 English

phrases(the TP userinterface preventsthe studentfro m sub mitting a guess outsidethis set).

For non-E X cards, we assu methestudentsa mplestheirresponse y ∈ O fro m alog-linear

distribution para meterized bytheir kno wledge state θ ∈ R d :

p(y|a;θ) =p(y|x,O ;θ) exp(θ ·ϕ (x,y)) = ∑ ( 6. 1) ′ y ′ ∈ O exp(θ ·ϕ (x,y ))

1 5 4 CHAPTER6. KNO WLEDGETRACING where ϕ(x,y)∈ {0,1} d is afeature vector extractedfro mthe(x, y) pair.

6.4.2 Feature Design

Thestudent’s kno wledgestateis described bythe weights θ placed onthefeatures ϕ (x, y ) i n

e q u ati o n ( 6. 1). We assu methefollo wing binaryfeatures willsufﬁceto describethestudent’s

b e h a vi or.

• Phrasal features: Weinclude a uniqueindicatorfeaturefor each possible (x, y ) p air,

yi el di n g 3 9 2 features. For exa mple,there exists afeaturethat ﬁresiff x = y o m i r o ∧

y = I e n t e r .

• Word features: Weincludeindicatorfeaturesfor all(source word,target word) pairs:

e.g.,yo ∈ x ∧ enter ∈ y.(These words need not bealigned.)

• Morphe me features: Weincludeindicatorfeaturesfor all (w, mc ) pairs, where w is

a word ofthesource phrase x , a n d m is a possibletense, person, or nu mberforthe

target phrase y (dra wnfro m Table 6.1). For exa mple, m mi g ht b e 1 s t (ﬁrst person)

or SPre (si mple present).

• Prefix and suffix features: For each word or morphe mefeaturethat fires, 8 backoff

features also ﬁre, wherethe source word and(if present)thetarget word arereplaced

bytheir ﬁrst orlast icharacters,for i∈ { 1,2,3,4}.

Thesete mplates yield about 4600 featuresin all, sothe kno wledge state has d ≈ 4 6 0 0

di mensions.

1 5 5 CHAPTER6. KNO WLEDGETRACING

6.4.3 Learning Models

We no wturntothe question of modeling ho wthestudent’s kno wledgestate changes during

their session. θ t denotesthe state atthe start ofround t. We t a k e θ 1 = 0 and assu methat

the student uses a deter ministic updaterule ofthefollo wingfor m:4

θ t+ 1 = β t ⊙ θ t + α t ⊙ u t ( 6. 2)

′ w h er e u t is anupdate vector that depends onthe student’s experience (a t , yt , at ) at r o u n d t.

d In general, we canregard α t ∈ ( 0, 1) as modelingtherates at whichthelearner updates

d the various para meters accordingto u t , a n d β t ∈ ( 0, 1) as modelingtherates at which

those para meters areforgotten. These vectors correspondrespectivelytotheinput gatesand

forget gatesinrecurrent neural net work architectures such asthe LST M( Hochreiter and

Sch midhuber, 1997) or G R U( Cho et al., 2014). Asinthose architectures, we will use neural

net worksto choose α t , β t at eachti me step t, sothatthey may be sensitivein nonlinear waystothe context atround t.

Whythisfor m? Firsti maginethatthe studentislearning by stochastic gradient descent ∑ 2 o n s o m e L 2 -regularizedloss function C · ∥ θ ∥ + t L t (θ ). This algorith m’s updaterule

hasthesi mpliﬁedfor m

θ t+ 1 = β t · θ t + α t · u t ( 6. 3)

4 Sincelearningis not perfectly predictable, it would be more realisticto co mpute θ t by a stochastic update —or equivalently, by a deter ministic updatethat also depends on arando m noise vector ϵ t ( w hi c h is dra wnfro m, say, a Gaussian). These noise vectors are “nuisance para meters,” butratherthanintegrating over their possible values, a straightfor ward approxi mationisto opti mizethe m by gradient descent —along withthe other update para meters —so astolocally maxi mizelikelihood.

1 5 6 CHAPTER6. KNO WLEDGETRACING

w h er e u t = − ∇ L t (θ ) isthe steepest-descent direction on exa mple t, α t > 0 isthelearning

r at e at ti m e t, a n d β t = 1 − α t C handlesthe weight decay duetofollo wingthe gradient of

the regularizer.

Adaptive versions of stochastic gradient descent —such as Ada Grad( Duchi, Hazan, and

Singer, 2011) and Ada Delta(Zeiler, 2012) —are morelike ourfullrule ( 6. 2) i n t h at t h e y

allo w differentlearning rates for different para meters.

6.4.3.1 Sche mesforthe Update Vector u t

We assu methat u t isthe gradient of so melog-probability, sothatthe studentlearns by

tryingtoincreasethelog-probability ofthe correct ans wer. Ho wever,the student does not

al ways observe the correct ans wer y . For exa mple,thereis no outputlabel provided when

the student onlyreceivesfeedbackthattheir ans werisincorrect. Evenin such cases,the

student can changetheir kno wledge state.

′ Inthis section, we deﬁne sche mesfor deﬁning u t fro mthe experience (a t , yt , at ) at

roundt. Recallthata t = ( x t ,O t ). We o mitthetsubscripts belo w.

Supposethe studentistoldthat a particular phrase y ∈ O isthe correcttranslation of x

(viaan E Xcard or viafeedback onanans wertoan MC or TPcard). Thenanaptstrategy

1 5 7 CHAPTER6. KNO WLEDGETRACING

forthe student would beto usethefollo wing gradient:5

✓ ∆ = ∇ θ logp(y |x,O ;θ) ( 6. 4) ∑ = ϕ (x, y ) − p (y ′ | x )ϕ (x, y ′) y ′ ∈ O Ifthe studentistoldthat y isincorrect, an apt strategyisto move probability mass

collectivelytothe other available options,increasingtheirtotal probability, since one of

those optionsmust be correct. We callthisthe redistribution gradient(R G):

✗ ∆ = ∇ θ logp(O −{y}|x,O ;θ) ( 6. 5) − = p (y ′ | x, y ′ ̸= y )ϕ (x, y ′) ( 6. 6) y ′ ∈ O − { y } ) − p (y ′ | x )ϕ (x, y ′) y ′ ∈ O w h er e p (y ′ | x, y ′ ̸= y ) is arenor malized distribution overjustthe options y ′ ∈ O − { y } .

Notethatifthe student selectst wo wrong ans wers y 1 , y2 inarow onan MCcard,the

ﬁrst update will subtractthe averagefeatures of O and addthose of O − { y 1 } ; t h e s e c o n d

update will subtractthe averagefeatures of O − { y 1 } and addthose of O − { y 1 , y2 } . T h e

inter mediate addition and subtraction cancel outifthe sa me α vectoris used at bothrounds,

sothe net effectisto shift probability massfro mthe 5initial optionstothe 3re maining

o n es. 6

An alternate sche meforincorrect y is t o us e − ∆ ✓ . We callthisnegative gradient( N G).

5 An objectionisthatfor an E X or TP card,thestudent may not actually kno wthe exactset of options O i n the deno minator. We atte mpted setting O to bethe set of English phrasesthe student has seen priortothe current question. Thoughintuitive,this setting perfor med worse on allthe update and gating sche mes. 6 Arguably, a zeroth updateshould be allo wed as well: upon ﬁrst vie wingthe M C card,thestudentshould havethe chanceto subtractthe averagefeatures ofthefull set of possibilities and addthose ofthe 5 optionsin O ,since again,thesyste misi mplyingthat one ofthose 5 optionsmust be correct.

1 5 8 CHAPTER6. KNO WLEDGETRACING

Update Scheme Correct Incorrect

✓ ✗ redistribution( R G) u t = ∆ u t = ∆

✓ ✓ negative grad.( N G) u t = ∆ u t = − ∆

feature vector(F G) u t = ϕ (x, y ) u t = − ϕ (x, y )

Table 6.2: Su m mary of updatesche mes(otherthan R N G).

Sincethe R G and N G update vectors both worked wellfor handlingincorrect y , w e

✗ ✓ alsotriedlinearlyinterpolatingthe m ( R N G), withu t = γ t ⊙ ∆ + ( 1 − γ t ) ⊙ − ∆ . T h e

interpolation vector γ t has ele mentsin ( 0, 1) , and may depend onthe context(possibly

differentfor M C and E X cards,for exa mple).

Finally,the feature vector(F G)sche mesi mply addsthefeatures ϕ (x, y ) w h e n y is c orr e ct

orsubtractsthe m when y isincorrect. Thisis appropriatefor a student who pays attention

o nl y t o y , without botheringto notethatthe alternative optionsin O are (respectively)

incorrect or correct.

Recallfro m section 6.2.1thatthe syste m so meti mes gives bothindicative and explicit

feedback,tellingthe studentthat one phraseisincorrect and a different phraseis correct.

Wetreatthese ast wo successive updates with update vectors u t a n d u t+ 1 . Noticethatin

the F G sche me, addingthis pair of update vectorsrese mbles a perceptron update. Table 6.2

su m marizes our updatesche mes.

1 5 9 CHAPTER6. KNO WLEDGETRACING

6.4.3.2 Sche mesforthe Gates α t, β t, γ t

We characterize each update t by a 7-di mensional context vector c t , which su m marizes whatthe student has experienced. The ﬁrstthree ele mentsin c t are binaryindicators ofthe

type of ﬂash card(E X, MC or TP). The nextthree ele ments are binaryindicators ofthe

type ofinfor mationthat causedthe update: correct student ans wer,incorrect student ans wer,

orrevealed ans wer(via an E X card or explicitfeedback). As are minder,the syste m can

respond with anindicationthatthe ans weris correct orincorrect, orit canrevealthe ans wer.

Finally,thelast ele ment of c t is 1 / | O|,the chance probability of success onthis card. Fro m

c t , w e d e ﬁ n e

α α d α t = σ (W c t + b 1 ) ∈ ( 0, 1) ( 6. 7)

β β d β t = σ (W c t− 1 + b 1 ) ∈ ( 0, 1) ( 6. 8)

γ γ d γ t = σ (W c t + b 1 ) ∈ ( 0, 1) ( 6. 9)

d × 7 w h er e c 0 = 0 . Each gate vectoris no w para meterized by a weight matrix W ∈ R , w h er e

d isthe di mensionality ofthe gradient and kno wledge state.

We alsotried si mpler versions ofthis model. Inthe vector model(V M), we deﬁne

α α t = σ (b ), a n d β t , γ t si milarly. These vectors do not vary withti me and si mplyreﬂect

that so me para meters are morelabilethan others. Finally,thescalar model(S M) deﬁnes

α α t = σ (b 1 ), sothat all para meters are equallylabile. One could alsoi maginetyingthe

gatesforfeatures derivedfro mthe sa mete mplate, meaningthat so me kinds offeatures

(in so me contexts) are morelabilethan others, orreducingthe nu mber of para meters by

1 6 0 CHAPTER6. KNO WLEDGETRACING

learninglo w-rankW matrices.

While we alsotried aug mentingthe context vector c t withthe kno wledge state θ t , t his

resultedinfartoo many para meterstotrain well, and did not help perfor mancein pilottests.

6.4.4 Para meter Esti mation

We t u n e t h e W a n d b para meters ofthe model by maxi mu mlikelihood, so asto better

predictthe students’responses y t . Thelikelihoodfunctionis

∑T ′ p (y 1 , . . . yT | a t , . . . aT ) = p (y t | a 1 :t , y1 :t− 1 , a1 :t− 1 ) t= 1 −T = p (y t | a t ; θ t ) ( 6. 1 0) t= 1 where wetake p (y t | · · · ) = 1 at steps wherethe student makes noresponse(E X cards and explicitfeedback). Notethatthe model assu mesthat θ t is a sufﬁcient statistic ofthe

student’s past experiences.

For each(update sche me, gating sche me) co mbination, wetrainedthe para meters using

S G D with R MSProp updates(Tiele man and Hinton, 2012)to maxi mizetheregularized

log-likelihood

) 2 l o g p (y t | x t ; θ t ) − C · ∥ W ∥ ( 6. 1 1)

t, τt = 0

su m med over all students. Notethat θ t depends onthe para metersthroughthe gated update

r ul e ( 6. 2). The develop mentset was usedfor earlystopping andtotunetheregularization

para meter C .7

7 We s e ar c h e d C ∈ { 0 .0 0 0 2 5 , 0 .0 0 0 5 , 0 .0 0 1 , . . ., 0 .0 1 , 0 .0 2 5 , 0 .0 5 , 0 .1 } for each gating model and update

1 6 1 CHAPTER6. KNO WLEDGETRACING

6.5 Data Collection

Werecruited 153 unique “students” via A mazon Mechanical Turk( MTurk). MTurk partici-

pants were co mpensated $1for co mpletingthetraining andtestsessions and a bonus of $10 was giventothethreetop scoring students. In our dataset, weretained onlythe 121 students who ans wered all questions.

6.5.1 Language Obfuscation

Fig. 6.1 sho ws afe w exa mple ﬂash cardsfor a native English speakerlearning Spanish.

Fig. 6.1sho ws all our Spanish-English phrase pairs. In our actualtask, ho wever, weinvented

an artiﬁciallanguageforthe MTurkstudentstolearn, which allo wed ustoignorethe proble m

of students with differentinitial kno wledgelevels. We generated our artiﬁciallanguage

by encipheringthe Spanish orthographicrepresentations. We created a mappingfro mthe

true source string alphabetto an alternative, manually deﬁned alphabet, while atte mptingto

preserve pronounceability(by mapping vo welsto vo wels, etc.). For exa mple, m i r a r w as

transfor medintomelil and tú aceptas becamepi icedpiz. sche me co mbination. C = 0 .0025 gave bestresultsforthe C M models, 0.01 for V M and0.0005 for S M.

1 6 2 CHAPTER6. KNO WLEDGETRACING

6.5.2 Card Ordering Policy

Inthefuture, we expectto use planning orreinforce mentlearningto choosethe sequence of

sti muliforthestudent. Forthe presentstudy ofstudent behavior, ho wever, we hand-designed

a si mple stochastic policyfor choosingthe sti muli.

The policy must decide whatforeign phrase and card modalityto use at eachtraining

step. Our policylikestorepeat phrases with which participants hadtrouble —in hopesthat

these already-taught phrases are onthe verge of beinglearned. It alsolikesto pick out ne w

phrases. This wasinspired bythe popular Leitner(1972) approach, which devised a syste m

of bucketsthat control ho wfrequently anite misrevie wed by a student. Leitner proposed

buckets withrevie wfrequencyrates of every day, every 2 days, every 4 days andso on.

For eachforeign phrase x ∈ X , we maintain anovelty score v x , whichis afunction ofthe nu mber ofti mesthe phraseis exposedto astudent and an errorscore e x , w hi c h is

afunction ofthe nu mber ofti mesthe studentincorrectlyrespondedtothe phrase. These

scores areinitializedto 1 and updated asfollo ws: 8

v x ← v x − 1when x isviewed ⎧ ⎪ ⎪ ⎨ 2 e x whenstudent gets x wrong e x ← ⎪ ⎪ ⎩ 0 .5 e x whenstudent gets x right

g (v ) + g (e ) x ∼ ( 6. 1 2) 2

8 Arguably weshould have updated e x instead by adding/subtracting 1, sinceit will be exponentiatedlater.

1 6 3 CHAPTER6. KNO WLEDGETRACING

On eachround, we sa mple a phrase x fr o m eit h er P v or P e (equal probability); these

distributions are co mputed by applying a soft max g (.) overthe vectors v a n d e respectively

(see Eq. 6.12). Oncethe phrase x is decided, the modality (E X, MC, TP)is chosen

st o c h asti c all y usi n g pr o b a biliti es (0.2,0.4,0.4) , exceptthat probabilities ( 1, 0 , 0) ar e us e d

forthe ﬁrst exa mple ofthesession, and (0.4,0.6,0) if x is not “TP-qualiﬁed.” A phraseis

TP-qualiﬁedifthe student has seen both x ’s pronoun and x ’s verble m ma on previous cards

(eveniftheir correcttranslation was notrevealed). For an M C card,the distractor phrases

are sa mpled unifor mly withoutreplace mentfro mthe 38 other phrases.

6.6 Results & Experi ments

We partitionedthe studentsintothree groups: 80 studentsfortraining, 20for develop ment,

and 21fortesting. Moststudentsfoundthetask difﬁcult;the averagescore onthe 7-question

q ui z — w as 2 .8 1 correct, with maxi mu m score of 6. ( Recallfro m section 6.2.2thatthe quiz

questions weretyping questions, not multiple choice questions.) The histogra m of user

perfor manceis sho wnin Fig. 6.2.

After constructing each model, we evaluatedit onthe held-out data:the 728responses

fro mthe 21testing students. We measurethelog-probability underthe model of each actual

response(“cross-entropy”), and alsothefraction ofresponsesthat were correctly predicted

if our prediction wasthe model’s max-probabilityresponse(“accuracy”).

Table 6.3 sho wstheresults of our experi ment. All of our models were predictive, doing

1 6 4 CHAPTER6. KNO WLEDGETRACING

User perfor mance on Quiz

2 5

2 0

1 5 Count of users 1 0

0 1 2 3 4 5 6 7 quiz perfor mance

Figure 6.2: Quiz perfor mance distribution(afterre moving users who scored 0).

far betterthan a unifor m baselinethat assigned equal probability 1 / | O| t o all o pti o ns. O ur

best modelsareshowninthe ﬁnalt wolines, R N G+ V Mand R N G+C M.

Which update sche me was best? Interestingly, althoughthe R G update vectoris princi-

pledfro m a machine learning vie wpoint,the N G update vectorso meti mes achieved better

accuracy —though worse perplexity — when predictingtheresponses of hu man learners.9

We got our bestresults on both metrics byinterpolating bet ween R G and N G(the R N G

sche me). Recallthatthe N Gsche me was motivated bythe notionthatstudents who guessed wrong may not studythe alternative ans wers(eventhough oneis correct), either because

itistoo muchtroubleto studythe m or because(for a TP card)those alternatives are not

actually sho wn.

Which gating mechanis m was best?In al most all cases, wefoundthat more para meters

h el p e d, wit h C M > V M > S M on accuracy, and a si milar pattern on cross-entropy( with

V M so meti mes winning but only slightly). In short,it helpsto use differentlearningrates

9 Eventhe F G vector so meti mes won(on both metrics!), butthis happened only withthe worst gating mechanis m, S M.

1 6 5 CHAPTER6. KNO WLEDGETRACING

for differentfeatures, andit probably helpsto makethe m sensitivetothelearning context.

SM 0. 8 VM 0. 6 CM 0. 4 RG

a c c0. ur a c2 y FV 0. 0 NG MC M C C M CI C TP T P C T PI C

Figure 6.3: Plot co mparingthe models ontest data under different conditions. Conditions

M C and TPindicate Multiple-choice and Typing questionsrespectively. These are broken

do wntothe cases wherethestudent ans wersthe m correctly C andincorrectlyI C. S M, V M,

and C Mrepresent scalar, vector, and contextretention and acquisition gates(sho wn with

different colors),respectively, while R G, N G and F G areredistribution, negative andfeature vector update sche mes(sho wn with different hatching patterns).

Surprisingly,the si mple F G sche me outperfor med both R G and N G when usedin conjunction with a scalarretention and acquisition gate. This, ho wever, did not extendto

more co mplex gates.

1 6 6 CHAPTER6. KNO WLEDGETRACING

Update Sche me Gating Mechanis m accuracy cross-ent.

( Unifor m baseline) 0.133 2.459

F G S M 0. 2 3 9 ∗ 2. 3 6 2

F G V M 0. 3 5 7 † 2. 1 3 0

F G C M 0.401 2.025

R G S M 0.135 3.194

R G V M 0. 3 9 7 † 1. 9 0 9

R G C M 0.405 1.938

N G S M 0. 1 8 5 ∗ 4. 6 7 4

N G V M 0. 3 9 4 † 2. 3 2 0

N G C M 0. 4 4 9 † ∗ 2. 2 4 4

RNG(mixed) S M 0.183 3.502

RNG(mixed) V M 0.427 1.855

RNG(mixed) C M 0.449 1.888

Table 6.3: Table su m marizing prediction accuracy and cross-entropy(in nats per prediction)

for different models. Larger accuracies and s maller cross-entropies are better. Within an

update sche me,the † indicates signiﬁcanti mprove ment( Mc Ne mar’stest, p < 0 .0 5 ) o ver

the next-best gating mechanis m. Within g a gating mechanis m,the ∗ indicates signiﬁcant

i mprove ment overthe next-best updatesche me. For exa mple, N G+ C Missigniﬁcantly better

than N G+ V M,soitreceives a † ;itis also signiﬁcantly betterthan R G+ C M, andreceives a ∗ 1 6 7 as well. These co mparisons are conducted only a mongthe pure updatesche mes(abovethe

doubleline). All other models aresigniﬁcantly betterthan R G+S M(p < 0.01). CHAPTER6. KNO WLEDGETRACING

Fig. 6.3sho ws a breakdo wn ofthe prediction accuracy measures accordingto whether the card was M C or TP, and accordingto whetherthestudent’s ans wer was correct( C) or incorrect(I C). Unsurprisingly, allthe models have an easierti me predictingthe student’s guess whenthe studentis correct, sincethe predicted para meters θ t will oft e n pi c k t h e correct ans wer. Ho wever,thisis wherethe vector and context gates far outperfor mthe scalar gates. Allthe models ﬁnd predictingtheincorrect ans wers ofthe students difﬁcult.

Moreover, when predictingtheseincorrect ans wers,the R G models do slightly betterthan the N G models.

The models obviously have higher accuracy when predictingstudent ans wersfor M C cardsthanfor TP cards, as MC cards havefe wer options. Again, within both ofthese modalities,the vector and context gates outperfor mthe scalar gate.

1 6 8 CHAPTER6. KNO WLEDGETRACING

4 4

2 2

0 0

− 2 − 2

suprisal− reduction 4 (bits) suprisal− reduction 4 (bits)

− 6 − 6

− 8 − 8 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 45 training steps training steps

(a) a student with quiz score 6/7 (b) astudent with quizscore 2/7

Figure 6.4: Predicting a speciﬁc student’sresponses. For eachresponse,the plot sho ws

our model’si mprove mentinlog-probability overthe unifor m baseline model. TP cards are

the square markers connected by solidlines(the ﬁnal 7 squares arethe quiz), while M C

cards — which have a much higher baseline —arethe circle markers connected by dashed

lines. Hollo w and solid markersindicate correct andincorrect ans wersrespectively. The

RNG+C M modelisshownin blueandthe FG+S M modelinred.

Finally, Fig. 6.4 exa mines ho wthese models behave when makingspeciﬁc predictions over

atrainingsequencefor asinglestudent. At eachstep we plotthe differenceinlog-probability

bet ween our model and a unifor m baseline model. Thus, a marker above 0 meansthat our

model assignedthe student’s ans wer a probability higherthan chance. 1 0 To contrastthe perfor mance difference, we sho w boththe highest-accuracy model(R N G+C M) andthe

lo west-accuracy model( R G+S M). For a high-scoringstudent(Fig. 6.4a), wesee R N G+ C M

1 0 1 1 1 For M C cards,the chance probabilityisin { 5 , 4 , 3 } —depending on ho w many optionsre main — while 1 for TP cardsitis 3 9 .

1 6 9 CHAPTER6. KNO WLEDGETRACING has alarge margin over R G+S M and a slight up wardtrend. A higher probabilitythan chanceis noticeable even whenthe student makes mistakes(indicated by hollo w markers).

In contrast,for an average student(Fig. 6.4b),the margin bet weenthet wo modelsisless perceptible. Whilethe C M+ N G modelis still abovethe S M+R Gline,there are so me ans wers where C M+ N G does very poorly. Thisis especiallytrueforso me ofthe wrong ans wers,for exa mple attrainingsteps 25, 29 and 33. Upon closerinspectionintothe model’s errorin step 33, wefoundthe pro mptreceived atthistraining step was e k k i m e l ü as a

M C card, which had beensho wntothestudent onthree prior occasions, andthestudent even ans wered correctly on one ofthese occasions. This explains whythe model wassurprisedto seethe student makethis error.

6.6.1 Co mparison with Less Restrictive Model

Our para metric kno wledgetracing architecture modelsthe student as atypical structured prediction syste m, which maintains weightsfor hand-designedfeatures and updatesthe m roughly as an onlinelearning algorith m would. A natural questionis whetherthisrestricted architecture sacriﬁces perfor manceforinterpretability, ori mproves perfor mance via useful inductive bias.

To considerthe other end ofthespectru m, wei mple mented a ﬂexible LST M modelin the style ofrecent deeplearningresearch. This alternative model predicts eachresponse by a student(i.e., on an M C or TP card) giventhe entire history of previousinteractions with thatstudent assu m marized by an LST M. The LST M architectureisfor mally capable of

1 7 0 CHAPTER6. KNO WLEDGETRACING capturing updaterules exactlylikethose of P KT, butitisfarfro mli mitedto suchrules.

Muchlike equation(6.1), at eachti me twe predict

e x p( h t · ψ (y )) p (y t = y | a t ) = ∑ ( 6. 1 3) ′ e x p( h · ψ (y )) y ∈ O t t

d for each possible response y inthe set of options O t , w h er e ψ (y ) ∈ R is a l e ar n e d

d e mbedding ofresponse y . H er e h t ∈ R denotesthe hidden state ofthe LST M, which evolves asthe studentinteracts withthe syste m andlearns. h t depends onthe LST Minputs for allti mes < t ,justlikethe kno wledge state θ t in equations ( 6. 1)– ( 6. 2).It also depends on the LST Minputforti me t, sincethat speciﬁesthe ﬂash card a t to which we are predicting theresponsey t .

Each ﬂash card a = ( x, O ) is encoded by a concatenation a ofthree vectors: a one-hot

39-di mensional vector specifyingtheforeign phrase x , a 39-di mensional binary vector O indicatingthe possible English optionsin O , and a one-hot vectorindicating whetherthe cardis E X, MC, or TP.

Whenreadingthe history of pastinteractions,the LST Minput at eachti me step t c o n- catenatesthe vector representation a t ofthe current ﬂash card with vectors a t− 1 , y t− 1 , f t− 1 that describethe student’s experienceinround t − 1 :theserespectively encodethe previous

ﬂash card,the student’sresponsetoit(a one-hot 39-di mensional vector), andtheresulting feedback(a 39-di mensional binary vectorthatindicatesthere maining options afterfeed- back). Thus,ifthe studentreceives nofeedback,then f t− 1 = O t− 1 . Indicativefeedback sets f t− 1 = y t− 1 or f t− 1 = O t− 1 − y t , accordingto whetherthe student was correct orincorrect.

Explicitfeedback(includingfor an E X card) sets f t− 1 to a one-hotrepresentation ofthe

1 7 1 CHAPTER6. KNO WLEDGETRACING

Model Para meters Accuracy(test) Cross-Entropy

RNG+CM ≈97K 0.449 1.888

LSTM ≈25K 0.429 1.992

Table 6.4: Co mparison of our best-perfor ming P KT model(R N G+C M)to our LST M model.

On our dataset, P KT outperfor msthe LST M bothinter ms of accuracy and cross-entropy.

correct ans wer. Thus, f t− 1 givestheset of “positive” optionsthat we usedinthe R G update ve ct or, w hil e O t− 1 givesthe set of “negative” options, allo wingthe LST Mto si milarly

1 1 updateits hidden statefro m h t− 1 t o h t toreﬂectlearning.

Asinsection 6.4.4, wetrainthe para meters by L 2 -regularized maxi mu mlikelihood, with

earlystopping on develop ment data. The weightsforthe LST M wereinitialized unifor mly

at r a n d o m ∼ U (− δ, + δ ), w h er e δ = 0 .0 1 , and R MSProp was usedfor gradient descent. We

settled on a regularization coefﬁcient of 0 .0 0 2 after aline search. The nu mber of hidden u nits d was alsotuned usingline search. Interestingly, a di mensionality ofjust d = 1 0

perfor med best on dev data: 1 2 atthis size,the LST M has fewerpara metersthan our best

m o d el.

Theresultis sho wnin Table 6.4. Theseresultsfavor ourrestricted P KT architecture. We 1 1 This architectureisfor mally ableto mi mic P KT. We wouldstore θ inthe LST M’s vector of cell activations, and configurethe LST M’s “input” and “forget” gatesto updatethis accordingto ( 6. 2) w h er e u t is c o m p ut e d fro mtheinput. Observethat eachfeaturein section 6.4.2 hasthefor m ϕ i j (x, y ) = ξ i (x ) · ψ j (y ). Considerthe hidden unitin h correspondingtothisfeature, with activation θ i j . By configuringthis unit’s “output” gateto b e ξ i (x ) ( w h er e x isthe currentforeign phrase givenintheinput), we would arrangeforthis hidden unitto h a ve o ut p ut ξ i (x ) · θ i j , w hi c h will b e m ulti pli e d b y ψ j (y ) i n ( 6. 1 3) t o r e c o ver θ i j · ϕ i j (x, y ) j ust as i n ( 6. 1). ( More precisely,the output would be si g m oi d(ξ i (x ) · θ i j ), but we can evadethis nonlinearityif wetakethe cell activationsto be ascaled-do wn version of θ andscale upthe e mbeddings ψ (y)to co mpensate.) 1 2 We searched 0.001,0.002,0.005,0.01,0.02,0.05 for the regularization coefficient, and 5,10,15,20,50,100,200 forthe nu mber of hidden units.

1 7 2 CHAPTER6. KNO WLEDGETRACING

ackno wledgethatthe LST M might perfor m better when alargertrainingset was available

( which would allo w alarger hiddenlayer), or using a differentfor m ofregularization(Sri- vast a va et al., 2 0 1 4).

Inter mediate or hybrid models would of course also be possible. For exa mple, we could

⊤ pr e di ct p (y | a t ) vi a ( 6. 1), d e ﬁ ni n g θ t as h t M , alearnedlinearfunction of h t . This variant would again have accessto our hand-designedfeatures ϕ (x, y ),sothatit would kno w which

ﬂash cards were si milar. Infact θ t · ϕ (x, y ) i n ( 6. 1) e q u als h t · (M ϕ (x, y )), s o M c a n b e regarded as projecting ϕ (x, y ) do wntothe LST M’s hidden di mension d ,learning ho wto weight and usethesefeatures. Inthis variant,the LST M would nolonger needtotake a t as

part ofitsinput atti me t: r at h er, h t (j ust li k e θ t in P KT) would be a purerepresentation of

the student’s kno wledge state atti me t, capable of predicting y t f or a n y a t . Thissetup more

closelyrese mbles P KT —orthe D KT LST M of Piech et al.(2015). Unlikethe D KT paper,

ho wever,it would still predictthe student’s speciﬁcresponse, not merely whetherthey were

right or wrong.

6.7 Conclusion

We have presented a cognitively plausible modelthattraces a hu man student’s kno wledge as

he or sheinteracts with a si mple onlinetutoring syste m. The student mustlearntotranslate veryshortinﬂected phrasesfro m an unfa miliarlanguageinto English. Our model assu mes

that when a studentrecalls or guessesthetranslation, he or sheis atte mptingto solve a

1 7 3 CHAPTER6. KNO WLEDGETRACING

structured prediction proble m of choosingthe besttranslation, based on salientfeatures of

theinput-output pair. Speciﬁcally, we characterizethe student’s kno wledge as a vector of

feature weights, whichis updated asthe studentinteracts withthe syste m. Whilethe phrasal

features me morizethetranslations of entireinput phrases,the otherfeatures can pick up on

thetranslations ofindividual words and sub- words, which arereusable across phrases.

We collected and modeled hu man-subjects data. We experi mented with models using several different update mechanis ms,focusing onthe student’streat ment of negative

feedback andthe degreeto whichthestudenttendsto update orforgetspeciﬁc weightsin

particular contexts. We alsofoundthatin co mparisonto aless constrained LST M model, we

can better ﬁtthe hu man behavior by using weight updatesche mesthat are broadly consistent with sche mes usedin machinelearning.

Inthefuture, we planto experi ment with more variants ofthe model,including variants

that allo w noise and personalization. Mosti mportant, we meanto usethe modelfor planning which ﬂash cards,feedback, or other sti mulito sho w nextto a given student.

1 7 4 C h a pt e r 7

Conclusion & Future Direction

Thisthesisintroducesthe proble m of generating macaroniclanguagetexts as a foreign

languagelearning paradig m. Adult foreignlanguagelearningis a challengingtaskthat

requires dedicatedti me and effortinfollo wing a curriculu m. We believethe macaronic

fra me workintroducedinthisthesis allo ws a studentin engageinlanguagelearning while

si mplyreading any docu ment. We hopethatsuchinstruction will be a valuable additionto

thetraditionalforeignlanguagelearning process.

We have made progressto wardsidentifying appropriate data structuresto all possible

macaronic conﬁgurationsfor a givensentence, devised a methodto modelthereadability and

guessability offoreignlanguage words and phrasesin macaronic conﬁgurations and sho w

ho w a si mple search heuristic can ﬁnd pedagogically useful macaronic conﬁgurations. We

have also presented aninteraction mechanis mfor macaronic docu ments, hopefullyleading

toi mproved student engage ment while gainingthe abilityto updatethestudent model based

1 7 5 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION onfeedback. Finally, we also studied sequential modeling of student’s kno wledge asthey navigatethrough arestrictedforeignlanguageinﬂectionlearning activity.

There are several possibleresearch directions movingfor ward. We are mostinterested ini mproving methodsthatfollo wthe generic student model based approach asit allo ws usto create macaronic docu mentsfro m a wide variety of do mains without data collection involving hu man students. Weidentifythefollo wingli mitations currentlyinthe generic student model based approach:

Capturing uncertainty of L2 word e mbeddings: Word e mbeddings are pointsin a subspace. Assigning each L2 word with a single pointinthat subspaceignores uncertainty associated withthat words meaning. Thisissue might not be very crucial whenlearning

L1 e mbeddings, as we can assu me(atleastforfrequent words)that we canlearntheir e mbeddings fro m differentinstancesinthetraining data. Ho wever, ourincre mental L2 learning approach assigns/learns an e mbeddingfro m(initially)just one exposure. Even subsequent exposures are not batched. Thus,it should be usefulto maintain arange(or distribution) ofreasonable e mbeddingsfor a ne w L2 word after each exposure,instead of a single pointinthe e mbedding space.

A possible approach could betorepresent each L2 e mbedding by a multidi mensional

Gaussian with a mean vector µ ∈ R D and variance σ 2 ∈ R D . Si milarideas have beensho wn to help word e mbeddinglearning( Vilnis and Mc Callu m, 2014; Athi waratkun and Wilson,

2017) using “ word2vec”style objectives( C B O W orskip-gra m)( Mikolov et al., 2013). We could also e mploytherecentrepara meterization method( King ma and Welling, 2013).

1 7 6 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION

Search Heuristics and Planning: We currently e mploy a si mple ”lefttoright” best-ﬁrst

search heuristicto searchforthe “best” macaronic conﬁgurationfor a given sentence. We

can explore several alternative search heuristics. One si mple alternativeisto replacethe

search ordering fro m “leftto right”to “lo w-frequencyto high-frequency”. Thatis, we willtrytoreplacelo w-frequency wordsinthe sentence withtheir L2translations before

trying high-frequency words. This heuristic should provide more opportunitiestoreplace

lo w-frequency content words atthe expense of high-frequency stop words, ho wever, since

high-frequency words are morelikelytosho w upintherest ofthe docu mentthere will be

other opportunitiesforthe modeltoreplacethese with L2translations. Pilot experi ments withlo w-frequencyto high-frequency(in conjunction with best-ﬁrst search) outperfor ms

thelefttoright heuristicinter ms of cosine si milarity score as deﬁnedin§5.5.2.

Our currentsche me does not considerthe entire docu ment whensearchingforthe best

macaronic conﬁgurationfor a sentence. Ifthe machineteacher kno ws,for exa mple,that a

certain L2 vocabularyite mis more guessableinso mefuture part ofthe docu ment,thenit

could usethe current sentencetoteach a different L2 vocabularyite mtothe student. Thus,

lookingintothefuture ofthe docu mentis a possiblefuture direction ofresearch. Our pilot

experi ments, using Monte- Carlotree searchto ﬁndthe best macaronic conﬁguration, also

suggestthe sa me. Ho wever, withlongerlook-ahead horizons searchtakes moreti meto

co mplete which might hinder “online” search.

Contextual Representationsfro m BE RT and KL- Net: The clozelanguage model

usedin our generic student modelis closelyrelatedto sentencerepresentation models such

1 7 7 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION

as BE RT and EL Mo( Devlin et al., 2019; Peters et al., 2018). We could usethese pretrained

modelsinstead or our clozelanguage model, ho wever, we would berestricted bythe L1 vocabulary used bytheselarge maskedlanguage models.

Modiﬁed soft maxlayerin cloze model: Ourincre mental approachtolearningthe

e mbeddings of novel L2 wordsis scored using cosine si milarity( § 5.5.1). Ho wever,the

i niti al cl o z e m o d el (§ § 5.2.1 and 5.2.2) does nottakethis particular scoringinto account.

Whentrainingthe cloze model on L1 data and duringincre mental L2 wordlearning,the

nor ms ofthe e mbeddings are not constrained,leadingto co m mon words havinglarger

nor ms. This creates a mis match bet ween ho w welearn L1 e mbeddings and L2 e mbeddings

(incre mentally) and ho w we scorethe m. Whileitis unclearifthis dra matically changesthe

resulting macaronic conﬁgurations, a possible solution could beto use cosine si milarityto

obtainlogits during L1learning and duringincre mental L2learning. This would encourage

theinitial cloze modeltorestrictthe nor ms of word e mbeddingsto be closeto one.

Phrasal Translations: Finally,the one-to-onelexicaltranslation setupis ali mitation as

it only affordsforteaching single L2 words and not phrases. Additionally, doesit expose

thestudentto word order differences bet weenthe L1 and L2. There aret wo main challenges when movingto non-lexicalteaching. To addressthisli mitation we would ﬁrst haveto

consider different scoringfunctionsto guidethe macaronic conﬁguration search. Currently,

the scoringfunction(§ 5.5.1) straight-for ward forlexicaltranslation case but not for scoring

L1-L2 phrase pairs especially when also considering word-order differences bet ween an

L1 and L2 phrase. Consequently, we needto address ho w werepresent L2 “kno wledge”

1 7 8 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION

d e Markup He gaveatalkabout howeducation a n d s c h o ol kills cr e ati v it y.

Pr e di cti o n He gave a talk about how education und schulen kreativit ä t t ö t e t .

d e M ar k u p Itwas so mbody who was tryingto ask a question aboutJavascript.

Prediction Es war jemand , der versuchte , toaskaquestionaboutJavascript.

d e d e M ar k u p We were standingon theedge ofthousands of acres of c ot t o n.

Pr e di cti o n Wir standen am rande of thousands of acres of baumwolle.

d e Markup And we’re building uponinnovations of generationswho went beforeus.

Pr e di cti o n And we’re building upon innovations of generationen , die vor uns gingen .

Table 7.1: Exa mples ofinputs and predicted outputs by our experi mental N MT

modeltrainedto generate macaroniclanguage sentences using annotations onthe

input sequence. We seethatthe macaroniclanguagetranslations are ableto correctly

order Ger man portions ofthe sentences, especially atthe sentence ending. The

source-features have also beenlearned bythe N MT model andtranslations are

faithfultothe markup. The case,tokenization anditalics addedin post.

in our generic student model. Merely, using L2 word e mbeddings would beinsufﬁcient.

One possibleresearch endeavoristo not onlylearn ne w L2 word e mbeddings but also

learn L2recurrent para metersin anincre mentalfashion. Thatis, we canlearn a entirely

ne w L2 clozelanguage model. Toscorethis cloze model we could use held outfully L2

sentences, perhapsfro mthere mainder ofthe current docu ment. Apartfro mredesigning

the generic student model andincre mentallearning paradig mto enable phrasal L2learning, we also haveto alter ho w we generatethe macaronic data structurethat can support phrasal

macaronic conﬁgurations. Creatingthe back-end macaronic data structure using statistical

1 7 9 CHAPTER7. FUTURE DIRECTIONS & CONCLUSION

machinetranslation § 1.4 mayresultintranslationsthat are not accurate. Neural Machine

Translation may provide better partialtranslations which could be usedto generatethe

required back-end data structure. We ﬁndthat we are ableto generate ﬂuent macaronic

translations bytaggingtokensintheinput sequence ( whichis fullyin L1) with either

a T r a n s l a t e or No-Translate tag. Table 7.1 sho ws exa mples of generated En- De

(L1-L2) macaronic sentences.

Longitudinal User Modelling: Inthisthesis, experi mentsinvolving hu man students were conducted onrelatively shortti me-fra mes. Modellinglongter mlearning andforgetting

patternsin a macaroniclearning setup wouldleadto better conﬁgurations asthe machine

teacher can accountfor student’sforgetting patterns. Such experi ments, ho wever, would

exhibit high variation and wouldrequirelarger nu mber of participants. Generallylonger

st u di es als o e x hi bit p o or p arti ci p a nt r et e nti o n.

1 8 0 Bibliography

Ahn, Luis von(2013). “ Duolingo: Learn a Languagefor Free While Helpingto Translate

the Web”.In:Proceedings ofthe 2013International Conference onIntelligent User

Interfaces, pp. 1–2.

Alishahi, Afra, Afsaneh Fazly, and Suzanne Stevenson (2008). “Fast mappingin word

learning: What probabilities tell us”. In: Proceedings of the t welfth conference on

co mputational naturallanguagelearning. Associationfor Co mputational Linguistics,

p p. 5 7 – 6 4.

Athi waratkun, Ben and Andre w Wilson (2017). “ Multi modal Word Distributions”. In:

Proceedings ofthe 55th Annual Meeting ofthe Associationfor Co mputational Linguistics

(Volu me 1: Long Papers), pp. 1645–1656.

Bojano wski, Piotr, Edouard Grave, Ar mandJoulin, and To mas Mikolov(2017). “Enriching

Word Vectors with Sub word Infor mation”. In: Transactions of the Association for

Co mputational Linguistics 5, pp. 135–146. URL : https://www.aclweb.org/

anthology/Q17-1010.

1 8 1 BIBLIOGRAPHY

B oj ar, O n dřej, Rajen Chatterjee, Christian Feder mann, Barry Haddo w, Matthias Huck,

Chris Hoka mp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt

Post, Carolina Scarton, Lucia Specia, and Marco Turchi(2015). “Findings ofthe 2015

Workshop on Statistical Machine Translation”. In: Proceedings of the Tenth Work-

shop on Statistical Machine Translation, pp. 1–46. URL : http://aclweb.org/

anthology/W15-3001.

Burstein, Jill, Joel Tetreault, and Nitin Madnani(2013). “The E- Rater Auto mated Essay

Scoring Syste m”.In: Handbook of Auto mated Essay Evaluation: Current Applications

and New Directions. Ed. by Mark D. Sher mis. Routledge, pp. 55–67.

Carey, Susan and Elsa Bartlett(1978). “ Acquiring a single ne w word.”In:

Chen, Tao, Naijia Zheng, Yue Zhao, Muthu Ku mar Chandrasekaran, and Min- Yen Kan(July

2015). “Interactive Second Language Learningfro m Ne ws Websites”.In: Proceedings

ofthe 2nd Workshop on Natural Language Processing Techniquesfor Educational

Applications . Beijing, China: Association for Co mputational Linguistics, pp. 34–42.

URL :https://www.aclweb.org/anthology/W15-4406.

Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, Dz mitry Bahdanau, Fethi

Bougares, Holger Sch wenk, and Yoshua Bengio( Oct. 2014). “Learning Phrase Rep-

resentations using R N N Encoder– Decoder for Statistical Machine Translation”. In:

Proceedings ofthe 2014 Conference on E mpirical Methodsin Natural Language Pro-

cessing(E M NLP). Doha, Qatar: Associationfor Co mputational Linguistics, pp. 1724–

1 7 3 4. URL :http://www.aclweb.org/anthology/D14-1179.

1 8 2 BIBLIOGRAPHY

Church, Kenneth Ward and Patrick Hanks( Mar. 1990). “ Word Association Nor ms, Mutual

Infor mation, and Lexicography”.In:Co mputational Linguistics 16.1, pp. 22–29. URL :

http://dl.acm.org/citation.cfm?id=89086.89095.

Clarke, Linda K.( Oct. 1988). “Invented versus Traditional Spellingin First Graders’ Writ-

ings: Effects on Learningto Spell and Read”.In:Researchinthe Teaching of English,

pp. 281–309. URL : http://www.jstor.org.proxy1.library.jhu.edu/

stable/40171140.

Constantino, Rebecca, Sy- Ying Lee, Kyung-Sook Cho, and Stephen Krashen(1997). “Free

Voluntary Reading as a Predictor of T OEFL Scores.”In: Applied Language Learning

8.1, pp. 111–18.

Corbett, Albert T andJohn R Anderson(1994). “ Kno wledgetracing: Modelingthe acqui-

sition of procedural kno wledge”.In: User modeling and user-adaptedinteraction 4.4,

pp. 253–278.

D a u m é III, Hal ( Aug. 2004). “ Notes on C G and L M-BF GS Opti mization of Logistic

Regression”. URL :http://hal3.name/megam/.

D a u m é III, Hal(2007). “Frustratingly Easy Do main Adaptation”.In:Proceedings of A CL.

Prague, Czech Republic.

D a u m é III, Hal(June 2007). “Frustratingly Easy Do main Adaptation”.In: Proceedings of

ACL, pp. 256–263. URL :http://www.aclweb.org/anthology/P07-1033.

Deutschlandfunk (2016). nachrichtenleicht. http://www.nachrichtenleicht.

de/. Accessed: 2015-09-30. URL :www.nachrichtenleicht.de.

1 8 3 BIBLIOGRAPHY

Devlin,Jacob, Ming- Wei Chang, Kenton Lee, and Kristina Toutanova(2018). “ Bert: Pre-

training of deep bidirectional transfor mers for language understanding”. In: arXiv

preprint ar Xiv:1810.04805.

Devlin,Jacob, Ming- Wei Chang, Kenton Lee, and Kristina Toutanova(June 2019). “ BE RT:

Pre-training of Deep Bidirectional Transfor mers for Language Understanding”. In:

Proceedings ofthe 2019 Conference ofthe North A merican Chapter ofthe Association

for Co mputational Linguistics: Hu man Language Technologies, Volu me 1(Long and

Short Papers). Minneapolis, Minnesota: Association for Co mputational Linguistics,

pp. 4171–4186. URL :https://www.aclweb.org/anthology/N19-1423.

Dorr, Bonnie J. ( Dec. 1994). “ Machine Translation Divergences: A For mal Description

and Proposed Solution”.In: Co mputational Linguistics 20.4, pp. 597–633. URL : h t t p :

//aclweb.org/anthology/J/J94/J94-4004.pdf.

Dreyer, Markus and Jason Eisner( Aug. 2009). “ Graphical Models Over Multiple Strings”.

In:Proceedings of E M NLP. Singapore, pp. 101–110. URL : http://cs.jhu.edu/

̃ jason/papers/\#dreyer-eisner-2009.

Du, Xinya,Junru Shao, and Claire Cardie(July 2017). “Learningto Ask: Neural Question

Generationfor Reading Co mprehension”.In: Proceedings ofthe 55th Annual Meeting

ofthe Associationfor Co mputational Linguistics(Volu me 1: Long Papers). Vancouver,

Canada: Association for Co mputational Linguistics, pp. 1342–1352. URL : h t t p s :

//www.aclweb.org/anthology/P17-1123.

1 8 4 BIBLIOGRAPHY

Duchi, John, Elad Hazan, and Yora m Singer(2011). “ Adaptive subgradient methodsfor

onlinelearning and stochastic opti mization”.In: Journal of Machine Learning Research

12.Jul, pp. 2121–2159.

Dupuy, B and J Mc Quillan(1997). “ Handcrafted books: Twoforthe price of one”.In:

Successful strategiesfor extensive reading, pp. 171–180.

Elley, War wick Band Francis Mangubhai(1983).“Thei mpact ofreading onsecondlanguage

learning”.In:Reading research quarterly , pp. 53–67.

Gentry,J. Richard( Nov. 2000). “ A Retrospective onInvented Spelling and a Look For ward”.

In: The Reading Teacher 54.3, pp. 318–332. URL : http://www.jstor.org.

proxy1.library.jhu.edu/stable/20204910.

G o n z ález- Brenes, José, Yun Huang, and Peter Brusilovsky(2014). “ Generalfeaturesin

kno wledgetracingto model multiplesubskills,te mporalite mresponsetheory, and expert

kno wledge”.In: Proceedings ofthe 7thInternational Conference on Educational Data

Mining . University of Pittsburgh, pp. 84–91.

Gra m marly(2009). Gra m marly . https://app.grammarly.com . Accessed: 2019-02-

2 0.

Heaﬁeld, Kenneth,Ivan Pouzyrevsky,Jonathan H. Clark, and Philipp Koehn( Aug. 2013).

“Scalable Modiﬁed Kneser- Ney Language Model Esti mation”.In: Proceedings of ACL.

Soﬁa, Bulgaria, pp. 690–696. URL : http://kheafield.com/professional/

edinburgh/estimate\_paper.pdf.

1 8 5 BIBLIOGRAPHY

Heil man, Michael and Nitin Madnani(2012). “ETS: Discri minative Edit Modelsfor Para-

phrase Scoring”.In: *SE M 2012: The First Joint Conference on Lexical and Co mputa-

tional Se mantics – Volu me 1: Proceedings ofthe main conference andthesharedtask,

and Volu me 2: Proceedings ofthe SixthInternational Workshop on Se mantic Evaluation

(Se mEval 2012). Montréal, Canada: Associationfor Co mputational Linguistics, pp. 529–

5 3 5. URL :https://www.aclweb.org/anthology/S12-1076.

Heil man, Michael and Noah A S mith(2010). “ Good question!statisticalrankingfor question

generation”.In: Hu man Language Technologies: The 2010 Annual Conference ofthe

North A merican Chapter ofthe Associationfor Co mputational Linguistics . Association

for Co mputational Linguistics, pp. 609–617.

Her mjakob, Ulf, Jonathan May, Michael Pust, and Kevin Knight(July 2018). “Translating a

Language You Don’t Kno wInthe Chinese Roo m”.In: Proceedings of ACL 2018, Sys-

te m De monstrations. Melbourne, Australia: Associationfor Co mputational Linguistics,

p p. 6 2 – 6 7. URL :https://www.aclweb.org/anthology/P18-4011.

Hochreiter, Sepp and J urgen̈ Sch midhuber(1997). “Longshort-ter m me mory”.In: Neural

Co mputation 9.8, pp. 1735–1780.

Hopﬁeld, J. J. (1982). “ Neural net works and physical syste ms with e mergent collective

co mputational abilities”.In: Proceedings ofthe National Acade my of Sciences ofthe

USA . Vol. 79. 8, pp. 2554–2558.

1 8 6 BIBLIOGRAPHY

Hu, Chang, Benja min B Bederson, and Philip Resnik(2010). “Translation byiterative col-

laboration bet ween monolingual users”.In:Proceedings ofthe AC M SI G K D D Workshop

on Hu man Co mputation, pp. 54–55.

Hu, Chang, Benja min B Bederson, Philip Resnik, and Yakov Kronrod(2011). “ Monotrans2:

A ne w hu man co mputation syste mto support monolingualtranslation”.In: Proceedings

ofthe SI GC HI Conference on Hu man Factorsin Co mputing Syste ms, pp. 1133–1136.

Huang, Yun, J Guerra, and Peter Brusilovsky(2016). “ Modeling Skill Co mbination Patterns

for Deeper Kno wledge Tracing”.In:Proceedings ofthe 6th Workshop on Personalization

Approachesin Learning Environ ments(PALE 2016). 24th Conference on User Modeling,

Adaptation and Personalization. Halifax, Canada.

Huckin, Tho mas andJa mes Coady(1999). “Incidental vocabulary acquisitionin asecond

language”.In:Studiesin Second Language Acquisition 21.2, pp. 181–193.

Kann, Katharina, Ryan Cotterell, and Hinrich Sch utze(̈ Apr. 2017). “ Neural Multi-Source

Morphological Reinﬂection”.In: Proceedings ofthe 15th Conference ofthe European

Chapter ofthe Associationfor Co mputational Linguistics: Volu me 1, Long Papers .

Valencia, Spain: Associationfor Co mputational Linguistics, pp. 514–524. URL : h t t p s :

//www.aclweb.org/anthology/E17-1049.

Khajah, Moha m mad, Ro wan Wing, Robert Lindsey, and Michael Mozer(2014a). “Inte-

gratinglatent-factor and kno wledge-tracing modelsto predictindividual differencesin

learning”. In:Proceedings ofthe 7th International Conference on Educational Data

Mi ni n g .

1 8 7 BIBLIOGRAPHY

Khajah, Moha m mad M, Yun Huang,José P G o n z ález- Brenes, Michael C Mozer, and Peter

Brusilovsky(2014b). “Integrating kno wledgetracing andite mresponsetheory: Atale

oft wofra me works”.In: Proceedings of Workshop on Personalization Approachesin

Learning Environ ments (PALE 2014) atthe 22th International Conference on User

Modeling, Adaptation, and Personalization . University of Pittsburgh, pp. 7–12.

Ki m, Haeyoung and Stephen Krashen(1998). “The Author Recognition and Magazine

Recognition Tests, and Free Voluntary Rereading as Predictors of Vocabulary Develop-

mentin English as a Foreign Languagefor Korean High School Students.”In: Syste m

26.4, pp. 515–23.

Ki m, Yoon, YacineJernite, David Sontag, and Alexander M. Rush(2016). “ Character-a ware

neurallanguage models”.In: Thirtieth AAAI Conference on ArtiﬁcialIntelligence.

King ma, Diederik P. andJi m my Ba(2014). “ Ada m: A methodforstochastic opti mization”.

In:ar Xiv preprint ar Xiv:1412.6980.

King ma, Diederik P and Max Welling(2013). “ Auto-encoding variational bayes”.In: arXiv

preprint ar Xiv:1312.6114.

Kno wles, Rebecca, Adithya Renduchintala, Philipp Koehn, and Jason Eisner( Aug. 2016).

“ Analyzing Learner Understanding of Novel L2 Vocabulary”.In: Proceedings of The

20th SI GNLL Conference on Co mputational Natural Language Learning. To appear.

Berlin, Ger many: Associationfor Co mputational Linguistics, pp. 126–135. URL : h t t p :

//www.aclweb.org/anthology/K16-1013.

1 8 8 BIBLIOGRAPHY

Koedinger, K. R., P.I. Pavlick Jr., J. Sta mper, T. Nixon, and S. Ritter(2011). “ Avoiding

Proble m Selection Thrashing with Conjunctive Kno wledge Tracing”.In: Proceedings of

the 4thInternational Conference on Educational Data Mining. Eindhoven, NL, pp. 91–

1 0 0.

Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison- Burch, Marcello Federico,

Nicola Bertoldi, Brooke Co wan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,

Ondrej Bojar, Alexandra Constantin, and Evan Herbst(2007). “ Moses: Open Source

Toolkit for Statistical Machine Translation”. In: Proceedings of A CL: Short Papers ,

pp. 177–180. URL : http://www.aclweb.org/anthology/P/P07/P07-

2 0 4 5 .

Krashen, S.(1993). “ Ho w Well do People Spell?”In: ReadingI mprove ment 30.1. URL :

http://www.sdkrashen.com/content/articles/1993\_how\_well\

_do\_people\_spell.pdf.

Krashen, Stephen (1989). “ We acquire vocabulary and spelling by reading: Additional

evidencefortheinput hypothesis”.In: The Modern Language Journal 73.4, pp. 440–

4 6 4.

Krashen, Stephen D(2003). Explorationsinlanguage acquisition and use. Heine mann

Ports mouth, N H.

Labutov,Igor and Hod Lipson(June 2014). “ Generating Code-s witched Textfor Lexical

Learning”.In: Proceedings ofthe 52nd Annual Meeting ofthe Associationfor Co m-

putational Linguistics (Volu me 1: Long Papers). Balti more, Maryland: Association

1 8 9 BIBLIOGRAPHY

for Co mputational Linguistics, pp. 562–571. URL : https://www.aclweb.org/

anthology/P14-1053.

Lee,JungIn and E m ma Brunskill(2012). “TheI mpact onIndividualizing Student Models on

Necessary Practice Opportunities.”In: International Educational Data Mining Society.

Lee, Sy- Ying and Stephen Krashen(1996). “Free Voluntary Reading and Writing Co m-

petencein Tai wanese High School Students”. In: Perceptual and Motor Skills 83.2,

pp. 687–690. eprint: https://doi.org/10.2466/pms.1996.83.2.687 .

URL :https://doi.org/10.2466/pms.1996.83.2.687.

Lee, Yon Ok, Stephen D Krashen, and Barry Gribbons(1996). “The effect ofreading on

the acquisition of English relative clauses”. In:ITL-International Journal of Applied

Linguistics 113.1, pp. 263–273.

Leitner, Sebastian(1972). Solernt manlernen: der Weg zu m Erfolg. Herder, Freiburg.

Lingua.ly(2013). Lingua.ly.https://lingua.ly/. Accessed: 2016-04-04.

Lit man, Diane(2016). “ Natural Language Processingfor Enhancing Teaching and Learning”.

In:Proceedings of AAAI.

Madnani, Nitin, Michael Heil man,Joel Tetreault, and Martin Chodoro w(2012). “Identifying

High-Level Organizational Ele mentsin Argu mentative Discourse”.In: Proceedings of

NAA CL- HLT . Associationfor Co mputational Linguistics, pp. 20–28.

Merity, Stephen, Cai ming Xiong, Ja mes Bradbury, and Richard Socher(2016). “Pointer

sentinel mixture models”.In: arXiv preprint arXiv:1609.07843.

1 9 0 BIBLIOGRAPHY

Mikolov, To mas, Martin Karaﬁa ́t, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudan-

pur(2010). “Recurrent neural net work basedlanguage model.”In: INTERSPEEC H

2010, 11th Annual Conference oftheInternational Speech Co m munication Association,

Makuhari, Chiba, Japan, Septe mber 26-30, 2010 , pp. 1045–1048.

Mikolov, To mas, Stefan Ko mbrink, Anoop Deoras, Lukar Burget, andJan Cernocky(2011).

“R N NL M —Recurrent Neural Net work Language Modeling Toolkit”.In: Proc. ofthe

2011 ASRU Workshop, pp. 196–201.

Mikolov, To mas,Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean(2013). “ Dis-

tributedrepresentations of words and phrases andtheir co mpositionality”.In:Advances

in neuralinfor mation processing syste ms, pp. 3111–3119.

Mikolov, To mas, Edouard Grave, Piotr Bojano wski, Christian Puhrsch, and Ar mand Joulin

( May 2018). “ Advancesin Pre-Training Distributed Word Representations”.In: Proceed-

ings ofthe EleventhInternational Conference on Language Resources and Evaluation

(LRE C-2018). Miyazaki,Japan: European Languages Resources Association(EL R A).

URL :https://www.aclweb.org/anthology/L18-1008.

Mitkov, Ruslan and Le An Ha(2003). “ Co mputer-aided generation of multiple-choicetests”.

In:Proceedings ofthe HLT- NAACL 03 workshop on Building educational applications

using naturallanguage processing-Volu me 2. Associationfor Co mputational Linguistics,

p p. 1 7 – 2 2.

Mousa, A mrand Bj orn̈ Schuller( Apr. 2017). “ Contextual Bidirectional Long Short-Ter m

Me mory Recurrent Neural Net work Language Models: A Generative Approachto

1 9 1 BIBLIOGRAPHY

Senti ment Analysis”.In: Proceedings ofthe 15th Conference ofthe European Chapter of

the Associationfor Co mputational Linguistics: Volu me 1, Long Papers. Valencia, Spain,

pp. 1023–1032. URL :https://www.aclweb.org/anthology/E17-1096.

Murphy, Kevin P., Yair Weiss, and MichaelI. Jordan(1999). “Loopy belief propagationfor

approxi mateinference: An e mpiricalstudy”.In: Proceedings of UAI. Morgan Kauf mann

PublishersInc., pp. 467–475.

Nelson, Mark (2007). The Alpheios Project. http://alpheios.net/ . A c c ess e d:

2016-04-05.

Nivre, Joaki m, Marie- Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, et al.

“ Universal Dependencies v1: A Multilingual Treebank Collection.”In:

One ThirdStories(2018). OneThirdStories . https://onethirdstories.com/ . A c-

cessed: 2019-02-20.

Ö z b al, G ozde,̈ Daniele Pighin, and Carlo Strapparava(2014). “ Auto mation and Evaluation of

the Key word Methodfor Second Language Learning”.In:Proceedings of ACL(Volu me

2: Short Papers), pp. 352–357.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning(2014). “ GLo Ve: Global

Vectorsfor Word Representation”.In: Proceedings of E M NLP. Vol. 14, pp. 1532–1543.

URL :http://www.aclweb.org/anthology/D14-1162.

Peters, Matthe w, Mark Neu mann, MohitIyyer, Matt Gardner, Christopher Clark, Kenton

Lee, and Luke Zettle moyer(June 2018). “ Deep Contextualized Word Representations”.

In:Proceedings ofthe 2018 Conference ofthe North A merican Chapter ofthe Association

1 9 2 BIBLIOGRAPHY

for Co mputational Linguistics: Hu man Language Technologies, Volu me 1(Long Papers).

Ne w Orleans, Louisiana: Associationfor Co mputational Linguistics, pp. 2227–2237.

URL :https://www.aclweb.org/anthology/N18-1202.

Philips, La wrence( Dec. 1990). “ Hanging onthe Metaphone”.In: Co mputer Language 7.12.

Piech, Chris,Jonathan Bassen,Jonathan Huang, Surya Ganguli, Mehran Saha mi, LeonidasJ

Guibas, and Jascha Sohl- Dickstein(2015). “ Deep Kno wledge Tracing”.In: Advancesin

NeuralInfor mation Processing Syste ms , pp. 505–513.

Posner, MichaelI(1989). Foundations of cognitivescience. MIT press Ca mbridge, M A.

Preacher, K.J.( May 2002). Calculationforthetest ofthe difference bet weent woindependent

correlation coefﬁcients [ Co mputer soft ware]. URL : http://www.quantpsy.org/

corrtest/corrtest.htm.

Press, Oﬁr and Lior Wolf( Apr. 2017). “ Usingthe Output E mbeddingtoI mprove Language

Models”. In: Proceedings ofthe 15th Conference ofthe European Chapter ofthe

Associationfor Co mputational Linguistics: Volu me 2, Short Papers . Valencia, Spain,

pp. 157–163. URL :https://www.aclweb.org/anthology/E17-2025.

Rafferty, Anna N and Christopher D Manning(2008). “Parsing Three Ger man Treebanks:

Lexicalized and Unlexicalized Baselines”.In: Proceedings ofthe Workshop on Parsing

Ger man . Associationfor Co mputational Linguistics, pp. 40–46.

Recht, Benja min, Christopher Re, Stephen Wright, and Feng Niu(2011). “ Hog wild!: A

Lock-Free Approachto Parallelizing Stochastic Gradient Descent”. In: Advancesin

NeuralInfor mation Processing Syste ms , pp. 693–701.

1 9 3 BIBLIOGRAPHY

Renduchintala, Adithya, Philipp Koehn, andJason Eisner( Aug. 2017). “ Kno wledge Tracing

in Sequential Learning ofInﬂected Vocabulary”.In:Proceedings ofthe 21st Conference

on Co mputational Natural Language Learning (CoNLL 2017). Vancouver, Canada,

pp. 238–247. URL :https://www.aclweb.org/anthology/K17-1025.

Renduchintala, Adithya, Philipp Koehn, and Jason Eisner ( Aug. 2019a). “Si mple Con-

struction of Mixed-Language Textsfor Vocabulary Learning”.In: Proceedings ofthe

14th Workshop onInnovative Use of NLPfor Building Educational Applications(BEA).

Fl or e n c e. URL : http://cs.jhu.edu/ ̃ jason/papers/\#renduchintala-

et-al-2019.

Renduchintala, Adithya, Philipp Koehn, and Jason Eisner( Nov. 2019b). “Spelling- A ware

Construction of Macaronic Textsfor Teaching Foreign-Language Vocabulary”.In: Pro-

ceedings ofthe 2019 Conference on E mpirical Methodsin Natural Language Processing

andthe 9thInternational Joint Conference on Natural Language Processing(E M NLP-

IJ C NLP). Hong Kong, China: Associationfor Co mputational Linguistics, pp. 6438–

6 4 4 3. URL :https://www.aclweb.org/anthology/D19-1679.

Renduchintala, Adithya, Rebecca Kno wles, Philipp Koehn, and Jason Eisner( Aug. 2016a).

“ Creating Interactive Macaronic Interfaces for Language Learning”. In: Proceedings

of ACL-2016 Syste m De monstrations. Berlin, Ger many, pp. 133–138. URL : h t t p s :

//www.aclweb.org/anthology/P16-4023.

Renduchintala, Adithya, Rebecca Kno wles, Philipp Koehn, and Jason Eisner( Aug. 2016b).

“ User Modelingin Language Learning with Macaronic Texts”.In: Proceedings ofthe

1 9 4 BIBLIOGRAPHY

54th Annual Meeting ofthe Associationfor Co mputational Linguistics(Volu me 1: Long

Papers). Berlin, Ger many, pp. 1859–1869. URL : https://www.aclweb.org/

anthology/P16-1175.

Rodrigo, Victoria, Jeff Mc Quillan, and Stephen Krashen(1996). “Free Voluntary Reading

and Vocabulary Kno wledgein Native Speakers of Spanish”.In: Perceptual and Motor

S kills 8 3. 2, p p. 6 4 8 – 6 5 0. e pri nt: https://doi.org/10.2466/pms.1996.83.

2 . 6 4 8 . URL :https://doi.org/10.2466/pms.1996.83.2.648.

Schacter, D. L. (1989). “ Me mory”. In: Foundations of Cognitive Science. Ed. by M. I.

Postner. MIT Press, pp. 683–725.

Settles, Burr and Brendan Meeder( Aug. 2016). “ A Trainable Spaced Repetition Modelfor

Language Learning”.In: Proceedings ofthe 54th Annual Meeting ofthe Associationfor

Co mputational Linguistics(Volu me 1: Long Papers) . Vol. 1. Berlin, Ger many: Associ-

ationfor Co mputational Linguistics, pp. 1848–1858. URL : http://www.aclweb.

org/anthology/P16-1174.

S molensky, Paul(1986). “Infor mation Processingin Dyna mical Syste ms: Foundations of

Har mony Theory”.In: Parallel Distributed Processing: Explorationsinthe Microstruc-

ture of Cognition. Ed. by D. E. Ru melhart, J. L. McClelland, andthe P DP Research

Group. Vol. 1: Foundations. Ca mbridge, M A: MIT Press/ Bradford Books, pp. 194–281.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhut-

dinov (2014). “ Dropout: A Si mple Wayto Prevent Neural Net works fro m Overﬁt-

1 9 5 BIBLIOGRAPHY

ting”. In:Journal of Machine Learning Research 15, pp. 1929–1958. URL : h t t p :

//jmlr.org/papers/v15/srivastava14a.html.

Srivastava, Rupesh Ku mar, Klaus Greff, and J urgen̈ Sch midhuber(2015). “ High way net-

works”.In: arXiv preprint arXiv:1505.00387.

Stokes,Jeffery, Stephen D Krashen, andJohn Kartchner(1998). “Factorsinthe acquisition of

the present subjunctivein Spanish: Therole ofreading and study”.In:ITL-International

Journal of Applied Linguistics 121.1, pp. 19–25.

S wych(2015). Swych.http://swych.it/. Accessed: 2019-02-20.

Tiele man, Tij men and Geoffrey Hinton(2012). “Lecture 6.5-r msprop: Dividethe gradient

by arunning average ofitsrecent magnitude”.In: C O URSERA: Neural net worksfor

machinelearning 4.2.

Vilnis, Luke and Andre w Mc Callu m(2014). “ Word Representations via Gaussian E mbed-

ding”.In: CoRR abs/1412.6623.

Vygotsky, Lev(2012). Thought and Language(Revised and Expanded Edition). MIT Press.

Weide, R.(1998). The C M U pronunciation dictionary, release 0.6.

Wieting,John, Mohit Bansal, Kevin Gi mpel, and Karen Livescu( Nov. 2016). “ Charagra m:

E mbedding Words and Sentences via Character n-gra ms”.In: Proceedings ofthe 2016

Conference on E mpirical Methodsin Natural Language Processing . Austin, Texas,

pp. 1504–1515. URL :https://www.aclweb.org/anthology/D16-1157.

Wiki media Foundation (2016). Si mple English Wikipedia. Retrieved fro m h t t p s : / /

dumps.wikimedia.org/simplewiki/20160407/ 8-April-2016.

1 9 6 BIBLIOGRAPHY

Wikipedia (2016). Leichte Sprache — Wikipedia, Diefreie Enzyklop adie.̈ [ Online; ac-

cessed 16- March-2016]. URL : https://de.wikipedia.org/wiki/Leichte\

_ S p r a c h e .

Wood, David, Jero me S. Bruner, and Gail Ross(1976). “Therole oftutoringin proble m

solving”.In: Journal of Child Psychology and Psychiatry 17.2, pp. 89–100.

Wu, Dekai(1997). “StochasticInversion Transduction Gra m mars and Bilingual Parsing of

Parallel Corpora”.In: Co mputational Linguistics 23.3, pp. 377–404.

Xu, Yanbo andJack Mosto w(2012). “ Co mparison of Methodsto Trace Multiple Subskills:

Is L R- D B N Best?”In:Proceedings ofthe 5thInternational Conference on Educational

Data Mining , pp. 41–48.

Yang, Zhilin, Zihang Dai, Yi ming Yang,Jai me Carbonell, Russ R Salakhutdinov, and Quoc

V Le(2019). “ Xlnet: Generalized autoregressive pretrainingforlanguage understanding”.

In:Advancesin neuralinfor mation processing syste ms , pp. 5753–5763.

Zeiler, Matthe w D(2012). “ A D A DELTA: an adaptivelearningrate method”.In: arXiv

preprint ar Xiv:1212.5701.

Zhang, Haoran, Ah med Magooda, Diane Lit man, Richard Correnti, Elaine Wang, LC

Mats mura, E mily Ho we, and Rafael Quintana(2019). “e Revise: Using Natural Language

Processingto Provide For mative Feedback on Text Evidence Usagein Student Writing”.

In:Proceedings ofthe AAAI Conference on ArtiﬁcialIntelligence . Vol. 33, pp. 9619–

9 6 2 5.

1 9 7 BIBLIOGRAPHY

Zhou, Jingguang and Zili Huang(2018). “ Recover missing sensor data withiterativei m-

puting net work”. In: Workshops atthe Thirty-Second AAAI Conference on Artiﬁcial

I nt elli g e n c e.

Zolf, Flak(1945). On Stranger Land: Pages of a Life .

Zolf, Flak and Martin Green(2003). On Foreign Soil: Tales of a Wandering Jew . Bench mark

P u blis hi n g.

1 9 8 Vit a

Adithya Renduchintala

ar e n d u. git h u b.i o

a di.r @j h u. e d u

I NTERESTS

I a m broadlyinterestedin proble ms attheintersection of( Deep) Machine Learning, Neural

Machine Translation, Natural Language Processing, & User Modeling

E DUCATION

Ph D, Co mputer Science 2 0 1 3 - 2 0 2 0

Johns Hopkins University, Balti more, M D

MS, Co mputer Science 2 0 1 0 - 2 0 1 2

University of Colorado, Boulder, C O

1 9 9 VITA

MS, Electrical Engineering, Arts Media and Engineering 2 0 0 5- 2 0 0 8

Arizona State University, Te mpe, AZ

B E, Electrical Engineering 2 0 0 1- 2 0 0 5

Anna University, S R M Engineering College, Chennai,India

P UBLICATIONS

1. Spelling- Aware Construction of Macaronic Texts for Teaching Foreign-Language

Vocabulary, Adithya Renduchintala , Philipp Koehn andJason Eisner. Proceedings

ofthe 2019 Conference on E mpirical Methodsin Natural Language Processing and

the 9thInternational Joint Conference on Natural Language Processing

2. Si mple Construction of Mixed-Language Textsfor Vocabulary Learning, A dit h y a

Renduchintala , Philipp Koehn and Jason Eisner. Annual Meeting ofthe Associa-

tionfor Co mputational Linguistics( ACL) Workshop onInnovative Use of NLPfor

B uil di n g E d u c ati o n al A p pli c ati o ns, 2 0 1 9.

3. Pretraining by Backtranslation for End-to-End AS R in Lo w- Resource Settings,

Matthe w Wiesner, Adithya Renduchintala , S hi nji Wat a n a b e, C h u n xi Li, N aji m

Dehak and Sanjeev Khudanpur,Interspeech 2019

4. A Callfor prudent choice of Sub word Merge Operations, Shuoyang Ding, A dit h y a

Renduchintala , and Kevin Duh. Machine Translation Su m mit 2019.

2 0 0 VITA

5. Character- Aware Decoder for Translation into Morphologically Rich Languages,

Adithya Renduchintala *, Pa mela Shapiro*, Kevin Duh and Philipp Koehn. Machine

Tr a nsl ati o n S u m mit 2 0 1 9.

6. TheJ H U/ Kyoto U Speech Translation Syste mforI WSLT 2018, Hirofu miInagu ma,

Xuan Zhang, Zhiqi Wang, Adithya Renduchintala , Shinji Watanabe and Kevin Duh.

TheInternational Workshop on Spoken Language Translation 2018(I WSLT)

7. Multi- Modal Data Aug mentation for End-to-End ASR Adithya Renduchintala ,

Shuoyang Ding, Matthe w Wiesner and Shinji Watanabe, Interspeech 2018. Best

Student Paper Award (3/700 +)

8. ESPnet: End-to-End Speech Processing Toolkit, Shinji Watanabe, Takaaki Hori,

Shigeki Karita, To moki Hayashi,Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta

Soplin, Jahn Hey mann, Matthe w Wiesner, Nanxin Chen, Adithya Renduchintala

and Tsubasa Ochiai,Interspeech 2018.

9. Kno wledge Tracingin Sequential Learning ofInﬂected Vocabulary, Adithya Ren-

d u c hi nt al a , Philipp Koehn andJason Eisner, Conference on Co mputational Natural

Language Learning( Co NLL), 2017.

1 0. User Modelingin Language Learning with Macaronic Texts, Adithya Renduchin-

t al a, Rebecca Kno wles, Philipp Koehn, and Jason Eisner. Annual Meeting ofthe

* Equal contribution

2 0 1 VITA

Associationfor Co mputational Linguistics( A CL) 2016.

1 1. Creatinginteractive macaronicinterfacesforlanguagelearning, Adithya Renduch-

i nt al a, Rebecca Kno wles, Philipp Koehn, andJason Eisner. Annual Meeting ofthe

Associationfor Co mputational Linguistics( ACL) De mo Session 2016.

1 2. Analyzinglearner understanding of novel L2 vocabulary, Rebecca Kno wles, A dit h y a

Renduchintala , Philipp Koehn, and Jason Eisner, Conference on Co mputational

Natural Language Learning( Co NLL), 2016.

13. Algerian Arabic-French Code-S witched Corpus, Ryan Cotterell, Adithya Renduch-

i nt al a, Nao mi P. Saphra and Chris Callison-Burch. An LREC-2014 Workshop on

Free/ Open-Source Arabic Corpora and Corpora Processing Tools. 2014.

1 4. Using Machine Learning and HL7 L OI NC D Ofor Classiﬁcation of Clinical Docu-

m e nts, Adithya Renduchintala , A my Zhang, Tho mas Polzin, G. Saada wi. A merican

MedicalInfor matics Association( A MI A) 2013.

1 5. Collaborative Tagging and Persistent Audio Conversations, Ajita John, Shreeharsh

Kelkar, Ed Peebles, Adithya Renduchintala , Doree Selig mann Web 2.0 and Social

Soft ware Workshopin Conjunction with ECSC W. 2007.

1 6. Designingfor persistent Audio Conversationsinthe Enterprise, Adithya Renduchin-

t al a, AjitaJohn, Shreeharsh Kelkar, and Doree Duncan-Selig mann. Designfor User

Experience. 2007.

2 0 2 VITA

E XPERIENCE

Facebook AI, Menlo Park, C A 2020-Present

Research Scientist

Working on proble msrelatedto Neural Machine Translation.

Johns Hopkins University, Balti more, M D 2 0 1 3- 2 0 2 0

Research Assistant

Designed and evaluated AIforeignlanguageteachingsyste ms. Also worked on Machine

Translation and End-to-End Speech Recognition.

Duolingo, Pittsburgh, P A S u m m e r 2 0 1 7

ResearchIntern

Prototyped a Chatbotsyste mthat detects and corrects word-ordering errors made bylanguage

learners. Explored spelling-errorrobustness of co mpositional word e mbeddings

M* Modal, Pittsburgh, PA 2 0 1 2- 2 0 1 3

NLP Engineer

Developed S V M based clinical docu ment classiﬁcation syste m. Worked onfeature engi-

neering for statistical models ( Docu ment Classiﬁcation, Entity Detection, Tokenization,

C h u n ki n g)

Rosetta Stone, Boulder, C O 2 0 0 8- 2 0 1 2

Soft ware Developer

Designed, prototyped and evaluated speechrecognition based ga mes and applicationsfor

2 0 3 VITA

languagelearning. Prototyped ai mage-to-concept relation visualizationtool for second

language vocabularylearning.

Avaya, Lincroft NJ S u m m e r 2 0 0 7

Research Scientist Intern

Developed aninteractive graph based visualizationtoolto explore and annotate conference

callsin enterprises.

Arizona State University, Te mpe A Z 2 0 0 6- 2 0 0 8

Research Assistant, Arts Media & Engineering

Designed and prototyped syste msfor serendipitousinteractionsin distributed workplaces.

T EACHING

1. Intro.to Hu man Language Technology, Teaching Assistant Fall, 2 0 1 9

2. Neural Machine Translation Lab Session,JS ALT Su m mer School Su m mer 2018

3. Machine Translation, Teaching Assistant S pri n g 2 0 1 6

4. Intro.to Progra m mingfor Scientists & Engineers, Teaching Assistant Fall 2013

P ROGRAMMING

Advanced: Python(nu mpy, scipy, scikit-learn)

Proﬁcient: Java, C/ C++, Javascript, Jquery, NodeJs

2 0 4 VITA

Deep Learning Fra me works: PyTorch( Advanced), Mx Net, Tensorﬂo w

Deep Learning Toolkits: Open N MT, Fairseq, ESP Net, Sockeye

N ATIONALITY

Indian, Per manent US Resident

Updated 08/28/2020

2 0 5