Improving Language + Assessment with Data

burr settles staf scientist + engineer launched in 2012 (CMU spinof) more than 200 million students globally currently 76 courses (incl. High Valyrian) expanding to 95+ courses (incl. Klingon) 100% FREE

GOOGLE APPLE TECHCRUNCH Best of the Best App of the Year Education Startup of the Year people learning a second language 1,200,000,000 (~16% of the world’s population)

~800M satisfy three properties: - learning English - in a developing country - to gain more opportunity

(Source: British Council) 86% mobile device access 64% toilet access

(Source: U.N. Report, 2015) Bassin Caiman, Haiti access personalization feedback 34 hours of is equivalent to one university semester

(Vesselinov + Grego, 2012) 12 weeks of self-study w/ Duolingo is as efective as traditional F2F instruction

(Rachels + Rockinson-Szapkiw, 2017) #1 language-learning platform began using empirical “duolingo” ML/NLP methods

“rosetta stone” “babbel”

“busuu” “pimsleur” key aspect of teaching people: forgetting

(Settles + Meeder, ACL 2016) Modeling Learning + Forgetting

Duolingo uses strength meters that remind students to practice … Modeling Learning + Forgetting

… and prioritizes words in their practice sessions … Modeling Learning + Forgetting

… based on a statistical model of word strength in étant/être un/un long-term enfant/enfant il/il est/être petit/petit Modeling Learning + Forgetting

of course, we want this model to be personalized for each student! The Spacing Efect (Ebbinghaus, 1885; Atkinson, 1972; Bloom+Shuell, 1981)

people learn better if practice is spaced over long intervals (instead of “cramming”) The (Ebbinghaus, 1885)

1 ! the probability p of a correct 0.8 answer as a function of:

0.6 • ! time Δ since the last practice 0.4 • half-life h in user’s memory 0.2

0 ! /h 0 1 2 3 4 5 6 7 p =2 The Lag Efect (Melton, 1970; Scarborough, 1977)

people learn even better if the spacing between practices gradually increases

(i.e., half-life increases with more practice) The Pimsleur Method (Pimsleur, 1967)

first mainstream application of spacing and lag efects

• vocabulary is reviewed over exponentially increasing half-life intervals

• drawback: fixed schedule (not adaptive/personalized) The (Leitner, 1972)

observation: half-life can be formalized 1st Duolingo student model: into the equation: Leitner adaptive flashcard algorithm x x hˆ =2 correctly-answered items or, more generally: h : 1 2 4 8 16 ⇥ x hˆ⇥ =2 · incorrectly-answered items for model weights £ and a feature vector x Half-life Regression (HLR) each session leaves a trace of what the student recalled correctly:

can we learn model weights £ from these traces? Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖

0.8 ✖

0.6 ✖ ✖ 0.4 rate recall 0.2

0 0 5 10 15 20 25 30 time (in days) Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 1.0 1.0 1.0 0.8 ✖ 0.8 0.6 ✖ ✖ 0.4 0.5 0.5 recall rate recall 0.2

0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 0.6 0.7 2.6 0.8 ✖ 4.7 0.6 ✖ ✖ 0.4 1.7 13.5 recall rate recall 0.2

0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session = lag time since the last session Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ correct = 2 correct = 15 0.8 incorrect = 1 ✖ incorrect = 5 lexeme = étant correct = 10 lexeme = étant incorrect = 3 0.6 lexeme = étant ✖ ✖ 0.4 correct = 5 correct = 14 incorrect = 1 incorrect = 4 recall rate recall lexeme = étant lexeme = étant 0.2

0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session = lag time since the last session x = feature vector summarizing practice history Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖

0.8 ✖

0.6 ✖ ✖ 0.4 Pimsleur + Leitner recall rate recall 0.2 are special cases of

0 this model! 0 5 10 15 20 25 30 time (in days)

goal: learn weights £ such that model predictions fit the training set = p, ,x D D {h ii}i=1 Half-life Regression (HLR)

objective function:

D 2 ˆ 2 2 ⇥⇤ = arg min (p pˆ⇥) + ↵(h h⇥) + ⇥ 2 ⇥ i=1 k k X recall est. half-life L2 error error penalty term term term

2 2 ⇥ x ⇥ x p 2 2 · ↵ 2 · log2(p) ⇣ ⌘ ✓ ◆ optimize £ using stochastic gradient descent Historical Log Data Evaluation

0.5 heuristics 0.375 logistic regression variants 0.25 MAE HLR ← variants 0.125

0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation

0.5 lexeme features help a little bit 0.375

0.25 MAE ←

0.125

0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation

0.5 half-life objective term helps a lot! 0.375

0.25 MAE ←

0.125

0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation

0.5 only full HLR variants more metrics + results in our

beat a fixed baseline! ACL’16 paper… 0.375

0.25 MAE p¯ =0.89 ←

0.125

0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Example £ Weights

+0.77 camera/camera +0.87 Baby/Baby +0.38 ends/end +0.56 sprechen/sprechen +0.08 circle/circle +0.13 sehr/sehr -0.09 rose/rise -0.07 den/der -0.48 performed/perform -0.55 Ihnen/Sie -0.81 writing/write -1.10 war/sein

+0.94 visite/visiter +0.83 liberal/liberal +0.47 suis/être +0.40 como/comer +0.05 trou/trou +0.10 encuentra/encontrar -0.06 dessous/dessous -0.05 está/estar -0.45 ceci/ceci -0.33 pensando/pensar -0.91 fallait/falloir -0.73 quedado/quedar

+ weights: common words - weights: rare words short/regular long/irregular cognates advanced grammar User Experiment

50% students kept 50% students given Leitner model HLR model (6 weeks; 1M students)

engagement = % students who returned the following day: HLR +0.3% (p=0.03) User Experiment short-term reactions (week of launch)… User Experiment long-term reactions (2 months later)… User Experiment

• culprit: extreme lexeme feature weights

• e.g., rare words, irregulars, etc. that are hard on average (but not for advanced students) User Experiment #2

50% students given 50% students given HLR model (2 weeks; 3M students) HLR -lexemes

engagement = % students who returned the following day: +1.7% Lessons (p<0.001) +9.5% Practice (p<0.001) +12.0% Anything (p<0.001) Data + Code github.com/duolingo/halflife-regression

… so many emails … other (better?) forgetting functions

≈HLR forgetting function Other Forgetting Functions

power pˆ = ↵ ! intercept: highest student recall rate that seems likely logarithm pˆ = ↵ log() " forgetting rate: how fast expsqrt pˆ = ↵ exp( p) this prediction decays following Ridgeway + Mozer (2016), we learn to personalize ! (logistic function of features), but fit a global "

HLR pˆ = ↵ exp( ) for HLR, " is learned but ! = 1 Other Forgetting Functions Historical Log Data Evaluation

0.12 expsqrt power logarithm 0.09 better?

0.06 MAE ← HLR 0.03

0 0 0.065 0.13 0.195 0.26 correlation → User Experiment #3

(3 months; 4.1M new students)

HLR power logarithm expsqrt

retention -0.3% +0.2% +0.1% # sessions -3.4% -3.2% -3.1% revenue -4.1% -3.0% -5.0% quiz score -0.8% -8.8% -4.0% other projects in personalized learning Levels

alternative to strength meter mechanism

each skill can teach the same content at diferent difculty levels, e.g.: - progress from receptive → productive - progress from written → verbal

currently requires re-doing lessons, but in the future will level up automatically and tailor content based learner modeling Learned Variable Rewards

learn to predict how much bonus time to award for correct answers

+4.8%*** practice session activity +19.4%*** practice session completion Data-Driven Curriculum Design

2 4 water 2 PART Case=Acc NumType=Card NumType=Ord go such from RUS thank top from ARA foodman official n't Degree=Cmp fatherenglish meal no away 1 1 Voice=Pass time hatallmeatchildrensupportchildspeaks CONJ Mood=Imp Person=2 from ARAif asklemonbreakfastprisonthroughneighborhoodlittleevening 're Degree=SupPronType=Rel 2 anybyoutshouldbuyseesthanksserious ' ADP cat bankor clothesenoughsleeps from ARA VerbForm=InfPronType=DemGender=Masc afterfishoil fromeatdayweekselephantswomenthirdknowanswersnorthspendhumandifferentlectureeatsdaughterthem from RUS milkcatsdogsdid teato sumthinkshistoricalgreatduckssearchesyourstraditionaltheaterhears SCONJADV Person=3 books webeer englandstartedbeefliveswouldagreetalkssosoonmissedwaitresswaitermakescitiesstudyingkidsexactlybit goodbye towards Gender=Fem have youworksnewspaperwithoutneedsroutefrancecaptaintookjuststreetdeliverteachermoviefamiliarartiststoldnightfebruaryclearlyknownsoil parties VERB apple hatsaddyoungthanteethplanesandwichrestworkersmustarrivefollowsturtleslovesflightobjectivefindsleastthensheetexcellentsite book notand wifeoverplayedmanysamesomehappeningtakensonsmissfourthfeelsundaysotherlookedseasonsheardwishapproximatelymethingsdonecabinetletduststayedthiswritertellwhenevermotorcycleincludeengineerdesignhisstarts 0 PRON NOUN VerbForm=Fin see zoodoormondayunderbreadfinespoonsavekeyrecordroomairportgirlfirstexampleflagfreedomthreeofficewatchedeightknowledgescreenbecitizeneffectsafetydownturnsswimmingmenrunsrightwearpracticeeverythinggoesbirthwhichus PROPN AUX 0 animalssadspoonsliveonlylocallanguagesmarathonbeenarrivedwishespeoplespeechrelationshippricesawcalendarislandhappenedeachmonthscameenteredyardhappenthinkroadshiphimfairreservesledofdepartmentrainsseemsswimsanything Gender=Neut from boyRUSpasta winetoogovernmentevenwithcollectiontraindowaitsignaturerainsureordersdeeplandcallsunionelephantexercisewordsstepchestprogramoneorderedgonefilmphonenewsmillionfoundationcourseoctobercutslosecannotgivetravelapplicationmatchaverageguardyouthspanishatecomeswatchherfalllunchpossiblestoppedskirtsgetreadsskirtwearsgotresponsibleairplane nevergoodlove playpasshadwindowmapslowlyfloorusemeetfoundmuchbasketballpositionopensimagecalledstampsasksmakingshowconcertcheaperusesdreamsuffercastleriverfriendindividualsentgivencontinuessleepingdressescallingtryinglatewrotebegincountlivingsongchinesechapterturndatefourteenteamchangeavenueaccountfeltpantsabroadchurchesimportanttypekitchenalsoknewtoothpastedaughtersbathroomsixtyimpossiblefollowtriedbrownoncefarprofessionalneithertalkengine ADJ greenmoremaydrinkhellobosswasinformationwhatcleansweetputdoingchangedproducesmodelbeforethinkingexplaintakingwantedwaitingdecadeolduniversitydangertenleavestarschosenclosefinallyqueencookedlevelmoneyknifelovedfinishedvisionagreementallowednoisefingerwakematerialdarklawwaygrandfathersmallertheretryearslightteachersunitagoeatenjoystationsleepcropopenedablehelpsintroducedcooksmotorattempthourdresshopedescriptionpropertybackmodernaimreadingmachineitteachingtwicebottlestudentactanywhereclockbedallowsupsquareafternoonmadeannouncefeelscellphonewalksfinalraisestrawberryavailable Tense=PastMood=IndNumber=Plur openname henowboxhelpyearsgoingwhenportambulancedrivelanguageweeklivedaprilpoordreamsunderstandspringmuseummooncutwallsclearreceivedconflictleaveslongwantsportsadviceworldstayseventyacceptskyskinintereststartstylesizefullweakgroupscenemanagedspokenperformeddeathtitleturnedcasecrisisinsideappearseuropetaskhairfrenchhowoffervisitssectionspeakvalueyetwrittensignsspokemyselfhimselfalreadychainreturnansweredreturnedtriesspecialrunpassportwantseveryonethesehithursdaythatwrite PronType=IntPronType=Prs 0 she nexttodayinsisterdaystheyesterdaydoctorslostcallfarmconversationfreepersonalitybestsimpleplayersvoicedrynearconditionshortcancersmallhousesdrinkingnicetheysenatorpreventhomechannelfashionhatebuildbreakprojectservealonerentlandscapesunairwinwarsingspacewatchingwalkedgardenavoidsignaleyesdiedtelephonenotebetweendiebuildingpuresomethingjumpolderexitlikedpaintthoseapplymindtruetripappearedfiftywellmonthscalefifteentelevisioncoastarchitectbeatmaincoatoffersprettynorafraidchickenoppositetonighttheir 'm DET Degree=PosPoss=YesTense=PresReflex=Yes normalparkgirlsforpenbigfridayideainstrumentflowersmotherhavinglastmemoriesarmsactorssouthninetywintersurpriseblooddiscoverseventeenrecoverstrategymedicinepagewonbasketlefttextmemoryweatherentergreycompanyalmosteverybodyaroundseaseconddiscoveredlifetreestaxesletterhalfdrinksalongearthtestsubwaytradefearcupstouchpopulationcompletelysystempreparesixteendemandsopeningbusdeskloadexplainedagainstearnedcornerwomanbilingualdadgivespossiblysetmakecheesetrustsandwichesshoenormallystaffwalkanwhose Number=Sing myisblackmorningspidercarinineworkingmodelsschoolsaltconferencefollowingbeachcoversharptogethermurderfastsporttennistowersbetterbadactivityeverycontrolhardcoldsellpreparedreceivenetworkeighteenplanmarketsladysurfacebarrooflooksfaultreallycheckmembersminuteaudiencewinslearnpostcardconstructionbiggerbrandvillagerealtrendeuropeantheorytvaccesswearingviewillwalletsaidhundredgroundwhostillelsethirteenexpensivewholeuntilhearnewspaperscolorfuldistrictlistens VerbForm=Ger doglikeneedsummerhappyhospitalproblemhistorysixkeysmagazinereadcookingballfinishsonforgetreadyvisitclosedsistersalwaysamericachinayourselfexpressionproducereligionmarchlibraryindustryimprovenursetomatocontinuegrassidentitylucksecuritytalkingwalkingglasssittingfrontputssuitcasefitswritingeasilymetwhitefieldstudiesservicefitusuallydamageorderauntdirtyservespieceanswerbowlamountweresaysaddressswim 've willveryworkwherecomputernovemberhotellookeducationrespecthousenewpartymusicfriendsflystrongreportlipsplanetcarstechnologyplatephrasesupportedoptionswrongdevelopedwestlistsecondsforcedmouserichcomingpoetrychairnovelcleaneranotherjudgesculturekingcontentitalybackpackrespondcityconsidermetercountrytiredagelettersknowsnavyborderreligioustalkedinterestingeatingresultchoicefortyassumednecessarilynobodymilitarybeautifulhusbandgetsstringcreatesownam writes −1 from POR VerbForm=Part boys recentlysometimespolicerememberpopularpresidentzonecooksecretyearnineteencolorsmarketcreatedtouchingbrothersstudyfamoussundaypresenteddevelopdinnerarmyhashourssoapgenerationcapitallaterthirtyideasheatfewenemyartistnoneregionpositiveshirtsdefinitelyminimumexplainskeptincludesjobwatcheswednesdaynecessary from POR Definite=Def yesredstop numbertwotomorrowelevenpoolsofapleaseuniversecongressargumentgettingcontestinformedfourrestaurantmixplacetallhoweverclimateawardsfirmcoatspowdertowerjuiceentrysoldiershirtkickindependentrecentyourbeginssomeonewhybatteryintoatprivate INTJ Case=Nom sorry orangedoctorbearsugarwindowssevencanlampcomefuturebutpensarepalacelesseditionnothingcrabtakeseptemberchurchpleasuretakespublicfillboyfriendyellowguidegoaleverperiodtwelvelawyercurrentlyrazorlotoursseasonpayssay listengray −1 Person=1 frommenu SPAbluefamilyinternetradiobabypersonalglassesamericanactorhotapplesexistencebellmirrorfarmerhorsevegetarianfailobjectbabiesstagetoothbrushpayas negativecinemariceagainsoupbrothermomentsparentscenturydecidedperfectlyanybodyourworkerprofessionquestionsperreturnsmarriagebehindbye from SPA from SPA from POR centeraboutpromotionpurplegeneralpassedminuteseightysounddecadessocceraugustedgealiveshowslikescareer classfindpaperdecembereggculturalappearaplatessoldiersplaysamong tablefive duckareaconvenientturtleofficertellswhile its −2 momentitalianpresentgirlfriendoffauthorjanuarygrandmothertill believe standardized PC2 (18.9% explained var.) PC2 (18.9% explained standardized var.) PC2 (13.6% explained standardized var.) PC2 (15.4% explained standardized judge −2 herecolorbecausehorsesvictimgenerallyvalleysets efficient animalsecretarybirdsdirectorjuneraining doessiblings birdfruit legal journey shoes welcomejulymineon 's −2 PronType=Art during saturdayquestionsign NUM pink prize perfectpork −3 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 −4 −2 0 2 4 6 standardized PC1 (67.0% explained var.) standardized PC1 (70.5% explained var.) standardized PC1 (63.3% explained var.) Mixed-Initiative Chatbots Computer-Adaptive Placement Tests

✘ +1.3%*** for new users ✔ +2.2%*** still active after 2 weeks

✔ ✔ ✔ ✔ ✘ ✔ ✔ ✘ ✔ ✘ ✘ ✘ ✔ ✘ ✔ ✘ used by Yale, UCLA, NYU, Notre Dame, Dartmouth, Tufts, 200+ other institutions… The Future: Closing the Loop teaching assessment a challenge for you! sharedtask.duolingo.com

Data Set Available: December 18 Workshop @ NAACL-HLT 2018 in New Orleans LA!

(also, we’re hiring ML + research positions! duolingo.com/jobs)