Burr Settles Staf Scientist + Engineer Launched in 2012 (CMU Spinof) More Than 200 Million Students Globally Currently 76 Courses (Incl
Total Page:16
File Type:pdf, Size:1020Kb
Improving Language Learning + Assessment with Data burr settles staf scientist + engineer launched in 2012 (CMU spinof) more than 200 million students globally currently 76 courses (incl. High Valyrian) expanding to 95+ courses (incl. Klingon) 100% FREE GOOGLE APPLE TECHCRUNCH Best of the Best App of the Year Education Startup of the Year people learning a second language 1,200,000,000 (~16% of the world’s population) ~800M satisfy three properties: - learning English - in a developing country - to gain more opportunity (Source: British Council) 86% mobile device access 64% toilet access (Source: U.N. Report, 2015) Bassin Caiman, Haiti access personalization feedback 34 hours of Duolingo is equivalent to one university semester (Vesselinov + Grego, 2012) 12 weeks of self-study w/ Duolingo is as efective as traditional F2F instruction (Rachels + Rockinson-Szapkiw, 2017) #1 language-learning platform began using empirical “duolingo” ML/NLP methods “rosetta stone” “babbel” “busuu” “pimsleur” key aspect of teaching people: forgetting (Settles + Meeder, ACL 2016) Modeling Learning + Forgetting Duolingo uses strength meters that remind students to practice … Modeling Learning + Forgetting … and prioritizes words in their practice sessions … Modeling Learning + Forgetting … based on a statistical model of word strength in étant/être<vb><gerund> un/un<det><ind><m><sg> long-term memory enfant/enfant<n><mf><sg> il/il<prn><p3><m><sg> est/être<vb><pri><p3><sg> petit/petit<adj><m><sg> Modeling Learning + Forgetting of course, we want this model to be personalized for each student! The Spacing Efect (Ebbinghaus, 1885; Atkinson, 1972; Bloom+Shuell, 1981) people learn better if practice is spaced over long intervals (instead of “cramming”) The Forgetting Curve (Ebbinghaus, 1885) 1 ! the probability p of a correct 0.8 answer as a function of: 0.6 • ! time Δ since the last practice 0.4 • half-life h in user’s memory 0.2 0 ! ∆/h 0 1 2 3 4 5 6 7 p =2− The Lag Efect (Melton, 1970; Scarborough, 1977) people learn even better if the spacing between practices gradually increases (i.e., half-life increases with more practice) The Pimsleur Method (Pimsleur, 1967) first mainstream application of spacing and lag efects • vocabulary is reviewed over exponentially increasing half-life intervals • drawback: fixed schedule (not adaptive/personalized) The Leitner System (Leitner, 1972) observation: half-life can be formalized 1st Duolingo student model: into the equation: Leitner adaptive flashcard algorithm x x hˆ =2 ⊕− correctly-answered items or, more generally: h : 1 2 4 8 16 ⇥ x hˆ⇥ =2 · incorrectly-answered items for model weights £ and a feature vector x Half-life Regression (HLR) each session leaves a trace of what the student recalled correctly: can we learn model weights £ from these traces? Half-life Regression (HLR) each ✖ = p, ∆ ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 0.8 ✖ 0.6 ✖ ✖ 0.4 recall rate recall 0.2 0 0 5 10 15 20 25 30 time (in days) Half-life Regression (HLR) each ✖ = p, ∆ ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 1.0 1.0 1.0 0.8 ✖ 0.8 0.6 ✖ ✖ 0.4 0.5 0.5 recall rate recall 0.2 0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session Half-life Regression (HLR) each ✖ = p, ∆ ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 0.6 0.7 2.6 0.8 ✖ 4.7 0.6 ✖ ✖ 0.4 1.7 13.5 recall rate recall 0.2 0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session ∆ = lag time since the last session Half-life Regression (HLR) each ✖ = p, ∆ ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ correct = 2 correct = 15 0.8 incorrect = 1 ✖ incorrect = 5 lexeme = étant correct = 10 lexeme = étant incorrect = 3 0.6 lexeme = étant ✖ ✖ 0.4 correct = 5 correct = 14 incorrect = 1 incorrect = 4 recall rate recall lexeme = étant lexeme = étant 0.2 0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session ∆ = lag time since the last session x = feature vector summarizing practice history Half-life Regression (HLR) each ✖ = p, ∆ ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 0.8 ✖ 0.6 ✖ ✖ 0.4 Pimsleur + Leitner recall rate recall 0.2 are special cases of 0 this model! 0 5 10 15 20 25 30 time (in days) goal: learn weights £ such that model predictions fit the training set = p, ∆,x D D {h ii}i=1 Half-life Regression (HLR) objective function: D 2 ˆ 2 2 ⇥⇤ = arg min (p pˆ⇥) + ↵(h h⇥) + λ ⇥ 2 ⇥ i=1 − − k k X recall est. half-life L2 error error penalty term term term 2 ∆ 2 ∆ ⇥ x ⇥ x p 2− 2 · ↵ − 2 · − log2(p) − ⇣ ⌘ ✓ ◆ optimize £ using stochastic gradient descent Historical Log Data Evaluation 0.5 spaced repetition heuristics 0.375 logistic regression variants 0.25 MAE HLR ← variants 0.125 0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation 0.5 lexeme features help a little bit 0.375 0.25 MAE ← 0.125 0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation 0.5 half-life objective term helps a lot! 0.375 0.25 MAE ← 0.125 0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation 0.5 only full HLR variants more metrics + results in our beat a fixed baseline! ACL’16 paper… 0.375 0.25 MAE p¯ =0.89 ← 0.125 0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Example £ Weights +0.77 camera/camera<n><sg> +0.87 Baby/Baby<n><nt><sg><acc> +0.38 ends/end<vb><pri><p3><sg> +0.56 sprechen/sprechen<vb><inf> +0.08 circle/circle<n><sg> +0.13 sehr/sehr<adv> -0.09 rose/rise<vb><past> -0.07 den/der<det><def><m><sg><acc> -0.48 performed/perform<vb><pp> -0.55 Ihnen/Sie<pn><p3><pl><dat><formal> -0.81 writing/write<vb><presp> -1.10 war/sein<vb><imperf><p1><sg> +0.94 visite/visiter<vb><pri><p3><sg> +0.83 liberal/liberal<sdj><sg> +0.47 suis/être<vb><pri><p1><sg> +0.40 como/comer<vb><pri><p1><sg> +0.05 trou/trou<n><m><sg> +0.10 encuentra/encontrar<vb><pri><p3><sg> -0.06 dessous/dessous<adv> -0.05 está/estar<vb><pri><p3><sg> -0.45 ceci/ceci<pn><nt> -0.33 pensando/pensar<vb><gerund> -0.91 fallait/falloir<vb><imperf><p3><sg> -0.73 quedado/quedar<vb><pp><m><sg> + weights: common words - weights: rare words short/regular long/irregular cognates advanced grammar User Experiment 50% students kept 50% students given Leitner model HLR model (6 weeks; 1M students) engagement = % students who returned the following day: HLR +0.3% (p=0.03) User Experiment short-term reactions (week of launch)… User Experiment long-term reactions (2 months later)… User Experiment • culprit: extreme lexeme feature weights • e.g., rare words, irregulars, etc. that are hard on average (but not for advanced students) User Experiment #2 50% students given 50% students given HLR model (2 weeks; 3M students) HLR -lexemes engagement = % students who returned the following day: +1.7% Lessons (p<0.001) +9.5% Practice (p<0.001) +12.0% Anything (p<0.001) Data + Code github.com/duolingo/halflife-regression … so many emails … other (better?) forgetting functions ≈HLR forgetting function Other Forgetting Functions β power pˆ = ↵∆− ! intercept: highest student recall rate that seems likely logarithm pˆ = ↵ β log(∆) − " forgetting rate: how fast expsqrt pˆ = ↵ exp( βp∆) − this prediction decays following Ridgeway + Mozer (2016), we learn to personalize ! (logistic function of features), but fit a global " HLR pˆ = ↵ exp( β∆) for HLR, " is learned but ! = 1 − Other Forgetting Functions Historical Log Data Evaluation 0.12 expsqrt power logarithm 0.09 better? 0.06 MAE ← HLR 0.03 0 0 0.065 0.13 0.195 0.26 correlation → User Experiment #3 (3 months; 4.1M new students) HLR power logarithm expsqrt retention -0.3% +0.2% +0.1% # sessions -3.4% -3.2% -3.1% revenue -4.1% -3.0% -5.0% quiz score -0.8% -8.8% -4.0% other projects in personalized learning Levels alternative to strength meter mechanism each skill can teach the same content at diferent difculty levels, e.g.: - progress from receptive → productive - progress from written → verbal currently requires re-doing lessons, but in the future will level up automatically and tailor content based learner modeling Learned Variable Rewards learn to predict how much bonus time to award for correct answers +4.8%*** practice session activity +19.4%*** practice session completion Data-Driven Curriculum Design 2 4 water 2 PART Case=Acc NumType=Card NumType=Ord go such from RUS thank top from ARA foodman official n't Degree=Cmp fatherenglish meal no away 1 1 Voice=Pass time hatallmeatchildrensupportchildspeaks CONJ Mood=Imp Person=2 from ARAif asklemonbreakfastprisonthroughneighborhoodlittleevening 're Degree=SupPronType=Rel 2 anybyoutshouldbuyseesthanksserious ' ADP cat bankor clothesenoughsleeps from ARA VerbForm=InfPronType=DemGender=Masc afterfishoil fromeatdayweekselephantswomenthirdknowanswersnorthspendhumandifferentlectureeatsdaughterthem from RUS milkcatsdogsdid teato sumthinkshistoricalgreatduckssearchesyourstraditionaltheaterhears SCONJADV Person=3 books webeer englandstartedbeefliveswouldagreetalkssosoonmissedwaitresswaitermakescitiesstudyingkidsexactlybit goodbye towards Gender=Fem have youworksnewspaperwithoutneedsroutefrancecaptaintookjuststreetdeliverteachermoviefamiliarartiststoldnightfebruaryclearlyknownsoil parties VERB apple hatsaddyoungthanteethplanesandwichrestworkersmustarrivefollowsturtleslovesflightobjectivefindsleastthensheetexcellentsite book notand wifeoverplayedmanysamesomehappeningtakensonsmissfourthfeelsundaysotherlookedseasonsheardwishapproximatelymethingsdonecabinetletduststayedthiswritertellwhenevermotorcycleincludeengineerdesignhisstarts 0 PRON NOUN VerbForm=Fin