Improving Language Learning + Assessment with Data
burr settles staf scientist + engineer launched in 2012 (CMU spinof) more than 200 million students globally currently 76 courses (incl. High Valyrian) expanding to 95+ courses (incl. Klingon) 100% FREE
GOOGLE APPLE TECHCRUNCH Best of the Best App of the Year Education Startup of the Year people learning a second language 1,200,000,000 (~16% of the world’s population)
~800M satisfy three properties: - learning English - in a developing country - to gain more opportunity
(Source: British Council) 86% mobile device access 64% toilet access
(Source: U.N. Report, 2015) Bassin Caiman, Haiti access personalization feedback 34 hours of Duolingo is equivalent to one university semester
(Vesselinov + Grego, 2012) 12 weeks of self-study w/ Duolingo is as efective as traditional F2F instruction
(Rachels + Rockinson-Szapkiw, 2017) #1 language-learning platform began using empirical “duolingo” ML/NLP methods
“rosetta stone” “babbel”
“busuu” “pimsleur” key aspect of teaching people: forgetting
(Settles + Meeder, ACL 2016) Modeling Learning + Forgetting
Duolingo uses strength meters that remind students to practice … Modeling Learning + Forgetting
… and prioritizes words in their practice sessions … Modeling Learning + Forgetting
… based on a statistical model of word strength in étant/être
of course, we want this model to be personalized for each student! The Spacing Efect (Ebbinghaus, 1885; Atkinson, 1972; Bloom+Shuell, 1981)
people learn better if practice is spaced over long intervals (instead of “cramming”) The Forgetting Curve (Ebbinghaus, 1885)
1 ! the probability p of a correct 0.8 answer as a function of:
0.6 • ! time Δ since the last practice 0.4 • half-life h in user’s memory 0.2
0 ! /h 0 1 2 3 4 5 6 7 p =2 The Lag Efect (Melton, 1970; Scarborough, 1977)
people learn even better if the spacing between practices gradually increases
(i.e., half-life increases with more practice) The Pimsleur Method (Pimsleur, 1967)
first mainstream application of spacing and lag efects
• vocabulary is reviewed over exponentially increasing half-life intervals
• drawback: fixed schedule (not adaptive/personalized) The Leitner System (Leitner, 1972)
observation: half-life can be formalized 1st Duolingo student model: into the equation: Leitner adaptive flashcard algorithm x x hˆ =2 correctly-answered items or, more generally: h : 1 2 4 8 16 ⇥ x hˆ⇥ =2 · incorrectly-answered items for model weights £ and a feature vector x Half-life Regression (HLR) each session leaves a trace of what the student recalled correctly:
can we learn model weights £ from these traces? Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖
0.8 ✖
0.6 ✖ ✖ 0.4 recall rate recall 0.2
0 0 5 10 15 20 25 30 time (in days) Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 1.0 1.0 1.0 0.8 ✖ 0.8 0.6 ✖ ✖ 0.4 0.5 0.5 recall rate recall 0.2
0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ 0.6 0.7 2.6 0.8 ✖ 4.7 0.6 ✖ ✖ 0.4 1.7 13.5 recall rate recall 0.2
0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session = lag time since the last session Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖ correct = 2 correct = 15 0.8 incorrect = 1 ✖ incorrect = 5 lexeme = étant correct = 10 lexeme = étant incorrect = 3 0.6 lexeme = étant ✖ ✖ 0.4 correct = 5 correct = 14 incorrect = 1 incorrect = 4 recall rate recall lexeme = étant lexeme = étant 0.2
0 0 5 10 15 20 25 30 time (in days) p = recall rate (proportion correct) in session = lag time since the last session x = feature vector summarizing practice history Half-life Regression (HLR) each ✖ = p, ,x is a student/lexeme practice session h i 1 ✖ ✖ ✖
0.8 ✖
0.6 ✖ ✖ 0.4 Pimsleur + Leitner recall rate recall 0.2 are special cases of
0 this model! 0 5 10 15 20 25 30 time (in days)
goal: learn weights £ such that model predictions fit the training set = p, ,x D D {h ii}i=1 Half-life Regression (HLR)
objective function:
D 2 ˆ 2 2 ⇥⇤ = arg min (p pˆ⇥) + ↵(h h⇥) + ⇥ 2 ⇥ i=1 k k X recall est. half-life L2 error error penalty term term term
2 2 ⇥ x ⇥ x p 2 2 · ↵ 2 · log2(p) ⇣ ⌘ ✓ ◆ optimize £ using stochastic gradient descent Historical Log Data Evaluation
0.5 spaced repetition heuristics 0.375 logistic regression variants 0.25 MAE HLR ← variants 0.125
0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation
0.5 lexeme features help a little bit 0.375
0.25 MAE ←
0.125
0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation
0.5 half-life objective term helps a lot! 0.375
0.25 MAE ←
0.125
0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Historical Log Data Evaluation
0.5 only full HLR variants more metrics + results in our
beat a fixed baseline! ACL’16 paper… 0.375
0.25 MAE p¯ =0.89 ←
0.125
0 HLR HLR -lex HLR -h HLR -lex-h Pimsleur Leitner LR LR -lex Example £ Weights
+0.77 camera/camera
+0.94 visite/visiter
+ weights: common words - weights: rare words short/regular long/irregular cognates advanced grammar User Experiment
50% students kept 50% students given Leitner model HLR model (6 weeks; 1M students)
engagement = % students who returned the following day: HLR +0.3% (p=0.03) User Experiment short-term reactions (week of launch)… User Experiment long-term reactions (2 months later)… User Experiment
• culprit: extreme lexeme feature weights
• e.g., rare words, irregulars, etc. that are hard on average (but not for advanced students) User Experiment #2
50% students given 50% students given HLR model (2 weeks; 3M students) HLR -lexemes
engagement = % students who returned the following day: +1.7% Lessons (p<0.001) +9.5% Practice (p<0.001) +12.0% Anything (p<0.001) Data + Code github.com/duolingo/halflife-regression
… so many emails … other (better?) forgetting functions
≈HLR forgetting function Other Forgetting Functions