A Trainable Spaced Repetition Model for Language Learning
Total Page:16
File Type:pdf, Size:1020Kb
A Trainable Spaced Repetition Model for Language Learning Burr Settles∗ Brendan Meeder† Duolingo Uber Advanced Technologies Center Pittsburgh, PA USA Pittsburgh, PA USA [email protected] [email protected] Abstract view sessions a few seconds apart, then minutes, then hours, days, months, and so on, with each We present half-life regression (HLR), a successive review stretching out over a longer and novel model for spaced repetition practice longer time interval. with applications to second language ac- The effects of spacing and lag are well- quisition. HLR combines psycholinguis- established in second language acquisition re- tic theory with modern machine learning search (Atkinson, 1972; Bloom and Shuell, 1981; techniques, indirectly estimating the “half- Cepeda et al., 2006; Pavlik Jr and Anderson, life” of a word or concept in a student’s 2008), and benefits have also been shown for gym- long-term memory. We use data from nastics, baseball pitching, video games, and many Duolingo — a popular online language other skills. See Ruth (1928), Dempster (1989), learning application — to fit HLR models, and Donovan and Radosevich (1999) for thorough reducing error by 45%+ compared to sev- meta-analyses spanning several decades. eral baselines at predicting student recall Most practical algorithms for spaced repetition rates. HLR model weights also shed light are simple functions with a few hand-picked pa- on which linguistic concepts are system- rameters. This is reasonable, since they were atically challenging for second language largely developed during the 1960s–80s, when learners. Finally, HLR was able to im- people would have had to manage practice sched- prove Duolingo daily student engagement ules without the aid of computers. However, the by 12% in an operational user study. recent popularity of large-scale online learning 1 Introduction software makes it possible to collect vast amounts of parallel student data, which can be used to em- The spacing effect is the observation that people pirically train richer statistical models. tend to remember things more effectively if they In this work, we propose half-life regression use spaced repetition practice (short study periods (HLR) as a trainable spaced repetition algorithm, spread out over time) as opposed to massed prac- marrying psycholinguistically-inspired models of tice (i.e., “cramming”). The phenomenon was first memory with modern machine learning tech- documented by Ebbinghaus (1885), using himself niques. We apply this model to real student learn- as a subject in several experiments to memorize ing data from Duolingo, a popular language learn- verbal utterances. In one study, after a day of ing app, and use it to improve its large-scale, op- cramming he could accurately recite 12-syllable erational, personalized learning system. sequences (of gibberish, apparently). However, he could achieve comparable results with half as 2 Duolingo many practices spread out over three days. Duolingo is a free, award-winning, online lan- The lag effect (Melton, 1970) is the related ob- guage learning platform. Since launching in 2012, servation that people learn even better if the spac- more than 150 million students from all over the ing between practices gradually increases. For ex- world have enrolled in a Duolingo course, either ample, a learning schedule might begin with re- via the website1 or mobile apps for Android, iOS, ∗Corresponding author. 1 † Research conducted at Duolingo. https://www.duolingo.com 1848 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1848–1858, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics (a) skill tree screen (b) skill screen (c) correct response (d) incorrect response Figure 1: Duolingo screenshots for an English-speaking student learning French (iPhone app, 2016). (a) A course skill tree: golden skills have four bars and are “at full strength,” while other skills have fewer bars and are due for practice. (b) A skill screen detail (for the Gerund skill), showing which words are predicted to need practice. (c,d) Grading and explanations for a translation exercise. etant´ un enfant il est petit etreˆ .V.GER un.DET.INDF.M.SG enfant.N.SG il.PN.M.P3.SG etreˆ .V.PRES.P3.SG petit.ADJ.M.SG Figure 2: The French sentence from Figure 1(c,d) and its lexeme tags. Tags encode the root lexeme, part of speech, and morphological components (tense, gender, person, etc.) for each word in the exercise. and Windows devices. For comparison, that is 2004), and other best practices. Early research more than the total number of students in U.S. el- suggests that 34 hours of Duolingo is equivalent ementary and secondary schools combined. At to a full semester of university-level Spanish in- least 80 language courses are currently available struction (Vesselinov and Grego, 2012). or under development2 for the Duolingo platform. Figure 1(a) shows an example skill tree for The most popular courses are for learning English, English speakers learning French. This specifies Spanish, French, and German, although there are the game-like curriculum: each icon represents also courses for minority languages (Irish Gaelic), a skill, which in turn teaches a set of themati- and even constructed languages (Esperanto). cally or grammatically related words or concepts. More than half of Duolingo students live in Students tap an icon to access lessons of new developing countries, where Internet access has material, or to practice previously-learned mate- more than tripled in the past three years (ITU and rial. Figure 1(b) shows a screen for the French UNESCO, 2015). The majority of these students skill Gerund, which teaches common gerund verb are using Duolingo to learn English, which can forms such as faisant (doing) and etant´ (being). significantly improve their job prospects and qual- This skill, as well as several others, have already ity of life (Pinon and Haydon, 2010). been completed by the student. However, the Mea- 2.1 System Overview sures skill in the bottom right of Figure 1(a) has one lesson remaining. After completing each row Duolingo uses a playfully illustrated, gamified de- of skills, students “unlock” the next row of more sign that combines point-reward incentives with advanced skills. This is a gamelike implementa- implicit instruction (DeKeyser, 2008), mastery tion of mastery learning, whereby students must learning (Block et al., 1971), explanations (Fahy, reach a certain level of prerequisite knowledge be- 2https://incubator.duolingo.com fore moving on to new material. 1849 Each language course also contains a corpus some weak words from the Gerund skill. Practice (large database of available exercises) and a lex- sessions are identical to lessons, except that the eme tagger (statistical NLP pipeline for automat- exercises are taken from those indexed with words ically tagging and indexing the corpus; see the (lexeme tags) due for practice according to student Appendix for details and a lexeme tag reference). model. As time passes, strength meters continu- Figure 1(c,d) shows an example translation exer- ously update and decay until the student practices. cise that might appear in the Gerund skill, and Fig- ure 2 shows the lexeme tagger output for this sen- 3 Spaced Repetition Models tence. Since this exercise is indexed with a gerund In this section, we describe several spaced repeti- lexeme tag (etreˆ .V.GER in this case), it is available tion algorithms that might be incorporated into our for lessons or practices in this skill. student model. We begin with two common, estab- The lexeme tagger also helps to provide correc- lished methods in language learning technology, tive feedback. Educational researchers maintain and then present our half-life regression model that incorrect answers should be accompanied by which is a generalization of them. explanations, not simply a “wrong” mark (Fahy, 2004). In Figure 1(d), the student incorrectly used 3.1 The Pimsleur Method the 2nd-person verb form es (etreˆ .V.PRES.P2.SG) Pimsleur (1967) was perhaps the first to make instead of the 3rd-person est (etreˆ .V.PRES.P3.SG). mainstream practical use of the spacing and lag ef- If Duolingo is able to parse the student response fects, with his audio-based language learning pro- and detect a known grammatical mistake such as gram (now a franchise by Simon & Schuster). He this, it provides an explanation3 in plain language. referred to his method as graduated-interval re- Each lesson continues until the student masters all call, whereby new vocabulary is introduced and of the target words being taught in the session, as then tested at exponentially increasing intervals, estimated by a mixture model of short-term learn- interspersed with the introduction or review of ing curves (Streeter, 2015). other vocabulary. However, this approach is lim- ited since the schedule is pre-recorded and can- 2.2 Spaced Repetition and Practice not adapt to the learner’s actual ability. Consider Once a lesson is completed, all the target words an English-speaking French student who easily being taught in the lesson are added to the student learns a cognate like pantalon (pants), but strug- model. This model captures what the student has gles to remember manteau (coat). With the Pim- learned, and estimates how well she can recall this sleur method, she is forced to practice both words knowledge at any given time. Spaced repetition is at the same fixed, increasing schedule. a key component of the student model: over time, 3.2 The Leitner System the strength of a skill will decay in the student’s long-term memory, and this model helps the stu- Leitner (1972) proposed a different spaced repeti- dent manage her practice schedule. tion algorithm intended for use with flashcards. It Duolingo uses strength meters to visualize the is more adaptive than Pimsleur’s, since the spac- student model, as seen beneath each of the com- ing intervals can increase or decrease depending pleted skill icons in Figure 1(a).