Downloaded by guest on September 30, 2021 ereo oto ilrsl nmr fetv pcdrepeti- spaced effective online www.pnas.org/cgi/doi/10.1073/pnas.1815156116 more and in result will of control pieces greater of and monitoring these degree fine-grained flash- automated of physical that is of promise platforms use The the replacing cards. often popular, increasingly Tabibian Behzad repetition optimization spaced via learning human Enhancing alternative heuristics. follow several by who determined learners schedules than algorithm effectively our more by determined memorize schedule reviewing learn- a that follow show who ers and platform, per- a online , We from language-learning then data popular times. algorithm, using can experiment reviewing natural repetition we large-scale optimal a spaced result, the form online a determine to As scalable MEMORIZE, itself. simple, is cost probability schedule a a recall reviewing develop to the optimal subject by the learned frequency, given maximize be reviewing to to aims the content learner on the the of human stochastic if probability well-known that, for recall show two problem we For models, control prov- jumps. optimal with with then equations an algorithms differential repetition and as spaced processes guarantees of point able design temporal the marked address using of Here, repetition framework spaced parameters. of the representation hard-coded flexible few simple a reten- introduce a are we long-term with algorithms improve repetition heuristics to spaced rule-based current algorithm However, repetition tion. spaced determined a schedule a 2018) by following 3, content which September of memorization review review efficient for repeated for (received uses 2018 technique 14, a December is approved repetition and Spaced IN, Bloomington, University, Indiana Shiffrin, M. Richard by Edited T 72076 Systems, Intelligent for Institute a ,Snp(www. (-proj.org), synap.ac Mnemosyne as maximum such its forms to closest is which also item review have rate. to the learning researchers item selecting the which greedily given so, decide by review that doing a algorithms In from heuristic time. proposed most reviewing the fixed iden- benefit of by a would scheduling line item optimal orthogonal which locally An pursued tifying threshold. model has memory a 13) item, a below (7, an by falls research given forget 12), as (1, to recall, choice about of of is probability recently, that learner the algorithms More when the heuristic i.e., as (9). proposed just system have reviews 11) schedule Leitner (10, the works several with starting history, forgotten on working information. time (recalled) to (less) of is more schedule goal spend whose learners pieces a that (8), ensure algorithm small following repetition spaced flashcards, reviews a by repeatedly of determined learner use a the information motivated empirical these have Moreover, studies second (4–7). in research particularly acquisition 3), language (2, the in The literature investigated (1). last psychology extensively Ebbinghaus been experimental the has by factors study since two seminal these elapsed of a effect time by temporal shown the the first and it, as reviewed review, reviews, have the we of times of distribution number the on ically processes point temporal ewrsLann ru,MxPac nttt o otaeSses 76 asrluen emn;and Germany; Kaiserslautern, 67663 Systems, Software for Institute Planck Max Group, Learning Networks nrcn er,sae eeiinsfwr n nieplat- online and software repetition spaced years, recent In rich a has algorithms repetition spaced designing of task The crit- depends information of piece a remember to ability Our ,adDoig ( Duolingo and ), | pcdrepetition spaced a,b,1 tas Upadhyay Utkarsh , | tcatcotmlcontrol optimal stochastic aebecome have www.duolingo.com) | ua learning human bne,Germany ubingen, ¨ a brDe Abir , | marked a l Zarezade Ali , o h otn ob ere ymaso e fstochastic of set a the of find can means we by Then, jumps. learned with the be probabilities (SDEs) equations express recall to differential and to content rates presentation the forgetting this for learner’s memory use a human we of 17–19), well-known dynamics temporal 12, several marked (1, For of representation models (16). framework flexible the processes a using point platforms. introduce repetition learning spaced first of online we and specifically, software More repetition mod- spaced by monitored ern continuously as performance, learner’s adapt the to specifically to designed specially algorithms, is repetition spaced it optimal and time, adapt systems. over Leitner not the performance does for it designed learner’s guarantees, the provable heuristic have to not a their does developed However, heuristic reviews. then scheduling for system flashcards—and algorithm Leitner approximation reviewing method—the for model repetition network spaced (9) queueing a particular proposed a who ours (15), these to for al. related Among et closely Reddy 15). most by (14, work is the recently exceptions, very notable until recent missing have hard- guarantees largely promise— few provable this been with fulfill a algorithms to unable data-driven with are adaptive, heuristics which repetition (8), rule-based parameters spaced coded simple above the are of algorithms most However, algorithms. tion 1073/pnas.1815156116/-/DCSupplemental. y at online information supporting contains (https:// article This GitHub on deposited been 1 has algorithm MEMORIZE github.com/Networks-Learning/memorize The deposition: Data under distributed BY).y is (CC article access open This Submission.y Direct PNAS a is article This interest.y of conflict no declare authors The n ..R efre eerh ..adUU nlzddt;adBT,UU,adM.G.-R. and U.U., B.T., and data; paper. analyzed the U.U. wrote A.Z., and A.D., B.T. U.U., research; B.T., performed research; M.G.-R. designed and M.G.-R. and B.S., U.U., B.T., contributions: Author owo orsodnesol eadesd mi:[email protected] Email: addressed. be should correspondence whom To infiatyspro oalternatives. language- to are superior framework significantly popular our using the derived a that algorithms repetition evidence spaced from empirical provides data platform online learning using large-scale A experiment performance. learners’ natural the to frame- adapt specially to algorithms, computational repetition designed a spaced optimal develop derive we to learning and work work, online modern this monitoring by In offered fine-grained platforms. control of automated degree the greater do which leverage parameters, are not hard-coded algorithms memo- with repetition efficient heuristics spaced rule-based for current algorithms However, repetition the rization. motivated spaced later of studies design empirical these and controlled focused small-scale long-standing experiments using works memory a Early human characterizing disciplines. been on scientific various has in memory problem human Understanding Significance nti ok edvlpacmuainlfaeokt derive to framework computational a develop we work, this In a enadSch Bernhard , y olkopf ¨ b miia neec eatet a Planck Max Department, Inference Empirical b n aulGomez-Rodriguez Manuel and , ). y raieCmosAtiuinLcne4.0 License Attribution Commons Creative www.pnas.org/lookup/suppl/doi:10. NSLts Articles Latest PNAS | f6 of 1 a

APPLIED MATHEMATICS optimal reviewing schedule for by solving a i.e., E[dN(t)] = u(t)dt, and think of the recall r as their binary stochastic optimal control problem for SDEs with jumps (20–23). marks. Moreover, every time that a learner reviews an item, the In doing so, we need to introduce a proof technique to find a recall r has been experimentally shown to have an effect on solution to the so-called Hamilton–Jacobi–Bellman (HJB) equa- the forgetting rate of the item (3, 15, 25). Here, we estimate tion (SI Appendix, sections 3 and 4), which is of independent such an effect using half-life regression (25), which implicitly interest. assumes that recalls of an item i during a review have a multi- For two well-known memory models, we show that, if the plicative effect on the forgetting rate ni (t)—a successful recall learner aims to maximize recall probability of the content to be at time tr changes the forgetting rate by (1 − αi ), i.e., ni (t) = learned subject to a cost on the reviewing frequency, the solu- (1 − αi )ni (tr ), αi ≤ 1, while an unsuccessful recall changes the tion uncovers a linear relationship with a negative slope between forgetting rate by (1 + βi ), i.e., ni (t) = (1 + βi )ni (tr ), βi ≥ 0. In the optimal rate of reviewing, or reviewing intensity, and the this context, the initial forgetting rate, ni (0), captures the diffi- recall probability of the content to be learned. As a consequence, culty of the item, with more difficult items having higher initial we can develop a simple, scalable online spaced repetition algo- forgetting rates compared with easier items, and the parame- rithm, which we name MEMORIZE, to determine the optimal ters αi , βi , and ni (0) are estimated using real data (refer to SI reviewing times. Finally, we perform a large-scale natural exper- Appendix, section 8 for more details). iment using data from Duolingo, a popular language-learning Before we proceed farther, we acknowledge that several lab- online platform, and show that learners who follow a reviewing oratory studies (6, 27) have provided empirical evidence that schedule determined by our algorithm memorize more effec- the retention rate follows an inverted U shape, i.e., mass prac- tively than learners who follow alternative schedules determined tice does not improve the forgetting rate—if an item is in by several heuristics. To facilitate research in this area, we are a learner’s short-term memory when the review happens, the releasing an open-source implementation of our algorithm (24). long-term retention does not improve. Thus, one could argue for time-varying parameters αi (t) and βi (t) in our framework. Modeling Framework of Spaced Repetition. Our framework is However, there are several reasons that prevent us from that: agnostic to the particular choice of memory model—it provides (i) The derivation of an optimal reviewing schedule under a set of techniques to find reviewing schedules that are opti- time-varying parameters becomes very challenging; (ii) for the mal under a memory model. Here, for ease of exposition, we reviewing sequences in our Duolingo dataset, allowing for time- showcase our framework for one well-known memory model varying αi and βi in our modeling framework does not lead from the psychology literature, the exponential to more accurate recall predictions (SI Appendix, section 9); model with binary recalls (1, 17), and use (a variant of) a recent and (iii) several popular spaced repetition heuristics, such as machine-learning method, half-life regression (25), to estimate the with exponential spacing and SuperMemo, the effect of the reviews on the parameters of such model. have achieved reasonable success in practice despite implicitly [In SI Appendix, sections 6 and 7, we apply our framework assuming constant αi and βi . [The Leitner system with expo- to other two popular memory models, the power-law forget- nential spacing can be explicitly cast using our formulation with ting curve model (18, 19) and the multiscale context model particular choices of αi and βi and the same initial forget- (MCM) (12).] ting rate, ni (0) = n(0), for all items (SI Appendix, section 11).] More specifically, given a learner who wants to memorize a set That being said, it would be an interesting venue for future of items I using spaced repetition, i.e., repeated, spaced review work to derive optimal reviewing schedules under time-varying of the items, we represent each reviewing event as a triplet parameters. Next, we express the dynamics of the forgetting rate ni (t) and time ↓ the recall probability mi (t) for each item i ∈ I using SDEs with e := ( i, t, r ), jumps. This is very useful for the design of our spaced repetition ↑ ↑ item recall algorithm using stochastic optimal control. More specifically, the dynamics of the forgetting rate ni (t) are readily given by which means that the learner reviewed item i ∈ I at time t and either recalled it (r = 1) or forgot it (r = 0). Here, note dni (t) = −αi ni (t)ri (t)dNi (t) + βi ni (t)(1 − ri (t))dNi (t), [2] that each reviewing event includes the outcome of a test (i.e., a N (t) r (t) ∈ recall) since, in most spaced repetition software and online plat- where i is the corresponding counting process and i {0, 1} i forms such as Mnemosyne, Synap, and Duolingo, the learner is indicates whether item has been successfully recalled at t m (t) tested in each review, following the seminal work of Reidiger and time . Similarly, the dynamics of the recall probability i are Karpicke (26). given by Proposition 1 (proved in SI Appendix, section 1): Given the above representation, we model the probability that Proposition 1. Given an item i ∈ I with reviewing intensity ui (t), the learner recalls (forgets) item i at time t using the exponential the recall probability mi (t), defined by Eq. 1, is a Markov process forgetting curve model; i.e., whose dynamics can be defined by the following SDE with jumps,

mi (t) := (r) = exp (−ni (t)(t − tr )), [1] P dmi (t) = −ni (t)mi (t)dt + (1 − mi (t))dNi (t), [3] + where tr is the time of the last review and ni (t) ∈ R is the where Ni (t) is the counting process associated to the reviewing t forgetting rate at time , which may depend on many factors, intensity ui (t)†. e.g., item difficulty and/or number of previous (un)successful Finally, given a set of items I, we cast the design of a spaced recalls of the item. [Previous work often uses the inverse of repetition algorithm as the search of the optimal item review- the forgetting rate, referred to as memory strength or half-life, (t) = [u (t)] −1 ing intensities u i i∈I that minimize the expected value s(t) = n (t) (15, 25). However, it is more tractable for us to of a particular (convex) loss function `(m(t), n(t), u(t)) of the work in terms of forgetting rates.] Then, we keep track of the recall probability of the items, m(t) = [mi (t)]i∈I ; the forgetting reviewing times using a multidimensional counting process N(t), in which the ith entry, Ni (t), counts the number of times the

learner has reviewed item i up to time t. Following the liter- † To derive Eq. 3, we assume that the recall probability mi(t) is set to 1 every time item i is ature on temporal point processes (16), we characterize these reviewed. Here, one may also account for item difficulty by considering that, for more counting processes using their corresponding intensities u(t), difficult items, the recall probability is set to oi ∈ [0, 1) every time item i is reviewed.

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1815156116 Tabibian et al. Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 t es Blmnspicpeo piaiyraiyflosusing probability follows recall the readily rate subprob- ting of optimality smaller property into of Markov the problem principle the [Bellman’s above break the lems. which to optimality, allows, of principle definition Bellman’s use we Now, stemnmmo h xetdvleo h oto on rmstate from going of cost the of value expected (m the of minimum the as fteHBeuto ofidteotmlslto oteproblem. the to solution optimal the find structure to unique equation the HJB exploit the cor- of and the (28), derive equation to HJB optimality responding of principle Bellman’s use problem, conditions initial with upSDEs, jump n Eqs. using Further, probability recall process counting ated Eq. i.e., then items. provide multiple and of derive item, solution, case one first the the to of only we it generalize implementation considering Here, practical problem (20). efficient the SDEs an to jump solution of a control optimal tic Eq. by defined Algorithm. MEMORIZE The intensities reviewing since the on depend 3, rates intensities reviewing mal function item it penalty of that number the so and limiting reviews, intensities) while the and learning to long-lasting loss rates rewards respect the (forgetting with recalls, probabilities (nondecreasing) (item) recall nonincreasing and is processes function counting associated the to where aiine al. et Tabibian window time a rates, + (t J = 0 ie nitem an Given ento 2: Definition function cost-to-go optimal an define we Next, ujc to subject t minimize (t ) (m dt f J u(t h xetto stknoe l osberaiain of realizations possible all over taken is expectation the , ), n ealprobability recall and n(t E[d dm ) n(t u min = dn (t 0 n u(t − ( ,t t min o twt eiwn intensity reviewing with it for 4, u ), (t ) n ,t (t (t [ = ) f N J ( ] 0 +dt n ealprobabilities recall and (t t )) n = ) = ) (t ,t (m , (t ]With ).] t f φ(mt ttime at ]= )] f ] ] n u(t E E[dJ ), ] E (t i a etcldfo h esetv fstochas- of perspective the from tackled be can 4, (t −n −αn  (t eoe h tmrveigitniisfrom intensities reviewing item the denotes ( t ), h pia cost-to-go optimal The ) φ(mt N 0 = ) φ(mt u(t i m )] , ≥ n ( ecnwietesae eeiinproblem, repetition spaced the write can we , f (t (m t s i i ), (t f ∈I ),r t (t 0 (t dJ )dt + u 2 )m i.e., ]; n(t ), ( otefia tt ttime at state final the to (t ∀t N n ( = ) )r t n h neste themselves, intensities the and ; min f and `(m s f (m ,t . t ), (t (t ))| i ), ), (t f +dt ecn ec,rwieEq. rewrite hence, can, we ), (t ∈ 0 )) n(t )dt m s s u n n(t (t )dN = ) =t =t = ) (t h pcdrptto rbe,as problem, repetition spaced The (t ∗ ecndfietefretn rate forgetting the define can we 3, (t ] (t sa rirr eat ucin [The function. penalty arbitrary an is m ), (t E[J f 0 f ), f (1 + ), , )+ )) n ogtigrate forgetting and ),  n )) (t ]Hr,nt htteforgetting the that note Here, ).] n N t (t + n φ(m t 0 f (t (m ) ]+ )] (t (t ), + ) sncsayt eieteopti- the derive to necessary is and Z ), m(t ytefloigtocoupled two following the by − Z ealoutcome recall ), ), t (t t t (t t 0 β m u f t `(m = ) f m + f sdfie yEqs. by defined as ), n `(m (t J (t ), `(mτ (t u(t (t dt ))dt (m u J ))dN n (t )(1 0 i (τ (m ), (t = ) (t ) ), (t f n ), eamt optimize to aim we = ) ), − n t )) ), (t (t f (t m u (t : n(τ m n r + (τ ) 0 + u ), (t (t J (t . (t dt ))d ))dN ), u dt ), o h above the for ) n ) (t r ), u(τ t ), i n forget- and i n associ- and τ (t ) 5 (t ))dt n u(t t  sdefined is as = ) (t (t = ) ))d . + ) over ); . + dt 2 n τ r dt and  )] (t (t [6] [5] [4] t ), ). ), 0 ial,w lgteaoeeuto noEq. into equation: differential equation above cost-to-go optimal the the plug we Finally, derivative the to taking by respect cost with optimal the and intensity optimal the monotonically decreases reviews of to respect number with average the penaliz- practice, parameter since by of the level reviews of desired number of value a on the number (Given intensities. constraint the reviewing limits hard loss high ing it of a instead, choice place reviews; particular directly of this lower not that the does Note value, function reviews. its higher of reviews—the number item the of number and ity where ucsflyand successfully time at reviewed and intensity with a process provides algo- the 1 reviewing Within rithm, Algorithm MEMORIZE. sample MEMORIZE. of recall to implementation name procedure the pseudocode we efficient Eqs. on very which by a only times, given for depends allows are intensity thus dynamics optimal whose the probability, that Note problem, loss, repetition spaced the 3. Theorem by 3, tion J 3, section pendix, em 1: Lemma time to respect iis ota trwrslann hl iiigtenme of number the limiting while learning (inten- rewards reviews, probabilities item it recall loss that nonincreasing the so is the to sities) which define respect we form, with (28), to quadratic (nondecreasing) control need following optimal the stochastic we consider on equation, literature above the Following the solve To o aaeeie aiyo losses of family parameterized a for hn odrv h J qain edifferentiate we equation, HJB the derive to Then, ne hs entos ecnfidterltosi between relationship the find can we definitions, these Under IAp- ( SI lemma technical a on rely we farther, continue To hoe 3 Theorem u endb Eq. by defined ReviewItem ∗ (t = 0 q = 0 Sample h pia eiwn nest sraiygiven readily is intensity reviewing optimal the 2, Lemma )= `(m sagvnprmtr hc rdsofrcl probabil- recall off trades which parameter, given a is J + + J −J + − ie igeitem, single a Given t q −J t (m (t u (m 1 2 −1 J J ( poe in (proved min (1 t ), (u 1 (1 (1, 1 1+ (1 (1, 1 1+ (1 (1, u q 1 1+ (1 (1, ,t , [J (t t .) n +dt n (t − (t (t r ,wihdrvsteotmlcost-to-go optimal the derives which 2), Lemma , t ), (t , (m 0 m ) 0 = )) 0 8, where , t u m ) ] n ), nEq. in ) {`(m − ∗ (t (t sgvnby given is (t (t − eun h ealoutcome recall the returns ape rma nooeeu Poisson inhomogeneous an from samples (t u α)n J and ), β β niae twsntrcle.Moreover, recalled. not was it indicates ), β )) ), (t = ) nmJ u )n )n )n eto 4): section Appendix, SI ed ostsytefloignonlinear following the satisfy to needs n 2 (t t )= )) , ) (t n − 7: , (t (t (t r ) q − t q , m ), −1/2 ), ), )(1 1 = ), u 2 1 n trtrstesmldtime sampled the returns it and n a eesl on ysimulation by found easily be can n (m 2 1 endb Eq. by defined t t [ ) q t (t t h pia eiwn nest for intensity reviewing optimal the (t )m ) (1 )(1 −1 )(1 − J niae h tmwsrecalled was item the indicates ) − , (1 )m n 1 (1 (1, − m (t using [J − J − − , m (t ) ) t 1 (1 (1, (m m ) m m − Using `. )J (t (t (t − (t NSLts Articles Latest PNAS J (t )) m eto 2, section Appendix, SI ))] (m ))] )). α)n ), (m 2 − + + 2 n + α)n , (t 4, . (t n . , 2 1 sec- Appendix, SI t ), , ), 7 qu ne quadratic under )m t (t n )]u t 2 n n that find and r 2 ), ) (t (t and (t fa item an of t ), ), )m )}. t ) (t J and 3, | ) f6 of 3 with [8] [9] [7] `.

APPLIED MATHEMATICS and items with enough numbers of reviewing events, we con- sider only users with at least 30 reviewing events and words that were reviewed at least 30 times. After this preprocessing step, our dataset consists of ∼5.2 million unique (user, word) pairs. We compare the performance of our method with two base- lines: (i) a uniform reviewing schedule, which sends item(s) for review at a constant rate µ, and (ii) a threshold-based reviewing schedule, which increases the reviewing intensity of an item by c exp ((t − s)/ζ) at time s, when its recall probability reaches a threshold mth . The threshold baseline is similar to the heuristics proposed by previous work (10, 11, 30), which schedule reviews just as the learner is about to forget an item. We do not com- pare with the algorithm proposed by Reddy et al. (15) because, as it is specially designed for the Leitner system, it assumes a dis- crete set of forgetting rate values and, as a consequence, is not note that t denotes a (time) parameter, s and t 0 denote specific applicable to our (more general) setting. (time) values, and we sample from an inhomogeneous Pois- Although we cannot make actual interventions to evaluate son process using a standard thinning algorithm (29). [In some the performance of each method, the following insight allows practical deployments, one may want to discretize the optimal for a large-scale natural experiment: Duolingo uses hand-tuned intensity u(t) and, e.g., “at top of each hour, decide whether to spaced repetition algorithms, which propose reviewing times do a review or not.”] to the users; however, users often do not perform reviews Given a set of multiple items I with reviewing intensities u(t) exactly at the recommended times, and thus schedules for some and associated counting processes N(t), recall outcomes r(t), (user, item) pairs will be closer to uniform than threshold or recall probabilities m(t), and forgetting rates n(t), we can solve MEMORIZE and vice versa, as shown in Fig. 1. As a con- the spaced repetition problem defined by Eq. 4 similarly as in the sequence, we are able to assign each (user, item) pair to a case of a single item. More specifically, consider the following treatment group (i.e., MEMORIZE) or a control group (i.e., quadratic form for the loss `, uniform or threshold). More in detail, we leverage this insight to design a robust evaluation procedure which relies on (i) like- 1 X 2 1 X 2 lihood comparisons to determine how closely a user followed `(m(t), n(t), u(t)) = (1 − m (t)) + q u (t), 2 i 2 i i a particular reviewing schedule during all reviews but the last i∈I i∈I in a reviewing sequence, i.e., e1, ... , en−1 in a sequence with n reviews, and (ii) a quality metric, empirical forgetting rate where {qi }i∈I are given parameters, which trade off recall prob- nˆ, which can be estimated using only the last review en (and ability and number of item reviews and may favor the learning the retention interval tn − tn−1) of each reviewing sequence and of one item over another. Then, one can exploit the indepen- does not depend on the particular choice of memory model. dence among items assumption to derive the optimal reviewing Refer to Materials and Methods for more details on our eval- intensity for each item, proceeding similarly as in the case of a uation procedure. [Note that our goal is to evaluate how well single item: different reviewing schedule spaces the reviews—our objective is not to evaluate the predictive power of the underlying mem- I Theorem 4. Given a set of items , the optimal reviewing intensity ory models; we are relying on previous work for that (18, 25). for each item i ∈ I in the spaced repetition problem, defined by Eq. 4, under quadratic loss is given by

∗ −1/2 ui (t) = qi (1 − mi (t)). [10]

Finally, note that we can easily sample item reviewing times simply by running |I| instances of MEMORIZE, one per item.

Experimental Design. We use data gathered from Duolingo, a popular language-learning online platform (the dataset is avail- able at https://github.com/duolingo/halflife-regression), to vali- date our algorithm MEMORIZE. (Refer to SI Appendix, section 5 for an experimental validation of our algorithm using synthetic data, whose goal is analyzing it under a controlled setting using metrics and baselines that we cannot compute in the real data we have access to.) This dataset consists of ∼12 million sessions of study, involving ∼5.3 million unique (user, word) pairs, which we denote by D, collected over the period of 2 wk. In a single session, Fig. 1. Examples of (user, item) pairs whose corresponding reviewing times a user answers multiple questions, each of which contains multi- have high likelihood under MEMORIZE (Top), threshold-based reviewing ple words. (Refer to SI Appendix, section 12 for additional details schedule (Middle), and uniform reviewing schedule (Bottom). In every plot, on the Duolingo dataset.) Each word maps to an item i and the each candlestick corresponds to a reviewing event with a green circle (red fraction of correct recalls of sentences containing a word i in the cross) if the recall was successful (unsuccessful), and time t = 0 corresponds session is used as an empirical estimate of its recall probability to the first time the user is exposed to the item in our dataset, which may mˆ (t) t or may not correspond with the first reviewing event. The pairs whose at the time of the session , as in previous work (25). If a reviewing times follow more closely MEMORIZE or the threshold-based word is recalled perfectly during a session, then it is considered a schedule tend to increase the time interval between reviews every time a successful recall, i.e., ri (t) = 1, and otherwise it is considered an recall is successful while, in contrast, the uniform reviewing schedule does unsuccessful recall, i.e., ri (t) = 0. Since we can expect the esti- not. MEMORIZE tends to space the reviews more than the threshold-based mation of the model parameters to be accurate only for users schedule, achieving the same recall pattern with less effort.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1815156116 Tabibian et al. Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 aiine al. et Tabibian least at with learner each for specifically, 70 More superior. performance her is item closely, more MEMORIZE the follows learner to cific this due that bias out selection of rule ( consequence can difficulty a we com- is Here, of greatest advantage increases. number the the advantage achieves increases, MEMORIZE petitive period which training under the reviews threshold-based as and competi- uniform- and, a the baselines to offers respect MEMORIZE with training that advantage distinct tive show three results the for The since window reviews periods. observation seven the to up of with summa- beginning sequences 2 for Fig. results rate. the forgetting rizes in empirical sequence its reviewing compute every (uni- group, for each control and, and groups threshold) (MEMORIZE) and reviews treatment form the of create number we their tern, by i.e., pairs period, training item) their and (user, group first We Results in work this in used 8.] we section benchmarks models Appendix, reviews memory of SI the series of for a periods, number evaluations provide training and the we and completeness, increases, for reviews period However, com- of training a number the offers distinct MEMORIZE as each performance. and, For better baselines (Mann–Whitney increases. indicate threshold-based advantage difference values the competitive lower significant where and greatest scheduling. statistically values, the uniform a median reviews achieves the cates indicate of MEMORIZE to lines numbers which respect solid different under with and with quantiles advantage sequences 75% petitive for and schedule 25% reviewing indicate threshold-based the and 2. Fig. AB h eoymdladteqartcls ucin ebelieve we function, i.e., loss quadratic choices, the modeling and particular model a the memory as to the algorithms agnostic our repetition Since is SDEs. spaced jump framework represen- such of of this problem design control exploiting optimal the stochastic and, casts jumps it with tation, SDEs point and temporal marked processes of framework represents soft- the learners’ framework using repetition modeling repetition the spaced Our spaced platforms. to modern online adapt and by ware to monitored designed as guar- specially performance, provable are with to which algorithms framework antees, repetition modeling spaced principled this online a data-driven In design introduced adaptive guarantees. have designing provable we on with work, there work algorithms However, of repetition 13). paucity spaced 11, a 10, been 8, has (7, algorithms repetition spaced more MEMORIZE follow they the empir- whenever closely. lower summarizes achieve rates 3 average, forgetting on Fig. ical users, that values. for- sequences show log-likelihood which empirical compute reviewing results, and the and of between rate MEMORIZE coefficient 50% getting under correlation bottom Pearson log-likelihood and the of top terms the in select we et eg tpfrhradvrf ht hnvraspe- a whenever that, verify and farther step a go we Next, ic h ete ytm() hr aebe elhof wealth a been have there (9), system Leitner the Since eiwn eune ihatann period training a with sequences reviewing (A–C eto 13). section Appendix, SI vrg miia ogtigrt o h o 5 fpisi em flklho o EOIE h nfr eiwn schedule, reviewing uniform the MEMORIZE, for likelihood of terms in pairs of 25% top the for rate forgetting empirical Average ) t n −1 − t 1 hn o ahrcl pat- recall each for Then, . T 8 = U test; ± 3.2 P -value d, n < in ecniee ereswt tlat7 eiwn eune iha with sequences reviewing 70 least estima- at reliable with period ensure under training To learners MEMORIZE. learner considered to we a due tion, gains of values greater correlation Lower to sequences error. indicate standard correspond circles indicate reviewing The bars the rate. and of forgetting values median empirical 50% associated its bottom and MEMORIZE and top the 3. Fig. ilt rv h eino oto loihsi ierneof range wide a in algorithms control poten- of the design applications. have the jumps, drive with to i.e., SDEs tial of algorithm, control our optimal underpinning stochastic that techniques of believe in we mathematical framework range Finally, time. our the can the of evaluate periods we to on longer interesting intervals spanning limitation very datasets retention be a would and places used It reviews study. we that probabili- between dataset and intervals recall The wk time items. the 2 several influence only of can spans rates and, one forgetting dependent and item, be ties an may probabil- items recall reviewing However, its by only rate. influence forgetting can and that, one ity assumed item, We an events. review- of reviewing number optimal by the derive on goals constraints learning to particular hard capturing and useful losses other be for intensities would ing reviewing it of number however, the particu- on events; constraints a soft consider and we loss work, quadratic our lar in Moreover, deployed Duolingo. algorithms e.g., repetition by, in spaced algorithm existing our of with performance comparison the interven- assess large-scale to perform experiments to tional interesting be would it example, model memory algorithms of choice repetition loss. given spaced a and under find optimal to provably are tool that powerful a provides it 5 ewe EOIEv.trsodadMMRZ s uniform vs. MEMORIZE and threshold vs. MEMORIZE between 0.05) hr r ayitrsigdrcin o uuewr.For work. future for directions interesting many are There ero orlto ofcetbtentelglklho of log-likelihood the between coefficient correlation Pearson T = 8 n ± n ifrn riigperiods training different and .Teewr 2 fsc learners. such of 322 were There d. 3.2 C NSLts Articles Latest PNAS T = t n−1 − t 1 | Boxes . ∗ f6 of 5 indi-

APPLIED MATHEMATICS Materials and Methods given by MEMORIZE, i.e., u(t) = q−1/2(1 − m(t)); the intensity given by the Evaluation Procedure. To evaluate performance of our proposed algo- uniform schedule, i.e., u(t) = µ; and the intensity given by the threshold- rithm, we rely on the following evaluation procedure. For each (user, based schedule, i.e., u(t) = c exp((t − s)/ζ). The likelihood LL({ti}) of a set of item) reviewing sequence, we first perform a likelihood-based comparison reviewing events {ti} given an intensity function u(t) can be computed as and determine how closely it follows a specific reviewing schedule (be it follows (16): X Z T MEMORIZE, uniform, or threshold) during the first n − 1 reviews, the train- LL({ti}) = log u(ti) − u(t) dt. ing reviews, where n is the number of reviews in the reviewing sequence. i 0 ˆ Second, we compute a quality metric, empirical forgetting rate n(tn), using More details on the empirical distribution of likelihood values under each the last review, the th review or test review, and the retention inter- n reviewing schedule are provided in SI Appendix, section 10. val tn − tn−1. Third, for each reviewing sequence, we record the value of the quality metric, the training period (i.e., T = t − t ), and the likeli- n−1 1 Quality Metric: Empirical Forgetting Rate. For each (user, item), the empirical hood under each reviewing schedule. Finally, we control for the training forgetting rate is an empirical estimate of the forgetting rate by the time t period and the number of reviewing events and create the treatment and n of the last reviewing event; i.e., control groups by picking the top 25% of pairs in terms of likelihood for each method, where we skip any sequence lying in the top 25% for nˆ = − log(mˆ (tn))/(tn − tn−1), more than one method. Refer to SI Appendix, section 13 for an additional analysis showing that our evaluation procedure satisfies the random assign- where mˆ (tn) is the empirical recall probability, which consists of the frac- ment assumption for the item difficulties between treatment and control tion of correct recalls of sentences containing word (item) i in the session at groups (31). time tn. Note that this empirical estimate does not depend on the particular In the above procedure, to do the likelihood-based comparison, we first choice of memory model and, given a sequence of reviews, the lower the estimate the parameters α and β and the initial forgetting rate ni(0) using empirical forgetting rate is, the more effective the reviewing schedule. half-life regression on the Duolingo dataset. Here, note that we fit a single Moreover, for a more fair comparison across items, we normalize each set of parameters α and β for all items and a different initial forgetting empirical forgetting rate using the average empirical initial forgetting rate rate n (0) per item i, and we use the power-law forgetting curve model i of the corresponding item at the beginning of the observation window nˆ0; due to its better performance (in terms of MAE) in our experiments (refer i.e., for an item i, to SI Appendix, section 8 for more details). Then, for each user, we use 1 X nˆ = nˆ , maximum-likelihood estimation to fit the parameter q in MEMORIZE and 0 |D | 0,(u,i) i ( , )∈D the parameter µ in the uniform reviewing schedule. For the threshold- u i i based schedule, we fit one set of parameters c and ζ for each sequence where Di ⊆ D is the subset of (user, item) pairs in which item i was reviewed. of review events, using maximum-likelihood estimation for the parameter Furthermore, nˆ0,(u,i) = − log(mˆ (t(u,i),1))/(t(u,i),1 − t(u,i),0), where t(u,i),k is the c and grid search for the parameter ζ, and we fit one parameter mth for kth review in the reviewing sequence associated to the (u, i) pair. How- each user using grid search. Finally, we compute the likelihood of the times ever, our results are not sensitive to this normalization step, as shown in of the n − 1 reviewing events for each (user, item) pair under the intensity SI Appendix, section 14.

1. Ebbinghaus H (1885); trans Ruger HA, Bussenius CE (1913) [Memory: A Contribution 17. Loftus GR (1985) Evaluating forgetting curves. J Exp Psychol Learn Mem Cogn 11:397– to Experimental Psychology] (Columbia Univ Teachers College, New York). 406. 2. Melton AW (1970) The situation with respect to the spacing of repetitions and 18. Wixted JT, Carpenter SK (2007) The Wickelgren power law and the Ebbinghaus memory. J Verbal Learn Verbal Behav 9:596–606. savings function. Psychol Sci 18:133–134. 3. Dempster FN (1989) Spacing effects and their implications for theory and practice. 19. Averell L, Heathcote A (2011) The form of the forgetting curve and the fate of Educ Psychol Rev 1:309–330. . J Math Psychol 55:25–35. 4. Atkinson RC (1972) Optimizing the learning of a second-language vocabulary. J Exp 20. Hanson FB (2007) Applied Stochastic Processes and Control for Jump-Diffusions: Psychol 96:124–129. Modeling, Analysis, and Computation (SIAM, Philadelphia). 5. Bloom KC, Shuell TJ (1981) Effects of massed and distributed practice on the learning 21. Zarezade A, De A, Upadhyay U, Rabiee HR, Gomez-Rodriguez M (2018) Steering social and retention of second-language vocabulary. J Educ Res 74:245–248. activity: A stochastic optimal control point of view. J Mach Learn Res 18:205. 6. Cepeda NJ, Pashler H, Vul E, Wixted JT, Rohrer D (2006) Distributed practice in verbal 22. Zarezade A, Upadhyay U, Rabiee H, Gomez-Rodriguez M (2017) Redqueen. An online recall tasks: A review and quantitative synthesis. Psychol Bull 132:354–380. algorithm for smart broadcasting in social networks. Proceedings of the 10th ACM 7. Pavlik PI, Anderson JR (2008) Using a model to compute the optimal schedule of International Conference on Web Search and Data Mining, eds Tomkins A, Zhang M practice. J Exp Psychol Appl 14:101–117. (ACM, New York), pp 51–60. 8. Branwen G (2016) Spaced repetition. Available at https://www.gwern.net/Spaced% 23. Kim J, Tabibian B, Oh A, Schoelkopf B, Gomez-Rodriguez M (2018) Leveraging 20repetition. Accessed October 20, 2018. the crowd to detect and reduce the spread of fake news and misinformation. 9. Leitner S (1974) So Lernt Man Lernen (Herder, Freiburg, Germany). Proceedings of the 11th ACM International Conference on Web Search and Data 10. Metzler-Baddeley C, Baddeley RJ (2009) Does adaptive training work?. Appl Cogn Mining, eds Liu Y, Maarek Y (ACM, New York), pp 324–332. Psychol 23:254–266. 24. Tabibian B, et al. (2019) Data from “Enhancing human learning via spaced repetition 11. Lindsey RV, Shroyer JD, Pashler H, Mozer MC (2014) Improving students’ long-term optimization.” Github. Available at https://github.com/Networks-Learning/memorize. knowledge retention through personalized review. Psychol Sci 25:639–647. Deposited January 8, 2019. 12. Pashler H, Cepeda N, Lindsey RV, Vul E, Mozer MC (2009) Predicting the optimal 25. Settles B, Meeder B (2016) A Trainable Spaced Repetition Model for Language and spacing of study: A multiscale context model of memory. Advances in Neural Infor- Learning (ACL, Berlin). mation Processing Systems, eds Bengio Y, Schuurmans D, Lafferty J, Williams C, 26. Roediger HL, III, Karpicke JD (2006) Test-enhanced learning: Taking memory tests Culotta A (Neural Information Processing Systems Foundation, New York), pp 1321– improves long-term retention. Psychol Sci 17:249–255. 1329. 27. Cepeda NJ, Vul E, Rohrer D, Wixted JT, Pashler H (2008) Spacing effects in learning: A 13. Mettler E, Massey CM, Kellman PJ (2016) A comparison of adaptive and fixed temporal ridgeline of optimal retention. Psychol Sci 19:1095–1102. schedules of practice. J Exp Psychol Gen 145:897–917. 28. Bertsekas DP (1995) Dynamic Programming and Optimal Control (Athena Scientific, 14. Novikoff TP, Kleinberg JM, Strogatz SH (2012) Education of a model student. Proc Belmont, MA). Natl Acad Sci USA 109:1868–1873. 29. Lewis PA, Shedler GS (1979) Simulation of nonhomogeneous Poisson processes by 15. Reddy S, Labutov I, Banerjee S, Joachims T (2016) Unbounded human learning: thinning. Naval Res Logistics Q 26:403–413. Optimal scheduling for spaced repetition. Proceedings of the 22nd ACM SIGKDD 30. Bjork EL, Bjork R (2011) Making things hard on yourself, but in a good way. Psychol- International Conference on Knowledge Discovery in Data Mining, eds Smola A, ogy in the Real World, eds Gernsbacher M, Pomerantz J (Worth Publishers, New York), Aggrawal C, Shen D, Rastogi R (ACM, San Francisco), pp 1815–1824. pp 56–64. 16. Aalen O, Borgan O, Gjessing HK (2008) Survival and Event History Analysis: A Process 31. Stuart EA (2010) Matching methods for causal inference: A review and a look Point of View (Springer, New York). forward. Stat Sci Rev J Inst Math Stat 25:1–21.

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1815156116 Tabibian et al. Downloaded by guest on September 30, 2021