Enhancing human via optimization

Behzad Tabibiana,b,1, Utkarsh Upadhyaya, Abir Dea, Ali Zarezadea, Bernhard Scholkopf¨ b, and Manuel Gomez-Rodrigueza

aNetworks Learning Group, Max Planck Institute for Software Systems, 67663 Kaiserslautern, Germany; and bEmpirical Inference Department, Max Planck Institute for Intelligent Systems, 72076 Tubingen,¨ Germany

Edited by Richard M. Shiffrin, Indiana University, Bloomington, IN, and approved December 14, 2018 (received for review September 3, 2018)

Spaced repetition is a technique for efficient which tion algorithms. However, most of the above spaced repetition uses repeated review of content following a schedule determined algorithms are simple rule-based heuristics with a few hard- by a spaced repetition algorithm to improve long-term reten- coded parameters (8), which are unable to fulfill this promise— tion. However, current spaced repetition algorithms are simple adaptive, data-driven algorithms with provable guarantees have rule-based heuristics with a few hard-coded parameters. Here, been largely missing until very recently (14, 15). Among these we introduce a flexible representation of spaced repetition using recent notable exceptions, the work most closely related to ours the framework of marked temporal point processes and then is by Reddy et al. (15), who proposed a queueing network model address the design of spaced repetition algorithms with prov- for a particular spaced repetition method—the able guarantees as an optimal control problem for stochastic (9) for reviewing flashcards—and then developed a heuristic differential equations with jumps. For two well-known human approximation algorithm for scheduling reviews. However, their models, we show that, if the learner aims to maximize heuristic does not have provable guarantees, it does not adapt recall probability of the content to be learned subject to a cost to the learner’s performance over time, and it is specifically on the reviewing frequency, the optimal reviewing schedule is designed for the Leitner systems. given by the recall probability itself. As a result, we can then In this work, we develop a computational framework to derive develop a simple, scalable online spaced repetition algorithm, optimal spaced repetition algorithms, specially designed to adapt MEMORIZE, to determine the optimal reviewing times. We per- to the learner’s performance, as continuously monitored by mod- form a large-scale natural experiment using data from , a ern spaced repetition software and online learning platforms. popular language-learning online platform, and show that learn- More specifically, we first introduce a flexible representation ers who follow a reviewing schedule determined by our algorithm of spaced repetition using the framework of marked temporal memorize more effectively than learners who follow alternative point processes (16). For several well-known human memory schedules determined by several heuristics. models (1, 12, 17–19), we use this presentation to express the dynamics of a learner’s rates and recall probabilities for the content to be learned by means of a set of stochastic memorization | spaced repetition | human learning | marked temporal point processes | stochastic optimal control Significance

Our ability to remember a piece of information depends crit- Understanding human memory has been a long-standing ically on the number of times we have reviewed it, the temporal problem in various scientific disciplines. Early works focused distribution of the reviews, and the time elapsed since the last on characterizing human memory using small-scale controlled review, as first shown by a seminal study by Ebbinghaus (1). The experiments and these empirical studies later motivated the effect of these two factors has been extensively investigated in the design of spaced repetition algorithms for efficient memo- experimental psychology literature (2, 3), particularly in second rization. However, current spaced repetition algorithms are language acquisition research (4–7). Moreover, these empirical rule-based heuristics with hard-coded parameters, which do studies have motivated the use of flashcards, small pieces of not leverage the automated fine-grained monitoring and information a learner repeatedly reviews following a schedule greater degree of control offered by modern online learning determined by a spaced repetition algorithm (8), whose goal is to platforms. In this work, we develop a computational frame- ensure that learners spend more (less) time working on forgotten work to derive optimal spaced repetition algorithms, specially (recalled) information. designed to adapt to the learners’ performance. A large-scale The task of designing spaced repetition algorithms has a rich natural experiment using data from a popular language- history, starting with the Leitner system (9). More recently, learning online platform provides empirical evidence that the several works (10, 11) have proposed heuristic algorithms that spaced repetition algorithms derived using our framework are schedule reviews just as the learner is about to forget an item, significantly superior to alternatives. i.e., when the probability of recall, as given by a memory model of choice (1, 12), falls below a threshold. An orthogonal line of Author contributions: B.T., U.U., B.S., and M.G.-R. designed research; B.T., U.U., A.D., A.Z., research (7, 13) has pursued locally optimal scheduling by iden- and M.G.-R. performed research; B.T. and U.U. analyzed data; and B.T., U.U., and M.G.-R. tifying which item would benefit the most from a review given wrote the paper.y a fixed reviewing time. In doing so, the researchers have also The authors declare no conflict of interest.y proposed heuristic algorithms that decide which item to review This article is a PNAS Direct Submission.y by greedily selecting the item which is closest to its maximum This open access article is distributed under Creative Commons Attribution License 4.0 learning rate. (CC BY).y In recent years, spaced repetition software and online plat- Data deposition: The MEMORIZE algorithm has been deposited on GitHub (https:// forms such as (mnemosyne-proj.org), (www. github.com/Networks-Learning/memorize).y synap.ac), and Duolingo (www.duolingo.com) have become See Commentary on page 3953.y increasingly popular, often replacing the use of physical flash- 1 To whom correspondence should be addressed. Email: [email protected] cards. The promise of these pieces of software and online This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. platforms is that automated fine-grained monitoring and greater 1073/pnas.1815156116/-/DCSupplemental.y degree of control will result in more effective spaced repeti- Published online January 22, 2019.

3988–3993 | PNAS | March 5, 2019 | vol. 116 | no. 10 www.pnas.org/cgi/doi/10.1073/pnas.1815156116 Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 eal fteie.[rvoswr fe ssteivreof inverse half-life, or the strength uses memory often as to work s (un)successful referred [Previous rate, previous forgetting item. of the the number of and/or recalls difficulty item time e.g., at rate forgetting tr ntmoa on rcse 1) ecaatrz these characterize we (16), processes point temporal on item ature reviewed the of has track learner the keep we which process counting Then, in multidimensional rates.] a using forgetting times of reviewing terms in work where ogtigcremdl i.e., model; curve forgetting item (forgets) recalls learner the and Reidiger is of work learner (26). seminal Karpicke the the following Duolingo, review, and each in Synap, plat- a tested online Mnemosyne, (i.e., and as test software such a repetition forms of spaced most outcome in the since, includes recall) event (r reviewing each it that recalled either and aiine al. et Tabibian item reviewed learner the that means which triplet a as event reviewing each represent we items, the model of context items multiscale of the forget- and power-law 19) model. the (18, (12).] such models, (MCM) model of memory curve popular parameters ting two the other on to reviews the recent [In estimate of a to of) effect (25), variant (a the regression use half-life and method, 17), (1, model machine-learning curve recalls we forgetting memory binary exponential exposition, with well-known the model opti- of literature, one ease psychology are for the for that from framework Here, schedules model. our reviewing memory showcase find a to under techniques provides mal model—it of memory set of choice Repetition. a particular the Spaced to of agnostic Framework are Modeling we (24). area, algorithm this our in of implementation research open-source facilitate an releasing To effec- heuristics. determined more several schedules alternative memorize by follow who algorithm reviewing learners a than our tively follow by who language-learning learners determined popular that schedule a show and Duolingo, platform, from online exper- data natural large-scale using optimal a the iment perform determine we to Finally, algo- times. MEMORIZE, reviewing repetition name spaced we the online which scalable and rithm, simple, consequence, intensity, a a As develop reviewing learned. can be we or to content the reviewing, of solu- probability of recall the between rate slope frequency, negative optimal a reviewing with the the relationship be linear on to a cost uncovers content tion a the to of probability subject recall learned maximize independent to of aims is learner which ), 4 a and find 3 to sections technique interest. Appendix, proof (SI a equa- introduce (HJB) tion Hamilton–Jacobi–Bellman to so-called the need to a we solution solving so, (20–23). doing by jumps with In repetition SDEs for problem spaced control optimal the for stochastic find schedule can we reviewing Then, optimal jumps. with (SDEs) equations differential (t ie h bv ersnain emdltepoaiiythat probability the model we representation, above the Given oeseicly ie ere h at ommrz set a memorize to wants who learner a given specifically, More the if that, show we models, memory well-known two For = ) eapyorframework our apply we 7 , and 6 sections Appendix, SI n t r −1 I stetm ftels eiwand review last the of time the is (t sn pcdrptto,ie,rpae,sae review spaced repeated, i.e., repetition, spaced using ) m i 1,2) oee,i smr rcal o sto us for tractable more is it However, 25). (15, hentry, th i (t := ) e P(r t =( := N hc a eedo ayfactors, many on depend may which , 1 = x ( exp = ) i (t i item onstenme ftmsthe times of number the counts ), rfro t(r it forgot or ) i ↑ pt time to up , i time ttime at t ↓ −n , recall i (t r ↑ )(t ), t t olwn h liter- the Following . sn h exponential the using − u rmwr is framework Our t n 0 = r i )), i (t I ∈ .Hr,note Here, ). ) ∈ R ttime at + sthe is N (t [1] ), t onigpoessuigtercrepnigintensities corresponding their i.e., using processes counting ogtigrtscmae ihese tm,adteparame- the and initial items, higher easier having with items ters compared difficult more rates with forgetting item, the rate, of forgetting culty initial the context, this ogtigrt by rate forgetting (1 rate time forgetting implicitly estimate at the we which on item Here, (25), effect an 25). regression plicative of 15, recalls half-life that (3, using assumes item effect the an of such rate the forgetting item, an the reviews learner a that recall time every Moreover, marks. eto 9 section Appendix, (SI predictions recall rate o h reviewing the time-varying for for allowing (ii) and dataset, Duolingo challenging; our in very sequences time-varying under The becomes schedule (i) reviewing that: parameters from optimal for us an argue prevent of that derivation could reasons several a one are there in Thus, ever, is improve. item not parameters time-varying an does long- rate—if the retention happens, forgetting prac- review term the mass the when i.e., memory improve short-term that shape, learner’s U not evidence inverted does empirical an tice provided follows rate have retention 27) the (6, studies oratory 8 section Appendix, yaiso h ogtigrate forgetting repetition the the spaced specifically, of our More dynamics of control. design optimal stochastic the using for algorithm useful very is This jumps. probability recall the reviewing optimal be parameters. derive would time-varying to it under work said, schedules future being for That venue 11).] interesting section an of Appendix, (SI choices items particular all with formulation our α using cast constant explicitly assuming rea- implicitly despite achieved α practice have in SuperMemo, success system and sonable Leitner spacing the as exponential such with heuristics, repetition spaced popular where jumps , with SDE following the by defined be can dynamics whose probability recall the 1. Proposition ealpoaiiyo h items, the function of loss probability review- (convex) recall item particular optimal a the of of search intensities the ing as algorithm repetition intensity by given where time 1 {0, † eiwd ee n a loacutfrie ifiut ycnieigta,frmore for that, to considering set by is difficulty probability recall item the for items, account difficult also may one Here, reviewed. odrv Eq. derive To i i dn eoew rce ate,w cnweg htsvrllab- several that acknowledge we farther, proceed we Before et eepestednmc ftefretn rate forgetting the of dynamics the express we Next, ial,gvnasto items of set a given Finally, − and and i E[d β α } α (t t i i iial,tednmc ftercl probability recall the of dynamics the Similarly, . i niae hte item whether indicates r N , N = ) )n normdln rmwr osntla omr accu- more to lead not does framework modeling our in β β N t β i u i rpsto 1 Proposition i a eneprmnal hw ohv nefc on effect an have to shown experimentally been has i r dm (t i (t i TeLinrsse ihepnnilsaigcnbe can spacing exponential with system Leitner [The . i (t (t and , n h aeiiilfretn rate, forgetting initial same the and −α easm httercl probability recall the that assume we 3, (t hne h ogtigrt by rate forgetting the changes ) ) ]= )] r i )† (t ), stecutn rcs soitdt h reviewing the to associated process counting the is stecrepnigcutn rcs and process counting corresponding the is i ie nitem an Given u(t . n = ) α u(t i n (t i [ = ) i 1+ (1 ≤ (0) −n )r )dt o oedetails). more for PNAS m hl nuscesu ealcagsthe changes recall unsuccessful an while 1, i m u (t i poe in (proved r siae sn eldt rfrto (refer data real using estimated are n hn fterecall the of think and , i (t β i (t i )dN (t i (t )m α i.e., ), ), | )] i ) (t i ac ,2019 5, March i endb Eq. by defined i o ahitem each for ∈I (t (t ) i i )dt and + ) I ∈ I m(t n htmnmz h xetdvalue expected the minimize that n a enscesul ealdat recalled successfully been has ecs h eino spaced a of design the cast we , i i i eto 1): section Appendix, SI o (t (t (1 + uigarve aeamulti- a have review a during β i β [ = ) ihrveigintensity reviewing with ∈ ) i 1+ (1 = ) i n ,1 vr ieitem time every 1) [0, (t r edl ie by given readily are i n (t − ) `(mt m m n i | norfaeok How- framework. our in )(1 (t i i m i (t atrstediffi- the captures (0), i (t o.116 vol. 1, sstt vr ieitem time every 1 to set is ) ascesu recall successful )—a (1 i I ∈ (t − )] β ), saMro process Markov a is − i i ))dN several (iii) and ); r ∈I )n n(t i n r α sn Dswith SDEs using (t i i h forgetting the ; i 0 = (0) sterbinary their as (t | i.e., ), ), ))dN i r (t o 10 no. u(t ), i ), sreviewed. is β m n i )) n (t i i i n for (0), (t ≥ r (t ), fthe of i | i u ) (t (t u(t ) In 0. i 3989 and (t = ) are [3] [2] ) α SI i ∈ ), ), is i

APPLIED SEE COMMENTARY MATHEMATICS rates, n(t) = [ni (t)]i∈I ; and the intensities themselves, u(t); over Then, to derive the HJB equation, we differentiate J with a time window (t0, tf ]; i.e., respect to time t, m(t), and n(t) using SI Appendix, section 2, Lemma 1:

 Z tf  minimize E φ(m(tf ), n(tf )) + `(m(τ), n(τ), u(τ))dτ 0 = Jt (m, n, t) − nmJm (m, n, t) u(t ,t ] 0 f t0 + min {`(m, n, u)[J (1, (1 − α)n, t)m u(t,t+dt] subject to u(t) ≥ 0 ∀t ∈ (t0, tf ), [4] + J (1, (1 + β)n, t)(1 − m) − J (m, n, t)]u(t)}. [7]

where u(t0, tf ] denotes the item reviewing intensities from t0 To solve the above equation, we need to define the loss `. to tf , the expectation is taken over all possible realizations of the associated counting processes and (item) recalls, the loss Following the literature on stochastic optimal control (28), we function is nonincreasing (nondecreasing) with respect to the consider the following quadratic form, which is nonincreasing recall probabilities (forgetting rates and intensities) so that it (nondecreasing) with respect to the recall probabilities (inten- rewards long-lasting learning while limiting the number of item sities) so that it rewards learning while limiting the number of item reviews, reviews, and φ(m(tf ), n(tf )) is an arbitrary penalty function. [The penalty function φ(m(tf ), n(tf )) is necessary to derive the opti- 1 1 mal reviewing intensities u∗(t).] Here, note that the forgetting `(m(t), n(t), u(t)) = (1 − m(t))2 + qu2(t), [8] 2 2 rates n(t) and recall probabilities m(t), as defined by Eqs. 2 and 3, depend on the reviewing intensities u(t) we aim to optimize where q is a given parameter, which trades off recall probabil- since E[dN(t)] = u(t)dt. ity and number of item reviews—the higher its value, the lower the number of reviews. Note that this particular choice of loss The MEMORIZE Algorithm. The spaced repetition problem, as function does not directly place a hard constraint on number defined by Eq. 4, can be tackled from the perspective of stochas- of reviews; instead, it limits the number of reviews by penaliz- tic optimal control of jump SDEs (20). Here, we first derive ing high reviewing intensities. (Given a desired level of practice, a solution to the problem considering only one item, provide the value of the parameter q can be easily found by simulation an efficient practical implementation of the solution, and then since the average number of reviews decreases monotonically generalize it to the case of multiple items. with respect to q.) Given an item i, we can write the spaced repetition problem, Under these definitions, we can find the relationship between i.e., Eq. 4, for it with reviewing intensity ui (t) = u(t) and associ- the optimal intensity and the optimal cost by taking the derivative ated counting process Ni (t) = N (t), recall outcome ri (t) = r(t), with respect to u(t) in Eq. 7: recall probability mi (t) = m(t), and forgetting rate ni (t) = n(t). Further, using Eqs. 2 and 3, we can define the forgetting rate u∗(t)= q −1 [J (m(t), n(t), t) − J (1, (1 − α)n(t), t)m(t) n(t) and recall probability m(t) by the following two coupled jump SDEs, −J (1, (1 + β)n(t), t)(1 − m(t))]+. Finally, we plug the above equation into Eq. 7 and find that dn(t) = −αn(t)r(t)dN (t) + βn(t)(1 − r(t))dN (t) the optimal cost-to-go J needs to satisfy the following nonlinear dm(t) = −n(t)m(t)dt + (1 − m(t))dN (t) differential equation:

0 = Jt (m(t), n(t), t) − n(t)m(t)Jm (m(t), n(t), t) with initial conditions n(t0) = n0 and m(t0) = m0. 1 1 Next, we define an optimal cost-to-go function J for the above + (1 − m(t))2 − q −1 [J (m(t), n(t), t) problem, use Bellman’s principle of optimality to derive the cor- 2 2 responding HJB equation (28), and exploit the unique structure −J (1, (1 − α)n(t), t)m(t) of the HJB equation to find the optimal solution to the problem. 2 − J (1, (1 + β)n(t), t)(1 − m(t))] . Definition 2: The optimal cost-to-go J (m(t), n(t), t) is defined + as the minimum of the expected value of the cost of going from state To continue farther, we rely on a technical lemma (SI Ap- (m(t), n(t)) at time t to the final state at time t f : pendix, section 3, Lemma 2), which derives the optimal cost-to-go J for a parameterized family of losses `. Using SI Appendix, sec-  tion 3, Lemma 2, the optimal reviewing intensity is readily given J = min E s=tf φ(m(tf ), n(tf )) u(t,tf ] (N (s),r(s))|s=t by Theorem 3 (proved in SI Appendix, section 4):

Z tf  + `(m(τ), u(τ))dτ . [5] Theorem 3. Given a single item, the optimal reviewing intensity for t the spaced repetition problem, defined by Eq. 4, under quadratic loss, defined by Eq. 8, is given by

Now, we use Bellman’s principle of optimality, which the above ∗ −1/2 definition allows, to break the problem into smaller subprob- u (t) = q (1 − m(t)). [9] lems. [Bellman’s principle of optimality readily follows using the Markov property of the recall probability m(t) and forget- Note that the optimal intensity depends only on the recall ting rate n(t).] With dJ (m(t), n(t), t) = J (m(t + dt), n(t + dt), probability, whose dynamics are given by Eqs. 2 and 3, and t + dt) − J (m(t), n(t), t), we can, hence, rewrite Eq. 5 as thus allows for a very efficient procedure to sample reviewing times, which we name MEMORIZE. Algorithm 1 provides a pseudocode implementation of MEMORIZE. Within the algo- J (m(t), n(t), t) = min E[J (m(t + dt), n(t + dt), t + dt)] rithm, Sample(u(t)) samples from an inhomogeneous Poisson u(t,t+dt] process with intensity u(t) and it returns the sampled time + `(m(t), n(t), u(t))dt and ReviewItem(t 0) returns the recall outcome r of an item reviewed at time t 0, where r = 1 indicates the item was recalled 0 = min E[dJ (m(t), n(t), t)] + `(m(t), n(t), u(t))dt. [6] u(t,t+dt] successfully and r = 0 indicates it was not recalled. Moreover,

3990 | www.pnas.org/cgi/doi/10.1073/pnas.1815156116 Tabibian et al. Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 aino h oe aaeest eacrt nyfrusers for only accurate be to parameters model the of mation i.e., recall, unsuccessful probability recall its of estimate empirical an m as word used a containing is item sentences an session of to recalls maps correct word of Each fraction dataset.) Duolingo the on odi ealdpretydrn eso,te ti osdrda considered is it i.e., then session, recall, a successful during perfectly recalled is word aeo igeie.Mr pcfial,cnie h following the consider specifically, loss More the for item. form single quadratic a of Eq. case by defined problem repetition spaced the probabilities processes recall counting associated and not.”] or review a do srasesmlil usin,ec fwihcnan multi- contains which to of (Refer words. each ple questions, multiple answers user a by denote of consists involving we dataset study, data This real using the to.) setting in access controlled compute have cannot a we under that baselines it and analyzing metrics is goal whose data, to 5 (Refer MEMORIZE. algorithm our date avail- is dataset at (the able platform online language-learning popular times reviewing Design. item Experimental sample easily can running we by simply that note Finally, 4, item each for a 4. of Theorem case the in as similarly indepen- reviewing item: proceeding the single optimal item, exploit the each can derive for to learning one intensity assumption the Then, items favor another. among may over dence and item reviews one item of of number and ability where optimal the discretize some to Pois- [In want inhomogeneous (29). may intensity algorithm one an thinning deployments, from standard practical a sample using we process son and values, (time) that note aiine al. et Tabibian ˆ o neprmna aiaino u loih sn synthetic using algorithm our of validation experimental an for (t ie e fmlil items multiple of set a Given ne udai osi ie by given is loss quadratic under `(m ) ttetm ftesession the of time the at {q (t https://github.com/duolingo/halflife-regression i u ), } t D (t i eoe tm)parameter, (time) a denotes ie e fitems of set a Given ∈I n olce vrtepro f2w.I igesession, single a In wk. 2 of period the over collected , ) i (t n,eg,“ttpo ahhu,dcd hte to whether decide hour, each of top “at e.g., and, I ∈ r ie aaees hc rd f ealprob- recall off trade which parameters, given are ), . ilo nqe(sr od ar,wihwe which pairs, word) (user, unique million ∼5.3 u |I| ntesae eeiinproblem repetition spaced the in (t u m(t i eto 12 section Appendix, SI ∗ )= )) euedt ahrdfo ulno a Duolingo, from gathered data use We r (t ntne fMMRZ,oepritem. per one MEMORIZE, of instances i n ogtigrates forgetting and ), (t = ) r n tews ti osdrdan considered is it otherwise and 1, = ) i 1 2 (t q `, X i i ∈I 0 = ) −1/2 (1 I I t (1 ic ecnepc h esti- the expect can we Since . si rvoswr 2) fa If (25). work previous in as , , − ihrveigintensities reviewing with h pia eiwn intensity reviewing optimal the − m N i m (t (t i ealoutcomes recall ), s )) ∼12 (t and 2 )). o diinldetails additional for + section Appendix, SI 4 ilo esosof sessions million n(t 2 1 t iial si the in as similarly 0 X i eoespecific denote , ecnsolve can we ), ∈I endb Eq. by defined q i u ,t vali- to ), i i 2 n the and (t i nthe in ), r(t [10] u(t ), ) r oes eaerligo rvoswr o ht(8 25). (18, that for work previous mem- on underlying the relying of are power we well predictive models; how the objective ory evaluate evaluate reviews—our to to the not is spaces is goal model. schedule our reviewing memory that different of [Note choice procedure. particular uation the to on Refer depend not does interval retention the last n the but i.e., reviews sequence, followed n all reviewing user during a a schedule in closely reviewing like- how particular (i) determine on a a insight relies to this to which comparisons leverage procedure (i.e., pair evaluation lihood we con- robust group item) detail, a a control in design (user, to More As a each threshold). or 1. or assign MEMORIZE) Fig. uniform or to (i.e., in threshold group able shown than treatment are as uniform reviews we versa, to perform sequence, closer vice not be and some will do for MEMORIZE schedules pairs often thus item) and users times, (user, times recommended however, the reviewing at users; exactly propose the which to allows algorithms, hand-tuned uses insight repetition Duolingo following experiment: spaced the natural large-scale method, a each for of performance not the is consequence, a setting. as general) and, (more dis- values our a to rate com- assumes applicable it forgetting not because, system, of do (15) Leitner set the al. We for crete et item. designed Reddy specially an is by forget it proposed as to algorithm about the reviews is with schedule pare learner which the 30), 11, as (10, just work previous by proposed by item an threshold of intensity reviewing c the increases for item(s) which sends rate schedule, constant which a schedule, at reviewing review uniform a (i) lines: of consists dataset least our at reviewed were n tm iheog ubr frveigeet,w con- we events, reviewing least at of with numbers users only enough sider with items and ceue civn h aercl atr ihls effort. less threshold-based with the pattern than recall does more same reviews schedule the the achieving a reviewing schedule, space time uniform to the every tends contrast, reviews threshold-based MEMORIZE whose in between not. the pairs while, interval successful or time The is the MEMORIZE event. recall may increase reviewing closely which to dataset, first tend more our schedule the in follow item with times the correspond to reviewing exposed not is may user the or time (red time and circle first (unsuccessful), green the successful a to was with recall event the reviewing if a plot, cross) every to In reviewing corresponds ( Bottom). candlestick threshold-based schedule each reviewing (Top), uniform MEMORIZE and (Middle), under schedule likelihood high have 1. Fig. ˆ hc a eetmtduigol h atreview last the only using estimated be can which , x ((t exp lhuhw antmk culitretost evaluate to interventions actual make cannot we Although base- two with method our of performance the compare We eiw,ad( and reviews, xmlso ue,ie)piswoecrepnigrveigtimes reviewing corresponding whose pairs item) (user, of Examples − aeil n Methods and Materials m s th )/ζ h hehl aeiei iia oteheuristics the to similar is baseline threshold The . ) ttime at ult erc miia ogtigrate forgetting empirical metric, quality a ii) PNAS t n . ilo nqe(sr od pairs. word) (user, unique million ∼5.2 30 − s hehl-ae reviewing threshold-based a (ii) and µ, | hnisrcl rbblt ece a reaches probability recall its when , t ie.Atrti rpoesn step, preprocessing this After times. n ac ,2019 5, March −1 30 e fec eiwn euneand sequence reviewing each of ) eiwn vnsadwrsthat words and events reviewing 1 , o oedtiso u eval- our on details more for . . . , e n | −1 o.116 vol. nasqec with sequence a in t | = o 10 no. corresponds 0 e n | (and 3991

APPLIED SEE COMMENTARY MATHEMATICS ABC

Fig. 2. (A–C) Average empirical forgetting rate for the top 25% of pairs in terms of likelihood for MEMORIZE, the uniform reviewing schedule, and the threshold-based reviewing schedule for sequences with different numbers of reviews n and different training periods T = tn−1 − t1. Boxes indicate 25% and 75% quantiles and solid lines indicate median values, where lower values indicate better performance. MEMORIZE offers a com- petitive advantage with respect to the uniform and the threshold-based baselines and, as the training period increases, the number of reviews under which MEMORIZE achieves the greatest competitive advantage increases. For each distinct number of reviews and training periods, ∗ indi- cates a statistically significant difference (Mann–Whitney U test; P-value < 0.05) between MEMORIZE vs. threshold and MEMORIZE vs. uniform scheduling.

However, for completeness, we provide a series of benchmarks it provides a powerful tool to find spaced repetition algorithms and evaluations for the memory models we used in this work in that are provably optimal under a given choice of memory model SI Appendix, section 8.] and loss. There are many interesting directions for future work. For Results example, it would be interesting to perform large-scale interven- We first group (user, item) pairs by their number of reviews n tional experiments to assess the performance of our algorithm in and their training period, i.e., tn−1 − t1. Then, for each recall pat- comparison with existing spaced repetition algorithms deployed tern, we create the treatment (MEMORIZE) and control (uni- by, e.g., Duolingo. Moreover, in our work, we consider a particu- form and threshold) groups and, for every reviewing sequence in lar quadratic loss and soft constraints on the number of reviewing each group, compute its empirical forgetting rate. Fig. 2 summa- events; however, it would be useful to derive optimal review- rizes the results for sequences with up to seven reviews since the ing intensities for other losses capturing particular learning goals beginning of the observation window for three distinct training and hard constraints on the number of events. We assumed that, periods. The results show that MEMORIZE offers a competi- by reviewing an item, one can influence only its recall probabil- tive advantage with respect to the uniform- and threshold-based ity and forgetting rate. However, items may be dependent and, baselines and, as the training period increases, the number of by reviewing an item, one can influence the recall probabili- reviews under which MEMORIZE achieves the greatest com- ties and forgetting rates of several items. The dataset we used petitive advantage increases. Here, we can rule out that this spans only 2 wk and that places a limitation on the range of advantage is a consequence of selection bias due to the item time intervals between reviews and retention intervals we can difficulty (SI Appendix, section 13). study. It would be very interesting to evaluate our framework in Next, we go a step farther and verify that, whenever a spe- datasets spanning longer periods of time. Finally, we believe that cific learner follows MEMORIZE more closely, her performance the mathematical techniques underpinning our algorithm, i.e., is superior. More specifically, for each learner with at least stochastic optimal control of SDEs with jumps, have the poten- 70 reviewing sequences with a training period T = 8 ± 3.2 d, tial to drive the design of control algorithms in a wide range of we select the top and bottom 50% of reviewing sequences applications. in terms of log-likelihood under MEMORIZE and compute the Pearson correlation coefficient between the empirical for- getting rate and log-likelihood values. Fig. 3 summarizes the results, which show that users, on average, achieve lower empir- ical forgetting rates whenever they follow MEMORIZE more closely. Since the Leitner system (9), there have been a wealth of spaced repetition algorithms (7, 8, 10, 11, 13). However, there has been a paucity of work on designing adaptive data-driven spaced repetition algorithms with provable guarantees. In this work, we have introduced a principled modeling framework to design online spaced repetition algorithms with provable guar- antees, which are specially designed to adapt to the learners’ performance, as monitored by modern spaced repetition soft- ware and online platforms. Our modeling framework represents spaced repetition using the framework of marked temporal point Fig. 3. Pearson correlation coefficient between the log-likelihood of the top and bottom 50% of reviewing sequences of a learner under processes and SDEs with jumps and, exploiting this represen- MEMORIZE and its associated empirical forgetting rate. The circles indicate tation, it casts the design of spaced repetition algorithms as a median values and the bars indicate standard error. Lower correlation values stochastic optimal control problem of such jump SDEs. Since our correspond to greater gains due to MEMORIZE. To ensure reliable estima- framework is agnostic to the particular modeling choices, i.e., tion, we considered learners with at least 70 reviewing sequences with a the memory model and the quadratic loss function, we believe training period T = 8 ± 3.2 d. There were 322 of such learners.

3992 | www.pnas.org/cgi/doi/10.1073/pnas.1815156116 Tabibian et al. Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 6 ae ,Bra ,Gesn K(2008) HK Gjessing O, Borgan learning: O, human Aalen Unbounded 16. (2016) T Joachims S, Banerjee I, Labutov S, Reddy 15. long-term students’ Improving (2014) MC Mozer H, Pashler JD, Shroyer RV, Lindsey 11. 4 oiofT,KenegJ,Srgt H(02 dcto famdlstudent. fixed model and a of adaptive of (2012) comparison SH Strogatz A JM, (2016) Kleinberg TP, PJ Novikoff Kellman 14. CM, Massey E, Mettler 13. optimal the Predicting (2009) MC Mozer E, Vul RV, Lindsey N, Cepeda H, Pashler 12. work?. training adaptive Does (2009) RJ Baddeley C, Metzler-Baddeley 10. ahue sn rdsac.Fnly ecmuetelklho ftetimes the of likelihood the compute we the Finally, of search. grid using user parameter each the for estimation parameters maximum-likelihood c of using set events, one review of fit we schedule, based parameter parameter (refer the the experiments fit our to in estimation MAE) maximum-likelihood of terms (in performance to better its to due single a fit we rate that parameters note Here, of dataset. set Duolingo the on regression half-life parameters the estimate control and treatment between difficulties for (31). item 25% groups likelihood the assign- top random for of the the assumption satisfies terms in procedure ment evaluation in lying our that sequence pairs to showing Refer any analysis of method. skip one 25% we than top where more and method, the treatment the each picking training create for the and by events for groups reviewing control of control we number Finally, the schedule. and period reviewing (i.e., each period under training hood the metric, quality the h atrve,the val rate review, forgetting empirical last metric, the quality a it compute (be we Second, schedule reviewing where (user, first reviews, specific the each ing a during threshold) follows For or it uniform, procedure. comparison MEMORIZE, closely likelihood-based evaluation how a perform determine following first and we the sequence, on reviewing item) rely we rithm, Procedure. Evaluation Methods and Materials aiine al. et Tabibian .LinrS(1974) of S https://www.gwern.net/Spaced% Leitner schedule at optimal 9. Available the repetition. compute Spaced to (2016) model G Branwen a Using 8. (2008) JR verbal Anderson in practice PI, Distributed (2006) Pavlik D Rohrer 7. JT, Wixted E, Vul H, learning Pashler the NJ, on Cepeda practice distributed and 6. massed of Effects (1981) TJ Shuell KC, vocabulary. Bloom second-language a 5. of practice. learning the and Optimizing and theory (1972) for repetitions RC implications Atkinson of their 4. spacing and effects the Spacing to (1989) respect FN Dempster with 3. situation The (1970) Contribution A AW [Memory: Melton (1913) CE 2. Bussenius HA, Ruger trans (1885); H Ebbinghaus 1. n rdsac o h parameter the for search grid and nteaoepoeue od h ieiodbsdcmaio,w first we comparison, likelihood-based the do to procedure, above the In d ml A, View Smola of 1815–1824. Point eds pp Francisco), Mining, San (ACM, Data R in Rastogi D, Discovery Shen C, Knowledge Aggrawal on repetition. Conference spaced International for scheduling Optimal USA Sci Acad Natl review. personalized through retention knowledge Psychol 2018. 20, October Accessed 20repetition. practice. synthesis. quantitative and review A tasks: recall vocabulary. second-language of retention and Psychol Rev Psychol Educ memory. Psychology Experimental to ceue fpractice. C, of schedules Williams J, 1321– Lafferty pp York), D, New Foundation, Schuurmans Systems 1329. Processing Y, Information Bengio (Neural A eds memory. Culotta of Systems, model Processing context multiscale mation A study: of spacing eto 8 section Appendix, SI t n n − i n 0 e item per (0) t − n−1 23:254–266. 96:124–129. x sco Appl Psychol Exp J eblLanVra Behav Verbal Learn Verbal J eiwn vnsfrec ue,ie)pi ne h intensity the under pair item) (user, each for events reviewing 1 hr,frec eiwn eune ercr h au of value the record we sequence, reviewing each for Third, . Srne,NwYork). New (Springer, µ oLrtMnLernen Man Lernt So 1:309–330. nteuiomrveigshdl.Frtethreshold- the For schedule. reviewing uniform the in 109:1868–1873. α n i n euetepwrlwfretn uv model curve forgetting power-law the use we and , and hrve rts eiw n h eeto inter- retention the and review, test or review nth x sco Gen Psychol Exp J stenme frvesi h eiwn sequence. reviewing the in reviews of number the is oeaut efrac forpooe algo- proposed our of performance evaluate To α β Clmi nvTahr olg,NwYork). New College, Teachers Univ (Columbia ] 14:101–117. o oedtis.Te,frec sr euse we user, each for Then, details). more for and o l tm n ifrn nta forgetting initial different a and items all for β eto 13 section Appendix, SI n h nta ogtigrate forgetting initial the and Hre,Febr,Germany). Freiburg, (Herder, 9:596–606. uvvladEetHsoyAayi:AProcess A Analysis: History Event and Survival 145:897–917. ζ n efi n parameter one fit we and , rceig fte2n C SIGKDD ACM 22nd the of Proceedings dcRes Educ J sco Bull Psychol T sco Sci Psychol = c t and n−1 n 74:245–248. − dacsi erlInfor- Neural in Advances ζ − q 132:354–380. eiw,tetrain- the reviews, 1 o ahsequence each for nMMRZ and MEMORIZE in t 25:639–647. 1 o nadditional an for ,adtelikeli- the and ), n(t ˆ n plCogn Appl i 0 using (0) n m ,using ), th Exp J Proc for 5 ete ,Mee (2016) B Meeder B, Settles 25. repetition spaced via learning human “Enhancing from Data (2019) Leveraging al. et (2018) B, Tabibian M 24. Gomez-Rodriguez B, Schoelkopf A, Oh B, Tabibian J, Kim 23. online An Redqueen. (2017) M Gomez-Rodriguez H, Rabiee U, Upadhyay A, social Zarezade Steering (2018) 22. M Gomez-Rodriguez HR, of Rabiee U, fate Upadhyay A, the De A, Zarezade and curve 21. forgetting the (2007) of FB Ebbinghaus form Hanson the The 20. and (2011) law A Heathcote power L, Wickelgren Averell The 19. (2007) SK Carpenter JT, Wixted 18. curves. forgetting Evaluating (1985) GR Loftus 17. 1 tatE 21)Mthn ehd o aslifrne eiwadalook a and review A inference: causal for methods Matching (2010) EA way. Stuart good a 31. in but yourself, on hard things by Making processes (2011) R Poisson Bjork nonhomogeneous EL, Bjork of Simulation 30. (1979) GS Shedler PA, Lewis 29. A learning: in (1995) effects DP Spacing Bertsekas (2008) H Pashler 28. JT, tests Wixted memory D, Rohrer Taking E, Vul learning: NJ, Test-enhanced Cepeda (2006) 27. JD Karpicke III, HL, Roediger 26. ftecrepnigie ttebgnigo h bevto window observation the item of an beginning for the rate i.e., at forgetting item initial corresponding empirical the average of the using rate the forgetting lower empirical schedule. the reviewing reviews, the effective of more sequence the a is, rate given forgetting and, empirical model memory of choice (item) word containing sentences time of recalls correct of tion where ogtigrt sa miia siaeo h ogtigrt ytetime the i.e., by event; rate reviewing forgetting last the the of estimate of empirical an is rate forgetting Rate. Forgetting Empirical Metric: Quality each under values likelihood in of provided are distribution schedule empirical reviewing the on details More (16): follows eiwn events reviewing ae ceue i.e., schedule, based i.e., schedule, uniform ie yMMRZ,i.e., MEMORIZE, by given vr u eut r o estv oti omlzto tp ssonin shown as step, normalization this to 14 . sensitive section Appendix, not SI are results our ever, k Furthermore, where hrve nterveigsqec soitdt h (u, the to associated sequence reviewing the in review th oevr o oefi oprsnars tm,w omlz each normalize we items, across comparison fair more a for Moreover, eoie aur ,2019. 8, January misinformation. Deposited Data https://github.com/Networks-Learning/memorize. and at and Available news Github. Search 324–332. optimization.” pp fake Web York), New of on (ACM, Y Conference spread Maarek International Y, the Liu ACM eds reduce 11th Mining, and the of detect Proceedings to crowd the M Zhang A, Tomkins eds 51–60. Mining, pp York), Data New and (ACM, Search networks. Web on social Conference in International broadcasting smart for algorithm view. of point control optimal stochastic A activity: Computation and Analysis, Modeling, . function. savings 406. forward. 56–64. pp World Real the in ogy thinning. MA). Belmont, retention. optimal of ridgeline temporal retention. long-term improves Learning t n D oeta hseprcletmt osntdpn nteparticular the on depend not does estimate empirical this that Note . m(t ˆ i D ⊆ n ttSiRvJIs ahStat Math Inst J Rev Sci Stat AL Berlin). (ACL, aa e oitc Q Logistics Res Naval steeprclrcl rbblt,wihcnit ftefrac- the of consists which probability, recall empirical the is ) ahPsychol Math J n ˆ stesbe f(sr tm ar nwihitem which in pairs item) (user, of subset the is 0,(u,i i , sco Sci Psychol {t ) ple tcatcPoessadCnrlfrJump-Diffusions: for Control and Processes Stochastic Applied LL({t i u(t = yai rgamn n pia Control Optimal and Programming Dynamic } d enbce ,PmrnzJ(ot ulses e York), New Publishers, (Worth J Pomerantz M, Gernsbacher eds , ie nitniyfunction intensity an given − PNAS ) n ˆ u(t = i log( = }) 55:25–35. n ˆ u(t ) rial pcdRptto oe o agaeand Language for Model Repetition Spaced Trainable A c 0 18:133–134. − = = exp((t = m(t ) ˆ sco Sci Psychol 26:403–413. log( | n h nest ie ytethreshold- the by given intensity the and µ; X = |D i ac ,2019 5, March (u,i q 1 m(t −1/2 log ˆ i − | ),1 25:1–21. eto 10. section Appendix, SI SA,Philadelphia). (SIAM, (u,i ))/(t s)/ζ n u(t X sco Sci Psychol ))/(t (1 )∈D 17:249–255. i − .Telikelihood The ). (u,i o ah(sr tm,teempirical the item), (user, each For ) i − n x sco er e Cogn Mem Learn Psychol Exp J m(t n ˆ ),1 − 0,(u,i Z − 0 t ) h nest ie ythe by given intensity the )); ahLanRes Learn Mach J T | n−1 19:1095–1102. t ) u(t , (u,i o.116 vol. rceig fte1t ACM 10th the of Proceedings u(t ), ),0 ) dt ,where ), a ecmue as computed be can ) . LL({t i | ntessinat session the in Ahn Scientific, (Athena i o 10 no. a reviewed. was 18:205. i i fastof set a of }) ar How- pair. ) t (u,i ),k | Psychol- 11:397– sthe is 3993 n ˆ t 0 n ;

APPLIED SEE COMMENTARY MATHEMATICS