Unbounded Human Learning: Optimal Scheduling for Spaced Repetition

Siddharth Reddy Igor Labutov Siddhartha Banerjee Department of Computer Electrical and Computer Operations Research and Science Engineering Information Engineering Cornell University Cornell University Cornell University [email protected] [email protected] [email protected] Thorsten Joachims Department of Computer Science Cornell University [email protected]

ABSTRACT 1. INTRODUCTION In the study of human learning, there is broad evidence that The ability to learn and retain a large number of new our ability to retain information improves with repeated ex- pieces of information is an essential component of human posure and decays with delay since last exposure. This plays learning. Scientific theories of human memory, going all the a crucial role in the design of educational software, leading way back to 1885 and the pioneering work of Ebbinghaus [9], to a trade-off between teaching new material and reviewing identify two critical variables that determine the probability what has already been taught. A common way to balance of recalling an item: reinforcement, i.e., repeated exposure this trade-off is spaced repetition, which uses periodic review to the item, and delay, i.e., time since the item was last re- of content to improve long-term retention. Though spaced viewed. Accordingly, scientists have long been proponents repetition is widely used in practice, e.g., in electronic flash- of the spacing effect for learning: the phenomenon in which card software, there is little formal understanding of the periodic, spaced review of content improves long-term reten- design of these systems. Our paper addresses this gap in tion. three ways. First, we mine log data from spaced repetition A significant development in recent years has been a grow- software to establish the functional dependence of retention ing body of work that attempts to ‘engineer’ the process of on reinforcement and delay. Second, we use this memory human learning, creating tools that enhance the learning model to develop a stochastic model for spaced repetition process by building on the scientific understanding of hu- systems. We propose a queueing network model of the Leit- man memory. These educational devices usually take the ner system for reviewing flashcards, along with a heuristic form of ‘flashcards’ – small pieces of information content approximation that admits a tractable optimization prob- which are repeatedly presented to the learner on a sched- lem for review scheduling. Finally, we empirically evaluate ule determined by a spaced repetition algorithm [4]. Though our queueing model through a Mechanical Turk experiment, flashcards have existed for a while in physical form, a new verifying a key qualitative prediction of our model: the exis- generation of spaced repetition software such as SuperMemo tence of a sharp phase transition in learning outcomes upon [20], Anki [10], Mnemosyne [2], Pimsleur [18], and Duolingo increasing the rate of new item introductions. [3] allow a much greater degree of control and monitoring of the review process. These software applications are growing CCS Concepts in popularity [4], but there is a lack of formal mathematical models for reasoning about and optimizing such systems. arXiv:1602.07032v2 [cs.AI] 8 Jun 2016 •Applied computing → Computer-assisted instruc- In this work, we combine memory models from psychology tion; •Mathematics of computing → Queueing the- with ideas from queueing theory to develop such a mathe- ory; matical model for these systems. In particular, we focus on one of the simplest and oldest spaced repetition methods: Keywords the Leitner system [13]. Spaced Repetition; Queueing Models; Human Memory The Leitner system, first introduced in 1970, is a heuristic for prioritizing items for review. It is based on a series of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed decks of flashcards. After the user sees a new item for the for profit or commercial advantage and that copies bear this notice and the full citation first time, it enters the system at deck 1. The items at each on the first page. Copyrights for components of this work owned by others than the deck form a first-in-first-out (FIFO) queue, and when the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission user requests an item to review, the system chooses a deck i and/or a fee. Request permissions from [email protected]. according to some schedule, and presents the top item. If the KDD ’16, August 13 - 17, 2016, San Francisco, CA, USA user does not recall the item, the item is added to the bottom c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. of deck i−1; else, it is added to the bottom of deck i+1. The ISBN 978-1-4503-4232-2/16/08. . . $15.00 aim of the scheduler is to ensure that items from lower decks DOI: http://dx.doi.org/10.1145/2939672.2939850 are reviewed more often than those from higher decks, so the 1.2 Our Contributions user spends more time working on forgotten items and less The key contributions of this paper fall into two cate- time on recalled items. Existing schemes for assigning review gories. First, the paper introduces a principled methodol- frequencies to different decks are based on heuristics that are ogy for designing review scheduling systems with various not founded on any formal reasoning, and hence, have no learning objectives. Second, the models we develop provide optimality guarantees. One of our main contributions is a qualitative insights and general principles for spaced repe- principled method for determining appropriate deck review tition. The overall argument in this paper consists of the frequencies. following three steps: The problem of deciding how frequently to review different decks in the Leitner system is a specific instance of the more 1. Mining large-scale log data to validate human memory general problem of review scheduling for spaced repetition models: First, we perform observational studies on data software. The main challenge in all settings is that schedules from Mnemosyne [2], a popular flashcard software tool, must balance competing priorities of introducing new items to compare different models of retention probability as and reviewing old items in order to maximize the rate of a function of reinforcement and delay. Our results, pre- learning. While most existing systems use heuristics to make sented in Section2, add to the existing literature on mem- this trade-off, our work presents a principled understanding ory models and provide the empirical foundation upon of the tension between novelty and reinforcement. which we base our mathematical model of spaced repetition. 2. Mathematical modeling of spaced repetition systems: Our 1.1 Related Work main contribution lies in embedding the above memory The scientific literature on modeling human memory is model into a stochastic model for spaced repetition sys- highly active and dates back more than a century. One of the tems, and using this model to optimize the review sched- simplest memory models, the exponential forgetting curve, ule. Our framework, which we refer to as the Leitner was first studied by Ebbinghaus in 1885 [9] – it models the Queue Network, is based on ideas from queueing theory probability of recalling an item as an exponentially-decaying and job scheduling. Though conceptually simple and easy function of the time elapsed since previous review and the to simulate, the Leitner Queue Network does not provide memory ‘strength’. The exact nature of how strength evolves a tractable way to optimize the review schedule. To this as a function of the number of reviews, length of review in- end, we propose a (heuristic) approximate model, which tervals, and other factors is a topic of debate, though there is in simulations is close to our original model for low ar- some consensus on the existence of a spacing effect, in which rival rates, and which leverages the theory of product- spaced reviews lead to greater strength than massed reviews form networks [11,6] to greatly simplify the scheduling (i.e., cramming) [8,5]. Recent studies have proposed more problem. This allows us to study several relevant ques- sophisticated probabilistic models of learning and forgetting tions: the maximum rate of learning, the effect of item [17, 15], and there is a large body of related work on item difficulties, and the effect of a learner’s review frequency response theory and knowledge tracing [14,7]. Our work on their overall rate of learning. We present our model, both contributes to this literature (via observational studies theory, and simulations in Section3. on log data from the Mnemosyne software) and uses it as the basis for our queueing model and scheduling algorithm. 3. Verifying the mathematical model in controlled experi- Though used extensively in practice (see [4] for an excel- ments: Finally, we use Amazon Mechanical Turk [1] to lent overview), there is very limited literature on the de- perform large-scale experiments to test our mathemati- sign of spaced repetition software. One notable work in cal models. In particular, we verify a critical qualitative this regard is that of Novikoff et al. [16], who propose a prediction of our mathematical model: the existence of a theoretical framework for spaced repetition based on a set phase transition in learning outcomes upon increasing the of deterministic operations on an infinite string of content rate of introduction of new content beyond a maximum pieces. They assume identical items and design schedules threshold. Our experimental results agree well with our to implement deterministic spacing constraints, which are model’s predictions, reaffirming the utility of our frame- based on an intuitive understanding of the effect of mem- work. ory models on different learning objectives. The focus in Our work provides the first mathematical model for spaced [16] is on characterizing the combinatorial properties (e.g., repetition systems which is empirically tested and admits maximum asymptotic throughput) of schedules that imple- a tractable optimization problem for review scheduling. It ment various spacing constraints. Though our work shares opens several directions for further research: developing bet- the same spirit of formalizing the spaced repetition problem, ter models for such systems, providing better analysis for the we improve upon their work in three ways: (1) in terms of resulting models, and performing more empirical studies to empirical verification, as our work leverages both software understand these systems. We discuss some of these open log data and large-scale experimentation to verify the mem- questions in detail in Section5. Our experimental platform ory models we use, and test the predictions made by our can help serve as a testbed for future studies; to this end, mathematical models; (2) in computational terms, wherein, we release all our data and software tools to facilitate repli- by using appropriate stochastic models and approximations, cation and follow-up studies (see Section4). we formulate optimization problems that are much easier to solve; and (3) in terms of flexibility, as our model can more easily incorporate various parameters such as the user’s re- 2. TESTING HUMAN MEMORY MODELS view frequency, non-identical item difficulties, and different To design a principled spaced repetition system, we must memory models. first understand how a user’s ability to recall an item is affected by various system parameters. One well-studied review! model of human memory from the psychology literature is the exponential forgetting curve, which claims that the probability of recalling an item decays exponentially with the full histories! ! time elapsed since it was last reviewed, at a rate which de- }! creases with the ‘strength’ of the item’s memory trace. In this section, we conduct an observational study on large- scale log data collected from the Mnemosyne [2] flashcard ?! user-item pair software to validate the exponential forgetting curve. ?! truncated ?! histories! Exponential Forgetting Curve. ?! }! We adopt a variant of the standard exponential forgetting curve model, where recall is binary (i.e., a user either time! completely recalls or forgets an item) and the probability of = training set! = validation set! recalling an item has the following functional form:

P[recall] = exp (−θ · d/s), (1) Figure 1: Schematic of the classification task used to + + evaluate memory models: Each square corresponds where θ ∈ R is the item difficulty, d ∈ R is the time + to a user-item interaction with a binary outcome. elapsed since previous review, and s ∈ R is the memory The gray squares are thrown out. This training- strength. Our formulation is slightly different from that of validation split occurs on each fold, with sets of full Ebbinghaus [9], in that we have added an explicit item diffi- and truncated histories changing across folds. culty parameter θ, which corresponds to the assumption that there is a constant, item-specific component of the memory decay rate. gression model uses the following statistics of the previous To justify the use of this memory model in our schedul- review intervals and outcomes to predict recall: mean, me- ing algorithm, we first explore how different forms of the dian, min, max, range, length, first, and last. exponential forgetting curve model fit empirical data. In Logistic regression and 1PL-IRT are trained using MAP particular, we explore the use of a global item difficulty θ estimation with an L2 penalty, where the regularization con- vs. an item-specific difficulty θi, as well as several simple stant is selected to maximize log-likelihood on a validation models of memory strength: a constant strength s = 1, a set. All other models are trained using maximum-likelihood strength s = nij equal to the number of repetitions of item estimation (i.e., with an implicit uniform prior on model i for user j (where nij ≥ 1), and a strength s = qij equal to parameters). The hyperparameters in the IRT models are the position of item i in the Leitner system for user j (where user abilities θ~ and/or item difficulties β~; the hyperparam- q ≥ 1). ij eters in the exponential forgetting models are item-specific difficulties θ~ or a global difficulty θ. We use ten-fold cross- Experiments on Log Data. validation to evaluate the memory models on the task of pre- We use large-scale log data collected from Mnemosyne [2], dicting held-out outcomes. Our performance metric is area a popular flashcard software tool, to validate our assump- under the ROC curve (AUC), which measures the discrimi- tions about the forgetting curve. After filtering out users native ability of a binary classifier that assigns probabilities and items with fewer than five interactions, we select a ran- to class membership.1 On each fold, we train on the full his- dom subset of the data that contains 859, 591 interactions, tories of 90% of user-item pairs and the truncated histories 2, 742 users, and 88, 892 items. Each interaction is anno- of 10% of user-item pairs, and validate on the interactions tated with a grade (on a 0-5 scale) that was self-reported immediately following the truncated histories. Truncations by the user. Users are instructed by the Mnemosyne soft- are made uniformly at random – see Fig.1 for an illustra- ware to use a grade of 0 or 1 to indicate that they did tion of this scheme. After using cross-validation to perform not recall the item, and a grade of 2-5 to indicate that model selection, we evaluate the models on a held-out test they did recall the item, with higher grades implying eas- set of truncated user-item histories (20% of the user-item ier recall. We discretize grades into binary outcomes, where pairs in the complete data set) that was not visible during recall grade ≥ 2 and, and observe an overall recall rate , the earlier model selection phase. of 0.56 in the data. Additionally, we scale the time intervals Table2 summarizes all the models that were evaluated, between reviews to days. with rows 1-4 representing the benchmarks and rows 5-14 We compare the exponential forgetting curve from Eqn. variants of the exponential forgetting curve model. We com- 1 to three benchmark models: the zero- and one-parameter pare models which use a global item-difficulty parameter θ logistic item response theory models (henceforth, 0PL-IRT (rows 5-9) vs. item-specific difficulties θ (rows 10-14); more- and 1PL-IRT) and logistic regression. The 0PL-IRT user i over, we allow the memory strength to be constant (rows 6 model assumes the recall likelihood follows [recall] = φ(θ ) P j and 11), proportional to the number of reviews n (rows for each user j observed in the training set (where φ is the ij 5,7,10,12), or proportional to the position of the item in the logistic link function); similarly, the 0PL-IRT item model Leitner system q (rows 8,9,13,14). assumes [recall] = φ(β ) for each item i in the training ij P i The predictive performance of the models on validation set. The 1PL-IRT model [14], a mathematical formulation and test data is given in Fig.2 and3. We make four key of the Rasch cognitive model [19], has the following recall 1 likelihood: P[recall] = φ(θj −βi) for user j and item i, where We use AUC as a metric instead of raw prediction accuracy θ is user proficiency and β is item difficulty. The logistic re- because it is insensitive to class imbalance. P[recall] Model 0.9 30000 1 φ(θj ) 0PL-IRT user 0.8 2 φ(−βi) 0PL-IRT item 25000 3 φ(θj − βi) 1PL-IRT 0.7 20000 4 φ(β · x) Logistic regression 0.6 5 exp (−θ · dij /nij ) Exponential forgetting curve 0.5 15000 6 exp (−θ · dij ) 7 exp (−θ/n ) 0.4

ij Validation AUC 10000 8 exp (−θ · dij /qij ) 0.3 9 exp (−θ/qij ) 5000 0.2 Frequency (number of interactions) 10 exp (−θi · dij /nij )

11 exp (−θi · dij ) 0.1 0 1 2 3 4 5 6 7 8 9 10 12 exp (−θi/nij ) Deck (qij) Model 3: φ(θ β ) 13 exp (−θi · dij /qij ) j − i Model 5: exp ( θ d /n ) Model 10: exp ( θ d /n ) 14 exp (−θi/qij ) − · ij ij − i · ij ij Model 7: exp ( θ/n ) Model 12: exp ( θ /n ) − ij − i ij Model 8: exp ( θ d /q ) Model 13: exp ( θ d /q ) − · ij ij − i · ij ij Table 1: Summary of models used for prediction: In Model 9: exp ( θ/q ) Model 14: exp ( θ /q ) − ij − i ij all cases, the subscripts refer to user j and item i. Rows 1-4 represent our benchmarks; here φ is the logistic function, while in row 4, x refers to the fea- Figure 2: To evaluate the memory models’ ability to ture vector of review interval and outcome statistics predict outcomes in the Leitner system, validation described earlier in this section, and β is a vector AUC is computed for separate bins of data that con- of coefficients. In rows 5-14, dij is the time elapsed trol for an item’s position qij in the Leitner system. since previous review of item i for user j, qij denotes The error bars show the standard error of validation the position of item i in the Leitner system for user AUC across the ten folds of cross-validation. Each j and nij is the number of past reviews of item i by curve corresponds to a row in Table2. We have in- user j. θ represents a global item difficulty, while θi cluded only the best-performing benchmark model, is an item-specific difficulty for item i. 1PL-IRT (model 3), to reduce clutter.

observations:

1. Positive impact of delay term: Incorporating a delay term improves the performance of the memory model. In Fig. 2, the solid lines (with delay term) and dashed lines (without delay term) of same color encode comparable models with and without the delay term (model 5 vs. 7, 8 vs. 9, 10 vs. 12, 13 vs. 14).

2. Use of item-specific difficulties: Item-specific difficulties θi outperform global item difficulty θ for lower decks (qij ≤ 2) and higher decks (qij > 5), but the global difficulty performs better for intermediate decks (model 5 vs. 10, 8 vs. 13); in Fig.2, compare the solid cool colors (models with θ) to solid warm colors (models with θi).

3. Leitner position vs. number of reviews: Setting the mem- Figure 3: To compare the three memory strength ory strength s to be equal to the Leitner deck position models s = nij , s = 1, and s = qij , we compute AUCs qij performs better than setting it to be proportional to for the full data set (instead of separate bins of data, the number of past reviews nij , which in turn is better as in Fig.2). The box-plots show the spread of vali- than using a constant s; see Fig.3 (model 8 vs. 5-6, and dation AUC across the ten folds of cross-validation, 13 vs. 10-11). and the orange circles show AUC on the test set.

4. Performance w.r.t. benchmark models: Exponential forgetting models that include the delay term (models 5, 8, 10, and 13) perform comparably to 1PL-IRT (model 3), which is the best-performing benchmark model; in Fig. where for item i, di is the time since last reviewed, qi is its 2, compare the solid black line (model 3) to the other current deck in the Leitner system, and θ is the global diffi- solid lines (models 5, 8, 10, and 13). culty. The choice of the former two parameters follows from observations 3 and 4. The choice between using θi or θ is Based on these observations, our model of the Leitner sys- less clear from the data, and we settle on the latter due to tem uses the following exponential forgetting curve: considerations of practicality (θi may be unknown and/or difficult to estimate) and mathematical tractability. We dis- P[recall] = exp (−θ · di/qi), (2) cuss extensions of our model to using θi in later sections. 3. A STOCHASTIC MODEL FOR SPACED mastered items! REPETITION SYSTEMS Based on the memory model developed in the previous section (as summarized in Eqn.2), we now present a stochastic model for a spaced repetition system, and outline how 1! 2! 3! 4! 5! we can use it to design good review scheduling policies. We note again that all existing schemes for assigning review frequencies to decks in the Leitner system, and in fact, in all other spaced repetition systems, are based on heuristics with no formal optimality guarantees. One of our main contributions is to provide a principled method for determining appropriate schedules for spaced repetition systems. new items! We focus on a regime where the learner wants to memorize a very large set of items – in particular, the number of avail- Figure 4: The Leitner Queue Network: Each queue able items is much larger than the potential amount of time represents a deck in the Leitner system. New items spent by the learner in memorizing them. A canonical exam- enter the network at deck 1. Green arrows indicate ple of such a setting is learning words in a foreign language. transitions that occur when an item is correctly re- From a technical point of view, this translates to assuming called during review, and red arrows indicate tran- that new items are always available for introduction into the sitions for incorrectly recalled items. Queue k is system, similar to an open queueing system (i.e., one with served (i.e., chosen for review) at a rate µk, and se- a potentially infinite stream of arrivals). Moreover, this al- lects items for review in a FIFO manner. lows us to use the steady-state performance of the queue as a measure of performance of the scheduler as an appropri- discipline. When an item comes up for review, its transition ate metric in this setting. We refer to this as the long-term to the next state depends on the recall probability, which learning regime. depends on the deck number and delay (i.e., time elapsed As mentioned before, our model is based on the Leitner since the last review of that item). In particular, at time t, system [13], one of the simplest and oldest spaced repetition for any deck k, let D = t − T denote the delay (i.e., the systems. It comprises of a series of n decks of flashcards, k k,1 time elapsed since that item was last reviewed) of the head- indexed as {1, 2, . . . , n}, where new items enter the system of-the-line (HOL) item in deck k. Then, using the memory at deck 1, and items upon being reviewed either move up model from Eqn.2, we have that upon reviewing the HOL a deck if recalled correctly or down if forgotten. In prin- item from deck k (for k ∈ {1, 2, 3, . . . , n − 1}), its transition ciple, the deck numbers can extend in both directions; in follows: practice however, they are bounded both below and above – we follow this convention and assume that items in deck P[k → k + 1] = exp (−θ · Dk/k) 1 are reflected (i.e., they remain in deck 1 if they are incor- [k → max{k − 1, 1}] = 1 − exp (−θ · D /k) rectly reviewed), and all items which are recalled at deck n P k (which in experiments we take as n = 5), are declared to be Note that items in deck 1 return to the same deck upon ‘mastered’ and removed from the system. For simplicity of incorrect recall. Finally, items coming up for review from exposition, we also assume that the difficulty parameter θ is deck n exit the system with probability exp (−θ · Dn/n) (i.e., the same for all items (i.e., model 8 in Table2), but we will upon correct recall), else transition to deck n − 1. We define discuss later how to allow for item-specific difficulties (i.e., the learning rate λout to be the long-term rate of mastered model 13 in Table2). items exiting from deck n, i.e.: 1 λout = lim · {Items mastered in interval [0,T ]} 3.1 The Leitner Queue Network T →∞ T

We model the dynamics of ﬂashcards in an n-deck Leitner The aim of a scheduling policy is to maximize λout. system using a network of n queues, as depicted in Fig.4. Given any scheduling policy that depends only on the Formally, at time t, we associate with each deck k the vector state S(t) = (S1(t),S2(t),...,Sn(t)), it can be easily veriﬁed

Sk(t) = (Qk(t), {Tk,1(t),Tk,2(t),...,Tk,Qk (t)}), where Qk is that S(t) forms a Markov chain. The most general scheduler- the number of items in the deck at time t, and Tk,j < t is the design problem is to choose a dynamic state-dependent sched- time at which the jth item in deck k first entered the system ule, wherein review instances are created following a Poisson (note that the times are sorted). A new item is introduced process at rate U, and at each review instant, the scheduler into the system at a time determined by the scheduler – it defines a map from the state S(t) to a control decision which is first shown to the user (who we assume has not seen it involves choosing either to introduce a new card, or a deck before), and then inserted into deck 1. from which to review an item. Analyzing such a dynamic We assume that the learner has a review frequency budget schedule is difficult as the state space of the system is very (e.g., the maximum rate at which the user can review items) high dimensional (in fact, it is infinite dimensional unless we of U, which is to be divided between reviewing the decks as impose some restriction on the maximum queue size). How- well as viewing new items. Formally, we assume that review ever, we can simplify this by restricting ourselves to static instances are created following a Poisson process at rate U. scheduling policies: we assume that new items are injected Our aim is to design a scheduler which at each review instant into deck 1 following a Poisson process with rate λext (hence- chooses an item to review. When a deck is chosen for review, forth referred to as the arrival rate), and for each deck k, we assume that items are chosen from it following a FIFO we choose a service rate µk, which represents the rate at which items from that deck come up for review. We need to the following flow-balance equations: enforce that the arrival rate λext and deck service rates together satisfy the user’s review frequency budget constraint, λ1 = λext + (1 − P1)λ1 + (1 − P2)λ2 P 2 i.e., λext + k µk ≤ U . λi = Pi−1λi−1 + (1 − Pi+1)λi+1 , for i ∈ {2, 3, . . . , n − 1} Restricting to static schedulers greatly reduces the prob- λn = Pn−1λn−1, lem dimensionality – the aim now is to choose λext, {µk}k∈[n] so as to maximize the learning rate λout. We henceforth re- Under the above assumptions, the Leitner Queue Network fer to this system as the Leitner Queue Network. The use is a Jackson network of M/M/1 queues Qk [11, 12], with of such static policies is a common practice in stochastic arrival rate λk and service times {µk}, and from Jackson’s control literature, and moreover, such schedules are com- theorem, we have that all queues are ergodic, and in steady- monly used in designing real Leitner systems (although the state, for each deck k, the sojourn time Dk is indeed dis- review rates are chosen in a heuristic manner). However, tributed as Exponential(µk −λk). Moreover, ergodicity also although the above model is potentially amenable to simu- gives that the learning rate λout is the same as the external lation optimization techniques, a hurdle in obtaining a more injection rate λext. Putting everything together, we get the quantitative understanding is that the Markov chain over following static planning problem: S(t) is time-inhomogeneous, as the transition probabilities change with t. In the next section, we propose a heuristic Maximize λext (4) {µ }n approximation that lets us obtain a tractable program for k k=1 n optimizing the review schedule. X Subject to U ≥ λext + µk, 3.2 The Mean-Recall Approximation k=1 λ = λ + (1 − P )λ + (1 − P )λ , The stochastic model in Section 3.1 captures all the im- 1 ext 1 1 2 2 portant details of the Leitner system – however, its time- λk = Pk−1λk−1 + (1 − Pk+1)λk+1 , for k 6=1,n, inhomogenous nature means that it does not admit a closed- λn = Pn−1λn−1, form program for optimizing the review schedule. In this µk − λk section, we propose an approximation technique for the Leit- Pk = ∀k ∈ [n], µk − λk + θ/k ner Queue Network, under which the problem of choosing an 0 ≤ λ ≤ µ ∀k ∈ [n], optimal review schedule reduces to a low-dimensional deter- k k ministic optimization problem. Simulation results in Section Thus, as desired, the mean-recall approximation helps con- 3.3 (see Fig.6) suggest that the two models match closely vert the stochastic control problem of designing an opti- for small arrival rates. We note however that this approxi- mal review schedule to a low (O(n)) dimensional, deter- mation is essentially a heuristic, and obtaining more rigor- ministic optimization problem. Now, we can use a nonlin- ous approximations for the proposed Leitner Queue Network ear solver (e.g., IP-OPT) to solve the static planning prob- model is an important topic for future work. lem. Note that our problem is unusual compared to typ- The main idea behind converting the model in Section ical network control problems, as the routing probabilities 3.1 to a time-homogeneous model is that for small values depend on the service rates µk. Also, note that ergodic- of λ, for which the system appears stable (i.e., the total ity of the system is critically dependent on the conditions number of packets in the system does not blow up), then λk < µk , ∀k ∈ {1, 2, . . . , n} – if, however, one or more of each Leitner deck k behaves similarly to an M/M/1 queue these do not hold, then the resulting queue length(s) (i.e., with service rate µk (chosen by the scheduler), and some deck sizes), and thus, the delays between reviews for items appropriate input rate λk. Recall that for an M/M/1 queue in that deck, grow unbounded for the decks for which the with input rate λ and service rate µ, the total sojourn time condition is violated. Moreover, since items move to lower for any packet is distributed as Exponential(µ − λ) (see decks upon being incorrectly recalled, choosing a high in- [11]). Based on this, we assume that for an item from deck jection rate should result in items building up in the lowest k coming under review, the recall likelihood is given by: deck. We verify these qualitative observations through experiments in Section4. h − θ ·D i µk − λk P[Recall| Deck k] = E e k k = (3) µk − λk + θ/k 3.3 Features of Optimal Leitner Schedules The above expression follows from the moment generating We now explore the properties of the optimal review sched- function of the exponential distribution. We henceforth refer ule for the Leitner Queue Network under the mean-recall to this as the mean-recall approximation. approximation. The main qualitative prediction from our Formally, we define the mean-recall approximation as fol- model is the existence of a phase transition in learning out- lows: Suppose we choose a static schedule λext, {µk}k∈[n] comes: Given a schedule {µk}, there is a threshold λt s.t. (with λext < µ1), and in addition choose input rates λk < µk for all λext > λt, there are no feasible solutions {λk} sat- at each deck. Moreover, suppose the probability Pk that an isfying Eqn.4. Moreover, if λext > λt, then the lowest item from deck k is recalled correctly upon review is given by Leitner deck (i.e., Q1) experiences packet accumulation and 3 Eqn.3 . Finally, we assume the arrival rates {λk} satisfy delay blow-up, and thus the learning rate λout goes to 0. In Fig.5, we simulate a review session with 500 reviews and 50 2 In practice, this corresponds to the following: for each re- unique items for different values of λext. We observe that a µk view instance, with probability P , we review the λext+ k µk sharp phase transition indeed occurs as the arrival rate is in- oldest item in deck k; else, we introduce a new item. 3 One way to view this is that for each item in deck k coming generate Dbk ∼ Exponential(µk − λk), which is then used to up for review, we ignore the true delay Dk and independently determine the recall probability Pk. 0.025 0.035 Phase Transition Threshold (Theoretical) Theoretical Steady-State Behavior Simulated (Clocked Delay) 0.030 Simulated (Mean-Recall Approximation) 0.020 Simulated (Clocked Delay) 0.025

0.015 0.020 (Items Per Second) (Items Per Second)

out out 0.015

λ 0.010 λ

0.010 0.005 0.005 Throughput Throughput

0.000 0.000 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 Arrival Rate λext (Items Per Second) Arrival Rate λext (Items Per Second)

Figure 5: Phase transition in learning: Average Figure 6: The mean-recall approximation: Average learning rate λout vs. arrival rate λext in the Leit- learning rate λout vs. arrival rate λext, for 500 re- ner Queue Network with clocked delays, for a ses- views over 50 items. We set number of decks n = 5, sion of 500 reviews over 50 items. We set number of review frequency budget U = 0.1902, and global item decks n = 5, review frequency budget U = 0.1902, and difficulty θ = 0.0077. The green curve is generated global item difficulty θ = 0.0077. The dashed verti- using clocked delays, while the blue curve uses the cal line is the predicted phase transition threshold mean-recall approximation. The λout = λext line under the mean-recall approximation. (red dashed curve) is the steady-state λout under the mean-recall approximation. creased: throughput initially increases linearly with arrival rate, then sharply decreases. can be relaxed by discretizing difficulties into a fixed num- The simulation in Fig.5 is for the Leitner Queue Network ber of bins b, and creating b parallel copies of the network. with actual (or clocked) delays, i.e., item routing is based on We now have b parallel Leitner Queue Networks, coupled actual times between reviews. The dotted line indicates the via the budget constraint which applies to the sum of ser- phase transition threshold obtained under our mean-recall vice rates across the networks for each θ. We can assume approximation (Eqn.4), which appears to be a lower bound that the θi are known a priori (e.g., from log data or expert (i.e., a conservative estimate) for the true phase transition annotations). To understand the effect of different θi, we point for review sessions of moderate length. Fig.6 verifies compare the optimal schedule for different θi (but using the our intuition that the mean-recall approximation performs same budget U) in Fig. 12. The result is interesting, because well for small values of λext. Obtaining more rigorous guar- to the best of our knowledge, there is little understanding antees on the approximation remains an open question. of how the user should adjust deck review frequencies when The above simulations suggest that the mean-recall ap- the general difficulty of items changes. Fig. 12 suggests proximation gives a good heuristic for optimizing the learn- that when items are generally easy, the user should spend ing rate. Moreover, the tractability of the resulting opti- a roughly uniform amount of time on each deck; however, mization program (Eqn.4) lets us investigate structural when items are of higher general difficulty, the user should aspects of the optimal schedule under the mean-recall as- spend more time on lower decks than higher decks. sumption. In Fig.7, we see that the optimal schedule spends more time on lower decks than on higher decks (i.e., 4. EXPERIMENTAL VALIDATION µk ≤ µk+1 ∀ k). This is partly a result of the network topol- ogy, where items enter the system through deck 1 and exit To empirically test the fidelity of the Leitner Queue Net- the system through deck n. However, in Fig.8 we ob- work as a model for spaced repetition, we perform an exper- serve that the Leitner Queue Network also increases the ex- iment on Amazon Mechanical Turk (MTurk) involving par- pected delay between subsequent reviews as an item moves ticipants memorizing words from a foreign language. Our up through the system. Note that longer review intervals study is designed to experimentally verify the existence of the phase transition (shown in Fig.5), the primary qualita- does not follow from decreasing µk, as the (steady-state) deck sizes can be different, as is indeed the case (see Fig.9). tive prediction made by our model. We note here that there is empirical support in the literature 4.1 Experiment Setup for expanding intervals between repetitions [5]. Finally, Fig. 10 and 11 show how the maximum achievable A total of 331 users (‘turkers’) on the MTurk platform learning rate depends on the general difficulty of items, and were solicited to participate in a vocabulary learning task. the user’s review frequency budget U. The convexity of At the beginning of the task, vocabulary used in the task the latter plot is encouraging, as it suggests that there are was selected randomly from one of two categories: Japanese increasing returns (for lower U) as the user increases their (words) and American Sign Language (animated gestures). budget. Items for the experiment were sampled from the list of common words in both languages 4. Each task was timed to last 3.4 Extension to item-specific difficulties 4https://en.wiktionary.org/wiki/Appendix:1000 Japanese The assumption that all items have the same difficulty θ basic words for Japanese and http://www.lifeprint.com/ ��

��

Figure 7: Optimal review schedule {µ } for n = 20, k k Figure 10: Variation in maximum learning rate λ∗ U = 1, θ = 0.01 ext with item diﬃculty θ, for n = 20,U = 1.

��

� ��

Figure 8: Expected delays 1/(µ − λ ) under optimal k k Figure 11: Variation in learning rate λ∗ with re- schedule for n = 20, U = 1, θ = 0.01. ext view frequency budget U for n = 5, θ = 0.01. Note the convexity (hence, increasing returns) for low U. ��

� ��

��

Figure 9: Expected queue size λk/(µk − λk) under optimal schedule for n = 20, U = 1, θ = 0.01. The Figure 12: Optimal review schedule {µ } for n = 20, kinks at the boundaries, in this and the previous k k U = 1 and varying item diﬃculties θ. plot, arise from having a bounded number of decks. 0.025 Empirical Simulated (Clocked Delay) 0.020

0.015 (Items Per Second) out

λ 0.010

0.005 Throughput

0.000 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Figure 13: Screenshot of our Mechanical Turk inter- Arrival Rate λext (Items Per Second) face : The user sees a word in Japanese (not shown) and enters a guess. The card is then ﬂipped to re- Figure 14: The exit rate λout vs. arrival rate λext, veal the word’s meaning in English (shown above), where number of decks n = 5, review frequency bud- and the user then assesses herself. get U = 0.1902, and global item diﬃculty θ = 0.0077.

70 15 minutes, and turkers were compensated $1.00 for com- Deck 1 pleting the task regardless of their performance. The task 60 Deck 2 Deck 3 used the interface depicted in Fig. 13: a ﬂashcard initially Deck 4 50 displaying the item in the foreign language (either word or Deck 5 Mastered gesture), and a pair of YES/NO buttons to collect input 40 from the user in response to the question, “Do you know this word?”. If the user selected YES, she was then asked 30

to type the translation of the word in English. Once the Number of Items word was entered, the flashcard was “flipped”, revealing the 20 correct English word. The user was then asked to self-assess 10 their correctness on a scale of 1 (completely wrong) to 4 (perfect). Following the submission of the rating, the next 0 card was sampled from the deck. In all our experiments, we 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Arrival Rate λext (Items Per Second) consider a self-assessment score of 3 (almost perfect) or 4 (perfect) as a “pass”, and all other scores as a “fail”. At the beginning of each task, each turker was assigned Figure 15: The number of items that finish in each to one of the 11 conditions corresponding to the arrival rate deck vs. arrival rate λext, where number of decks n = (λext) of new items: [0.002, 0.004, 0.010, 0.015, 0.020, 0.023, 5, review frequency budget U = 0.1902, and global 0.029, 0.050, 0.076, 0.095, 0.11, 0.19] (items per second). item difficulty θ = 0.0077. The resulting data set consists of a total of 77, 034 logs, 331 unique users, 446 unique items, an overall recall rate of 1.0 λ = 0.004 (Before Phase Transition) 0.663, a fixed session duration of 15 minutes, and an average ext λext = 0.015 (Before Phase Transition) session length of 171 logs, where each log is a tuple (turk- 0.8 λext = 0.048 (After Phase Transition) λ = 0.114 (After Phase Transition) erID, itemID,√ score, timestamp). We set deck review rates ext to µk ∝ 1/ k – this roughly follows the shape of the opti- 0.6 mal allocation in Fig.7. We note that choosing an optimal scheduler is not essential for our purpose of observing the phase transition – moreover, we cannot optimize the review 0.4 rates µk ex ante as we do not know the item difficulty θ or the review budget U. During the experiments, we set the 0.2

number of decks in the system to n = ∞, so items never Fraction of Items Seen During Session exit the system during a review session. Items incorrectly 0.0 recalled at deck 1 are ‘reflected’ and stay in deck 1. In our 1 2 3 4 5 6 post-hoc analysis of the data, we consider an item to be Deck ‘mastered’ if its final position is in deck 6 or greater. We estimate the empirical review budget U as (average Figure 16: The fraction of items seen during a ses- number of logs in a session) / (session duration), and the sion that finish in each deck for different arrival empirical item difficulty θ using maximum-likelihood esti- rates λext, where number of decks n = 5, review fre- mation. We measure throughput λout as (average number quency budget U = 0.1902, and global item difficulty of items mastered in a session) / (session duration). θ = 0.0077. Deck 6 refers to the pile of mastered asl101/gifs-animated/ for American Sign Language items. 4.2 Results was funded in part through NSF Awards IIS-1247637, IIS- Fig. 14 overlays the empirical learning rate from the 1217686, IIS-1513692, the Cornell Presidential Research Schol- MTurk data for each arrival rate condition on the learning ars Program, and a grant from the John Templeton Foun- rate curve for the simulated Leitner Queue Network (same dation provided through the Metaknowledge Network. as Fig.5) using the parameter values for θ and U measured from the MTurk data (see caption of Fig.5 for details). 7. REFERENCES The simulated and empirical curves are in close agreement; [1] Amazon mechanical turk. https://www.mturk.com, in particular, the MTurk data shows the phase transition in 2005. learning rate predicted by our theoretical model. [2] The mnemosyne project. http://mnemosyne-proj.org, In addition to computing the observed throughput for the 2006. various arrival rates in the MTurk data, we compute the av- [3] Duolingo. https://www.duolingo.com, 2011. erage distribution of items across the five decks and the pile of mastered items at the end of a session. This gives insight [4] Spaced repetition. into where items accumulate in unstable regimes. Fig. 15 http://www.gwern.net/Spaced%20repetition, 2016. illustrates the same phase transition observed earlier: as the [5] N. J. Cepeda, H. Pashler, E. Vul, J. T. Wixted, and arrival rate increases, we first see an increase in the number D. Rohrer. Distributed practice in verbal recall tasks: of mastered items. However, as the arrival rate increases A review and quantitative synthesis. Psychological past the optimum, relatively fewer items are mastered and bulletin, 132(3):354, 2006. relatively more items get ‘stuck’ in deck 1. Intuitively, the [6] X. Chao, M. Pinedo, and M. Miyazawa. Queueing user gets overwhelmed by incoming items so that fewer and networks: Customers, signals, and product form fewer items get reviewed often enough to achieve mastery. solutions. Wiley, 1999. Fig. 15 and 16 match the behavior suggested by our queue- [7] A. T. Corbett and J. R. Anderson. Knowledge tracing: ing model: for injection rates higher than the threshold, the Modeling the acquisition of procedural knowledge. number of items in deck 1 blows up while the other decks User modeling and user-adapted interaction, remain stable. 4(4):253–278, 1994. [8] F. N. Dempster. Spacing effects and their implications 5. CONCLUSION AND OPEN QUESTIONS for theory and practice. Educational Psychology Review, 1(4):309–330, 1989. Our work develops the first formal mathematical model for [9] H. Ebbinghaus. Memory: A contribution to reasoning about spaced repetition systems that is validated experimental psychology. Number 3. University by empirical data and provides a principled, computation- Microfilms, 1913. ally tractable algorithm for flashcard review scheduling. Our [10] D. Elmes. Anki. http://ankisrs.net, 2015. formalization of the Leitner system suggests the maximum speed of learning as a natural design metric for spaced rep- [11] F. P. Kelly. Reversibility and stochastic networks. etition software; using techniques from queueing theory, we Cambridge University Press, 2011. derive a tractable program for calibrating the Leitner system [12] D. G. Kendall. Stochastic processes occurring in the to optimize the speed of learning. Our queueing framework theory of queues and their analysis by the method of opens doors to leveraging an extensive body of work in this the imbedded markov chain. The Annals of area to develop more sophisticated extensions. To inspire Mathematical Statistics, pages 338–354, 1953. and facilitate future work in this direction, we release (1) all [13] S. Leitner. So lernt man lernen. Herder, 1974. model and evaluation code, (2) framework code for carrying [14] W. Linden and R. K. Hambleton. Handbook of out user studies, and (3) the data collected in our Mechanical modern item response theory. New York, 1997. Turk study. The data and code for replicating our experi- [15] R. V. Lindsey, J. D. Shroyer, H. Pashler, and M. C. ments are available online at http://siddharth.io/leitnerq. Mozer. Improving students’ long-term knowledge Our work suggests several directions for further research. retention through personalized review. Psychological The primary follow-up is to obtain a better understanding science, 25(3):639–647, 2014. of the Leitner Queue Network; in particular, better approx- [16] T. P. Novikoff, J. M. Kleinberg, and S. H. Strogatz. imations with rigorous performance guarantees. Doing so Education of a model student. Proceedings of the will allow us to design better control policies, which ideally National Academy of Sciences, 109(6):1868–1873, could maximize the learning rate in the transient regime. 2012. The latter is critical for designing policies for cramming [16], [17] H. Pashler, N. Cepeda, R. V. Lindsey, E. Vul, and a complementary problem to long-term learning where the M. C. Mozer. Predicting the optimal spacing of study: number of items to be learnt is of the same order as the A multiscale context model of memory. In Advances in number of reviews. Next, our queueing model should be neural information processing systems, pages modified to incorporate more sophisticated memory models 1321–1329, 2009. that more accurately predict the effect of a particular re- [18] P. Pimsleur. A memory schedule. Modern Language view schedule on retention. Finally, there is a need for more Journal, pages 73–75, 1967. extensive experimentation to understand how closely these [19] G. Rasch. Probabilistic models for some intelligence models of spaced repetition apply to real-world settings. and attainment tests. ERIC, 1993. [20] P. Wozniak and E. J. Gorzelanczyk. Optimization of 6. ACKNOWLEDGEMENTS repetition spacing in the practice of learning. Acta We thank Peter Bienstman and the Mnemosyne project neurobiologiae experimentalis, 54:59–59, 1994. for making their data set publicly available. This research