Rltutor: Reinforcement Learning Based Adaptive Tutoring System by Modeling Virtual Student with Fewer Interactions

RLTutor: Reinforcement Learning Based Adaptive Tutoring System by Modeling Virtual Student with Fewer Interactions Yoshiki Kubotani1∗ , Yoshihiro Fukuhara1 , Shigeo Morishima2 1Waseda University 2Waseda Research Institute for Science and Engineering [email protected], f [email protected], [email protected] Abstract Flashcard-style questions, where the answer to a question is uniquely determined, have attracted significant research at- A major challenge in the field of education is pro- tention. Knowledge tracing [Corbett and Anderson, 1994] viding review schedules that present learned items aims to estimate the knowledge state of students based on at appropriate intervals to each student so that their learning history and uses it for instruction. In this con- memory is retained over time. In recent years, at- text, it has been reported that the success or failure of future tempts have been made to formulate item reviews students’ answers can be accurately estimated using psycho- as sequential decision-making problems to realize logical findings [Lindsey et al., 2014; Choffin et al., 2019] adaptive instruction based on the knowledge state and deep neural networks (DNNs) [Piech et al., 2015]. of students. It has been reported previously that re- Furthermore, adaptive instructional acquisition has also inforcement learning can help realize mathemati- been formulated as a continuous decision problem to opti- cal models of students learning strategies to main- mize the instructional method [Rafferty et al., 2016; White- tain a high memory rate. However, optimization hill and Movellan, 2017; Reddy et al., 2017; Utkarsh et al., using reinforcement learning requires a large num- 2018]. Reddy et al. [2017] considered the optimization of ber of interactions, and thus it cannot be applied di- student instruction as an interaction between the environment rectly to actual students. In this study, we propose (students) and the agent (teacher), and they attempted to opti- a framework for optimizing teaching strategies by mize the instruction using reinforcement learning. Although constructing a virtual model of the student while their results outperform existing heuristic teaching methods minimizing the interaction with the actual teach- for mathematically modelled students, there is a practical ing target. In addition, we conducted an experiment problem in that it cannot be applied directly to actual students. considering actual instructions using the mathemat- This is because of the extremely large number of interactions ical model and confirmed that the model perfor- required for optimization. mance is comparable to that of conventional teach- In this study, we propose a framework that can optimize the ing methods. Our framework can directly substitute teaching method while reducing the number of interactions mathematical models used in experiments with hu- between the environment (modeled students) and the agent man students, and our results can serve as a buffer (teacher) using a pretrained mathematical model of the stu- between theoretical instructional optimization and dents. Our contributions are summarized as follows. practical applications in e-learning systems. • We pretrained a mathematical model that imitates students using mass data, and realized adaptive instruc- 1 Introduction tion using existing reinforcement learning with a smaller number of interactions. arXiv:2108.00268v1 [cs.AI] 31 Jul 2021 The demand for online education is increasing as schools around the world are forced to close due to the COVID-19 • We conducted an evaluation experiment of the proposed pandemic. E-learning systems that support self-study are framework in a more practical setting, and showed that gaining rapid popularity. While e-learning systems are ad- it can achieve comparable performance to existing meth- vantageous in that students can study from anywhere with- ods with fewer interactions. out gathering, they also have the disadvantage of making • We highlighted the need to reconsider the functional it difficult to provide an individualized curriculum, owing form of the loss function for the modeled students to re- to the lack of communication between teachers and stu- alize more adaptive instruction. dents [Zounek and Sudicky, 2013]. Instructions suited to individual learning tendencies 2 Related Work and memory characteristics have been investigated re- 2.1 Knowledge Tracing cently [Pavlik and Anderson, 2008; Khajah et al., 2014]. Knowledge tracing (KT) [Corbett and Anderson, 1994] is a ∗Contact Author task used to estimate the time-varying knowledge state of learners from the corresponding learning histories. Various methods have been proposed for KT, including those using tutoring & optimize Bayesian estimation [Michael et al., 2013], and DNN-based methods [Piech et al., 2015; Pandey and Karypis, 2019]. feedback In this study, we focus on the item response theory RL tutor student (IRT) [Frederic, 1952]. IRT is aimed at evaluating tests that are independent of individual abilities. The following is the input : simplest logistic model: output : optimize study history P(Oi;j = 1) = σ(αi − δj): (1) Here, αi is the ability of the learner i, δj is the difficulty feedback of the question j, and Oi;j is a random variable represent- RL tutor inner model student ing the correctness o of learner i’s answer to question j in binary form. σ(·) represents the sigmoid function. While tutoring ordinary IRT is a static model with no time variation, IRT- based KT attempts to realize KT by incorporating learning Figure 1: Illustration of the difference between the usual reinforce- history into Equation (1) [Cen et al., 2006; Pavlik et al., 2009; ment learning setting and the proposed method. The proposed Vie and Kashima, 2019]. Lindsey et al. [2014] proposed the method updates the instructional strategies not by interacting with DASH model, which uses psychological knowledge to design the students directly, but by interacting with the KT model estimated from the student’s study history. a history parameter. Since DASH does not consider the case where multiple pieces of knowledge are associated with a sin- gle item, Choffin et al. [2019] extended DASH to consider However, in recent years, there have been some attempts to correlations between items, and the model is called DAS3H: obtain more personalized instruction by treating such instruc- [ et al. (O = 1) tion as a sequential decision problem Rafferty , 2016; P i;j;l Whitehill and Movellan, 2017; Reddy et al., 2017; Utkarsh X = σ αi − δj + βk + hθ,φ(ti;j;1:l; oi;j;1:l−1) ; (2) et al., 2018]. Rafferty et al. [2016] formulated student instruction as a partially observed Markov decision process k2KCj (POMDP) and attempted to optimize instruction for real stu- hθ,φ(ti;j;1:l; oi;j;1:l−1) dents through planning for multiple modelled students. Based W −1 [ ] X X on their formulation, Reddy et al. Reddy et al., 2017 have = θk;w ln(1 + ci;k;w) + φk;w ln(1 + ni;k;w): also attempted to optimize instructional strategies using trust k2KCj w=0 region policy optimization (TRPO) [Schulman et al., 2015], (3) a method of policy-based reinforcement learning. However, optimization by reinforcement learning requires a large num- Equation (2) contains two additional terms from Equation (1): ber of interactions, which makes it inapplicable to real-life the proficiency βk of the knowledge components (KC) asso- scenarios. ciated with item j, and hθ,φ defined in Equation (3). hθ,φ represents students’ learning history, ni;k;w refers to the num- 3 Proposed Framework ber of times learner i attempts to answer skill k, and ci;k;w refers to the number of correct answers out of the trials, both To address the issue of large numbers of interactions, we for- counted in each time window τw. Time window τw is a pa- mulate a framework for acquiring adaptive instruction with rameter that originates from the field of psychology [Rovee- fewer contacts. In this section, we consider student teaching Collier, 1995], and represents the time scale of loss of mem- as a POMDP and formulate a framework for acquiring adap- ory. By dividing the counts by each discrete time scale sat- tive teaching with a small number of interactions. isfying τw < τw+1, the memory rate can be estimated by The proposed method has two main structures: a memory taking into account the temporal distribution of the learning model that captures the knowledge state of the student (in- history [Lindsey et al., 2014]. ner model) and a teaching model that acquires the optimal instructional strategy through reinforcement learning (RL- 2.2 Adaptive Instruction Tutor). As shown in Figure 1, RLTutor optimizes its strat- A mainstream approach to adaptive instruction is the opti- egy indirectly through interaction with the inner model, rather mization of review intervals. The effects of repetitive learning than with the actual student. In the following sections, we first on memory consolidation have been discussed in the field of describe the detailed design of the inner model and RLTutor psychology [Ebbinghaus, 1885], and various studies have ex- and describe the working principle of our framework. perimentally confirmed that gradually increasing the repetition interval is effective for memory retention [Leitner, 1974; 3.1 Inner Model Wickelgren, 1974; Landauer and Bjork, 1978; Wixted and Inner Model is a virtual student model constructed from Carpenter, 2007; Cepeda et al., 2008]. the learning history of the instructional target using the KT Previously, the repetition interval was determined algorith- method described in Section 2.1. Specifically, given a learn- mically when the item was presented [Khajah et al., 2014]. ing history H = fi; j; t; ogn with the same notation as in Sec- step1 student step2 student step3 student tutoring tutoring tutoring ? ? ? teaching model estimation teaching model estimation teaching model estimation RL tutor Inner Model RL tutor Inner Model RL tutor Inner Model Figure 2: Graphical representation of each step of the proposed framework. First, the Inner Model is updated from the learning history of the student.

Load more