Variational Deep Knowledge Tracing for Language Learning

Variational Deep Knowledge Tracing for Language Learning Sherry Ruan Wei Wei James A. Landay [email protected] [email protected] [email protected] Stanford University Google Stanford University Stanford, California, United States Sunnyvale, California, United States Stanford, California, United States ABSTRACT You are✔ buying shoes Deep Knowledge Tracing (DKT), which traces a student’s knowledge change using deep recurrent neural networks, is widely adopted in student cognitive modeling. Current DKT models only predict You are✘ young Yes you are✘ going a student’s performance based on the observed learning history. However, a student’s learning processes often contain latent events not directly observable in the learning history, such as partial under- You are✔ You are✔ You are✔ You are✔ standing, making slips, and guessing answers. Current DKT models not going not real eating cheese welcome fail to model this kind of stochasticity in the learning process. To address this issue, we propose Variational Deep Knowledge Tracing (VDKT), a latent variable DKT model that incorporates stochasticity Linking verb Tense ✔/✘ Right/Wrong Attempt into DKT through latent variables. We show that VDKT outper- forms both a sequence-to-sequence DKT baseline and previous Figure 1: A probabilistic knowledge tracing system tracks SoTA methods on MAE, F1, and AUC by evaluating our approach the distribution of two concepts concerning the word are: on two Duolingo language learning datasets. We also draw various understanding it as a linking verb and understanding it in interpretable analyses from VDKT and offer insights into students’ the present progressive tense. 3 and 7 represent whether stochastic behaviors in language learning. a student answers a certain question correctly or not. The height of the red and blue knowledge bars represents the CCS CONCEPTS system’s beliefs in the student’s knowledge about two differ- • Social and professional topics ! Student assessment; • Com- ent grammar rules over time. Higher bars represent higher puting methodologies ! Neural networks; • Mathematics of probabilities of mastering the concept. Initially, the system computing ! Variational methods. has an equal belief in the student’s knowledge of two concepts. As the student practices more problems, the system KEYWORDS grows its confidence and updates its estimates based on the student’s performance. knowledge tracing, language learning, student modeling, deep learning, variational inference ACM Reference Format: usually takes a long period to learn and retain, most of the language Sherry Ruan, Wei Wei, and James A. Landay. 2021. Variational Deep Knowl- learning interaction trajectories are long and noisy sequences. As edge Tracing for Language Learning. In LAK21: 11th International Learning a result, a unique challenge is posed to apply knowledge tracing Analytics and Knowledge Conference (LAK21), April 12–16, 2021, Irvine, CA, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3448139. models to language learning. 3448170 Many existing methods are based on psychological insights. For example, Pimsleur [29] and Leitner [22] used psycho-linguistically 1 INTRODUCTION inspired models with hand-picked parameters. Corbett and Ander- son [7] proposed Bayesian Knowledge Tracing (BKT), a well-known Knowledge tracing is the task of modeling a student’s knowledge Bayesian-based method for knowledge tracing. More recently, Set- acquisition and loss process based on the student’s past trajecto- tles and Meeder [38] and Renduchintala et al. [31] investigated ries of interactions with a learning system [2]. Knowledge tracing more advanced statistical learning models in combination with on language learning is particularly interesting for two reasons. psycho-linguistic theories. However, due to the limitations of pa- First, modern language learning systems such as Duolingo [37, 38] rameter size and the form of distributions they can choose from, provide vast amounts of interaction data that fuel the performance these models could not effectively capture sophisticated patterns of advanced machine learning models. Second, since a language of the data provided by language learning systems. Permission to make digital or hard copies of part or all of this work for personal or To improve the performance of knowledge tracing models, re- classroom use is granted without fee provided that copies are not made or distributed searchers have built RNN-based knowledge tracing models, e.g., for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. Deep Knowledge Tracing (DKT) by Piech et al. [28]. Current DKT For all other uses, contact the owner/author(s). models achieve superior performance compared to the aforemen- LAK21, April 12–16, 2021, Irvine, CA, USA tioned methods in many domains [28, 39, 42, 47] including language © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8935-8/21/04. learning [37]. However, unlike latent variable models such as BKT https://doi.org/10.1145/3448139.3448170 models, a DKT model works purely on observed data and ignores LAK21, April 12–16, 2021, Irvine, CA, USA S. Ruan et al. the underlying scholastic patterns made by the unobserved latent Additionally, VDKT achieved state-of-the-art (SoTA) performance variables. Such a simplification makes it difficult for DKT methods on these two real-world language learning datasets. (3) We per- to model language learning datasets, which are often stochastic formed various analyses comparing VDKT and SDKT, including [10]. creating visualizations to make these analyses more interpretable. A student’s interaction with a language learning system is noisy These analyses provide further insights towards a better understand- and uncertain. For instance, one could slip on questions even when ing of latent information in language learning modeling. Our work already mastering the underlying concepts. One could also guess has many real-world educational applications including helping to an answer to a question correctly without fully understanding the better assess student learning, understanding students’ stochastic corresponding concepts. Moreover, a student learns by acquiring behavior, and providing insight into students’ underlying learning new knowledge, understanding parts of it, and then forgetting parts processes. All of these are directly related to the interests of the of the learned concepts. As a result, the student’s knowledge state learning analytics community. fluctuates during the learning process. All the fluctuations lead to the student’s stochastic behaviors when practicing knowledge. 2 RELATED WORK Therefore, the student’s raw observed behavior is not always the Our work tackles the classical knowledge tracing problem and best measure (or loss function in deep knowledge tracing) to opti- leverages variational autoencoders, so we summarize prior work in mize for, and these many latent events need to be uncovered at an these two related areas: knowledge tracing and variational encoder- abstraction level higher than the raw observance. Unfortunately, decoder. current DKT models only learn from the observed inputs and can- not adequately generalize the stochastic learning processes that contain these latent events. 2.1 Knowledge Tracing As further illustrated in Figure 1, a student is learning two lin- The goal of knowledge tracing is to learn a representation of student guistic rules pertaining to the word are. The first one uses are as a knowledge over time based on their past learning performance. The linking verb followed by a noun phrase or an adjective, e.g., “you representation is evaluated by predicting student performance on are not real”. The second one uses are to represent the present pro- future questions. The two most popular families of knowledge gressive tense, e.g., “you are buying shoes”. The student needs to tracing models are Bayesian knowledge tracing (BKT) [7], based answer a series of questions that involve the use of are in sentences. upon Hidden Markov Models (HMMs) [12], and Deep Knowledge A beginning language learner is likely to be confused by these rules Tracing (DKT) [28], back-boned by sequence to sequence models and has uncertain learning behaviors when practicing the word [40]. are. Although a DKT model could implicitly model some stochastic BKT, the long-standing method for knowledge tracing, uses behaviors, it would largely ignore the latent information and fail HMMs to model a student’s prior knowledge and learning speed. To at generalizing these learner behaviors. In contrast, a probabilistic model latent information in a student’s stochastic learning process, knowledge tracing system is capable of maintaining a probabilistic BKT introduces a slip parameter (fail a question despite mastering distribution of a student’s knowledge representations (as demon- the skill) and a guess parameter (guess a question correctly de- strated by red and blue belief bars in Figure 1) and separate noisy spite not yet mastering the skill). These parameters together with events from learning actual knowledge representations to achieve other HMM parameters are typically optimized using past data by better modeling performance. expectation-maximization algorithms [8]. BKT infers a student’s Much of the previous work attempted to incorporate latent in- future performance based on these estimated parameters. Many formation by including richer

Variational Deep Knowledge Tracing for Language Learning

Celebrating 40 Years of TLT Feature Article My Share

The Indirect Spaced Repetition Concept Louis Lafleur Ritsumeikan University

Studium: Your Personal Mobile Study On-The-Go

Numer 1(9)/2016

Linguistic Skill Modeling for Second Language Acquisition

Uˇcıcı Syst´Em Pro Android

A Game to Learn Grammatical Gender in Portuguese Information Systems

Vocabulary Learning Questionnaire………………………………………………..74

Entwicklung Eines Digitalen Lernbegleiters Von Lernmethodik Bis Zur App

Beat the Books Gamifying the Learning Experience for K-6 Students Using Leitner System

Review Scheduling for Maximum Long-Term Retention of Knowledge

Free Flashcard Program Windows