Modeling Second-Language Learning from a Psychological Perspective
Total Page:16
File Type:pdf, Size:1020Kb
Modeling Second-Language Learning from a Psychological Perspective Alexander S. Rich Pamela J. Osborn Popp David J. Halpern Anselm Rothe Todd M. Gureckis Department of Psychology, New York University asr443,pamop,david.halpern,anselm,todd.gureckis @nyu.edu { } Abstract which finished in third place for the en es data set, second place for es en, and third place for Psychological research on learning and mem- fr en. ory has tended to emphasize small-scale lab- oratory studies. However, large datasets of Learning and memory has been a core focus of people using educational software provide op- psychological science for over 100 years. Most portunities to explore these issues from a of this work has sought to build explanatory the- new perspective. In this paper we describe ories of human learning and memory using rela- our approach to the Duolingo Second Lan- tively small-scale laboratory studies. Such studies guage Acquisition Modeling (SLAM) com- have identified a number of important and appar- petition which was run in early 2018. We ently robust phenomena in memory including the used a well-known class of algorithms (gradi- ent boosted decision trees), with features par- nature of the retention curve (Rubin and Wenzel, tially informed by theories from the psycho- 1996), the advantage for spaced over massed prac- logical literature. After detailing our mod- tice (Ruth, 1928; Cepeda et al., 2006; Mozer et al., eling approach and a number of supplemen- 2009), the testing effect (Roediger and Karpicke, tary simulations, we reflect on the degree to 2006), and retrieval-induced forgetting (Ander- which psychological theory aided the model, son et al., 1994). The advent of large datasets and the potential for cognitive science and pre- such as the one provided in the Duolingo SLAM dictive modeling competitions to gain from challenge may offer a new perspective and ap- each other. proach which may prove complementary to lab- 1 Introduction oratory scale science (Griffiths, 2015; Goldstone and Lupyan, 2016). First, the much larger sam- Educational software that aims to teach people ple sizes may help to better identify parameters new skills, languages, and academic subjects have of psychological models. Second, datasets cover- become increasingly popular. The wide-spread ing more naturalistic learning situations may allow deployment of these tools has created interest- us to test the predictive accuracy of psychological ing opportunities to study the process of learning theories in a more generalizable fashion (Yarkoni in large samples. The Duolingo shared task on and Westfall, 2017). Second Lanugage Acquisition Modeling (SLAM) Despite these promising opportunities, it re- was a competitive modeling challenge run in early mains unclear how much of current psychologi- 2018 (Settles et al., 2018). The challenge, orga- 1 cal theory might be important for tasks such as nized by Duolingo , a popular second language the Duolingo SLAM challenge. In the field of ed- learning app, was to use log data from thousands ucation data mining, researchers trying to build of users completing millions of exercises to pre- predictive models of student learning have typi- dict patterns of future translation mistakes in held- cally relied on traditional, and interpretable, mod- out data. The data was divided into three sets cov- els and approaches that are rooted in cognitive en es ering Spanish speakers learning English ( ), science (e.g., Atkinson, 1972b,a; Corbett and An- es en English speakers learning Spanish ( ), and derson, 1995; Pavlik and Anderson, 2008). How- fr en English speakers learning French ( ). This ever, a recent paper found that state-of-the-art re- paper reports the approach used by our team, sults could be achieved using deep neural net- 1http://duolingo.com works with little or no cognitive theory built in (so 223 Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 223–230 New Orleans, Louisiana, June 5, 2018. c 2018 Association for Computational Linguistics called “deep knowledge tracing”, Piech et al., instance correctly, and designing a model that can 2015). Khajah, Lindsey, & Mozer (2016) com- achieve high performance using this feature set. pared deep knowledge tracing (DKT) to more standard “Bayesian knowledge tracing” (BKT) 2.1 Feature Engineering models and showed that it was possible to equate We used a variety of features, including features the performance of the BKT model by additional directly present in the training data, features con- features and parameters that represent core aspects structed using the training data, and features that of the psychology of learning and memory such as use information external to the training data. Ex- forgetting and individual abilities (Khajah et al., cept where otherwise specified, categorical vari- 2016). An ongoing debate remains in this com- ables were one-hot encoded. munity whether using flexible models with lots of data can improve over more heavily structured, 2.1.1 Exercise features theory-based models (Tang et al., 2016; Xiong We encoded the exercise number, client, session, et al., 2016; Zhang et al., 2017). format, and duration (i.e., number of seconds to For our approach to the SLAM competition, we complete the exercise), as well as the time since decided to use a generic and fairly flexible model the user started using Duolingo for the first time. structure that we provided with hand-coded, psy- chologically inspired features. We therefore posi- 2.1.2 Word features tioned our entry to SLAM somewhat in between Using spaCy2, we lemmatized each word to pro- the approaches mentioned above. Specifically, we duce a root word. Both the root word token and the used gradient boosting decision trees (GBDT, Ke original token were used as categorical features. et al., 2017) for the model structure, which is a Due to their high cardinality, these features were powerful classification algorithm that is known to not one-hot encoded but were preserved in single perform well across various kinds of data sets. columns and handled in this form by the model (as Like deep learning, GBDT can extract complex described below). interactions among features, but it has some ad- Along with the tokens themselves we encoded vantages including faster training and easier inte- each instance word’s part of speech, morphologi- gration of diverse inputs. cal features, and dependency edge label. We no- We then created a number of new ticed that some words in the original dataset were psychologically-grounded features for the paired with the wrong morphological features, SLAM dataset covering aspects such as user per- particularly near where punctuation had been re- severance, learning processes, contextual factors, moved from the sentence. To fix this, we re- and cognate similarity. After finding a model that processed the data using Google SyntaxNet3. provided the best held-out performance on the test We also encoded word length and several word data set, we conducted a number of “lesioning” characteristics gleaned from external data sources. studies where we selectively removed features Research in psychology has suggested certain from the model and re-estimated the parameters in word features that play a role in how difficult a order to assess the contribution of particular types word is to process, as measured by how long read- of features. We begin by describing our overall ers look at the word as well as people’s perfor- modeling approach, and then discuss some of the mance in lexical-decision and word-identification lessons learned from our analysis. tasks. Two such features that have somewhat inde- pendent effects are word frequency (i.e., how often 2 Task Approach does the word occur in natural language; Rayner, 1998) and age-of-acquisition (i.e., the age at which We approached the task as a binary classification children typically exhibit the word in their vo- problem over instances. Each instance was a sin- cabulary; Brysbaert and Cortese, 2011; Ferrand gle word within a sentence of a translation exer- et al., 2011). We therefore included a feature that cise and the classification problem was to predict encoded the frequency of each word in the lan- whether a user would translate the word correctly guage being acquired, calculated from Speer et al. or not. Our approach can be divided into two components—constructing a set of features that is 2https://spacy.io/ informative about whether a user will answer an 3https://github.com/ljm625/syntaxnet-rest-api 224 (2017), and a feature that encoded the mean age- 0 indicating the same time, 0.25 indicating a rela- of-acquisition (of the English word, in English tive shift by 6 hours, etc.). We then discretized this native speakers), derived from published age-of- variable into 20-minute bins and computed the en- acquisition norms for 30,000 words (Kuperman tropy of the empirical frequency distribution over et al., 2012), which covered many of the words these bins. A lower entropy score indicated less present in the dataset. Additionally, words shar- variability in the times of day a user started their ing a common linguistic derivation (also called exercises. “cognates”; e.g., “secretary” in English and “sec- The entropy score might also give an indication retario” in Spanish), are easier to learn than words for context effects on users’ memory. A user prac- with dissimilar translations (De Groot and Keijzer, ticing exercises more regularly is more likely to be 2000). As an approximate measure of linguis- in the same physical location when using the app, tic similarity, we used the Levenshtein edit dis- which might result in better memory of previously tance between the word tokens and their transla- studied words (Godden and Baddeley, 1975). tions scaled by the length of the longer word. We found translations using Google Translate4 and 2.1.4 Positional features calculated the Levenshtein distance to reflect the To account for the effects of surrounding words letter-by-letter similarity of the word and its trans- on the difficulty of an instance, we created several lation (Hyyro¨, 2001).