Modelling Language Learning of Duolingo Users
Total Page:16
File Type:pdf, Size:1020Kb
Modelling language learning of Duolingo users Floor Dikker STUDENT NUMBER: 2017217 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN DATA SCIENCE & SOCIETY DEPARTMENT OF COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE SCHOOL OF HUMANITIES AND DIGITAL SCIENCES TILBURG UNIVERSITY Thesis committee: Dr. Hendrickson Dr. Alishahi Tilburg University School of Humanities and Digital Sciences Department of Cognitive Science & Artificial Intelligence Tilburg, The Netherlands June, 2019 Data Science and Society 2019 2 Data Science and Society 2019 Preface Dear reader, I present to you my thesis about modelling language learning of users of Duolingo. First of all, I would like to thank Dr. Hendrickson for his guidance. I have found our meetings and your suggestions very helpful. Moreover, your enthusiasm is very contagious which made writing this thesis a more fun experience. I would also like to thank my parents and brother for offering support during my entire studies and always offering a listening ear. Finally, I would like to thank my boyfriend for supporting me all the way. I hope you enjoy reading my thesis, Floor Dikker 3 Data Science and Society 2019 Table of Contents 1. Introduction ....................................................................................................... 6 2. Related Work..................................................................................................... 9 2.1 The SLAM challenge .................................................................................... 9 2.2 Relevant features ........................................................................................ 10 2.2.1 Linguistic features................................................................................ 10 2.2.2 Learning process features ...................................................................... 12 2.3 Unbalanced data ......................................................................................... 15 2.4 Added value of the thesis ............................................................................. 15 3. Experimental Setup .......................................................................................... 16 3.1 The data set ............................................................................................... 16 3.1.1 Description of the data set used in the experiment .................................... 16 3.1.2 Data structure ...................................................................................... 16 3.1.3 Software ............................................................................................. 17 3.2 Data manipulation ...................................................................................... 17 3.2.1 Preprocessing steps .............................................................................. 17 3.2.2 Feature engineering .............................................................................. 18 3.2.3 Outliers and missing values ................................................................... 19 3.3 Explanatory Data Analysis ........................................................................... 21 3.3.1 The target feature ................................................................................. 21 3.3.2 Linguistic features................................................................................ 21 3.3.3 Learning process features ...................................................................... 23 3.4 Methods .................................................................................................... 25 3.4.1 Logistic Regression .............................................................................. 25 3.4.2 Random Forest Classifier ...................................................................... 25 3.5 Experimental procedure............................................................................... 25 3.5.1 The Baseline ....................................................................................... 25 3.5.2 Algorithms .......................................................................................... 26 3.5.3 Hyperparameters.................................................................................. 26 3.5.4 Evaluation metrics ............................................................................... 26 3.5.5 Down sampling ................................................................................... 27 4. Results ............................................................................................................ 27 4.1 The baseline model ..................................................................................... 27 4.1.1 The LR baseline................................................................................... 27 4.1.2 The ZeroR Classifier ............................................................................ 28 4.2 The other models ........................................................................................ 29 4.3 Feature importance ..................................................................................... 31 4 Data Science and Society 2019 4.4 Error analysis ............................................................................................. 32 4.4.1 Number of characters ........................................................................... 32 4.4.2 Morphological features ......................................................................... 33 4.4.3 Word .................................................................................................. 34 4.4.4 Part of Speech ..................................................................................... 34 4.4.5 Edge Label.......................................................................................... 35 5. Discussion ....................................................................................................... 35 6. Conclusion ...................................................................................................... 39 References .......................................................................................................... 40 Appendix A: Features in the original data set........................................................... 44 Appendix B: Parameter settings of the models ......................................................... 45 Appendix C: Plots of the logistic regression models ................................................. 46 Appendix D: Plots of the random forest models ....................................................... 47 Appendix E: Visual representation of the features and the error ................................. 48 Appendix F: Distribution comparisons train and test set ............................................ 49 5 Data Science and Society 2019 Modelling language learning of Duolingo users Floor Dikker This thesis looks to conclude which features are most effective when modelling second language acquisition. In order to do so in a transparent manner, a division is made between learning process features and linguistic features. Both categories of features are modelled with a random forest classifier and a logistic regression. The linguistic features, modelled with a random forest classifier, are concluded to be most effective to predict word learning. Moreover, an error analysis is conducted to further investigate the errors made by the best performing model. Based on the error analysis, the conclusion is made that the model is more prone to predict certain classes or ranges within the features incorrectly, which is assumed to be caused by a different distribution of the features in the train versus the test set. 1. Introduction When learning a new language, a learner must devote attention to numerous interacting elements of a language. For instance, not only the vocabulary must be learned, but also the structure of the language must be assimilated. As these various elements cannot be dealt with in isolation, Second Language Acquisition (SLA) is found to be challenging for many people. Traditionally, one of the most common ways to learn a language has been by means of classroom education. However, in recent years, there has been a large increase in the number of learners that have chosen to acquire a second language via educational software. This new trend is often referred to as ‘mobile learning’ or ‘m-learning’, which implies that learners cannot only learn where they want but also when they want (Kim & Yeonhee, 2012). The rapid growth of mobile learning poses new challenges for researchers. People learn in a different manner when using such platforms compared to the more traditional ways and, as such, the field is said not to align with established SLA theories and other pedagogical research (Heil, Wu, Lee, & Schmidt, 2017). Goodwin & Highfield (2012) state that researchers in these fields have failed to keep pace with the exponential growth of the amount of educational software available for language learning. Nonetheless, the issue has been addressed to some extent by the teams participating in the Second Language Acquisition Modelling (SLAM) challenge, using a data set from Duolingo. Duolingo is an award-winning language learning platform on which more than 200 million students have enrolled. According to the company, their success comes from offering personalized education, a fun learning experience and accessibility from every location1. A benefit of the many learners