Modelling language

of users Floor Dikker STUDENT NUMBER: 2017217

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN DATA SCIENCE & SOCIETY DEPARTMENT OF COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE SCHOOL OF HUMANITIES AND DIGITAL SCIENCES TILBURG UNIVERSITY

Thesis committee:

Dr. Hendrickson Dr. Alishahi

Tilburg University School of Humanities and Digital Sciences Department of Cognitive Science & Artificial Intelligence Tilburg, The Netherlands June, 2019 Data Science and Society 2019

2

Data Science and Society 2019

Preface

Dear reader,

I present to you my thesis about modelling language learning of users of Duolingo. First of all, I would like to thank Dr. Hendrickson for his guidance. I have found our meetings and your suggestions very helpful. Moreover, your enthusiasm is very contagious which made writing this thesis a more fun experience. I would also like to thank my parents and brother for offering support during my entire studies and always offering a listening ear. Finally, I would like to thank my boyfriend for supporting me all the way.

I hope you enjoy reading my thesis,

Floor Dikker

3

Data Science and Society 2019

Table of Contents 1. Introduction ...... 6 2. Related Work...... 9 2.1 The SLAM challenge ...... 9 2.2 Relevant features ...... 10 2.2.1 Linguistic features...... 10 2.2.2 Learning process features ...... 12 2.3 Unbalanced data ...... 15 2.4 Added value of the thesis ...... 15 3. Experimental Setup ...... 16 3.1 The data set ...... 16 3.1.1 Description of the data set used in the experiment ...... 16 3.1.2 Data structure ...... 16 3.1.3 Software ...... 17 3.2 Data manipulation ...... 17 3.2.1 Preprocessing steps ...... 17 3.2.2 Feature engineering ...... 18 3.2.3 Outliers and missing values ...... 19 3.3 Explanatory Data Analysis ...... 21 3.3.1 The target feature ...... 21 3.3.2 Linguistic features...... 21 3.3.3 Learning process features ...... 23 3.4 Methods ...... 25 3.4.1 Logistic Regression ...... 25 3.4.2 Random Forest Classifier ...... 25 3.5 Experimental procedure...... 25 3.5.1 The Baseline ...... 25 3.5.2 Algorithms ...... 26 3.5.3 Hyperparameters...... 26 3.5.4 Evaluation metrics ...... 26 3.5.5 Down sampling ...... 27 4. Results ...... 27 4.1 The baseline model ...... 27 4.1.1 The LR baseline...... 27 4.1.2 The ZeroR Classifier ...... 28 4.2 The other models ...... 29 4.3 Feature importance ...... 31 4

Data Science and Society 2019

4.4 Error analysis ...... 32 4.4.1 Number of characters ...... 32 4.4.2 Morphological features ...... 33 4.4.3 Word ...... 34 4.4.4 Part of Speech ...... 34 4.4.5 Edge Label...... 35 5. Discussion ...... 35 6. Conclusion ...... 39 References ...... 40 Appendix A: Features in the original data set...... 44 Appendix B: Parameter settings of the models ...... 45 Appendix C: Plots of the logistic regression models ...... 46 Appendix D: Plots of the random forest models ...... 47 Appendix E: Visual representation of the features and the error ...... 48 Appendix F: Distribution comparisons train and test set ...... 49

5

Data Science and Society 2019

Modelling language learning of Duolingo users

Floor Dikker

This thesis looks to conclude which features are most effective when modelling second language acquisition. In order to do so in a transparent manner, a division is made between learning process features and linguistic features. Both categories of features are modelled with a random forest classifier and a logistic regression. The linguistic features, modelled with a random forest classifier, are concluded to be most effective to predict word learning. Moreover, an error analysis is conducted to further investigate the errors made by the best performing model. Based on the error analysis, the conclusion is made that the model is more prone to predict certain classes or ranges within the features incorrectly, which is assumed to be caused by a different distribution of the features in the train versus the test set.

1. Introduction

When learning a new language, a learner must devote attention to numerous interacting elements of a language. For instance, not only the vocabulary must be learned, but also the structure of the language must be assimilated. As these various elements cannot be dealt with in isolation, Second Language Acquisition (SLA) is found to be challenging for many people. Traditionally, one of the most common ways to learn a language has been by means of classroom education. However, in recent years, there has been a large increase in the number of learners that have chosen to acquire a second language via educational software. This new trend is often referred to as ‘mobile learning’ or ‘m-learning’, which implies that learners cannot only learn where they want but also when they want (Kim & Yeonhee, 2012).

The rapid growth of mobile learning poses new challenges for researchers. People learn in a different manner when using such platforms compared to the more traditional ways and, as such, the field is said not to align with established SLA theories and other pedagogical research (Heil, Wu, Lee, & Schmidt, 2017). Goodwin & Highfield (2012) state that researchers in these fields have failed to keep pace with the exponential growth of the amount of educational software available for language learning. Nonetheless, the issue has been addressed to some extent by the teams participating in the Second Language Acquisition Modelling (SLAM) challenge, using a data set from Duolingo. Duolingo is an award-winning language learning platform on which more than 200 million students have enrolled. According to the company, their success comes from offering personalized education, a fun learning experience and accessibility from every location1.

A benefit of the many learners on Duolingo is that the company has gathered a large amount of data about the learners. The company states that it possesses “the world’s largest collection of language-learning data”2. To exploit this data, Duolingo has released various

1 https://www.duolingo.com/, accessed on 10-04-2019 2 https://ai.duolingo.com/, accessed on 05-05-2019 6

Data Science and Society 2019

research projects themselves. One of these has been the SLAM challenge, of which the goal was to predict if a user of Duolingo will translate a word correctly. The user is given the task to translate a whole sentence, and as such, the words learned by a user occur in various grammatical forms. Hence, a learner must not only know the correct word, but must also be able to conjugate it. An illustration of an exercise is displayed in Figure 1.

Figure 1. Illustration of an exercise Note: the answer of the learner is fictional, as the answers of the learners were not provided

To predict how successful users were when learning the words, a variety of modelling approaches were adopted by the 14 participating teams in the SLAM challenge. One common approach was to use feature engineering to represent expectations based on previous research about or linguistic effects (Nayak & Rao, 2018; Rich, Popp, Halpern, & Gureckis, 2018; Tomoschuk & Lovelett, 2018). Another common approach was to use deep learning techniques in order to predict word learning (Kaneko, Kajiwara, & Komachi, 2018; Osika et al., 2018; Xu & Chen, 2018; Yuan, 2018).

Numerous questions remain unanswered despite the work done in light of the SLAM challenge. One of these gaps in knowledge is that it remains unclear which features are proven to be most relevant for SLAM. The teams participating in the challenge obtained high scores, however, no clarity was obtained regarding which features led to this. This statement is in line with the findings by Settles et al. (2018), the organizers of the SLAM challenge, who state that the variety of included features did not influence the results of the teams in a significant manner. The authors conclude that rather the methods used are proven to be most relevant for obtaining a high result in the SLAM challenge. Concluding, much ambiguity still exists regarding which features are most effective when modelling SLA and further research is necessary.

From a practical perspective, shedding light on what influences language learning is relevant, since being proficient in a second language has become increasingly necessary in a globalizing world. A second language can be used to communicate our thinking, to interact with others and to apprehend the world around us (Marian & Shook, 2012). Nowadays, a popular way of acquiring a second language is by means of computer-based educational software. As many people are using these platforms, vast amounts of learning data are generated which can be used to obtain additional insights regarding how a second language is best acquired (Settles et al., 2018). In this thesis, such data will be exploited to pinpoint constructs that can benefit learners using educational software for language learning.

Making use of the same data set as published for the SLAM challenge, this thesis aims to create more transparency regarding which features influence language learning via educational platforms. Transparency will be enhanced by making a distinction into 2 different categories of features. Firstly, linguistic features are considered, which are the qualitative and quantitative properties of the word. Secondly, learning process features are examined, which describe the circumstances in which the word was learned. By separating the features in such a way, the effect of the different kinds of features is isolated and hence, a conclusion can be made regarding which kinds of features are most relevant when modelling SLA. Next, all the features are pooled together to determine whether this further 7

Data Science and Society 2019

enhances the predictive power of the model. If so, a mix of the different kinds of features is more appropriate than solely focusing on either linguistic or learning process features. Finally, the features of the best performing model are evaluated based on how informative they are to the model, providing a more precise answer to which features are important when predicting word learning. Subsequently, the following research question and the related sub-questions will be answered in this thesis:

RQ1: Which features are best able to predict whether a learner will translate a word correctly?

SQ1: Is a model that only includes linguistic features able to outperform the baseline? SQ2: Is a model that only includes learning process features able to outperform the baseline? SQ3: Does combining all the features into one model lead to a more predictive model than the other two models? SQ4: What features of the best performing model are most informative to that model?

The baseline addressed in the above questions will largely be similar to the baseline used in the SLAM challenge, which is a logistic regression including only a few features (Settles et al., 2018). However, when aiming to replicate the baseline of the SLAM challenge, the results are found to deviate considerably. Therefore, down sampling is applied in all the models, an issue further outlined in Chapter 3.5.5. Additionally, all the models will be compared to the results of the ZeroR classifier.

Using the most predictive model found by answering the first research question, an error analysis will be conducted on the test set to examine under what circumstances the model makes a wrong prediction. To do so, the predictions of the best performing model are compared with the true labels of the exercises, establishing whether the model has made an erroneous prediction. Subsequently, the erroneous predictions are further investigated per feature to determine if for some classes within the features more erroneous predictions were made. Such an approach can help to identify what kind of errors the model makes and whether a pattern can be identified among those errors. Hence, conducting an error analysis can provide a better understanding of the model and create more transparency. These new insights can, in turn, be used for creating more tailor-made learning strategies in order to further assist learners in making use of an educational platform, such as Duolingo. Therefore, the second research question that will be answered in this thesis is:

RQ2: What explains erroneous predictions of the best performing model?

The main findings of this thesis are that the linguistic features, modelled with a random forest classifier, are best able to predict word learning of Duolingo users. The learning process features are found to perform worse than the logistic regression (LR) baseline regarding both the logistic regression as well as the random forest classifier model. The learning process features are found to perform similarly as the ZeroR classifier, which predicts the same class for every instance and thus, the learning process features are concluded not to be effective when modelling SLA. The model including both categories of features is found to outperform the LR baseline, but to perform worse than the linguistic features model. Regarding the error analysis, certain categories or ranges of values within a feature are concluded to be more often erroneously predicted than others. When comparing the distributions of those features in the train and the test set, they are found to be significantly different.

8

Data Science and Society 2019

2. Related Work

2.1 The SLAM challenge

The data used for the SLAM challenge, and in this thesis, is about learning trace data of Duolingo users, in their first 30 days of using the platform. Because each Duolingo user has its own personal ID, their progress can be traced and the exercises they have done can be analyzed. Along with the exercises and the labels, various features were already included in the data set as released for the SLAM challenge, these are added in Appendix A. Moreover, the data of the exercises has been published for three language tracks, namely: 1. Spanish speakers learning English 2. English speakers learning Spanish 3. English speakers learning French. For each language track, a separate leaderboard was created, ranking the teams participating in the challenge on the basis of their obtained AUROC score.

The winners of the SLAM challenge, finishing first in all three language tracks, were Nilsson et al. (2018). Their team build an ensemble model, combining the predictions of a gradient boosted decision tree (GBDT) with the predictions of a recurrent neural network (RNN). This approach was chosen because the sequential nature of the data could well be modelled with a RNN whereas a GBDT is known for consistently achieving high results with tabular data. Because of the GBDT approach, it can be determined which features are proven to be most relevant for SLAM, which is found to be generalizable to all three language tracks. The most important features found were word, user, format, exercise ID and time. Similarly to Nilsson et al. (2018), several other teams opted to train a RNN (Kaneko et al., 2018; Xu & Chen, 2018; Yuan, 2018). All of these teams scored high on the leaderboard, implying that a RNN is effective as language learning is of a sequential nature. A disadvantage of using such deep learning approaches is that the model is not transparent and, thus, little insights can be obtained regarding how the model makes a prediction (Marcus, 2018).

Another popular approach in the SLAM challenge was to use a variant of a random forest classifier (RF). RF techniques are said to lend themselves for feature engineering (Tomoschuk & Lovelett, 2018), as frequently done by the teams that opted for a RF (Chen, Hauff, & Houben, 2018; Rich et al., 2018; Tomoschuk & Lovelett, 2018). The most popular engineered features applied in the SLAM challenge have been word corpus frequency, features, L1-L2 cognates, word embeddings and word stem/root/lemma (Settles et al., 2018). A more complete description of the features adopted in the SLAM challenge can be found in Chapter 2.2.

Finally, a logistic regression has been frequently adopted in the SLAM challenge (Bestgen, 2018; Klerke, Alonso, & Plank, 2018; Nayak & Rao, 2018). These teams also often engineered additional features and used feature ablation experiments to determine which features were most informative (Bestgen, 2018; Klerke et al., 2018).

Settles et al. (2018) aimed to determine what has been the most successful approach in the SLAM challenge. To do so, the authors conducted a mixed-effects analysis regarding both the modelling approaches as well as the included features. With regard to the modeling approaches, non-linear algorithms were found to be particularly desirable, especially the RNN. The mixed effects analysis regarding the included features indicated that the impact of which features to include was considerably smaller than the impact of the chosen model.

9

Data Science and Society 2019

2.2 Relevant features

2.2.1 Linguistic features

In this thesis, linguistic features are defined as the features that describe the qualitative and quantitative properties of the word itself, and are independent of the manner in which the word was taught. As SLA is a relatively old research field, many word properties are already known to affect word difficulty. Therefore, existing SLA research will be reviewed in order to shape expectations on which word properties are likely to affect word learning. Interesting to note is that a comparable research of Hopman, Thompson, Austerweil, & Lupyan (2018) already identified numerous of such word properties. The authors spoke of a ‘big data investigation’ (p.1) and have also examined a data set of Duolingo. However, the data set is different than the one examined in this thesis as mostly linguistic features are included and less information about the user is available. Nonetheless, where possible, their research will also be drawn upon and their findings will be described.

Part of speech

Some grammatical categories are more difficult to learn than others. To illustrate, Gentner (2006) stated that nouns are easier to learn as they have the privilege of naming the easy cohesive parts of the world, whereas verbs and prepositions describe a more diffuse set of relational components. Hopman et al. (2018) investigated this assumption in their big data study and have indeed found an effect of the part of speech on word learning. The authors concluded that when a measure for the part of speech is added, other features, such as the concreteness of a word, are not significant anymore.

Length

The longer the length of a word, the more difficult the word is to learn. This claim has often been made in the traditional field of SLA. To illustrate, Baddeley et al. (1975) found that the longer the word is, the more often the word is erroneously recalled by language learners in the short term. However, Laufer (1990) stated that the empirical results of those SLA studies are not conclusive. According to the author, a possible explanation for this could be that within those researches, the effect of word length was not isolated well. With regard to the SLAM challenge, Tomoschuk & Lovelett (2018) included the word length in their analysis and found a small effect of the feature on the predictive power of their model.

Word Frequency

Word frequency is defined as how often the word occurs in natural language (Rich et al., 2018). The basic assumption of incorporating word frequency when estimating word difficulty is that difficult words are those appearing least often in print. Although the assumption does not hold for all words, word frequency is said to be one of the best methods to estimate word difficulty (Breland, 1996). Contrary to this claim, De Groot & Keijzer (2000) stated that many studies have manipulated word frequency, but no simple robust effect has been found. Using a larger data set, Hopman et al. (2018) only found a marginal effect of word frequency on word learning.

10

Data Science and Society 2019

Linguistic Similarity

Words that are relatively comparable across languages, also known as cognates, are expected to be easier to learn (De Groot & Keijzer, 2000). Hopman et al. (2018) have added to this claim that there is a difference between semantic cognates, words sharing the same meaning, and translation similarity, words with a resemblance in how they are spelled. With relation to the SLAM challenge, Rich et al. (2018) have concluded that translation similarity leads to a small boost in the performance of their model.

Dependency Edge

Dependency is the notion that words in a sentence are dependent on each other through pre- specified links. The word that is not dependent on any of the other words is called the root. For example, in the sentence: “A man sleeps”, “a” depends on “man” and “man” depends on “sleeps”, hence, “sleeps” is the root (Debusmann, 2000). According to the notion of Dependency Locality Theory (DLT), longer dependencies, i.e. the words further away from the root, require more effort and are more troublesome to learn (Oya, 2011). In light of the SLAM challenge, Tomoschuk & Lovelett (2018) have assessed the importance of each feature and subsequently concluded that dependency edge is relatively informative to their model.

Concreteness

Concreteness can be defined as the degree to which the concept behind the word refers to a perceptible entity, or in other words, to something that exists in reality (Brysbaert, Warriner, & Kuperman, 2014). To illustrate, ‘hat’ is expected to be easier to learn than ‘liberty’, as ‘hat’ has a concrete meaning to it and is thus easier to imagine (Tomoschuk & Lovelett, 2018). Much attention has been paid to the effect of concreteness in the big data study of Hopman et al. (2018). The authors have used a list of 60,099 English words with a concreteness rating, published by Brysbaert et al. (2014). In their study, the effect of concreteness is found to disappear when a measure for the part of speech is added. Moreover, for Spanish and Portuguese learners studying English, the effect is positive but for Italian learners the effect is found to be negative. Therefore, no general conclusion can be drawn.

Word

The word itself has been the most popular feature to include during the SLAM challenge (Settles et al., 2018). However, a clear rationale for doing so is not offered by any of the teams. Nevertheless, a possible explanation could be that by adding the word itself, certain factors are captured that are not accounted for directly. Although such effects could be captured directly by creating an additional feature, many SLA theories exists and these can, by some extent, be represented by adding the word itself.

Lemma

By engineering the lemma, also known as the dictionary form of a word (Balakrishnan & Lloyd-Yemoh, 2014), words that have a different grammatical form, but are essentially the same words, are assigned to the same class. Consequently, the cardinality of the feature is decreased. This approach is taken by Rich et al. (2018) in the SLAM challenge, the authors have found that this feature leads to a small increase in the performance of the model.

11

Data Science and Society 2019

Morphology

Morphology is the study of the structure of words (Booij, 2019). To represent this structure, multiple morpho-syntactic features can be identified. For example, two classes that belong to morphology are the tense of a word and the gender of a word. According to Rastle (2018), there is a direct link between the morphological features of a word and the regularity of a word. The author illustrated this with the example of ‘punt’ and ‘pant’. The words sound similar but have different meanings. In the absence of any underlying regularity, as represented within the morphology, it can be difficult to connect those words to the right meaning. Klerke et al. (2018) have examined the effectiveness of this feature in light of the SLAM challenge and have concluded that the morphological features are helpful when predicting word learning.

2.2.2 Learning process features

Learning process features are defined in this thesis as the features unique to each user and are created to describe the circumstances in which the word was taught. Hence, these features entail the level of the learner and how the exercise was conducted. The findings of the SLAM challenge will mainly be used to shape expectations about the learning process features. Traditional SLA research may not be generalizable to a mobile environment, as the manner in which people learn via educational platforms is said to differ from traditional classroom approaches (Heil et al., 2017).

User ID

All learners that make use of Duolingo are traced via their own personal user ID. Consequently, some measures unique to the user can be identified, as often done by the participating teams of the SLAM challenge. A number of these measures are outlined below:

1. Mean error rate per user This feature is constructed by looking at all the exercises done by a user and dividing the total number of mistakes by the total number of exercises. Therefore, the feature captures the fact that some users inevitably learn faster and make fewer errors than other users (Tomoschuk & Lovelett, 2018). This feature is the inverse of the mean accuracy per user, as used by Chen, Hauff, & Houben (2018).

2. Total exercises done per user The total exercises done per user is used as a proxy for motivation by Rich et al. (2018). The underlying thought is that the more motivated a user is, the more exercises are done in an attempt to learn the language.

3. Number of mastered words The number of mastered words is regarded as a proxy for student engagement by Chen, Hauff, & Houben (2018). The authors claim that a student’s engagement can be regarded as a useful indicator to predict the learning gain.

12

Data Science and Society 2019

4. Total time spent on Duolingo This feature is included by Chen, Hauff, & Houben (2018) as the authors expect that the more time a user has spent on Duolingo, the more engaged he or she is. Therefore, the feature is stated to serve as a proxy for student engagement.

Considerably more user specific features have been created by the teams participating in the SLAM challenge. However, as claimed by Rich et al. (2018), the majority of the model’s predictive power could be achieved by only using the user IDs, represented as features with high cardinality. Consequently, this thesis will only include these four features.

Time

The more time users have spent to solve the exercise, the more likely they are to give an incorrect answer (Chen et al., 2018). However, a user could also skip the question entirely if he or she does not know the answer, and thus, a small amount of time spent can also indicate an error. Chen et al. (2018) have proven that time and exercise level accuracy are negatively related, which supports the first claim that the more time is needed to construct an answer, the more likely the answer is to be incorrect.

Days

The amount of days a learner has been on Duolingo is included by almost all of the teams participating in the SLAM challenge, however, a motivation of why to include this feature is neglected by the teams. A possible justification could be that if a learner has spent more days learning the language, a deeper sense of the language is already developed and one can more easily rely on the context in order to learn a word (Prince, 2009). In line with this claim, Tomoschuk & Lovelett (2018) have confirmed that the amount of days since a learner has started on Duolingo is one of the most informative features in their model.

System

A learner can choose to make use of Duolingo via three popular ways. Namely, Android, IOS or via the web. Klerke, Alonso & Plank (2018) found that the system used is an informative feature, which they have concluded by means of feature ablation. However, the authors failed to indicate how it might impact word learning. All the authors have stated is that the chosen system is ‘a potential source of error’ (p.207). Tomoschuk & Lovelett (2018) adopted a somewhat different approach and merged Android users and IOS users into one category and consequently only preserved the distinction between the web and mobile apps. Although not stated by the authors, this division could be more informative because, nowadays, making use of an app is far more popular among smartphone users than making use of the web. Another explanation can be that apps offer a better user experience than the web (Spence, 2014). These characteristics of apps can possibly lead to a more efficient learning experience.

Session

Duolingo users can take part in 3 different sessions; a lesson session, a practice session and a test session. Lessons contain about 77% of the data set and in these lessons, learners are introduced with new words or concepts. Practice sessions contain about 22% of the data set and should only contain previously seen words and concepts. Finally, test sessions contain 13

Data Science and Society 2019

around 1% of the data set and are mini-quizzes that allow a learner to skip a single skill in the curriculum (Settles et al., 2018). Although not explicitly stated in the literature, the session format is expected to influence word learning as in a lesson session, learners encounter words they have not seen before, and are thus more likely to make mistakes. In the analysis of Klerke, Alonso & Plank (2018) regarding the SLAM challenge, the session is found to be an informative feature.

Format

Included in the data set are three distinct format types. The first format is the reverse translate format, where learners are given a prompt in the language they know and have to translate the items into the language they are learning. Moreover, a learner can choose for the reverse tap format, a simpler version of the reverse translate format, where learners construct an answer using a bank of words and distractors. Finally, a learner can opt for the listening format, where learners hear an utterance in the language they are learning and must transcribe it (Settles et al., 2018). An illustration of the different formats is displayed in Figure 2.

Figure 2. Exercise formats used with permission of Settles et al. (2018) (a) reverse_translate (b) reverse_tap (c) listen

According to Chen et al. (2018), the reverse tap format is expected to be most often correctly completed by a user as the format has a larger element of recognition in it, rather than active . The authors confirmed this is the case and state that the learning format should be included when modelling SLA.

Spacing

The spacing effect was first introduced by Dempster (1989) and implies that words are more successfully learned when spread out over time rather than cramping a large number of words into a few sessions. This effect was visually examined by Chen et al. (2018), by 14

Data Science and Society 2019

way of classifying students into high and low spacing subgroups. However, no visual effect was found. Contrary to this finding, Tomoschuk & Lovelett (2018) have identified the spacing features in their analysis as informative features.

2.3 Unbalanced data

Chen et al. (2018) addressed that the data as provided for the SLAM challenge is found to be unbalanced, an issue that is confirmed in Chapter 3. Unbalanced data is referred to in literature as a data set within one or more of the classes have a much greater number of observations than the others (Haixiang et al., 2017). This can degrade the performance of classifiers as instances from the minority class may potentially be treated as noise by the model. Oppositely, noise can be identified as minority examples, as both of them occur rarely in the data set (Beyan & Fisher, 2015).

According to Beyan & Fischer (2015), several common approaches can be identified to deal with the problem of unbalanced data. Firstly, the algorithmic level approaches are considered, which entail that the classifier is forced to converge to a decision boundary biased towards the minority class. Secondly, cost-sensitive approaches are found to be frequently adopted in research. Within these approaches, different costs are assigned to training examples from the majority and the minority classes. Thirdly, data level approaches are also found to be a popular technique. This approach implies that the data is resampled by over sampling the minority class or under sampling the majority class, and is often applied as a pre-processing technique prior to training the classifier. Finally, ensemble models are stated to also partially combat the problem of unbalanced classes as for each individual model, a different subset of the data is considered. However, despite that these models are frequently used for dealing with unbalanced data, they are found to be unable to handle such data sets by themselves. Therefore, ensemble models are often combined with re-sampling techniques in order to combat the problem of unbalanced classes (Chawla, Lazarevic, Hall, & Bowyer, 2003).

2.4 Added value of the thesis

As outlined in Chapter 2.2, a large amount of existing work has already aimed to identify which factors increase the difficulty of learning a language. With regard to the linguistic features, a clear trend can be identified that most of these studies have taken place in small classrooms or laboratories (Hopman et al., 2018). Hence, a big data perspective provides further insights by determining which factors are likely to influence a large number of learners. This has to some extent been done by Hopman et al. (2018), however, the data set used in their research only allowed for the inclusion of two learning process features, namely the total times a user has seen a word and the sum of all the words practiced. As such, their big data study was not able to compare the effectiveness of linguistic features to learning process features.

In light of the SLAM challenge, both linguistic and learning process features were available, however, no clarity was obtained regarding which features contribute to word learning. Although several teams have identified a variety of features as informative to the model, according to Settles et al. (2018), the methods used by the participating teams were found to be more important when predicting word learning. Moreover, as the focus of the challenge was to end as high as possible on the leaderboard, approaches that led to more transparency were often not chosen, as described in Chapter 2.1. Besides adopting a more 15

Data Science and Society 2019

transparent approach by separating the features into categories, an error analysis will also be conducted in this thesis. None of the teams participating in the SLAM challenge have followed this approach, however, as explained in Chapter 1, doing so can lead to unexpected insights which could be used to gain more knowledge about how people learn a second language.

In conclusion, the main contribution of this thesis is to shed light on which features are most important to predict language learning. Until now, no definite answer can be provided to this question as existing studies focused on only one category of features, or the included features did not influence the results in a significant manner.

3. Experimental Setup

3.1 The data set

3.1.1 Description of the data set used in the experiment

The data set, as released for the SLAM challenge, can be freely downloaded on the website describing the SLAM challenge and comes in the form of a tar.gz file.3 The file contains a separate train, validation and test set, including the true labels. A description of how the different sets will be used can be found in Chapter 3.5.3.

All users present in the validation and test set are included also in the train set. In Table 1, some further information of the different sets is provided. Based on this information, the conclusion is made that on a first glance, the different data sets are relatively similar.

Table 1. Information regarding the different sets Train Set Validation Set Test Set Number of exercises 2,622,939 387,369 386,600 Number of learners 2593 2592 2593 Percentage correct exercises 87.39 85.71 84.92 Number of unique words 1967 1839 1879

Of the tree language tracks made available for the SLAM challenge, only the Spanish speakers learning English will be considered in this thesis because of computational constraints. This language track is chosen as English is the most popular language in terms of people acquiring it as a second language (Eberhard, Simons, & Fennig, 2019).

3.1.2 Data structure The data, as downloaded, is formatted per user, and separated by a blank line for a new user. The first line of a user’s exercises contains exercise-level metadata. For the purpose of the analysis, during preprocessing, the metadata is concatenated with the exercise level data such that all data available for a single exercise is contained within a single row. A complete description of the features originally included in the data set, without any feature engineering, is added in Appendix A.

3 http://sharedtask.duolingo.com/#task-definition-data, accessed on 10-04-2019 16

Data Science and Society 2019

3.1.3 Software In this thesis, the programming language used is Python 3.0. Regarding the preprocessing, the integrated development environment Pycharm (Community Edition 2017.1) is used. However, Sklearn could not be imported within this environment, therefore, the models are trained and evaluated by means of the Jupyter Notebook web application server. The server was hosted on a local pc with Anaconda (version 5.2.0) installed. A variety of packages has been used in this thesis. These are outlined below:

• Pandas (McKinney, 2010) • Numpy (Van der Walt, Colbert, & Varoquaux, 2011) • Sklearn (Pedregosa et al., 2011) • Seaborn (Waskom et al., 2014) • Matplotlib (Hunter, 2007) • NLTK (Bird, Loper, & Klein, 2009) • SciPy (Jones, Oliphant, & Peterson, 2001) To compute the high density region (HDR) as described in Chapter 4, no existing python package was found. Therefore, R, within the Rstudio environment (RStudio Team, 2015), will be used solely for this purpose with the package hdrcde (Hyndman, 2018) installed.

3.2 Data manipulation

3.2.1 Preprocessing steps

The data as provided for the SLAM challenge is already fairly organized. However, as stated, to have all relevant information to predict the target feature contained in a single row, the meta information of an exercise is concatenated with the information regarding the specific exercise.

To make all categorical features compatible with the machine learning algorithms used, the categorical features with a small number of classes (3 or less) are processed by creating dummies. With regard to the other categorical features, the LabelEncoder method within the Sklearn package is used to preprocess them. The LabelEncoder is trained on the union of the train, validation and test set, as not all values within the validation and test set are included in the train set. Using the LabelEncoder for features with a large amount of classes is preferred as creating dummies for them will largely increase the feature space.

Moreover, with regard to the word feature, all letters are first lowered before them. By doing so, words that are equal will have the same label, even if one is written with a capital letter, which has also been done by Settles et al. (2018).

The data set has originally no missing values, however, as mentioned by the authors of the SLAM challenge, logging issues occurred regarding the time feature (Settles et al., 2018). This has resulted in the feature having several negative values. These values are replaced with NaN’s in the data frame and will consequently be treated as missing values as described in Chapter 3.2.3. No other anomalies can be identified in the data set and thus the assumption is made that no logging issues have occurred other than reported by Settles et al. (2018).

17

Data Science and Society 2019

3.2.2 Feature engineering

This section is divided into two subsections in order to describe how the additional linguistic features and learning process features are created. The name of the feature as processed in the analysis is stated between the parentheses. Furthermore, an explanation of why these features are included can be found in Chapter 2.

Linguistic feature engineering

1. Word length (nchar) This feature is created by measuring the number of letters in a word.

2. Word Frequency (frequent) A binary feature, where a 1 implies that the word is frequent in the English language. In order to determine if a word is frequent, the words are compared with the 1000 most used words as reported by Google Research (2013).

3. Linguistic Similarity (Levenshtein) This feature measures the similarity between the word in the language of the learner and the word in the language learned. Originally, only the English word was included in the data set. In order to obtain the Spanish word, all words were translated using Google Translate4. Next, the Levenshtein distance was computed, which is the minimum edit distance between two words (Yujian & Bo, 2007). After the Levenshtein distance was obtained, the distance was scaled by the length of the longest word in line with Rich et al. (2018).

4. Lemma (Lemma) The lemmas of the English words are created by means of the NLTK package and are added to the data frame.

5. Concreteness (Concreteness) The concreteness ratings were taken from Brysbaert et al., (2014), who have asked 4,000 participants to rate a list of words based on their perceived concreteness. Tomoschuk & Lovelett (2018) have also used these ratings for the SLAM challenge. Not all words included in the data set were present in the concreteness list provided by Brysbaert et al. (2014). If the word was not present, the lemma was searched for. In case that the lemma was also not present, a missing value was imputed. Learning process feature engineering

1. Total exercises done per user (exercises) This feature is created by counting all the exercises done per user.

2. Number of correct exercises (correct) The number of correct exercises was calculated by summing up all the correct responses per user.

4 https://cloud.google.com/translate/, accessed on 12-03-2019 18

Data Science and Society 2019

3. Mean accuracy rate per user (accuracy) This feature was constructed by dividing the number of correct answers by the total number of exercises done per user.

4. Total time spent on Duolingo (total_time) This feature is constructed by taking the sum of the total time it took a Duolingo user to construct and submit all the exercises done.

5. System (web/app) Although the system feature was originally included in the data set, the feature is recoded to represent the difference between learners using an app and learners using the web. This approach is taken to represent the expectation of Tomoschuk & Lovelett (2018) that a distinction between an app and the web is more likely to influence word learning than an additional separation between Android and IOS.

6. Spacing (class) This feature is engineered in line with Chen et al. (2018). Firstly, all students are sorted in ascending order according to their total time spent on Duolingo. Then, the students are divided among 10 bins of equally sized groups labeled from 0, representing the students spending the least amount of time learning, to 9, which are the students spending the most amount of time learning. This implies that students who are in the same subgroup, have roughly spent the same amount of time learning. The students are then sorted based on the number of exercises they have done and subsequently, they are divided into two equally sized subgroups. Students who have done few exercises are considered to be low spacing and are classified with a 0. Students who have done many exercises are assumed to be high spacing and are labelled with a 1.

3.2.3 Outliers and missing values

Most features included in the data set are of a categorical nature, and thus, outliers are not a prevalent issue. However, regarding the numerical features, the presence of outliers is analyzed as these might make the results unreliable. Subsequently, outliers are imputed with missing values. The features where outliers were present, are discussed below.

• Days; Only the first 30 days of a learner using Duolingo are included in the data set. As can be seen in Figure 3, most exercises are conducted in an earlier stage. This can suggest that learners do not finish the whole 30 days or practice less frequently towards the end. These outliers are not deleted as the time interval is relatively small and in this way, it becomes possible to investigate whether users who have finished the whole 30 days, make fewer mistakes.

19

Data Science and Society 2019

Figure 3. Distribution of days

• Time; time is reported as the amount of seconds it took a learner to construct and submit their answers. When examining the data, some learners took an extremely large amount of time for submitting a single answer. This may suggest that a learner was distracted when filling in the exercise. As such errors are not relevant for SLA, these outliers are deleted. To identify such outliers, the interquartile range (IQR) is computed and outliers that fall above 3 times the IQR are deleted. Although this is a high threshold to identify outliers, some learners can be inherently slower and therefore valuable information might be lost when setting the cutoff value lower. The computed cutoff value is 63 seconds and exercises that took longer are imputed with a missing value, which is in total 2.8% of the data set. • Total time per user; As total time per user is the sum of the time for all exercises per user, this feature is likely to suffer from the same problem as the time feature. Hence, the same method is applied for identifying outliers, again with using a cutoff value of 3. Consequently, users that have spent more than 142,424 seconds in total, are substituted with missing values. This includes 7% of the data set. As the deleted values are now missing values, a reliable estimate must be found. Moreover, as the concreteness rating was not available for all the words included in the data set, this feature also contains missing values. The concreteness rating was not available for 387 exercises.

The density plots of the features with missing values are displayed in Figure 4. As can be observed from the plots, total_time and time follow roughly a normal distribution. Consequently, imputing the missing values with the mean would be an appropriate unbiased estimate. However, as the values for concreteness are more dispersed, the missing values are imputed with the mode. Figure 4. Distribution of missing value features (a) total_time (b) time (c) concreteness

20

Data Science and Society 2019

3.3 Explanatory Data Analysis

3.3.1 The target feature

The target feature provided in the data set is a description of whether a word has been correctly translated by a learner. If so, the label is 0. If a learner did not know the word or has made a spelling error, the label is a 1. The value counts regarding the target feature are displayed in Figure 5.

Figure 5. Distribution of target feature within the train set

Within the train set, 87.39 percent of the words are correctly translated by the users. This observation implies that there is an unequal distribution of the classes within the target feature, and thus, the target feature is concluded to be unbalanced. This issue is further discussed in Chapter 3.4.4 and Chapter 3.4.5.

3.3.2 Linguistic features

When combining the linguistic features already included in the data set with the additionally engineered features, the linguistic features used for analysis are nchar, word, Part of Speech, edge label, edge head, frequent, lemma, Levenshtein distance and concreteness. The summary statistics of the numerical features are displayed in Table 2.

Table 2. Summary statistics of linguistic features statistic Count Mean Std Min 25% 50% 75% max nchar 2,622,939.00 3.82 2.00 1.00 2.00 3.00 5.00 14.00 frequent 2,622,939.00 0.41 0.49 0.00 0.00 0.00 1.00 1.00 Levenshtein 2,622,939.00 0.75 0.31 0.00 0.50 0.85 1.00 1.00 concreteness 2,622,939.00 2.65 1.54 0.00 1.59 2.42 3.93 5.00

21

Data Science and Society 2019

To determine how the categorical features are distributed, plots of the value counts for Part of Speech, edge label, edge head and frequent are displayed in Figure 6. With regard to word, Morph and lemma, the value counts are not as informative because there are many different classes. From the figure, it can be observed that for the displayed categorical features the classes are not equally distributed. Especially for the features with a large amount of classes, some classes are found to occur far more often than others. Hence, the displayed linguistic features are, similarly to the target feature, concluded to be unbalanced.

Figure 6. Distribution of the linguistic categorical features

To obtain more insight into how the linguistic features might affect the target feature, a correlation heat map is displayed in Figure 7. edge label, edge head and Morph are moderately correlated with the target feature. The other features demonstrate no correlation. These findings suggest that on a first glance, a linear model, such as a logistic regression, might be less applicable.

Figure 7. Correlations of linguistic features and target feature

22

Data Science and Society 2019

3.3.3 Learning process features

The engineered features are combined with the learning process features already present in the data set and consequently, the model includes the following features: user, time, total_time, exercises, correct, accuracy, class, app, web, listen, reverse_tap, reverse_translate, lesson, practice and test. The summary statistics of the numeric features are displayed in Table 3.

Table 3. Summary statistics of learning process features

Count Mean Std Min 25% 50% 75% max time 2,622,939.00 14.97 1.15 0.00 7.00 12.00 19.00 63.00 total_time 2,622,939.00 29,980.76 25,631.08 0.00 12,372.00 23,086.00 36,156.00 142,424.00 exercises 2,622,939.00 1.430.87 998.09 103.00 786.00 1138.00 1771.00 8894.00 correct 2,622,939.00 1248.94 897.28 94.00 683.00 990.00 1514.00 8561.00 accuracy 2,622,939.00 0.874 0.00 0.26 0.84 0.89 0.92 0.99

The summary statistics in Table 3 indicate that some further logging issues must have occurred regarding the time feature and consequently also the total_time feature. First, the minimum value of time is 0, implying that users have finished exercises in 0 seconds. Moreover, the minimum total_time in the data set for some users is also 0, whereas the minimum amount of exercises is consistently higher than 0. Thus, the total_time feature must be incorrect as a user cannot conduct several exercises in 0 seconds. There are 67,011 exercises where time equals 0 and there are 13,099 exercises where total_time equals 0. All of these values will be imputed with the mean, as done in Chapter 3.2.3. The new summary statistics for time and total_time are presented in Table 4.

Table 4. Summary statistics of time and total_time

Count Mean Std Min 25% 50% 75% max time 2,622,939.00 15.35 1.12 1.00 7.00 13.00 19.00 63.00 total_time 2,622,939.00 30,133.39 25,538.57 840.00 12,542.00 23,404.00 36,156.00 142,424.00

To gain insights into how the categorical features are distributed, plots of the value counts are displayed in Figure 8. Again, many features are found to be unbalanced, except for the class feature. Figure 8. Distribution of the learning process features

23

Data Science and Society 2019

It can be observed from the correlation heat map in Figure 9 that there is no correlation with any of the learning process features and the target feature. Therefore, this again indicates that a linear model, such as the logistic regression, might not be suitable to predict word learning based on these features.

Figure 9. Correlations of learning process features

24

Data Science and Society 2019

3.4 Methods

In Chapter 2.1, a RNN was concluded to be the most effective modelling approach for SLA. However, as deep learning techniques are said to be a black box (Marcus, 2018), more transparent classifiers are chosen instead, which are outlined below.

3.4.1 Logistic Regression

A logistic regression (LR) will be applied in this thesis to predict word learning, as the classifier was used as the baseline for the SLAM challenge and is furthermore applied by several teams as described in Chapter 2.1. A logistic regression is a linear classifier that estimates the probability of an instance belonging to a certain class using the underlying logistic function. The class is subsequently assigned based on a cutoff value of the probability, which creates a decision boundary (Witten, 2016).

A common known disadvantage of a logistic regression is that it can only solve linear problems. Implying that if the features are not linearly associated with the target feature, a logistic regression is likely to output poor predictions (Witten, 2016). The correlation heat maps in Chapter 3.3 indicated that the linear correlations are only moderate or absent altogether, hence, a nonlinear classifier will also be evaluated, as described in the next section.

3.4.2 Random Forest Classifier

A random forest classifier (RF) is a combination of different decision tree predictors, such that each tree depends on the values of a random vector sampled from the same distribution for all trees in the forest. Subsequently, a large number of trees is grown, and a prediction is made by every tree for each instance. The instance is then classified by the label which got the majority vote of all the decision trees (Breiman, 2001).

A RF is chosen as it offers the possibility to evaluate how a nonlinear classifier performs on the data. Moreover, a RF has also been often applied in the SLAM challenge, as described in Chapter 2.1, and thus this classifier is expected to be suitable for the task of predicting word learning. Furthermore, a RF can uncover complex interactions (Denisko & Hoffman, 2018). Tomoschuk & Lovelett (2018) have found several interactions between the features that are relevant for predicting word learning. Finally, the RF classifier belongs to the ensemble techniques (Biau, Devroye, & Lugosi, 2008), which have the advantage to partially combat the problem of unbalanced data as stated in Chapter 2.3. This only holds when such an ensemble model is combined with other techniques, addressed in Chapter 3.5.5.

3.5 Experimental procedure

3.5.1 The Baseline Firstly, a logistic regression without any feature engineering is chosen as a baseline, which will be referred to as the LR baseline. This baseline is adopted as it was the baseline for the SLAM challenge, and by choosing the same baseline, the results obtained in this thesis can be easier compared to those of the SLAM challenge. Settles et al. (2018) only included the following features in the baseline: user, word, Part of Speech, Morphology and Edge Label. 25

Data Science and Society 2019

The baseline features are only a selection of the original features included in the data set and moreover, no explanation of why these features are chosen is provided. However, to try to replicate their baseline, this thesis will follow the same approach. An additional benefit of adopting this baseline is that because as only a few features are included, the added value of the other features can easily be evaluated.

Additionally, the results of the models will also be compared to the ZeroR classifier. The ZeroR classifier simply predicts the majority class for all instances (Nasa & Suman, 2012). The ZeroR classifier does not learn from the data and makes a prediction regardless of the values of the features, therefore, the ZeroR classifier can serve as a clear threshold to compare the predictiveness of the other models against.

Thus, this thesis will compare the results of the models to a baseline similar to that in the SLAM challenge as well as the ZeroR classifer. Comparing the models against a second baseline is furthermore important as the baseline similar to the SLAM challenge baseline does not behave as expected, which is described in Chapter 4.1.1.

3.5.2 Algorithms

The algorithms applied in this thesis are based on the description of the applied classifiers in Chapter 3.4.1. To implement these classifiers, the Sklearn library (Pedregosa et al., 2011) is used. To implement the logistic regression, the SGD classifier from Sklearn is applied with the loss function parameter set to ‘log’. In this manner, the logistic regression is ensured to be trained via the stochastic gradient descent method. To implement the random forest classifier, the Sklearn RandomForestClassifier is used. This method belongs to the ensemble algorithms of the Sklearn library.

3.5.3 Hyperparameters

Regarding the baseline, the same approach as opted for by Settles et al. (2018) will be chosen. This implies that the logistic regression will be trained via L2 regularization with alpha set to 20. The number of iterations is 10 and the learning rate is set to 0.1.

In order to optimize the results regarding the classifiers used for the linguistic model, the learning process model, and the model combining all features, the models are trained for a variety of hyperparameters on the train set. For each hyperparameter setting, 5 different random states are explored as the models are found to be sensitive to the random state setting. Although exploring more than 5 random states would offer a more reliable estimate, it also increases the number of iterations substantially. Subsequently, the hyperparameters leading to the highest AUROC score on the validation set are chosen, these are added in Appendix B. The final score of the metrics is then based on the performance of the models, with the selected hyperparameters, on the test set.

3.5.4 Evaluation metrics

The main evaluation metric used in this thesis will be the Area Under the Operating Characteristic Curve (AUROC), in line with the SLAM challenge. The AUROC score indicates how well the model is able to discriminate between the classes and is thus an appropriate metric to deal with unbalanced data. Moreover, the AUROC metric is clearly interpretable. A value of 0.5 implies that the model has no predictive power and does not 26

Data Science and Society 2019

perform better than random odds would yield (Provost & Fawcett, 2013; Van Hulse, Khoshgoftaar, & Napolitano, 2007). The AUROC score can maximally be equal to 1, suggesting that the model is a perfect classifier (Provost & Fawcett, 2013).

Moreover, to develop a sense how often the model makes a correct prediction, accuracy will be reported. However, it must be stressed that this metric is solely reported to create a more intuitive sense of the results, rather than to assess the model based on accuracy. This is as accuracy does not reflect the nature of the results well. To illustrate, 87% of the words are correctly filled in by the learner in the train set. This implies that if the model predicts the class to be 0 for all users, it will already have a training accuracy of 87%.

The F1-score has also been reported in the SLAM challenge, however, the F1-score can already be significantly improved by just tuning the classification threshold. With the AUROC metric, the threshold is fixed and therefore it can better inform on how capable the model is in distinguishing between classes (Settles et al., 2018). Consequently, the F1- score is not reported in this thesis.

3.5.5 Down sampling

To further combat the problem of unbalanced classes, all of the models are trained with a down sampled majority class regarding the target feature. It has been stated in Chapter 2.3 that down sampling is a common approach to deal with unbalanced classes. Although training the models in this manner deviates from the baseline as proposed by Settles et al. (2018), it offers more informative results. An attempt is made, as described in Chapter 4.1, to train the baseline model without down sampling, and it will indeed become apparent that the model is simply predicting the label to be 0 for every instance.

4. Results

4.1 The baseline model

4.1.1 The LR baseline

The baseline consists of a logistic regression trained with stochastic gradient descent, which is the same baseline as adopted by Settles et al. (2018). However, when aiming to replicate the baseline of the SLAM challenge, the results, displayed in Table 5, are found to deviate. An exploration of the probable causes of why the results differ is given in Chapter 5.

Table 5. First results LR baseline model Accuracy AUROC Score on test set 0.8492 0.5000

As described in Chapter 3.5.5, the model is found to predict the label to be 0 for every instance in an attempt to maximize accuracy. Consequently, the AUROC score is 0.5000 and accuracy is equal to the proportion of correct exercises in the test set, suggesting that the model does not have the ability to separate between the classes. To combat this problem, down sampling is applied to create a balanced train set and the model is retrained. The results are illustrated in Figure 10. 27

Data Science and Society 2019

Figure 10. Plots of the baseline model with down sampling

As can be observed from Figure 10, accuracy and the AUROC score have relatively much variance and are found to be dependent on the chosen random state, of which 1000 are evaluated. To find single measures for both scores, the highest density region (HDR) is computed. The HDR is the point where the distribution is most dense, also called the mode (Hyndman, 1996). Additionally, the upper and lower bounds are reported. These are the bounds in which 95% of the observations fall. This method is preferred over computing the mean and the relative confidence interval as the data is not normally distributed. From the graphs, it can be concluded that more than one peak exists in the distributions and hence, examining the HDRs is more applicable. The statistics for both metrics are displayed in Table 6. Table 6. Final results LR baseline model

Accuracy AUROC Mode 0.6174 0.5612 Lower bound 0.3334 0.5247 Upper bound 0.7648 0.5645

From the scores in Table 6, it can be seen that the model is not predicting the label to be 0 for every instance anymore, as accuracy has decreased and the AUROC score has increased. However, the AUROC score is still not comparable to the baseline used in the SLAM challenge, which is 0.774. Although the results differ, these scores will be used as a baseline in order to draw a fair comparison.

4.1.2 The ZeroR Classifier The results of the models are additionally evaluated against the results of the ZeroR classifier, added in Table 7. Test accuracy is found to be equal to the proportion of correct exercises in the test set and the AUROC score indicates that the model is not able to discriminate between the classes. Both findings are as expected regarding the ZeroR classifier, as the classifier outputs the majority class for every instance.

Table 7. Results ZeroR model

Accuracy AUROC Performance on test set 0.8492 0.5000

28

Data Science and Society 2019

4.2 The other models

With regard to the other models, a logistic regression and a random forest classifier model are analyzed for each set of features, as motivated in Chapter 3.4. With regard to the logistic regression, 1000 random states are evaluated, similarly to the LR baseline. The results are found to be heavily dependent on the chosen random state for each set of features. The distributions of these scores are added in Appendix C. As the scores are not normally distributed, the mode with the upper bound and the lower bound will be reported for the logistic regression models. With regard to the random forest classifier, 250 different random states are evaluated because of limited computational resources. The distributions of these scores can be found in Appendix D. The random forest classifier is found to be far less sensitive to the random state and therefore the mean will be reported for these models.

Firstly, based on the reported AUROC scores in Table 8, the RF only including linguistic features is concluded to not only beat the LR baseline, but is also found to be the best performing model. On the other hand, when the linguistic features are modelled with a logistic regression, the mode of the AUROC score is found to be lower than that of the LR baseline. An important caveat to this finding is that the upper bound is higher than the upper bound of the LR baseline, and thus, if the most optimal random state is chosen, the linguistic features modelled with a LR can in principle also beat the LR baseline. Moreover, the linguistic features modelled with a LR are found to outperform the ZeroR classifier, suggesting that the model is able to discriminate between the classes to some extent.

Secondly, the learning process features are found to perform worse than the LR baseline regardless of the chosen model. For both the logistic regression as well as the random forest classifier, the AUROC score lies around 0.5. Hence, the AUROC score is similar to that of the ZeroR classifier, suggesting that the models do not have the ability to separate between the classes. This can be confirmed by examining the accuracy scores, added in Table 9. The lower bound and the upper bound of the logistic regression are comparable to the proportion of correct and incorrect exercises in the test set.

Table 8. The test AUROC scores of the models

The AUROC score on the test set of all the models. The mean is reported for the RF models and the mode with lower and upper bound for the LR models.

AUROC

Features Models Mean Mode Lower Upper bound bound

Linguistic RF 0.6099 LR 0.5521 0.5141 0.5686

Learning Process RF 0.5012 LR 0.4999 0.4647 0.5283

All features RF 0.6027 LR 0.4999 0.4950 0.5670 29

Data Science and Society 2019

Finally, the model including both set of features is found to outperform the LR baseline when modelled with a RF, based on its AUROC score. However, it scores lower than the linguistic features modelled with a RF with regard to both the AUROC score as well as accuracy. When all the features are modelled with a LR, the mode AUROC score is 0.5, which is comparable to the AUROC score of the ZeroR classifier, implying that the model has no predictive power. Nonetheless, the upper bound of the AUROC score is found to be much higher than the mode. This suggests that if the optimal random state is chosen, the LR model will be able to discriminate between the classes to some extent, however, it often gets trapped in a local minimum. The upper bound of the AUROC score is also higher than the upper bound of the LR baseline model. Thus, if the optimal random state is chosen, this model could outperform the LR baseline.

Table 9. The test accuracy scores of the models

The accuracy score on the test set of all the models. The mean is reported for the RF models and the mode with lower and upper bound for the LR models.

ACCURACY

Features Models Mean Mode Lower Upper bound bound

Linguistic RF 0.5137 LR 0.3918 0.2825 0.7929

Learning Process RF 0.5529 LR 0.8286 0.1508 0.8491

All features RF 0.4912 LR 0.8184 0.1508 0.8469

For a more complete analysis, the train AUROC scores are added in Table 10. With regard to the RF models, the train score is higher than the test score. Especially with regard to the learning process features, the found difference in AUROC scores is considerably larger. With regard to the LR models, the difference between the train and the test set is minimal.

30

Data Science and Society 2019

Table 10. The train AUROC scores of the models

The AUROC score on the train set of all the models. The mean is reported for the RF models and the mode with lower and upper bound for the LR models.

AUROC

Features Models Mean Mode Lower Upper bound bound

Linguistic RF 0.6145 LR 0.5435 0.5157 0.5466

Learning Process RF 0.7019 LR 0.5000 0.4987 0.5013

All features RF 0.6161 LR 0.5000 0.4998 0.5463

4.3 Feature importance

In order to answer the question which features are most informative to the best performing model, the importance of the linguistic features, as modelled with the random forest classifier, is displayed in Figure 11. Figure 11. Feature importance

Morph is found to be the most relevant feature for predicting word learning. Additionally, edge label and edge head are also found to be relatively informative to the model. The other features are demonstrated to be less relevant. Interesting to note is that from the EDA section in Chapter 3.3.2, Morph, edge label and edge head were also the only features found to correlate with the target feature.

31

Data Science and Society 2019

4.4 Error analysis

In order to answer the second research question, an error analysis on the test set is conducted using the best performing model, which is the random forest classifier only including the linguistic features. The random forest classifier is considered to be relatively independent of the random state, and thus, the results of only one random state will be evaluated in the error analysis.

To conduct the error analysis, the error feature is created by comparing the value predicted by the model, with the real value. A 0 is imputed if the values match, and a 1 otherwise. In order to examine how this feature correlates with the other features, a correlation plot is displayed in Figure 12. As a measure of interestingness, a correlation higher than 0.2 or lower than -0.2 are used as cutoff values to decide which features to investigate further. The visual representation of the features not discussed is added in Appendix E.

Figure 12. Correlation plot with the errors

4.4.1 Number of characters First, the feature nchar is found to have the highest correlation with error. A boxplot of nchar grouped by error is displayed in Figure 13. The instances that are wrongly predicted by the model, have on average a higher number of characters, as represented by the triangle in Figure 13.

32

Data Science and Society 2019

Figure 13. Boxplot of nchar grouped by error

To determine if the difference between the means is significant, a T-Test between the two different groups is conducted and the results are reported in Table 11. From the table, it can be concluded that the difference is found to be significant.

Table 11. Two sided independent T-Test Nchar Variable Correct Error t-value prob (n=198,993) (n=187,607)

Nchar M 2.2695 3.5750 -200.0670 0.000 SD (1.9265) (2.1299)

4.4.2 Morphological features Morph is also found to be relatively correlated with error. However, as the feature has many categories, only the 10 most frequent categories are displayed to enhance the interpretability of the visualization. Several categories are illustrated to be often wrongly predicted. However, no clear relation between those categories can be found.

Figure 14. Morph grouped by error r

33

Data Science and Society 2019

4.4.3 Word word likewise contains many categories, and therefore, only the 20 words that occur most frequently are displayed. As can be seen in Figure 15, some words are more frequently erroneously predicted than other words. To illustrate, ‘are’ is relatively more often wrongly predicted by the model. If the word is ‘are’, the model only predicts 43.38% of the instances correctly, whereas the overall percentage of correct predictions is 51.37%. Moreover, for certain words, the model predicts the correct label relatively often. A clear example of this is the word ‘the’. As said, on average, the model predicts 51.37% of the instances correctly, however, if the word is ‘the’, the model predicts 90.37% correctly.

Figure 15. word grouped by error r

4.4.4 Part of Speech With respect to Part of Speech, illustrated in Figure 16, several categories are found to contain considerably more erroneous predictions. This is clearly visible for the categories ADJ and NOUN. The overall accuracy is 51.37 percent, the accuracy if the Part of Speech is ADJ is 25.98 percent and if the Part of Speech is NOUN, accuracy is 23.51 percent. This implies that certain grammatical groups are harder to predict for the model. However, the opposite is also found to hold. If the Part of Speech is PRON, accuracy equals 73.45 percent.

Figure 16. Part of Speech grouped by error

34

Data Science and Society 2019

4.4.5 Edge Label With respect to the edge label, represented in Figure 17, the model is again found to more often erroneously predicts words of certain categories. To illustrate, if edge label is ROOT, the model predicts 33.49 percent correctly, whereas 51.37 percent of all instances are correctly predicted. Furthermore, certain instances are also found to be more correctly predicted. For instance, if the word is labeled with det, which means determiner, the model predicts 79.54 percent of the instances correctly.

Figure 17. Edge label grouped by error

5. Discussion The main goal of this thesis was to shed light on which features are able to predict word learning when learners make use of an educational platform, such as Duolingo. From the literature, it became apparent that a wide variety of features were likely to affect word learning. These features were separated into two categories, linguistic features and learning process features. Additionally, this thesis aimed to explain the errors made by the best performing model by conducting an error analysis.

The LR baseline of this thesis did not match the AUROC score of the baseline adopted in the SLAM challenge. Although an attempt was made to match the parameters described for the baseline as closely as possible, the results deviated substantially. To explain this, a few possible causes are explored. Firstly, with regard to the baseline as used for the SLAM challenge, no preprocessing was done before modelling. In this thesis, some outliers were deleted and the missing data was imputed with the mean or the mode. Secondly, Settles et al. (2018) did not use the Sklearn library but created their own logistic regression model. A possibility is that because of some design differences, the Sklearn model obtained different results compared to the logistic regression of Settles et al. (2018). Finally, Settles et al. (2018) scaled the weights of the classes by their frequencies. In this thesis, down sampling was used to overcome the problem of unbalanced classes. A possibility is that the approach taken by Settles et al. (2018) offers a more sophisticated solution to the problem of unbalanced classes.

Moreover, the highest AUROC score obtained in the analyses done in this thesis was also considerably lower than that of the teams participating in the SLAM challenge. As stated, the teams scoring the highest on the leaderboard often opted for a RNN, however, this was 35

Data Science and Society 2019

not done in this thesis as deep learning models are known to lack transparency (Marcus, 2018). Nevertheless, even the teams that opted for similar models, thus the logistic regression or the random forest classifier, outperformed the scores in this thesis considerably. To illustrate, the lowest score obtained of the teams that submitted a paper was of Klerke et al. (2018), who have used a logistic regression, however, they still obtained an AUROC of 0.817 for the English track. Another explanation in the deviation of the scores can be sought in the manner in which the parameters were set. Although many teams did not disclose their opted parameters, Chen et al. (2018) state that they have set their n_estimators parameter in the RF model to 1550 and Rich et al. (2018) have set it to 512. The highest value of the n_estimators parameter explored in this thesis was 40. Exploring higher values of the parameters requires more computational resources, which were not available during the time of this thesis.

Interesting to note is that the random forest classifier performed better with regard to all the different sets of features. In the EDA section in Chapter 3.3, the linear correlations between the features and the target feature were found to be at most moderate, and thus, a linear model, such as the logistic regression, is less applicable. Furthermore, as stated in Chapter 3.4.2, a random forest classifier is capable of modelling interactions among the features, which were found to be present by Tomoschuk & Lovelett (2018). Finally, in all the models, the random forest classifier was less sensitive to the random state compared to the logistic regression.

The finding that the logistic regression is more sensitive to the random state can be explained by that the model often gets trapped in a local minimum. Initializing different random states partly solves this problem, as learning begins at a different point on the error curve, and thus may not encounter a local minimum. Nevertheless, the presence of local minima makes it more difficult to interpret the results, as can be illustrated by the logistic regression including linguistic features. The upper bound of the model was higher than the upper bound of the LR baseline, suggesting that if the model does not get trapped in a local minimum, it could outperform the LR baseline. However, as seen by the mode of the model, it often does, and does so more frequently than the LR baseline. Consequently, no clear conclusion can be made about whether the logistic regression including linguistic features outperforms the LR baseline. This also implies that no general conclusion can be made regarding how the linguistic features perform with regard to the LR baseline. If the features are modelled with a random forest classifier, they clearly outperform the LR baseline, however, as the LR baseline is a different model, a linear one, a fair comparison cannot be made regarding the features. Nevertheless, it can be said with certainty that they are more effective than the learning process features, as these score similar to the ZeroR classifier regarding both models.

The model containing both the linguistic and the learning process features is found to perform worse than the linguistic features model. One would expect more features to increase the predictive power of the model because more information is at hand. However, as the model only including linguistic features performs better, the learning process features are found to hurt the performance of the model. Moreover, the models only including learning process features both obtained an AUROC of around 0.5, suggesting that the models are not able to discriminate between the two classes. Thus, the learning process features are concluded not to be relevant for predicting word learning. This conclusion is comparable to the findings of Hopman et al. (2018). Their model obtained an overall accuracy of 90 percent on a different data set that focused on modelling linguistic features.

36

Data Science and Society 2019

When conducting the error analysis, the five linguistic features that correlated the most with the error were further investigated. With regard to some labels, expectations were shaped in Chapter 2 that the relation between them and the target feature would be relatively straightforward. For instance, in the related work chapter, the expectation was formed that root words would affect word difficulty as words that are the root of a sentence, are easier to learn. However, in the error analysis, with regard to the edge label feature, root words were found to often be wrongly predicted by the model. Upon further examination, if the word is the root of a sentence, learners give 81.5 percent of the time a correct answer, which is lower than the proportion of correctly answered exercises in the test set. With regard to the Part of Speech feature, expectations were shaped that nouns would be easier to learn, however, the error analysis indicates that the model often predicts nouns wrongly. Upon further examination, noun words are shown to be 82.1 percent of the time correctly learned, which is again lower than the proportion of correctly answered exercises in the test set. Therefore, the expectations shaped in the related work chapter regarding nouns and root words are found not to hold when further investigated.

Moreover, the error analysis shows that with regard to the morphological features, classes that occur often in the data set, are still frequently erroneously predicted by the model. One would expect that if the models are trained on many instances of these classes, a more reliable prediction could be made. This expectation does hold with regard to the word feature. The four most occurring words are found to be very often correctly predicted by the model.

Furthermore, in general, the error analysis shows that the model often erroneously predicts certain categories are labels within the test set. A possible explanation of why the model fails to predict the values can be that the distribution of these features differs between the train and the test set, which is also implied by the higher AUROC score of the models on the train set. In order to determine whether the distributions of the categorical features vary, a Chi-square test for the goodness of fit is conducted on the value counts of the features in the train and test set. This test is used to determine whether there is a significant difference between an expected distribution and the actual distribution (Nesbitt, 1966). The expected distribution is the distribution of the train set, and hence, the test examines whether the test set matches this distribution. The results of the Chi-square test regarding Part of Speech, edge label, word and Morph are added in Appendix F. Regarding all of these features, the Chi-square test indicates that the null hypothesis, whether the data is consistent with the expected distribution, is rejected. To test whether the nchar feature is differently distributed in the test set compared to the train set, a Kolmogorov-Smirnov test is used, which is more appropriate for continuous features (Wolstenholme, 1999). The result of the Kolmogorov- Smirnov test is also added in Appendix F. The null hypothesis that the two samples have the same distribution is rejected. Based on the results, the assumption is made that the distribution of the examined features is different in the train set compared to the test set, which is referred to as a covariate shift (Storkey, 2013).

The occurrence of a covariate shift violates one of the important assumptions in machine learning, and is said to cause unreliable predictions of the model (Rabanser, Günnemann, & Lipton, 2018). A possible solution would be to reweight the training data by the probability of presence in the test data (Tran & Aussem, 2015). An interesting direction for future research would be to determine if indeed the performance of the model increases when the train data is reweighted.

In relation to the first research question, a more transparent answer has been provided to the question of how SLA could best be modelled, namely with the linguistic properties of 37

Data Science and Society 2019

a word. However, it should be stressed that the accuracy of the model is fairly low. This implies that the model is not able to make a correct prediction for a large amount of instances. The models should consequently only be used to gain insights regarding what features the model uses to discriminate between the classes. It should not be used to predict actual word learning, as then, predicting the word to be correctly learned for all exercises is more effective. Furthermore, although the most informative features of the linguistic model have been identified in Chapter 4.5, no conclusion is reached yet on how these features affect the difficulty of a word. If, for instance, words with a certain edge label are known to be difficult to learn, more attention could be paid to such words when encountered by learners. This additionally offers an interesting direction for future research. With regard to the second research question, certain classes or values within the features are found to be often predicted wrongly by the model. When further investigating these features, their distributions are found to be different in the train data compared to the test data. This suggests that the model could be improved by reweighting the training data by the probability of presence in the test data.

38

Data Science and Society 2019

6. Conclusion

The goal of this thesis was to conclude what the most effective features are to model second language acquisition. To do so, the features were separated into two categories, namely linguistic and learning process features. An advantage of this approach is that more transparency is obtained compared to previous research regarding what kind of features are most important. The first research question investigated in this thesis was:

RQ1: Which features are best able to predict whether a learner will translate a word correctly?

Based on the results, linguistic features are concluded to be best able to predict word learning. More specifically, the morphological structure, the edge label and the edge head were found to be the most informative features included in the linguistic model. The learning process features are found not to beat the LR baseline. The model with both linguistic and learning process features is found to beat the LR baseline when modelled with a random forest classifier, however, it does not outperform the linguistic features model.

Moreover, the second goal of this thesis was to further investigate the errors made by the best performing model and to determine if the model consistently erroneously predicts certain categories or ranges within features. The second research question was:

RQ2: What explains erroneous predictions of the best performing model?

By investigating the 5 features that correlated the most with the error feature, certain classes or values within those features are found to be more likely to be erroneously predicted by the model. When investigating those features further, their distributions are found to be different in the train set compared to the test set.

The main implication of this thesis is that attention should be focused on the linguistic properties of a word when modelling second language learning via educational platforms. Moreover, learning process features are found to be less relevant, and thus, the manner in which the word is taught and the level of the learner are found to be not as important. This suggests that certain words are bound to be more troublesome for a learner, regardless of the circumstances in which the word was learned.

39

Data Science and Society 2019

References Baddeley, A. D., Thomson, N., & Buchanan, M. (1975). Word length and the structure of working memory. Journal of Verbal Learning and Verbal Behavior, 14(6), 575–589. https://doi.org/http://dx.doi.org/10.1016/S0022-5371(75)80045-4

Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and lemmatization: A comparison of retrieval performances. In Proceedings of SCEI Seoul Conferences (pp. 174–179). Seoul, Korea.

Bestgen, Y. (2018). Predicting Second Language Learner Successes and Mistakes by Means of Conjunctive Features. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 349–355). https://doi.org/10.18653/v1/w18-0542

Beyan, C., & Fisher, R. (2015). Classifying imbalanced data s ets using similarity based hierarchical decomposition. Pattern Recognition, 48, 1653–1672. https://doi.org/10.1016/j.patcog.2014.10.032

Biau, G., Devroye, L., & Lugosi, G. (2008). Consistency of Random Forests and Other Averaging Classifiers. Journal of Machine Learning Research, (9), 2015–2033. https://doi.org/10.1145/1390681.1442799

Bird, S., Loper, E., & Klein, E. (2009). Natural Language Processing with Python. O’Reilly Media Inc.

Booij, G. (2019). Introduction. In The morphology of Dutch (pp. 1–2). Oxford University Press.

Breiman, L. (2001). RANDOM FOREST. Machine Learning, 45, 5–32.

Breland, H. M. (1996). Word Frequency and Word Difficulty: A Comparison of Count in Four Corpora. American Psychological Society, 7(2), 96–99. https://doi.org/10.1111/j.1467-9280.1996.tb00336.x

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911. https://doi.org/10.3758/s13428-013-0403-5

Chawla, N. V, Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost : Improving Prediction. In European Conference on Principles of Data Mining and Knowledge Discovery (Vol. 2838, pp. 107–119). Retrieved from https://link.springer.com/content/pdf/10.1007%2F978-3-540-39804-2_12.pdf

Chen, G., Hauff, C., & Houben, G. (2018). Feature Engineering for Second Language Acquisition Modeling Feature Engineering for Second Language Acquisition Modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 356–364).

De Groot, A. M. B., & Keijzer, R. (2000). What is hard to learn is easy to forget: The roles of word concreteness, cognate status, and word frequency in foreign-language vocabulary learning and forgetting. Language Learning, 50(1), 1–56. https://doi.org/10.1111/0023- 8333.00110

40

Data Science and Society 2019

Debusmann, R. (2000). An Introduction to Dependency Grammar. In Hausarbeit für das Hauptseminar Dependenzgrammatik SoSe 99 An (pp. 1–16).

Dempster, F. N. (1989). Spacing effects and their implications for theory and practice. Educational Psychology Review, 1(4), 309–330. https://doi.org/10.1007/BF01320097

Denisko, D., & Hoffman, M. M. (2018). Classification and interaction in random forests. Proceedings of the National Academy of Sciences, 115(8), 1690–1692. https://doi.org/10.1073/pnas.1800256115

Eberhard, D. M., Simons, G. F., & Fennig, C. D. (2019). Ethnologue: Languages of the World. Retrieved from https://www.ethnologue.com/language/eng

Gentner, D. (2006). Why Verbs Are Hard to Learn. In Action meets word: How children learn verbs (pp. 544–564). Oxford University Press.

Goodwin, K., & Highfield, K. (2012). G. iTouch and iLearn-An examination of “educational” apps. In Early Education and Technology for Children conference (pp. 14–16). Retrieved from http://www.eetcconference.org/wp- content/uploads/Examination_of_educational_apps.pdf

Google Research. (2013). Ngram Viewer. Retrieved April 12, 2019, from https://books.google.com/ngrams/info

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035

Heil, C. R., Wu, J. S., Lee, J. J., & Schmidt, T. (2017). A Review of Mobile Language Learning Applications: Trends, Challenges, and Opportunities. The EuroCALL Review, 24(2), 32. https://doi.org/10.4995/eurocall.2016.6402

Hopman, E., Thompson, B., Austerweil, J., & Lupyan, G. (2018). Predictors of L2 word learning accuracy: A big data investigation. In the 40th Annual Conference of the Cognitive Science Society (CogSci 2018) (pp. 513–518).

Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering2, 9, 90–95. https://doi.org/10.1109/MCSE.2007.55

Hyndman, R. J. (1996). Computing and graphing highest density regions. The American Statistician, 50(2), 120–126.

Hyndman, R. J. (2018). hdrcde: Highest Density Regions and Conditional Dens ity Estimation.

Jones, E., Oliphant, T., & Peterson, P. (2001). SciPy: Open Source Scientific Tools for Python. Retrieved April 19, 2019, from http://www.scipy.org/

Kaneko, M., Kajiwara, T., & Komachi, M. (2018). TMU System for SLAM-2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 365–369).

Kim, H., & Yeonhee, K. (2012). Exploring Smartphone Applications for Effective Mobile- Assisted Language Learning. Multimedia-Assisted Language Learning, 15(1), 31–57. Retrieved from http://kmjournal.bada.cc/wp-content/uploads/2013/05/15-1-2Kim.pdf 41

Data Science and Society 2019

Klerke, S., Alonso, H. M., & Plank, B. (2018). Grotoco @ SLAM : Second Language Acquisition Modeling with Simple Features , Learners and Task-wise Models. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, (pp. 206–211).

Marcus, G. (2018). Deep Learning: A Critical Appraisal. arXiv Preprint arXiv:1801.00631.

Marian, V., & Shook, A. (2012). The Cognitive Benefits of Being Bilingual. Cerebrum, 2012(13), 1–12.

McKinney, W. (2010). Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 51–56).

Nasa, C., & Suman, S. (2012). Evaluation of Different Classification Techniques for WEB Data. International Journal of Computer Applications, 52(9), 34–40. https://doi.org/10.5120/8233-1389

Nayak, N. V, & Rao, A. R. (2018). Context Based Approach for Second Language Acquisition. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 212–216).

Nesbitt, J. E. (1966). Chi-square.

Osika, A., Nilsson, S., Sydorchuk, A., Sahin, F., Huss, A., & Labs, S. (2018). Second Language Acquisition Modeling : An Ensemble Approach. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 217–222).

Oya, M. (2011). Syntactic Dependency Distance as Sentence Complexity Measure. In Proceedings of The 16th Conference of Pan-Pcific Association of Applied Linguistics (pp. 313–316).

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, (12), 2825–2830. https://doi.org/10.1007/s13398-014-0173-7.2

Prince, P. (2009). Second Language Vocabulary Learning: The Role of Context versus Translations as a Function of Proficiency. The Modern Language, 80(4), 478–493.

Provost, F., & Fawcett, T. (2013). Data Science for Business. O’REILLY.

Rabanser, S., Günnemann, S., & Lipton, Z. C. (2018). Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. arXiv Preprint arXiv:1810.11953. Retrieved from http://arxiv.org/abs/1810.11953

Rastle, K. (2018). The place of morphology in learning to read in English. Cortex, 1–10. https://doi.org/10.1016/j.cortex.2018.02.008

Rich, A. S., Popp, P. J. O., Halpern, D. J., & Gureckis, T. M. (2018). Modeling Second- Language Learning from a Psychological Perspective. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications.

RStudio Team. (2015). RStudio: Integrated Development for R. Bosten, MA: RStudio, Inc. Retrieved from http://www.rstudio.com/ 42

Data Science and Society 2019

Settles, B., Brust, C., Gustafson, E., Hagiwara, M., & Madnani, N. (2018). Second Language Acquisition Modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 56–65).

Spence, E. (2014). The Mobile Browser Is Dead, Long Live The App. Retrieved April 2, 2019, from https://www.forbes.com/sites/ewanspence/2014/04/02/the-mobile-browser- is-dead-long-live-the-app/#3f7b8e3b614d

Storkey, A. J. (2013). When Training and Test Sets Are Different: Characterizing Learning Transfer. In Dataset Shift in Machine Learning (pp. 2–28). https://doi.org/10.7551/mitpress/9780262170055.003.0001

Tomoschuk, B., & Lovelett, J. T. (2018). A Memory-Sensitive Classification Model of Errors in Early Second Language Learning. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 231–239).

Tran, V. T., & Aussem, A. (2015). A practical approach to reduce the learning bias under covariate shift. In ECML PKDD 2015 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (pp. 71–86). https://doi.org/10.1007/978-3-319-23525-7_5

Van der Walt, S., Colbert, C. S., & Varoquaux, G. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering, 13, 22–30. https://doi.org/10.1109/MCSE.2011.37

Van Hulse, J., Khoshgoftaar, T., & Napolitano, A. (2007). Experimental Perspectives on Learning from Imbalanced Data Jason. In Proceedings of the 24th international conference on Machine learning (pp. 935–942). https://doi.org/10.1145/1273496.1273614

Waskom, M., Botvinnik, O., Hobson, P., Cole, J. B., Halchenko, Y., Hoyer, S., … Allan, D. (2014). seaborn: v0.5.0 (November 2014). https://doi.org/10.5281/ZENODO.12710

Witten, I. H. (2016). Linear Models. In Data Mining: Practical machine learning tools and techniques (pp. 128–134).

Wolstenholme, L. C. (1999). Reliability Modelling: a statistical approach.

Xu, S., & Chen, J. (2018). CLUF : a Neural Model for Second Language Acquisition Modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 374–380).

Yuan, Z. (2018). Neural sequence modelling for learner error prediction. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 381–388).

Yujian, L., & Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095. https://doi.org/10.1109/TPAMI.2007.1078

43

Data Science and Society 2019

Appendix A: Features in the original data set

User An 8-digit, anomyzed, unique identifier for each student Countries Two letter country code from which a user has done the exercises. Days The number of days since the student has started learning English on Duolingo. Client The student’s device platform; this can either be android, ios or web. Session The session type; this can either be lesson, practice, or test. Format The exercise format; this can either be reverse_translate, reverse_tap, or listen. Time The amount of time (measured in seconds) it took for the student to construct and submit their answer. A Unique 12-digit ID The first 8 digits consists of an ID representing the session, the next 2 digits denote the index of this exercise within the exercise, and finally, the last 2 digits denote the index of the word in this exercise. The token (word) The word learned in the exercise by a user. Part of Speech The part of speech is represented within a universal dependencies (UD) format. Morphological features Encoded with the UD format and contains a description of the relation a word has to other words in the same language. Such as stems, root words, prefixes and suffixes. Dependency edge label The relation of a word is identified compared to the root of the sentence. Dependency edge head The index of the word in the sentence. Label The label can either be 0 or 1. A 0 corresponds to a correct exercise and a 1 corresponds to an incorrect exercise.

44

Appendix B: Parameter settings of the models

Table 12. Hyper parameters linguistic features LR

Hyper parameter Setting Explored parameter space Penalty L2 {None, L1,L2} Alpha 10 {0.01,0.1,1, 10, 20 ,40} Max_iter 5 {5, 10 , 20} Learning_rate 0.01 {0.01,0.1, 0.5}

Table 13. Hyper parameters linguistic features RF

Hyper parameter Setting Explored parameter space Max_features 7 {5, 7, 10} Min_samples_split 10 {5, 10, 20} N_estimators 30 {10,20, 30, 40} Max_depth 10 {10, 20, 30}

Table 14. Hyper parameters learning process features LR

Hyper parameter Setting Explored parameter space Penalty None {None, L1, L2} Alpha 0.1 {0.01,0.1, 1, 10,20 ,40} Max_iter 5 {5, 10 , 20} Learning_rate 0.01 {0.01,0.1, 0.5}

Table 15. Hyper parameters learning process features RF Hyper parameter Setting Explored parameter space Max_features 7 {5, 7, 10} Min_samples_split 10 {5, 10, 20} N_estimators 30 {10,20, 30, 40} Max_depth 10 {10, 20, 30}

Table 16. Hyper parameters all features LR Hyper parameter Setting Explored parameter space Penalty L1 {None, L1,L2} Alpha 0.1 {0.01,0.1,1, 10, 20 ,40} Max_iter 5 {5, 10 , 20} Learning_rate 0.01 {0.01,0.1, 0.5}

Table 17. Hyper parameters all features RF

Hyper parameter Setting Explored parameter space Max_features 10 {5, 7, 10} Min_samples_split 10 {5, 10, 20} N_estimators 40 {10,20, 30, 40} Max_depth 10 {10, 20, 30} Data Science and Society 2019

Appendix C: Plots of the logistic regression models

Figure 18. Plots of the linguistic features model LR

Figure 19. Plots of the learning process features model LR

Figure 20. Plots of the total model LR

46

Data Science and Society 2019

Appendix D: Plots of the random forest models

Figure 21. Plots of the linguistic features model RF

Figure 22. Plots of the learning process model RF

Figure 23. Plots of the total model RF

47

Data Science and Society 2019

Appendix E: Visual representation of the features and the error

Figure 24. concreteness grouped by error Figure 25. edge head grouped by error

Figure 26. Levenshtein grouped by error

48

Data Science and Society 2019

Appendix F: Distribution comparisons train and test set

Table 18. Chi-squared test for the categorical features

Feature chi-squared p-value test statistic Part of Speech 1,909,779 0.00 Edge label 13,336,966 0.00 Word 1,194,548 0.00 Morph 1,914,095 0.00

Table 19. Kolmogorov-Smirnov test nchar

Feature Kolmogorov- p-value Smirnov test statistic Nchar 0.2437 0.00

49