<<

Using Machine Learning Techniques to Identify the Native Language of an English User

Jake Browning

4th Year Project Report Computer Science and Mathematics School of Informatics University of Edinburgh

April 2017 Acknowledgements

I thank my supervisor, Dr. Shay Cohen, for his invaluable insights and advice during the undertaking of this project.

2 Contents

0.1 Abstract ...... 6

1 Introduction 7 1.1 Previous Work ...... 8 1.1.1 Areas Preceding Native Language Identification ...... 8 1.1.2 Native Language Identification ...... 8 1.2 Motivation ...... 9 1.3 Outline and Goals ...... 9 1.3.1 Outline ...... 9 1.3.2 Overview of Goals ...... 10 1.4 A Note on Topic Bias ...... 10

2 Description of the Data 12 2.1 Corpus of Non-Native English ...... 12 2.2 Corpus of Native English ...... 12 2.3 The Lang-8 Corpus vs other English Learner Corpora ...... 12 2.4 Training and Testing Sets ...... 13 2.4.1 Native vs Non-Native ...... 13 2.4.2 Multi-Class Non-Native ...... 13 2.5 General Information on Each Class ...... 13

3 Differing Notions of Grammatical Structure, and their Uses in this Task 14 3.1 Constituency ...... 14 3.2 Dependency Grammar ...... 14 3.3 POS Tagset and Tagging Software ...... 16 3.3.1 Tagset ...... 16

4 Classification Methods 17 4.1 Gaussian Naive Bayes ...... 17 4.2 Logistic Regression ...... 18 4.3 Decision Trees and Random Forests ...... 19 4.3.1 Entropy ...... 19 4.3.2 Information Gain ...... 19

3 4.3.3 Using Information Gain to Rank Features ...... 20 4.3.4 Random Forests ...... 20 4.4 Normalisation ...... 21

5 Using Language Models as Classification Features 22 5.1 Description of a Language Model, and its Application to this Task ...... 22 5.2 Description of Language Models Tested ...... 22 5.2.1 The SLOR language model ...... 23 5.2.2 Potential issues with the SLOR model ...... 23 5.2.3 The POS language model ...... 23 5.3 Using Language Models to Classify Native vs Non-Native ...... 24 5.3.1 Results ...... 24 5.3.2 Summary of Results ...... 24 5.4 Using Language Models to Classify among Non-Natives ...... 25

6 A Heuristic Approach to Classifying Native vs Non-Native 27 6.1 Adverb Placement ...... 27 6.2 Left and Right ...... 27 6.2.1 What is "branching"? ...... 28 6.2.2 Capturing Differences in Branching ...... 28 6.2.3 ...... 29 6.2.4 Sentence Length ...... 29 6.2.5 Results ...... 30 6.3 Combining the "Language Model" and "Heuristic" Features ...... 30

7 Feature "Families" Used in the Multi-Class Problem 31

8 Using Techniques Influenced by Previous L1 Identification Research to Improve the Multi- Class Classification Problem 32 8.1 Capturing and Categorising Spelling Errors ...... 32 8.1.1 Levenshtein Distance ...... 33 8.1.2 Using Levenshtein Distance to Capture Error Types ...... 34 8.1.3 Results and Feature Importance ...... 35 8.2 Using an Error Correction Library to Detect Errors ...... 36 8.2.1 Results ...... 37 8.3 Using Stopwords as Features ...... 38 8.3.1 Stopwords ...... 38 8.3.2 Method ...... 38 8.3.3 Results ...... 38

9 Investigating New Feature Types in the Multi-Class Problem 39 9.1 Capturing the Usage of Prepositions ...... 39 9.1.1 The "Caseframe" Method ...... 39 9.1.2 Justifications for Not Explicitly Counting Errors ...... 40

4 9.1.3 Using an Ensemble Method to Capture Differences in Preposition Usage . . . . 40 9.1.4 Results ...... 42 9.2 Capturing the Usage of in the Global Context ...... 43 9.2.1 Simplifications to the Model Proposed by Tajiri et al ...... 43 9.2.2 Method ...... 43 9.2.3 Results ...... 45 9.3 Placement of Prepositional Phrases ...... 46 9.3.1 Method ...... 46 9.3.2 Results ...... 47 9.4 Correlative Conjunctions ...... 48 9.4.1 Linking Adjectives and Phrases with Conjunctions ...... 48 9.4.2 Correlative Conjunctions Spanning Multiple ...... 49 9.4.3 Frequencies ...... 50

10 Further Methods to Increase the Performance of the Multi-Class Classifier, and Analysis of the Accuracy 51 10.1 Incorporating Features Used in the Native vs Non-Native Problem ...... 51 10.2 Arguments against Dimensionality Reduction ...... 51 10.3 Confusion Matrix ...... 52

11 Feature Analysis of the Multi-Class Classifier 53 11.1 Methods used during Feature Analysis ...... 53 11.1.1 Information Gain ...... 53 11.1.2 Continuous Mutual Information ...... 53 11.2 Feature Analysis ...... 54 11.2.1 Using Information Gain to Rank all Features ...... 54 11.2.2 Using Information Gain to Rank Feature Families ...... 55 11.3 Feature Correlation ...... 56

12 Unused and Unsuccessful Methods 57 12.1 Using Simple Methods to Count Preposition Errors ...... 57 12.2 Overuse of "to me" ...... 57 12.3 Incorrect Tense used with Modal ...... 58 12.4 "One of" Followed by Singular Noun ...... 58

13 Conclusion 59 13.0.1 Insights Gained ...... 59 13.1 Motivation for Future Research ...... 60

A Tagsets 62 A.1 Table of Phrase-Level POS Tags ...... 62 A.2 Table of Word-Level POS Tags ...... 62

Bibliography 64

5 Abstract

0.1 Abstract

In this project, we perform native language identification on English text written by native speakers of English, Korean, Chinese and Japanese using machine learning techniques. Statistical language models, error type analysis, and investigations into the differing styles and grammatical structure of text written by authors in each class are utilised as features in the classification task. We build a binary classifier which discriminates natives from non-natives with accuracies ranging from 97.17% to 99.19%, and a multi-class classifier which discriminates among Korean, Chinese and Japanese speakers with 68.8% accuracy. We also emphasise and report on the differences in structure and style found in texts across the four classes.

6 Chapter 1

Introduction

L1 Interference is a highly researched area of linguistics which is concerned with studying the various ways in which a non-native user of English is influenced by their native language in their style of speech, as well as the frequency and types of mistakes that they make. Within the past sixteen years, the area of "native language detection" has become well-established in computational linguistics, and involves using machine learning techniques to determine the native language of an English user. In this task, machine learning techniques were used to classify the native languages of native English, Korean, Chi- nese and Japanese speakers according to their use of English. The challenges of detecting a writer as native or not compared to detecting the native languages among non-native writers were found to be very different, and in need of different sets of features. Therefore the problem proposed in this report was tackled as two sub-problems: detecting native vs non-native as a binary classification problem, and detecting between native speakers of Chinese, Japanese and Korean as a multi-class classification problem.

In the native vs non-native problem, using language model confidence scores as a single feature was found to be very effective with accuracies of up to 97.98%. Building a separate classifier formed of more insightfully picked features which target specific differences in style between native and non-native English writing was also found to be effective, with accuracies of up to 94.74%. The most effective classifier in the native vs non-native problem was formed by combining the features in both classifiers, yielding accuracies of up to 99.19%.

Using language models as features in the multi-class non-native problem showed much lower performance at 41.96%. The low-performance multi-class classifier was built upon through utilising other feature types used in previous research, as well as through introducing several new previously unexplored feature types, and through this the multi-class accuracy was ultimately improved to 68.8%.

Aside from the practical achievement of providing classifiers which solve both sub-problems with acceptable accuracy, the research involved in feature engineering during this project uncovered some prominent differences in the style of writing and grammatical structure between English texts written by speakers in the English, Korean, Chinese and Japanese classes. While previous research in native language identification has largely focused on accuracy as an end to which feature engineering is a means, in this project the insights gained regarding the differences in and structure across writers in

7 each class will be emphasised.

1.1 Previous Work

1.1.1 Areas Preceding Native Language Identification

The area of L1 identification succeeds, and was originally largely motivated by the areas of authorship attribution and accent recognition. Classifiers built in accent recognition research such as Fung and Liu (1999) used acoustic features from audio recordings to capture the frequencies of phone classes such as nasals and affricates, in order to classify the accent type (and often indirectly the native language) of an English speaker. Indeed, research in L1 identification was firstly proposed by Tomokiyo and Jones (2001) as an alternative to accent recognition when a transcript is present but no audio file.

From the first published paper in native language identification until today, research in this area has used techniques heavily influenced by general authorship attribution problems, which include gender iden- tification, liar identification and publisher identification among others. For example Argamon-Engelson et al. (1998) used frequencies of types of function words to identify the publishers of news articles. This technique carried on into L1 identification, where the majority of research also includes function word and stopword types as features.

1.1.2 Native Language Identification

As previously discussed, the first published research in this area is by Tomokiyo and Jones (2001), who used Naive Bayes to classify the native languages of 45 speakers of English, Japanese and Korean with accuracies of up to 100% using frequencies of stopwords as features. While the methods employed in this research were effective, it would be difficult to draw conclusions about native speakers of these languages in general from such a limited group of writers. Following the release of the much larger International Corpus of Learner English (ICLE), Koppel et al. (2005) demonstrated that analysing the types of errors made in English text is an effective way of classifying a speaker’s native language. Koppel et al used several categories of spelling errors as features, as well as various syntactical error types and bigram language model probabilities to classify speakers of Czech, French Bulgarian, Russian and Spanish. Using these error types as features allowed the classifier to discriminate among speakers of these classes with accuracies above 80%, which clearly demonstrated that using error types as features to classify a speaker’s native language is highly effective.

This research was built upon in some respects by Kochmar (2011), who selected features by identi- fying and targeting errors made by speakers of specific Indo-European languages when using English. Kochmar analysed not only error types and rates, but their distribution throughout learner text. Multiple methods of detecting error types were also tested, including the use of a learner corpus pre-tagged with error occurrences as well as a trigram error detection model trained from native English data. The results of this method were highly effective at classifying speakers of very closely related languages, with Danish and Swedish speakers being classified with 100% accuracy using character n-grams to detect

8 spelling errors. This research demonstrated that speakers of languages who are likely to make very similar L1-influenced errors when speaking English can also be correctly discriminated.

While continuing the paradigm of error analysis, Tetreault et al. (2012) used all of the features out- lined by Koppel et al. (2005), and also showed the effectiveness of using several types of language model perplexity scores to classify native speakers of seven Indo-European languages using the TOEFL11 corpus. Tetreualt et al also attempted to capture differing frequencies of errors as well as differing writing styles and discourse by using an automatic essay scoring system which rates learner text as a feature. This technique managed to classify , Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu and Turkish speakers with accuracies of up to 86%.

1.2 Motivation

The application of research pursued in this area is largely related to language education. In the context of error analysis, the systems built during the building process of the classifier as well as the classifier itself may lead to error correction systems which provide more targeted and specific error feedback to English users. For example, most spell-checking systems simply report a spelling as wrong, however it would be more fruitful to identify the type and the root of this error as an influence from the writer’s language in order to provide helpful feedback. Furthermore, exploring the L1 identification problem acts as a proxy for understanding the general fundamental differences in the types of mistakes that non-native English speakers tend to make, which will lead to more guided education in general

In the context of both error analysis and stylistic differences, uncovering the structural differences in grammar, and style among English users of different linguistic backgrounds is an inherently interesting linguistic problem.

Finally, it has been suggested that L1 identification could be helpful in helping to identify anonymous authors of text in situations such as criminal investigation (Perkins, 2014).

1.3 Outline and Goals

1.3.1 Outline

This project used a subset of the techniques which were successful in previous research, and new feature types outlined below were also explored which increased the accuracy of the classification problem. However these proposed feature types proved not to be as effective as the full scope of techniques used in previous research.

Native vs Non-Native: As explored in Kochmar (2011), language model confidence scores were used as features in a simple classifier with a single feature. Afterwards, a classifier using more complex and

9 targeted features were used to discriminate aspects of native and non-native speech. Feature types such as right/left branch lengths in grammar trees and adverb dispersion which have not been explored in previous research were tested, and this classifier was compared against the basic language model classifier. Sentence length, which was also explored in Kochmar (2011), was found to be the simplest but most effective feature in this classifier.

Multi-Class Non-Native (Chinese/Korean/Japanese): Unlike the binary native vs non-native prob- lem, using language model confidence scores as features was not effective. Heuristically chosen features were appended to the classifier in order to improve the performance, including techniques from previous L1 identification research such as the use of stopwords, counting punctuation and grammatical errors and spelling error types. New techniques drawn from unrelated research were also used to capture verb and preposition usage which improved accuracy, and various feature types related to stylistic usage were also explored. Ultimately the feature types used in the native vs non-native problem were also found to improve the performance of the muli-class classifier.

1.3.2 Overview of Goals

Upon completion of the project, the following goals should be satisfied.

• Build a native vs non-native binary classifier which discriminates the two classes with reasonable accuracy.

• Build a multi-class Korean/Chinese/Japanese classifier which discriminates the three classes with reasonable accuracy.

• Give evidence of any indication of differences in the style or structure of writing between native speakers on English, Korean, Chinese and Japanese which resulted from the feature selection process.

• Perform an analysis of the features used in the multi-class problem, reporting on which features were most important to the classification task and which were redundant.

1.4 A Note on Topic Bias

Topic bias can be described as the influence of the or topics discussed in a piece of text on its classification. It is favourable to avoid topic bias for multiple reasons. Firstly, it makes the classifier more general and potentially more effective at classifying a wider range of unseen data; the training data contains only a limited range of topics, and so if the classifier learns to discriminate between classes based on topics in the training data then it may perform poorly on future data which contains new topics. Secondly, not accounting for topic bias allows the classifier to be "tricked" easily. For example, if the classifier learns to classify Japanese speakers on the basis that they are more likely to write the word "Japan" and "sushi", it would be easy to manipulate the classifier into classifying text as Japanese by inserting random instances of the word "Japan" and "sushi" into the text.

10 Lau et al. (2016) approaches the task of removing topic bias as equivalent to removing content words, and this approach will be used here. This assumption alongside two others are as follows:

• Removing content words is considered equivalent to removing topic bias.

• The influence of the topic on grammatical structure is negligible.

• The influence of the topic on the frequency of extremely common content verbs such as "be" and "go" is negligible.

11 Chapter 2

Description of the Data

2.1 Corpus of Non-Native English

Tokyo Metropolitan University has provided a corpus of non-native English text which was scraped from lang-8.com for use in Mizumoto et al. (2011), whose research was related to automatic correction of Japanese text. Lang-8.com is a website with over 750,000 users where language learners submit a mixture of diary and essay style text to be proof-read. Each entry, which is 12 sentences long on average, was represented as a data point in the classification task. 20,000 entries each from Japanese, Korean and Chinese speakers were used, which amounts to 60,000 entries in total.

2.2 Corpus of Native English

The Open American National Corpus (Ide and Suderman, 2004) provides a collection of entries in Slate, which is an online current affairs magazine covering topics related to art, cooking, politics, culture, sport and news. Slate currently hires 46 in-house editors and columnists, and also accepts articles written by the public. The corpus used in this task contains 4,351 entries in the format of articles, which were merged and broken down into 18,687 entries each averaging 10 sentences long, in order to be comparable to the non-native corpus.

2.3 The Lang-8 Corpus vs other English Learner Corpora

The Lang-8 corpus used in this task is composed of diary-style entries written recreationally by English learners in their own time, whereas corpora used in previous L1 identification research is mostly composed of essays written by students under exam conditions. Brooke and Hirst (2011) discovered that accuracy scores on an L1 identification classifier trained using the Lang-8 corpus were 15% lower than when trained on the ICLE corpus, and conjectured the above explanation as a possible reason; the allowance of Lang-8 writers to stay within their comfort zones and write in their own time resulted in less errors than in the ICLE corpus. Brooke et al also found that classifiers trained on both corpora did not generalise well when tested on the other corpus, and concluded that the exam-style essay text in the ICLE corpus does not generalise well to non-native speech in general.

12 2.4 Training and Testing Sets

2.4.1 Native vs Non-Native

The native vs non-native problem was framed as three separate binary classification problems: native vs Korean, native vs Chinese and native vs Japanese. For each non-native class, the 20000 entries in that class were reduced to 18687 in order to be balanced with the native class. Accuracy results shown for each binary classifier in the native vs non-native task are 5-fold cross-validation results across all of the 18687 ∗ 2 = 37374 entries.

Whenever other statistics related to the dataset in the binary problem are produced, they are also across all of the entries.

2.4.2 Multi-Class Non-Native

The dataset in the multi-class problem contains 20000 ∗ 3 = 60000 entries. However, since in this task various secondary methods were required to be trained on a portion of the dataset, a balanced 40% portion of the dataset was held out for training ensemble classifiers and language models in order to prevent overfitting. A further 20% of the corpus was witheld as a validation set for the aforementioned methods and for general tuning, and the remaining 40% (18000 entries) was used as the cross-validation training/testing set: all accuracy results shown in the multi-class problem are 5-fold cross-validation results performed on this 40% portion.

Whenever other statistics related to features in the multi-class problem are shown, they are across the 60% held-out set. This is because unlike in the native vs non-native problem, various statistical results were used during the feature selection process in order to pick effective features. Using the cross-validation set to produce these statistics would put the classifier at risk of overfitting.

2.5 General Information on Each Class

Class # Entries # words # Sentences Avg # sentences per entry Avg # words per entry Native 18687 4,389,008 186,889 235 10 Korean 20000 2,469,118 292,789 123 14 Chinese 20000 2,562,562 251,445 128 13 Japanese 20000 1,142,742 244,351 57 12

Based on this brief summary of the data, the native class tends to contain a smaller number of comparitavely longer sentences, whereas the other classes use a greater amount of sentences but far fewer words. Particularly in the case of Japanese, where the average number of sentences is around the same as in the Chinese class but the average number of words per entry is less than half that of the Chinese class.

13 Chapter 3

Differing Notions of Grammatical Structure, and their Uses in this Task

In linguistics and natural language processing, grammatical structure is represented as a tree either in a "constituency" or "dependency" format. Both representations have different strengths, and both were required in this project. Their definitions and differing usages are given in this chapter.

3.1 Constituency Grammar

The constituency grammar (also known as phrase-structure grammar) Was introduced in Chomsky (1957) as a linguistic tool for viewing grammatical structure. Within a constituency grammar, the entire sentence acts as a root node which is broken down by its descendents into noun phrases (NPs), verb phrases (VPs), adjective phrases (ADJPs) and prepositional phrases (PPs). These phrase types differ according to the class of their head word (noun, verb, adjective or preposition respectively), and are further broken down again to phrase sub-trees or word-level leaf nodes such as verb, noun, preposition, determiner. Within a constituency grammar, there is no limit on the nesting of phrase-level subtrees, which can result in very large trees which represent comparatively short sentences. Within this project, parse trees were used in constituency form when it was desirable to examine grammatical structure as formed out of types of phrases.

When a constituency parse tree structure was needed, the BLLIP parser (Charniak and Johnson, 2005) (also known as the Charniak-Johnson parser) was used to parse text.

3.2 Dependency Grammar

The idea of modelling grammatical structure as a tree of dependency chains was formalised by Tesnière in 1959 (Percival, 1990), and has been widely used since. In a dependency grammar, the head verb of the sentence acts as a root node. All other words in the sentence are thought of as "dependent" on the head

14 verb. A word at the nth level of the tree is directly dependent on some word at level n − 1, and indirectly dependent on a word at each level above n − 1 up until the root. In with constituency where a word corresponds to a leaf element but can also give rise to a single-word phrase subtree, each word in a dependency grammar gives rise to exactly one node in the tree.

S

NP VP

NNP VBZ NP Pistachio is NP PP

CD

one IN NP of NP SBAR

DT JJ NNS WHNP S

the few flavours WDT VP

that VBP PP

appeal TO NP

to PRP

me

Figure 3.1: Example of a constituency grammar.

is

Pistachio one

of

flavours

the few appeal

that to

me

Figure 3.2: Example of a dependency grammar.

15 Dependency grammars were used in this project where it was useful to capture dependencies between words in a sentence, for example determining the of a preposition or the subject of a passive phrase. Due to the fact that dependency grammars have a one-to-one correspondence between words and nodes, they are generally more minimal and therefore quicker to navigate (Osborne et al., 2011). Therefore dependency grammars were also used where possible to speed up computation time. The Python library SpaCy (Honnibal and Johnson, 2015) was used to parse sentences into dependency trees throughout the project.

3.3 POS Tagset and Tagging Software

3.3.1 Tagset

The Penn Treebank POS tagset was used throughout this project for word-level POS tags, where the table in appendix A.2 from Santorini (1991) lists them in full. Phrase level tags, which appear alongside word-level tags in constituency trees, are outlined in appendix A.1. When a flat word-by-word parse was required, the POS tagger provided by NLTK (Loper and Bird, 2002) was used.

16 Chapter 4

Classification Methods

Three classification techniques were used at different stages of the project, due to their differing perfor- mances and speeds on different types of data and features. Overall, random forest decision trees were found to give the best performance on the majority of the classification tasks, however naive Bayes was found to be advantageous during chapter 5. Logistic regression was chosen during the feature engineering process in order to train ensemble classifiers, due to its speed and effectiveness over high-dimensional data. Naive Bayes and Logistic Regression will be outlined briefly in this chapter, whereas decision tree learning methods will be covered in more detail as this was the main method of classification; unless stated otherwise, all accuracy scores in the multi-class problem were computed using random forests of decision trees.

The python library SKLearn (Buitinck et al., 2013) was used in the implementation of all of the classifica- tion techniques.

4.1 Gaussian Naive Bayes

Naive Bayes1 is a probabilistic classification model which is based on Bayes’ rule paired with the assumption that all features are statistically independent. Given classes c1, ··· , cm and a feature vector T x = (x1, ··· , xn) , Bayes’ rule applies as

P (ci)P (x1, ··· , xn|ci) P (ci|x1, ··· , xn) = P (x1, ··· , xn) Therefore the class of feature vector x can be predicted through calculating the probability of each class ci given x, and choosing the maximum probability. Now several simplifications can be made, with the first and most important being that the independence assumption between features allows us to rewrite the following for each xj and class ci:

P (xj|ci, x1, ··· , xj−1, xj+1, ··· , xn) = P (xj|ci)

Applying this assumption to all xj, we have that for class ci, n Y P (x1, ··· , xn|ci) = P (xj|ci) j=1 1Information in this section on Naive Bayes is partly taken from http://scikit-learn.org/stable/modules/naive_bayes.html

17 Two further simplifications can be made: firstly, the denominator P (x1, ··· , xn) is not variable across different classes and so does not change the class with maximum probability. Furthermore in this specific task, all classes are balanced and so P (ci) is constant across all classes, since this term is simply the frequency of datapoints of each class in the dataset. Therefore we have

n Y P (ci|x1, ··· , xn) ∝ P (xj|ci) j=1

Therefore taking this product for every class and choosing the maximum class gives us a naive Bayes class prediction.

In Gaussian Naive Bayes, a Gaussian distribution is built using the mean and variance vectors of the training dataset in order to estimate the posterior probabilities. Gaussian Naive Bayes was chosen over discrete variants of Naive Bayes, since in this project it was applied to a classification task whose features are continuous.

4.2 Logistic Regression

Logistic regression (Agresti, 2003, p. 165) is a binary classification method which assigns one class as 1 and the second as 0 and discriminates them using a linear decision boundary. The decision boundary is trained by constructing an optimum weight vector ˆw, which is learned by maximising the log likelihood of the training data D = ((x1, y1), ··· , (xn, yn)) (represented as datapoint/class pairs) with respect to possible weight vectors w: ˆw = max log P (D|w) w Where n X T T log P (D|w) = yi log σ(w xi) + (1 − yi) log(1 − σ(w xi)) i=1 And σ is the sigmoid function bounded between 0 and 1. After the weights have been trained, new unseen datapoints are classified according to the probability that they belong to class 1, using the function T P (yi = 1|x) = σ(w x). Instances with probabilities above some threshold (usually 0.5) are assigned to class 1, which corresponds to the instance laying on the side of the hyperplane assigned by the classifier as belonging to "positive" instances. Probabilities of 0.5 lie directly on the decision boundary, and below 0.5 are classed as belonging to class 0. Due to the fact that the sigmoid function has highest gradient as it approaches 0.5 from either side, small distances close to the decision boundary are weighted more heavily than larger distances far from the boundary.

Logistic regression is inherently a binary classifier, however as in this case it can be extended to a multi-class predictor by using a "one vs all" technique; that is, building a binary classifier for each class and assigning the given class as 1, while all other classes are attributed as 0. For each datapoint, the probability that it belongs to the positive class is computed across all of the classifiers, where the class corresponding to the classifier which gives the highest probability is assigned to the datapoint. Logistic regression is generally quick and effective on high-dimensional data, and so it was used in this task in ensemble classifiers where data with over 300 features was required.

18 4.3 Decision Trees and Random Forests

The decision tree (Quinlan, 1986) is a classification method which uses characteristics of the features in the dataset to build a tree of decision rules, where the leaves are instances of classes. Once a tree has been built from training data, an unseen datapoint is classified by moving down the tree, at each level choosing to descend the branch corresponding to the characteristic relevant to the datapoint. Eventually a leaf node is reached, which determines the predicted class of the datapoint.

4.3.1 Entropy

A large part of the decision tree training portion is deciding which feature to "split" on at each level of the tree: it is important to choose features which discriminate the classes effectively when data is split according to characteristics related to that feature. The effectiveness of the split can be modelled using the notion of "impurity", which is how evenly represented each class is in each subset after a split. In order to discriminate classes well, we want splits to cause subsets of the training data which are as pure as possible; that is, subsets of the data after a split should each contain data mostly from a single class. While there are several criteria which can be used, in this project the impurity of a set is measured using entropy.

The function H for entropy is given below, where S is the set of data to be split on, and p(ci) is the proportion of datapoints in the subset being considered which belong to class ci.

n X H(S) = − p(ci) log2(p(ci)) i=1 As we would expect intuitively, H reaches its maximum value of 1 when the classes of the data being considered are perfectly equal, (i.e. p(c1) = ··· = p(cn)), and reaches a minimum value of 0 when all of the points being considered belong to a single class.

4.3.2 Information Gain

The goal during the training process of decision trees in this project is to choose splits which maximise information gain, which can be defined as the drop in entropy which occurs from splitting the data by its characteristics related to a certain feature. This is calculated by computing entropy on the set being considered before the split, and subtracting the weighted average entropy of each subset after the split has partitioned the data into these subsets. Formally, for the set being split S and candidate feature to split on A, information gain is defined as

X |Sv| IG(S, A) = H(S) − H(S ) |S| v v∈values(A)

Where A refers to the feature, and values(A) is the set of values to split the feature on. If the feature is continuous, this is often a set of ranges. S refers to the set of data to split, and Sv the subset of data which is grouped according to a common value v.

19 4.3.3 Using Information Gain to Rank Features

Information gain is a natural choice when ranking features using random forests of decision trees, as it is the goal of the classifier to maximise information gain during training time for optimal performance. It is an effective measure of how well a specific feature can discriminate classes (Strobl et al., 2007). SKLearn provides information gain figures as a class variable in its random forests classifier, which is quick to access. Throughout this project, the information gain of individual features will be used to rank features by effectiveness.

There are limitations to using information gain as a measure of effectiveness, which should be kept in mind. Namely, if multiple features have high information gain individually but are highly correlated, the first feature among these to be split on will be given a high information gain and the others will become uninformative after this split.

4.3.4 Random Forests

Using the random forest method is an effective way to avoid over-fitting and improve accuracy. During the training process, a specified number of decision trees are built from randomly sampled data in the training set, and the result of each decision tree is considered when predicting classes in the testing set. In this project, predictions are categorical and so each datapoint in the testing set will be classified based on the most frequently predicted class across the decision trees in the forest.

Choosing the number of trees in the forest The finished random forest classifier was run on the validation set on a range of trees from 1 to 120.

20 After experimenting with tree counts ranging from 1 to 120, it was found that average accuracy did not improve beyond 80 trees. Therefore this is the forest size used in the remainder of the project.

4.4 Feature Normalisation

All features across all of the classifiers were normalised using SKLearn’s StandardScaler library2, where the mean was scaled to 0 and variance to 1 for each feature.

2Documentation for StandardScaler is found at http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

21 Chapter 5

Using Language Models as Classification Features

5.1 Description of a Language Model, and its Application to this Task

A language model is a system which builds a probability distribution of n-grams based on their frequency in the training data, and uses this distribution to assign a probability to an unseen sentence from the product of the probability of each n-gram in that sentence.

A language model builds a distribution on the assumption that words within n-grams are dependent, however distinct n-grams are independent. This makes them useful for capturing very local features of written language. Below is an example of the trigram probability of a sentence.

P (Call me Ishmael) = P (Call|hsi, hsi) P (me|Call, hsi) P (Ishmael|Call, me) P (h/si|me, Ishmael) (5.1) In equation (5.1), the token hsi refers to the beginning of the sentence and h/si to the end. The hypothesis in this task is that non-native speakers will score lower than native speakers on a language model probability score trained on native text, and will therefore be an effective feature in the binary native vs. non-native classification task. Three separate language models (one for each class) were also trained using the training portion of the learner corpus, with the hypothesis also being that speakers of one non-native class will score higher on a language model trained using data from that class than on language models from other classes.

5.2 Description of Language Models Tested

Several language models were trained using the portion of the Open American National Corpus (Ide and Suderman, 2004) not used for building the native speaker class entries, consisting of 8 million words. KenLM (Heafield, 2011) was used to build the language models. For the purposes of this task, it is necessary that topic bias is removed when calculating the log probability of each entry. Otherwise, a class

22 of speaker who tends to often mention topics which also frequently arise in the language model’s training data will be scored higher, which is an unwanted consequence. For example, if the language model’s training corpus contains many articles about Kimchi, it may score Korean speakers higher on the merit that they are more likely to talk about Kimchi than other speakers.

5.2.1 The SLOR language model

The first language model considered uses the SLOR technique, used previously by Lau et al. (2016) to remove topic bias. This technique normalises the effect of word frequencies by subtracting the unigram log probability of each word from the trigram probability of the sentence, and divides by sentence length to normalise the probabilities of sentences with differing lengths. The formula penalises for words which commonly appear in the language model’s training data, whereas rarer words improve the score. This offsets the bias of the trigram model, which penalises rare words and rewards common ones.

L (s) − P L (w) L (s) = tri w∈s uni slor |s| P Where Ltri(s) is the log trigram probability of the sentence, and w∈s Luni(w) is the summed log unigram probability of each word in the sentence.

5.2.2 Potential issues with the SLOR model

Some potential issues exist with the SLOR model’s ability to eliminate topic bias, which were not ad- dressed by Lau et al. (2016). The model assumes that topic bias is caused by unigrams, whereas content words are often represented as n-grams for higher n. For example, a native speaker may be more likely to use the term "United Kingdom", whereas the SLOR model would attempt to offset the bias of this term by subtracting the unigram probabilities of "United" and "Kingdom" separately. This is a potential issue as the words "united" and "kingdom" can arise in many other contexts when considered separately.

5.2.3 The POS language model

Another method of avoiding topic bias was tested, which involved replacing all content words in the language model’s training data with their respective part-of-speech (POS) tags. For example, the sentence

African wildlife managers are constantly dealing with questions surrounding the need to slow the growth of elephant populations.

Becomes the following sentence in the modified corpus:

JJ NN NNS VBP RB VBG with NNS VBG the NN to VB the NN of NN NNS .

23 Two language models were trained; one on the original corpus, and another on the modified POS corpus. In the first instance the SLOR method is used to remove topic bias, whereas in the second instance there is no need to account for bias. Therefore the formula used to score on the second model is simply the log trigram probability divided by sentence length.

5.3 Using Language Models to Classify Native vs Non-Native

5.3.1 Results

Each of the three non-native classes was reduced from 20,000 entries to 18,687 entries and compared separately against the 18,687 entries in the native class. A Gaussian Naive Bayes classifier, which was found to give the best performance, was trained for each non-native class using 5-fold cross-validation on the cross-validation set. The mean score across each fold was taken to be the measure of accuracy in this classifier, and every classifier further mentioned in this report.

Classes SLOR accuracy POS accuracy Native/Korean 86.45% 96.96% Native/Chinese 78.22% 94.44% Native/Japanese 86.94% 97.98%

Figure 5.1: Accuracy of the binary language model classifier.

5.3.2 Summary of Results

It is clear that trigram language models are very effective at discriminating between the native class and each non-native class. At first this may seem to be because there are many local differences between native and non-native speech which can be accurately captured by a trigram probability distribution. However it is important to consider the fact that the tagger used to create the POS corpus is itself trained on a native corpus, and so may give peculiar and incorrect POS tags to akward non-native English that it was not trained on. This may exaggerate the differences between native and non-native speech.

The increased effectiveness of the POS model can be predicted after plotting each class using the SLOR model and POS model as axes. Clearly there is greater variance along the axis of the SLOR model, however the POS model axis separates the classes more effectively.

24 (a) Native vs Korean (b) Native vs Chinese

(c) Native vs Japanese

5.4 Using Language Models to Classify among Non-Natives

The training portion of the learner corpus was separated by class, and used to built three separate trigram POS language models; one for each class. The cross-validation portion of the corpus was then used to train and test a classifier with confidence scores from each model as features. The language models built for each class did not discriminate the data nearly as well as in the native vs non-native problem, as the results and scatter plots show below.

Classes POS accuracy Baseline Korean/Chinese/Japanese 41.96% 33.33%

Figure 5.3: The non-native POS classification score is shown alongside a baseline of 33.3%, which is obtained through random prediction of the three equally-balanced classes.

While language models are excellent tools for classifying in the native vs non-native task, they prove to fall short for classifying among non-natives. in this task, contrary to previous research such as in Brooke and Hirst (2011) where accuracies above 70% were reached using language model probability scores alone. A possible reason for this is the limited amount of data available in this task to train the

25 respective non-native language models; if a larger learner corpus were to be used, better results may be obtained.

(a) Chinese vs Japanese (b) Chinese vs Korean

(c) Korean vs Japanese

Figure 5.4: In the three figures above, data in all pairs of non-native classes are plotted using their relevant language model POS scores as axes, to demonstrate the high correlation between them.

26 Chapter 6

A Heuristic Approach to Classifying Native vs Non-Native

In this chapter, a more insightful attempt was made at building a binary native vs non-native classifier, and its performance compared against the simple statistical classifier built previously. This classifier used features which capture differing placement of adverbs, left/right branch direction in dependency parse trees, frequency of the passive voice and sentence length.

6.1 Adverb Placement

Adverbs are classified as either being the first word in a sentence, being between the first word and first verb, between the first verb and the end of the sentence or at the very end of the sentence. The percentage of adverbs occurring at each position was calculated for every class. The results found are shown below.

Class Head Mid - preceding 1st verb Mid - succeeding 1st verb Tail Korean 0.14 0.38 0.35 0.13 Chinese 0.14 0.37 0.36 0.13 Japanese 0.12 0.37 0.36 0.15 Native 0.11 0.33 0.47 0.09

Clearly, speakers in the native class are less prone to place adverbs at the tail of the sentence than the non-native speakers, and more prone to place adverbs between the first verb and the tail. As a result, "Mid - succeeding 1st verb" and "tail" are taken as candidate features.

6.2 Left and Right Branching

Before describing the methodology and results of this feature type, the notion of branching in linguistics and its relevance to this task will be outlined.

27 6.2.1 What is "branching"?

The term "branching" refers to the direction in which a parse tree descends. In general, languages whose speakers produce parse trees which generally descend towards the right are called "right-branching", whereas languages whose parse trees descend towards the left are "left-branching". English is largely right-branching, which means that there are no strict rules regarding the branch direction but speakers instinctively use methods such as extraposition to shift the parse tree towards the right (Francis, 2010). On the other hand, Korean, Chinese and Japanese are strictly right-branching; the languages contain violatable rules which ensure their "left-branchedness".

A hypothesis can be made from the above information: speakers whose native languages are left-branching may be less inclined to right-branch when speaking English than natives.

6.2.2 Capturing Differences in Branching

In order to test this hypothesis, every sentence in each entry was parsed into a dependency tree. The length of the longest right branch starting at the second level in each tree was divided by the length of the longest longest left branch. The average of value over every sentence in a given entry was then taken to be the feature value for that entry. If the hypothesis is true, this value will be significantly higher among the native class than among the other classes.

In the figure below, an extreme example is considered to illustrate the procedure. A very right-branching sentence is shifted left and their right/left branching scores are compared. The right branching sentence

You can build a machine that performs better in less than one hour with a few million dollars.

If shifted to the left, so it becomes

With a few million dollars, in less than one hour you can build a better performing machine.

Where the only difference in the word usage is the omission of "that" appearing in the first sentence. To illustrate their differences, these sentences are shown as separate constituency trees.

28 Figure 6.1: In the right-branching sentence (a), the branch proportion is 7/1 = 7. In the left-branching sentence (b), it is 5/3 = 1.67. The red branches indicate the longest left branch, whereas the blue branches indicate the longest right branch.

6.2.3 Passive Voice

The SpaCy dependency parser was used to detect instances of passive voice in each entry. The results showed that speakers in the native class make use of passive voice much more frequently than those in other classes. The results of the branching investigation and number of sentences containing the passive voice are given below, where passive voice instances are normalised by the number of sentences across each class. Bold indicates the highest value in the column.

Class Longest right branch / longest left branch % Sentences containing passive voice Korean 1.81 4.16% Chinese 2.03 5.14% Japanese 1.83 4.44% Native 2.64 15.13%

Figure 6.2: The average proportion of right-branch/left-branch among Native speakers is between 23% and 31% higher than the other classes. Members of the native class also use passive voice around three times as frequently as members of the non-native classes.

6.2.4 Sentence Length

Kochmar (2011) successfully used sentence length as a feature, and from observing the general information in section 2.5 we saw that entries in the native class tend to contain more words but fewer sentences. Therefore there is incentive to use sentence length as a feature.

29 6.2.5 Results

The performance of a classifier trained on each of these features individually and combined is reported below. Note that with the exception of the rightmost column, all scores are reported for a classifier with a single feature.

Classes Adverb Branch Passive Sentence length All combined Native/Korean 67.94% 81.45% 70.31% 93.88% 94.48% Native/Chinese 68.45% 77.81% 71.94% 88.09% 88.69% Native/Japanese 70.83% 80.59% 71.02% 94.54% 94.74%

Figure 6.3: The first four columns show the accuracy results (%) given by a classifier using a single feature. The final column shows the result of combining the four features.

Using a classifier with adverb distribution, right-branchedness, sentence length and frequency of pas- sive voice is effective with classifciation accuracy between 88.69% and 94.74%. Curiously, the branching factor feature alone is extremely effective at classifying Korean speakers, at 81.45%. Furthermore, despite each feature type being effective on its own, the most effective feature, sentence length, accounts for most of the accuracy when the features are combined.

Despite giving us more insight regarding the differences in how speakers in the native and non-native classes use English, these results do not beat the simple statistical language model. However, performance is maximised when the two classifier types are combined.

6.3 Combining the "Language Model" and "Heuristic" Features

The final investigation in this chapter is to combine the language model and heuristic features previously addressed separately in this chapter, in order to maximise the performance of the native vs non-native classifier. The results are shown below.

Classes Accuracy Native/Korean 99.09% Native/Chinese 97.17% Native/Japanese 99.19%

Figure 6.4: Final reported accuracy of the native vs non-native problem.

30 Chapter 7

Feature "Families" Used in the Multi-Class Problem

Throughout the rest of this report, very similar types of features which were engineered in the same section are grouped as belonging to "families". When reporting statistics regarding features, their families will be referred to using an abbreviated name. The table below provides a reference for the reader to be used during the rest of the report. The table provides an abbreviation of the family, a brief description and the section in which the feature family is developed and discussed.

Abbreviation Description # Features Section POS The non-native language models 3 5.4 SPELL Spelling mistake features 10 7.1 LCHECK General mistakes reported by the "LanguageCheck" library 7 7.2 STOPW Stopword frequencies 30 7.3 PREP Ensemble preposition usage features 3 8.1 VERB Ensemble verb usage features 3 8.2 Frequency of prepositional phrases PPPOS 2 8.3 occurring before the 1st verbphrase Features used during the native Native vs Non-Native 7 5.3, 6 vs non-native classification problem

31 Chapter 8

Using Techniques Influenced by Previous L1 Identification Research to Improve the Multi-Class Classification Problem

In this chapter, feature types broadly influenced by previous research are used to improve the performance of the non-native classifier. Namely, capturing different types of spelling errors, counting general errors using an error correction library and using stopword frequencies as features. However in the case of investigating spelling errors, error types will not be captured using letter n-gram features as in Koppel et al. (2005) but rather captured using weighted variants of Levenshtein distance between misspelled words and suggested corrections.

8.1 Capturing and Categorising Spelling Errors

The phonology of one’s native language has a large influence in the types of errors that an L2 speaker of English makes when spelling (Randall and Isnin, 2004). These influences were investigated in an attempt to capture and categorise different error types which discriminate the classes. Ultimately, this feature family improved the classification performance by 5.8% when appended to the decision tree classifier.

Koppel et al. (2005) considered spelling errors as belonging to five categories, of which the first four were explored. They are as follows

Name Example Substitution error probrem instead of problem error fisrt instead of first Insertion error friegnd instead of friend Deletion error frend instead of friend Conflated words stucktogether instead of stuck together

32 The above error types with the exception of "conflated words" and "inversion error" were explored. Error frequency as well as "severity" of errors were also explored. Severity was measured as the normalised Levenshtein distance (edit distance) between the word and suggested correction, normalised by the maximum length of the two words.

8.1.1 Levenshtein Distance

Levenshtein distance (commonly known as edit distance) is a measure of similarity between two strings, first proposed by Levenshtein (1966). Roughly speaking, it can be described as the minimum number of edits required to transform one string to another. An edit in this case is classified as being a letter substitution, insertion or deletion. For example

L(car,far) = 1 (one substitution)

L(cannot,kanot) = 2 (one substitution "t" for "d", one deletion "n")

Weighted and unweighted Levenshtein distance metrics are used throughout this section.

Weighted Levenshtein Distance In order to isolate the specific types of errors committed, it would be useful to count only the edit opera- tions corresponding to that error type; for example we would only like to count the number of substitution edits in the "letter swap" error type. So that L(flaner,flannel) = 1, only accounting for the r/l swap despite the need to insert an extra "n".

The python library Weighted-Levenshtein1 was used to weight specific edit operations. Due to the dynamic programming nature of the algorithm, weighting all of the unwanted edit types as 0 or close to 0 returned incorrect results. A solution to this could be to use the recursive variant of the algorithm, however this runs in exponential time compared to the O(mn) dynamic programming algorithm (where m and n are the word lengths). Therefore it was favourable to devise a method that uses the dynamic programming variant. In order to only count the edits of one type, the weighted distance where the targeted edit type has weight 2 was calculated, and the unweighted distance was then subtracted from this value. Formally, let

Ls be the Levenshtein distance which only counts swaps. This is calculated using L, the distance where all edit types have equal weight, and Lw, the distance where swap edits have weight 2. So for two words w1 and w2,

Lw(w1, w2) − L(w1, w2) Ls(w1, w2) = max(|w1|, |w2|)

For example, In the below case the swap distance before dividing by max word length is 1, accounting for the one swap and ignoring the deletion.

(2 ∗ 1 + 1) − 2 1 L (flaner, flannel) = = s 7 7 1Available at https://testpypi.python.org/pypi/weighted-levenshtein/0.9

33 8.1.2 Using Levenshtein Distance to Capture Error Types

Weighted Levenshtein distance has been used previously in unrelated tasks to improve the accuracy of spelling correction systems, such as in Hicham et al. (2012), where errors requiring different edit types and edits involving specific letters were weighted by their frequencies in a training corpus which was analysed by hand. In this project, we notice that the substitution, insertion and deletion actions in the edit distance algorithm correspond suitably with the "substitution", "deletion" and "insertion" categories listed above. For example, the average deletion distance between a misspelled and corrected word can be used to model the occurrence and severity of insertion errors. The Python library Enchant2 was used to detect misspelled words in the learner corpus and give suggested corrections. The unweighted Levenshtein distance between the misspelled word and the correction was taken to be a measure of "severity" of the mistake made. Furthermore, the weighted distance procedure explained in 7.2.1 was used to capture substitution, insertion and deletion errors.

The table below summarises the average normalised distances for each type of error across each class in the training set.

Class letter substitution letter insertion letter deletion unweighted distance Korean 0.21 0.032 0.081 0.35 Chinese 0.14 0.037 0.11 0.34 Japanese 0.18 0.034 0.1 0.36

It was also found that in every class, between 5.6% and 5.9% of words contained errors. Therefore not only do speakers in each class commit errors with similar frequencies, if we take unweighted error distance between the mistake and the correction as a metric for severity of a mistake, the severity of mistakes made across each class is also similar on average. However the proportion of errors in the Korean class which are letter substitution errors is evidently significantly higher than in other classes.

Capturing Specific Substitution Errors In this case, rather than counting the occurrences of any substitution error made, only substitutions involving pairs of specific letters were considered. The letter swaps l/r and v/f/p/b were specifically targeted using knowledge of the languages in Question; there is no distinction between /l/ and /r/ in Japanese phonology (Kubozono, 2015, p. 6), and as well as this being the case with Korean, the fricatives /v/ and /f/ also do not exist in Korean phonology (Brown and Yeon, 2015, p. 7). In the latter case, Koreans often substitute these fricatives for /b/ and /p/ when pronouncing loanwords from English. This may lead to general confusion over when to use the four sounds, and the corresponding four letters. Apart from targeting the phonological differences, distances of different specific letter swaps were tested through trial and error, and the swaps m/n and c/k were found to be noticeably different among the classes.

2Available at https://pypi.python.org/pypi/pyenchant/

34 Class l/r v/f/p/b c/k m/n Korean 7.6 × 104 1.1 × 103 5.3 × 104 7.6 × 102 Chinese 1.7 × 104 3.9 × 104 6.4 × 105 2.2 × 102 Japanese 1.7 × 103 7.3 × 104 1.1 × 103 4.2 × 102

Figure 8.1: Average weighted Levenshtein distance of specific swaps between the incorrectly spelled word and suggested correction across each class.

In the case of the error types chosen as a result of the languages’ phonologies, the hypotheses are confirmed: The languages without /r/ and /l/ in their phonologies (Korean and Japanese) are more likely to mix up the letters "r" and "l" when writing in English, and Koreans are more likely to confuse "v", "f", "p" and "b". The reason for a proportionally higher rate of "c/k" and "m/n" errors among Japanese and Korean speakers respectively is unknown.

Capturing Specific Insertion Errors Chinese speakers were found to be most likely to add superfluous vowels to words, whereas Korean speakers most often incorrectly added consonants. The average weighted delete distance (corresponding to the required letter deletions to reach the correct word) are shown.

Class delete vowel delete consonant Korean 1.1 × 102 4.7 × 102 Chinese 2.5 × 102 2.2 × 102 Japanese 1.9 × 102 4.2 × 102

Figure 8.2: Average weighted Levenshtein distance of specific deletions made in the incorrectly spelled word to reach the suggested correction.

The specific insertion and swap error distances, as well as the general substitution, insertion, deletion distances and the unweighted distance were all taken to be features in the "spelling error" group.

8.1.3 Results and Feature Importance

This family of features improved the accuracy of the classifier by 5.8% when appended alongside the non- native POS log probability features. The table below shows accuracy reported by a classifier containing only features in the POS family, and performance after adding features in this family (SPELL).

35 Feature Added Accuracy POS 41.96% SPELL 47.76%

Features in this family were ranked according to information gain, where the table of features with respective information gain scores is given in below in order of descending importance. It was found that the general unweighted distance and swap distance were approximately equally informative, and the specific swap errors were least informative.

Feature name Information Gain Unweighted distance 0.205288 Letter swap 0.190701 Letter deletion 0.148508 Delete consonant 0.117607 Letter insertion 0.113546 Delete vowel 0.108587 Swap m/n 0.053574 Swap r/l 0.005928 Swap b/p/v/f 0.005264 Swap c/k 0.002451

8.2 Using an Error Correction Library to Detect Errors

In this approach, the python library LanguageCheck (Miłkowski, 2010) was used to check for errors in each entry and the most discriminative types were used as features, improving classification accuracy by a further 9.12%. Over 400 error types were identified across the learner corpus, with the majority of error types only occurring once. A high-dimensional logistic regression classifier was built, where logistic regression was used for its speed over high-dimensional data. Features were ranked according to importance using recursive feature elimination: it was found that the 7 best ranked features captured over 99% of the accuracy in the classifier, so these features were selected to be in the decision tree model. A graph of the most important 14 features (added by descending importance) against accuracy is shown below. The features along with their descriptions are also shown in descending order of importance.

36 Figure 8.3: Number of features (added by descending importance) against accuracy. Choosing the best seven features captures around 99% of the accuracy that all the features provide.

Feature name Description Uppercase sentence start Sentence starts with a lowercase letter. Whitespace rule Sentence contains unnecessary whitespace. Comma parenthesis whitespace No space after comma. I lowercase I (pronoun) appears as "i". Unpaired brackets Bracket is opened and not closed. Double punctuation Use of two consecutive dots or commas. Sentence fragment Sentence is actually a dependent .

With the exception of "sentence fragment", the most important features seem to be superficial typos rather than grammatical errors resulting from true L1 interference. The reason for the significance of rules "whitespace rule" and "comma parenthesis whitespace" may result from the fact that Chinese and Japanese do not use whitespace in their writing systems, whereas Korean does. Therefore speakers of Korean should be better at correctly using whitespace. It may be effective to find deeper grammatical errors which result from L1 interference.

8.2.1 Results

The 9.12% increase in accuracy after adding this family (LCHECK) is given below.

Feature Added Accuracy POS 41.96% SPELL 47.76% LCHECK 56.88%

37 8.3 Using Stopwords as Features

8.3.1 Stopwords

In computer science, stopwords are the extremely common words in a language which do not contain important topical information. They are mostly common verbs such as "be" or "have", as well as function words which contribute to syntax rather than meaning, such as "the" and "but". Stopwords have been successfully used as feature types in almost all research discussed in the background chapter, and so they are also utilised here.

8.3.2 Method

The top 30 most frequently appearing stopwords in the corpus were found, and for each stopword its frequency in each entry was taken to be a feature. This method significantly improved the performance of the classifier.

8.3.3 Results

This family (STOPW) increased performance by 4.25% when appended to the classifier.

Feature Added Accuracy POS 41.96% SPELL 47.76% LCHECK 56.88% STOPW 61.13%

38 Chapter 9

Investigating New Feature Types in the Multi-Class Problem

In this chapter, several new ideas will be tested which involve techniques in capturing preposition and verb usage alternative to the methods of explicitly counting errors used in previous research. Furthermore some features which capture stylistic differences will be investigated, such as the use of correlative conjunctions and the placement of prepositional phrases. These feature types will be appended to the decision tree classifier used in the previous chapter in order to further improve performance.

9.1 Capturing the Usage of Prepositions

The usage of prepositions as grammatical components linking verbs to objects were investigated. The method used in this section will not explicitly count the occurrences of prepositional errors, but rather model prepositional behaviour through recording what types of prepositions commonly occur with which verbs, linking to which object types and alongside which other types of grammatical particles. The features acquired through this investigation improved performance by 1.46% when appended to the classifier.

9.1.1 The "Caseframe" Method

Nagata et al. (2014) built a systematic method for detecting preposition errors in a sentence, where each sentence is converted and stored into a "caseframe" entry. A model of correct prepositional usage was created by building a large dictionary of caseframes from native corpora, and prepositional errors were found by creating caseframes from non-native written sentences and looking them up in the native dictionary. If the lookup returned no matches, the sentence was marked as containing a prepositional error. In this project a simplified caseframe model was built using the basic ideas in Nagata et al. (2014), which does not attempt to count errors or compare the learner caseframes with a native dictionary of caseframes. The following attributes are stored in each caseframe entry.

39 Attribute Description Examples Subj Subject of the sentence I, you, he Dobj Direct object of the head PRP, NNP, NN Iobj Indirect object of the head PRP, NNP, NN Head Head verb of the sentence, converted to infinitive form go, find, look Prt Particle to, while, for Object of a preposition, where [prep] is Prep_[prep] PRP, NNP, NN the specific preposition (i.e. to, around)

An example of a sentence missing the preposition "to" being converted to a caseframe entry is shown below, alongside an example where the "to" is present. In native speech, the verb "listen" cannot be followed by a pronoun direct object.

I listened him but I forgot. −→[head:listen, subj:I, dobj: PRP]

In this correct sentence, the pronoun is not a direct object but a prepositional object of "to":

I listened to him but I forgot. −→[head:listen, subj:I, prep_to: PRP]

9.1.2 Justifications for Not Explicitly Counting Errors

In this task, non-native caseframe entries were not looked up against a native dictionary. Instead, the caseframe entries across the three classes were compared. The following justification is given for this decision.

• A very large amount of native data is required in order to produce a respectable recall.

• It is time-consuming to process each sentence in a a large corpus of native data to a caseframe entry.

• Analysing how each class’s caseframes differ may capture a deeper and winder range of differences than simply counting preposition errors.

• The caseframe method proposed by Nagata et al. (2014) effectively captures preposition errors. Therefore if there are differences in preposition error types/rates between the classes, representing sentences as caseframes will still capture these differences whether they are looked up against native text or not.

9.1.3 Using an Ensemble Method to Capture Differences in Preposition Usage

For every entry in the training portion of the corpus, each sentence was converted to a caseframe using SpaCy’s dependency parser to identify each attribute, and was represented as a vector of the form

40 root_go root_be ... subj_propn subj_nn ... prep_to_nn ... 1 0 ... 1 0 ... 1 ...

Where a secondary "caseframe classifier" was built from these features. In order to create a classifier with a manageable number of features, only sentences whose head verbs are among the 100 most common were considered, and all head verbs were converted to their infinitive forms using NLT K. A logistic regression classifier was built out of these features for use in the ensemble method. The validation set was used to weight each of the 3 features, with the biggest improvement found with weights Chinese=1, Korean=0.985, Japanese=0.98. The diagram below shows the performance of the caseframe classifier on the training portion of the corpus after validation tuning.

Using this logistic regression classifier as an ensemble method proceeded as follows. For each entry across the cross-validation set, every sentence in the entry was then converted to a caseframe and represented as a vector like the one above. The three probabilities for each class given by the caseframe classifier for every sentence in an entry were then averaged, and these three average probabilities were appended to the original classifier as features corresponding to that entry.

41 Figure 9.1: A flowchart showing conversion of a sentence to a feature vector, which is then fed into the caseframe classifier whose probability scores are appended to the original classifier.

9.1.4 Results

The results after appending features in this family (PREP) are given below.

Feature Added Accuracy POS 41.96% SPELL 47.76% LCHECK 56.88% STOPW 61.13% PREP 62.59%

42 9.2 Capturing the Usage of Verbs in the Global Context

A method proposed by Tajiri et al. (2012) was simplified and investigated as a means of capturing the differing usages of verbs across the non-native classes. Tajiri et al uses information regarding tense usage in the previous verb phrase to make predictions about the correctness of the tense used in the verb phrase under question. Furthermore temporal adverbs (such as "today", "yesterday") are taken into consideration when evaluating tense correctness, as well as other information such as the subjects, objects, conjunctions which appear in the verb phrase, and whether the verb phrase is in the main clause of the sentence. This feature family raised performance by a further 1.27%.

9.2.1 Simplifications to the Model Proposed by Tajiri et al

When building the system proposed by Tajiri et al. (2012), several simplifications were made both for the sake of efficiency and to make this section manageable in the scope of the project. The major simplifica- tions are that firstly, rather than building a system which analyses every verb phrase in a sentence, only the leftmost top level verb phrase is taken into account. As for the global context, the top level verb phrase in a given sentence is compared to the top level verb phrase in the directly preceding sentence. Furthermore rather than storing each head verb as a content word, the POS tag of the verb is stored instead. This design decision was made since it is easier to make generalisations about a limited set of POS tags than a very large possible set of content words. Moreover, the "top 100" method used in the caseframe method to overcome this would be difficult here, since every possible verb form of the top 100 would need to be added as a feature, whereas in the caseframe method all verbs were converted to infinitives.

Finally, as in the preposition caseframe method, entries were not compared against a native corpus or have their tense errors explicitly counted. Instead, by using this method we hope to capture and contrast the differences in verb usage across the three non-native classes.

9.2.2 Method

A method very similar to the caseframe method of creating a secondary ensemble classifier trained from the held-out training set was used. Also similarly, each sentence was stored as a set of "key components" in an entry, and this entry format was then used to build features in the secondary classifier. Afterwards, entries in the training set were converted to vectors of this form and run through the secondary classifier. The probabilities of each sentence of an entry belonging to each of the three classes were then averaged and appended to the vector corresponding to that entry in the original classifier.

The following attributes were used for each sentence.

43 Attribute Description Examples (not exhaustive) Temporal 1st temporal adverb appearing in the verb phrase today, last year, yesterday Dobj Direct object of the verb phrase PRP, NNP, NN Conj First conjugation used in the verb phrase and, but, or Head Head verb of the verb phrase VB, VBG, VBZ Subj Subject of the sentence I, you, me Aux within the verb phrase VB, VBG, VBZ Pobj Prepositional object within the verb phrase NP, DT, NNP

Where for each attribute, the corresponding attribute in the previous sentence is also considered. As an example, consider the following sentence.

Yesterday I go to Shinjuku for dinner. Afterwards I drank with my friends.

Each sentence would be stored as an entry in the following manner:

Yesterday I go to Shinjuku for dinner. −→[temporal: yesterday, head: VB, subj: I, dobj: NNP, pobj: NP, temporal_previous: None, head_previous: None, subj_previous: None, dobj_previous: None, pobj_previous: None]

Afterwards I drank with my friends. −→[temporal: None, head: VBD, subj: I, dobj: None, pobj: NNS, temporal_previous: yesterday, head_previous: VB, subj_previous: I, dobj_previous: NNP, pobj_previous: NP]

While the pronoun is "I" in both sentences, the verb tenses are inconsistent. Also the tense "VB / infinitive verb" in this case is unnaturally paired with the temporal adverb "yesterday".

Also as in the caseframe method, the sentence "Afterwards I drank with my friends" is vectorised as follows.

temporal_yesterday ... head_VBD ... subj_I ... pobj_NP ... temporal_prev_yesterday ... 0 ... 1 ... 1 ... 1 ... 1 ...

In order to find the portion of the sentence which corresponds to the top-level verb phrase, a depth-first search was performed on the sentence’s constituency tree until a "VP" node was found, counting the num- ber of occurrences of leaf nodes (words) along the way. A second depth-first search was then performed on the subtree with the "VP" node as its root, where a second count of each encountered leaf node was

44 started. On the termination of the second depth-first search, the number of leaf nodes before finding "VP" was taken to be the index of the beginning of the verb phrase in the sentence, and this count summed with the leaf node count upon termination of the depth-first search on the "VP" subtree was the index of the end of the verb phrase in the sentence. The pseudocode for this method is given below.

Algorithm 1 Finding the start and end words of the top-level verb phrase in a sentence. 1: function VP_POSITION(tree) 2: word_count = 0 3: vp_found = False 4: start, end = 0 5: for subtree in DFS(tree) do 6: if is_leaf(subtree) then 7: word_count += 1 8: else if subtree.label == "VP" and vp_found == False then 9: vp_found = True 10: start = word_count 11: for vp_subtree in DFS(subtree) do 12: if is_leaf(v_subtree) then 13: word_count += 1 14: end if 15: if word_count > 0 then 16: end = word_count - 1 17: else 18: end = 0 19: end if 20: end for 21: break . Break out of the top-level DFS after DFS of the verb phrase subtree terminates 22: end if 23: end for 24: return (start, end) 25: end function

9.2.3 Results

Performance reported by the classifier after appending the VERB feature family is given below.

Feature Added Accuracy POS 41.96% SPELL 47.76% LCHECK 56.88% STOPW 61.13% PREP 62.59% VERB 63.86%

45 9.3 Placement of Prepositional Phrases

A prepositional phrase is a syntactic unit of a sentence whose head word is a preposition. In English, prepositional phrases can often be placed optionally before or after the head verb of the sentence;

Before going to class, he quickly did his homework.

He quickly did his homework before going to class.

The positioning of prepositional phrases was investigated across speakers of the three classes, where a prepositional phrase was categorised as either appearing before or after the first verb phrase in a sentence. It was found that Japanese speakers place prepositional phrases at the beginning significantly less often than speakers in the other classes, and features exploiting this fact were found to increase performance by 0.5%. The table below shows the percentage of prepositional phrases occurring before the first verb phrase.

Non-Native Class % PP occurring before first VP Korean 16.73% Chinese 17.59% Japanese 13.26%

9.3.1 Method

Most prepositional phrases were correctly identified as "PP", however occasionally a fragmented sentence would cause the parser to mark a prepositional phrase as a subordinate clause whose head word is a preposition. The second case was also considered in the algorithm, which is explained below. In this feature family, counts of prepositional phrases arising as PPs and counts arising as subordinate clauses were treated as separate features.

In order to find the number of prepositional phrases preceding the first verb phrase, we proceed as follows. Beginning at the root of the sentence’s constituency tree, perform a depth-first search of the tree until a prepositional phrase (or a subordinating conjunction containing a preposition) is found. Treating the nodes on the level of the prepositional phrase as a string of words, record the index of the prepositional phrase (pp_index) along with the minimum index of the first verb phrase at this level. If pp_index < vp_index, record this sentence as containing a clause with a prepositional phrase proceeding any verb phrases.

46 Algorithm 2 Determining whether a tree is a prepositional phrase involves also checking whether it is a subordinating conjunction with a preposition as its leftmost child. This function is used in the PP_FIRST function. 1: function IS_PP(tree) 2: if tree.label == "PP" then 3: return True 4: end if 5: children = tree.children 6: for i from 0 to children.length do 7: if children[i].label == "SBAR" then 8: sbar_children = children[i].children 9: if sbar_children[0] == "IN" then 10: return True 11: end if 12: end if 13: end for 14: return False 15: end function

Algorithm 3 Determining whether a prepositional phrase appears before the first verb phrase within any of a sentence’s clauses. 1: function PP_FIRST(tree) 2: pp_index = -1 3: vp_index = -1 4: for subtree in DFS(tree) do 5: if pp_index > -1 then 6: break 7: end if 8: children = subtree.children 9: for i from 0 to children.length do 10: if is_pp(children[i]) and pp_index == -1 then 11: pp_index = i 12: else if children[i].label == "VP" and vp_index == -1 then 13: vp_index = i 14: end if 15: end for 16: end for 17: if pp_index < vp_index then 18: return True 19: else 20: return False 21: end if 22: end function

9.3.2 Results

Performance reported by the classifier after appending this family (PPPOS) is given below.

47 Feature Added Accuracy POS 41.96% SPELL 47.76% LCHECK 56.88% STOPW 61.13% PREP 62.59% VERB 63.86% PPPOS 64.36%

9.4 Correlative Conjunctions

Correlative conjunctions are pairs of conjunctions that appear together as a pattern in natural language. For example, in English "If, S1, then S2", and "Both S1 and S2" are examples in the English language. Chinese makes heavy use of both of these structures, as well as others which do not appear in native English speech; for example, the structure "Although S1, but S2" is natural in Chinese but not in English. Instances of different correlative conjunctions were counted for each class and used as features. The methodology for finding instances of correlative conjunctions is firstly discussed, then the results are given.

Adding this feature family on top of the classifier did not increase performance, and so these features were not useful in the classification task. However they are included in the main body of this report, as it was found that Chinese speakers were much more likely to make use of correlative conjunctions in English than the other speakers across every type of correlative conjunction that was tested. Therefore the research done in this portion of the task showed a clear difference in the stylistic usage of English among speakers in the three classes. The conjectured reason for their poor contribution to accuracy is that structures of this type are sparse in text written by non-natives, and each correlative conjunction has a different constituency tree structure so it is time-consuming to devise an algorithm for capturing each individual type. However if a very large number of these structures were investigated they could possibly provide a significant increase in performance, given their more frequent use by Chinese speakers.

9.4.1 Linking Adjectives and Phrases with Conjunctions

This structure aims to capture grammatical patterns of the form "Both A and B", "Neither A nor B", "Either A or B". After analysing constituency parse trees in the learner corpus by hand, it was found that this pattern can be captured according to the rule

[DT/CC] [JJ/NP/VP/ADJP] CC [JJ/NP/VP/ADJP]

Where the second instance of [JJ/NP/VP/ADJP] must agree with the first. Informally, this pattern can be described as "Two (noun/verb/adjective)-phrases of the same type linked by two conjunctions". Strictly, the [DT/CC] should always be CC, however the parser often incorrectly tagged this part of the pattern.

48 S

NP VP

PRP

He VBZ NP

is

CC NP CC NP

neither nor DT JJ NN NN DT JJ NN

a good role model a bad influence

Figure 9.2: A sentence containing the phrase "Neither a good role model nor a bad influence", with the corresponding substring "CC NP CC NP" highlighted.

S

NP VP

NP NN VBD ADJP NNP POS class was Today ’s DT JJ CC JJ

both helpful and interesting

Figure 9.3: A sentence containing the phrase "Both helpful and interesting", with the corresponding substring "DT JJ CC JJ" highlighted.

9.4.2 Correlative Conjunctions Spanning Multiple Clauses

In the previous example, the role of the correlative conjunction was to connect phrases in the same clause. In the case of the patterns "if,then", "although, but" and "because, so", two clauses are being connected where each conjunction in the pair normally appears at the beginning of its respective clause. Constituency parse trees were again analysed by hand in order to find out how these patterns normally appear.

In general this type was represented in a parse tree where each clause is a separate subtree, and the

49 conjunction within the first clause is a direct leftmost child of the root node of the subtree. However the conjunction corresponding to the second clause is normally not contained inside the second clause’s subtree, but rather directly adjacent on the left side. This is shown in the figure below where "although" and "but" are connecting two clauses. While this was the most common pattern, there were many exceptions to

S

SBAR , CC S

, but NP VP

IN NP VP PRP VBD NP

Although NN I enjoyed PRP MD VP somebody it will VB ADJP PP

feel JJ IN S

sick after VP

VBG

watching

Figure 9.4: Typical structure of a sentence containing the correlative conjunction "Although,but" connect- ing two clauses. this rule and it proved to be very difficult to produce a good heuristic rule which provided high recall. The simplest procedure was opted for; perform a depth-first search until an instance of "although" is found as a subordinating conjunction, then perform another depth first search on adjacent subtrees until an instance of "but" is found as a conjunction.

9.4.3 Frequencies

In the following table, the frequency of correlative conjunctions across each class is given, normalised by the number of sentences across each respective class.

Non-Native Class If,then Although,but DT JJ CC JJ JJ and JJ Total Korean 6.6 × 104 3.5 × 104 2.2 × 104 5.6 × 104 1.8 × 103 Chinese 8.7 × 104 7.2 × 104 5.1 × 104 2.2 × 103 4.3 × 103 Japanese 2.3 × 104 1.3 × 104 2.6 × 104 6.3 × 104 1.3 × 103

Among the correlative conjunctions analysed, their use by Chinese speakers is 3.3 times more frequent than by Japanese speakers, and 2.4 times more frequent than their use by Korean speakers.

50 Chapter 10

Further Methods to Increase the Performance of the Multi-Class Classifier, and Analysis of the Accuracy

A final attempt was made in order to improve the performance of the multi-class classifier, where features used in the native vs non-native problem were found to be effective when added to the multi-class classi- fier, and dimensionality reduction was considered and rejected. The ultimate accuracy of 68.8% is also analysed further by considering the confusion matrix of accuracies.

10.1 Incorporating Features Used in the Native vs Non-Native Problem

Although the features used in the binary native vs non-native problem were not designed with the multi- class problem in mind, they were found to boost the performance of the multi-class classifier significantly. The two binary language model feature types (SLOR score and POS score) as well as the heuristic features used in chapter 6 (branch factor, adverb placement and passive voice) were appended to the multi-class classifier. This increased performance from 64.36% to 68.8%. This new appended classifier will be considered in section 11.2, where specific feature importances will be examined.

10.2 Arguments against Dimensionality Reduction

The motivation for reducing the dimensionality of a classifier is generally to speed up training time and to improve performance through avoiding the "curse of dimensionality": as more dimensions are added to the feature space, the volume of the space increases exponentially which greatly increases the regions of space over which the classifier has no predictive power (Donoho, p. 16). There was not much motivation

51 to attempt dimensionality reduction in this task, since the number of features (65) is very small in the context of natural language processing classifiers, so training time is very fast and dimensionality is low prior to any attempts at dimensionality reduction. Furthermore the analysis in section 10.2 shows that the vast majority of features were informative and contributed positively to the classifier’s performance.

10.3 Confusion Matrix

The confusion matrix displayed in this section shows that entries in the Korean class were classified with the greatest accuracy, with the Chinese class closely behind. Classification accuracy of the Japanese class is the lowest, at 66%.

52 Chapter 11

Feature Analysis of the Multi-Class Classifier

11.1 Methods used during Feature Analysis

Before commencing with the feature analysis portion of this chapter, the basic methods used to analyse the effectiveness of each features and correlations between features are outlined. Respectively these methods are information gain and continuous mutual information.

11.1.1 Information Gain

Information gain, which was covered in section 4.3.2 and justified as a method of ranking features in section 4.3.3, is used here to judge the effectiveness of features.

11.1.2 Continuous Mutual Information

Informally, the mutual information I(A, B) between feature A and feature B is the amount of information that both of these features share. Formally, it is the entropy of feature A subtracted by the entropy of B given that the information provided by A is already known:

I(A, B) = H(A) − H(A|B)

In the continuous case it is calculated as Z Z  p(a, b)  p(a, b) log A B p(a)p(b) Where p(a, b) is the join probability density function of a and b, and p(a) and p(b) are the marginal probability density functions. Mutual information is useful for computing how correlated features are, and is used in this chapter as a metric for investigating this. The Python library MinePy (Albanese et al., 2012) was used to compute mutual information between features, where mutual information between every pair of features was computed across the cross-validation set.

53 11.2 Feature Analysis

Despite the fact that feature elimination was not explored in this task, statistics on feature importances and correlations are provided here for the sake of insight, and to inform future research in this area where feature elimination may be utilised.

11.2.1 Using Information Gain to Rank all Features

Feature analysis was performed on the finished classifier using the cross-validation set in order to gauge how important each feature is to the final performance score of 68.8%. The graph of all features ranked by their information gain is given below.1

Figure 11.1: Ranked feature importances in the multi-class classifier.

All features in the classifier have nonzero information gain and so are informative in some way, and there is no feature or set of features which are more informative than all other features. The 10 most important features according to the random forests classifier are listed below.

The three most important features were found to be those from the native vs non-native problem (in chapters 5 and 6), whereas features from the effective STOPW family do not appear in the top 20 at all. The reason for this may be that an individual stopword feature is ineffective on its own, however collectively as a group of 30 features they contribute significantly to the accuracy. Therefore ranking importances on the level of individual features does not account for large effective feature families formed of features which are not significant individually.

1Code used to generate this graph was taken from http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

54 Feature name Information Gain Family SLOR model 0.050421 Native vs Non-Native Sentence length 0.049080 Native vs Non-Native Native/non-native POS model 0.043027 Native vs Non-Native Comma parenthesis whitespace 0.038883 LCHECK Preposition probability Korean 0.037695 PREP Preposition probability Japanese 0.036206 PREP Verb probability Japanese 0.033989 VERB Preposition probability Chinese 0.033860 PREP Verb probability Korean 0.031796 VERB Branching factor 0.027869 Native vs Non-Native Verb probability Chinese 0.027479 VERB Vowel insert distance 0.026313 SPELL Chinese POS model 0.025366 POS Korean POS model 0.025152 POS Excess whitespace 0.024170 LCHECK Swap distance 0.024088 SPELL Japanese POS model 0.024032 POS General Levenshtein distance 0.023416 SPELL Sentence start lowercase 0.023357 LCHECK Delete distance 0.019686 SPELL

Figure 11.2: The 20 most informative features ranked in descending order.

11.2.2 Using Information Gain to Rank Feature Families

Family Name Total Information Gain Native vs Non-Native 0.22462 STOPW 0.22381 LCHECK 0.13389 SPELL 0.13920 LCHECK 0.11589 PREP 0.11110 VERB 0.094692 POS 0.075497 PPPOS 0.01518

Figure 11.3: The most informative families ranked in descending order.

The STOPW family is evidently very informative, despite none of this family’s features appearing in the 20 most informative features.

55 11.3 Feature Correlation

65 The mutual information between all possible ( 2 ) = 2080 feature pairs was computed and plotted as a histogram in order to obtain a general overview of how the correlations between features are distributed. The reported histogram shows a multi-modal distribution, with one large peak close to 0 and another peak at 0.05. The distribution plot also shows that the vast majority of feature pairs are not highly correlated, with a negligible amount of pairs having information gain higher than 0.2.

The table below shows the ten most correlated pairs. From this small subset of pairs, we can see that the most correlated tend to be features which are intuitively very similar. The table also tells us that The Chinese, Korean, and model features are very highly correlated, as shown in section 5.4.

Feature pair Mutual information Korean POS model/Japanese POS model 0.97099 Chinese POS model/Japanese POS model 0.81211 Korean POS model/Chinese POS model 0.80754 Verb Probability Korean/Verb Probability Japanese 0.71941 Verb Probability Chinese/Verb Probability Japanese 0.61863 Verb Probability Korean/Verb Probability Chinese 0.57702 Delete distance/Delete consonant distance 0.51155 Preposition probability Chinese/Preposition probability Japanese 0.45439 Delete Distance/Delete Vowel Distance 0.39565 Preposition probability Korean/Preposition probability Japanese 0.38375

Figure 11.4: The most informative families ranked in descending order.

56 Chapter 12

Unused and Unsuccessful Methods

Over the course of this project, many avenues were explored and found to be interesting and discriminative, but exposed patterns which appeared too infrequently in the data to be practical. The most interesting are included briefly in this chapter.

12.1 Using Simple Methods to Count Preposition Errors

A list of 15 very common English verbs which require the addition of a preposition when used with an object were compiled. Sentences in the learner corpora were then parsed using NLTK’s POS tagger, and a "missing preposition" error was detected when one of the verbs requiring a preposition was followed by a noun, determiner or pronoun without preposition. An example would be "Listen music", where the correct expression is "Listen to music". Across each of the three classes, it was found that between 12% and 14% of occurrences of the 15 words were missing a preposition according to this metric. The conclusion that either there is not a large difference in "missing preposition" errors across the classes, or that the methodology was not adequate.

12.2 Overuse of "to me"

Bauman (2006) lists the addition of "to me" after an opinionated expression as ubiquitous in English speech among Koreans. An example listed by Bauman is the expression "This pizza tastes good to me". This quirk was investigated using NLTK’s POS tagger, where strings containing the POS substring JJ to PRP, meaning "adjective followed by to followed by pronoun". Results showed that this grammatical pattern is indeed more ubiquitous among Korean speakers than in the other non-native classes. The raw count in each class normalised by the number of words in the class is shown in the table below.

Non-Native Class Raw Count Normalised by # Words Korean 464 3.2 × 104 Chinese 273 1.8 × 104 Japanese 109 1.0 × 104

57 12.3 Incorrect Tense used with Modal Verb

While detecting tense terrors with high recall is in general a difficult problem, there are various consistent tense rules in the English language which can be captured easily used a flat POS parse. One of these is that a modal verb (such as "should" or "can") must be followed by a verb in its infinitive form. For example, "I should going" is incorrect. Incorrect patterns of this form were detected by looking for sub-strings of the pattern MD [VBG/VBZ/VBN/VBD]; that is, a modal verb followed by a verb not in infitive form. It was found that Chinese speakers committed this type of error most often. The table below reports the number of sentences where this incorrect pattern was detected.

Non-Native Class Raw Count Normalised by # Sentences Korean 189 6.4 × 104 Chinese 275 1.1 × 103 Japanese 106 4.3 × 104

12.4 "One of" Followed by Singular Noun

This type of mistake was investigated using a dependency tree structure; in general, the nodes correspond- ing to "one", "of" and the singular noun appear as a chain in the tree as shown below. A depth-first search was performed on the tree to look for chain of this type. Below the clause "Because one of my favorite view is fall landscape" is shown, with the chain highlighted.

is

Because one landscape

of fall

view

my favorite

Non-Native Class Raw Count Normalised by # Sentences Korean 267 9.1 × 104 Chinese 164 6.5 × 104 Japanese 251 1.0 × 103

58 Chapter 13

Conclusion

In this project, machine learning techniques were used to approach the task of discriminating between native and non-native English users as well as among non-native English users whose native languages are Korean, Chinese and Japanese. The native vs non-native problem was solved effectively using both language model features and hand-picked features designed using linguistic knowledge, with accuracy scores of 94.44 − 97.98% and 88.69 − 94.74% respectively. The most effective classifier in the native vs non-native problem was built from combining these two feature types, resulting in accuracies of 97.17 − 99.19%.

In the multi-class task, POS language models alone were found to give a low accuracy score of 41.96%. Using methods which were broadly inspired by previous research, namely capturing spelling error types, stopword frequencies and using an error correction library to count frequencies of different error types brought the accuracy up to 61.13%. Further previously unexplored techniques developed in order to capture verb and preposition usage brought the accuracy up further to 64.36%. Afterwards, it was found that appending the feature types specifically designed for the native vs non-native problem significantly improved performance on the multi-class classifier, bringing the performance up to the final score of 68.8%.

While the accuracies reported in this project could be considered acceptable, they do not meet or exceed the accuracies reported in research such as Tetreault et al. (2012), who used the large ICLE corpus to discriminate speakers among whom were Chinese, Japanese and Korean native speakers, and did so with accuracies of up to 86% while making more extensive use of statistical language model features. It is likely that if the language models used in this project were trained on bigger data, and if character n-gram features were also utilised, the classification accuracy of 68.8% could be improved to match previous research.

13.0.1 Insights Gained

A secondary objective of this task was to report on differences between the writing style and structure used by speakers of different classes. In the case of the native vs non-native problem, stark differences were found, although these differences should be interpreted with the disclaimer that the native data and non-native data were taken from different domains; the former from Slate magazine, and the latter from the

59 Lang-8 website. While the differences found in writing between the native and non-native speakers make sense from a linguistic perspective, it is possible that these differences are more reflective of discrepancies between the domains of the data rather than being rooted in inherent linguistic differences. It would be interesting to perform the same analysis on native and non-native English writing across the same domain to verify that the differences still hold. Nevertheless, the following insights were found in the native vs non-native problem:

• Speakers in the native class produce more right-branching grammar trees than in the non-native class.

• Speakers in the native class place adverbs at the end of sentences far less often than in the non-native class.

• Passive voice is much more heavily used in the native class

In the case of the non-native multi-class problem, speakers across each class belong to the same domain (the Lang-8 learner corpus) and so it can be said with more confidence that differences found here likely do reflect true inherent differences in how difference non-native English users write, and that these differences likely come from L1 interference. They are as follows

• Speakers across different classes commit spelling errors with roughly the same frequency and severity, however some specific types of errors are far more frequent in certain classes. For example, Chinese speakers are most likely to insert superfluous vowels, whereas Korean speakers are most likely to commit swap errors.

• Chinese speakers make the heaviest use of correlative conjunctions across the types of correlative conjunctions investigated.

• Japanese speakers are least likely to place prepositional phrases before verb phrases.

13.1 Motivation for Future Research

This project leaves several areas only partially explored and leads to several more unexplored problems. They are outlined below.

• Test the generalisation of this classifier on a different corpus such as ICLE or TOEFL11. It was shown in Brooke and Hirst (2011) that classifiers built using the ICLE corpus do not generalise well to the Lang-8 corpus. It would be interesting to test the generalisation of this classifier, which was built using the Lang-8 corpus, on another non-native English corpus such as ICLE or TOEFL11.

• Use language model confidence score features from models which are trained on a larger amount of data in the multi-class case and utilise more types of features engineered using language models, as in Tetreault et al. (2012) The main effort of this project was to use more targeted and insightful feature selection, however the result is that the classifier accuracy falls short of accuracies reported in previous research which utilises simple statistical models more heavily.

60 • Provide a more rigorous definition of topic bias, and what causes it. Assumptions regarding topic bias are made throughout this project, which were outlined in chapter 1. It would be useful to determine whether topic influences grammatical structure in a significant way. This would be a difficult task, since grammar can vary in many different ways. Such an experiment could be conducted by viewing the usage of grammatical function words across different domains where the only variable factor is topic, as well as by training a POS language model on a large corpus encompassing many topics, and comparing scores across each topic domain on a separate corpus.

• Explore further the methods which did not help the performance on the classifier despite discriminating classes well; try to find more examples of these cases to use as features so as to make their occurrences less sparse. The prime example is the use of correlative conjunctions, which discriminated Chinese speakers well but the appearance of correlative conjunctions was too sparse to contribute to classification accuracy.

61 Appendix A

Tagsets

A.1 Table of Phrase-Level POS Tags

Various phrase-level tags appear in constituency trees throughout this project. Their meanings are listed here.

Tag Description ADJP Adjective phrase NP Noun phrase POS Possessive ending (’s) PP Prepositional phrase S Sentence SBAR Subordinating conjunction VP Verb phrase

A.2 Table of Word-Level POS Tags

This table from Santorini (1991) is to be used as a reference for the reader when word-level POS tags are used throughout this project.

62 Tag Description CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign word IN Preposition or subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, NP Proper noun, singular NPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun PRP$ Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO to UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb

Figure A.1: List of the Treebank POS tagset. See Santorini (1991) for a more thorough description of each tag.

63 Bibliography

A. Agresti. Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley, 2003. ISBN 9780471458760.

D. Albanese, M. Filosi, R. Visintainer, S. Riccadonna, G. Jurman, and C. Furlanel. Minerva and minepy: a c engine for the mine suite and its r, python and matlab wrappers. 2012. doi: 10.1093/bioinformat- ics/bts707.

S. Argamon-Engelson, M. Koppel, and G. Avneri. Style-based text categorization: What newspaper am I reading? Proceedings of the AAAI Workshop on Text Categorization, pages 1–4, 1998.

N. R. Bauman. A Catalogue of Errors Made by Korean Learners of English. Proceedings of the 14th annual Korea TESOL international conference, pages 167–176, 2006.

J. Brooke and G. Hirst. Native language detection with ’cheap’ learner corpora. In In Learner Corpus Research 2011 (LCR 2011), Louvain-la-Neuve, 2011.

L. Brown and J. Yeon. The Handbook of Korean Linguistics. Wiley-Blackwell, 2015. ISBN 9781118354919.

L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project. CoRR, abs/1309.0238, 2013. URL http://arxiv.org/abs/1309.0238.

E. Charniak and M. Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. 2005.

N. Chomsky. Syntactic Structures. page 27, 1957.

D. L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. In AMS conference on math challenges of the 21st century.), year = 2000, publisher = .

E. Francis. Grammatical weight and extraposition in English. Cognitive Linguistics, 21: 37–45, 2010.

P Fung and W. K. Liu. Fast Accent Identification and Accented Speech Recognition. 1999.

64 K. Heafield. KenLM: faster and smaller language model queries. WMT ’11 Proceedings of the Sixth Workshop on Statistical Machine , pages 187–197, 2011.

G. Hicham, Y. Abdallah, and B. Mustapha. Introduction of the weight edition errors in the Levenshtein distance. International Journal of Advanced Research in Artificial Intelligence (IJARAI), 2012.

M. Honnibal and M. Johnson. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, Lisbon, Portugal, September 2015. Association for Computational Linguistics. URL https://aclweb.org/anthology/D/D15/D15-1162.

N. Ide and K. Suderman. The american national corpus first release. In Proceedings of the Fourth Language Resources and Evaluation Conference (LREC, pages 1681–1684, 2004.

E. Kochmar. Identification of a Writer’s Native Language by Error Analysis. University of Cambridge Master’s Thesis, 2011.

M. Koppel, J. Schler, and K. Zigdon. Determining an author’s native language by mining a text for errors. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005.

H. Kubozono. Handbook of Japanese Phonetics and Phonology. De Gruyter, 2015. ISBN 9781614511984.

J. H. Lau, A. Clark, and S. Lappin. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge. Cognitive Science, pages 20–21, 2016.

V. I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707, February 1966.

E. Loper and S. Bird. NLTK: The Natural Language Toolkit. 2002.

T. Mizumoto, M. Komachi, M. Nagata, and Y. Matsumoto. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pp.147-155. Chiang Mai, Thailand, 2011.

M. Miłkowski. Developing an open-source, rule-based proofreading tool. 2010.

R. Nagata, M. Vilenius, and E. Whittaker. Correcting Preposition Errors in Learner English Using Error Case Frames and Feedback Messages. Proceedings of The 52nd Annual Meeting of the Association for Computational Linguistics, 2014.

T. Osborne, M. Putnam, and T. Groß. Bare phrase structure, label-less trees, and specifier-less syntax: Is Minimalism becoming a dependency grammar? The Linguistic Review. The Linguistic Review, page 315–364, 2011.

K. Percival. Reflections on the history of dependency notions in linguistics. Historiographia Linguistica, 17:29–47, 1990.

65 R. Perkins. Linguistic identifiers of L1 Persian speakers writing in English: NLID for authorship analysis. 2014.

J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, March 1986. ISSN 0885-6125. doi: 10.1023/A:1022643204877. URL http://dx.doi.org/10.1023/A:1022643204877.

M. Randall and N. D. Isnin. The Effects of Mother Tongue and First Language Literacy on Spelling. Centre for Research in Pedagogy and Practice, 2004.

B. Santorini. Part-of-Speech Tagging Guidelines for the Penn Treebank Project. 1991.

C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1):25, 2007. ISSN 1471-2105. doi: 10.1186/1471-2105-8-25. URL http://dx.doi.org/10.1186/1471-2105-8-25.

T. Tajiri, M. Komachi, and Y. Matsumoto. Tense and aspect error correction for ESL learners using global context. Proceeding ACL ’12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, 2:198–202, 2012.

J. Tetreault, D. Blanchard, A. eCahill, and M. Chodorow. Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification. Hunter College and the Graduate Center, City University of New York, 2012.

L. Tomokiyo and R. Jones. You’re not from ’round here, are you?: Naive Bayes detection of non-native utterance text. Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), 2001.

66