Using Machine Learning Techniques to Identify the Native Language of an English User

Using Machine Learning Techniques to Identify the Native Language of an English User Jake Browning 4th Year Project Report Computer Science and Mathematics School of Informatics University of Edinburgh April 2017 Acknowledgements I thank my supervisor, Dr. Shay Cohen, for his invaluable insights and advice during the undertaking of this project. 2 Contents 0.1 Abstract . 6 1 Introduction 7 1.1 Previous Work . 8 1.1.1 Areas Preceding Native Language Identification . 8 1.1.2 Native Language Identification . 8 1.2 Motivation . 9 1.3 Outline and Goals . 9 1.3.1 Outline . 9 1.3.2 Overview of Goals . 10 1.4 A Note on Topic Bias . 10 2 Description of the Data 12 2.1 Corpus of Non-Native English . 12 2.2 Corpus of Native English . 12 2.3 The Lang-8 Corpus vs other English Learner Corpora . 12 2.4 Training and Testing Sets . 13 2.4.1 Native vs Non-Native . 13 2.4.2 Multi-Class Non-Native . 13 2.5 General Information on Each Class . 13 3 Differing Notions of Grammatical Structure, and their Uses in this Task 14 3.1 Constituency Grammar . 14 3.2 Dependency Grammar . 14 3.3 POS Tagset and Tagging Software . 16 3.3.1 Tagset . 16 4 Classification Methods 17 4.1 Gaussian Naive Bayes . 17 4.2 Logistic Regression . 18 4.3 Decision Trees and Random Forests . 19 4.3.1 Entropy . 19 4.3.2 Information Gain . 19 3 4.3.3 Using Information Gain to Rank Features . 20 4.3.4 Random Forests . 20 4.4 Feature Normalisation . 21 5 Using Language Models as Classification Features 22 5.1 Description of a Language Model, and its Application to this Task . 22 5.2 Description of Language Models Tested . 22 5.2.1 The SLOR language model . 23 5.2.2 Potential issues with the SLOR model . 23 5.2.3 The POS language model . 23 5.3 Using Language Models to Classify Native vs Non-Native . 24 5.3.1 Results . 24 5.3.2 Summary of Results . 24 5.4 Using Language Models to Classify among Non-Natives . 25 6 A Heuristic Approach to Classifying Native vs Non-Native 27 6.1 Adverb Placement . 27 6.2 Left and Right Branching . 27 6.2.1 What is "branching"? . 28 6.2.2 Capturing Differences in Branching . 28 6.2.3 Passive Voice . 29 6.2.4 Sentence Length . 29 6.2.5 Results . 30 6.3 Combining the "Language Model" and "Heuristic" Features . 30 7 Feature "Families" Used in the Multi-Class Problem 31 8 Using Techniques Influenced by Previous L1 Identification Research to Improve the Multi- Class Classification Problem 32 8.1 Capturing and Categorising Spelling Errors . 32 8.1.1 Levenshtein Distance . 33 8.1.2 Using Levenshtein Distance to Capture Error Types . 34 8.1.3 Results and Feature Importance . 35 8.2 Using an Error Correction Library to Detect Errors . 36 8.2.1 Results . 37 8.3 Using Stopwords as Features . 38 8.3.1 Stopwords . 38 8.3.2 Method . 38 8.3.3 Results . 38 9 Investigating New Feature Types in the Multi-Class Problem 39 9.1 Capturing the Usage of Prepositions . 39 9.1.1 The "Caseframe" Method . 39 9.1.2 Justifications for Not Explicitly Counting Errors . 40 4 9.1.3 Using an Ensemble Method to Capture Differences in Preposition Usage . 40 9.1.4 Results . 42 9.2 Capturing the Usage of Verbs in the Global Context . 43 9.2.1 Simplifications to the Model Proposed by Tajiri et al . 43 9.2.2 Method . 43 9.2.3 Results . 45 9.3 Placement of Prepositional Phrases . 46 9.3.1 Method . 46 9.3.2 Results . 47 9.4 Correlative Conjunctions . 48 9.4.1 Linking Adjectives and Phrases with Conjunctions . 48 9.4.2 Correlative Conjunctions Spanning Multiple Clauses . 49 9.4.3 Frequencies . 50 10 Further Methods to Increase the Performance of the Multi-Class Classifier, and Analysis of the Accuracy 51 10.1 Incorporating Features Used in the Native vs Non-Native Problem . 51 10.2 Arguments against Dimensionality Reduction . 51 10.3 Confusion Matrix . 52 11 Feature Analysis of the Multi-Class Classifier 53 11.1 Methods used during Feature Analysis . 53 11.1.1 Information Gain . 53 11.1.2 Continuous Mutual Information . 53 11.2 Feature Analysis . 54 11.2.1 Using Information Gain to Rank all Features . 54 11.2.2 Using Information Gain to Rank Feature Families . 55 11.3 Feature Correlation . 56 12 Unused and Unsuccessful Methods 57 12.1 Using Simple Methods to Count Preposition Errors . 57 12.2 Overuse of "to me" . 57 12.3 Incorrect Tense used with Modal Verb . 58 12.4 "One of" Followed by Singular Noun . 58 13 Conclusion 59 13.0.1 Insights Gained . 59 13.1 Motivation for Future Research . 60 A Tagsets 62 A.1 Table of Phrase-Level POS Tags . 62 A.2 Table of Word-Level POS Tags . 62 Bibliography 64 5 Abstract 0.1 Abstract In this project, we perform native language identification on English text written by native speakers of English, Korean, Chinese and Japanese using machine learning techniques. Statistical language models, error type analysis, and investigations into the differing styles and grammatical structure of text written by authors in each class are utilised as features in the classification task. We build a binary classifier which discriminates natives from non-natives with accuracies ranging from 97:17% to 99:19%, and a multi-class classifier which discriminates among Korean, Chinese and Japanese speakers with 68:8% accuracy. We also emphasise and report on the differences in structure and style found in texts across the four classes. 6 Chapter 1 Introduction L1 Interference is a highly researched area of linguistics which is concerned with studying the various ways in which a non-native user of English is influenced by their native language in their style of speech, as well as the frequency and types of mistakes that they make. Within the past sixteen years, the area of "native language detection" has become well-established in computational linguistics, and involves using machine learning techniques to determine the native language of an English user. In this task, machine learning techniques were used to classify the native languages of native English, Korean, Chi- nese and Japanese speakers according to their use of English. The challenges of detecting a writer as native or not compared to detecting the native languages among non-native writers were found to be very different, and in.

Using Machine Learning Techniques to Identify the Native Language of an English User

Measuring the Comprehension of Negation in 2- to 4-Year-Old Children Ann E

Annotating Tense, Mood and Voice for English, French and German

Serial Verb Constructions Revisited: a Case Study from Koro

Tenses and Conjugation (Pdf)

Double Classifiers in Navajo Verbs *

Number Systems in Grammar Position Paper

Clusivity in Presidential Discourse: a Rhetorical Discourse Analysis of State-Of-The-Nation Addresses in Ghana and the United States

Evidentiality and Mood: Grammatical Expressions of Epistemic Modality in Bulgarian

Investigation of Spoken-Language Detection and Classification

Does Grammatical Gender Influence Perception? 388

A Contrastive Study of Identity Perception in Voice

The Evolution of Case Grammar