An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification
Total Page:16
File Type:pdf, Size:1020Kb
Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours ........................ 23 2.4.7 Multilayer perceptron ........................ 24 2.4.8 Convolutional neural networks ................... 25 2.4.9 Recurrent neural networks ..................... 28 2.4.10 Hierarchical attention network ................... 29 3 The EURLEX Dataset 32 3.1 Structure of the EUR-Lex database .................... 32 3.2 The EUR-Lex paper ............................ 32 i Table of Contents 3.3 Filtering the dataset ............................ 34 3.3.1 Distribution of the distilled EURLEX dataset .......... 36 3.4 Analysis of the distilled EURLEX dataset ................ 38 3.5 Dataset visualisation ............................ 41 4Experiments 44 4.1 Overview ................................... 44 4.2 Preprocessing ................................ 44 4.3 Machine learning models .......................... 45 4.4 Summary of results ............................. 46 4.5 Analysis of results .............................. 46 4.5.1 Naive Bayes classifier ........................ 46 4.5.2 Decision tree ............................ 48 4.5.3 Random forest ........................... 52 4.5.4 Logistic regression ......................... 53 4.5.5 k-NN ................................. 61 4.5.6 Linear SVM ............................. 64 4.5.7 Non-linear SVM ........................... 69 4.5.8 MLP ................................. 71 4.5.9 Preliminary conclusions ...................... 72 4.6 Deep learning models ............................ 72 4.7 Summary of results ............................. 75 4.7.1 General trend ............................ 75 4.7.2 Sentence embeddings ........................ 75 4.8 Analysis of results .............................. 77 4.8.1 MLP ................................. 77 4.8.2 CNN ................................. 77 4.8.3 LSTM ................................ 79 4.8.4 HAN ................................. 80 4.8.5 Conclusions on deep learning models ............... 81 4.9 Choice of best models ............................ 82 5RelatedWork 83 5.1 Deep learning approaches to NLP ..................... 83 5.2 Deep learning approaches to text classification .............. 85 5.3 Text classification in the legal context ................... 87 6 Conclusions 92 6.1 Summary of contributions ......................... 92 6.2 Practical implications ............................ 93 6.3 Challenges .................................. 95 6.3.1 Lack of labelled data ........................ 95 6.3.2 Hardware limitations ........................ 95 6.4 Possible improvements ........................... 95 6.5 Future work ................................. 96 6.6 Legal and ethical considerations ...................... 96 ii Table of Contents Appendices 98 A Ethics checklist 98 B Copy of readme for code repository 101 iii Abstract This project provides a comprehensive comparative study of the performance of su- pervised machine learning models in the natural language processing task of text classification, specifically in the legal context. We distill a dataset of European Union legislation for multi-label classification into one for a single-label, multi-class classifi- cation task. We provide visualisations and analysis of the dataset. We then draw a distinction between ‘machine learning’ models, including the Naive Bayes classifier, lo- gistic regression and support vector machines, and more contemporary ‘deep learning’ approaches, such as convolutional neural networks, long short-term memory networks and the hierarchical attention network. We experiment with traditional count-based vectorizers for feature embedding with the machine learning models, and pre-trained word embeddings for the deep learning models. We critically evaluate the performance of each model on its own, and with those in its group, before proposing a final model. Finally, we discuss the potential uses of such a classifier in professional legal practice. 1 Acknowledgements I would like to express my utmost gratitude to the following people, without whom this project would not have been possible: Professor Alessandra Russo and Nuri Cingillioglu, for their continued guidance, • encouragement and time throughout the course of this project. My family, for everything. • 2 Chapter 1 Introduction 1.1 Motivation The application of technology to assist legal professionals with the provision of legal services, a sector known as legal tech, has received tremendous investment and interest in recent years [29]. In particular, with the recent successes of machine learning meth- ods in fields such as computer vision and pattern recognition, expectations that these methods will provide the panacea for the ills of the legal profession, such as repetitive administrative work, have begun to arise [66]. The broad aim of this project is thus to apply the latest methods used in machine learning and in natural language processing (NLP) to a dataset in the legal context. More specifically, the goal will be to experiment with and compare the performance of several machine learning and deep learning methods for the task of text classification. Text classification has many potential uses in the legal domain, particularly for cate- gorising legal documents and cases which can aid the process of legal research, and for the development of a knowledge management system (for a detailed example of such an implementation, see [5, 6]). The task is an interesting one from an academic perspective, for several reasons. While text classification as an NLP task in general is well-studied, the specific study of text classification methods in the legal domain has remained relatively under-explored [28, 67]. Applying text classification methods specifically to the legal context is not a trivial problem, i.e. simply because a method has proven to be useful in classifying texts of a general subject matter does not mean that the method will necessarily work equally well in the legal context. This is because the structure of legal language can be distinguished from that of ordinary language in terms of vocabulary, syntax, semantics and other linguistic features [5, 65]. The types of texts used in NLP research tend to be user reviews, where the language used tends to be colloquial or informal (such as the IMDB dataset [45]), posts scraped from Twitter (which are a maximum of 280 characters) and other documents of a much shorter length than a legal judgement or piece of legislation [28]. 3 Chapter 1. Introduction 1.2 Aims and objectives The broad aim of this project is to present a framework through which a document in the English language with legal subject matter can be classified into one of several predefined classes. Specifically, documents will be drawn from a dataset of European Union legislation, with each document belonging to one of 20 classes. Concretely, the goal will be to propose a classification model. Given an unseen legal text document of length n, X =(x1,x2,...,xn), where xi is an individual token in the document, the model will assign X to one of k classes, where k =20in our case. This aim will be achieved by fulfilling the following objectives: Extracting relevant sections of legal texts from their raw HTML source obtained • from a publicly accessible European Union law repository Preprocessing the unstructured data to a structured format • Analysing the characteristics of the dataset (e.g. distribution of classes) • Exploring different methods of evaluating the performance of classifiers • Drawing a distinction between two groups of classifiers, ‘machine learning’ meth- • ods and ‘deep learning’ methods, and analysing each group separately, with dif- ferent methods of feature extraction: – Classifiers based on various machine learning methods, with count vector- ization and TF-IDF vectorization: 1. Naive Bayes classifier 2. Decision tree 3. Random forest 4. Logistic regression 5. Support vector machines (SVMs) 6. K-nearest neighbours (k-NN) 7. Multilayer perceptron (MLP) – Comparing the performance of classifiers based on deep learning methods, with pre-trained