Enhancing Partially Labelled Data: Self Learning and Word Vectors in Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
Technological University Dublin ARROW@TU Dublin Dissertations School of Computer Sciences 2019 Enhancing Partially Labelled Data: Self Learning and Word Vectors in Natural Language Processing Eamon McEntee Technological University Dublin Follow this and additional works at: https://arrow.tudublin.ie/scschcomdis Part of the Computer Engineering Commons, and the Computer Sciences Commons Recommended Citation McEntee. E. (2019) Enhancing Partially Labelled Data: Self Learning and Word Vectors in Natural Language Processing, Masters Thesis, Technological University Dublin. This Theses, Masters is brought to you for free and open access by the School of Computer Sciences at ARROW@TU Dublin. It has been accepted for inclusion in Dissertations by an authorized administrator of ARROW@TU Dublin. For more information, please contact [email protected], [email protected]. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License Enhancing Partially Labelled Data: Self Learning and Word Vectors in Natural Language Processing Eamon McEntee Adissertationsubmittedinpartialfulfilmentoftherequirementsof Dublin Institute of Technology for the degree of M.Sc. in Computing (Stream) Date: 23/09/2019 Declaration IcertifythatthisdissertationwhichInowsubmitforexaminationfortheawardof MSc in Computing (Stream), is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. This dissertation was prepared according to the regulations for postgraduate study of the Dublin Institute of Technology and has not been submitted in whole or part for an award in any other Institute or University. The work reported on in this dissertation conforms to the principles and requirements of the Institutes guidelines for ethics in research. Signed: Date: 23/09/2019 I Abstract There has been an explosion in unstructured text data in recent years with services like Twitter, Facebook and WhatsApp helping drive this growth. Many of these companies are facing pressure to monitor the content on their platforms and as such Natural Language Processing (NLP) techniques are more important than ever. There are many applications of NLP ranging from spam filtering, sentiment analysis of social media, automatic text summarisation and document classification. Many of the more powerful models are supervised techniques that need large amounts of labelled data in order to achieve excellent results. In many cases in the real world large text datasets, such as customer feedback, are available but unlabelled for a specific behaviour or sentiment of interest. Labelled data can be expensive to acquire or at the very least represents a bottleneck in the modelling process as resource is allocated to apply labels. Self-learning is a technique that can take a small amounts of labelled data and automatically label more data in order grow the pool of training data and boost model performance. This research was concerned with investigating the merits of self-learning using a dataset of 10,000 Wikipedia natural language comments. It evaluates the performance of an iterative self-learning algorithm given small amounts (0.2%, 1.0%, 2.0%, 4.0%, 10.0%, 20.0%, 30%, 40%) of labelled data. One of the interesting findings was that just 400 labelled data points could be used to automatically label an additional 3,574 observations resulting in an enhanced dataset with Average Class Accuracy (ACA) of 97.1%. A Support Vector Machine (SVM) and Convolutional Neural Network (CNN) were then trained on these enhanced datasets and performance compared against models trained on the original datasets with no self-learning. It was found that self- II learning significantly improved model performance particularly in cases with little labelled data (0.2%, 1.0%, 2.0%, 4.0%). It was found that a CNN trained on 400 labelled observations produced an ACA of 85.82% less than the 87.11% ACA for an SVM trained on the same data. However, the introduction of self-learning to the process for this same dataset boosted CNN model performance to an ACA of 89.02%. Some of the significant findings of this research is to highlight the added value that self-learning can provide particularly in situations where there is a small amount of labelled data or when working with algorithms that don’t perform well with little labelled data. III Acknowledgments I would sincerely like to thank my supervisor Sarah Jane Delaney for her prompt feedback, support and advice through out the duration of this research. I would also like to thank my wife Miki and sons Kai and Jamie who have supported me not only through this research but over the last 2 years while completing this course of study. Finally I would like to thank my family and friends who also encouraged and helped me through out this course and without any of this help and support none of this would of been possible. IV Contents Declaration I Abstract II Acknowledgments IV Contents V List of Figures VIII List of Tables X List of Acronyms XI 1 Introduction 1 1.1 Background . 1 1.2 Research Project/problem . 3 1.3 Research Objectives . 3 1.4 Research Methodologies . 3 1.5 Scope and Limitations . 4 1.6 Document Outline . 4 2 Review of existing literature 5 2.1 Domain Knowledge . 5 2.1.1 Text Representation . 5 2.1.2 Word Embeddings . 7 V 2.1.3 Text Modelling . 9 2.1.4 Self-Learning ............................ 10 2.2 Algorithms.................................. 11 2.2.1 Relevance Vector Machine . 11 2.2.2 ConvolutionalNeuralNetwork . 15 2.2.3 SupportVectorMachine . 18 2.2.4 AverageClassAccuracy . 18 2.2.5 McNemar’sTest........................... 19 3 Experiment design and methodology 21 3.1 Background . 21 3.2 Data . 22 3.3 Word Embeddings . 25 3.4 Self-learning . 26 3.4.1 EnhancingtheData ........................ 29 3.5 Classification . 30 3.5.1 SupportVectorMachine . 30 3.5.2 ConvolutionalNeuralNetwork . 31 3.6 Evaluation . 32 3.6.1 Self-learning . 32 3.6.2 Classification . 33 4 Results, evaluation and discussion 35 4.1 Word Representation . 35 4.1.1 Aim ................................. 36 4.1.2 Method . 36 4.1.3 Results................................ 36 4.2 Self-learning . 37 4.2.1 Aim ................................. 37 4.2.2 Method . 37 4.2.3 Results................................ 38 VI 4.3 Classification . 40 4.3.1 Aim ................................. 40 4.3.2 Method . 40 4.3.3 Results................................ 42 5 Conclusion 49 5.1 Research Overview . 49 5.2 Problem Definition . 50 5.3 Design/Experimentation, Evaluation & Results . 50 5.3.1 Word Representation . 50 5.3.2 Self-learning . 50 5.3.3 Classification . 51 5.4 Contributionsandimpact . 52 5.5 FutureWork&recommendations . 53 References 55 A Additional content 63 VII List of Figures 2.1 RVM Decision Boundary. Source: Bishop, C. M. (2006) . 14 2.2 Example Layers of a Text CNN. Source: Zhang, Y., & Wallace, B. (2015) 16 2.3 ContingencytableforMcNemar’stest . 19 3.1 DistributionofDocumentLengths. 23 3.2 Snapshot of Dataset . 24 3.3 Class Distribution . 24 3.4 FastText embeddings most similar to ”Strong” . 26 3.5 Traindatasetscreatedforexperiment . 27 3.6 RVM Grid Search results on Dataset represented with word embeddings 28 3.7 RVM Grid Search results on Dataset represented as Term Document Matrix.................................... 29 3.8 SVMGridSearchResults ......................... 31 3.9 CNNLayersasdefinedinKeras . 31 4.1 T1 - T8: Size of Train (labelled) / Test (initially unlabelled) datasets used for self-learning . 38 4.2 Number of new labelled observations and Average Class Accuracy post self-learning . 38 4.3 T1 - T8: Size of Train / Test for each training run . 41 4.4 TE1 - TE8: Size of Train / Test for each training run . 42 4.5 Average Class Accuracy: SVM vs. SVM Enhanced after 3 training cycles 43 4.6 Average Class Accuracy: CNN vs. CNN Enhanced after 3 training cycles 45 VIII 4.7 Average Class Accuracy: SVM Enhanced vs. CNN Enhanced after 3 training cycles . 47 IX List of Tables 3.1 Dataset Features . 23 X List of Acronyms ACA Average Class Accuracy BoW Bag of Words CNN Convolutional Neural Network LDA Latent Dirchlet Allocation LSA Latent Semantic Allocation NB Nave Bayes NLP Natural Language Processing RBF Radial Basis Function ReLU Rectified Linear Unit RVM Relevance Vector Machine SVM Support Vector Machine TDM Term Document Matrix TF-IDF Term Frequency Inverse Document Frequency XI Chapter 1 Introduction 1.1 Background Big data has arrived and we have seen an explosion in data creation in recent years with this trend only set to increase in the coming years. Social network and messaging services such as Twitter, Facebook and WhatsApp are just some of the applications driving this growth with the vast majority of this data being unstructured text. This explosion in growth has also made it harder for platforms to monitor this content for inappropriate or o↵ensive behaviour. Unstructured text data has traditionally been more difficult to analyse than structured data and this domain of data science is known as Natural Language Processing (NLP). NLP techniques work by turning large amounts of text into numeric feature vectors and then using statistical algorithms to find patterns within these vectors. These range from the well-established term- document matrix techniques to the rising use of neural networks and word vectors. The more data provided to these algorithms the more accurately they can learn. Some of the more powerful supervised techniques can classify text better than a human and utilise large amounts of labelled data in order to achieve this. However, they do require large amounts of labelled data which may not be readily available for the task at hand or may be expensive to acquire. This thesis investigates self-learning that enables algorithms to learn with small amounts of labelled data and the e↵ectiveness of this technique.