Global-Local Word Embedding for Text Classification

Global-Local Word Embedding for Text Classification by Mehran Kamkarhaghighi A Thesis Submitted to the School of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical and Computer Engineering Department of Electrical, Computer and Software Engineering Faculty of Engineering and Applied Science University of Ontario Institute of Technology Oshawa, Ontario, Canada April 2019 © Mehran Kamkarhaghighi, 2019 THESIS EXAMINATION INFORMATION Submitted by: Mehran Kamkarhaghighi Doctor of Philosophy in Electrical and Computer Engineering Thesis title: Global-Local Word Embedding for Text Classification An oral defense of this thesis took place on March 27, 2019 in front of the following examining committee: Examining Committee: Chair of Examining Committee Dr. Walid Morsi Ibrahim Research Supervisor Dr. Masoud Makrehchi Examining Committee Member Dr. Shahryar Rahnamayan Examining Committee Member Dr. Qusay H. Mahmoud University Examiner Dr. Jeremy S. Bradbury External Examiner Dr. Chen (Cherie) Ding, Ryerson University The above committee determined that the thesis is acceptable in form and content and that a satisfactory knowledge of the field covered by the thesis was demonstrated by the candidate during an oral examination. A signed copy of the Certificate of Approval is available from the School of Graduate and Postdoctoral Studies. I Abstract Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recent word embedding approaches have drawn much attention to text mining research. One of the main benefits of such approaches is the use of global corpuses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages, namely sole reliance on pre-trained word vectors that may neglect the local context and increase word ambiguity. In this thesis, four new document representation approaches are introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors. The proposed approaches, which are frameworks for document representation while using word embedding learning features for the task of text classification, are: Content Tree Word Embedding; Composed Maximum Spanning Content Tree; Embedding-based Word Clustering; and Autoencoder-based Word Embedding. The results show improvement in the F_score accuracy measure for a document classification task applied to IMDB Movie Reviews, Hate Speech Identification, 20 Newsgroups, Reuters-21578, and AG News as benchmark datasets in comparison to using three deep learning-based word embedding approaches, namely GloVe, Word2Vec, and fastText, as well as two other document representations: LSA and Random word embedding. Keywords: Document Representation; Word Embedding; Text Classification; Deep Learning; Neural Networks II AUTHOR’S DECLARATION I hereby declare that this thesis consists of original work of which I have authored. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I authorize the University of Ontario Institute of Technology to lend this thesis to other institutions or individuals for the purpose of scholarly research. I further authorize the University of Ontario Institute of Technology to reproduce this thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. I understand that my thesis will be made electronically available to the public. Mehran Kamkarhaghighi III Dedicated to my wife, without whose support and encouragement none of this would have been possible. IV Acknowledgements Firstly, I would like to express my sincere gratitude to my advisor, Dr. Masoud Makrehchi, for offering me the opportunity to work with him and for continuously supporting my Ph.D. studies and related research, as well as for his patience, motivation, and sharing of his immense knowledge. I could not imagine having a better advisor and mentor for my doctoral studies. In addition to my advisor, I would like to thank the rest of my thesis committee, Dr. Shahryar Rahnamayan, Dr. Qusay H. Mahmoud, and Dr. Jeremy S. Bradbury, for not only their insightful comments and encouragement, but also for all the hard questions that encouraged me to widen my research from various perspectives. My sincere thanks also go to my friends in SciLab, Dr. Somayyeh (Bahar) Aghababaei, Mahboubeh (Tara) Ahmadalinezhad, Fateme Azimlou, Iuliia Chepurna, Dr. Eren Gultepe, Neil Seward, and Afsaneh Towhidi, without whose precious support it would not have been possible to conduct this research. I would also like to extend my thanks to my dear friends, Dr. Roozbeh Jalali and Dr. Reza Mohammad Alizadeh, for their encouragement and support throughout my research. I also wish to express my appreciation to Mrs. Catherine Lee, for editing and proofreading this thesis. Finally, I would like to express my greatest gratitude to my family: my dearest wife, Solmaz, and my father and my mother for their unconditional support during this period of my life. V STATEMENT OF CONTRIBUTIONS I hereby certify that I am the sole author of this thesis and I have used standard referencing practices to acknowledge ideas, research techniques, or other materials that belong to others. Furthermore, I hereby certify that I am the sole source of the creative works and/or inventive knowledge described in this thesis. In all cases, I (Mehran Kamkarhaghighi) was the main investigator for each of the co- authored studies. Part of the work described in Chapter 2 as the literature review was previously published as: M. Kamkarhaghighi, E. Gultepe, and M. Makrehchi, "Deep Learning for Document Representation," in Handbook of Deep Learning Applications: Springer, 2019, pp. 101- 110. The other part of the work described in Chapter 3 as the CTWE approach has been published as: M. Kamkarhaghighi and M. Makrehchi, "Content tree word embedding for document representation," Expert Systems with Applications, vol. 90, pp. 241-249, 2017. VI Table of Contents ABSTRACT .............................................................................................................................................. II ACKNOWLEDGEMENTS .................................................................................................................... V LIST OF TABLES ................................................................................................................................... X LIST OF FIGURES .............................................................................................................................. XII LIST OF ABBREVIATIONS AND SYMBOLS .............................................................................. XIII CHAPTER 1. INTRODUCTION .................................................................................................... 1 1.1 Overview .................................................................................................................................. 1 1.2 Problem Statement .................................................................................................................... 3 1.3 Research Questions ................................................................................................................... 5 1.4 Research Objectives ................................................................................................................. 5 1.5 Research Contribution .............................................................................................................. 6 1.6 Thesis Outline ........................................................................................................................... 6 CHAPTER 2. LITERATURE REVIEW ........................................................................................ 7 2.1 Introduction .............................................................................................................................. 7 2.2 Traditional Document Representations Methods ..................................................................... 7 2.3 Word2Vec ............................................................................................................................... 10 2.4 GloVe ...................................................................................................................................... 11 2.5 fastText ................................................................................................................................... 12 2.6 LSA ......................................................................................................................................... 13 2.7 Doc2Vec ................................................................................................................................. 14 2.8 Other Word Embedding Approaches ..................................................................................... 16 2.9 Taxonomy Induction ............................................................................................................... 19 2.10 Summary ................................................................................................................................

Global-Local Word Embedding for Text Classification

Arxiv:2002.06235V1

Text Mining Course for KNIME Analytics Platform

Combining Word Embeddings with Semantic Resources

Application of Machine Learning in Automatic Sentiment Recognition from Human Speech

Word-Like Character N-Gram Embedding

Text KM & Text Mining

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Fasttext.Zip: Compressing Text Classification Models

Knowledge-Powered Deep Learning for Word Embedding

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms

Local Homology of Word Embeddings