Comparing Latent Dirichlet Allocation and Latent Semantic Analysis As Classifiers
Total Page:16
File Type:pdf, Size:1020Kb
COMPARING LATENT DIRICHLET ALLOCATION AND LATENT SEMANTIC ANALYSIS AS CLASSIFIERS Leticia H. Anaya, B.S., M.S. Dissertation Prepared for the Degree of DOCTOR OF PHILOSOPHY UNIVERSITY OF NORTH TEXAS December 2011 APPROVED: Nicholas Evangelopoulos, Major Professor Shailesh Kulkarni, Committee Member Robert Pavur, Committee Member Dan Peak, Committee Member Nourredine Boubekri, Committee Member Mary C. Jones, Chair of the Department of Information Technology and Decision Sciences Finley Graves, Dean of College of Business James D. Meernik, Acting Dean of the Toulouse Graduate School Anaya, Leticia H. Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers. Doctor of Philosophy (Management Science), December 2011, 226 pp., 40 tables, 23 illustrations, references, 72 titles. In the Information Age, a proliferation of unstructured text electronic documents exists. Processing these documents by humans is a daunting task as humans have limited cognitive abilities for processing large volumes of documents that can often be extremely lengthy. To address this problem, text data computer algorithms are being developed. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are two text data computer algorithms that have received much attention individually in the text data literature for topic extraction studies but not for document classification nor for comparison studies. Since classification is considered an important human function and has been studied in the areas of cognitive science and information science, in this dissertation a research study was performed to compare LDA, LSA and humans as document classifiers. The research questions posed in this study are: R1: How accurate is LDA and LSA in classifying documents in a corpus of textual data over a known set of topics? R2: How accurate are humans in performing the same classification task? R3: How does LDA classification performance compare to LSA classification performance? To address these questions, a classification study involving human subjects was designed where humans were asked to generate and classify documents (customer comments) at two levels of abstraction for a quality assurance setting. Then two computer algorithms, LSA and LDA, were used to perform classification on these documents. The results indicate that humans outperformed all computer algorithms and had an accuracy rate of 94% at the higher level of abstraction and 76% at the lower level of abstraction. At the high level of abstraction, the accuracy rates were 84% for both LSA and LDA and at the lower level, the accuracy rate were 67% for LSA and 64% for LDA. The findings of this research have many strong implications for the improvement of information systems that process unstructured text. Document classifiers have many potential applications in many fields (e.g., fraud detection, information retrieval, national security, and customer management). Development and refinement of algorithms that classify text is a fruitful area of ongoing research and this dissertation contributes to this area. Copyright 2011 by Leticia H. Anaya ii ACKNOWLEDGEMENTS I want to express my deepest gratitude and appreciation to my dissertation committee. To my Chair, Dr. Nicholas Evangelopoulos, for his mentoring, encouragement and guidance in my pursuit of the PhD. I honestly can never thank him enough for opening my eyes into a field that I never knew existed, for sharing his knowledge and for his overall support in this endeavor. To Dr. Robert Pavur who taught me how to think analytical way beyond my expectations. To Dr. Shailesh Kulkarni who showed me the beauty of incorporating analytical mathematics for manufacturing operations which I hope that one day I can expand upon. To Dr. Dan Peak, whose encouragement and words of wisdom were always welcomed and I thank him for sharing those interesting and funny anecdotes, stories, and images that made life interesting while pursuing this endeavor. To Dr. Nourredine Boubekri for his support and for his initial words of encouragement that allowed me to overcome the fear that prevented me from even taking the first step toward the pursuit of a PhD. To Dr. Victor Prybutok for his support, his encouragement, his sense of humor, his sense of fairness and overall great disposition in spite of having dealt with much stronger trials and tribulations that I ever had. To Dr. Sherry Ryan and Dr. Mary Jones, two of my female professors who left an everlasting positive impression in my life not only in terms of their scholastic work but also in terms of their attitudes toward life. To my family and all my PhD fellow friends who shared the PhD experience with me. iii TABLE OF CONTENTS Page ACKNOWLEDGMENTS ............................................................................................................. iii LIST OF TABLES ....................................................................................................................... viii LIST OF FIGURES .........................................................................................................................x 1. OVERVIEW ................................................................................................................................1 1.1 Introduction ....................................................................................................................1 1.2 Overview of Document Classification ...........................................................................2 1.3 Modeling Document Classification ...............................................................................3 1.4 Overview of Human Classification ................................................................................4 1.5 Overview of Latent Semantic Analysis .........................................................................5 1.6 Overview of Latent Dirichlet Allocation .......................................................................6 1.7 Research Question and Potential Contributions ............................................................7 2. BASIC TEXT MINING CONCEPTS AND METHODS .........................................................10 2.1 Introduction ..................................................................................................................10 2.2 Text Mining Introduction .............................................................................................10 2.3 Vector Space Model .....................................................................................................12 2.4 Text Mining Methods ..................................................................................................14 2.4.1 Latent Semantic Analysis .............................................................................14 2.4.2 Probabilistic Latent Semantic Analysis ........................................................17 2.4.3 Relationship between LSA and pLSA ..........................................................18 2.4.4 The Latent Dirichlet Allocation Method.......................................................20 2.4.5 Gibbs Sampling: Approximation to LDA .....................................................23 iv 2.5 Conclusion ...................................................................................................................25 3. CLASSIFICATION ...................................................................................................................25 3.1 Classification as Supervised Learning .........................................................................26 3.2 Data Points Representation ..........................................................................................27 3.3 What are Classifiers? ...................................................................................................28 3.4 Developing a Classifier ................................................................................................28 3.4.1 Loss Functions ..............................................................................................30 3.4.2 Data Selection ...............................................................................................31 3.5 Classifier Performance .................................................................................................32 3.5.1 Classification Matrix .....................................................................................32 3.5.2 Receiver Operating Characteristics (ROC) Curves ......................................33 3.5.3 F1 Micro-Sign and F1 Macro-Sign Tests .....................................................34 3.5.4 Misclassification Costs .................................................................................35 3.6 Document Classification ..............................................................................................36 3.7 Commonly Used Document Classifiers .......................................................................37 3.7.1 K-Nearest Neighborhood Classifier ..............................................................37 3.7.2 Centroid-Based Classifier .............................................................................38 3.7.3 Support Vector Machine ...............................................................................38 3.8 Conclusion ...................................................................................................................39 4. HUMAN CATEGORIZATION LITERATURE REVIEW ......................................................41