Clustering and Topic Analysis Final Report
Total Page:16
File Type:pdf, Size:1020Kb
Clustering and Topic Analysis Final Report CS 5604 Information Storage and Retrieval Virginia Polytechnic Institute and State University Fall 2017 Submitted by Ashish Baghudana Aman Ahuja Pavan Bellam Rammohan Chintha Prathyush Sambaturu Ashish Malpani Shruti Shetty Mo Yang December 15, 2017 Blacksburg, Virginia 24061 Instructor: Dr. Edward A. Fox Abstract One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive. Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summa- rize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project. Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algo- rithm and latent Dirichlet allocation (LDA) on dierent collections of webpages and tweets. Sub- sequently, we provide a developer manual to help set up our framework, and nally, outline a user manual describing the elds that we populate in HBase. Contents 1 Introduction 1 1.1 Problem Statement . .1 1.2 Clustering . .2 1.3 Topic Analysis . .4 2 Literature Survey 6 2.1 Clustering . .6 2.1.1 Partition-based Clustering . .6 2.1.2 Hierarchical Clustering . .7 2.1.3 Density-based Clustering . .7 2.1.4 Grid-based Clustering . .8 2.1.5 Model-based Clustering . .8 2.2 Topic Analysis . .8 2.2.1 TF-IDF . .9 2.2.2 Latent Semantic Indexing . .9 2.2.3 Latent Dirichlet Allocation . .9 2.2.4 Twitter-LDA . 10 3 Requirements Gathering 12 3.1 Clustering . 12 3.2 Topic Analysis . 13 3.3 Outputs . 13 i 4 Design and Deliverables 14 4.1 System Design . 14 4.2 Technologies Used . 15 4.3 Timeline . 17 5 Implementation and Evaluation Techniques 19 5.1 Preprocessing . 19 5.2 Clustering . 20 5.2.1 Implementation Details . 20 5.2.2 Evaluation . 22 5.3 Topic Analysis . 23 5.3.1 Implementation Details . 23 5.3.2 Evaluation . 24 6 Results 26 6.1 Remember April 16 Tweets . 27 6.1.1 Clustering . 27 6.1.2 Topic Analysis . 29 6.2 Solar Eclipse 2017 Tweets . 31 6.2.1 Clustering . 31 6.2.2 Topic Analysis . 32 6.3 Solar Eclipse 2017 Webpages . 32 6.3.1 Clustering . 32 6.3.2 Topic Analysis . 34 6.4 Hurricane Irma Webpages . 35 6.4.1 Clustering . 35 6.4.2 Topic Analysis . 36 6.5 Vegas Shooting Webpages . 37 6.5.1 Clustering . 37 ii 7 User Manual 39 7.1 HBase schema . 39 7.2 Topic Analysis . 40 7.2.1 Help File . 40 7.2.2 Computational Complexity . 41 7.3 Clustering . 42 7.3.1 Running Clustering Algorithm . 42 7.3.2 Analysis . 42 8 Developer Manual 43 8.1 Clustering . 43 8.2 Topic Analysis . 44 8.3 HBase interaction . 44 8.3.1 Clustering . 44 8.4 File Inventory . 45 9 Future Work and Enhancements 46 9.1 Clustering . 46 9.2 Topic Analysis . 46 Acknowledgements 48 Bibliography 48 iii List of Figures 2.1 Plate notation for LDA (courtesy Wikipedia) . .9 2.2 Plate notation for Twitter-LDA [22] . 10 4.1 Pipeline for text processing. The CTA team now begins the preprocessing pipeline at Step 3: Remove stop words and punctuation as the text is already tokenized and lowercased. 14 4.2 Latent Dirichlet Allocation uses a Python based system with three main capabil- ities – access to HBase, preprocessing and LDA, and visualization. 16 5.1 The three stages of our preprocessing pipeline – tokenization, mapping, and l- tering. 20 6.1 Clean Tweet Data Sample . 26 6.2 Calinski Harabaz index vs. number of clusters for “remember april 16” Dataset . 27 6.3 k-means clustering results on “remember April 16” tweets. 28 6.4 Tweets distribution over clusters using hierarchical clustering algorithm. 29 6.5 Cluster distribution for “Solar Eclipse 2017” tweets . 31 6.6 Cluster distribution for “Solar Eclipse 2017” webpages . 33 6.7 Cluster distribution for “Hurricane Irma” webpages . 35 6.8 Plots showing number of topics vs. log perplexity and number of topics vs. topic coherence for the collections Solar Eclipse webpages and Hurricane Irma web- pages. We attempt to choose the best number of topics based on these two plots. 36 6.9 Cluster distribution for “Vegas Shooting” webpages . 37 7.1 Computational complexity of running LDA for dierent collections. The results were benchmarked on a single node server with 20 cores. 41 iv List of Tables 1.1 Sample topics from a collection of Wikipedia articles collected using a keyword search for “computers”, “basketball”, and “economics” . .4 4.1 Timeline of task list . 18 5.1 A sample of collection specic stop words for Solar Eclipse 2017 and Hurricane Irma............................................ 21 6.1 Datasets description with category (tweet or document) and number of documents 26 6.2 Frequent words and events in each cluster for “Remember April 16” dataset . 28 6.3 Top words for topics obtained through running LDA on the “Remember April 16” dataset. The results show only the best 6 topics. The remaining 4 topics were incoherent. 29 6.4 Topics from Twitter-LDA that did not appear in LDA for the “Remember April 16” dataset . 31 6.5 Cluster Naming based on frequent words for the “Solar Eclipse 2017” tweet data . 32 6.6 Keywords for topics in the collection “Solar Eclipse” . 32 6.7 The cosine similarity analysis of “Solar Eclipse 2017” webpage data . 34 6.8 Cluster Naming based on frequent words for the “Solar Eclipse 2017” webpage data 34 6.9 Keywords for topics in the collection “Solar Eclipse” . 34 6.10 The cosine similarity analysis of “Hurricane Irma” webpage data . 35 6.11 Cluster naming based on frequent words for the “Hurricane Irma” web data . 36 6.12 Keywords for topics in the collection “Solar Eclipse” . 37 6.13 The cosine similarity analysis of “Vegas Shooting” webpage data . 38 6.14 Cluster naming based on frequent words for the “Vegas Shooting” webpage data . 38 v 7.1 HBase Schema: Fields for Topic Analysis . 39 7.2 HBase Schema: Fields for Clustering . 40 8.1 File Inventory . 45 vi Chapter 1 Introduction The CS5604 course project aims to build a state-of-the-art information retrieval (IR) system in support of the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The semester-long project is divided into several subareas undertaken by dierent teams. These are Classication (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Database and Indexing (SOLR), and Front-end and Visualization (FE). This report focuses on the results of the Clustering and Topic Analysis (CTA) team. 1.1 Problem Statement Building a state-of-the-art information retrieval system involves several components. Each com- ponent is handled by a team. CMW and CMT crawl/collect from the Internet to fetch event related webpages and tweets, respectively. CLA renes the data processed by CMW and.