Clustering and Topic Analysis Final Report

CS 5604 Information Storage and Retrieval

Virginia Polytechnic Institute and State University Fall 2017

Submitted by

Ashish Baghudana Aman Ahuja Pavan Bellam Rammohan Chintha Prathyush Sambaturu Ashish Malpani Shruti Shetty Mo Yang

December 15, 2017 Blacksburg, Virginia 24061

Instructor: Dr. Edward A. Fox Abstract

One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive. Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summa- rize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project. Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algo- rithm and latent Dirichlet allocation (LDA) on dierent collections of webpages and tweets. Sub- sequently, we provide a developer manual to help set up our framework, and nally, outline a user manual describing the elds that we populate in HBase. Contents

1 Introduction 1 1.1 Problem Statement ...... 1 1.2 Clustering ...... 2 1.3 Topic Analysis ...... 4

2 Literature Survey 6 2.1 Clustering ...... 6 2.1.1 Partition-based Clustering ...... 6 2.1.2 Hierarchical Clustering ...... 7 2.1.3 Density-based Clustering ...... 7 2.1.4 Grid-based Clustering ...... 8 2.1.5 Model-based Clustering ...... 8 2.2 Topic Analysis ...... 8 2.2.1 TF-IDF ...... 9 2.2.2 Latent Semantic Indexing ...... 9 2.2.3 Latent Dirichlet Allocation ...... 9 2.2.4 Twitter-LDA ...... 10

3 Requirements Gathering 12 3.1 Clustering ...... 12 3.2 Topic Analysis ...... 13 3.3 Outputs ...... 13

i 4 Design and Deliverables 14 4.1 System Design ...... 14 4.2 Technologies Used ...... 15 4.3 Timeline ...... 17

5 Implementation and Evaluation Techniques 19 5.1 Preprocessing ...... 19 5.2 Clustering ...... 20 5.2.1 Implementation Details ...... 20 5.2.2 Evaluation ...... 22 5.3 Topic Analysis ...... 23 5.3.1 Implementation Details ...... 23 5.3.2 Evaluation ...... 24

6 Results 26 6.1 Remember April 16 Tweets ...... 27 6.1.1 Clustering ...... 27 6.1.2 Topic Analysis ...... 29 6.2 Solar Eclipse 2017 Tweets ...... 31 6.2.1 Clustering ...... 31 6.2.2 Topic Analysis ...... 32 6.3 Solar Eclipse 2017 Webpages ...... 32 6.3.1 Clustering ...... 32 6.3.2 Topic Analysis ...... 34 6.4 Hurricane Irma Webpages ...... 35 6.4.1 Clustering ...... 35 6.4.2 Topic Analysis ...... 36 6.5 Vegas Shooting Webpages ...... 37 6.5.1 Clustering ...... 37

ii 7 User Manual 39 7.1 HBase schema ...... 39 7.2 Topic Analysis ...... 40 7.2.1 Help File ...... 40 7.2.2 Computational Complexity ...... 41 7.3 Clustering ...... 42 7.3.1 Running Clustering Algorithm ...... 42 7.3.2 Analysis ...... 42

8 Developer Manual 43 8.1 Clustering ...... 43 8.2 Topic Analysis ...... 44 8.3 HBase interaction ...... 44 8.3.1 Clustering ...... 44 8.4 File Inventory ...... 45

9 Future Work and Enhancements 46 9.1 Clustering ...... 46 9.2 Topic Analysis ...... 46

Acknowledgements 48

Bibliography 48

iii List of Figures

2.1 Plate notation for LDA (courtesy Wikipedia) ...... 9 2.2 Plate notation for Twitter-LDA [22] ...... 10

4.1 Pipeline for text processing. The CTA team now begins the preprocessing pipeline at Step 3: Remove stop words and punctuation as the text is already tokenized and lowercased...... 14 4.2 Latent Dirichlet Allocation uses a Python based system with three main capabil- ities – access to HBase, preprocessing and LDA, and visualization...... 16

5.1 The three stages of our preprocessing pipeline – tokenization, mapping, and l- tering...... 20

6.1 Clean Tweet Data Sample ...... 26 6.2 Calinski Harabaz index vs. number of clusters for “remember april 16” Dataset . . 27 6.3 k-means clustering results on “remember April 16” tweets...... 28 6.4 Tweets distribution over clusters using hierarchical clustering algorithm...... 29 6.5 Cluster distribution for “Solar Eclipse 2017” tweets ...... 31 6.6 Cluster distribution for “Solar Eclipse 2017” webpages ...... 33 6.7 Cluster distribution for “Hurricane Irma” webpages ...... 35 6.8 Plots showing number of topics vs. log perplexity and number of topics vs. topic coherence for the collections Solar Eclipse webpages and Hurricane Irma web- pages. We attempt to choose the best number of topics based on these two plots. . 36 6.9 Cluster distribution for “Vegas Shooting” webpages ...... 37

7.1 Computational complexity of running LDA for dierent collections. The results were benchmarked on a single node server with 20 cores...... 41

iv List of Tables

1.1 Sample topics from a collection of Wikipedia articles collected using a keyword search for “computers”, “basketball”, and “economics” ...... 4

4.1 Timeline of task list ...... 18

5.1 A sample of collection specic stop words for Solar Eclipse 2017 and Hurricane Irma...... 21

6.1 Datasets description with category (tweet or document) and number of documents 26 6.2 Frequent words and events in each cluster for “Remember April 16” dataset . . . . 28 6.3 Top words for topics obtained through running LDA on the “Remember April 16” dataset. The results show only the best 6 topics. The remaining 4 topics were incoherent...... 29 6.4 Topics from Twitter-LDA that did not appear in LDA for the “Remember April 16” dataset ...... 31 6.5 Cluster Naming based on frequent words for the “Solar Eclipse 2017” tweet data . 32 6.6 Keywords for topics in the collection “Solar Eclipse” ...... 32 6.7 The cosine similarity analysis of “Solar Eclipse 2017” webpage data ...... 34 6.8 Cluster Naming based on frequent words for the “Solar Eclipse 2017” webpage data 34 6.9 Keywords for topics in the collection “Solar Eclipse” ...... 34 6.10 The cosine similarity analysis of “Hurricane Irma” webpage data ...... 35 6.11 Cluster naming based on frequent words for the “Hurricane Irma” web data . . . 36 6.12 Keywords for topics in the collection “Solar Eclipse” ...... 37 6.13 The cosine similarity analysis of “Vegas Shooting” webpage data ...... 38 6.14 Cluster naming based on frequent words for the “Vegas Shooting” webpage data . 38

v 7.1 HBase Schema: Fields for Topic Analysis ...... 39 7.2 HBase Schema: Fields for Clustering ...... 40

8.1 File Inventory ...... 45

vi Chapter 1

Introduction

The CS5604 course project aims to build a state-of-the-art information retrieval (IR) system in support of the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The semester-long project is divided into several subareas undertaken by dierent teams. These are Classication (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Database and Indexing (SOLR), and Front-end and Visualization (FE). This report focuses on the results of the Clustering and Topic Analysis (CTA) team.

1.1 Problem Statement

Building a state-of-the-art information retrieval system involves several components. Each com- ponent is handled by a team. CMW and CMT crawl/collect from the Internet to fetch event related webpages and tweets, respectively. CLA renes the data processed by CMW and CMT to classify webpages and tweets with a specic event. Our team (CTA) takes classied data to learn clus- ter(s) and keywords/topics for each of the documents so that SOLR can index these documents using Lily. This will help FE fetch documents from the computer cluster when a user searches for something. A naive solution to extract keywords would be to nd the top-n used words in the entire col- lection. There are two major problems with this approach. First, it assumes all documents in a collection talk about the same keywords. It would not allow users to search for dierent facets of an event – such as, relief-related documents vs. destruction-related documents in the collection Hurricane Harvey. Second, it is possible that a document may be about school shootings without mentioning the words “school” or “shooting”. Clustering and topic modeling algorithms help identify semantically related words not just in a single document, but across similar documents. For instance, if “gun” occurred frequently with

1 2

“school” and “shooting”, a search for “gun” would yield results about school shootings. These algorithms also help us identify recurring sub-themes within the collection. This is particularly important in documents about events as they often talk about dierent aspects of the event. As an example, these algorithms can identify dierent facets of the event Hurricane Irma. Some articles talk about “destruction” and “damage”, while others talk about “weather” and “storm”. Once we nd the themes in the corpus, the documents will be indexed using the discovered topics and clusters to improve the quality of search and retrieval. The front-end team uses these topics to design a faceted search. These topics can also be used to validate the results of the clustering algorithms and vice-versa. The approaches for clustering and topic modeling are discussed in the subsequent subsections. More background information is described in Chapter 2.

1.2 Clustering

Clustering can be intuitively thought of as a process of placing similar objects close to each other and dissimilar objects away from each other. It is an example of an unsupervised form of learning where grouping into natural categories takes place when no class label is available. In clustering the goal is to reduce the distance between objects in the same cluster and have increasing distance between objects of dierent clusters [17]. Clustering can be used for nding latent groupings that will later be useful for categorization. It may help to gain insight into the nature of the data. In addition, it may also lead to discovery of distinct subclasses or similarities among patterns. Clustering can be classied into two categories based on the assignment of objects.

1. Hard clustering: Each data point is assigned to only one of the given clusters.

2. Soft Clustering: Instead of putting each data point into a separate cluster, a probability or likelihood is assigned for a data point being in a cluster.

Clustering algorithms can be grouped into dierent types of generating models. These are:

1. Centroid Models: In these algorithms the similarity is derived from the closeness of the data point to the centroid of the clusters. These are iterative type algorithms, and the popular k-means clustering algorithm falls under this category. It is described in Algorithm 1. 3

Algorithm 1 The k-means algorithm 1: procedure KMeans(k, data) 2: centroids ← randomly select K points from data as initial centroids 3: while centroids do not change do . This is the convergence criterion 4: for i ← 1, k do 5: centroids[i] ← Recompute centroid for cluster i 6: end for 7: end while 8: return centroids 9: end procedure

The closeness of data points in the cluster is represented by a distance measure. This could be based on the L1-distance, L2-distance, cosine similarity, correlation, or sum of squared errors.

2. Distribution Models: In Distribution Models, the data points are related to each other based on their likelihood of belonging to the same probability distribution. Popular algo- rithms like Expectation-Maximization (EM) and Gaussian Mixture Model (GMM) fall under this category. Initially we start with a xed number of distributions and iteratively update to t the data distribution such that the likelihood of the data given the distribution is maximized.

3. Hierarchical Models: These models hierarchically aggregate or divide points into clusters based on their distance from each other. The two main components of hierarchical models are the distance function (distance between points) and the link function (distance between clusters). Based on the recursive approach, there are two types of hierarchical models.

• Agglomerative Clustering Approach: Start with all data points as individual clus- ters and aggregate them to form clusters. • Divisive Clustering Approach: Start with all data points as a single cluster and partition the large cluster to form smaller clusters.

These models are very easy to interpret but lack scalability for handling big datasets.

4. Density Models: These models search the data space for areas of varied density of data points in the data space. They isolate various dierent density regions and assign the data points within these regions in the same cluster. DBSCAN is a popular example of a density model.

In this report, we focus on centroid and hierarchical clustering techniques, namely k-means and Agglomerative Clustering using bag-of-words as feature vectors. For each cluster, we output the n-most frequent words in the cluster and set these as keywords for all documents in that cluster. 4

1.3 Topic Analysis

Topic analysis or topic modeling aims to nd latent (hidden) groups of words, called topics, from a large corpus of text documents. The topics discovered by these techniques can be dened as groups (or themes) of semantically similar words. Topic modeling uses statistical techniques to discover these topics, by using co-occurrence of words in the documents. Given a set of docu- ments about “computers”, “basketball”, and “economics”, sample words in each of the topics are given in Table 1.1.

Topic 1 Topic 2 Topic 3 computer basketball economic game team economy ibm league government program team investment machine coach market design player trade software nba growth memory ncaa policy

Table 1.1: Sample topics from a collection of Wikipedia articles collected using a keyword search for “computers”, “basketball”, and “economics”

Algorithm 2 Latent Dirichlet Allocation Algorithm 1: procedure LDA(k, documents, iterations) 2: Randomly initialize topic assignments Z 3: for each iteration do 4: for each document do 5: for word w ← 1, number of words in document do 6: z ← sampleTopic(w) . Ignore current assignment when sampling 7: Update topic assignments Z 8: end for 9: end for 10: end for 11: return topic assignments Z 12: end procedure

Recent work in topic modeling is based on Latent Dirichlet Allocation (LDA) [8]. LDA is a prob- abilistic generative model that observes the word frequencies and co-occurrence in documents and infers the topic distribution based on sampling techniques. It models each document as a mixture over topics, and each topic as a mixture of words. Using this assumption, the algorithm 5 aims to nd the top ranked words for each topic. Since document-topic distribution and topic- word distribution are treated as latent variables, we use an approximation technique to tease out the probability distributions. Some of the techniques are Gibbs Sampling and Expectation Maximization (EM). The algorithm for LDA using Gibbs Sampling is given in Algorithm 2. Chapter 2

Literature Survey

2.1 Clustering

Clustering algorithms can be classied into ve groups - partitioning-based, hierarchy-based, density-based, grid-based, and model-based.

2.1.1 Partition-based Clustering

In partitioning-based clustering algorithms, data objects are divided into a number of partitions initially. Each partition represents a cluster. The partitioning is optimized for a pre-specied criterion function. The most typical partitioning-based clustering algorithm is the k-means algo- rithm. k-means clustering [18] represents data as real-valued vectors in d-dimensional space Rd. Initially, data points are partitioned into K clusters with K center points. Then, the algorithm will iteratively update the center points to minimize the mean squared distance from each data point to its nearest center point. The major challenge in the k-means algorithm is to determine the number of clusters K. This is explored later in this report.

Based on k-means, fuzzy clustering such as fuzzy c-means (FCM) [7] has also been proposed. In FCM, data points are assigned to centers of clusters with a degree of belief. Therefore, each data point may belong to more than one cluster with dierent memberships. FCM follows the same principle of k-means that iteratively searches the center point and updates memberships of each data object. But its goal is to minimize the objective function J below.

n c X X m 2 J = µik|pi − vk| (2.1) i=1 k=1

6 7

Here, n is the number of data points, c is the number of dened clusters, µik is the likelihood data, i belongs to cluster k, m is a fuzziness factor and |pi − vk| is the Euclidean distance between i-th object pi and k-th cluster center vk.

2.1.2 Hierarchical Clustering

In hierarchical clustering algorithms, data points are organized in a hierarchical manner depend- ing on the medium of proximity [12]. Each data point is a leaf node in the tree-like hierarchical structure. Hierarchy-based algorithms can either be bottom-up or top-down. Bottom-up meth- ods start with one data point for each cluster and recursively merge other clusters. Top-down methods start with one cluster and recursively split into multiple clusters according to a certain metric. The major drawback of hierarchical based methods is that merge or split steps cannot be undone. BIRCH [21] is an ecient hierarchy-based clustering algorithm. It builds up a clustering feature tree (CF tree) by scanning the dataset in an incremental and dynamic way. When a data point is encountered, the CF tree is traversed from root to leaf by choosing the closest node at each level. After the closest leaf cluster is identied, a test will be performed to check whether a current data point belongs to this leaf cluster. If not, a new leaf cluster will be created. Two major advantages of BIRCH are the ability to deal with large datasets and handle noise. However, it does not have good stability and may not work well when clusters are not spherical. Compared to BIRCH, CURE [13] is more robust in noise handling and can identify clusters with non-spherical shapes. CURE represents each cluster by a set of well-scattered points and shrinks them towards the center of the cluster by a specic function. With more than one representa- tive point per cluster, CURE is able to adjust well to the geometry of clusters with sophisticated shapes which suppresses the noise eect. In addition, CURE also applies a combination of random sampling and partitioning to deal with large datasets.

2.1.3 Density-based Clustering

In density-based clustering algorithms, data objects are separated based on regions of density, connectivity, and boundary. A cluster is a connected, dense component with arbitrary shape. This feature provides a natural protection against outliers to lter out noise. Two typical density-based clustering algorithms are density-based spatial clustering of applications with noise (DBSCAN) [11] and density-based clustering (DENCLUE) [15]. In DBSCAN, a data object is assigned to a cluster when the density in its neighborhood is high enough. Clusters are created from a data object by absorbing all objects in its neighborhood. DENCLUE models cluster distributions based on the sum of inuence functions of all data objects. An inuence function describes the impact of a data object in its neighborhood. DENCLUE creates clusters according to density attractors which are the local maxima of the overall density function. 8

DENCLUE is much faster than DBSCAN because it uses tree-based access structures.

2.1.4 Grid-based Clustering

In grid-based clustering algorithms, data objects are divided into grids. Then clustering is per- formed on the grids instead of the large dataset directly. A major advantage of grid-based methods is speed since the size of a grid is usually much smaller than the size of the dataset. However, it is not good at handling datasets with irregular distributions. Optimal Grid (OptiGrid) [14] is a grid-based clustering algorithm which aims at achieving optimal grid partitioning. OptiGrid constructs the best cutting hyperplanes through a set of selected projections. The cutting plane is chosen to have minimal point density. After grid construction, clusters can be found using a density function. Then, the algorithm is applied recursively on the clusters to achieve better clustering.

2.1.5 Model-based Clustering

Model-based clustering algorithms are designed to optimize the t between a given dataset and a certain mathematical model. There are two types of model-based methods, statistical and neural network methods. Statistical methods use probability measures to determine clusters, while neu- ral network methods utilize a set of weighted connections between input/output units to derive clusters. One example of statistical methods is the Expectation-Maximization (EM) algorithm [10]. As the name indicates, EM iteratively has two steps. In the expectation step, data objects are fractionally assigned to each cluster according to the posterior distribution of latent variables which is derived using current model parameters. In the maximization step, the fractional assign- ment is given by re-estimating model parameters with the maximum likelihood rule. However, the EM algorithm has a lot of mathematical requirements and it has a slow convergence rate.

2.2 Topic Analysis

Topic modeling is used to represent a set of documents/text corpora by a distribution of hid- den topics. Topics refer to unobserved data that are discovered using observed data - the words present in the documents. Thus, a distribution of related words makes up a topic. Topics help preserve the essential relationship amongst words in a document, thereby reducing the dimen- sionality of the documents to a bunch of topics. 9

2.2.1 TF-IDF

Dierent methodologies have been used in order to retrieve data from text corpora. The term frequency-inverse document frequency (tf-idf) method calculates the frequency of a word/term (tf) in a corpus and multiplies it by the inverse document frequency (idf) value to return a term- by-document matrix.

2.2.2 Latent Semantic Indexing

The Latent Semantic Indexing (LSI) method was later proposed to replace the tf-idf matrix by a singular value decomposition (SVD). Following this, the probabilistic Latent Semantic Index- ing (pLSI) model was proposed by Homan [16]. It models a document as a mixture of topics based on the likelihood principle. There is no generative process for determining the document- topic distribution, which leads to problems while assigning probabilities to documents outside the training set.

2.2.3 Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) [8] is an unsupervised machine learning technique, which as- sumes a hierarchical Bayesian dependency between the documents, topics, and words for topic modeling. It was rst presented by David Blei, Andrew Ng, and Michael I. Jordan as a generative probabilistic model for collections of discrete data such as text corpora. It overcomes the limita- tions of the pLSI method. Given a collection of documents, LDA assigns a distribution of words to every topic and also a distribution of topics to every document.

Figure 2.1: Plate notation for LDA (courtesy Wikipedia)

As explained in [8], LDA assumes the following generative process:

• For each topic k ∈ K, draw topic distribution φk∼Dirichlet(β) 10

• For each document m, – Draw topic distribution θ∼Dirichlet(α) – For each word n in document m,

∗ Draw topic zm,n∼Multinomial(θm)

∗ Draw word wm,n∼Multinomial(φzm,n )

The model is equivalently explained through a plate notation diagram (Figure 2.1). There are M documents in the corpus. For simplicity, the diagram assumes N words in each document. Each word w has a topic z, which is generated from the document-topic distribution θ. α and β are the hyperparameters to the model (also known as Dirichlet priors).

2.2.4 Twitter-LDA

Twitter-LDA [22] is a variant of the standard LDA model which was developed for the purpose of modeling tweets. This model assumes that a tweet contains a single topic given the constraint on its length (140, expanded to 280, characters), and models background and topic-related words separately, to give a more realistic modeling of Twitter text. Topic modeling on tweets has also been studied in [6] and [19].

Figure 2.2: Plate notation for Twitter-LDA [22]

The generative process for Twitter-LDA is as follows:

• Draw word-category distribution π∼Bernoulli(γ) • Draw background words distribution φB∼Dirichlet(β) 11

0 • For each topic k, draw topic-word distribution φk∼Dirichlet(β) • For each user u,

– Draw topic distribution θu∼Dirichlet(α)

– For each document mu,t by user u,

∗ Draw topic zu,t∼Multinomial(θu)

∗ For each word wu,t,n

· Draw category yu,t,n∼Binomial(π) 0 · Draw word wu,t,n∼Multinomial(φzu,t ), if yu,t,n = 1 B else wu,t,n∼Multinomial(φ , if yu,t,n = 0)

This model is equivalently explained by Figure 2.2. Each user u has a topic-distribution θu. Since every tweet is only 140 (expanded to 280) characters, there is only one topic z for a tweet t. Each word in the tweet is drawn from a multinomial distribution. Chapter 3

Requirements Gathering

The system is aimed to support a helpful search experience for the users. Our team works on an- alyzing the topics from the clean text of tweets and webpages that have been already classied by CLA on the basis of dierent events. Since CMT and CMW already convert the text to lowercase, tokenize it, and remove stop words, we take in preprocessed text for our unsupervised learning algorithms. The text in these elds is UTF-8 encoded. The tokenization technique, using NLTK [2], “tokenizes a string to split o punctuation other than periods”.

The most relevant elds in the HBase schema for CTA are clean-webpage:clean-tokens and clean-tweet:cleantokens. To use custom tokenizing, one can also directly use clean-webpage:- clean-text-cta and clean-tweet:clean-text-cta. We nd that the preprocessed text for webpages is less noisy and that the default NLTK im- plementation might work. However, for tweets a more custom implementation such as NLTK’s TweetTokenizer [1] or a similar implementation called tweetokenizer [20] might work better.

3.1 Clustering

For clustering, the tokenized text is converted to a bag-of-words model where we ignore the order of occurrence of words in a document. Each document is represented as a high-dimensional vector. We run the k-means algorithm on this with dierent number of clusters K. To determine the best model (i.e., the best number of clusters), we plot the Calinski-Harabasz index for each model. The plot resembles an elbow plot and the best number of clusters is determined by the point at which the Calinski-Harabasz index drops o steeply. This is discussed in more detail in the evaluation section. The keywords for each cluster are the most frequently used n words.

12 13

3.2 Topic Analysis

For topic modeling, the tokenized text is merely mapped to a high-dimensional vector space where each of the terms is encoded as a number. We run LDA on this vector space to determine the document-topic distribution and topic-word distribution. Using these two distributions, we nd the keywords for each of the documents in the corpus. The quality of our results is evaluated using two quantitative measures – perplexity and topic coherence. We favor models with low perplexity and high topic coherence. The two measures are described in the evaluation section. We also plan to do a qualitative study, where students in the class will be asked to determine if a set of words is coherent or not.

3.3 Outputs

The results of the CTA team are the most probable clusters and topics for each document and a set of words representing a topic or a cluster. The topic analysis team maps the list of topics along with their probabilities to topic:topic-list and topic:probability-list, respectively. The clustering team populates the elds cluster:cluster-list and cluster-probability. To help the FE team, we also populate two elds – topic:topic-displaynames and cluster:cluster- displaynames. These correspond to the highest probability topic and cluster, respectively. Chapter 4

Design and Deliverables

4.1 System Design

Figure 4.1: Pipeline for text processing. The CTA team now begins the preprocessing pipeline at Step 3: Remove stop words and punctuation as the text is already tokenized and lowercased.

The CTA team uses tokenized text provided by the CMT and CMW teams, respectively. The incoming documents undergo pre-processing to normalize and lter out the redundant informa-

14 15 tion. In general, each document is converted to lowercase and tokenized into blocks of text based on word boundaries (each called a token). A token is nothing but a sequence of characters which acts as a useful semantic entity for processing. For example : “Harvey was a catastrophic ood disaster in southeast Texas” will be split into harvey, was, a, catastrophic, flood, disaster, in, southeast, and texas. CMT and CMW have already uploaded, tokenized, and lowercased text to HBase, thereby, eliminating the rst two steps in our pipeline.

The CTA team begins at Step 3, which involves removing stop words such as the, is, a, etc. They oer no analytic value to the text mining process. Punctuation marks are another component that has little or no importance to topic analysis and can be safely eliminated without any loss of information. For example: “It’s hard to believe, but it has already been a month since Hurricane Harvey made landfall in Texas.” ⇒ it, ’s, hard, to, believe, ,, but, it, has, already, been, a, month, since, hurricane, harvey, made, landfall, in, texas. The emphasized strings are stop words and punctuation, which can be eliminated. In this exam- ple, we removed 55% of the tokens which otherwise would have an impact on the overall runtime of the topic analysis system. Several documents in social media now contain hashtags (#hashtag). These should not be removed as they oer key insights into the themes in the document. URLs can also be safely removed from documents for clustering and topic analysis. As a rule of thumb, we also eliminate all tokens with length less than 3. An optional step in the pipeline is stemming or lemmatization. Stemming reduces words to a base form using a simple algorithm. For example, “forest”, “forests”, and “forested” all contribute to a single word: “forest”. This helps reduce the dimensionality of the vector space and eliminates words that just dier in tense or plurality. Lemmatization is another technique to reduce words, to their root forms. Stemming employs a rule based system, whereas lemmatization takes into account the part-of-speech tag and linguistic analysis to normalize words. For example, “am”, “are”, and “is” will be lemmatized to “be”. Stemming or lemmatization can be helpful for webpages, however these techniques are not suit- able for tweets. Tweets often ignore normal grammar rules, and may not comprise dictionary words. This makes stemming and lemmatization troublesome. Currently, we do not perform any stemming or lemmatization on any of the data. The preprocessing step is trivially parallelizable for both webpages and tweets to increase the performance of our design.

4.2 Technologies Used

The topic modeling system uses Python extensively. Broadly, it has three main responsibilities:

1. Database Access: We use the Python library happybase to retrieve data from and put data into HBase. However, happybase is prone to failure when querying more than one million 16

records. To overcome this diculty, we use a shell (.sh) script when pulling data with over a million records. We also batch our queries whenever possible to reduce the load on the database server using happybase’s in-built batch method.

2. Preprocessing and LDA: Preprocessing generally includes removal of stop words and punctuation. We augment the stop word list using collection specic stop words. We achieve this using nltk and a custom preprocessing pipeline. The Python library gensim performs the heavy-lifting code of topic modeling. Since we do this on a single node of the cluster, we parallelize the algorithm by using all 20 cores of the node. This is done using the LdaMulticore class of gensim.

3. Visualization: pyLDAvis is a bring-your-own-topic-model package for visualizing inter- topic distances and the most relevant words for a topic. We incorporate this library to understand and evaluate the coherence of our topic models.

Figure 4.2: Latent Dirichlet Allocation uses a Python based system with three main capabilities – access to HBase, preprocessing and LDA, and visualization.

Alternate systems can be built using Scala, Spark, and MLLib (as used by the Fall 2016 team). However, we found Scala lacking in the rich preprocessing and visualization aspects that Python- based libraries excelled in. Additionally, we did not face scaling problems as gensim is written using numpy and scipy, both of which are wrappers for C libraries. This makes gensim adequately fast for large collections of data.

We also used PySpark to create topic models. However, the LDAModel implementation of PySpark is incomplete and does not have methods for labeling documents by topics and evaluation. There- fore, we did not proceed with this approach. 17

Same technologies are used for database access and preprocessing in clustering as well. Scala and Spark are used for clustering the data into clusters. matplotlib and Python scripts are used for further evaluating and plotting the results.

4.3 Timeline

The major tasks for our team are highlighted in the table below. It is broadly divided into three phases – Literature Survey, Baseline implementation and Scaling, and Experiments (Runs) and Evaluation. Date Task List Team Member Status Literature Survey Sep 19, 2017 Literature survey Entire team Done Sep 26, 2017 Interim Report 1 Entire team Done Baseline Implementation and Scaling Oct 03, 2017 Implement baseline LDA Ashish B, Aman Done Oct 05, 2017 Data preprocessing for clustering Mo, Prathyush Done Oct 08, 2017 Generate bag of words for tweet Pavan, Ram Done collection Oct 10, 2017 Implement baseline Twitter-LDA Ashish B, Aman Done Oct 12, 2017 Implement k-means clustering Pavan, Ram Done Oct 13, 2017 Evaluate LDA and Twitter-LDA on Ashish B Done “election” data Oct 15, 2017 Intrepretation of results and visual- Mo, Prathyush Done ization Oct 15, 2017 Find best number of clusters quali- Ashish B, Shruti Done tatively Oct 18, 2017 Interim Report 2 Entire team Done Oct 19, 2017 Implement hierachial clustering Pavan, Ram Done Oct 21, 2017 Implement quantitative measures: Ashish B Done perplexity and topic coherence Oct 21, 2017 Performance comparison and se- Mo, Prathyush Done lection of nal value of K Oct 24, 2017 Compare LDA with Twitter-LDA Ashish B, Aman Done for results Oct 31, 2017 Package as single script to run LDA Ashish B Done experiments and evaluations Experiments and Evaluation Nov 04, 2017 Run LDA on CMW collections - Ashish B Done “Solar Eclipse 2017”, “Hurricane Irma”, “Las Vegas Shooting” 18

Nov 6, 2017 Run clustering algorithm on CMW Mo, Prathyush Done collections Nov 08, 2017 Interim Report 3 Entire Team Done Nov 14, 2017 Integration with HBase to pull and Shruti, Ashish M, Done write data seamlessly Mo, Prathyush Nov 21, 2017 Thanksgiving break Nov 27, 2017 Collate results between clustering Ashish B, Pavan, Done and topic modeling Ram Nov 30, 2017 Qualitative evaluation of topics and Entire team Done clusters Dec 07, 2017 Final Presentation Entire team Done Dec 12, 2017 Final Report Entire team Done

Table 4.1: Timeline of task list Chapter 5

Implementation and Evaluation Techniques

5.1 Preprocessing

Each document is preprocessed before being fed into the clustering and topic analysis algorithms. The pseudocode for our approach is described in Algorithm 3.

Algorithm 3 Preprocessing Algorithm 1: procedure Preprocess(document, tokenizer, mappers, filters) 2: tokens ← tokenize(document) 3: for each mapper in mappers do 4: tokens ← map(mapper.map, tokens) 5: end for 6: for each lter in lters do 7: tokens ← filter(filters.filter, tokens) 8: end for 9: return tokens 10: end procedure

We use Python’s functional programming helper functions map and filter to make our interface simple. Examples of mappers include LowerCaseMapper, which converts all tokens to lowercase, WordNetLemmatizer, which lemmatizes each token using Wordnet, and PorterStemmer, which stems words using Porter’s algorithm.

Examples of filters include StopWordFilter, which removes stop words and PunctuationFilter which removes all punctuation from the tokens.

19 20

Figure 5.1: The three stages of our preprocessing pipeline – tokenization, mapping, and ltering.

One of the most important aspects of preprocessing is maintaining a list of words that are stop- words for a particular collection. For instance, webpages about Solar Eclipse 2017 mention words such as solar, eclipse, tse, etc. These words do not add any meaning to topic modeling or clustering results as they are present in almost all documents. We develop these lists for each collection based on multiple runs and experiments. A sample of these words is shown in Table 5.1.

5.2 Clustering

5.2.1 Implementation Details

The implementation of the clustering algorithm is shown in Algorithm 4. The rst step is to represent each document as a bag of words, which results in a corpus. The second step is to determine the number of clusters K with the elbow method using the Calinski-Harabaz Index as a metric. A higher Calinski-Harabaz score relates to a model with better dened clusters. For K clusters, the Calinski-Harabaz score s is given as the ratio of the between-clusters dispersion mean and the within-cluster dispersion:

T r(B ) (N − k) s(k) = k × T r(Wk) k − 1

k X X T Wk = (x − Cq)(x − Cq)

q=1 x∈Cq

X T Bk = nq(cq − c)(cq − c) q

Here N is the number of points in our data, Cq is the set of points in cluster q, cq is the center of cluster q, c is the center of E, and nq be the number of points in cluster q. 21

Solar Eclipse 2017 Hurricane Irma solar hurricane eclipse irma totality national tse global eclipse2017 us eclipses sept subnav september aug csbn august account orida business twitter insider published guardian username subscribe password bloomberg

Table 5.1: A sample of collection specic stop words for Solar Eclipse 2017 and Hurricane Irma

We run the k-means and hierarchical clustering algorithm with an input of the corpus, iterations, and K value (the number of clusters). The iteration is the maximum number of iterations before terminating the algorithm. Note that as well as commonly known k-means, we also implement Hierarchical Clustering. In Hierarchical Clustering, we do a divisive method where we split each cluster into sub-clusters in subsequent steps. Depending on the threshold we decide, we get a certain number of clusters. We can lower the threshold to get more clusters. Currently for hierarchical clustering, we set the threshold so that we have only four clusters generated. Further, we have to decide on the number of clusters at which to stop the divisive process.

Algorithm 4 Run Clustering Algorithm 1: procedure RunCluster(documents, iterations) 2: corpus ← return bag of words for each document 3: . Convert all documents into their bag of words representation 4: K ← Elbow(documents) 5: clusters_kmean ← KMeans(corpus, iterations, K) 6: clusters_hier ← Hierarchical(corpus, iterations) 7: return clusters_kmeans, clusters_hier 8: end procedure

Frequent Word Analysis: k-means just returns the clusters, but does not name them. We can name each cluster by looking at a handful of documents in the it, which can be tedious in a Big 22

Data scenario. Therefore, (i) we determine the most frequent words in all documents within a cluster, and (ii) name the cluster based on a few very frequent words in the cluster. Here, we use the assumption that "the frequent words in the cluster describe the information in the documents within the cluster". The algorithm 5 describes the rst step, i.e., nding frequent words in each cluster. Once we obtain the frequent words in a cluster, we use them to manually label the cluster with a suitable name related to these words.

Algorithm 5 Run Freqeunt Words Analysis Algorithm Input: documents belonging to a single cluster, and a threshold value on the minimum frequency required for a word to be dened as "frequent" Output: frequent words with frequency higher than the threshold 1: procedure FreentWords(documents, threshold) 2: frequent_words = ∅ 3: . frequent_words is a set of frequent words 4: frequency_map = ∅ 5: . The frequency map has a key as a word and the value as its frequency in the cluster 6: for each document in documents do 7: for each word in document do 8: if word in frequency_map then 9: increase word0s frequency by 1 in frequency_map 10: else 11: add < word, 1 > to frequency_map 12: end if 13: end for 14: end for 15: for each word in frequency_map do 16: if frequency_map[word] ≥ threshold then 17: add word to frequent_words 18: end if 19: end for 20: return frequent_words 21: end procedure

5.2.2 Evaluation

Silhouette

This is one of the methods to validate or understand the clustering obtained. In case of k-means, we have K clusters C1,C2, ..., Ck. Let di be a document in cluster Ci, and let a(di) represent the average dissimilarity of di with all other documents in the same cluster. We also compute the average dissimilarity of di to all documents in any cluster Cj, i 6= j. Let b(di) represent the 23

lowest of all these average dissimilarities. The silhouette of di is

b(di) − a(di) s(di) = (5.1) max(b(di), a(dj))

s(di) has the range -1 to +1. If s(di) is close to 1, then the document is said to be correctly clustered, whereas if s(di) is close to -1, then the document is wrongly clustered. This could be easily seen from the denitions of a(di) and b(di).

Elbow

This is used mainly in conjunction with the k-means clustering to help in determining the optimal number of clusters. The method plots the sum of squared errors on the Y-axis and the number of clusters on the X-axis. The elbow of this plot points to the optimal number of clusters.

5.3 Topic Analysis

5.3.1 Implementation Details

Algorithm 6 Building Vocabulary 1: procedure BuildVocabulary(document) 2: id2word ← dictionary() 3: word2id ← dictionary() 4: counter ← 0 5: for each document in documents do 6: tokens ← preprocess(document) 7: for each token in tokens do 8: if token not in word2id then 9: id2word[counter] ← token 10: word2id[token] ← counter 11: counter ← counter + 1 12: end if 13: end for 14: end for 15: return (id2word, word2id) 16: end procedure

We use Python 2.7 as our language of choice for topic modeling. We use the package gensim to train all our topic models. We use the module LdaMulticore in gensim to use support of multi- 24 cores to distribute our workload. We follow the Gibbs sampling derivations in [5] to understand and implement Gibbs sampling in LDA and Twitter-LDA.

The input to LdaMulticore is a stream of vectors or vector-space representations of each docu- ment. Since we want to scale our method as the number of documents, we opt to operate from disk rather than main memory. This helps us avoid loading all documents in memory and there- fore scales easily.

The BuildVocabulary method helps us create a vector representation of each of the documents. It ensures that all terms get a unique ID.

Algorithm 7 Run LDA using Gensim 1: procedure RunLDA(documents, vocabulary, iterations, topics, hyperparameters) 2: corpus ← doc2id(documents) 3: . Convert all documents into their vector representation 4: model ← LdaMulticore(corpus, vocabulary, topics, iterations, hyperparamters) 5: return model 6: end procedure

The RunLDA method creates the topic model. Once trained, the model has details for the document- topic distribution and topic-word distribution. For each topic, we extract the top 30 keywords and give them human-readable names. Subse- quently, for each document, we extract the top 3 topics and their corresponding probabilities. These are put in HBase as probable facets for a document. We also extract the most probable topic and populate that in topic:topic-displaynames. For quantitative evaluation, we calculate perplexity and topic coherence, which are described in the next section.

5.3.2 Evaluation

Quantitative

One of the main evaluation techniques for topic modeling is perplexity. Informally, perplexity measures the cross entropy between the empirical distribution and the predicted distribution. Perplexity of a model for a test set of M documents is given by:

n− PM log p(w )o P erp(D ) = exp d=1 d test PM d=1 Nd

where Nd is the number of words in the document and p(wd) is the probability of the word in the document. 25

According to the denition, a lower perplexity score indicates a better model.

Qualitative

Evaluation of topic models quantitatively is fraught with several challenges. Often, human eval- uation is necessary to determine the coherence of a topic. The authors in [9] use a word intrusion task to detect the odd-one-out of a set of words for a topic. If they are consistently able to nd the out of place word, the topics are coherent. While we had initially planned to crowd-source our annotations from the class, we did not nd enough time to complete this. Chapter 6

Results

Our experiments are performed on both the tweet and webpage data as shown in Table 6.1. For example, 56,376 tweets were collected through a keyword search for “remember April 16”. 721 webpages and 2,667,720 webpages were collected using the keyword “solar eclipse 2017”. 2741 and 913 webpages were collected for "Hurricane Irma" and "Vegas Shooting". A sample of the tweets is shown in Figure 6.1.

Table 6.1: Datasets description with category (tweet or document) and number of documents

Dataset Name Type No. of documents Remember April 16 tweet 56376 Solar Eclipse 2017 webpage 721 Solar Eclipse 2017 tweet 2667720 Hurricane Irma webpage 2741 Vegas Shooting webpage 913

Figure 6.1: Clean Tweet Data Sample

26 27

6.1 Remember April 16 Tweets

6.1.1 Clustering

We use the Calinski Harabaz Index as the metric and apply the elbow method to nd the optimal value of K in k-means clustering. As illustrated in Figure 6.2 the Calinski Harabaz Index does not drop much after K = 4. Therefore, the point K = 4 indicates the optimal K value.

Figure 6.2: Calinski Harabaz index vs. number of clusters for “remember april 16” Dataset

However, after performing the experiments using various datasets we observed that the cluster distribution is skewed with one cluster representing the majority of data which are not correlated. To reduce this, we can increase the the value of the number of clusters as per the datasets. For the experiments conducted we generally used a value of K= 5 or 6. If the results interpreted are not comprehensible we can also conduct an iterative clustering on the large cluster to derive more meaningful results. A sample of the k-means clustering result is in Figure 6.3a. The number highlighted by a blue box is tweet ID while the number highlighted by a brown box is cluster index. For example, tweet 588503698176757762 belongs to cluster 0. Figure 6.3b presents the tweet distribution among dierent clusters. For experiments with webpage data, as will be shown later, we also apply cosine similarity analysis which calculates the similarity of intra- and inter-cluster documents to 28 evaluate clustering results. However, we didn’t perform cosine similarity analysis for tweet data because of the large data size and the exponential analysis time. To understand the meaning of each cluster, we nd the most frequently used words in the tweets belonging to each cluster. Using the frequent words we obtain the real world event or news that is frequently discussed in the set of tweets in the cluster. The frequent words and the corresponding event / news of each cluster are shown in Table 6.2. For example, the frequent words in Cluster 0 are: Remember, Sewol, door, missing, and April. When we use these keywords to search in the Internet, we understand that these tweets are remembering the victims and suerers of the Sewol Ferry Disaster that happened on April 16, 2014. While Cluster 0 is fairly coherent, Cluster 3 lacks clarity. It is not always easy to ascertain the real word event from frequently used words in the cluster.

(b) Tweet distribution over clusters (a) Sample of k-means clustering output

Figure 6.3: k-means clustering results on “remember April 16” tweets.

Table 6.2: Frequent words and events in each cluster for “Remember April 16” dataset

Cluster 0 Frequent words April, 162014, Remember, #Sewol, door, missing Event Tweets remembering the Sinking of MV Sewol ferry on April 16, 2014 Cluster 1 Frequent words March, Selena, RIPSelena, Quintanilla Event Remembering the American singer Selena Quintanilla who died on March 31, 1995. Cluster 2 Frequent words Remember, 2007, April 16, Virginia, Students, Faculty, VT Event Remembering the victims and the suering of the Virginia Tech Shooting that took place on April 16 2007 Cluster 3 Frequent words Remember, Jongin, pray, hyung-ksoo Event Too few tweets belonging to this cluster to infer exact event! Cluster 4 Frequent words world, voice, opportunity, elections Event Tweets celebrating WorldVoiceDay and some tweets belonging to Scottish elections. 29

We perform Hierarchical Clustering on the same data. We note that for the same number of clusters, this gives a slightly dierent distribution of clusters as shown in Figure 6.4. However, since the distribution of the clusters is almost similar in both types of clustering, we excluded hierarchical clustering from future experiments.

Figure 6.4: Tweets distribution over clusters using hierarchical clustering algorithm.

6.1.2 Topic Analysis

LDA

Topic 1 Topic 2 Topic 3 votes @louisemensch world monday schizophrenia opportunity elections laughing 16 scottish trolled communicate council ruin #worldvoiceday Topic 4 Topic 5 Topic 6 sewol selena years thousands quintanilla-p never ferry erez tech stampede love students family la faculty people always #neverforget died great virginia

Table 6.3: Top words for topics obtained through running LDA on the “Remember April 16” dataset. The results show only the best 6 topics. The remaining 4 topics were incoherent. 30

Running the LDA algorithm on the same dataset yielded slightly dierent results than clustering. We trained the model with K = 10, α = 0.1, β = 0.01, iterations = 500. We obtained a list of top-30 words used for each topic and tried labelling them manually. The results are shown in Table 6.3. Topic 1: Words like votes, monday, elections, scottish, council, registered, etc. are closely associated with the theme “Scottish Elections”. Topic 2: @louisemensch, schizophrenia, laughing, trolled, and ruin talk about Louise Mensch – a British journalist and former Conservative Member of Parliament. Topic 3: Monday, April 16 is observed as the world voice day and the words world, opportunity, 16, communicate, #worldvoiceday, and basic group tweets that are celebrating #WorldVoiceDay. Topic 4: Words such as sewol, thousands, ferry, stampede, family, people, and died account for the sinking of MV Sewol o the coast of South Kore on April 16th, 2014. Topic 5: Selena, quintanilla-p, erez, love, la, always, and great refer to the American singer and songwriter Selena Quintanilla-PÃľrez who celebrated her birthday on April 16th. Topic 6: Words such as years, never, tech, students, faculty, lives, lost, #neverforget, va, and heroes are about the Virginia Tech massacre that occurred on April 16, 2007. While some of these topics are coherent, we also obtained a few other topics that we could not annotate. We ran the Twitter-LDA algorithm as well. These results are described in the next section.

Twitter-LDA

Twitter-LDA assumes that each user tweets about certain topics. It further assumes that each tweet is only about one topic, as compared to LDA that models each document as a mixture over topics. While Twitter-LDA gave many of the same topics, it also identied a few dierent topics. These are described in Table 6.4. Two of those topics were about Coach Frank Beamer and a certain Maverick Gamer. This included words such as changed, lives, coach, beamer, #thanksfrank and maverickgamer, victims, still, vol, 1. However, neither of these topics were about real world events. Seeing these results, we decided to simplify our scripts and run only LDA instead of both LDA and Twitter-LDA. 31

Table 6.4: Topics from Twitter-LDA that did not appear in LDA for the “Remember April 16” dataset

Topic 1 Topic 2 changed maverickgamer lives victims coach still beamer vol #thanksfrank 1

6.2 Solar Eclipse 2017 Tweets

6.2.1 Clustering

We run the clustering algorithm on “Solar Eclipse 2017” tweet data as well. The optimal number of clusters K is found to be 6. The clustering distribution is shown in Figure 6.5. The clusters are named based on the frequent word analysis. The names of the clusters are presented in Table 6.5.

Figure 6.5: Cluster distribution for “Solar Eclipse 2017” tweets 32

Table 6.5: Cluster Naming based on frequent words for the “Solar Eclipse 2017” tweet data

Cluster Name 0 DiamondRing 1 WatchEclipse 2 SafeEclipse 3 ExoPlanetMusic 4 MidFlightEclipse 5 NonEnglish

6.2.2 Topic Analysis eclipse safety photos experience midight weather and pic- and fore- tures cast eclipse view photos truly catch cloud exo watch don’t timelapse breath- ight path totaleclipse taking moon look pictures remarkable breath- shadow thepower- taking ofexo block glass photobomb beautiful mid weather planet cover eye space great internation- rain message al totality watch lifetime pretty space outside verexo circle safely live happy wow forecast que

Table 6.6: Keywords for topics in the collection “Solar Eclipse”

We obtained the best results with number of topics as 10. However, we were only able to extract 7 coherent topics from the model. The remaining 3 topics were repeats of other topics. These topics were about – describing the eclipse, safety, photos and pictures, experience, experiencing the eclipse midight, weather and forecast, and a music band called exo.

6.3 Solar Eclipse 2017 Webpages

6.3.1 Clustering

We run the clustering algorithm on webpage data corresponding to the event of “Solar Eclipse 2017”. The optimal number of clusters K is found to be 6 using the same method mentioned 33 previously. The cluster distribution is shown in Figure 6.6. The cluster names based on frequent word analysis are presented in Table 6.8.

Figure 6.6: Cluster distribution for “Solar Eclipse 2017” webpages

To evaluate the result, cosine similarity analysis is performed. Cosine similarity is a metric to indicate the similarity of two documents. It is dened as

−→ −→ V (d1) · V (d2) sim(d1, d2) = −→ −→ |V (d1)||V (d2)|

The intra- and inter-cluster cosine similarity are computed. The intra-cluster cosine similarity is the average cosine similarity of each pair of documents in one cluster. Inter-cluster cosine similarity is computed in a similar way but each pair contains documents from two dierent clusters. The result is shown in Table 6.7. The intra-cluster cosine similarity is in red font while the inter-cluster cosine similarity is in blue font. On average, the intra-cluster cosine similarity is about three times that of inter-cluster similarity, indicating distinct clusters. 34

Table 6.7: The cosine similarity analysis of “Solar Eclipse 2017” webpage data

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 0 0.235892 0.057602 0.095294 0.077461 0.067748 0.087039 Cluster 1 0.057602 0.579325 0.092020 0.083038 0.081589 0.076596 Cluster 2 0.095294 0.092020 0.146127 0.105437 0.095530 0.099726 Cluster 3 0.077461 0.0830 0.1054 0.1769 0.0772 0.0870 Cluster 4 0.067748 0.0816 0.0955 0.0772 0.1116 0.0849 Cluster 5 0.087039 0.0766 0.0997 0.0870 0.0849 0.7148

Table 6.8: Cluster Naming based on frequent words for the “Solar Eclipse 2017” webpage data

Cluster Name 0 EclipseChasers 1 AjcEclipseNews 2 EclipseScience 3 BusinessInsiderEclipseArticles 4 Eclipseville 5 MuseumEclipse

6.3.2 Topic Analysis

Topic 1 Topic 2 Topic 3 total map world totality atlanta nasa science carolina lunar sun washington annular sky denver earth

Table 6.9: Keywords for topics in the collection “Solar Eclipse”

Using number of topics as 3, we ran our LDA code to obtain the following topics. The top words for each topic are highlighted in Table 6.12. Topic 1: The words total, totality, science, sun and sky convey that these documents discuss the sci- ence behind solar eclipses. Topic 2: The words map atlanta, carolina, washington and denver seem to describe the locations in which the Solar Eclipse was observed. 35

Topic 3: The words world, nasa, lunar, annular and earth are a little disjoint from each other. Unfor- tunately, this is one of the disadvantages of topic modeling. Words that do not relate with each other are sometimes force t together within an articial topic.

6.4 Hurricane Irma Webpages

6.4.1 Clustering

The clustering distribution and cosine similarity analysis result for "Hurricane Irma" webpage data are shown in Figure 6.7 and Table 6.10. The optimal number of clusters K is found to be 6. The names of clusters obtained are presented in Table 6.11.

Figure 6.7: Cluster distribution for “Hurricane Irma” webpages

Table 6.10: The cosine similarity analysis of “Hurricane Irma” webpage data

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 0 0.511192 0.037326 0.052042 0.030241 0.046346 0.041547 Cluster 1 0.037326 0.274652 0.120389 0.060207 0.081885 0.086328 Cluster 2 0.052042 0.120389 0.369712 0.088786 0.144062 0.128138 Cluster 3 0.030241 0.060207 0.088786 0.081079 0.067102 0.071955 Cluster 4 0.046346 0.081885 0.144062 0.067102 0.907489 0.119770 Cluster 5 0.041547 0.086328 0.128138 0.071955 0.119770 0.118860 36

Table 6.11: Cluster naming based on frequent words for the “Hurricane Irma” web data

Cluster Name 0 HeavyDotComIrmaUpdates 1 GlobalNewsIrmaUpdates 2 ExpressCoUkIrmaUpdates 3 FloridaIrmaUpdates 4 PicturesIrma 5 IrmasPathAndDevastation

6.4.2 Topic Analysis

Figure 6.8: Plots showing number of topics vs. log perplexity and number of topics vs. topic coherence for the collections Solar Eclipse webpages and Hurricane Irma webpages. We attempt to choose the best number of topics based on these two plots.

We were able to obtain 5 topics on the Hurricane Irma dataset. These broadly corresponded to damage and destruction, carribbean islands, weather and winds, orida, and president trump. 37 damage carribbean weather winds orida trump islands damage carribbean weather orida trump destruction coast winds miami president surge islands mph orlando us water puerto tropical coast home ood rico forecast state help

Table 6.12: Keywords for topics in the collection “Solar Eclipse”

6.5 Vegas Shooting Webpages

6.5.1 Clustering

The clustering distribution and cosine similarity analysis result for "Vegas Shooting" webpage data are shown in Figure 6.9 and Table 6.13. The optimal number of clusters K is found to be 6. The names of clusters obtained are presented in Table 6.14.

Figure 6.9: Cluster distribution for “Vegas Shooting” webpages 38

Table 6.13: The cosine similarity analysis of “Vegas Shooting” webpage data

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 0 0.356642 0.115300 0.130751 0.042092 0.031763 0.028054 Cluster 1 0.115300 0.174116 0.138287 0.143103 0.109531 0.107970 Cluster 2 0.130751 0.138287 0.353778 0.154930 0.119397 0.113233 Cluster 3 0.042092 0.143103 0.154930 0.517013 0.036331 0.032462 Cluster 4 0.031763 0.109531 0.119397 0.036331 0.412272 0.024849 Cluster 5 0.028054 0.107970 0.113233 0.032462 0.024849 0.271590

Table 6.14: Cluster naming based on frequent words for the “Vegas Shooting” webpage data

Cluster Name 0 ReviewJournalLasVegas 1 MandalayBayShooting 2 LasVegasShootingTheGuardian 3 DowntownShooting 4 RealEstateLasVegas 5 LocalNewsAndEntertainmentLasVegas Chapter 7

User Manual

7.1 HBase schema

The results of clustering and topic modeling on webpages and tweets will be stored in an HBase table. The output received from the CTA team will be useful to the SOLR and FE team for faceted search. In the HBase table we will use column families to store our data. Topic and cluster are the two column families in the table that our team is concerned with. As illustrated in Table 7.1, column family topic will consist of columns topic-list, probability-list, and keyword-list. The topic-list will contain a list of topic labels received after topic modeling on our set of webpages and tweets. The probability-list will contain a list of topic probabilities with respect to the topics present in the topic list. The keyword-list will contain the top 5 list of words that occur in the top 2 topics of the topic lists.

Similarly, the “cluster” column family consists of columns cluster-list, display-clusternames and cluster-probability. Here, cluster-list and display-clusternames contains the name of each cluster. Note that display-clusternames is for the FE team to use as a facet in their inter- face. cluster-probability will denote the probability the document is in that cluster. Note that in this case, we use hard clustering, so the probability will always be 1. An example of clustering results for "Solar Eclipse 2017" tweets stored in HBase is shown in Table 7.2.

Table 7.1: HBase Schema: Fields for Topic Analysis column family topic column name topic-list probability-list display-clusternames photos pictures, midight, eclipse 0.53,0.26,0.12 photos pictures exo, eclipse, experience 0.78,0.14,0.04 exo midight, eclipse, experience 0.40,0.27,0.15 midight

39 40

Table 7.2: HBase Schema: Fields for Clustering

column family cluster column name cluster-list cluster-probability display-clusternames WatchEclipse 1 WatchEclipse DiamondRing 1 DiamondRing ExoPlanetMusic 1 ExoPlanetMusic

The schema also contains a eld keywords, which is currently present only in the column-family webpages. The CTA team proposes the same eld be present for tweets as well to index tweets by their topics and clusters.

7.2 Topic Analysis

7.2.1 Help File

The entire process of training a topic model can be done through a single Python script lda.py. To view the help section to run the code, use the following command:

python lda.py -h

The following parameters need to be used while running the code:

• COLLECTION_NAME: The name of the data collection on which the LDA code will run.

• FILE: Location of the le containing the dataset.

• TOPICS: Number of topics k to be given to the LDA model as input. To run multiple instances of the model with dierent number of topics in each instance, use the following format k1, k2, k3.

In addition to these, the user can also pass the following optional arguments while running the code:

• ALPHA: The value of the hyperparameter α in the LDA model. If this ag is not specied, the default value of 0.1 will be used.

• BETA: The value of the hyperparameter β in the LDA model. If this ag is not specied, the default value of 0.01 will be used. 41

• ITER: The number of Gibbs sampling iterations for the LDA model. The default value of 800 will be used if this is not specied.

• PREPROCESS: Preprocessing options for the dataset if any ne-tuning is required.

• TOKENIZER: The tokenizer to use – valid options are CommaTokenizer, SemicolonTokenizer, SpaceTokenizer, and WordTokenizer.

• MAPPERS: The mappers to use for each of the tokens – valid options are WordnetLemmatizer, PorterStemmer, and LowercaseMapper.

• FILTERS: The lters to use for each of the tokens – valid options are ASCIIFilter, Stopword- Filter, LengthFilter and CollectionFilter.

• FILTER_WORDS: The le path to specic words to lter out for a collection.

The output of this script includes several les, mainly – document-topics, topics-keywords, and visualization.

7.2.2 Computational Complexity

We benchmarked our method on time taken, the results of which are shown in Figure 7.1. The time taken to preprocess the text, create a model, and run evaluations takes about one hour for approximately 2.5 million tweets. While we did not implement LDA using Spark, this is better than the Spark implementation from Fall 2016, which was averaging about 3 hours for a fewer number of tweets. However, this is only a preliminary comparison as the datasets used might dier in composition.

Figure 7.1: Computational complexity of running LDA for dierent collections. The results were benchmarked on a single node server with 20 cores. 42

7.3 Clustering

7.3.1 Running Clustering Algorithm

The user can use the Spark framework for clustering and can tune the algorithm by modifying parameters such as the number of clusters and iterations. First, a build.sbt has to be created with the following conguration to run the sbt package.

import sbt.Artifact name := "Kmeans" scalaVersion := "2.10.4" libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.0","org.apache.spark" %% "spark-mllib" % "1.5.0")

Use the following command to generate a jar le to run on Apache Spark

sbt package

To run k-means, run the following command.

spark-submit -k

For large datasets, if there is any performance related issue, the driver memory and executor memory can be increased by modifying the command as follows.

spark-submit -driver-memory 16g -executor-memory 16g -k

7.3.2 Analysis

We plot the cluster distribution using the Python script in the analysis/distribution_analysis/ directory with the following command. The should be in the format of "(document_ID, cluster_ID)"

python clustering_analysis.py

The inter- and intra-cluster cosine similarity analysis is performed using the Python script in the "analysis/similarity_analysis/" directory with following command. The should be in the format of "clean_document,cluster_ID".

python cos_sim_inter.py python cos_sim_intra.py Chapter 8

Developer Manual

8.1 Clustering

This section is useful for users to continue further development on the project. The implemen- tation of clustering requires the following tools:

1. Python 2.7.0

2. Apache Spark 1.5.0

3. Scala >=2.10.4

4. SBT 1.0.1

5. Java 1.6.0_31

6. Pyspark >=1.5.2

7. Sklearn >=0.16.0

8. Scipy >=0.14.1

A reference manual for Scala and Spark can be found at A complete reference manual for Scala and Spark 1.5.0, including the guidelines for installation, can be found at [3] and [4], respectively. The value of K was chosen based on the Calinski Harabaz index and after examining the results. Developers can employ a dierent metric based on the input dataset to arrive at a value for K. This can be directly modied in both the Python and Scala implementations.

43 44

8.2 Topic Analysis

Topic Analysis for both documents and tweets is written in Python 2.7.14. The requirements for the code are:

1. Python 2.7.14 2. Gensim 3.1.0 3. nltk 3.2.5 4. numpy 1.13.3 5. tabulate 0.8.1 6. happybase 1.1.0

LDA can also be done using Apache Spark using the mllib library. This is still under development since this implementation uses Variational Inference rather than Collapsed Gibbs Sampling for inference. This leads to dierent (and often poor) results as compared to Gibbs Sampling.

We use virtualenv to encapsulate our environment for replicability. We strongly suggest that other developers also use virtualenv for each of their projects to manage versions of libraries.

We also use the library happybase to access the HBase database dynamically. This allows us to fetch and put rows into the database directly from our code.

8.3 HBase interaction

8.3.1 Clustering

To read data from HBase, run the script HBase_interaction/hbase_read_cluster.py with the following command. Please modify table name and corresponding column list in the Python code accordingly. An output CSV le will be generated storing the data.

python hbase_read_cluster.py

To write data to HBase, run the script HBase_interaction/hbase_write_cluster.py with the following command. The should be a CSV le in the format "document_ID, cluster_name, cluster_probability". This will ll in the HBase eld "cluster-name" and "cluster-probability" with "document_ID" as row name.

python hbase_write_cluster.py 45

8.4 File Inventory

Directory Description Clustering/kmeans/src/main/scala/kmeans.scala Code for Clustering Clustering/kmeans/src/main/scala/Preprocess.scala Does Preprocessing for Clustering Clustering/kmeans/src/main/scala/cluster.scala Supporting File for Clustering Clustering/analysis/clustering_analysis.py Clustering Results Visualization Clustering/analysis/cos_sim_inter.py Inter-Cluster Similarity Calculation Clustering/analysis/cos_sim_intra.py Intra-Cluster Similarity Calculation Clustering/HBase_interaction/hbase_read_cluster.py Code for Reading data from HBase Clustering/HBase_interaction/hbase_write_cluster.py Code for writing Clustering Results to HBase Topic_Analysis/tokenizers.py Dierent tokenizers for the data Topic_Analysis/mappers.py Mapping tokens to a dierent form Topic_Analysis/lter.py Filtering out specic tokens Topic_Analysis/pipeline.py Preprocessing pipeline Topic_Analysis/hbase.py HBase interaction Topic_Analysis/lda.py Source code for LDA Topic_Analysis/utils.py Miscellaneous functions Topic_Analysis/readers.py Read dataset from le

Table 8.1: File Inventory Chapter 9

Future Work and Enhancements

9.1 Clustering

In clustering, we only used the hard clustering algorithms, but in the future we would like to use some soft clustering algorithms. This might help to understand the community structure of the documents in cases where a document could be part of more than one cluster. We did perform hierarchical clustering on some tweets and documents, but would like to compare and contrast these results with those of k-means. In future, we would also like to improvise the frequent word analysis on the resultant clusters to include techniques from Community Detection.

9.2 Topic Analysis

Topic analysis focused on two models – LDA and Twitter LDA. Twitter-LDA assumes that each user has a distribution of topics. Since the data collected is event-driven and not user-driven, we realized that Twitter-LDA is not a good model since the data violated some of the assumptions in the Twitter-LDA model. In the LDA model, however, we noticed some room for improvement in our process:

1. Automatic elimination of collection specic words: While we currently maintain lists of collection specic words to discard in the preprocessing step, it would indeed be better to do this automatically by understanding the word distribution in a collection.

2. Crowd sourcing annotations for topic names: In the current approach, we follow either automatic or manual naming of clusters. Automatic naming is fraught with challenges as naming is dependent on the order of words and does not capture adequate semantic meaning. Manual naming can get equally dicult when we have several collections and

46 47

topics. Moreover, neither approach helps us understand how coherent a topic really is. Therefore, we feel that topic naming is an ideal candidate for crowd-sourcing.

3. Joint model for tweets and webpages: Modeling tweets and webpages separately has its advantages – however, a joint model may be better able to capture the latent themes in the combined collection. We found that the topics in tweets were generally dierent from those in webpages. This makes correlation between the two collections dicult and may be alleviated by using a joint model. 48

Acknowledgments

This work was a part of the Global Event and Trend Archive Research (GETAR) and Integrated Digital Event Archiving and Library (IDEAL) projects, supported by the National Science Foun- dation grants IIS-1619028 and IIS-1319578, respectively. We would also like to thank Dr. Edward A. Fox (Instructor), and Liuqing Li (GTA) from the course CS 5604 at Virginia Tech, for their support towards the completion of this work. Bibliography

[1] NLTK Tweet Tokenizer. http://www.nltk.org/api/nltk.tokenize.html#nltk. tokenize.casual.TweetTokenizer. Accessed: 2017-11-08.

[2] NLTK Word Tokenize. http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize. punkt.PunktLanguageVars.word_tokenize. Accessed: 2017-11-08.

[3] Scala Tutorial. http://www.scala-lang.org/. Accessed: 2017-11-08.

[4] Spark Tutorial. https://spark.apache.org/docs/1.5.0/. Accessed: 2017-11-08.

[5] Aman Ahuja, Wei Wei, and Kathleen M. Carley. Topic modeling in large scale social network data. Technical Report CMU-ISR-15-108, School of Computer Science, Carnegie Mellon University, 2015.

[6] Aman Ahuja, Wei Wei, and Kathleen M. Carley. Microblog sentiment topic model. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pages 1031–1038. IEEE, 2016.

[7] James C Bezdek, Robert Ehrlich, and William Full. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2-3):191–203, 1984.

[8] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pages 601–608, 2002.

[9] Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. Read- ing tea leaves: How humans interpret topic models. In Neural Information Processing Sys- tems, 2009.

[10] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incom- plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodolog- ical), pages 1–38, 1977.

[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226–231, 1996.

49 50

[12] Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y Zomaya, Sebti Foufou, and Abdelaziz Bouras. A survey of clustering algorithms for big data: Taxon- omy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2(3):267– 279, 2014.

[13] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: an ecient clustering algorithm for large databases. In ACM SIGMOD Record, volume 27, pages 73–84. ACM, 1998.

[14] Alexander Hinneburg and Daniel A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In Proceedings of the 25th Interna- tional Conference on Very Large Data Bases, VLDB ’99, pages 506–517, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

[15] Alexander Hinneburg, Daniel A Keim, et al. An ecient approach to clustering in large multimedia databases with noise. In KDD, volume 98, pages 58–65, 1998.

[16] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57. ACM, 1999.

[17] Saurav Kaushik. An introduction to clustering. https://www.analyticsvidhya.com/blog/ 2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/, 2016 (accessed October 7, 2017).

[18] James MacQueen et al. Some methods for classication and analysis of multivariate ob- servations. In Proceedings of the fth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA., 1967.

[19] Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. Improving LDA topic mod- els for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th inter- national ACM SIGIR conference on Research and development in information retrieval, pages 889–892. ACM, 2013.

[20] Jared Suttles. tweetokenize. https://github.com/jaredks/tweetokenize. Accessed: 2017- 11-08.

[21] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: an ecient data clustering method for very large databases. In ACM SIGMOD Record, volume 25, pages 103–114. ACM, 1996.

[22] Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaom- ing Li. Comparing Twitter and traditional media using topic models. In European Conference on Information Retrieval, pages 338–349. Springer, 2011.