Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics Too!

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! Suzanna Sia Ayush Dalmia Sabrina J. Mielke Department of Computer Science Johns Hopkins University Baltimore, MD, USA [email protected], [email protected], [email protected] Abstract the top words in each cluster based on distance from the cluster center.1 Topic models are a useful analysis tool to un- Aside from reporting the best performing com- cover the underlying themes within document bination of word embeddings and clustering al- collections. The dominant approach is to use gorithm, we are also interested in whether there probabilistic topic models that posit a genera- are consistent patterns: embeddings which per- tive story, but in this paper we propose an alternative way to obtain topics: clustering pre- form consistently well across clustering algorithms trained word embeddings while incorporating might be good representations for unsupervised document information for weighted clustering document analysis, clustering algorithms that per- and reranking top words. We provide bench- form consistently well are more likely to generalize marks for the combination of different word to future word embedding methods. embeddings and clustering algorithms, and To make our approach reliably work as well as analyse their performance under dimensional- LDA, we incorporate corpus frequency statistics ity reduction with PCA. The best performing combination for our approach performs as well directly into the clustering algorithm, and quan- as classical topic models, but with lower run- tify the effects of two key methods, 1) weighting time and computational complexity. terms during clustering and 2) reranking terms for obtaining the top J representative words. Our con- 1 Introduction tributions are as follows: Topic models are the standard approach for ex- • We systematically apply centroid-based clus- ploratory document analysis (Boyd-Graber et al., tering algorithms on top of a variety of pre- 2017), which aims to uncover main themes and trained word embeddings and embedding underlying narratives within a corpus. But in times methods for document analysis. of distributed and even contextualized embeddings, are they the only option? • Through weighted clustering and reranking of This work explores an alternative to topic model- top words we obtain sensible topics; the best ing by casting ‘key themes’ or ‘topics’ as clusters of performing combination is comparable with word types under the modern distributed represen- LDA, but with smaller time complexity and arXiv:2004.14914v2 [cs.CL] 6 Oct 2020 tation learning paradigm: unsupervised pre-trained empirical runtime. word embeddings provide a representation for each • We show that further speedups are possible by word type as a vector, allowing us to cluster them reducing the embedding dimensions by up to based on their distance in high-dimensional space. 80% using PCA. The goal of this work is not to strictly outperform, but rather to benchmark standard clustering of mod- 2 Related Work and Background ern embedding methods against the classical approach of Latent Dirichlet Allocation (LDA; Blei Analyzing documents by clustering word embed- et al., 2003). dings is a natural idea—clustering has been used We restrict our study to influential embedding 1We found that using non-centroid-based hierarchical, or methods and focus on centroid-based clustering density based clustering algorithms like DBScan resulted in algorithms as they provide a natural way to obtain worse performance and more hyperparameters to tune. 1 for readability assessment (Cha et al., 2017), ar- gument mining (Reimers et al., 2019), document classification and document clustering (Sano et al., 2017), inter alia. So far, however, clustering word embeddings has not seen much success for the pur- poses of topic modeling. While many modern ef- forts have attempted to incorporate word embed- Figure 1: The figure on the left shows the cluster cen- dings into the probabilistic LDA framework (Liu ter (?) without weighting, while the figure on the right et al., 2015; Nguyen et al., 2015; Das et al., 2015; shows that after weighting (larger points have higher Zhao et al., 2017; Batmanghelich et al., 2016; Xun weight) a hopefully more representative cluster center et al., 2017; Dieng et al., 2019), relatively little is found. Note that top words based on distance from work has examined the feasibility of clustering em- the cluster center could still very well be low frequency beddings directly. word types, motivating reranking (§3.3). Xie and Xing(2013) and Viegas et al.(2019) first cluster documents and subsequently find words 3.1 Obtaining top-J words within each cluster for document analysis. Srid- In traditional topic modeling (LDA), the top J har(2015) targets short texts where LDA per- words are those with highest probability under each forms poorly in particular, fitting GMMs to learned topic-word distribution. For centroid based clus- word2vec representations. De Miranda et al.(2019) tering algorithms, the top words of some cluster cluster using self-organising maps, but provide only i are naturally those closest to the cluster center qualitative results. c(i), or with highest probability under the cluster In contrast, our proposed approach is straight- parameters. Formally, this means choosing the set forward to implement, feasible for regular length of types J as documents, requires no retraining of embeddings, and yields qualitatively and quantitatively convinc- 8 (i) 2 c xj 2 for KM/KD; ing results. We focus on centroid based k-means X <>k − k argmin cos(c(i); x ) for SK; (KM), Spherical k-means (SK), and k-medoids j J : jJj=10 j2J > (i) :f(xj c ; Σi) for GMM/VMFM: (KD) for hard clustering, and von Mises-Fisher j Models (VMFM) and Gaussian Mixture Models (GMM) for soft clustering; as pre-trained embed- Our results in §6 focus on KM and GMM, as we dings we consider word2vec (Mikolov et al., 2013), observe that k-medoids, spherical KM and von GloVe (Pennington et al., 2014), FastText (Bo- Mises-Fisher tend to perform worse than KM and janowski et al., 2017), Spherical (Meng et al., GMM (see App.A, App.B). 2019), ELMo (Peters et al., 2018), and BERT (De- vlin et al., 2018). Note that it is possible to extend this approach to obtain the top topics given a document: compute similarity scores between learned topic cluster 3 Methodology centers and all word embeddings from that particular document, and normalize them using softmax After preprocessing and extracting the vocabulary to obtain a (non-calibrated) probability distribution. from our training documents, each word type is converted to its embedding representation (averag- Crucial to our method is the incorporation of ing all of its tokens for contextualized embeddings; corpus statistics on top of vanilla clustering algo- details in §5.3). Following this we apply the vari- rithms, which we will describe in the remainder of ous clustering algorithms on the entire training cor- this section. pus vocabulary to obtain k clusters, using weighted 3.2 Weighting while clustering (§3.2) or unweighted word types. After the clustering algorithm has converged, we obtain the top J The intuition of weighted clustering is based on words (§3.1) from each cluster for evaluation. Note the formulation of classical LDA which models the that one potential shortcoming of our approach is probability of the word type t belonging to a topic i Nt;i+βt the possibility of outliers forming their own cluster, as P , where Nt;i refers to the number of t0 Nt0i+βt0 which we leave to future work. times word type t has been assigned to topic i, and 2 β is a parameter of the Dirichlet prior on the per- O(tkN), where N is the number of all tokens, so topic word distribution. In our case, illustrated by when N n, clustering methods can potentially the schematic in Fig.1, weighting is a natural way achieve better performance-complexity tradeoffs. to account for the frequency effects of vocabulary Note that running ELMo and BERT over doc- terms during clustering. uments also requires iterating over all tokens, but only once, and not for every topic and iteration. 3.3 Reranking when obtaining topics When obtaining the top-J words that make up a 4.1 Cost of obtaining Embeddings cluster’s topic, we also consider reranking terms, For readily available pretrained word embeddings as there is no guarantee that words closest to cluster such as word2vec, FastText, GloVe and Spherical, centers are important word types. We will show the embeddings can be considered as ‘given’ as the in Table2 that without reranking, clustering yields practioner does not need to generate these embed- “sensible” topics but low NPMI scores. dings from scratch. However for contextual embeddings such as ELMo and BERT, there is additional 3.4 Which corpus statistics? computational cost in obtaining these embeddings To incorporate corpus statistics into the clustering before clustering, which requires passing through algorithm, we examine three different schemes2 to RNN and transformer layers respectively. This assign weights to word types, where nt is the count can be trivially parallelised by batching the con- of word type t in corpus D, and d is a document: text window (usually a sentence). We use standard pretrained ELMo and BERT models in our experi- nt tf = P (1) ments and therefore do not consider the runtime of 0 n 0 t t training these models from scratch. d D t d tf-df = tf jf 2 j 2 gj (2) · D 5 Experimental Setup j j D tf-idf = tf log j j (3) Our implementation is freely available online.4 · d D t d + 1 jf 2 j 2 gj 5.1 Datasets These scores can now be used for weighting We use the 20 newsgroup dataset (20NG) which word types when clustering ( w), reranking top w contains around 18000 documents and 20 cate- 100 words ( r) after, both ( ), or neither (simply r gories,5 and a subset of Reuters215786 which con- ).

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics Too!

Probabilistic Topic Modelling with Semantic Graph

D1.3 Deliverable Title: Action Ontology and Object Categories Type

A Hierarchical Clustering Approach for Dbpedia Based Contextual Information of Tweets

Query-Based Multi-Document Summarization by Clustering of Documents

Wordnet-Based Metrics Do Not Seem to Help Document Clustering

Collocation Classification with Unsupervised Relation Vectors

Comparative Evaluation of Collocation Extraction Metrics

Collocation Extraction Using Parallel Corpus

Collocation Classification with Unsupervised Relation Vectors

Dish2vec: a Comparison of Word Embedding Methods in an Unsupervised Setting

Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints

Evaluating Language Models for the Retrieval and Categorization of Lexical Collocations