<<

Topic Modeling with Contextualized Word Representation Clusters

Laure Thompson David Mimno University of Massachusetts Amherst Cornell University [email protected] [email protected]

Abstract Topic modeling is often associated with prob- abilistic generative models in the machine learn- Clustering token-level contextualized word ing literature, but from the perspective of most representations produces output that shares actual applications the core benefit of such models many similarities with topic models for En- glish text collections. Unlike clusterings of is that they provide an interpretable latent space vocabulary-level word embeddings, the result- that is grounded in the text of a specific collec- ing models more naturally capture polysemy tion. Standard topic modeling algorithms operate and can be used as a way of organizing docu- by estimating the assignment of individual tokens ments. We evaluate token clusterings trained to topics, either through a Gibbs sampling state from several different output layers of popular or through parameters of variational distributions. contextualized language models. We find that These token-level assignments can then provide BERT and GPT-2 produce high quality clus- terings, but RoBERTa does not. These cluster disambiguation of tokens based on context, a broad models are simple, reliable, and can perform overview of the themes of a corpus, and visual- as well as, if not better than, LDA topic mod- izations of the location of those themes within the els, maintaining high topic quality even when corpus (Boyd-Graber et al., 2017). the number of topics is large relative to the size A related but distinct objective is vocabulary of the local collection. clustering. These methods operate at the level of distinct word types, but have no inherent connec- 1 Introduction tion to words in context (e.g. Brown et al., 1992; Contextualized word representations such as those Arora et al., 2013; Lancichinetti et al., 2015). Re- produced by BERT (Devlin et al., 2019) have revo- cently, there has also been considerable interest in lutionized natural language processing for a num- continuous type-level embeddings such as GloVe ber of structured prediction problems. Recent work (Pennington et al., 2014) and word2vec (Mikolov has shown that these contextualized representations et al., 2013a,b), which can be clustered to form can support type-level semantic clusters (Sia et al., interpretable semantic groups. Although it has not 2020). In this work we show that token-level clus- been widely used, the original word2vec distribu- tering provides contextualized semantic informa- tion includes code for k-means clustering of vec- tion equivalent to that recovered by statistical topic tors. Sia et al.(2020) extends this behavior to arXiv:2010.12626v1 [cs.CL] 23 Oct 2020 models (Blei et al., 2003). From the perspective contextualized embeddings, but does not take ad- of contextualized word representations, this result vantage of the contextual, token-based nature of suggests new directions for semantic analysis us- such embeddings. ing both existing models and new architectures In this work, we demonstrate a new property of more specifically suited for such analysis. From contextualized word representations: if you run the perspective of topic modeling, this result im- a simple k-means algorithm on token-level em- plies that transfer learning through contextualized beddings, the resulting word clusters share similar word representations can fill gaps in probabilistic properties to the output of an LDA model. Tra- modeling (especially for short documents and small ditional topic modeling can be viewed as token collections) but also suggests new approaches for clustering. Indeed, a clustering of tokens based latent semantic analysis that are more closely tied on BERT vectors is functionally indistinguishable to mainstream transformer architectures. from a Gibbs sampling state for LDA, which as- signs each token to exactly one topic. For topic sentence-level embeddings have been shown to pro- modeling, clustering is based on local context (the duce semantically related document clusters (Aha- current topic disposition of words in the same docu- roni and Goldberg, 2020). But such models cannot ment) and on global information (the current topic represent topic mixtures or provide an interpretable disposition of other words of the same type). We word-based representation without additional map- find that contextualized representations offer simi- ping from clusters to documents to words. It is lar local and global information, but at a richer and widely known that token-level representations of more representationally powerful level. single word types provide contextual disambigua- We argue that pretrained contextualized embed- tion. For example, Coenen et al.(2019) show an dings provide a simple, reliable method for users example distinguishing uses of die between the Ger- to build fine-grained, semantically rich represen- man article, a verb for “perish” and a game piece. tations of text collections, even with limited local We explore this property on the level of whole col- training data. While for this study we restrict our lections, looking at all word types simultaneously. attention to English text, we see no reason contex- There are a number of models that solve the topic tualized models trained on non-English data (e.g. model objective directly using contemporary neu- Martin et al., 2020; Nguyen and Nguyen, 2020) ral network methods (e.g. Srivastava and Sutton, would not have the same properties. It is impor- 2016; Miao et al., 2017; Dieng et al., 2020). There tant to note, however, that we make no claim that are also a number of neural models that incorporate clustering contextualized word representations is topic models to improve performance on a vari- the optimal approach in all or even many situations. ety of tasks (e.g. Chen et al., 2016; Narayan et al., Rather, our goal is to demonstrate the capabilities 2018; Wang et al., 2018; Peinelt et al., 2020). Ad- of contextualized embeddings for token-level se- ditionally, BERT has been used for word sense dis- mantic clustering and to offer an additional useful ambiguation (Wiedemann et al., 2019). In contrast, application in cases where models like BERT are our goal is not to create hybrid or special-purpose already in use. models but to show that simple contextualized em- bedding clusters support token-level topic analysis 2 Related Work in themselves with no significant additional mod- eling. Since our goal is simply to demonstrate this We selected three contextualized language models property and not to declare overall “winners”, we based on their general performance and ease of focus on LDA in empirical comparisons because it accessibility to practitioners: BERT (Devlin et al., is the most widely used and straightforward, high- 2019), GPT-2 (Radford et al., 2019), and RoBERTa lighting the similarities and differences between (Liu et al., 2019). All three use similar Transformer contextualized embedding clusters and topics. (Vaswani et al., 2017) based architectures, but their objective functions vary in significant ways. These 3 Data and Methods models are known to encode substantial informa- tion about lexical semantics (Petroni et al., 2019; We use three real-world corpora of varying size, Vulic´ et al., 2020). content, and document length: Wikipedia articles Clustering of vocabulary-level embeddings has (WIKIPEDIA), Supreme Court of the been shown to produce semantically related word legal opinions (SCOTUS), and Amazon product re- clusters (Sia et al., 2020). But such embeddings views (REVIEWS). We select WIKIPEDIA for its cannot easily account for polysemy or take advan- affinity with the training data of the pretrained mod- tage of local context to disambiguate word senses els. Because its texts are similar to ones the mod- since each word type is modeled as a single vec- els have already seen, WIKIPEDIA is a “best-case” tor. Since these embeddings are not grounded in scenario for our clustering algorithms. If a cluster- specific documents, we cannot directly use them to ing method performs poorly on WIKIPEDIA, we track the presence of thematic clusters in a particu- expect the method to perform poorly in general. lar collection. In addition, Sia et al.(2020) find that In contrast, we select SCOTUS and REVIEWS for reweighting their type-level clustering by corpus their content variability. Legal opinions tend to be frequencies is helpful. In contrast, such frequen- long and contain many technical legal terms, while cies are “automatically” accounted for when we user-generated product reviews tend to be short and operate on the token level. Similarly, clusterings of highly variable in content and vocabulary. Term Model Top Words sea coast Beach Point coastal land Long Bay m sand beach tide Norfolk shore Ocean Coast areas LDA land acres County ha facilities State location property acre cost lot site parking settlers Department arrived arrival landing landed arriving arrive returning settled departed land leaving sailed arrives land BERT land property rights estate acres lands territory estates properties farm farmland Land fields acre arrived landed arriving landfall arrive arrives arrival landing land departed ashore embarked Back GPT-2 land sea ice forest rock mountain ground sand surface beach ocean soil hill lake snow sediment metals metal potassium sodium + lithium compounds electron ions hydrogen chemical atomic – ion LDA metal folk bands music genre band debut Metal heavy musicians lyrics instruments acts groups metals elements metal electron element atomic periodic electrons chemical atoms ions atom metal BERT rock dance pop metal Rock folk punk comedy Dance heavy funk alternative soul street club rock pop hop dance metal folk hip B soul funk alternative rap heavy disco electronic GPT-2 plutonium hydrogen carbon sodium potassium metal lithium uranium oxygen diamond radioactive

Table 1: Automatically selected examples of polysemy in contextualized embedding clusters. Clusters containing “land” or “metal” as top words from BERT L[−1], GPT-2 L[−2], and LDA with K = 500. All models capture multiple senses of the noun “metal”, but BERT and GPT-2 are better than LDA at capturing the syntactic variation of “land” as a verb and noun.

Corpus Docs Types Tokens and surrogate Unicode codepoints since they are in- WIKIPEDIA 1.0K 22K 1.2M SCOTUS 5.3K 58K 10.8M ternally removed by BERT’s tokenizer. We extract REVIEWS 100K 52K 9.4M contextualized word representations from BERT (cased version), GPT-2, and RoBERTa using pre- Table 2: Corpus statistics for number of documents, trained models available through the huggingface types, and tokens. Document and type counts are listed transformers in thousands (K), token counts in millions (M). library (Wolf et al., 2019). All methods break low-frequency words into multiple subword tokens: BERT uses WordPiece (Wu et al., WIKIPEDIA. In this collection, documents are 2016), while GPT-2 and RoBERTa use a byte-level Wikipedia articles (excluding headings). We ran- variant of byte pair encoding (BPE) (Sennrich et al., domly selected 1,000 articles extracted from the 2016). For example, the word disillusioned is rep- raw/character-level training split of Wikitext-103 resented by four subtokens “di -si -llus -ioned” in (Merity et al., 2017). We largely use the existing to- BERT and by two subtokens “disillusion -ed” in kenization, but recombine internal splits on dot and GPT-2 and RoBERTa. One key difference between comma characters but not hyphens so that “Amazon these tokenizers is that byte-level BPE can encode @.@ com” becomes “Amazon.com”, “1 @,@ 000” all inputs, while WordPiece replaces all Unicode becomes “1,000”, and “best @-@ selling” becomes codepoints it has not seen in pretraining with the “best - selling”. special token UNK. For simplicity, rather than us- SCOTUS. In this collection, documents are legal ing a sentence splitter we divide documents into opinions from the Supreme Court of the United the maximum length subtoken blocks. To make vo- States filed from 1980 through 2019.1 These docu- cabularies comparable across models with different ments can be very long, but have a regular structure. subword tokenization schemes, we reconstitute the original word tokens by averaging the vectors for REVIEWS. In this collection, documents are subword units (Bommasani et al., 2020). Amazon product reviews. For four product cat- egories (Books, Electronics, Movies and TV, CDs and Vinyl), we select 25,000 reviews from category- Clustering. We cluster tokens using spherical k- level dense subsets of Amazon product reviews (He means (Dhillon and Modha, 2001) with spkm++ and McAuley, 2016; McAuley et al., 2015). initialization (Endo and Miyamoto, 2015) because of its simplicity and high-performance, and cosine Data Preparation. For SCOTUS and REVIEWS, similarities are commonly used in other embed- we tokenize documents using the spaCy NLP ding contexts. Although we extract contextualized toolkit.2 Tokens are case-sensitive non-whitespace features for all tokens, prior to clustering we re- character sequences. For consistency across mod- move frequent words occurring in more than 25% els, we also delete all control, format, private-use, of documents and rare words occurring in fewer 1https://www.courtlistener.com/ than five documents. Each clustering is run for 2https://spacy.io/ 1000 iterations or until convergence. For LDA, we train models using Mallet (McCallum, 2002) coherence metric, P P log D(wi,wj )+ , where i j<1 D(wi) with hyperparameter optimization occurring every D refers to the number of documents that contain 20 intervals after the first 50. For each embed- a word or word-pair. For external coherence we ding model, we cluster the token vectors extracted use Newman et al.(2010)’s topic coherence met- from the final layer L[−1], the penultimate layer ric: P P log Pr(wi,wj )+ , where probabilities i j<1 Pr(wi) Pr(wj ) L[−2], and the antepenultimate layer L[−3]. Vulic´ are estimated from the number of 25-word sliding et al.(2020) suggest combining multiple layers, windows that contain a word or word-pair in an but no combination we tried provided additional external corpus. We use the New York Times An- benefit for this specific task. We consider more notated Corpus (Sandhaus, 2008) as our external than the final hidden layer of each model because collection with documents corresponding to articles of the variability in anisotropy across layers (Etha- (headline, article text, and corrected text) tokenized yarajh, 2019). In a space where any two words with spaCy. For both metrics, we use the top 20 have near perfect cosine similarity, clustering will words of each topic and set the smoothing factor  only capture the general word distribution of the to 10−12 to reduce penalties for non-cooccurring corpus. Since Ethayarajh(2019) has shown GPT- words (Stevens et al., 2012). We ignore words that 2’s final layer to be extremely anisotropic, we do do not appear in the external corpus and do not con- not expect to produce viable topics in this case. For sider topics that have fewer than 10 attested words. each test case, we build ten models each of size These “skipped” topics are often an indicator of K ∈ {50, 100, 500}. model failure. Higher scores are better.

4 Evaluation Metrics Exclusivity. A topic model can attain high coher- ence by repeating a single high-quality topic multi- We evaluate the quality of “topics” produced by ple times. To balance this effect, we measure topic clustering contextualized word representations with diversity using Bischof and Airoldi(2012)’s word- several quantitative measures. For all models we level exclusivity metric to quantify how exclusive use hard topic assignments, so each word token has a word w is to a specific topic z: Pr(wi|z) .A P Pr(w |z0) a word type wi and topic assignment z. Note that z0 i we use “topic” and “cluster” interchangeably. word prevalent in many topics will have a low ex- clusivity score near 0, while a word occurring in Word Entropy. As a proxy for topic specificity, few topics will have a score near 1. We lift this mea- we measure a topic’s word diversity using the sure to topics by computing the average exclusivity conditional entropy of word types given a topic: of each topic’s top 20 words. While higher scores P − i Pr(wi|z) log Pr(wi|z). Topics composed of are not inherently better, low scores are indicative tokens from a small set of types will have low of topics with high levels of overlap. entropy (minimum 0), while topics more evenly spread out across the whole vocabulary will have 5 Results high entropy (maximum log of vocabulary size; approx. 10 for WIKIPEDIA). There is no best fit be- We evaluate whether contextualized word represen- tween quality and specificity, but extreme entropy tation clusters can group together related words, scores indicate bad topics. Topics with extremely distinguishing distinct uses of the same word based low entropy are overly specialized, while those on local context. Compared to bag-of-words LDA, with extremely high entropy are overly general. we expect contextualized embedding clusters to en- code more syntactic information. As we are not Coherence. We measure the semantic quality of doing any kind of fine-tuning, we expect perfor- a topic using two word-cooccurrence-based coher- mance to be best on text similar to the pretraining ence metrics. These coherence metrics measure data. We also expect contextualized embedding whether a topic’s words actually occur together. In- clusters to be useful in describing differences be- ternal coherence uses word cooccurrences from the tween partitions of a working collection. working collection, while external coherence relies on word cooccurrences from a held-out external BERT produces meaningful topic models. collection. The former measures fit to a dataset, BERT cluster models consistently form semanti- while the latter measures generalization. For inter- cally meaningful topics, with the final layer per- nal coherence we use Mimno et al.(2011)’s topic forming marginally better for larger K. Figure1 Figure 2: Distinct words per cluster for LDA, BERT L[−1], GPT-2 L[−2], and RoBERTa L[−1] for K = 500. Although BERT clusters cover fewer word types on average, RoBERTa produces more clusters with very few (< 20) word types.

other (low exclusivity). They also highlight the dif- ferences between our two coherence scores. Since these clusters tend to repeatedly echo the back- ground distribution of WIKIPEDIA, they perform relatively well for internal coherence, but poorly for external coherence. Since the final layer of GPT-2 has such high anisotropy, we cannot expect vector Figure 1: Contextualized embedding clusters produce directionalities to encode semantically meaningful mean internal and external coherence scores compara- ble to LDA (dashed line). BERT clusters (blue) have information. In contrast, the penultimate and ante- high mean external coherence, better than LDA for penultimate layer clusterings perform much better. large numbers of topics. BERT clusters contain more We see a large improvement in external coherence unique words, while RoBERTa (red) and GPT-2 (green) surpassing LDA for K = 500. Topic word entropy L[−1] clusters tend to repeat similar clusters. BERT and exclusivity are also improved. clusters have the highest word concentrations. For K = 500, GPT-2 L[-3] has surprisingly low mean internal coherence—the worst scores in Fig- shows that BERT clusterings have the highest exter- ure1 by a significant margin. The number of topics nal coherence, matching LDA for K ∈ {50, 100} below the 10-valid-words threshold is similar to and beating LDA for K = 500. For internal co- BERT, so this result is comparable. We posit that herence, the opposite is true, with BERT on par this layer is relying more on transferred knowledge with LDA for smaller K, while LDA “fits” better from the pretrained GPT-2 model than the working for K = 500. This distinction suggests that at collection. Because of this less explained behavior, very fine-grained topics, LDA may be overfitting we will only consider GPT-2 L[-2] going forward. to noise. BERT has relatively low word entropy, in- RoBERTa clusters are noticeably worse. dicating more focused topics on average. Figure2 Given BERT’s success and GPT-2’s partial shows the number of word types per cluster. BERT success, we were surprised to find that RoBERTa clusters are on average smaller than LDA topics cluster models were consistently of poor quality, (counted from an unsmoothed sampling state), but with very low exclusivity scores and high word very few BERT clusters fall below our 10-valid- entropies. Although RoBERTa scores fairly well words threshold for coherence scoring. BERT clus- in coherence, this is not indicative of collectively ters are not only semantically meaningful, but also high quality topics because of the correspondingly unique. Figure1 shows that BERT clusters have low exclusivity scores. As shown in Figure1, exclusivity scores as high if not higher than LDA RoBERTa has the highest average number of topics on average. Since there is little difference distinct words per cluster, but also large numbers between layers, we will only consider BERT L[-1] of clusters that contain very few distinct words. for the remainder of this work. For K = 500, 25–50 clusters are skipped on GPT-2 can produce meaningful topic models. average for different layer choices. For example, As expected, the final layer clusterings of GPT-2 one topic consists entirely of the words game, form bad topics. These clusters tend to be homo- games, Game, another just ago, and one simply geneous (low word entropy) and similar to each the symbol. The remaining tokens are thus limited to a smaller number of more general topics that are words of a cluster and calculate the entropy of that closer to the corpus distribution. distribution. If all 20 words are the same POS, this While it is commonly accepted that RoBERTa value will be 0, while if POS tags are more diffuse outperforms BERT for a variety of natural language it will be larger. We find that BERT and GPT-2 clus- understanding tasks (Wang et al., 2019), we find the ters have consistently lower entropy. In Table3, we opposite to be true for semantic clustering. There see that the 25th percentile for LDA topics has en- are a number of differences between BERT and tropy 0.97, higher than the median entropy for both RoBERTa, but our experimental results do not mark BERT and GPT-2. We find that these results are a clear cause. The tokenization method is a very consistent across model sizes. Although contextu- unlikely source since GPT-2 uses the same scheme. alized embedding clusters are more homogeneous in POS, LDA may appear more homogeneous be- Contextualized embedding clusters capture pol- cause it is dominated by nouns. For LDA, nouns ysemy. A limitation of many methods that rely and proper nouns account for 43.7% and 33.4% on vocabulary-level embeddings is that they can- respectively of all the words in the top 20 for all not explicitly account for polysemy and contextual topics, while verbs make up 8.5% and adjectives differences in meaning. In contrast, token-based 6.7%. These proportions are 39.0%, 25.3%, 14.9%, topic models are able to distinguish differences in and 8.1% for BERT, and 37.0%, 23.4%, 16.9%, meaning between contexts. There has already been and 9.5% for GPT-2. evidence that token-level contextualized embed- dings are able to distinguish contextual meanings Compression improves efficiency while main- for specific examples (Wiedemann et al., 2019; Co- taining quality. We have established that we can enen et al., 2019), but can they also do this for effectively learn topic models by clustering token- entire collections? level contextualized embeddings, and we have shown that there are advantages to clustering at the Instead of manually selecting terms we expect token rather than vocabulary level. But for token- to be polysemous, we choose terms that occur as level clustering to be more than a curiosity we need top words for clusters with dissimilar word distribu- to address computational complexity. Vocabularies tions (high Jensen-Shannon divergence). While dis- are typically on the order of tens of thousands of similarity is not indicative of polysemy—different words, while collections often contain millions of topics can use a term in the same way—it narrows words. Even storing full 768-dimensional vectors our focus to words that are more likely to be pol- for millions of tokens, much less clustering them, ysemous. In Table1, we see topics for two such can be beyond the capability of many potential terms “land” and “metal”. All models are able to users’ computing resources. Therefore, we investi- distinguish metal the material from metal the genre, gate the effects of feature dimensionality reduction but BERT and GPT-2 are also able to distinguish to reduce the memory footprint of our method. land the noun from land the verb. The hidden layers of deep learning models are Contextualized embedding clusters are more known to learn representations with dimensionali- syntactically consistent than LDA topics. Con- ties much lower than the number of neurons (Raghu textualized word representations are known to rep- et al., 2017). We apply two methods for dimen- resent a large amount of syntactic information that sionality reduction: principal component analysis is not available to traditional bag-of-words topic (PCA) and sparse random projection (SRP) (Li 3 models (Goldberg, 2019). We therefore expect et al., 2006; Achlioptas, 2001). that token-level clusterings of contextualized word We find that reducing our token vectors to as few representations will have more homogeneity of syn- as 100 dimensions can have little negative effect. tactic form, and indeed we find that they do. Figure3 shows that reduced PCA features produce As a simple proxy for syntactic similarity, we improved internal coherence and little significant find the most likely part of speech (POS) for the top change in external coherence, but reduced SRP words in each cluster. We use this method because features are worse in both metrics, especially for it is easily implemented; inaccuracies should be BERT. We note that more clusters pass the 10-valid- consistent across models. To quantitatively evalu- words threshold for reduced PCA and SRP features. ate the homogeneity of POS within each topic, we Instead of skipping 10 BERT clusters and 7 GPT-2 count the distribution of POS tags for the top 20 3All implementations from (Pedregosa et al., 2011). Model Perc. Entr. Top Words (noun verb adj adv other) 5% 0.69 Valley Death valley Creek California mining ◦ Range Nevada Desert 25% 0.97 army forces soldiers campaign troops captured defeated Battle victory commander LDA 50% 1.11 society News Week Good Spirit Fruit says Doug host free 75% 1.28 Washington Delaware ceremony Grand Capitol building 156 Number laying Master 95% 1.53 critics reviews review positive mixed list Entertainment Times style something 5% 0.00 1997 1996 1995 1937 1895 1935 96 1896 1795 97 25% 0.61 Jewish Israel Jews Ottoman Arab Muslim Israeli Islamic Jerusalem Islam BERT 50% 0.86 captured defeated attacked capture attack siege destroyed surrender defeat occupied 75% 1.09 hop dance hipBR Dance HipZ Hop rapper 95% 1.48 separate combined co joint shared divided common combination distinct respective 5% 0.00 2004 2003 2015 2000 2014 1998 2001 2013 2002 1997 25% 0.42 Atlantic Pacific Gulf Mediterranean Caribbean Columbia Indian Baltic Bay Florida GPT-2 50% 0.73 knew finds discovers learned reveals discovered know heard discover learns 75% 1.02 Olympic League FA Summer Premier Division UEFA European Winter Tour 95% 1.42 positive mixed critical negative garnered favorable mostly attracted commercial

Table 3: Contextualized embedding clusters are more syntactically aware than LDA. Topics ranked by the entropy of POS distribution of the top 20 words (10 shown) with K = 500.

Figure 3: Mean internal and external coherence for re- duced features of BERT and GPT-2. Features reduced with PCA tend to have higher coherence than SRP. clusters on average, no clusters are skipped for PCA Figure 4: BERT and GPT-2 produce coherent topics for reduced features and only 1 for 100-dimensional less familiar (w.r.t. pretraining) collections. BERT con- SRP features. For 300-dimensional SRP features sistently produces more unique clusters. LDA external there is only a significant drop for GPT-2 with 3 coherence drops for K = 500. skipped on average. This decrease in skipped top- ics indicates that overly specific topics are being replaced with less specific ones. We hypothesize ated offline and immediately applied to embedding that dimensionality reduction is smoothing away vectors as soon as they are generated. To over- “spikes” in the embeddings space that cause the al- come the memory limitations of PCA, we use a gorithm to identify small clusters. Finally, larger batch approximation, incremental PCA (Ross et al., dimensionality reductions decrease concentration 2008). Using 100 dimensions and scikit-learn’s and exclusivity, making clusters more general. default batch size of five times the number of fea- PCA is significantly better than SRP, especially tures (3840), we find no significant difference in for more aggressive dimensionality reductions. We results between PCA and incremental PCA. For the find that mean-centering SRP features does not sig- remainder of experiments we use 100-dimensional nificantly improve results. An advantage of SRP, vectors which correspond to the top 100 compo- however, is that the projection matrix can be gener- nents produced by incremental PCA. book books author novel novels work Book fiction by authors Kindle published volume works literature books read reading copy Read reads Reading readable reader reread across follow opened researched study hand problem children problems course power lives mystery questions issues words death example reality battle use up than off used back over using there about work need down thing nice other full no easy small electronics screen quality sound device power battery unit system software remote video player mode drive audio setup remote battery mode card set range input signal support setting manual stand menu GPS fan power movie movies films flick theater Movie flicks game cinema film comedies Movies pictures westerns movies into up over through between off down than about around during against under found away along though film movie picture screen documentary films Film cinema feature work production filmmaker piece Picture album albums record release Album LP releases records effort up label EP thing titled studio label titled cds songs tracks hits tunes singles material stuff Songs ballads cuts ones numbers sounds artists Hits versions lyrics guitar vocals voice bass singing solo vocal sound work music piano chorus style here live playing

Table 4: Unlike vocabulary-level clusters, token-level clusters are grounded in specific documents and can be used to analyze collections. Here we show the most prominent BERT topics (K = 500) for the four product categories in REVIEWS. This analysis is purely post hoc, neither BERT nor its clustering have access to product labels.

than beyond Than twice upon much except besides half times less unlike nor yet per alone exceeded beside above within day days Day morning today date daily Days month 19 basis night period 1979 shortly moment term rainy mood sunny can Can able possible manage could knows lets capable allows can’t easily s Cannot cant ability Can’t can‘t barely cannot when When once time whenever Once soon everytime upon during Whenever moment Everytime At near before anytime too enough Too overly taste beyond tired sufficiently plenty somehow Enough unnecessary sufficient sick half overkill because since due Because Since considering cause meaning given thanks means based result order therefore being went took got happened came started did turned ended fell used kept left taken gave stopped won ran made moved would ’d Would might d normally I’d imagine otherwise happily woulda wouldn’t Wouldn’t envision probably Iwould up off Up ready upload ups along end forth uploading used down away unpacked securely rope onto open unpacking instead rather either matter Instead opposed other depending based choice Rather than favor otherwise regardless no

Table 5: The ten BERT topics from REVIEWS with the most uniform distribution over product categories.

Pretrained embeddings are effective for differ- higher mean exclusivity scores for both SCOTUS ent collections. While BERT and GPT-2 clus- and REVIEWS, while GPT-2 tends to have scores as ter models produce useful topics for WIKIPEDIA, good as LDA for K ∈ {50, 100}, but significantly will this hold for collections less similar to the lower for K = 500. training data of these pretrained models? Does it work well for collections of much shorter and Contextualized embedding clusters support col- much longer texts than Wikipedia articles? We find lection analysis. Token-level clusterings of con- that both BERT and GPT-2 produce semantically textualized word representations support sophisti- meaningful topics for SCOTUS and REVIEWS, but cated corpus analysis with little additional com- BERT continues to outperform GPT-2. As with plexity. In practice, topic models are often used WIKIPEDIA, we find that contextualized embed- as a way to “map” a specific collection, for exam- ding clusters have the largest advantage over LDA ple by identifying key documents or measuring the for large K. Figure4 shows that for K = 500 prevalence of topics over time (Boyd-Graber et al., BERT and GPT-2 clusters have significantly higher 2017). A key disadvantage therefore of vocabulary- external coherence scores on average than LDA level semantic clustering is that it is not grounded topics for SCOTUS and very similar scores for RE- in specific documents in a collection. Token-level VIEWS. For smaller K, LDA has the highest ex- clustering, in contrast, supports a wide range of ternal coherence scores followed by BERT. Inter- analysis techniques for researchers interested in nal coherence is more difficult to interpret because using a topic model as a means for studying a spe- of the variability in exclusivity. With K = 500, cific collection. We provide two case studies that BERT clusters have substantially worse internal both use additional metadata to organize token- coherence scores. In contrast, GPT-2 clusters tend level clusterings post hoc. to experience a smaller drop in scores, but this can Given a partition of the working collection, we be partially explained by their much lower aver- can count tokens assigned to each topic within a age exclusivity. We find that BERT consistently given partition to estimate locally prominent topics. produces the most unique topics for SCOTUS and Table4 shows the three most prominent topics for REVIEWS. BERT consistently has significantly the four product categories in REVIEWS from a BERT cluster model with K = 500. Many of these 1980–2019 Top Words union employment labor bargaining Labor workers job strike unions working gas coal oil natural mining mineral fuel mine fishing hunting compensation wages pension wage salary welfare compensate salaries retirement bonus discrimination prejudice unfair bias harassment segregation retaliation boycott persecution medical health care hospital physician patient Medical physicians clinic hospitals market competition competitive markets compete demand marketplace trading competitor trade election vote voting electoral ballot voter elected votes elect Election violence firearm gun violent weapon firearms armed weapons arms lethal patent copyright Copyright Patent patents trademark invention patented copyrighted patentee

Table 6: Grounded topics allow us to analyze trends in the organization of a corpus using a BERT topic model with K = 500. Here we measure the prevalence of topics from 1980 to 2019 in US Supreme Court opinions by counting token assignments. Topics related to unions and natural resources are more prevalent earlier in the collection, while topics related to firearms and intellectual property have become more prominent. topics are clearly interpretable as aspects of a par- 6 Conclusion ticular product (e.g. full albums, individual songs, descriptions of songs). Two topics contain mostly prepositions. While we could have added these We have presented a simple, reliable method for words to a stoplist before clustering, these less extracting mixed-membership models from pre- obviously interpretable clusters can nevertheless trained contextualized word representations. This represent distinct discourses, such as descriptions result is of interest in several ways. First, it pro- into, up, over, through, ... of action in Movies ( ) or vides insight into the affordances of contextualized use, descriptions of physical objects in Electronics ( representations. For example, our result suggests a up, than, off, ... ). We can also find the topics that way to rationalize seemingly ad hoc methods such least are associated with any one product category as averaging token vectors to build a representation by calculating the entropy of their distribution of of a sentence. Second, it suggests directions for tokens across categories. These are shown in Table further analysis and development of contextualized 5, and appear to represent subtle variations of sub- representation models and algorithms. The signifi- overkill possibilities reasons jective experiences: , , , cant differences we observe in superficially similar time periods and . We emphasize that this analysis systems such as BERT and RoBERTa require ex- requires no additional modeling, simply counting. planations that could expand our theoretical under- standing of what these models are doing. Why, for For partitions that have a natural order, such as example, is RoBERTa more prone to very small, years, we can create time series in the same post specific clusters, while BERT is not? Furthermore, hoc manner. Thus, we can use a BERT clustering if models like BERT are producing output similar of SCOTUS to examine the changes in subject of to topic model algorithms, this connection may sug- the cases brought before the US Supreme Court. gest new directions for simpler and more efficient Table6 shows time series for nine manually se- language model algorithms, as well as more rep- lected topics from a BERT clustering of SCOTUS resentationally powerful topic model algorithms. with K = 500, ordered by the means of their dis- Third, there may be substantial practical benefits tributions over years. We find that topics related to for researchers analyzing collections. Although labor and collective bargaining, oil and gas explo- running BERT on a large-scale corpus may for now ration, and compensation have decreased in inten- be substantially more computationally inefficient sity since the 1980s, while those related to medical than running highly-tuned LDA algorithms, pass- care and elections have remained relatively stable. ing a collection through such a system is likely It appears that competitive markets was a less com- to become an increasingly common analysis step. mon subject in the middle years, but has returned If such practices could be combined with online to prominence. Meanwhile, discrimination has re- clustering algorithms that would not require storing mained a prominent topic throughout the period, large numbers of dense token-level vectors, data but with higher intensities in the 1980s. Addition- analysts who are already using BERT-based work- ally, topics related to gun violence and patents and flows could easily extract high-quality topic model copyright appear to be increasing in intensity. output with essentially no additional work. Acknowledgments Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of This work was supported by NSF #1652536 and deep bidirectional transformers for language under- the Alfred P. Sloan foundation. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), References pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Dimitris Achlioptas. 2001. Database-friendly random projections. In PODS ’01. Inderjit S Dhillon and Dharmendra S Modha. 2001. Concept decompositions for large sparse text data us- Roee Aharoni and Yoav Goldberg. 2020. Unsupervised ing clustering. Machine learning, 42(1-2):143–175. domain clusters in pretrained language models. In Proceedings of the 58th Annual Meeting of the Asso- Adji B. Dieng, Francisco J. R. Ruiz, and David M. ciation for Computational Linguistics, pages 7747– Blei. 2020. Topic modeling in embedding spaces. 7763, Online. Association for Computational Lin- Transactions of the Association for Computational guistics. Linguistics, 8:439–453.

Sanjeev Arora, Rong Ge, Yonatan Halpern, David Yasunori Endo and Sadaaki Miyamoto. 2015. Spher- Mimno, Ankur Moitra, David Sontag, Yichen Wu, ical k-means++ clustering. In International Con- and Michael Zhu. 2013. A practical algorithm for ference on Modeling Decisions for Artificial Intelli- topic modeling with provable guarantees. In Interna- gence, pages 103–114. Springer. tional Conference on Machine Learning, pages 280– Kawin Ethayarajh. 2019. How contextual are contex- 288. tualized word representations? comparing the geom- etry of BERT, ELMo, and GPT-2 embeddings. In Jonathan Bischof and Edoardo M Airoldi. 2012. Sum- Proceedings of the 2019 Conference on Empirical marizing topical content with word frequency and Methods in Natural Language Processing and the Proceedings of the 29th International exclusivity. In 9th International Joint Conference on Natural Lan- Conference on Machine Learning (ICML-12), pages guage Processing (EMNLP-IJCNLP), pages 55–65, 201–208. Hong Kong, China. Association for Computational Linguistics. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of ma- Yoav Goldberg. 2019. Assessing bert’s syntactic abili- chine Learning research, 3(Jan):993–1022. ties. ArXiv, abs/1901.05287.

Rishi Bommasani, Kelly Davis, and Claire Cardie. Ruining He and Julian McAuley. 2016. Ups and downs: 2020. Interpreting Pretrained Contextualized Repre- Modeling the visual evolution of fashion trends with sentations via Reductions to Static Embeddings. In one-class collaborative filtering. In proceedings of Proceedings of the 58th Annual Meeting of the Asso- the 25th international conference on world wide ciation for Computational Linguistics, Online. Asso- web, pages 507–517. ciation for Computational Linguistics. Andrea Lancichinetti, M Irmak Sirer, Jane X Wang, Jordan Boyd-Graber, Yuening Hu, David Mimno, et al. Daniel Acuna, Konrad Kording,¨ and Lu´ıs A Nunes 2017. Applications of topic models. Foundations Amaral. 2015. High-reproducibility and high- and Trends in Information Retrieval, 11(2-3):143– accuracy method for automated topic classification. 296. Physical Review X, 5(1):011007. Ping Li, Trevor J. Hastie, and Kenneth Ward Church. Peter F. Brown, Peter V. deSouza, Robert L. Mer- 2006. Very sparse random projections. In KDD ’06. cer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural lan- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- guage. Computational Linguistics, 18(4). dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Wenhu Chen, Evgeny Matusov, Shahram Khadivi, Roberta: A robustly optimized bert pretraining ap- and Jan-Thorsten Peter. 2016. Guided alignment proach. ArXiv, abs/1907.11692. training for topic-aware neural machine translation. AMTA 2016, Vol., page 121. Louis Martin, Benjamin Muller, Pedro Javier Or- tiz Suarez,´ Yoann Dupont, Laurent Romary, Eric´ Andy Coenen, Emily Reif, Ann Yuan, Been Kim, de la Clergerie, Djame´ Seddah, and Benoˆıt Sagot. Adam Pearce, Fernanda B. Viegas,´ and Martin Wat- 2020. CamemBERT: a tasty French language model. tenberg. 2019. Visualizing and measuring the ge- In Proceedings of the 58th Annual Meeting of the ometry of BERT. In Advances in Neural Informa- Association for Computational Linguistics, pages tion Processing Systems 32, pages 8594–8603. Cur- 7203–7219, Online. Association for Computational ran Associates, Inc. Linguistics. Julian McAuley, Christopher Targett, Qinfeng Shi, and Nicole Peinelt, Dong Nguyen, and Maria Liakata. 2020. Anton Van Den Hengel. 2015. Image-based recom- tBERT: Topic models and BERT joining forces for mendations on styles and substitutes. In Proceed- semantic similarity detection. In Proceedings of the ings of the 38th International ACM SIGIR Confer- 58th Annual Meeting of the Association for Compu- ence on Research and Development in Information tational Linguistics, pages 7047–7055, Online. As- Retrieval, pages 43–52. sociation for Computational Linguistics.

Andrew Kachites McCallum. 2002. Mallet: A machine Jeffrey Pennington, Richard Socher, and Christopher learning for language toolkit. Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference Stephen Merity, Caiming Xiong, James Bradbury, and on Empirical Methods in Natural Language Process- Richard Socher. 2017. Pointer sentinel mixture mod- ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso- els. In 5th International Conference on Learning ciation for Computational Linguistics. Representations, ICLR 2017. Fabio Petroni, Tim Rocktaschel,¨ Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Yishu Miao, Edward Grefenstette, and Phil Blunsom. Alexander Miller. 2019. Language models as knowl- 2017. Discovering discrete latent topics with neural edge bases? In Proceedings of the 2019 Confer- variational inference. In PMLR. ence on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey ence on Natural Language Processing (EMNLP- Dean. 2013a. Efficient estimation of word represen- IJCNLP), pages 2463–2473, Hong Kong, China. As- tations in vector space. In Proceedings of Workshop sociation for Computational Linguistics. at ICLR. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- Dario Amodei, and Ilya Sutskever. 2019. Language rado, and Jeffrey Dean. 2013b. Distributed repre- models are unsupervised multitask learners. OpenAI sentations of words and phrases and their composi- Blog, 1(8):9. tionality. In Advances in Neural Information Pro- cessing Systems 26, pages 3111–3119. Curran Asso- Maithra Raghu, Justin Gilmer, Jason Yosinski, and ciates, Inc. Jascha Sohl-Dickstein. 2017. Svcca: Singular vec- tor canonical correlation analysis for deep learning David Mimno, Hanna Wallach, Edmund Talley, dynamics and interpretability. In Advances in Neu- Miriam Leenders, and Andrew McCallum. 2011. ral Information Processing Systems 30, pages 6076– Optimizing semantic coherence in topic models. In 6085. Curran Associates, Inc. Proceedings of the 2011 Conference on Empirical David A Ross, Jongwoo Lim, Ruei-Sung Lin, and Methods in Natural Language Processing, pages Ming-Hsuan Yang. 2008. Incremental learning for 262–272, Edinburgh, Scotland, UK. Association for robust visual tracking. International journal of com- Computational Linguistics. puter vision, 77(1-3):125–141. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Evan Sandhaus. 2008. The new york times annotated 2018. Don’t give me the details, just the summary! corpus. Linguistic Data Consortium, Philadelphia, topic-aware convolutional neural networks for ex- 6(12):e26752. treme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- Rico Sennrich, Barry Haddow, and Alexandra Birch. guage Processing, pages 1797–1807, Brussels, Bel- 2016. Neural machine translation of rare words gium. Association for Computational Linguistics. with subword units. In Proceedings of the 54th An- nual Meeting of the Association for Computational David Newman, Youn Noh, Edmund Talley, Sarvnaz Linguistics (Volume 1: Long Papers), pages 1715– Karimi, and Timothy Baldwin. 2010. Evaluating 1725, Berlin, Germany. Association for Computa- topic models for digital libraries. In Proceedings of tional Linguistics. the 10th annual joint conference on Digital libraries, pages 215–224. Suzanna Sia, Ayush Dalmia, and Sabrina J. Mielke. 2020. Tired of topic models? clusters of pretrained Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. word embeddings make for fast and good topics too! PhoBERT: Pre-trained language models for Viet- In Proceedings of the 2020 Conference on Empirical namese. Findings of EMNLP. Methods in Natural Language Processing. Akash Srivastava and Charles Sutton. 2016. Neural F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, variational inference for topic models. In NeurIPS B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Bayesian deep learning workshop. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Keith Stevens, Philip Kegelmeyer, David Andrzejew- esnay. 2011. Scikit-learn: Machine learning in ski, and David Buttler. 2012. Exploring topic coher- Python. Journal of Machine Learning Research, ence over many models and many topics. In Pro- 12:2825–2830. ceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning, pages 952– 961, Jeju Island, Korea. Association for Computa- tional Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran Asso- ciates, Inc. Ivan Vulic,´ Edoardo Maria Ponti, Robert Litschko, Goran Glavas,ˇ and Anna Korhonen. 2020. Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural In- formation Processing Systems 32, pages 3266–3280. Curran Associates, Inc.

Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. 2018. A reinforced topic-aware con- volutional sequence-to-sequence model for abstrac- tive text summarization. In Proceedings of the 27th International Joint Conference on Artificial Intelli- gence, pages 4453–4460. Gregor Wiedemann, Steffen Remus, Avi Chawla, and Chris Biemann. 2019. Does bert make any sense? interpretable word sense disambiguation with con- textualized embeddings. In KONVENS. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. Huggingface’s trans- formers: State-of-the-art natural language process- ing. ArXiv, abs/1910.03771. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin John- son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Gregory S. Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neu- ral machine translation system: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144.