Arxiv:2010.12626V1 [Cs.CL] 23 Oct 2020 Models (Blei Et Al., 2003)

Topic Modeling with Contextualized Word Representation Clusters Laure Thompson David Mimno University of Massachusetts Amherst Cornell University [email protected] [email protected] Abstract Topic modeling is often associated with probabilistic generative models in the machine learn- Clustering token-level contextualized word ing literature, but from the perspective of most representations produces output that shares actual applications the core benefit of such models many similarities with topic models for En- glish text collections. Unlike clusterings of is that they provide an interpretable latent space vocabulary-level word embeddings, the result- that is grounded in the text of a specific collec- ing models more naturally capture polysemy tion. Standard topic modeling algorithms operate and can be used as a way of organizing docu- by estimating the assignment of individual tokens ments. We evaluate token clusterings trained to topics, either through a Gibbs sampling state from several different output layers of popular or through parameters of variational distributions. contextualized language models. We find that These token-level assignments can then provide BERT and GPT-2 produce high quality clusterings, but RoBERTa does not. These cluster disambiguation of tokens based on context, a broad models are simple, reliable, and can perform overview of the themes of a corpus, and visual- as well as, if not better than, LDA topic mod- izations of the location of those themes within the els, maintaining high topic quality even when corpus (Boyd-Graber et al., 2017). the number of topics is large relative to the size A related but distinct objective is vocabulary of the local collection. clustering. These methods operate at the level of distinct word types, but have no inherent connec- 1 Introduction tion to words in context (e.g. Brown et al., 1992; Contextualized word representations such as those Arora et al., 2013; Lancichinetti et al., 2015). Re- produced by BERT (Devlin et al., 2019) have revo- cently, there has also been considerable interest in lutionized natural language processing for a num- continuous type-level embeddings such as GloVe ber of structured prediction problems. Recent work (Pennington et al., 2014) and word2vec (Mikolov has shown that these contextualized representations et al., 2013a,b), which can be clustered to form can support type-level semantic clusters (Sia et al., interpretable semantic groups. Although it has not 2020). In this work we show that token-level clus- been widely used, the original word2vec distribu- tering provides contextualized semantic information includes code for k-means clustering of vec- tion equivalent to that recovered by statistical topic tors. Sia et al.(2020) extends this behavior to arXiv:2010.12626v1 [cs.CL] 23 Oct 2020 models (Blei et al., 2003). From the perspective contextualized embeddings, but does not take ad- of contextualized word representations, this result vantage of the contextual, token-based nature of suggests new directions for semantic analysis us- such embeddings. ing both existing models and new architectures In this work, we demonstrate a new property of more specifically suited for such analysis. From contextualized word representations: if you run the perspective of topic modeling, this result im- a simple k-means algorithm on token-level em- plies that transfer learning through contextualized beddings, the resulting word clusters share similar word representations can fill gaps in probabilistic properties to the output of an LDA model. Tra- modeling (especially for short documents and small ditional topic modeling can be viewed as token collections) but also suggests new approaches for clustering. Indeed, a clustering of tokens based latent semantic analysis that are more closely tied on BERT vectors is functionally indistinguishable to mainstream transformer architectures. from a Gibbs sampling state for LDA, which as- signs each token to exactly one topic. For topic sentence-level embeddings have been shown to pro- modeling, clustering is based on local context (the duce semantically related document clusters (Aha- current topic disposition of words in the same docu- roni and Goldberg, 2020). But such models cannot ment) and on global information (the current topic represent topic mixtures or provide an interpretable disposition of other words of the same type). We word-based representation without additional map- find that contextualized representations offer simi- ping from clusters to documents to words. It is lar local and global information, but at a richer and widely known that token-level representations of more representationally powerful level. single word types provide contextual disambigua- We argue that pretrained contextualized embed- tion. For example, Coenen et al.(2019) show an dings provide a simple, reliable method for users example distinguishing uses of die between the Ger- to build fine-grained, semantically rich represen- man article, a verb for “perish” and a game piece. tations of text collections, even with limited local We explore this property on the level of whole col- training data. While for this study we restrict our lections, looking at all word types simultaneously. attention to English text, we see no reason contex- There are a number of models that solve the topic tualized models trained on non-English data (e.g. model objective directly using contemporary neu- Martin et al., 2020; Nguyen and Nguyen, 2020) ral network methods (e.g. Srivastava and Sutton, would not have the same properties. It is impor- 2016; Miao et al., 2017; Dieng et al., 2020). There tant to note, however, that we make no claim that are also a number of neural models that incorporate clustering contextualized word representations is topic models to improve performance on a vari- the optimal approach in all or even many situations. ety of tasks (e.g. Chen et al., 2016; Narayan et al., Rather, our goal is to demonstrate the capabilities 2018; Wang et al., 2018; Peinelt et al., 2020). Ad- of contextualized embeddings for token-level se- ditionally, BERT has been used for word sense dis- mantic clustering and to offer an additional useful ambiguation (Wiedemann et al., 2019). In contrast, application in cases where models like BERT are our goal is not to create hybrid or special-purpose already in use. models but to show that simple contextualized embedding clusters support token-level topic analysis 2 Related Work in themselves with no significant additional modeling. Since our goal is simply to demonstrate this We selected three contextualized language models property and not to declare overall “winners”, we based on their general performance and ease of focus on LDA in empirical comparisons because it accessibility to practitioners: BERT (Devlin et al., is the most widely used and straightforward, high- 2019), GPT-2 (Radford et al., 2019), and RoBERTa lighting the similarities and differences between (Liu et al., 2019). All three use similar Transformer contextualized embedding clusters and topics. (Vaswani et al., 2017) based architectures, but their objective functions vary in significant ways. These 3 Data and Methods models are known to encode substantial information about lexical semantics (Petroni et al., 2019; We use three real-world corpora of varying size, Vulic´ et al., 2020). content, and document length: Wikipedia articles Clustering of vocabulary-level embeddings has (WIKIPEDIA), Supreme Court of the United States been shown to produce semantically related word legal opinions (SCOTUS), and Amazon product re- clusters (Sia et al., 2020). But such embeddings views (REVIEWS). We select WIKIPEDIA for its cannot easily account for polysemy or take advan- affinity with the training data of the pretrained mod- tage of local context to disambiguate word senses els. Because its texts are similar to ones the mod- since each word type is modeled as a single vec- els have already seen, WIKIPEDIA is a “best-case” tor. Since these embeddings are not grounded in scenario for our clustering algorithms. If a cluster- specific documents, we cannot directly use them to ing method performs poorly on WIKIPEDIA, we track the presence of thematic clusters in a particu- expect the method to perform poorly in general. lar collection. In addition, Sia et al.(2020) find that In contrast, we select SCOTUS and REVIEWS for reweighting their type-level clustering by corpus their content variability. Legal opinions tend to be frequencies is helpful. In contrast, such frequen- long and contain many technical legal terms, while cies are “automatically” accounted for when we user-generated product reviews tend to be short and operate on the token level. Similarly, clusterings of highly variable in content and vocabulary. Term Model Top Words sea coast Beach Point coastal land Long Bay m sand beach tide Norfolk shore Ocean Coast areas LDA land acres County ha facilities State location property acre cost lot site parking settlers Department arrived arrival landing landed arriving arrive returning settled departed land leaving sailed arrives land BERT land property rights estate acres lands territory estates properties farm farmland Land fields acre arrived landed arriving landfall arrive arrives arrival landing land departed ashore embarked Back GPT-2 land sea ice forest rock mountain ground sand surface beach ocean soil hill lake snow sediment metals metal potassium sodium + lithium compounds electron ions hydrogen chemical atomic – ion LDA metal folk bands music genre band debut Metal heavy musicians lyrics instruments acts groups metals elements metal electron element atomic periodic electrons chemical atoms ions atom metal BERT rock dance pop metal Rock folk jazz punk comedy Dance heavy funk alternative soul street club rock pop hop dance metal folk hip punk jazz B soul funk alternative rap heavy disco electronic GPT-2 plutonium hydrogen carbon sodium potassium metal lithium uranium oxygen diamond radioactive Table 1: Automatically selected examples of polysemy in contextualized embedding clusters. Clusters containing “land” or “metal” as top words from BERT L[−1], GPT-2 L[−2], and LDA with K = 500.

Load more