Efficient Methods for Inferring Large Sparse Topic Hierarchies

Efficient Methods for Inferring Large Sparse Topic Hierarchies Doug Downey, Chandra Sekhar Bhagavatula, Yi Yang Electrical Engineering and Computer Science Northwestern University [email protected], csb,yiyang @u.northwestern.edu { } Abstract have focused on “flat” topic models such as LDA. Flat topic models over large topic spaces are prone Latent variable topic models such as La- to overfitting: even in a Web-scale corpus, some tent Dirichlet Allocation (LDA) can dis- words are expressed rarely, and many documents cover topics from text in an unsupervised are brief. Inferring a large topic distribution for fashion. However, scaling the models up each word and document given such sparse data to the many distinct topics exhibited in is challenging. As a result, LDA models in prac- modern corpora is challenging. “Flat” tice tend to consider a few thousand topics at most, topic models like LDA have difficulty even when training on billions of words (Mimno et modeling sparsely expressed topics, and al., 2012). richer hierarchical models become compu- A promising alternative to flat topic models is tationally intractable as the number of top- found in recent hierarchical topic models (Paisley ics increases. et al., 2015; Blei et al., 2010; Li and McCallum, In this paper, we introduce efficient meth- 2006; Wang et al., 2013; Kim et al., 2013; Ahmed ods for inferring large topic hierarchies. et al., 2013). Topics of words and documents can Our approach is built upon the Sparse be naturally arranged into hierarchies. For exam- Backoff Tree (SBT), a new prior for la- ple, an article on the topic of the Chicago Bulls is tent topic distributions that organizes the also relevant to the more general topics of NBA, latent topics as leaves in a tree. We show Basketball, and Sports. Hierarchies can combat how a document model based on SBTs data sparsity: if data is too sparse to place the can effectively infer accurate topic spaces term “Pau Gasol” within the Chicago Bulls topic, of over a million topics. We introduce a perhaps it can be appropriately modeled at some- collapsed sampler for the model that ex- what less precision within the Basketball topic. A ploits sparsity and the tree structure in or- hierarchical model can make fine-grained distinc- der to make inference efficient. In exper- tions where data is plentiful, and back-off to more iments with multiple data sets, we show coarse-grained distinctions where data is sparse. that scaling to large topic spaces results in However, current hierarchical models are hindered much more accurate models, and that SBT by computational complexity. The existing infer- document models make use of large topic ence methods for the models have runtimes that spaces more effectively than flat LDA. increase at least linearly with the number of topics, making them intractable on large corpora with 1 Introduction large numbers of topics. Latent variable topic models, such as Latent In this paper, we present a hierarchical topic Dirichlet Allocation (Blei et al., 2003), are popu- model that can scale to large numbers of dis- lar approaches for automatically discovering top- tinct topics. Our approach is built upon a new ics in document collections. However, learning prior for latent topic distributions called a Sparse models that capture the large numbers of distinct Backoff Tree (SBT). SBTs organize the latent top- topics expressed in today’s corpora is challenging. ics as leaves in a tree, and smooth the distribu- While efficient methods for learning large topic tions for each topic with those of similar top- models have been developed (Li et al., 2014; Yao ics nearby in the tree. SBT priors use absolute et al., 2009; Porteous et al., 2008), these methods discounting and learned backoff distributions for 774 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 774–784, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics smoothing sparse observation counts, rather than The Pachinko Allocation Model (PAM) intro- the fixed additive discounting utilized in Dirichlet duced by Li & McCallum (Li and McCallum, and Chinese Restaurant Process models. We show 2006) is a general approach for modeling corre- how the SBT’s characteristics enable a novel col- lations among topic variables in latent variable lapsed sampler that exploits the tree structure for models. Hierarchical organizations of topics, as efficiency, allowing SBT-based document models in SBT, can be considered as a special case of a (SBTDMs) that scale to hierarchies of over a mil- PAM, in which inference is particularly efficient. lion topics. We show that our model is much more efficient We perform experiments in text modeling and than an existing PAM topic modeling implemen- hyperlink prediction, and find that SBTDM is tation in Section 5. more accurate compared to LDA and the re- Hu and Boyd-Graber (2012) present a method cent nested Hierarchical Dirichlet Process (nHDP) for augmenting a topic model with known hier- (Paisley et al., 2015). For example, SBTDMs archical correlations between words (taken from with a hundred thousand topics achieve perplex- e.g. WordNet synsets). By contrast, our focus ities 28-52% lower when compared with a stan- is on automatically learning a hierarchical orga- dard LDA configuration using 1,000 topics. We nization of topics from data, and we demonstrate verify that the empirical time complexity of in- that this technique improves accuracy over LDA. ference in SBTDM increases sub-linearly in the Lastly, SparseLDA (Yao et al., 2009) is a method number of topics, and show that for large topic that improves the efficiency of inference in LDA spaces SBTDM is more than an order of magni- by only generating portions of the sampling distri- tude more efficient than the hierarchical Pachinko bution when necessary. Our collapsed sampler for Allocation Model (Mimno et al., 2007) and nHDP. SBTDM utilizes a related intuition at each level of Lastly, we release an implementation of SBTDM the tree in order to enable fast inference. as open-source software.1 3 Sparse Backoff Trees 2 Previous Work In this section, we introduce the Sparse Backoff The intuition in SBTDM that topics are naturally Tree, which is a prior for a multinomial distribu- arranged in hierarchies also underlies several other tion over a latent variable. We begin with an ex- models from previous work. Paisley et al. (2015) ample showing how an SBT transforms a set of introduce the nested Hierarchical Dirichlet Pro- observation counts into a probability distribution. cess (nHDP), which is a tree-structured generative Consider a latent variable topic model of text doc- model of text that generalizes the nested Chinese uments, similar to LDA (Blei et al., 2003) or pLSI Restaurant Process (nCRP) (Blei et al., 2010). (Hofmann, 1999). In the model, each token in a Both the nCRP and nHDP model the tree struc- document is generated by first sampling a discrete ture as a random variable, defined over a flexi- latent topic variable Z from a document-specific ble (potentially infinite in number) topic space. topic distribution, and then sampling the token’s However, in practice the infinite models are trun- word type from a multinomial conditioned on Z. cated to a maximal size. We show in our experi- We will focus on the document’s distribution ments that SBTDM can scale to larger topic spaces over topics, ignoring the details of the word types and achieve greater accuracy than nHDP. To our for illustration. We consider a model with 12 knowledge, our work is the first to demonstrate a latent topics, denoted as integers from the set hierarchical topic model that scales to more than 1,..., 12 . Assume we have assigned latent { } one million topics, and to show that the larger topic values to five tokens in the document, specif- models are often much more accurate than smaller ically the topics 1, 4, 4, 5, 12 . We indicate the { } models. Similarly, compared to other recent hi- number of times topic value z has been selected as erarchical models of text and other data (Petinot nz (Figure 1). et al., 2011; Wang et al., 2013; Kim et al., 2013; Given the five observations, the key question Ahmed et al., 2013; Ho et al., 2010), our focus is faced by the model is: what is the topic distribu- on scaling to larger data sets and topic spaces. tion over a sixth topic variable from the same doc- 1http://websail.cs.northwestern.edu/ ument? In the case of the Dirichlet prior utilized projects/sbts/ for the topic distribution in LDA, the probability 775 = 0.24 = 0.36 = 0.36 = 0.30 = 0.30 = 0.30 = 0.30 9 10 11 12 z 1 2 3 4 5 6 7 8 nz = 1 0 0 2 1 0 0 0 0 0 0 1 P(Z|S, n) 0.46 0.36 0.36 1.56 0.56 0.46 0.14 0.14 0.14 0.24 0.24 0.34 Figure 1: An example Sparse Backoff Tree over 12 latent variable values. that the sixth topic variable has value z is propor- tion counts are zero in each case. In LDA, top- tional to nz + α, where α is a hyperparameter of ics six and eight would have equal pseudo-counts the model. (proportional to α). SBT differs from LDA in that it organizes the 3.1 Definitions topics into a tree structure in which the topics are Z leaves (see Figure 1). In this paper, we assume Let be a discrete random variable that takes inte- ger values in the set 1,...,L .

Load more