SPRITE: Generalizing Topic Models with Structured Priors Michael J. Paul and Mark Dredze Department of Computer Science Human Language Technology Center of Excellence Johns Hopkins University, Baltimore, MD 21218 [email protected], [email protected] Abstract between topics and document attributes (Ramage et al., 2009; Mimno and McCallum, 2008). We introduce SPRITE, a family of topic SPRITE builds on a standard topic model, models that incorporates structure into adding structure to the priors over the model pa- model priors as a function of underlying rameters. The priors are given by log-linear func- components. The structured priors can tions of underlying components ( 2), which pro- § be constrained to model topic hierarchies, vide additional latent structure that we will show factorizations, correlations, and supervi- can enrich the model in many ways. By apply- sion, allowing SPRITE to be tailored to ing particular constraints and priors to the compo- particular settings. We demonstrate this nent hyperparameters, a variety of structures can flexibility by constructing a SPRITE-based be induced such as hierarchies and factorizations model to jointly infer topic hierarchies and ( 3), and we will show that this framework cap- § author perspective, which we apply to cor- tures many existing topic models ( 4). § pora of political debates and online re- After describing the general form of the model, views. We show that the model learns in- we show how SPRITE can be tailored to partic- tuitive topics, outperforming several other ular settings by describing a specific model for topic models at predictive tasks. the applied task of jointly inferring topic hierar- chies and perspective ( 6). We experiment with § 1 Introduction this topic+perspective model on sets of political debates and online reviews ( 7), and demonstrate Topic models can be a powerful aid for analyzing § large collections of text by uncovering latent in- that SPRITE learns desired structures while outper- terpretable structures without manual supervision. forming many baselines at predictive tasks. Yet people often have expectations about topics in 2 Topic Modeling with Structured Priors a given corpus and how they should be structured for a particular task. It is crucial for the user expe- Our model family generalizes latent Dirichlet al- rience that topics meet these expectations (Mimno location (LDA) (Blei et al., 2003b). Under LDA, et al., 2011; Talley et al., 2011) yet black box topic there are K topics, where a topic is a categor- models provide no control over the desired output. ical distribution over V words parameterized by This paper presents SPRITE, a family of topic φk. Each document has a categorical distribution models that provide a flexible framework for en- over topics, parameterized by θm for the mth doc- coding preferences as priors for how topics should ument. Each observed word in a document is gen- be structured. SPRITE can incorporate many types erated by drawing a topic z from θm, then drawing of structure that have been considered in prior the word from φz. θ and φ have priors given by work, including hierarchies (Blei et al., 2003a; Dirichlet distributions. Mimno et al., 2007), factorizations (Paul and Our generalization adds structure to the gener- Dredze, 2012; Eisenstein et al., 2011), sparsity ation of the Dirichlet parameters. The priors for (Wang and Blei, 2009; Balasubramanyan and Co- these parameters are modeled as log-linear com- hen, 2013), correlations between topics (Blei and binations of underlying components. Components Lafferty, 2007; Li and McCallum, 2006), pref- are real-valued vectors of length equal to the vo- erences over word choices (Andrzejewski et al., cabulary size V (for priors over word distribu- 2009; Paul and Dredze, 2013), and associations tions) or length equal to the number of topics K 43 Transactions of the Association for Computational Linguistics, vol. 3, pp. 43–57, 2015. Action Editor: Janyce Wiebe. Submission batch: 7/2014; Revision batch 12/2014; Published 1/2015. c 2015 Association for Computational Linguistics. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 Generate hyperparameters: α, β, δ, ω ( 3) (for priors over topic distributions). • § For each document m, generate parameters: For example, we might assume that topics about • sports like baseball and football share a common ˜ C(θ) 1. θmk = exp( c=1 αmc δck), 1 k K prior – given by a component – with general words ≤ ≤ 2. θ Dirichlet(θ˜ ) about sports. A fine-grained topic about steroid m ∼ P m For each topic k, generate parameters: use in sports might be created by combining com- • ponents about broader topics like sports, medicine, C(φ) 1. φ˜kv = exp( βkc ωcv), 1 v V and crime. By modeling the priors as combina- c=1 ≤ ≤ 2. φ Dirichlet(φ˜ ) tions of components that are shared across all top- k ∼ P k ics, we can learn interesting connections between For each token (m, n), generate data: • topics, where components provide an additional 1. Topic (unobserved): zm,n θm latent layer for corpus understanding. ∼ 2. Word (observed): wm,n φzm,n As we’ll show in the next section, by imposing ∼ certain requirements on which components feed Figure 1: The generative story of SPRITE. The difference into which topics (or documents), we can induce from latent Dirichlet allocation (Blei et al., 2003b) is the gen- eration of the Dirichlet parameters. a variety of model structures. For example, if we want to model a topic hierarchy, we require that each topic depend on exactly one parent compo- guage” topic component, with high-weight words nent. If we want to jointly model topic and ide- such as speech and spoken, which informs the ology in a corpus of political documents ( 6), we prior of all three topics. Second, we can model that § make topic priors a combination of one component these topics are likely to occur together in docu- from each of two groups: a topical component and ments. For example, articles about dialog systems an ideological component, resulting in ideology- are likely to discuss automatic speech recognition specific topics like “conservative economics”. as a subroutine. This is handled by the document Components construct priors as follows. For the components – there could be a “spoken language” topic-specific word distributions φ, there are C(φ) document component that gives high weight to all topic components. The kth topic’s prior over φk three topics, so that if a document draw its prior is a weighted combination (with coefficient vector from this component, then it is more likely to give (φ) βk) of the C components (where component c is probability to these topics together. denoted ωc). For the document-specific topic dis- The next section will describe how particular tributions θ, there are C(θ) document components. priors over the coefficients can induce various The mth document’s prior over θm is a weighted structures such as hierarchies and factorizations, (θ) combination (coefficients αm) of the C compo- and components and coefficients can also be pro- nents (where component c is denoted δc). vided as input to incorporate supervision and prior Once conditioned on these priors, the model knowledge. The general prior structure used in is identical to LDA. The generative story is de- SPRITE can be used to represent a wide array of scribed in Figure 1. We call this family of models existing topic models, outlined in Section 4. SPRITE: Structured PRIor Topic modEls. 3 Topic Structures To illustrate the role that components can play, consider an example in which we are modeling re- By changing the particular configuration of the hy- search topics in a corpus of NLP abstracts (as we perparameters – the component coefficients (α and do in 7.3). Consider three speech-related topics: β) and the component weights (δ and ω) – we ob- § signal processing, automatic speech recognition, tain a diverse range of model structures and behav- and dialog systems. Conceptualized as a hierar- iors. We now describe possible structures and the chy, these topics might belong to a higher level corresponding priors. category of spoken language processing. SPRITE allows the relationship between these three topics 3.1 Component Structures to be defined in two ways. One, we can model that This subsection discusses various graph structures these topics will all have words in common. This that can describe the relation between topic com- is handled by the topic components – these three ponents and topics, and between document com- topics could all draw from a common “spoken lan- ponents and documents, illustrated in Figure 2. 44 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 (a) Dense DAG (b) Sparse DAG (c) Tree (d) Factored Forest Figure 2: Example graph structures describing possible relations between components (middle row) and topics or documents (bottom row). Edges correspond to non-zero values for α or β (the component coefficients defining priors over the document and topic distributions). The root node is a shared prior over the component weights (with other possibilities discussed in 3.3). § 3.1.1 Directed Acyclic Graph For a weighted tree, α and β could be a product The general SPRITE model can be thought of as a of two variables: an “integer-like” indicator vec- dense directed acyclic graph (DAG), where every tor with sparse Dirichlet prior as suggested above, document or topic is connected to every compo- combined with a real-valued weight (e.g., with a nent with some weight α or β. When many of Gaussian prior). We take this approach in our model of topic and perspective ( 6). the α or β coefficients are zero, the DAG becomes § sparse. A sparse DAG has an intuitive interpre- 3.1.3 Factored Forest tation: each document or topic depends on some subset of components.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages16 Page
-
File Size-