SPRITE: Generalizing Topic Models with Structured Priors

Michael J. Paul and Mark Dredze Department of Human Language Technology Center of Excellence Johns Hopkins University, Baltimore, MD 21218 [email protected], [email protected]

Abstract between topics and document attributes (Ramage et al., 2009; Mimno and McCallum, 2008). We introduce SPRITE, a family of topic SPRITE builds on a standard , models that incorporates structure into adding structure to the priors over the model pa- model priors as a function of underlying rameters. The priors are given by log-linear func- components. The structured priors can tions of underlying components ( 2), which pro- § be constrained to model topic hierarchies, vide additional latent structure that we will show factorizations, correlations, and supervi- can enrich the model in many ways. By apply- sion, allowing SPRITE to be tailored to ing particular constraints and priors to the compo- particular settings. We demonstrate this nent hyperparameters, a variety of structures can flexibility by constructing a SPRITE-based be induced such as hierarchies and factorizations model to jointly infer topic hierarchies and ( 3), and we will show that this framework cap- § author perspective, which we apply to cor- tures many existing topic models ( 4). § pora of political debates and online re- After describing the general form of the model, views. We show that the model learns in- we show how SPRITE can be tailored to partic- tuitive topics, outperforming several other ular settings by describing a specific model for topic models at predictive tasks. the applied task of jointly inferring topic hierar- chies and perspective ( 6). We experiment with § 1 Introduction this topic+perspective model on sets of political debates and online reviews ( 7), and demonstrate Topic models can be a powerful aid for analyzing § large collections of text by uncovering latent in- that SPRITE learns desired structures while outper- terpretable structures without manual supervision. forming many baselines at predictive tasks. Yet people often have expectations about topics in 2 Topic Modeling with Structured Priors a given corpus and how they should be structured for a particular task. It is crucial for the user expe- Our model family generalizes latent Dirichlet al- rience that topics meet these expectations (Mimno location (LDA) (Blei et al., 2003b). Under LDA, et al., 2011; Talley et al., 2011) yet black box topic there are K topics, where a topic is a categor- models provide no control over the desired output. ical distribution over V words parameterized by This paper presents SPRITE, a family of topic φk. Each document has a categorical distribution models that provide a flexible framework for en- over topics, parameterized by θm for the mth doc- coding preferences as priors for how topics should ument. Each observed word in a document is gen- be structured. SPRITE can incorporate many types erated by drawing a topic z from θm, then drawing of structure that have been considered in prior the word from φz. θ and φ have priors given by work, including hierarchies (Blei et al., 2003a; Dirichlet distributions. Mimno et al., 2007), factorizations (Paul and Our generalization adds structure to the gener- Dredze, 2012; Eisenstein et al., 2011), sparsity ation of the Dirichlet parameters. The priors for (Wang and Blei, 2009; Balasubramanyan and Co- these parameters are modeled as log-linear com- hen, 2013), correlations between topics (Blei and binations of underlying components. Components Lafferty, 2007; Li and McCallum, 2006), pref- are real-valued vectors of length equal to the vo- erences over word choices (Andrzejewski et al., cabulary size V (for priors over word distribu- 2009; Paul and Dredze, 2013), and associations tions) or length equal to the number of topics K

43

Transactions of the Association for Computational Linguistics, vol. 3, pp. 43–57, 2015. Action Editor: Janyce Wiebe. Submission batch: 7/2014; Revision batch 12/2014; Published 1/2015. c 2015 Association for Computational Linguistics.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 Generate hyperparameters: α, β, δ, ω ( 3) (for priors over topic distributions). • § For each document m, generate parameters: For example, we might assume that topics about • sports like baseball and football share a common ˜ C(θ) 1. θmk = exp( c=1 αmc δck), 1 k K prior – given by a component – with general words ≤ ≤ 2. θ Dirichlet(θ˜ ) about sports. A fine-grained topic about steroid m ∼ P m For each topic k, generate parameters: use in sports might be created by combining com- • ponents about broader topics like sports, medicine, C(φ) 1. φ˜kv = exp( βkc ωcv), 1 v V and crime. By modeling the priors as combina- c=1 ≤ ≤ 2. φ Dirichlet(φ˜ ) tions of components that are shared across all top- k ∼ P k ics, we can learn interesting connections between For each token (m, n), generate data: • topics, where components provide an additional 1. Topic (unobserved): zm,n θm latent layer for corpus understanding. ∼ 2. Word (observed): wm,n φzm,n As we’ll show in the next section, by imposing ∼ certain requirements on which components feed Figure 1: The generative story of SPRITE. The difference into which topics (or documents), we can induce from latent Dirichlet allocation (Blei et al., 2003b) is the gen- eration of the Dirichlet parameters. a variety of model structures. For example, if we want to model a topic hierarchy, we require that each topic depend on exactly one parent compo- guage” topic component, with high-weight words nent. If we want to jointly model topic and ide- such as speech and spoken, which informs the ology in a corpus of political documents ( 6), we prior of all three topics. Second, we can model that § make topic priors a combination of one component these topics are likely to occur together in docu- from each of two groups: a topical component and ments. For example, articles about dialog systems an ideological component, resulting in ideology- are likely to discuss automatic specific topics like “conservative economics”. as a subroutine. This is handled by the document Components construct priors as follows. For the components – there could be a “spoken language” topic-specific word distributions φ, there are C(φ) document component that gives high weight to all topic components. The kth topic’s prior over φk three topics, so that if a document draw its prior is a weighted combination (with coefficient vector from this component, then it is more likely to give (φ) βk) of the C components (where component c is probability to these topics together. denoted ωc). For the document-specific topic dis- The next section will describe how particular tributions θ, there are C(θ) document components. priors over the coefficients can induce various The mth document’s prior over θm is a weighted structures such as hierarchies and factorizations, (θ) combination (coefficients αm) of the C compo- and components and coefficients can also be pro- nents (where component c is denoted δc). vided as input to incorporate supervision and prior Once conditioned on these priors, the model knowledge. The general prior structure used in is identical to LDA. The generative story is de- SPRITE can be used to represent a wide array of scribed in Figure 1. We call this family of models existing topic models, outlined in Section 4. SPRITE: Structured PRIor Topic modEls. 3 Topic Structures To illustrate the role that components can play, consider an example in which we are modeling re- By changing the particular configuration of the hy- search topics in a corpus of NLP abstracts (as we perparameters – the component coefficients (α and do in 7.3). Consider three speech-related topics: β) and the component weights (δ and ω) – we ob- § signal processing, automatic speech recognition, tain a diverse range of model structures and behav- and dialog systems. Conceptualized as a hierar- iors. We now describe possible structures and the chy, these topics might belong to a higher level corresponding priors. category of spoken language processing. SPRITE allows the relationship between these three topics 3.1 Component Structures to be defined in two ways. One, we can model that This subsection discusses various graph structures these topics will all have words in common. This that can describe the relation between topic com- is handled by the topic components – these three ponents and topics, and between document com- topics could all draw from a common “spoken lan- ponents and documents, illustrated in Figure 2.

44

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 (a) Dense DAG (b) Sparse DAG (c) Tree (d) Factored Forest Figure 2: Example graph structures describing possible relations between components (middle row) and topics or documents (bottom row). Edges correspond to non-zero values for α or β (the component coefficients defining priors over the document and topic distributions). The root node is a shared prior over the component weights (with other possibilities discussed in 3.3). §

3.1.1 Directed Acyclic Graph For a weighted tree, α and β could be a product The general SPRITE model can be thought of as a of two variables: an “integer-like” indicator vec- dense directed acyclic graph (DAG), where every tor with sparse Dirichlet prior as suggested above, document or topic is connected to every compo- combined with a real-valued weight (e.g., with a nent with some weight α or β. When many of Gaussian prior). We take this approach in our model of topic and perspective ( 6). the α or β coefficients are zero, the DAG becomes § sparse. A sparse DAG has an intuitive interpre- 3.1.3 Factored Forest tation: each document or topic depends on some subset of components. By using structured sparsity over the DAG, we can The default prior over coefficients that we use obtain a structure where components are grouped G in this study is a 0-mean Gaussian distribution, into factors, and each document or topic has which encourages the weights to be small. We one parent from each group. Figure 2(d) illus- note that to induce a sparse graph, one could use trates this: the left three components belong to one a 0-mean Laplace distribution as the prior over α group, the right two belong to another, and each and β, which prefers parameters such that some bottom node has exactly one parent from each. components are zero. This is a DAG that we call a “factored forest” be- cause the subgraphs associated with each group in 3.1.2 Tree isolation are trees. This structure arises in “multi- When each document or topic has exactly one par- dimensional” models like SAGE (Eisenstein et al., ent (one nonzero coefficient) we obtain a two-level 2011) and Factorial LDA (Paul and Dredze, 2012), tree structure. This structure naturally arises in which allow tokens to be associated with multiple topic hierarchies, for example, where fine-grained variables (e.g. a topic along with a variable denot- topics are children of coarse-grained topics. ing positive or negative sentiment). This allows To create an (unweighted) tree, we require word distributions to depend on both factors. α 0, 1 and α = 1 for each docu- The “exactly one parent” indicator constraint is mc ∈ { } c mc ment m. Similarly, β 0, 1 and β = 1 the same as in the tree structure but enforces a Pkc ∈ { } c kc for each topic k. In this setting, αm and β are tree only within each group. This can therefore be P k indicator vectors which select a single component. (softly) modeled using a sparse Dirichlet prior as In this study, rather than strictly requiring αm described in the previous subsection. In this case, and βk to be binary-valued indicator vectors, we the subsets of components belonging to each fac- create a relaxation that allows for easier parameter tor have separate sparse Dirichlet priors. Using estimation. We let αm and βk to real-valued vari- the example from Figure 2(d), the first three com- ables in a simplex, but place a prior over their val- ponent indicators would come from one Dirichlet, ues to encourage sparse values, favoring vectors while the latter two component indicators would with a single component near 1 and others near 0. come from a second. This is achieved using a Dirichlet(ρ < 1) distribu- 3.2 Tying Topic and Document Components tion as the prior over α and β, which has higher density near the boundaries of the simplex.1 A desirable property for many situations is for the topic and document components to correspond to 1This generalizes the technique used in Paul and Dredze (2012), who approximated binary variables with real-valued bution as the prior to encourage sparsity. The Dirichlet distri- variables in (0, 1), by using a “U-shaped” Beta(ρ < 1) distri- bution is the multivariate extension of the Beta distribution.

45

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 each other. For example, if we think of the com- Model Sec. Document priors Topic priors LDA 4.1 Single component Single component ponents as coarse-grained topics in a hierarchy, SCTM 4.2 Single component Sparse binary β SAGE 4.3 Single component Sparse ω then the coefficients β enforce that topic word dis- FLDA 4.3 Binary δ is transpose of β Factored binary β tributions share a prior defined by their parent ω PAM 4.4 α are supertopic weights Single component component, while the coefficients α represent a DMR 4.5 α are feature values Single component document’s proportions of coarse-grained topics, Table 1: Topic models with Dirichlet priors that are gen- eralized by SPRITE. The description of each model can be which effects the document’s prior over child top- found in the noted section number. PAM is not equivalent, ics (through the δ vectors). Consider the example but captures very similar behavior. The described component with spoken language topics in 2: these three top- formulations of SCTM and SAGE are equivalent, but these § differ from SPRITE in that the components directly define the ics (signal processing, speech recognition, and di- parameters, rather than priors over the parameters. alog systems) are a priori likely both to share the same words and to occur together in documents. By tying these together, we ensure that the pat- 4.1 Latent Dirichlet Allocation terns are consistent across the two types of com- In LDA (Blei et al., 2003b), all θ vectors are ponents, and the patterns from both types can re- drawn from the same prior, as are all φ vectors. inforce each other during inference. This is a basic instance of our model with only In this case, the number of topic components one component at the topic and document levels, is the same as the number of document compo- C(θ) = C(φ) = 1, with coefficients α = β = 1. (φ) (θ) nents (C = C ), and the coefficients (βcz) 4.2 Shared Components Topic Models of the topic components should correlate with the weights of the document components (δzc). The Shared components topic models (SCTM) (Gorm- approach we take ( 6) is to define δ and β as a ley et al., 2010) define topics as products of “com- § product of two variables (suggested in 3.1.2): a ponents”, where components are word distribu- § binary mask variable (with sparse Dirichlet prior), tions. To use the notation of our paper, the kth which we let be identical for both δ and β, and a topic’s word distribution in SCTM is parameter- βkc ized by φ ωcv , where the ω vectors are real-valued positive weight. kv ∝ c word distributions (rather than vectors in RV ), and Q the β 0, 1 variables are indicators denoting kc ∈ { } 3.3 Deep Components whether component c is in topic k. As for priors over the component weights δ and This is closely related to SPRITE, where top- ω, we assume they are generated from a 0-mean ics also depend on products of underlying com- Gaussian. While not experimented with in this ponents. A major difference is that in SCTM, study, it is also possible to allow the components the topic-specific word distributions are exactly themselves to have rich priors which are functions defined as a product of components, whereas in of higher level components. For example, rather SPRITE, it is only the prior that is a product of 2 than assuming a mean of zero, the mean could be a components. Another difference is that SCTM weighted combination of higher level weight vec- has an unweighted product of components (β is bi- tors. This approach was used by Paul and Dredze nary), whereas SPRITE allows for weighted prod- (2013) in Factorial LDA, in which each ω compo- ucts. The log-linear parameterization leads to sim- nent had its own Gaussian prior provided as input pler optimization procedures than the product pa- to guide the parameters. rameterization. Finally, the components in SCTM only apply to the word distributions, and not the topic distributions in documents. 4 Special Cases and Extensions 4.3 Factored Topic Models We now describe several existing Dirichlet prior Factored topic models combine multiple aspects topic models and show how they are special cases of the text to generate the document (instead of of SPRITE. Table 1 summarizes these models and just topics). One such topic model is Factorial their relation to SPRITE. In almost every case, we LDA (FLDA) (Paul and Dredze, 2012). In FLDA, also describe how the SPRITE representation of 2The posterior becomes concentrated around the prior the model offers improvements over the original when the Dirichlet variance is low, in which case SPRITE be- model or can lead to novel extensions. haves like SCTM. SPRITE is therefore more general.

46

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 “topics” are actually tuples of potentially multiple thor ideology. Like SCTM, φ is exactly defined as variables, such as aspect and sentiment in online a product of ω weights, rather than our approach reviews (Paul et al., 2013). Each document distri- of using the product to define a prior over φ. bution θm is a distribution over pairs (or higher- dimensional tuples if there are more than two fac- 4.4 Topic Hierarchies and Correlations tors), and each pair (j, k) has a word distribu- While the two previous subsections primarily fo- tion φ . FLDA uses a similar log-linear pa- (j,k) cused on word distributions (with FLDA being an rameterization of the Dirichlet priors as SPRITE. exception that focused on both), SPRITE’s priors Using our notation, the Dirichlet(φ˜ ) prior for (j,k) over topic distributions also have useful charac- φ is defined as φ˜ =exp(ω +ω ), where (j,k) (j,k),v jv kv teristics. The component-specific δ vectors can ω is a weight vector over the vocabulary for the j be interpreted as common topic distribution pat- jth component of the first factor, and ω encodes k terns, where each component is likely to give high the weights for the kth component of the second weight to groups of topics that tend to occur to- factor. (Some bias terms are omitted for sim- gether. Each document’s α weights encode which plicity.) The prior over θ has a similar form: m of the topic groups are present in that document. θ˜ =exp(α + α ), where α is docu- m,(j,k) mj mk mj Similar properties are captured by the Pachinko ment m’s preference for component j of the first allocation model (PAM) (Li and McCallum, factor (and likewise for k of the second). 2006). Under PAM, each document has a distri- This corresponds to an instantiation of SPRITE bution over supertopics. Each supertopic is as- using an unweighted factored forest ( 3.1.3), § sociated with a Dirichlet prior over subtopic dis- where β = δ ( 3.2, recall that δ are document zc cz § tributions, where subtopics are the low level top- components while β are the topic coefficients). ics that are associated with word parameters φ. Each subtopic z (which is a pair of variables in Documents also have supertopic-specific distribu- the two-factor model) has one parent component tions over subtopics (drawn from each supertopic- from each factor, indicated by βz which is binary- specific Dirichlet prior). Each topic in a document valued. At the document level in the two-factor is drawn by first drawing a supertopic from the example, δj is an indicator vector with values of 1 document’s distribution, then drawing a subtopic for all pairs with j as the first component, and thus from that supertopic’s document distribution. α the coefficient mj controls the prior for all such While not equivalent, this is quite similar to pairs of the form (j, ), and likewise δ indicates · k SPRITE where document components correspond k pairs with as the second component, controlling to supertopics. Each document’s α weights can the prior over ( , k). · be interpreted to be similar to a distribution over The SPRITE representation offers a benefit over supertopics, and each δ vector is that supertopic’s the original FLDA model. FLDA assumes that the contribution to the prior over subtopics. The prior entire Cartesian product of the different factors is over the document’s topic distribution is thus af- represented in the model (e.g. φ parameters for ev- fected by the document’s supertopic weights α. ery possible tuple), which leads to issues with effi- The SPRITE formulation naturally allows for ciency and overparameterization with higher num- powerful extensions to PAM. One possibility is bers of factors. With SPRITE, we can simply fix to include topic components for the word distri- the number of “topics” to a number smaller than butions, in addition to document components, and the size of the Cartesian product, and the model to tie together δ and β ( 3.2). This models the cz zc § will learn which subset of tuples are included, intuitive characteristic that subtopics belonging to through the values of β and δ. similar supertopics (encoded by δ) should come Finally, another existing model family that al- from similar priors over their word distributions lows for topic factorization is the sparse additive (since they will have similar β values). That is, generative model (SAGE) (Eisenstein et al., 2011). children of a supertopic are topically related – they SAGE uses a log-linear parameterization to define are likely to share words. This is a richer alterna- word distributions. SAGE is a general family of tive to the hierarchical variant of PAM proposed models that need not be factored, but is presented by Mimno et al. (2007), which modeled separate as an efficient solution for including multiple fac- word distributions for supertopics and subtopics, tors, such as topic and geography or topic and au- but the subtopics were not dependent on the super-

47

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 topic word distributions. Another extension is to function, the derivative of the log of the gamma form a strict tree structure, making each subtopic function. The digamma terms arise from the belong to exactly one supertopic: a true hierarchy. Dirichlet-multinomial distribution, when integrat- ing out the parameters φ. P (β) is the hyperprior. 4.5 Conditioning on Document Attributes For a 0-mean Gaussian hyperprior with variance 2 ∂P (β) βkc SPRITE also naturally provides the ability to con- σ , = 2 . Under a Dirchlet(ρ) hyper- ∂βkc − σ dition document topic distributions on features of prior, when we want β to represent an indicator ∂P (β) ρ 1 the document, such as a user rating in a review. vector ( 3.1.2), = − . § ∂βkc βkc To do this, let the number of document compo- The partial derivatives for the other hyperpa- nents be the number of features, and the value of rameters are similar. Rather than involving a sum α is the mth document’s value of the cth fea- ∂ mc over the vocabulary, L sums over documents, ∂δck ture. The δ vectors then influence the document’s ∂ ∂ while L and L sum over topics. topic prior based on the feature values. For exam- ∂ωcv ∂αmc Our inference algorithm alternates between one ple, increasing α will increase the prior for topic mc Gibbs iteration and one iteration of gradient as- z if δ is positive and decrease the prior if δ is cz cz cent, so that the parameters change gradually. For negative. This is similar to the structure used for unconstrained parameters, we use the update rule: PAM ( 4.4), but here the α weights are fixed and § xt+1=xt + η (xt), for some variable x and provided as input, rather than learned and inter- t∇L a step size η at iteration t. For parameters con- preted as supertopic weights. This is identical to t strained to the simplex (such as when β is a soft the Dirichlet-multinomial regression (DMR) topic indicator vector), we use exponentiated gradient model (Mimno and McCallum, 2008). The DMR ascent (Kivinen and Warmuth, 1997) with the up- topic model define’s each document’s Dirichlet date rule: xt+1 xt exp(η (xt)). prior over topics as a log-linear function of the i ∝ i t∇iL document’s feature values and regression coeffi- 5.1 Tightening the Constraints cients for each topic. The cth feature’s regression For variables that we prefer to be binary but coefficients correspond to the δc vector in SPRITE. have softened to continuous variables using sparse 5 Inference and Parameter Estimation Beta or Dirichlet priors, we can straightforwardly strengthen the preference to be binary by modify- We now discuss how to infer the posterior of the ing the objective function to favor the prior more latent variables z and parameters θ and φ, and find heavily. Specifically, under a Dirichlet(ρ<1) prior maximum a posteriori (MAP) estimates of the hy- we will introduce a scaling parameter τt 1 perparameters α, β, δ, and ω, given their hyperpri- ≥ to the prior log likelihood: τt log P (β) with par- ors. We take a Monte Carlo EM approach, using a ρ 1 tial derivative τt − , which adds extra weight to collapsed Gibbs sampler to sample from the pos- βkc the sparse Dirichlet prior in the objective. The terior of the topic assignments z conditioned on algorithm used in our experiments begins with the hyperparameters, then optimizing the hyperpa- τ1 = 1 and optionally increases τ over time. This rameters using gradient-based optimization condi- is a deterministic annealing approach, where τ tioned on the samples. corresponds to an inverse temperature (Ueda and Given the hyperparameters, the sampling equa- Nakano, 1998; Smith and Eisner, 2006). tions are identical to the standard LDA sampler As τ approaches infinity, the prior-annealed (Griffiths and Steyvers, 2004). The partial deriva- MAP objective max P (φ β)P (β)τ approaches tive of the collapsed log likelihood of the corpus β | L maxβ P (φ β) maxβ P (β). Annealing only the with respect to each hyperparameter β is: | kc prior P (β) results in maximization of this term only, while the outer max chooses a good β under ∂ ∂P (β) ˜ L = + ωcvφkv (1) P (φ β) as a tie-breaker among all β values that ∂βkc ∂βkc × v | 3 k ˜ ˜ X ˜ k0 ˜ maximize the inner max (binary-valued β). Ψ(nv +φkv) Ψ(φkv) +Ψ( k φk v) Ψ( k nv +φk v) − 0 0 − 0 0 We show experimentally ( 7.2.2) that annealing  P P  § the prior yields values that satisfy the constraints. ˜ k where φkv=exp( βkc ωc v), nv is the number c0 0 0 3Other modifications could be made to the objective func- of times word v is assigned to topic k (in the P tion to induce sparsity, such as entropy regularization (Bala- samples from the E-step), and Ψ is the digamma subramanyan and Cohen, 2013).

48

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 b Dirichlet(ρ < 1) (soft indicator) 6 A Factored Hierarchical Model of • k ∼ Topic and Perspective α(P ) is given as input (perspective value) • δ(P ) = β(P ) We will now describe a SPRITE model that en- • k k (B) (P ) (P ) compasses nearly all of the structures and exten- φ˜ = exp(ωv + β ωv + b βˆ ω ) • kv k c kc kc cv sions described in 3–4, followed by experimen- ˜ (B) (P ) (P ) ˆ § θmk = exp(δ +αm δ + P bkcαmcδck) tal results using this model to jointly capture topic • k k c and “perspective” in a corpus of political debates Figure 3: Summary of the hyperparametersP in our SPRITE- based topic and perspective model ( 6). (where perspective corresponds to ideology) and § a corpus of online doctor reviews (where perspec- tive corresponds to the review sentiment). itive or negative score denoting the perspective, First, we will create a topic hierarchy ( 4.4). (P ) § which is the variable αm for document m. Since The hierarchy will model both topics and docu- α(P ) are the coefficients for δ(P ), positive values ments, where α is document m’s supertopic pro- (P ) m of δ indicate that topic k is more likely if the au- portions, δ is the cth supertopic’s subtopic prior, k c thor is conservative (which has a positive α score ω is the cth supertopic’s word prior, and β is c k in our data), and less likely if the author is liberal the weight vector that selects the kth topic’s par- (which has a negative score). There is only a single ent supertopic, which incorporates (soft) indicator perspective component, but it represents two ends vectors to encode a tree structure ( 3.1.2). § of a spectrum with positive and negative weights; We want a weighted tree; while each βk has β(P ) and δ(P ) are not constrained to be positive, only one nonzero element, the nonzero element (P ) (P ) unlike the supertopics. We also set β = δ . can be a value other than 1. We do this by replac- k k This means that topics with positive δ(P ) will also ing the single coefficient βkc with a product of two k have a positive β coefficient that is multiplied with variables: bkcβˆkc. Here, βˆk is a real-valued weight the perspective word vector ω(P ). vector, while bkc is a binary indicator vector which Finally, we include “bias” component vectors zeroes out all but one element of βk. We do the denoted ω(B) and δ(B), which act as overall same with the δ vectors, replacing δck with bkcδˆck. The b variables are shared across both topic and weights over the vocabulary and topics, so that the document components, which is how we tie these component-specific ω and δ weights can be inter- together ( 3.2). We relax the binary requirement preted as deviations from the global bias weights. § and instead allow a positive real-valued vector Figure 3 summarizes the model. This includes whose elements sum to 1, with a Dirichlet(ρ<1) most of the features described above (trees, fac- prior to encourage sparsity ( 3.1.2). tored structures, tying topic and document compo- § To be properly interpreted as a hierarchy, we nents, and document attributes), so we can ablate constrain the coefficients α and β (and by ex- model features to measure their effect. tension, δ) to be positive. To optimize these pa- rameters in a mathematically convenient way, we 7 Experiments write βkc as exp(log βkc), and instead optimize 7.1 Datasets and Experimental Setup log βkc R rather than βkc R+. ∈ ∈ Second, we factorize ( 4.3) our hierarchy such We applied our models to two corpora: § that each topic depends not only on its supertopic, Debates: A set of floor debates from the 109th– • but also on a value indicating perspective. For ex- 112th U.S. Congress, collected by Nguyen et ample, a conservative topic about energy will ap- al. (2013), who also applied a hierarchical topic pear differently from a liberal topic about energy. model to this data. Each document is a tran- The prior for a topic will be a log-linear combina- script of one speaker’s turn in a debate, and each tion of both a supertopic (e.g. energy) and a per- document includes the first dimension of the spective (e.g. liberal) weight vector. The variables DW-NOMINATE score (Lewis and Poole, 2004), associated with the perspective component are de- a real-valued score indicating how conservative noted with superscript (P ) rather than subscript c. (positive) or liberal (negative) the speaker is. To learn meaningful perspective parameters, we This value is α(P ). We took a sample of 5,000 include supervision in the form of document at- documents from the House debates (850,374 to- tributes ( 4.5). Each document includes a pos- kens; 7,426 types), balanced across party affilia- §

49

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 tion. We sampled from the most partisan speak- One topic has both supertopics as parents, which ers, removing scores below the median value. appears to describe surgeries that saved a family member’s life, with top words including saved, Reviews: Doctor reviews from RateMDs.com, { • life, husband, cancer . The figure also illustrates previously analyzed using FLDA (Paul et al., } which topics are associated more with positive or 2013; Wallace et al., 2014). The reviews con- negative reviews, as indicated by the value of δ(P ). tain ratings on a 1–5 scale for multiple aspects. We centered the ratings around the middle value Interpretable parameters were also learned from 3, then took reviews that had the same sign for the Debates corpus. Consider two topics about (P ) all aspects, and averaged the scores to produce energy that have polar values of δ . The a value for α(P ). Our corpus contains 20,000 conservative-leaning topic is about oil and gas, with top words including oil, gas, companies, documents (476,991 tokens; 10,158 types), bal- { prices, drilling . The liberal-leaning topic is anced across positive/negative scores. } about renewable energy, with top words includ- Unless otherwise specified, K=50 topics and ing energy, new, technology, future, renewable . { } C=10 components (excluding the perspective Both of these topics share a common parent of an component) for Debates, and K=20 and C=5 for industry-related supertopic whose top words are Reviews. These values were chosen as a qualita- industry, companies, market, price . A nonparti- { } tive preference, not optimized for predictive per- san topic under this same supertopic has top words formance, but we experiment with different values credit, financial, loan, mortgage, loans . in 7.2.2. We set the step size η according to Ada- { } § t Grad (Duchi et al., 2011), where the step size is 7.2.2 Quantitative Evaluation the inverse of the sum of squared historical gradi- ents.4 We place a sparse Dirichlet(ρ=0.01) prior We evaluated the model on two predictive tasks as on the b variables, and apply weak regulariza- well as topic quality. The first metric is perplex- tion to all other hyperparameters via a (0, 102) ity of held-out text. The held-out set is based on N prior. These hyperparameters were chosen after tokens rather than documents: we trained on even only minimal tuning, and were selected because numbered tokens and tested on odd tokens. This is they showed stable and reasonable output qualita- a type of “document completion” evaluation (Wal- tively during preliminary development. lach et al., 2009b) which measures how well the We ran our inference algorithm for 5000 itera- model can predict held-out tokens of a document tions, estimating the parameters θ and φ by aver- after observing only some. aging the final 100 iterations. Our results are aver- We also evaluated how well the model can pre- aged across 10 randomly initialized samplers.5 dict the attribute value (DW-NOMINATE score or user rating) of the document. We trained a linear 7.2 Evaluating the Topic Perspective Model regression model using the document topic distri- butions θ as features. We held out half of the docu- 7.2.1 Analysis of Output ments for testing and measured the mean absolute Figure 4 shows examples of topics learned from error. When estimating document-specific SPRITE the Reviews corpus. The figure includes the high- parameters for held-out documents, we fix the fea- est probability words in various topics as well as (P ) ture value α = 0 for that document. the highest weight words in the supertopic com- m ponents and perspective component, which feed These predictive experiments do not directly into the priors over the topic parameters. We see measure performance at many of the particular that one supertopic includes many words related to tasks that topic models are well suited for, like surgery, such as procedure and performed, and has data exploration, summarization, and visualiza- multiple children, including a topic about dental tion. We therefore also include a metric that more work. Another supertopic includes words describ- directly measures the quality and interpretability coherence ing family members such as kids and husband. of topics. We use the topic metric intro- duced by Mimno et al. (2011), which is based on 4AdaGrad decayed too quickly for the b variables. For co-occurrence statistics among each topic’s most these, we used a variant suggested by Zeiler (2012) which probable words and has been shown to correlate uses an average of historical gradients rather than a sum. 5Our code and the data will be available at: with human judgments of topic quality. This met- http://cs.jhu.edu/˜mpaul. ric measures the quality of each topic, and we

50

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 Perspective!

best! dr! love! dr! best! years! life! years! caring! thank! doctor! children! saved! love! really! god! cares! “Surgery”! “Family”! wonderful! husband! ive! baby! hes! heart! children! dentist! surgery! dr! son! great! cancer! patients! teeth! pain! best! pregnancy! family! pain! years! hes! dental! went! children! dr! comfortable! surgery! helped! family! work! dr! years! child! listens! dr! doctors! kids! tooth! surgeon! kids! pregnant! thank! went! hospital! seen! root! told! cares! ob! amazing! knee! father! doctors! mouth! procedure! told! hes! daughter! …! foot! man! son! pain! months! hospital! care! first! …! neck! performed! able! old! went! dentists! dr! delivered! mri! removed! daughter! pay! went! injury! blood! gyn! left! child! later! filling! went! birth! shoulder! fix! husband! staff! canal! bone! later! delivery! dr! said! days! family! section! asked! months! later! pediatrician! money! crown! mother! hospital! told! years! trust! company! cleaning! surgeon! said! refused! therapy! er! pain! cancer! office! weight! didn’t! home! said! father! told! months! doctor!

Figure 4: Examples of topics (gray boxes) and components (colored boxes) learned on the Reviews corpus with 20 topics and 5 components. Words with the highest and lowest values of ω(P ), the perspective component, are shown on the left, reflecting positive and negative sentiment words. The words with largest ω values in two supertopic components are also shown, with manually given labels. Arrows from components to topics indicate that the topic’s word distribution draws from that component in its prior (with non-zero β value). There are also implicit arrows from the perspective component to all topics (omitted for clarity). The vertical positions of topics reflect the topic’s perspective value δ(P ). Topics centered above the middle line are more likely to occur in reviews with positive scores, while topics below the middle line are more likely in negative reviews. Note that this is a “soft” hierarchy because the tree structure is not strictly enforced, so some topics have multiple parent components. Table 3 shows how strict trees can be learned by tuning the annealing parameter.

measure the average coherence across all topics: similar behavior to PAM ( 4.4). We also com- § pared to the original PAM model. K M m 1 1 − DF (vkm, vkl) + 1 The “hierarchy only” model using only topic log (2) • K DF (vkl) components ω and no document components. k=1 m=2 l=1 X X X This is a SCTM-style model because it exhibits where DF (v, w) is the document frequency of similar behavior to SCTM ( 4.2). § words v and w (the number of documents in which The full model where α(P ) is learned rather than • they both occur), DF (v) is the document fre- given as input. This is a FLDA-style model that quency of word v, and vki is the ith most probable has similar behavior to FLDA ( 4.3). We also § word in topic k. We use the top M = 20 words. compared to the original FLDA model. This metric is limited to measuring only the qual- The “perspective only” model but without the • ity of word clusters, ignoring the potentially im- ω(P ) topic component, so the attribute value af- proved interpretability of organizing the data into fects only the topic distributions and not the certain structures. However, it is still useful as an word distributions. This is identical to the DMR alternative measure of performance and utility, in- model of Mimno and McCallum (2008) ( 4.5). § dependent of the models’ predictive abilities. A model with no components except for the • Using these three metrics, we compared to sev- bias vectors ω(B) and δ(B). This is equiva- eral variants (denoted in bold) of the full model lent to LDA with optimized hyperparameters to understand how the different parts of the model (learned). We also experimented with using affect performance: fixed symmetric hyperparameters, using val- Variants that contain the hierarchy components ues suggested by Griffiths and Steyvers (2004): • but not the perspective component (Hierarchy 50/K and 0.01 for topic and word distributions. only), and vice versa (Perspective only). To put the results in context, we also compare to The “hierarchy only” model using only docu- two types of baselines: (1) “bag of words” base- • ment components δ and no topic components. lines, where we measure the perplexity of add-one This is a PAM-style model because it exhibits smoothed unigram language models, we measure

51

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 Debates Reviews Model Perplexity Prediction error Coherence Perplexity Prediction error Coherence Full model †1555.5 2.3 †0.615 0.001 -342.8 0.9 †1421.3 8.4 †0.787 0.006 -512.7 1.6 ± ± ± ± ± ± Hierarchy only †1561.8 1.4 0.620 0.002 -342.6 1.1 †1457.2 6.9 †0.804 0.007 -509.1 1.9 ± ± ± ± ± ± Perspective only †1567.3 2.3 †0.613 0.002 -342.1 1.2 †1413.7 2.2 †0.800 0.002 -512.0 1.7 ± ± ± ± ± ± SCTM-style 1572.5 1.6 0.620 0.002 †-335.8 1.1 1504.0 1.9 †0.837 0.002 †-490.8 0.9 ± ± ± ± ± ± PAM-style †1567.4 1.9 0.620 0.002 -347.6 1.4 †1440.4 2.7 †0.835 0.004 -542.9 6.7 ± ± ± ± ± ± FLDA-style †1559.5 2.0 0.617 0.002 -340.8 1.4 †1451.1 5.4 †0.809 0.006 -505.3 2.3 ± ± ± ± ± ± DMR 1578.0 1.1 0.618 0.002 -343.1 1.0 †1416.4 3.0 †0.799 0.003 -511.6 2.0 ± ± ± ± ± ± PAM 1578.9 0.3 0.622 0.003 †-336.0 1.1 1514.8 0.9 †0.835 0.003 †-493.3 1.2 ± ± ± ± ± ± FLDA 1574.1 2.2 0.618 0.002 -344.4 1.3 1541.9 2.3 0.856 0.003 -502.2 3.1 ± ± ± ± ± ± LDA (learned) 1579.6 1.5 0.620 0.001 -342.6 0.6 1507.9 2.4 0.846 0.002 -501.4 1.2 ± ± ± ± ± ± LDA (fixed) 1659.3 0.9 0.622 0.002 -349.5 0.8 1517.2 0.4 0.920 0.003 -585.2 0.9 ± ± ± ± ± ± Bag of words 2521.6 0.0 0.617 0.000 †-196.2 0.0 1633.5 0.0 0.813 0.000 †-408.1 0.0 ± ± ± ± ± ± Naive baseline 7426.0 0.0 0.677 0.000 -852.9 7.4 10158.0 0.0 1.595 0.000 -795.2 13.0 ± ± ± ± ± ± Table 2: Perplexity of held-out tokens and mean absolute error for attribute prediction using various models ( std. error). ± † indicates significant improvement (p < 0.05) over optimized LDA under a two-sided t-test.

the prediction error using bag of words features, (which influence the word distributions) improve and we measure coherence of the unigram distri- coherence over LDA, while document compo- bution; (2) naive baselines, where we measure the nents worsen coherence. SCTM-style (which uses perplexity of the uniform distribution over each only topic components) does the best in both dataset’s vocabulary, the prediction error when datasets, while PAM-style (which uses only doc- simply predicting each attribute as the mean value uments) does the worst. PAM also significantly in the training set, and the coherence of 20 ran- improves over LDA, despite worse perplexity. domly selected words (repeated for 10 trials). The LDA (learned) baseline substantially out- Table 2 shows that the full SPRITE model sub- performs LDA (fixed) in all cases, highlighting the stantially outperforms the LDA baseline at both importance of optimizing hyperparameters, con- predictive tasks. Generally, model variants with sistent with prior research (Wallach et al., 2009a). more structure perform better predictively. Surprisingly, many SPRITE variants also outper- The difference between SCTM-style and form the bag of words regression baseline, even PAM-style is that the former uses only topic com- though the latter was tuned to optimize perfor- ponents (for word distributions) and the latter uses mance using heavy `2 regularization, which we only document components (for the topic distri- applied only weakly (without tuning) to the topic butions). Results show that the structured priors model features. We also point out that the “bag are more important for topic than word distribu- of words” version of the coherence metric (the co- tions, since PAM-style has lower perplexity on herence of the top 20 words) is higher than the av- both datasets. However, models with both topic erage topic coherence, which is an artifact of how and document components generally outperform the metric is defined: the most probable words in either alone, including comparing the Perspec- the corpus also tend to co-occur together in most tive only and DMR models. The former includes documents, so these words are considered to be both topic and document perspective components, highly coherent when grouped together. while DMR has only a document level component. Parameter Sensitivity We evaluated the full PAM does not significantly outperform opti- model at the two predictive tasks with varying mized LDA in most measures, likely because it up- numbers of topics ( 12,25,50,100 for Debates { } dates the hyperparameters using a moment-based and 5,10,20,40 for Reviews) and components { } approximation, which is less accurate than our ( 2,5,10,20 ). Figure 5 shows that performance is { } gradient-based optimization. FLDA perplexity more sensitive to the number of topics than com- is 2.3% higher than optimized LDA on Reviews, ponents, with generally less variance among the comparable to the 4% reported by Paul and Dredze latter. More topics improve performance mono- (2012) on a different corpus. The FLDA-style tonically on Debates, while performance declines SPRITE variant, which is more flexible, signifi- at 40 topics on Reviews. The middle range of com- cantly outperforms FLDA in most measures. ponents (5–10) tends to perform better than too The results are quite different under the co- few (2) or too many (20) components. herence metric. It seems that topic components Regardless of quantitative differences, the

52

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 1900 Debates 1600 Reviews K=12 K=50 K=5 K=20 iment with the basic SPRITE model for the three K=25 K=100 K=10 K=40 1800 1550 structures described in 3: a DAG, a tree, and a § forest. For each structure, we also exper- 1700 1500 factored

Perplexity iment with each type of component: document, 1600 1450 topic, and both types (combined).

1500 1400 2 5 10 20 2 5 10 20 For this set of experiments, we included a third .630 .92 dataset that does not contain a perspective value:

.625 .88 Abstracts: A set of 957 abstracts from the ACL .620 • .84 anthology (97,168 tokens; 8,246 types). These .615 abstracts have previously been analyzed with .80 Prediction.610 error FLDA (Paul and Dredze, 2012), so we include .605 .76 2 5 10 20 2 5 10 20 it here to see if the factored structure that we Number of components Number of components explore in this section learns similar patterns. Figure 5: Predictive performance of full model with differ- ent numbers of topics K across different numbers of compo- Based on our sparsity experiments in the pre- nents, represented on the x-axis (log scale). t vious subsection, we set τt = 1.003 to induce hard structures (tree and factored) and τ = 0 to in- τt Debates Reviews 0.000 (Sparse DAG) 58.1% 42.4% duce a DAG. We keep the same parameters as the 1.000 (Soft Tree) 93.2% 74.6% previous subsection: K=50 and C=10 for Debates t 1.001 (Hard Tree) 99.8% 99.4% and K=20 and C=5 for Reviews. For the factored 1.003t (Hard Tree) 100% 100% structures, we use two factors, with one factor hav- Table 3: The percentage of indicator values that are sparse (near 0 or 1) when using different annealing schedules. ing more components than the other: 3 and 7 com- ponents for Debates, and 2 and 3 components for Reviews (the total number of components across choice of parameters may depend on the end ap- the two factors is therefore the same as for the plication and the particular structures that the user DAG and tree experiments). The Abstracts exper- has in mind, if interpretability is important. For iments use the same parameters as with Debates. example, if the topic model is used as a visual- Since the Abstracts dataset does not have a per- ization tool, then 2 components would not likely spective value to predict, we do not include predic- result in an interesting hierarchy to the user, even tion error as a metric, instead focusing on held-out if this setting produces low perplexity. perplexity and topic coherence (Eq. 2). Table 4 Structured Sparsity We use a relaxation of the shows the results of these two metrics. binary b that induces a “soft” tree structure. Ta- Some trends are clear and consistent. Topic ble 3 shows the percentage of b values which are components always hurt perplexity, while these within  = .001 of 0 or 1 under various anneal- components typically improve coherence, as was ing schedules, increasing the inverse temperature observed in the previous subsection. It has pre- t τ by 0.1% after each iteration (i.e. τt = 1.001 ) viously been observed that perplexity and topic as well as 0.3% and no annealing at all (τ = 1). quality are not correlated (Chang et al., 2009). At τ = 0, we model a DAG rather than a tree, be- These results show that the choice of components cause the model has no preference that b is sparse. depends on the task at hand. Combining the two Many of the values are binary in the DAG case, but components tends to produce results somewhere the sparse prior substantially increases the number in between, suggesting that using both component of binary values, obtaining fully binary structures types is a reasonable “default” setting. with sufficient annealing. We compare the DAG Document components usually improve per- and tree structures more in the next subsection. plexity, likely due to the nature of the document completion setup, in which half of each document 7.3 Structure Comparison is held out. The document components capture The previous subsection experimented with mod- correlations between topics, so by inferring the els that included a variety of structures, but did components that generated the first half of the doc- not provide a comparison of each structure in iso- ument, the prior is adjusted to give more probabil- lation, since most model variants were part of a ity to topics that are likely to occur in the unseen complex joint model. In this section, we exper- second half. Another interesting trend is that the

53

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 Perplexity Coherence DAG Tree Factored DAG Tree Factored Debates Document 1572.0 0.9 1568.7 2.0 1566.8 2.0 -342.9 1.2 -346.0 0.9 -343.2 1.0 Topic 1575.0 ± 1.5 1573.4 ± 1.8 1559.3 ± 1.5 -342.4 ± 0.6 -339.2 ± 1.7 -333.9 ± 0.9 Combined 1566.7 ± 1.7 1559.9 ± 1.9 1552.5 ± 1.9 -342.9 ± 1.3 -342.6 ± 1.2 -340.3 ± 1.0 ± ± ± Reviews ± ± ± Document 1456.9 3.8 1446.4 4.0 1450.4 5.5 -512.2 4.6 -527.9 6.5 -535.4 7.4 Topic 1508.5 ± 1.7 1517.9 ± 2.0 1502.0 ± 1.9 -500.1 ± 1.2 -499.0 ± 0.9 -486.1 ± 1.5 Combined 1464.1 ± 3.3 1455.1 ± 5.6 1448.5 ± 8.5 -504.9 ± 1.4 -527.8 ± 6.1 -535.5 ± 8.2 ± ± ±Abstracts ± ± ± Document 3107.7 7.7 3089.5 9.1 3098.7 10.2 -393.2 0.8 -390.8 0.9 -392.8 1.5 Topic 3241.7 ± 2.1 3455.9 ±10.2 3507.4± 9.7 -389.0 ± 0.8 -388.8 ± 0.7 -332.2 ± 1.1 Combined 3200.8 ± 3.5 3307.2± 7.8 3364.9 ±19.1 -373.1 ± 0.8 -360.6 ± 0.9 -342.3 ± 0.9 ± ± ± ± ± ± Table 4: Quantitative results for different structures (columns) and different components (rows) for two metrics ( std. error) across three datasets. The best (structure, component) pair for each dataset and metric is in bold. ±

factored structure tends to perform well under both with the largest number of components also de- metrics, with the lowest perplexity and highest co- composes by research topic. These results show herence in a majority of the nine comparisons (i.e. that SPRITE is capable of recovering similar struc- each row). Perhaps the models are capturing a nat- tures as FLDA, a more specialized model. SPRITE ural factorization present in the data. is also much more flexible than FLDA. While To understand the factored structure qualita- FLDA strictly models a one-to-one mapping of tively, Figure 6 shows examples of components topics to each pair of components, SPRITE allows from each factor along with example topics that multiple topics to belong to the same pair (as in draw from all pairs of these components, learned the semantics examples above), and conversely on Abstracts. We find that the factor with the SPRITE does not require that all pairs have an as- smaller number of components (left of the figure) sociated topic. This property allows SPRITE to seems to decompose into components represent- scale to larger numbers of factors than FLDA, be- ing the major themes or disciplines found in ACL cause the number of topics is not required to grow abstracts, with one component expressing compu- with the number of all possible tuples. tational approaches (top) and the other expressing linguistic theory (bottom). The third component 8 Related Work (not shown) has words associated with speech, in- Our topic and perspective model is related to su- cluding spoken, speech, recognition . { } pervised hierarchical LDA (SHLDA) (Nguyen et The factor shown on the right seems to decom- al., 2013), which learns a topic hierarchy while pose into different research topics: one compo- also learning regression parameters to associate nent represents semantics (top), another syntax topics with feature values such as political per- (bottom), with others including morphology (top spective. This model does not explicitly incorpo- words including segmentation, chinese, morphol- { rate perspective-specific word priors into the top- ogy ) and information retrieval (top words includ- } ics (as in our factorized approach). The regression ing documents, retrieval, ir ). { } structure is also different. SHLDA is a “down- Many of the topics intuitively follow from the stream” model, where the perspective value is a re- components of these two factors. For example, sponse variable conditioned on the topics. In con- the two topics expressing vector space models and trast, SPRITE is an “upstream” model, where the (top left and right) both topics are conditioned on the perspective value. draw from the “computational” and “semantics” We argue that the latter is more accurate as a gen- components, while the topics expressing ontolo- erative story (the emitted words depend on the gies and (middle left and right) author’s perspective, not the other way around). draw from “linguistics” and “semantics”. Moreover, in our model the perspective influences The factorization is similar to what had been both the word and topic distributions (through the previously been induced by FLDA. Figure 3 of topic and document components, respectively). Paul and Dredze (2012) shows components that Inverse regression topic models (Rabinovich look similar to the computational methods and and Blei, 2014) use document feature values (such linguistic theory components here, and the factor as political ideology) to alter the parameters of the

54

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 “Computational”! similarity! words! “Semantics”! words! corpus! semantic! method! word! word! knowledge! words! vector! multiword! domain! word! semantic! paper! ontology! corpus! similar! based! systems! learning! based! frequency! words! performance! method! expressions! approaches! information! training! ! proposed! question! based! dialogue! semantic! training! question! knowledge! learning! questions! semantics! corpus! answer! ontology! large! answering! relations! unsupervised! answers! lexical! corpora! qa! “Linguistics”! concepts! method! systems! “Syntax”! concept! data! type! grammar! parse! ! ! representation! parser! structure! penn! parsing! grammars! german! parsers! parser! parse! languages! trees! parse! syntax! french! dependencies! treebank! representations! english! acoustic! grammar! semantics! multilingual! corpus! tree! formalism! italian! parsing! trees! structure! structure! spanish! Figure 6: Examples of topics (gray boxes) and components (colored boxes) learned on the Abstracts corpus with 50 topics using a factored structure. The components have been grouped into two factors, one factor with 3 components (left) and one with 7 (right), with two examples shown from each. Each topic prior draws from exactly one component from each factor.

topic-specific word distributions. This is an alter- supervision in the form of document attributes. native to the more common approach to regression These structures were incorporated into the pri- based topic modeling, where the variables affect ors of both the word and topic distributions, unlike the topic distributions rather than the word distri- most prior work that considered one or the other. butions. Our SPRITE-based model does both: the Our experiments explored how each of these var- document features adjust the prior over topic dis- ious model features affect performance, and our tributions (through δ), but by tying together the results showed that models with structured priors document and topic components (with β), the doc- perform better than baseline LDA models. ument features also affect the prior over word dis- Our framework has made clear advancements tributions. To the best of our knowledge, this is the with respect to existing structured topic models. first topic model to condition both topic and word For example, SPRITE is more general and of- distributions on the same features. fers simpler inference than the shared compo- The topic aspect model (Paul and Girju, 2010a) nents topic model (Gormley et al., 2010), and is also a two-dimensional factored model that has SPRITE allows for more flexible and scalable fac- been used to jointly model topic and perspective tored structures than FLDA, as described in earlier (Paul and Girju, 2010b). However, this model sections. Both of these models were motivated by does not use structured priors over the parameters, their ability to learn interesting structures, rather unlike most of the models discussed in 4. than their performance at any predictive task. Sim- § An alternative approach to incorporating user ilarly, our goal in this study was not to provide preferences and expertise are interactive topic state of the art results for a particular task, but models (Hu et al., 2013), a complimentary ap- to demonstrate a framework for learning struc- proach to SPRITE. tures that are richer than previous structured mod- els. Therefore, our experiments focused on un- 9 Discussion and Conclusion derstanding how SPRITE compares to commonly We have presented SPRITE, a family of topic mod- used models with similar structures, and how the els that utilize structured priors to induce pre- different variants compare under different metrics. ferred topic structures. Specific instantiations of Ultimately, the model design choice depends on SPRITE are similar or equivalent to several exist- the application and the user needs. By unifying ing topic models. We demonstrated the utility of such a wide variety of topic models, SPRITE can SPRITE by constructing a single model with many serve as a common framework for enabling model different characteristics, including a topic hierar- exploration and bringing application-specific pref- chy, a factorization of topic and perspective, and erences and structure into topic models.

55

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 Acknowledgments D. Mimno and A. McCallum. 2008. Topic mod- els conditioned on arbitrary features with Dirichlet- We thank Jason Eisner and Hanna Wallach for multinomial regression. In UAI. helpful discussions, and Viet-An Nguyen for pro- D. Mimno, W. Li, and A. McCallum. 2007. Mixtures viding the Congressional debates data. Michael of hierarchical topics with Pachinko allocation. In Paul is supported by a Microsoft Research PhD International Conference on . fellowship. D. Mimno, H.M. Wallach, E. Talley, M. Leenders, and A. McCallum. 2011. Optimizing semantic coher- ence in topic models. In EMNLP. References D. Andrzejewski, X. Zhu, and M. Craven. 2009. In- V. Nguyen, J. Boyd-Graber, and P. Resnik. 2013. Lex- corporating domain knowledge into topic modeling ical and hierarchical topic regression. In Neural In- via Dirichlet forest priors. In ICML. formation Processing Systems. M.J. Paul and M. Dredze. 2012. Factorial LDA: Sparse R. Balasubramanyan and W. Cohen. 2013. Regular- multi-dimensional text models. In Neural Informa- ization of latent variable models to obtain sparsity. tion Processing Systems (NIPS). In SIAM Conference on Data Mining. M.J. Paul and M. Dredze. 2013. Drug extraction from D. Blei and J. Lafferty. 2007. A correlated topic model the web: Summarizing drug experiences with multi- of Science. Annals of Applied Statistics, 1(1):17–35. dimensional topic models. In NAACL.

D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. M. Paul and R. Girju. 2010a. A two-dimensional 2003a. Hierarchical topic models and the nested topic-aspect model for discovering multi-faceted Chinese restaurant process. In NIPS. topics. In AAAI.

D. Blei, A. Ng, and M. Jordan. 2003b. Latent Dirichlet M.J. Paul and R. Girju. 2010b. Summarizing con- allocation. JMLR. trastive viewpoints in opinionated text. In Empirical Methods in Natural Language Processing. J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. 2009. Reading tea leaves: How humans M.J. Paul, B.C. Wallace, and M. Dredze. 2013. What interpret topic models. In NIPS. affects patient (dis)satisfaction? Analyzing online doctor ratings with a joint topic-sentiment model. J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive sub- In AAAI Workshop on Expanding the Boundaries of gradient methods for online learning and stochastic Health Informatics Using AI. optimization. JMLR, 12:2121–2159. M. Rabinovich and D. Blei. 2014. The inverse regres- J. Eisenstein, A. Ahmed, and E. P. Xing. 2011. Sparse sion topic model. In International Conference on additive generative models of text. In ICML. Machine Learning.

M.R. Gormley, M. Dredze, B. Van Durme, and J. Eis- D. Ramage, D. Hall, R. Nallapati, and C.D. Man- ner. 2010. Shared components topic models. In ning. 2009. Labeled LDA: a supervised topic model NAACL. for credit attribution in multi-labeled corpora. In EMNLP. T. Griffiths and M. Steyvers. 2004. Finding scientific topics. In Proceedings of the National Academy of N.A. Smith and J. Eisner. 2006. Annealing structural Sciences of the United States of America. bias in multilingual weighted grammar induction. In COLING-ACL. Y. Hu, J. Boyd-Graber, B. Satinoff, and A. Smith. 2013. Interactive topic modeling. Machine Learn- E.M. Talley, D. Newman, D. Mimno, B.W. Herr II, ing, 95:423–469. H.M. Wallach, G.A.P.C. Burns, M. Leenders, and A. McCallum. 2011. Database of NIH grants us- J. Kivinen and M.K. Warmuth. 1997. Exponentiated ing machine-learned categories and graphical clus- gradient versus gradient descent for linear predic- tering. Nature Methods, 8(6):443–444. tors. Information and Computation, 132:1–63. N. Ueda and R. Nakano. 1998. Deterministic anneal- J.B. Lewis and K.T. Poole. 2004. Measuring bias and ing EM algorithm. Neural Networks, 11(2):271– uncertainty in ideal point estimates via the paramet- 282. ric bootstrap. Political Analysis, 12(2):105–127. B.C. Wallace, M.J. Paul, U. Sarkar, T.A. Trikalinos, W. Li and A. McCallum. 2006. Pachinko alloca- and M. Dredze. 2014. A large-scale quantitative tion: DAG-structured mixture models of topic cor- analysis of latent factors and sentiment in online relations. In International Conference on Machine doctor reviews. Journal of the American Medical Learning. Informatics Association, 21(6):1098–1103.

56

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 H.M. Wallach, D. Mimno, and A. McCallum. 2009a. Rethinking LDA: Why priors matter. In NIPS. H.M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. 2009b. Evaluation methods for topic models. In ICML.

C. Wang and D. Blei. 2009. Decoupling sparsity and smoothness in the discrete hierarchical Dirich- let process. In NIPS.

M.D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. CoRR, abs/1212.5701.

57

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021 58

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00121 by guest on 24 September 2021