<<

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION , VOL. , NO. ,–, Applications and Case Studies http://dx.doi.org/./..

A Model of Text for Experimentation in the Social Sciences

Margaret E. Robertsa, Brandon M. Stewartb,andEdoardoM.Airoldic aDepartment of Political Science, University of California San Diego, San Diego, CA, USA; bDepartment of Sociology, Princeton University, Princeton, NJ, USA; cDepartment of , Harvard University, Cambridge, MA, USA

ABSTRACT ARTICLE HISTORY Statistical models of text have become increasingly popular in statistics and computer science as a method Received March  of exploring large document collections. Social scientists often want to move beyond exploration, to mea- Revised October  surement and experimentation, and make inference about social and political processes that drive discourse KEYWORDS and content. In this article, we develop a model of text data that supports this type of substantive research. Causal inference; Our approach is to posit a hierarchical mixed membership model for analyzing topical content of docu- Experimentation; ments, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence High-dimensional inference; and topical content are specifed as a simple on an arbitrary number of document- Social sciences; Text analysis; level covariates, such as news source and time of release, enabling researchers to introduce elements of Variational approximation the experimental design that informed document collection into the model, within a generally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods quantify the efect of news wire source on both the frequency and nature of topic coverage. Supplementary materials for this article are available online.

1. Introduction to the analysis. This representation is often referred to as the bag of words representation, since the order in which words are used Written documents provide a valuable source of data for the within a document is completely disregarded. One milestone in measurement of latent linguistic, political, and psychological the statistical analysis of text was the analysis of the disputed variables (e.g., Socher et al. 2009;Grimmer2010;Quinnetal. authorship of “The Federalist” articles (Mosteller and Wallace 2010;GrimmerandStewart2013). Social scientists are primarily 1963, 1964, 1984), which featured an in-depth study of the extent interested in how document metadata, that is, observable covari- to which assumptions used to reduce the complexity of text data ates such as author or date, infuence the content of the text. With representations hold in practice. Because the bag of words rep- the rapid digitization of texts, larger and larger document collec- resentation retains word co-occurrence information, but loses tions are becoming available for analysis, for which such meta- the subtle nuances of grammar and syntax, it is most appropri- data information is recorded. A fruitful approach for the analysis ate for settings where the quantity of interest is a coarse sum- of text data is the use of mixtures and mixed membership mod- mary such as topical content (Manning, Raghavan, and Schütze els (Airoldi et al. 2014a), often referred to as topic models in the 2008;TurneyandPantel2010). In recent years, there has been a literature (Blei 2012). While these models can provide insights surge of interest in methods for text data analysis in the statis- into the topical structure of a document collection, they cannot tics literature, most of which use the bag of words representation easily incorporate the observable metadata information. Here, (e.g., Blei, Ng, and Jordan 2003; Erosheva, Fienberg, and Laferty we develop a framework for modeling text data that can fexi- 2004;GrifthsandSteyvers2004; Genkin, Lewis, and Madigan bly incorporate a wide of document-level covariates and 2007;JeskeandLiu2007;Airoldietal.2010; Taddy 2013;Jia metadata, and capture their efect on topical content. We apply et al. 2014). A few studies also test the appropriateness of the our model to learn about how media coverage of China’s rise assumptions underlying such a representation (e.g., Airoldi and varies over time and by newswire service. Fienberg 2003;Airoldietal.2006). Quantitative approaches to text data analysis have a long his- Perhaps the simplest topic model, to which, arguably, much tory in the social sciences (Mendenhall 1887;Zipf1932;Yule of the recent interest in statistical text analysis research can 1944;Miller,Newman,andFriedman1958). Today, the most be ascribed, is known as the latent Dirichlet allocation (LDA common representation of text data involves representing a doc- henceforth), or also as the generative aspect model (Blei, Ng, V ument d as a vector of word counts, wd Z , where each of the and Jordan 2001, 2003; Minka and Laferty 2002). Consider ∈ + V entries map to a unique term in a vocabulary of interest (with acollectionofD documents, indexed by d,eachcontaining V in the order of thousands to tens of thousands) specifed prior Nd words, a vocabulary of interest of V distinct terms, and K

CONTACT Edoardo M. Airoldi [email protected] Department of Statistics, Harvard University, Cambridge, MA . Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JASA. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA. ©AmericanStatisticalAssociation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 989 subpopulations, indexed by k and referred to as topics. Each collapsed variational expectation-maximization algorithm that topic is associated with a V-dimensional probability mass func- uses a Laplace approximation to the nonconjugate portion of the tion, βk, that controls the frequency according to which terms model (Dempster, Laird, and Rubin 1977;Liu1994;Mengand are generated from that topic. The data-generating process for Van Dyk 1997; Blei and Laferty 2007; Wang and Blei 2013). This document d assigns terms in the vocabulary to each of the Nd inference strategy provides a computationally efcient approach positions; instances of terms that fll these positions are typically to model ftting that is sufciently fast and well behaved to sup- referred to as the words.Termsinthevocabularyareunique, port the analysis of large collections of documents, in practice. while distinct words in a document may instantiate multiple We use posterior predictive checks (Gelman, Meng, and Stern occurrences of the same term. The process begins by drawing 1996)toexaminethemodelandassessmodelft,andtoolsfor a K-dimensional Dirichlet vector θd that captures the expected and interpretation we developed in substantive proportion of words in document d that can be attributed to companion articles (Roberts et al. 2014b;Lucasetal.2015). each topic. Then for each position (or, equivalently, for each The central contribution of this article is twofold: we intro- word) in the document, indexed by n,itproceedsbysampling duce a new model of text that can fexibly incorporate various an indicator zd,n from a MultinomialK (θd, 1) whose positive forms of document-level information, and we demonstrate how component denotes which topic such position is associated this model enables an original analysis of the diferences among with. The process ends by the actual word indica- newswire services, in the frequency with which they cover top- tor wd,n from a MultinomialV (B zd,n, 1),wherethematrix ics and the vocabulary with which they describe topics. In par- B [β β ]encodesthedistributionsovertermsinthe ticular, we are interested in characterizing how Chinese sources = 1|···| K vocabulary associated with the K topics. represent topics diferently than foreign sources, or whether they In practice, social scientists often know more about a docu- leave out specifc topics completely. The model allows us to pro- ment than its word counts. For example, open-ended responses duce the frst quantifcation of media slant in various Chinese collected as part of a survey include additional infor- and international newswire services, over a 10 year period of mation about the respondents, such as gender or political party China’s rise. In addition, the model allows us to summarize slant (Roberts et al. 2014b). From a statistical perspective, it would be more quickly than reading large swaths of text, the method more desirable to include additional covariates and information about frequently used by China scholars. The article is organized as the experimental design into the model to improve estimation of follows. We motivate the use of text analysis in the social sci- the topics. In addition, the relationships between the observed ences and provide the essential background for our model. We covariates and latent topics are most frequently the estimand of describe the structural topic model and discuss the proposed scientifc interest. Here, we allow for such observed covariates estimation strategy. We empirically validate the frequentist cov- to afect two components of the model, the proportion of a doc- erage of STM in a realistic simulation, and provide a compar- ument devoted to a topic, which we refer to as topic prevalence ative performance analysis with state-of-the-art models on real and the word rates used in discussing a topic, which we refer to data. We use STM to study media coverage of China’s rise by as topical content. analyzing variations in topic prevalence and content across fve We leverage generalized linear models (GLMs henceforth) to diferent newswire services over time. introduce covariate information into the model. Prior distribu- To make the model accessible to social scientists, we devel- tions with globally shared parameters in the latent Dirich- oped the R package stm (Roberts, Stewart, and Tingley 2014a), let allocation model are replaced with parameterized by which handles model estimation, summary, and visualization alinearfunctionofobservedcovariates.Specifcally,fortopic (http://cran.r-project.org/web/packages/stm). prevalence, the Dirichlet distribution that controls the propor- tion of words in a document attributable to the diferent topics 1.1. Statistical Analysis of Text Data in the Social Sciences is replaced with a logistic Normal distribution with a mean vec- tor parameterized as a function of the covariates (Aitchison and Our development of the proposed model is motivated by Shen 1980). For topical content, we defne the distribution over a common structure in the application of models for text the terms associated with the diferent topics as an exponen- data within the social sciences. In these settings, the typical tial family model, similar to a multinomial , application involves estimating latent topics for a corpus of parameterized as a function of the marginal frequency of occur- interesting documents and subsequently comparing how topic rence deviations for each term, and of deviations from it that proportions vary with an external covariate of interest. While are specifc to topics, covariates, and their interactions. We shall informative, these applications raise a practical and theoretical often refer to the resulting model as the structural topic model tension. Documents are typically assumed to be exchangeable (STM), because the inclusion of covariates is informative about according to the model used for data analysis, but the exchange- structure in the document collection and its design. From an ability assumption is then often invalidated by the research fnd- inferential perspective, including covariate information allows ings reported. for partial pooling of parameters along the structure defned by This problem has motivated the development of a series the covariates. of application-specifc models designed to capture particular As with other topic models, the exact posterior for the pro- quantities of interest (Ahmed and Xing 2010;Grimmer2010; posed model is intractable, and sufers from identifability issues Quinn et al. 2010; Gerrish and Blei 2012). Many of the mod- in theory (Airoldi et al. 2014a). Inference is further compli- els designed to incorporate various forms of meta-data allow cated in our setting by the nonconjugacy of the logistic Nor- the topic mixing proportions (θd)ortheobservedwords(w) mal with the multinomial likelihood. We develop a partially to be drawn from document-specifc prior distributions rather 990 M. E. ROBERTS ET AL.

than globally shared priors α, βk in the LDA model. We refer model (STM henceforth), so-called because we use covariates to to the distribution over the document-topic proportions as the structure the corpus beyond exchangeable documents. prior on topical prevalence and we refer to the topic-specifc dis- tribution over words as the topical content prior. For example, the author-topic model allows the prevalence of topics to vary 2. A Model of Text that Leverages Covariate by author (Rosen-Zvi et al. 2004), the geographic topic model Information allows topical content to vary by region (Eisenstein et al. 2010), and the dynamic topic model allows topic prevalence and topic We introduce the basic structural topic model and notation in content to drift over time (Blei and Laferty 2006). Section 2.1.Wediscusshowcovariatesinformthemodelinthe However, for the vast majority of social scientists, designing Section 2.1.1,andpriorspecifcationsinSection 2.1.2. a specifc model for each application is prohibitively difcult. These users would need a general model that would balance fex- ibility to accommodate unique research problems with ease of 2.1. Basic Structural Topic Model use. Recall that we index the documents by d 1 ...D and the Our approach to this task builds on two prior eforts to ∈{ } words (or positions) within the documents by n 1 ...N . incorporate general covariate information into topic models, the ∈{ d} Dirichlet-multinomial regression topic model of Mimno and Primary observations consist of words wd,n that are instances of unique terms from a vocabulary of terms, indexed by v McCallum (2008) and the sparse additive generative model of ∈ 1 ...V , deemed of interest in the analysis. The model also Eisenstein, Ahmed, and Xing (2011). The model of Mimno and { } McCallum (2008) replaces the Dirichlet prior on the topic mix- assumes that the analyst has specifed the number of topics K indexed by k 1 ...K .Additionalobservedinformationis ing proportions in the LDA model with a Dirichlet-multinomial ∈{ } regression over arbitrary covariates. This allows the prior distri- given by two design matrices, one for topic prevalence and one bution over document-topic proportions to be specifc to a set of for topical content, where each row defnes a vector of covari- observed document features through a linear model. Our model ates for a given document specifed by the analyst. The matrix of extends this approach by allowing covariance among topics and topic prevalence covariates is denoted by X,andhasdimension D P. The matrix of topical content covariates is denoted by Y emphasizing the use of nonlinear functional forms of the fea- × and has dimension D A.Rowsofthesematricesaredenoted tures. × While the Dirichlet-multinomial regression model focuses by xd and yd,respectively.Last,wedefnemv to be the marginal on topical prevalence, the sparse additive generative model log frequency of term v in the vocabulary, easily estimable allows topical content to vary by observed categorical covariates. from total counts (see, e.g., Airoldi, Cohen, and Fienberg In this framework, topics are modeled as sparse log-transformed 2005). deviations from a baseline distribution over words. Regulariza- The proposed model can be conceptually divided into three tion to the corpus mean ensures that rarely occurring words do components: (1) a topic prevalence model, which controls how not produce the most extreme loadings onto topics (Eisenstein, words are allocated to topics as a function of covariates, (2) a Ahmed, and Xing 2011). Because the model is linear in the log- topical content model, which controls the frequency of the terms probability it becomes simple to combine several efects (e.g., in each topic as a function of covariates, and (3) a core language topic, covariate, or topic-covariate ) by simply includ- (or observation) model, which combines these two sources of ing the deviations additively in the linear predictor. We adopt a variation to produce the actual words in each document. Next, similar infrastructure to capture changes in topical content and we discuss each component of the model in turn. A graphical extend the setting to any covariates. illustration of the full data-generating process for the proposed An alternative is to ft word counts directly as a function of model is provided in Figure 1. observable covariates and fxed or random efects (Taddy 2013), To illustrate the model clearly, we will specify a particular at the cost of specifying thousands of such efects. default set of priors. The model, however, as well as the R pack- Our solution to the need for a fexible model combines and age stm, allow for a number of alternative prior specifcations, extends these existing approaches to create the structural topic which we discuss in Section 2.1.2.

Figure . Agraphicalillustrationofthestructuraltopicmodel. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 991

The data-generating process for document d,giventhenum- Xing 2011). The idea is to parameterize the (multinomial) dis- ber of topics K,observedwords w , ,thedesignmatrices tribution of word occurrences in terms of log-transformed rate { d n} for topic prevalence X,andtopicalcontentY,scalarhyper- deviations from the rates of a corpus-wide background dis- parameters s, r,ρ,andK-dimensional hyper-parameter vector tribution m,whichcanbeestimatedorfxedtoadistribu- σ,isasfollows: tion of interest. The log-transformed rate deviations can then be specifed as a function of topics, of observed covariates, 2 γ NormalP(0,σ IP ), for k 1 ...K 1, (1) and of topic-covariate interactions. In the proposed model, the k ∼ k = − log-transformed rate deviations are denoted by a collection of θd LogisticNormalK 1(&′xd′ , '), (2) ∼ − parameters κ ,wherethesuperscriptindicateswhichsetthey { } zd,n MultinomialK (θd ), for n 1 ...Nd, (3) belong to, that is, t for topics, c for covariates, or i for topic- ∼ = covariate interactions. In detail, κ(t) is a K-by-V matrix contain- w , Multinomial (Bz , ), for n 1 ...N , (4) d n ∼ V d n = d ing the log-transformed rate deviations for each topic k and term (t) (c) (i) v v exp mv κ κy ,v κ ,overthebaselinelog-transformedrateforterm . These devi- + k,v + d + yd ,k,v βd,k,v , ations are shared across all A levels of the content covariate Yd. = ! (t) (c) (i) " (c) v exp mv κ κy ,v κ The matrix κ has dimension A V, and it contains the log- + k,v + d + yd ,k,v × ! " transformed rate deviation for each level of the covariate Yd and #for v 1 ...V and k 1 ...K, (5) = = each term v,overthebaselinelog-transformedratefortermv. These deviations are shared across all topics. Finally, the array where % [γ ... γ ]isaP (K 1) matrix of coefcients (i) = 1| | K × − κ has dimension A K V,anditcollectsthecovariate- for the topic prevalence model specifed by Equations (1)and × × (t) (c) (i) topic interaction efects. For example, for the simple case where (2), and κ.,. ,κ.,. ,κ.,. is a collection of coefcients for the top- { } there is a single covariate (Yd)denotingamutuallyexclusive ical content model specifed by Equation (5)andfurtherdis- and exhaustive group of documents, such as newswire source, cussed below. Equations (3)and(4) denote the core language the distribution over terms is obtained by adding these log- model. β ( κ (t) transformed efects such that the rate d,k,v exp mv k,v The core language model allows for correlations in the topic (c) (i) ∝ + + κy ,v κ ),wheremv is the marginal log-transformed rate proportions using the logistic normal distribution (Aitchison d + yd ,k,v and Shen 1980;Aitchison1982). For a model with K top- of term v.Typically,mv is specifed as the estimated (marginal) ics, we can represent the logistic normal by drawing η log-transformed rate of occurrence of term v in the document d ∼ NormalK 1(µd, ') and mapping to the simplex, by specifying collection under study (see, e.g., Airoldi, Cohen, and Fienberg − K 2005), but can alternatively be specifed as any baseline distri- θd,k exp(ηd,k)/( i 1exp(ηd,i )),whereηd,K is fxed to zero to = = render the model identifable. Given the topic proportion vector, bution of interest. The content model is completed by positing # sparsity inducing priors for the κ parameters, so that topic θd, for each word within document d,atopicissampledfrom { } amultinomialdistributionz , Multinomial(θ ),andcondi- and covariate efects represent sparse deviations from the back- d n ∼ d tional on such a topic, a word is chosen from the appropriate ground distribution over terms. We defer discussion of prior specifcation to Section 2.1.2.Intuitively,theproposedtopi- distribution over terms Bzd,n,alsodenotedβzd,n for simplicity. While in previous research (e.g., Blei and Laferty 2007)bothµ cal content model replaces the multinomial likelihood for the and B are global parameters shared by all documents, in the pro- words with a multinomial logistic regression, where the covari- ates are the word-level topic latent variables z , ,theuser- posed model they are specifed as a function of document-level { d n} supplied covariates Y ,andtheirinteractions.Inprinciple, covariates. { d} we need not restrict ourselves to models with single categorical covariates; in practice, computational considerations dictate that ... Modeling Topic Prevalence and Topic Content With the number of levels of topical content covariates be relatively Covariates small. The topic prevalence component of the model allows the The specifcation of the topic prevalence model is inspired by expected document-topic proportions to vary as a function of generalized additive models (Hastie and Tibshirani 1990). Each the matrix of observed document-level covariates (X), rather covariate is included with B-splines (De Boor et al. 1978), which than arising from a single prior shared by all documents. We allows nonlinearity in the efects on the latent topic prevalence, model the mean vector of the logistic normal as a simple lin- but the covariates themselves remain additive in the specifca- ear model such that µ &′x′ , with an additional regularizing tion. The inclusion of a particular covariate allows the model d = d prior on the elements of & to avoid over-ftting. Intuitively, the to borrow strength from documents with similar covariate val- topic prevalence model takes the form of a multivariate normal ues when estimating the document-topic proportions, analo- linear model with a single shared -covariance matrix gously to partial pooling in other Bayesian hierarchical models of parameters. In the absence of covariates, but with a constant (Gelman and Hill 2007). We also include covariates that afect intercept, this portion of the model reduces to the model by Blei the rate at which terms are used within a topic through the top- and Laferty (2007). ical content model. Unlike covariates for topical prevalence, for To model the way covariates afect topical content, we draw each observed content covariate combination it is necessary to maintain a dense K V matrix, namely, the expected number of on a parameterization that has proved useful in the text analysis × literature for modeling diferential word usage (e.g., Mosteller occurrences of term v attributable to topic k,withindocuments and Wallace 1984;Airoldietal.2006;Eisenstein,Ahmed,and having that observed covariate level. 992 M. E. ROBERTS ET AL.

... Prior Specifications ... Variational Expectation-Maximization The prior specifcation for the topic prevalence parameters Recall that we can write the logistic normal document- is a zero mean Gaussian distribution with shared variance topic proportions in terms of the K 1-dimensional Gaus- 2 2 − exp(η ) parameter, that is, γ , Normal(0,σ ),andσ Inverse- θ d p k k k sian random variable such that d K (η ) where ∼ ∼ = k 1 exp d,k Gamma(a, b),wherep indexes the covariates, k indexes the top- = η Normal(x &,)) where η , is set to 0 for identifca- d ∼ d d K # ics, and a, b are fxed hyperparameters (see the Appendix for tion. Inference involves fnding the approximate posterior more details). There is no prior on the intercept, if included as d q(ηd )q(zd ),whichmaximizestheapproximateevidence a covariate. This prior shrinks coefcients toward zero, but does lower bound (ELBO), not induce sparsity. & In the topical content specifcation, we posit a Laplace prior D ELBO E [log p(η µ , ')] (Friedman, Hastie, and Tibshirani 2010)toinducesparsityon ≈ q d| d d 1 the collection of κ parameters. This is necessary for inter- '= { } pretability. See the Appendix for details of how the hyper- D N E [log p(z , η )] parameters are calibrated. + q n d| d d 1 n 1 '= '= D N 2.2. Estimation and Interpretation ( , β ) ( ), Eq[log p wn,d zn,d d,k z , ] H q + + | = d n − d 1 n 1 The full posterior of interest, p(η, z, κ, γ,) w, X, Y),ispropor- '= '= | tional to (6)

D N where q(ηd ) is fxed to be Gaussian with mean λd and covari- Normal(η X γ,)) Multinomial(z , θ ) ance ν and q(z ) is a variational multinomial with parameter d| d n d| d d d $d 1 $n 1 φ . H(q) denotes the entropies of the approximating distribu- %= %= d ( β ) (κ) (&) tions. We qualify the ELBO as approximate to emphasize that Multinomial wn d,k z , p p × | = d n × it is not a true bound on the marginal likelihood (due to the "" K % % Laplace approximation) and it is not being directly maximized with θd,k exp(ηd,k)/( i 1exp(ηd,i)) and βd,k,v exp(mv (t) (c)= (i) = ∝ + by the updates (see, e.g., Wang and Blei 2013,formorediscus- κ κy ,v κ ),andthepriorsontheprevalenceandcon- k,v + d + yd ,k,v # sion). &, κ tent coefcients specifc to the options chosen by the user. In the E-step, we iterate through each document updating the As with most topic models, the posterior distribution for the variational posteriors q(ηd ), q(φd ). In the M-step, we maximize structural topic model is intractable and so we turn to methods the approximate ELBO with respect to the model parameters of approximate inference. To allow for ease of use in iterative &, ',andκ.AfterdetailingtheE-stepandM-step,wediscuss model ftting, we use a fast variant of nonconjugate variational convergence, properties, and initialization before summarizing expectation-maximization (EM). the complete algorithm. Traditionally, topic models have been ft using either col- In practice, one can monitor convergence in terms of relative lapsed Gibbs sampling or mean feld variational Bayes (Blei, Ng, changes to the approximate ELBO. This boils down to a sum and Jordan 2003;GrifthsandSteyvers2004). Because the logis- over the document level contributions, and can be dramatically tic normal distribution introduces nonconjugacy, these standard simplifed from Equation (6) to the following, methods are not available. The original work on logistic nor- mal topic models used an approximate variational Bayes proce- D V w ,v log(θ β ,v ) 0.5log ' dure by maximizing a novel lower bound on the marginal likeli- LELBO = d d d − | | d 1 $$ i 1 ( hood (Blei and Laferty 2007) but the bound can be quite loose '= '= T 1 (Ahmed and Xing 2007; Knowles and Minka 2011). Later work 0.5(λd µd ) '− (λd µd ) 0.5log( νd ) drew on inference for logistic regression models (Groenewald − − − + | | (7) and Mokgatlhe 2005;HolmesandHeld2006)todevelopaGibbs ) sampler using auxiliary variable schemes (Mimno, Wallach, and McCallum 2008). Recently, Chen et al. (2013) developed a scal- Variational E-Step. Because the logistic-normal is not con- able Gibbs sampling algorithm by leveraging the Polya-Gamma jugate with the multinomial, q(ηd ) does not have a closed- auxiliary variable scheme of Polson, Scott, and Windle (2013). form update. We instead adopt the Laplace approximation Instead, we developed an approximate variational EM algo- advocated in Wang and Blei (2013), which involves fnding rithm using a Laplace approximation to the expectations ren- the maximum a posteriori (MAP) estimate η and approxi- ˆ d dered intractable by the nonconjungacy (Wang and Blei 2013). mating the posterior with a quadratic Taylor expansion. This To speed convergence, empirically, we also integrate out the results in a Gaussian form for the variational posterior q(ηd ) 2 1 2 ≈ word-level topic indicator z while estimating the variational (η , f (η )− ),where f (η ) is the Hessian of f (η ) N ˆ d −▽ ˆ d ▽ ˆ d d parameters for the logistic normal latent variable, and then rein- evaluated at the . In standard variational approximation, troduce it when maximizing the topic-word distributions, β. algorithms for the correlated topic model (CTM) inference Thus inference consists in optimizing the variational posterior iterates between the word-level latent variables q(zd ) and the for each document’s topic proportions in the E-step, and esti- document-level latent variables q(ηd ) until local convergence. mating the topical prevalence and content coefcients in the This process can be slow, and so we integrate out the latent vari- M-step. ables z and fnd the joint optimum using quasi-Newton methods JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 993

(Khan and Bouchard 2009). Solving for η for a given document of model variants (see, e.g., Pritchard et al. 2000; Blei, Ng, and ˆ d amounts to optimizing the function, Jordan 2003; Erosheva, Fienberg, and Laferty 2004;Braunand McAulife 2010). While, from a theoretical perspective, mixed 1 T 1 f (η ) (η µ ) '− (η µ ) membership models similar to the one we consider in this article ˆ d ∝− d − d d − d 2 sufer from multiple symmetric modes in the likelihood defn-

η , η , ing an equivalence class of solutions (see, e.g., Stephens 2000; c ,v log β ,v e d k W log e d k , (8) + d k − d Buot and Richards 2006;Airoldietal.2014b), a number of suc- $ v k k ( ' ' ' cessful solutions exist to mitigate the issue in practice, such where cd,v is the count of the vth term in the vocabulary within as using multiple starting points, clever initialization, and pro- the dth document and Wd is the total count of words in the doc- crustes transforms to identify and estimate a canonical element ument. We optimize the objective with quasi-Newton methods of the equivalence class of solutions (Hurley and Cattell 1962; using the gradient Wallach et al. 2009b). The takeaway from these articles, which report extensive empirical evaluations of the inference task in 1 f (η ) c ,v φ ,v, W θ , )− (η µ ) , mixed membership models, is that inference is expected to have ▽ d k = d ⟨ d k⟩ − d d k − d − d k $ v ( good frequentist properties. More recently, a few articles have ' * ) (9) been able to analyze theoretical properties of the inference task (Mukherjee and Blei 2009;Tangetal.2014; Nguyen 2015). These θ η where d is the simplex mapped version of d and we defne articles essentially show that inference on the mixed member- the expected probability of observing a given topic-word as ship vectors has good frequentist properties, thus providing a exp(ηd,k )βd,v,k φd,v,k ( (η )β ). This gives us our variational pos- welcome confrmation of the earlier empirical studies, but also ⟨ ⟩= k exp d,k d,v,k 2 1 terior q(η ) (λ η , ν f (η )− ).Wethensolve conditions under which inference is expected to behave well. d #= N d = ˆ d d =−▽ ˆ d for q(zd ) in closed form, While exactly characterizing the theoretical complexity of the optimizationproblem is beyond the scope of this article, we note φ , , exp(λ , )β , ,w . (10) d n k ∝ d k d k n that inference even in simple topic models has been shown to be NP-hard (Arora, Ge, and Moitra 2012). In the next section, M-Step. In the M-step, we update the coefcients in the topic we carry out an extensive empirical evaluation, including a fre- prevalence model, topical content model, and the global covari- quentist coverage analysis, in scenarios that closely resemble real ance matrix. data, and a comparative performance analysis with state-of-the- The prior on document-topic proportions maximizes the art methods, in out-of-sample on real data. These approximate ELBO with respect to the document-specifc mean evaluations provide confdence in the results and conclusions µ , X γ and the topic covariance matrix '.Updatesfor we report in the case study. An important component of our d k = d k γk correspond to for each topic under the strong performance in these setting is the use of an initialization user specifed prior with λk as the variable. By default strategy based on the spectral method of moments algorithm of γ ( ,σ2) σ 2 Arora et al. (2013). We describe this approach and compare its we give the k aNormal0 k where k is either manually selected or given a broad inverse-gamma prior. We also provide performance to a variety of alternatives in Roberts, Stewart, and an option to estimate γk using an L1 penalty. Tingley (2016). The matrix ' is then estimated as the convex combination of the maximum likelihood estimation (MLE) and a diagonalized ... Interpretation form of the MLE, After ftting the model, we are left with the task of summarizing 1 T the topics in an interpretable way (Chang et al. 2009). The major- 'MLE νd (λd Xd&)(λd Xd&) ˆ = D + − ˆ − ˆ ity of topic models are summarized by the most frequent terms d ' within a topic, although there are several methods for choos- ' w) (diag(' )) (1 w) )() ), (11) ˆ = ˆ MLE + − ˆ MLE ing higher order phrases (Mei, Shen, and Zhai 2007; Blei and where the weight w) [0, 1] is set by the user and we default to Laferty 2009). Instead, here we use a metric to summarize top- ∈ zero. ics that combines term frequency and exclusivity to that topic Updates for the topic-word distributions correspond to esti- into a univariate summary referred to as FREX (Bischof mationofthecoefcients (κ)inamultinomiallogisticregression and Airoldi 2012; Airoldi and Bischof in press). This statistics model where the observed words are the output, and the design calculates the of the empirical CDF of a term’s matrix includes the expectations of the word-level topic assign- frequency under a topic with the empirical CDF of exclusivity ments E[q(z )] φ ,topicalcontentcovariatesY ,andtheir to that topic. Denoting the K V matrix of topic-conditional d = d d × interactions. The intercept m is fxed to be empirical log proba- term probabilities as B,theFREXstatisticisdefnedas bility of the terms in the corpus. (See the Appendix for details.) 1 ω 1 ω − − , FREXk,v K = + (β , ) Remarks on Inference. Much progress on the analysis of $ECDF(βk,v / j 1 β j,v ) ECDF k v ( behavior of the inference task in mixed membership models has = been accomplished in the past few years. A thread of research where ω is a weight, which balances# the infuence of frequency in applied statistics has explored the properties of the inference and exclusivity, which we set to 0.5. The harmonic mean ensures task in mixed membership models, empirically, for a number that chosen terms are both frequent and exclusive, rather than 994 M. E. ROBERTS ET AL.

simply an extreme on a single dimension. We use a plugin esti- Multinomial(Bθd ),wherewehaveomittedthetokenlevellatent mator for the FREX statistics using the collection B coefcients variable z to reduce sampling variance. { } estimated using variational EM. We simulate from this data generating 50 times. For each simulated dataset, we ft an LDA model using collapsed Gibbs 3. Empirical Evaluation and Data Analysis sampling and an STM model. For both cases, we use the cor- rectly specifednumberoftopics.ForSTM,wespecifythemodel In this section, we demonstrate that our proposed model is use- with the covariate xd for each document using a B-spline with 10 ful with a combination of simulation evidence and an exam- degrees of freedom. Crucially we do not provide it any informa- ple application in political science. From a social science per- tion about the true functional form. LDA cannot use the covari- spective, we are interested in studying how media coverage of ate information. China’s rise varies between mainstream Western news sources Interpreting the simulation results is complicated due to pos- and the Chinese state-owned news agency, Xinhua. We use the terior invariance to label switching. For both LDA and STM, we STM on a corpus of newswire reports to analyze the diferences match the estimated topics to the simulated parameters using in both topic prevalence and topical content across fve major the Hungarian algorithm to maximize the dot product of the news agencies. true θ and the MAP estimate (Papadimitriou and Steiglitz 1998; Before proceeding to our application, we present series of Hornik 2005). simulation studies. In Section 3.1,westartwithaverysim- In Figure 2,weplottheLoess-smoothed(span 1/3) rela- = ple simulation that captures the intuition of why we expect the tionship between the covariate and the MAP estimate for θd of model to be useful in practice. This section also lays the founda- the second topic. Each line corresponds to one run of the model tion for our simulation procedures. In Section 3.2,wedemon- and the true relationship is depicted with a thick black line. For strate that the model is able to recover parameters of interest comparison, the third panel shows the case using the true values in a more complicated simulation setting that closely parallels of θ. While the fts based on the LDA model vary quite widely, our real data. In Section 3.3, we further motivate our applied the proposed model fts essentially all 50 samples with a rec- question and present our data. Using the China data we per- ognizable representation of the true functional form. This is in form a held-out likelihood comparison to three competing mod- some sense not at all surprising, the proposed model has access els (Section 3.3.1)andcheckmodelftusingposteriorpredic- to valuable information about the covariate that LDA does not tive checks (Section 3.3.2). Finally having validated the model incorporate. The result is a very favorable bias-variance tradeof through simulation, held-out experiments and model checking, in which our prior produces a very mild bias in the estimate of we present our results in Section 3.3.3. the covariate efects in return for a substantial variance reduc- tion across simulations. 3.1. Estimating Nonlinear Covariate Efects This simulation demonstrates that STM is able to capture a nonlinear covariate efect on topical prevalence. The focus here In this simulation, we build intuition for why including on the document-topic proportions (θd)difersfrompriorwork covariate information into the topic model is useful for in computer science, which typically focuses on the recovery of recovering trends in topical prevalence. We compare STM the topic-word distributions (βk). Recovery of βk is an easier task with Latent Dirichlet Allocation (LDA) using a very sim- in the sense that the parameters are global and our estimates can ple data-generating process that generates 100 documents be expected to improve as the number of documents increases using three topics and a single continuous covariate. We (Arora et al. 2013). By contrast θd is a document level parame- start by drawing the topic word distributions for each topic ter where it makes less sense to speak of the number of words βk Dirichlet49(0.05).Collectingthetopicworddistribu- increasing toward infnity. Nevertheless, estimates of covariate ∼ tions into the 3 by 50 matrix B,eachdocumentissimu- relationships based on the document level parameters θd are lated by sampling: Nd Pois(50), xd Uniform(0, 1), θd often the primary focus for applied social scientists and thus we ∼ ∼ ∼ LogisticNormal (µ (0.5, cos(10xd )),' 0.5I),andwd,n emphasize them here. 2 = = ∼

Figure . Plot of fitted covariate-topic relationships from  simulated datasets using LDA and the proposed structural topic model of text. The third panel shows the estimated relationship using the true values of the topic and thus only reflects sampling variability in the data-generating process. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 995

3.2. Frequentist Coverage Evaluation in a Realistic Setting covariance matrix '

In this section, we expand the quantitative evaluation of the T 1 ν ) (λd Xd&)(λd Xd&) νd. (12) proposed model to a more complex and realistic setting. Using ˜ = ˆ − − ˆ − ˆ = D the ftted model from the application in Section 3.3.3 as a ref- 'd erence, we simulate synthetic data from the estimated model The approximation ν equals the sample average of the estimated parameters. The simulated dataset includes 11,980 documents, ˜ document-specifc covariance matrices νd .Underthisapprox- avocabularyofV 2518 terms, K 100 topics, and covari- { } = = imation, it is still necessary to simulate from the multivariate ates for both topic prevalence and topical content. We set the Normal variational posterior, but there are substantial compu- true values of θ and β to the MAP estimates of the reference tational gains from avoiding the need to recalculate the covari- model and simulate new observed words as above. We then ance matrix for each document. As we show next, this approxi- ft the model to the synthetic documents using the same set- mation yields credible intervals with good coverage properties. tings (and observed covariates) as we did in estimating the To summarize, for each document we simulate 2500 draws from reference model. We repeat this process 100 times, and, as the variational posterior (λd, νd ) using the document-specifc above, align the topics to the reference model using the Hun- N ˆ variational mode λd and the global approximation to the covari- garian algorithm. This is a substantially more rigorous test of ance matrix ν. We then apply the softmax transformation to the inference procedure. With 100 topics, a content covari- ˜ these draws and recover the 95% of θd. We cal- ate with 5 levels and 2518 vocabulary terms, there are over culate coverage along each topic separately. 1.2 million topic-word probabilities that need to be estimated. The left panel of Figure 3 shows boxplots of the coverage rates The documents themselves are on average 167 words long, and grouped by size of the true θ with the dashed line indicating the for each one of them over 100 topic proportions need to be nominal 95% coverage. We can see that for very small values estimated. of θ (<0.05) and moderate to large values (>0.15). coverage is We evaluate the simulations by examining the frequentist extremely close to the nominal 95% level. The observed discrep- coverage of the credible interval for θ and the expected error ancies between empirical and nominal coverage are reasonable. between the MAP estimate and the truth. The most straight- There are several sources of variability that contribute to these forward method for defning credible intervals for θ is using deviations. First, the variational posterior is conditional on the the Laplace approximation to the unnormalized topic propor- point estimates of the topic-word distributions β,whichareesti- η ˆ tions .Bysimulatingdrawsfromthevariationalposteriorover mated with error. Many of the documents are quite short relative η and applying the softmax transformation, we can recover to the total number of topics, thus the accuracy of the Laplace θ the credible intervals for .However,thisprocedureposesa approximation may sufer. Finally, the optimization procedure ν computational challenge as the covariance matrix d,whichis only fnds a local optimum. of dimension K 1 K 1, cannot easily be stored for each Next we consider how well the MAP estimates of θ compare − × − ν document, and recalculating d can be computationally unfea- to the true values. The right panel of Figure 3 provides a series sible. Instead, we introduce a simpler global approximation of of boxplots of the expected L error grouped by the true θ.For ν 1 the covariance matrix d,whichleveragestheMLEoftheglobal very small values of θ,theestimatesareextremelyaccurate,and

Figure . Coverage rates for a 95% credible interval on the document-topic proportions (θ )inasimulatedK 100 topic model. The left panel shows the distribution of d = coverage rates on a nominal 95% credible interval grouped by the size of the true θ . The right panel shows the distribution of the L errors, E[ θ θ ],wheretheθ is d 1 d ˆd ˆd the MAP estimate. | − | 996 M. E. ROBERTS ET AL. the size of the errors grows little as the true parameter value To explore how diferent news agencies have treated China’s increases. For very large values of θ,thereisasmall,butper- rise diferently, we analyze a stratifed random sample (Rosen- sistent, negative bias that results in underestimation of the large baum 1987)of11,980newsreportscontainingtheterm“China” elements of θ. dated from 1997–2006 and originating from fve diferent inter- This simulation represents a challenging case but the model national news sources. For each document in our sample, we performs well. Additional simulation results can be found in observe the day it was written and the newswire service pub- Roberts et al. (2014b)includingapermutationstyletestfortopi- lishing the report. Our data include fve news sources: Agence cal prevalence covariate efects in which a covariate is randomly France Presse (AFP), the Associated Press (AP), British Broad- permuted and the model is repeatedly reestimated. This can help casting Corporation (BBC), Japan Economic Newswire (JEN), the analyst determine if there is a risk of overftting in reported and Xinhua (XIN), the state-owned Chinese news agency. We covariate efects. Next, we validate the model using real data. include the month a document was written and the news agency as covariates on topical prevalence. We also include news agency as a covariate afecting topical content to estimate how topics are discussed in diferent ways by diferent news agencies. In 3.3. Media Coverage of China’s Rise our case study, we estimated the number of topics to be 100, Over the past decade, “rising” China has been a topic of conver- by evaluating and maximizing topics’ coherence using a cross- sations, news sources, speeches, and lengthy books. However, validation scheme while changing the number of topics (Airoldi what rising China means for China, the West and the rest of the et al. 2010). world is subject to much intense debate (Ikenberry 2008;Fergu- son 2010). Tellingly, both Western countries and China accuse each other of slanting their respective medias to obfuscate the ... Comparative Performance Evaluation with true quality of Chinese governance or meaning of China’s new- State-of-the-Art found power (Fang 2001;JohnstonandStockmann2007). West- To provide a fully automated comparison of our model to exist- ern “slant” and Chinese censorship and propaganda have been ing alternatives, we estimate the heldout likelihood using the blamed for polarizing views among the American and Chi- document completion approach (Asuncion et al. 2009; Wallach nese public (Roy 1996;JohnstonandStockmann2007), possi- et al. 2009b). To demonstrate that the covariates provide use- bly increasing the probability of future confict between the two ful predictive information, we compare the proposed structural countries. topic model (STM) to latent Dirichlet allocation (LDA), the In Section 3.3,westudybothWesternandChinesemedia Dirichlet multinomial regression topic model (DMR), and the slant about China’s rise through a collection of newspapers con- sparse additive generative text model (SAGE). We use a mea- taining the word China over a decade of its development. We sure of predictive power to evaluate comparative performance give a brief analysis of how diferent media agencies have char- among these models: for a subset of the documents we hold acterized China’s rise, focusing particularly on key diferences in back half of the document and evaluate the likelihood of the held the way the Chinese news agency, Xinhua, represents and cov- out words (Asuncion et al. 2009;Paisley,Wang,andBlei2012). ers news topics diferently than mainstream Western sources. Higher numbers indicate a more predictive model. In doing so, we seek to measure “slant” on a large scale. Pro- Figure 4 shows the heldout likelihood for a variety of topic ceeding this substantive analysis, in Section 3.3.1 we frst show values. We show two plots. On the left is the average held- the extent to which our model leads to better prediction out- out likelihood for each model on 100 datasets, and their 95% of-sample than existing models on the data, and the extent to quantiles. At frst glance, in this plot, it seems that STM is doing which the proposed model fts the data (using posterior predic- much better or about the same as the other three models. How- tive checks). ever, looking at the second plot, the paired diferences between

Figure . STM versus SAGE, LDA, and DMR heldout likelihood comparison. On the left is the mean heldout likelihood and % quantiles. On the right is the mean paired difference between the three comparison models and STM. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 997

Figure . Posterior predictive checks using the methodology outlined in Mimno and Blei (). The plot shows the top ten most probable words for each of three topics marginalizing over the covariate-specific word distributions. The x-axis gives the instantaneous mutual information that would be in the true data-generating process. The black closed circle gives the observed value. the models on each individual dataset, we see that STM consis- is the observed word, k is the topic, and H()is the entropy. When tently outperforms all other models when the models are run on the model assumptions hold, we expect this quantity to be close the same dataset. With the exception of the 40 topic run, STM to zero because the entropy of each word should be the same as does better than all models in every dataset for every topic num- the entropy of the topic distribution under the data-generating ber. Focusing on paired comparison suggests that STM is the process. To provide a reference for the observed value, we plot preferred choice for prediction. the value for the top 10 words for three diferent topics along The main takeaway from this table is that STM performs sig- with 20 draws from the simulated posterior predictive distribu- nifcantly better than competing models, except for the case of tion (Gelman, Meng, and Stern 1996;MimnoandBlei2011). 40 topics, when it has comparable predictive ability to Dirich- Figure 5 gives an example of these checks for three topics. let multinomial regression model. This suggests that including The posterior predictive checks give us an indication of where information on topical prevalence and topical content aids in the model assumptions do and do not hold. Cases where there prediction. Further, STM has more interpretable quantities of is a large gap between the observed value (dark circle) and interest than its closest competitor because it allows correlations the reference distribution (open circles) indicate cases where between topics and covariates on topic content. We cover these the model assumptions do not hold. Generally, these discrepan- qualitative advantages in the next section. cies occur for terminology, which is specifc to a sub-component of the topic. For example, in the left plot on SARS/avian fu, the two terms with the greatest discrepancies are the word stems ... Assessing Model Fit for “SARS” and “bird.” The distribution of occurrence for these The most efective method for assessing model ft is to carefully terms would naturally be heavy-tailed, in the sense that once we read documents that are closely associated with particular top- have observed the occurrence of the term in a document, the ics to verify that the semantic concept covered by the topic is likelihood observing it again would increase. A model that splits refected in the text. The parameter θ provides an estimate of d SARS and avian fu into separate topics would be unlikely to each document’s association with every topic making it straight- have this problem. However, for our purposes here combining forward to efectively direct analyst engagement with the texts them into one topic is not a problem. (see, e.g., Roberts, Stewart, and Tingley 2014a). An overview of manual validation procedures can be found in Grimmer and Stewart (2013). When automated tools are required, we can use the frame- ... Substantive Analysis of Differential Newswire work of posterior predictive checks to assess components of Reporting model ft (Gelman, Meng, and Stern 1996). Mimno and Blei While it is useful to demonstrate that STM shows predictive (2011) outlined a framework for posterior predictive checks gains, our primary motivation in developing the STM is to create for the latent Dirichlet allocation model using mutual informa- atoolthatcanhelpusanswersocialsciencequestions.Specif- tion between document indices and observed words as the real- cally, we want to study how the various news sources cover topics ized discrepancy function. Under the data-generating process, relatedtothelast10yearsofChina’sdevelopmentandthevocab- knowing the document index would provide us no additional ulary with which these newswires describes the same events. We information about the terms it contains after conditioning on are interested in how Chinese and Western sources represent the topic. In practice, topical words often have heavy tailed dis- prominent international events during this time period difer- tributions of occurrence, thus we may not expect independence ently, that is, describe the same event with diferent vocabulary, to hold (Doyle and Elkan 2009). and the diferences between how much Chinese and Western As in Mimno and Blei (2011), we operationalize the check sources discuss a particular topic. Accusations of “slant” have using the instantaneous mutual information between words been largely anecdotal, and the STM provides us with a unique and document indices conditional on the topic: IMI(w, D k) opportunity to measure characterizations of news about China | = H(D k) H(D W w, k),whereD is the document index, w on a large scale. | − | = 998 M. E. ROBERTS ET AL.

have a forward-looking view on China’s trajectory, discussing “reform,”“advancing,” and references to the formation of laws in China. The analysis provides clear evidence of varying perspectives in both Western and Chinese sources on China’s political future, and surprisingly shows signifcant variation within Western sources. Second, we turn to a very controversial event within China during our time period, the crackdown on Falungong. Falun- gong is a spiritual group that became very popular in China dur- ing the 1990s. Due to the scale and organization of the group, the Chinese government outlawed Falungong beginning in 1999, arresting followers, and dismantling the organization. This topic appears within all of our news sources, since the crackdown occurred within the time period we are studying. Figure 7 (left panel) shows the diferent ways in which the news sources portray the Falungong incident. Again, we see that the AP and AFP have the most “Western” view of the incident, using words like “dissident,” “crackdown,” and “activist.” The BBC, on the other hand, takes a much milder language to talk about the incident, with words such as “illegal,” or “according.” JEN talks a lot about asylum for those feeing China, with words such as Figure . China trajectory topic. Each group of words are the highest probability “asylum,” “refugee,” and “immigration.” Xinhua, on the other words for the news source. hand, talks about the topic using exclusively language about crime, for example, “crime,”“smuggle,”“suspect,”and “terrorist.” For the purposes of the following analyses, we labeled each Again, we see not only the diference between Western and Chi- topic individually by looking at the most frequent words and at nese sources, but interestingly large variation in language within the most representative articles. We start with a general topic Western sources. related to Chinese governance, which includes Chinese gov- Since we included news source as a covariate in estimat- ernment strategy, leadership transitions, and future policy. We ing topical prevalence part within the model, we can estimate might call this a “China trajectory” topic. Figure 6 shows the the diferences in frequency, or how much each of the news highest probability of words in this topic for each of the news sources discussed the Falungong topic. As shown in Figure 7 sources. The news sources have vastly diferent accounts of (right panel), we see unsurprisingly that Xinhua talks signif- China’s trajectory. AFP and AP talk about China’s rule with icantly less about the topic than Western news sources. This words like “Tiananmen,” referring to the 1989 Tiananmen stu- would be unsurprising to China scholars, but reassuringly agrees dent movement, and “Zhao,” referring to the reformer Zhao with expectations. Interestingly, the Western news sources we Ziyang who fell out of power during that incident due to his sup- would identify to have the most charged language, AFP and AP, port of the students. Even though Tiananmen occurred 10 years also talk about the topic more. Slant has a fundamental relation- before our sample starts, these Western news sources discuss it ship with topical prevalence, where those with a positive slant as central to China’s current trajectory. on China talk about negative topics less, and those with nega- Xinhua, on the other hand, has a more positive view tive slant on China talk about negative topics more. of China’s direction, with words like “build” and “for- In general, our model picks up both short-lived events like ward,” omitting words like “corrupt” or mentions of the the Olympics and invasion of Iraq, and long-term topical trends, Tiananmen crackdown. Interestingly, the BBC and JEN also such as discussion about North Korea and nuclear weapons over

Figure . Falungong topic. Each group of words are the highest probability words for the news source (left panel). Mean prevalence of Falungong topic within eachnews source corpus (right panel). JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 999

Figure . SARS and avian flu. Each dot represents the average topic proportion in a document in that month and the line is a smoothed average across time (left panel). Comparisons between news sources (right panel). time and discussion of the environment, both increasing over time, our model is able to capture the SARS and subsequent time. avian fu events, described above. The topic model shows how As an illustration, we turn to the difering news coverage of the news both quickly picked up outbreaks of SARS and avian SARS during the outbreak of the disease during 2003 in China. fu and quickly stopped talking about them when the epidemics First, in Figure 8 (left panel) we show that by smoothing over were resolved. The Chinese government received a lot of

Figure . Correlation between topics, Xinhua versus BBC. 1000 M. E. ROBERTS ET AL. international criticism for its news coverage of SARS, mostly which are important for analyzing experiments and for carry- because it reported on the disease much later than it knew ing out other causal analyses when the outcome comes in the that the epidemic was occurring. As shown in Figure 8 (right form of text data. We then demonstrated the proposed meth- panel), our model picks up small diferences in news coverage ods to address questions about the variation in news coverage between Chinese and Western sources once news coverage of China’s rise. In related work, we have applied these methods began happening, although not substantial. In particular, while to study open-ended survey responses (Roberts et al. 2014b), Western news sources seemed to talk a lot about death, Chinese comparative politics literature (Lucas et al. 2015), and student- news sources mainly focused on policy-related words, such as generated text in massive open online courses (Reich et al. “control,” “fght,” and “aid,” and avoided mentions of death by 2015). the disease. We conclude by highlighting some areas of work that would Finally, because the model allows for the inclusion of cor- be fruitful for expanding the role of this type of analysis, espe- related topics, we can also visualize the relationship between cially in the social sciences. China-related topics in the 1997–2006 period. In particular, we Aproductivelineofinquiryhasfocusedontheinterpre- can see how topics are correlated diferently for diferent news tation of topic models (Chang et al. 2009;Mimnoetal.2011; wires, indicating how topics are connected and framed difer- Airoldi and Bischof in press). These methods are aided by ently in each newswire. In Figure 9,wefndalledgesbetween techniques for dealing with the practical threats to interpreta- topics where they exhibit a positive correlation above 0.1. Pairs tion such as excessive stop-words and categories with overlap- of topics where an edge exists in both Xinhua and BBC, we ping keywords (Wallach, Mimno, and McCallum 2009a;Zou denote with a light blue dot. Pairs of topics where Xinhua, but and Adams 2012). In addition to fully automated approaches, not BBC have an edge between them, we denote with a red work on interactive topic modeling and user-specifed con- square, and those where BBC, but not Xinhua have an edge straints is particularly appropriate to social scientists who between them we denote with a blue square. may have a deep knowledge of their particular document sets We then sort the matrix by topics that are similarly corre- (Ramage et al. 2009;Andrzejewskietal.2011;Hu,Boyd- lated with other topics in Xinhua and BBC to those that are Graber, and Satinof 2011). One advantage of our approach is not similarly correlated. Topics such as accidents and disasters, that the meta-information is incorporated by means of gen- tourism, factory production, and manufacturing are correlated eralized linear models, which are already familiar to social with similar topics in both BBC and Xinhua. However, top- scientists. ics related to democracy, human rights, and the environment Asecondareawewanttoemphasizeistherecentwork have diferent topic correlations between the two corpuses. For on general methods for evaluation and model checking (Wal- example, in BBC, the environment and pollution is correlated lach et al. 2009b;MimnoandBlei2011;AiroldiandBischofin with factory production, construction, development, and fam- press). As noted in both the computer science literature (Blei ilies. In Xinhua, on the other hand, environment and pollu- 2012) and the political science literature (Grimmer and Stew- tion is only correlated with the topic associated with the Chi- art 2013), validation of the model becomes even more impor- nese ambassador, meaning it is mainly talked about in articles tant when using unsupervised methods for inference or mea- related to international relations, rather than internal economic surement than it is when used for prediction or exploration. development. While model-based ft statistics are an important part of the pro- In conclusion, the STM allows us to measure how of the cess, we also believe that recent work in the automated visual- various newswire services diferentially treat China’s rise over ization of topic models (Chaney and Blei 2012;Chuangetal. a10yearperiod.Weseemuchvariationinhowthedif- 2012b;Chuang,Manning,andHeer2012a)isofequalorgreater ferent newswires discuss the rise of China. Unsurprisingly, importance for helping users to substantively engage with the Xinhua news services talk about negative topics less than West- underling texts. And user engagement is important to ultimately ern sources, and focus on the positive aspects of topics related deliver interesting substantive conclusions (Grimmer and King to China’s rise. Interestingly, however, we see high variation 2011). within Western news sources, with AFP and AP taking a much Alternative inference strategies for the proposed model, and more negative slant on China’s rise than the BBC. We believe for topic models generally, are an area of current research. With we are the frst to quantify media slant in news sources all regard to our model, an alternative inference approach would over the world on China’s rise, adding to the discussion of be to develop an Markov chain Monte Carlo (MCMC) sampler how China is perceived across many diferent countries. Con- based on the polya-gamma data augmentation scheme (Chen veniently, the STM allows us to summarize the newspapers’ per- et al. 2013;Polson,Scott,andWindle2013). This has the advan- spectives more quickly than would reading large swaths of text, tage of retaining asymptotic guarantees on recovering the true the method that is currently most frequently used by China posterior. However, while MCMC get to the right answer in the- scholars. ory, in the limit, in practice they also get stuck in local modes, and they often converge slower than variational approaches. 4. Concluding Remarks Asecondapproachwouldbetoexploretechniquesbasedon stochastic approximations (Toulis and Airoldi 2014, 2015). This Inthisarticle,wehaveoutlinedanewmixedmembershipmodel has the advantage of providing a solution, which scales well for the analysis of documents with meta-information. We also to larger collections of documents, while retaining the asymp- have outlined some of the features of the proposed models, totic properties of MCMC. Elsewhere, we have also developed a JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1001 strategy to appropriately include measurement uncertainty from Airoldi, E. M., Cohen, W. W., and Fienberg, S. E. (2005), “Bayesian Models the variational posterior in regressions where the latent topic is for Frequent Terms in Text,” in Proceedings of the Classifcation Society used as the outcome variable (Roberts et al. 2014b). of North America and INTERFACE Annual Meetings.[990,991] stm Airoldi, E. M., and Fienberg, S. E. (2003), “Who Wrote Ronald Reagan’s Software availability. The R package implements the Radio Addresses?” Technical Report no. CMU-STAT-03-789, Carnegie methods described here, in addition to a suite of visu- Mellon University. [988] alization and post-estimation tools (http://cran.r-project.org/ Aitchison, J. (1982), “The Statistical Analysis of Compositional Data,” Jour- web/packages/stm). nal of the Royal Statistical Society,SeriesB,44,139–177.[991] Aitchison, J., and Shen, S. M. (1980), “Logistic-Normal Distribu- tions: Some Properties and Uses,” Biometrika,67,261–272. Supplementary Materials [989,991] Andrzejewski, D., Zhu, X., Craven, M., and Recht, B. (2011), “A Framework The supplementary materials contain the article’s appendix. Addition- for Incorporating General Domain Knowledge Into Latent Dirichlet ally, materials are available on Dataverse and can be found at Allocation using First-Order Logic,” in IJCAI,pp.25–32.[1000] http://dx.doi.org/10.7910/DVN/SIGIAU. Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. (2013), “A Practical Algorithm for Topic Modeling With Prov- able Guarantees,” in Proceedings of The 30th International Conference Acknowledgments on ,pp.280–288.[993,994] Arora, S., Ge, R., and Moitra, A. (2012), “Learning Topic Models–Going The authors thank Ryan Adams, Ken Benoit, David Blei, Patrick Brandt, Beyond Svd,” in IEEE 53rd Annual Symposium on Foundations of Com- Amy Catalinac, Sean Gerrish, Adam Glynn, Justin Grimmer, Gary King, puter Science (FOCS),pp.1–10.[993] Christine Kuang, Chris Lucas, Brendan O’Connor, Arthur Spirling, Alex Asuncion, A., Welling, M., Smyth, P.,and Teh,Y.-W.(2009), “On Smoothing Storer, Hanna Wallach, Daniel Young, and in particular Dustin Tingley, for and Inference for Topic Models,” in UAI,pp.27–34.[996] useful discussions, and the editor and two anonymous reviewers for their Bischof, J. M., and Airoldi, E. M. (2012), “Summarizing Topical Content valuable input. The frst two authors contributed equally to this work. With Word Frequency and Exclusivity,” in International Conference on Machine Learning (Vol. 29), pp. 201–208. [993] Blei, D. M. (2012), “Probabilistic Topic Models,” Communications of the Funding ACM,55,77–84.[988,1000] Blei, D. M., and Laferty, J. D. (2006), “Dynamic Topic Models,” in ICML, This research supported, in part, by The Eunice Kennedy Shriver National 113–120. [990] Institute of Child Health & Human Development under grant P2- ——— (2007), “A Correlated Topic Model of Science,” Annals of Applied CHD047879 to the Ofce of Population Research at Princeton University, Statistics,1,17–35.[989,991,992] by the National Science Foundation under grants CAREER IIS-1149662, ——— (2009), “Topic Models,” in : Classifcation, Clustering, and and IIS-1409177, and by the Ofce of Naval Research under grant YIP Applications, eds. A. N. Srivastava and M. Sahami, Boca Raton, FL: CRC N00014-14-1-0485 to Harvard University. This research was largely per- Press, pp. 71–94. [993] formed when Brandon M. Stewart was a National Science Foundation Blei, D. M., Ng, A. Y., and Jordan, M. I. (2001), “Latent Dirichlet alloca- Graduate Research Fellow at Harvard University. Edoardo M. Airoldi is an tion,” in Advances in Neural Information Processing Systems,pp.601– Alfred P. Sloan Research Fellow. The content is solely the responsibility of 608. [988] the authors and does not necessarily represent the ofcial views of the NIH, ——— (2003), “Latent Dirichlet Allocation,” Journal of Machine Learning of the NSF, nor of the ONR. Research,3,993–1022.[988,992,993] Braun, M., and McAulife, J. (2010), “Variational Inference for Large-Scale Models of Discrete Choice,” Journal of the American Statistical Associ- References ation,105,324–335.[993] Buot, M., and Richards, D. (2006), “Counting and Locating the Solutions of Ahmed, A., and Xing, E. (2007), “On Tight Approximate Inference of the Polynomial Systems of Maximum Likelihood Equations, I,” Journal of Logistic-Normal Topic Admixture Model,” in Proceedings of AISTATS, Symbolic Computation,41,234–244.[993] pp. 19–26. [992] Chaney, A., and Blei, D. (2012), Visualizing Topic Models, Princeton, NJ: ——— (2010), “Staying Informed: Supervised and Semi-Supervised Multi- Department of Computer Science, Princeton University. [1000] View Topical Analysis of Ideological Perspective,” in Proceedings Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Blei, D. M. (2009), of the 2010 Conference on Empirical Methods in Natural Language “Reading Tea Leaves: How Humans Interpret Topic Models,” in NIPS, Processing, Association for Computational Linguistics, pp. 1140– 288–296. [993,1000] 1150. [989] Chen, J., Zhu, J., Wang, Z., Zheng, X., and Zhang, B. (2013), “Scalable Infer- Airoldi, E., Erosheva, E., Fienberg, S., Joutard, C., Love, T., and ence for Logistic-Normal Topic Models,” in Advances in Neural Infor- Shringarpure, S. (2010), “Reconceptualizing the Classifcation of PNAS mation Processing Systems,pp.2445–2453.[992,1000] Articles,” Proceedings of the National Academy of Sciences,107,20899– Chuang, J., Manning, C., and Heer, J. (2012a), “Termite: Visualization Tech- 20904. [988,996] niques for Assessing Textual Topic Models,” in Proceedings of the Inter- Airoldi, E. M., Anderson, A. G., Fienberg, S. E., and Skinner, K. K. (2006), national Working Conference on Advanced Visual Interfaces,ACM,pp. “Who Wrote Ronald Reagan’s Radio Addresses?” Bayesian Analysis,1, 74–77. [1000] 289–320. [988,991] Chuang, J., Ramage, D., Manning, C., and Heer, J. (2012b), “Interpretation Airoldi, E. M., and Bischof, J. M. (in press), “A Regularization Scheme on and Trust: Designing Model-Driven Visualizations for Text Analysis,” Word Occurrence Rates That Improves Estimation and Interpretation in Proceedings of the 2012 ACM annual conference on Human Factors in of Topical Content” (with discussion), Journal of American Statistical Computing Systems,ACM,pp.443–452.[1000] Association.[993,1000] De Boor, C., De Boor, C., De Boor, C., and De Boor, C. (1978), APractical Airoldi, E. M., Blei, D. M., Erosheva, E. A., and Fienberg, S. E. (eds.) (2014a), Guide to Splines (Vol. 27), New York: Springer-Verlag. [991] Handbook of Mixed Membership Models and Their Applications (Hand- Dempster, A. P., Laird, N., and Rubin, D. B. (1977), “Maximum Likelihood books of Modern Statistical Methods),BocaRaton,FL:Chapman& Estimation From Incomplete Data via the EM Algorithm,” Journal of Hall/CRC. [988,989] the Royal Statistical Association,39,1–38.[989] ——— (2014b), “Introduction to Mixed Membership Models and Meth- Doyle, G., and Elkan, C. (2009), “Accounting for Burstiness in Topic Mod- ods,” in Handbook of Mixed Membership Models and Their Applications els,” in ICML,281–288.[997] (Handbooks of Modern Statistical Methods),BocaRaton,FL:Chapman Eisenstein, J., Ahmed, A., and Xing, E. (2011), “Sparse Additive Generative &Hall/CRC,pp.3–14.[993] Models of Text,” in Proceedings of ICML,pp.1041–1048.[990,991] 1002 M. E. ROBERTS ET AL.

Eisenstein, J., O’Connor, B., Smith, N., and Xing, E. (2010), “A Latent Vari- Knowles, D. A., and Minka, T. (2011), “Non-Conjugate Variational Message able Model for Geographic Lexical Variation,” in Proceedings of the Passing for Multinomial and Binary Regression,” in Advances in Neural 2010 Conference on Empirical Methods in Natural Language Processing, Information Processing Systems,pp.1701–1709.[992] Association for Computational Linguistics, pp. 1277–1287. [990] Liu, J. S. (1994), “The Collapsed Gibbs Sampler in Bayesian Computations Erosheva, E. A., Fienberg, S. E., and Laferty, J. (2004), “Mixed-Membership With Applications to a Gene Regulation Problem,” Journal of the Amer- Models of Scientifc Publications,” Proceedings of the National Academy ican Statistical Association,89,958–966.[989] of Sciences,101,5220–5227.[988,993] Lucas, C., Nielsen, R., Roberts, M. E., Stewart, B. M., Storer, A., and Tingley, Fang, Y.-J. (2001), “Reporting the Same Events? A Critical Analysis of Chi- D. (2015), “Computer Assisted Text Analysis for Comparative Politics,” nese Print News Media Texts,” Discourse & Society,12,585–613.[996] Political Analysis,23,254–277.[989,1000] Ferguson, N. (2010), “Complexity and Collapse: Empires on the Edge of Manning, C. D., Raghavan, P., and Schütze, H. (2008), An Introduction Chaos,” Foreign Afairs,89,18–32.[996] to Information Retrieval,Cambridge,England:CambridgeUniversity Friedman, J., Hastie, T., and Tibshirani, R. (2010), “Regularization Paths for Press. [988] Generalized Linear Models via Coordinate Descent,” Journal of statis- Mei, Q., Shen, X., and Zhai, C. (2007), “Automatic Labeling of Multinomial tical software,33,1–22.[992] Topic Models,” in KDD,pp.490–499.[993] Gelman, A., and Hill, J. (2007), Data Analysis Using Regression and Mul- Mendenhall, T. C. (1887), “The Characteristic Curves of Composition,” Sci- tilevel/Hierarchical Models (Vol. 3), New York: Cambridge University ence,11,237–249.[988] Press. [991] Meng, X.-L., and Van Dyk, D. (1997), “The EM Algorithm—An Old Folk- Gelman, A., Meng, X.-L., and Stern, H. (1996), “Posterior Predictive Assess- Song Sung to a Fast New Tune,” Journal of the Royal Statistical Society, ment of Model Fitness via Realized Discrepancies,” Statistica Sinica,6, Series B, 59, 511–567. [989] 733–760. [989,997] Miller, G. A., Newman, E. B., and Friedman, E. A. (1958), “Length- Genkin, A., Lewis, D. D., and Madigan, D. (2007), “Large-Scale Bayesian Frequency Statistics for Written English,” Information and Control,1, Logistic Regression for Text Categorization,” Technometrics,49,291– 370–389. [988] 304. [988] Mimno, D., and Blei, D. (2011), “Bayesian Checking for Topic Models,” in Gerrish, S., and Blei, D. (2012), “How They Vote: Issue-Adjusted Models Proceedings of the Conference on Empirical Methods in Natural Lan- of Legislative Behavior,” in Advances in Neural Information Processing guage Processing, Association for Computational Linguistics, pp. 227– Systems,25,2762–2770.[989] 237. [997,1000] Grifths, T. L., and Steyvers, M. (2004), “Finding Scientifc Topics,” Proceed- Mimno, D., and McCallum, A. (2008), “Topic Models Conditioned on Arbi- ings of the National Academy of Sciences of the United States of America, trary Features With Dirichlet-Multinomial Regression,” in UAI,pp. 101, 5228–5235. [988,992] 411–418. [990] Grimmer, J. (2010), “A Bayesian Hierarchical Topic Model for Political Mimno, D., Wallach, H., and McCallum, A. (2008), “Gibbs Sampling for Texts: Measuring Expressed Agendas in Senate Press Releases,” Polit- Logistic Normal Topic Models With Graph-Based Priors,” in NIPS ical Analysis,18,1–35.[988,989] Workshop on Analyzing Graphs,pp.1–8.[992] Grimmer, J., and King, G. (2011), “General Purpose Computer-Assisted Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011), Clustering and Conceptualization,” Proceedings of the National “Optimizing Semantic Coherence in Topic Models,” in Proceedings of Academy of Sciences,108,2643–2650.[1000] the Conference on Empirical Methods in Natural Language Processing, Grimmer, J., and Stewart, B. M. (2013), “Text as Data: The Promise and Pit- Association for Computational Linguistics, pp. 262–272. [1000] falls of Automatic Content Analysis Methods for Political Texts,” Polit- Minka, T., and Laferty, J. D. (2002), “Expectation-Propagation for the Gen- ical Analysis,21,267–297.[988,997,1000] erative Aspect Model,” in Proceedings of the Eighteenth Conference on Groenewald, P. C., and Mokgatlhe, L. (2005), “Bayesian Computation for Uncertainty in Artifcial Intelligence (UAI),pp.352–359.[988] Logistic Regression,” Computational Statistics & Data Analysis,48, Mosteller, F., and Wallace, D. (1963), “Inference in an Authorship Problem,” 857–868. [992] Journal of the American Statistical Association,58,275–309.[988] Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models (Vol. 43), Mosteller, F., and Wallace, D. L. (1964), Inference and Disputed Authorship: Boca Raton, FL: Chapman & Hall/CRC. [991] The Federalist, Reading, MA: Addison-Wesley. [988] Holmes, C. C., and Held, L. (2006), “Bayesian Auxiliary Variable Models ——— (1984), Applied Bayesian and Classical Inference: The Case of The Fed- for Binary and Multinomial Regression,” Bayesian Analysis,1,145–168. eralist Papers,NewYork:Springer-Verlag.[988,991] [992] Mukherjee, I., and Blei, D. M. (2009), “Relative Performance Guarantees Hornik, K. (2005), “A CLUE for CLUster Ensembles,” Journal of Statistical for Approximate Inference in Latent Dirichlet Allocation,” in Neural Software,14,1–25.[994] Information Processing Systems,pp.1129–1136.[993] Hu, Y., Boyd-Graber, J. L., and Satinof, B. (2011), “Interactive Topic Mod- Nguyen, X. (2015), “Posterior Contraction of the Population Polytope in elingin,” in ACL,pp.248–257.[1000] Finite Admixture Models,” Bernoulli,21,618–646.[993] Hurley, J., and Cattell, R. (1962), “The Procrustes Program: Producing Paisley, J., Wang, C., and Blei, D. (2012), “The Discrete Infnite Logistic Nor- Direct Rotation to Test a Hypothesized Factor Structure,” Behavioral mal Distribution,” Bayesian Analysis,7,235–272.[996] Science,7,258–262.[993] Papadimitriou, C. H., and Steiglitz, K. (1998), Combinatorial Optimization: Ikenberry, G. J. (2008), “The Rise of China and the Future of the West: Can Algorithms and Complexity, Mineola, NY: Courier Dover Publications. the Liberal System Survive?” Foreign Afairs,87,23–37.[996] [994] Jeske, D. R., and Liu, R. Y. (2007), “Mining and Tracking Massive Text Data: Polson, N. G., Scott, J. G., and Windle, J. (2013), “ for Classifcation, Construction of Tracking Statistics, and Inference Under Logistic Models Using Polya-Gamma Latent Variables,” Journal of the Misclassifcation,” Technometrics,49,116–128.[988] American Statistical Association,108,1339–1349.[992,1000] Jia, J., Miratrix, L., Yu, B., Gawalt, B., El Ghaoui, L., Barnesmoore, L., and Pritchard, J. K., Stephens, M., Rosenberg, N. A., and Donnelly, P. (2000), Clavier, S. (2014), “Concise Comparative Summaries (CCS) of Large “Association Mapping in Structured Populations,” American Journal of Text Corpora With a Human Experiment,” Annals of Applied Statistics, Human Genetics,67,170–181.[993] 8, 499–529. [988] Quinn, K., Monroe, B., Colaresi, M., Crespin, M., and Radev, D. Johnston, A. I., and Stockmann, D. (2007), Chinese Attitudes Toward the (2010), “How to Analyze Political Attention With Minimal Assump- United States and Americans,Ithaca,NY:CornellUniversityPress,pp. tions and Costs,” American Journal of Political Science,54,209– 157–195. [996] 228. [988,989] Khan, M. E., and Bouchard, G. (2009), “Variational EM Algorithms for Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009), “Labeled Correlated Topic Models,” Technical Report, University of British LDA: A Supervised Topic Model for Credit Attribution in Multi- Columbia. [993] Labeled Corpora,” in EMNLP,pp.248–256.[1000] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1003

Reich, J., Tingley, D., Leder-Luis, J., Roberts, M. E., and Stewart, B. M. ——— (2015), “Distributed Multinomial Regression,” Annals of Applied (2015), “Computer Assisted Reading and Discovery for Student Gener- Statistics,9,1394–1414.[990] ated Text in Massive Open Online Courses,” Journal of Learning Ana- Tang, J., Meng, Z., Nguyen, X., Mei, Q., and Zhang, M. (2014), “Under- lytics,2,156–184.[1000] standing the Limiting Factors of Topic Modeling via Posterior Con- Roberts, M. E., Stewart, B. M., and Tingley, D. (2014a), “stm: R Package traction Analysis,” Journal of Machine Learning Research,32,190–198. for Structural Topic Models,” Technical Report, Harvard University. [993] [989,997] Toulis, P., and Airoldi, E. M. (2014), “Implicit Stochastic Gradient Descent,” ——— (2016), “Navigating the Local Modes of Big Data: The Case of Topic arxiv no. 1408.2923. Available at https://arxiv.org/abs/1408.2923 [1000] Models,” Computational Social Science: Discovery and Prediction,ed.R. ——— (2015), “Scalable Estimation Strategies Based on Stochastic Approxi- M. Alvarez, New York: Cambridge University Press, pp. 51–97. [993] mations: Classical Results and New Insights,” Statistics and Computing, Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadar- 25, 781–795. [1000] ian, S., Albertson, B., and Rand, D. (2014b), “Structural Topic Models Turney, P. D., and Pantel, P. (2010), “From Frequency to Meaning: Vector for Open-Ended Survey Responses,” American Journal of Political Sci- Space Models of Semantics,” Journal of Artifcial Intelligence Research, ence,58,1064–1082.[989,996,1000] 37, 141–188. [988] Rosenbaum, P. R. (1987), “Model-Based Direct Adjustment,” Journal of the Wallach, H., Mimno, D., and McCallum, A. (2009a), “Rethinking LDA: American Statistical Association,82,387–394.[996] Why Priors Matter,” in NIPS,pp.1973–1981.[1000] Rosen-Zvi, M., Grifths, T., Steyvers, M., and Smyth, P. (2004), “The Wallach, H., Murray, I., Salakhutdinov, R., and Mimno, D. (2009b), Author-Topic Model for Authors and Documents,” in UAI,pp.487– “Evaluation Methods for Topic Models,” in ICML,pp.1105–1112. 494. [990] [993,996,1000] Roy, D. (1996), “The ”China Threat” Issue: Major Arguments,” Asian Sur- Wang, C., and Blei, D. (2013), “Variational Inference in Nonconju- vey,36,758–771.[996] gate Models,” Journal of Machine Learning Research,14,1005–1031. Socher, R., Gershman, S., Perotte, A., Sederberg, P., Blei, D., and Norman, [989,992] K. (2009), “A Bayesian Analysis of Dynamics in Free Recall,” Advances Yule, U. (1944), The Statistical Study of Literary Vocabulary,Cambridge,UK: in Neural Information Processing Systems,22,1714–1722.[988] Cambridge University Press. [988] Stephens, M. (2000), “Bayesian Analysis of Mixtures With an Unknown Zipf, G. K. (1932), Selected Studies of the Principle of Relative Frequency in Number of Components—An Alternative to Reversible Jump Meth- Language,Cambridge,MA:HarvardUniversityPress.[988] ods,” Annals of Statistics,28,40–74.[993] Zou, J., and Adams, R. (2012), “Priors for Diversity in Generative Latent Taddy, M. (2013), “Multinomial Inverse Regression for Text Analysis,” Jour- Variable Models,” Advances in Neural Information Processing Systems, nal of the American Statistical Association,108,755–770.[988] 25, 3005–3013. [1000]