A Non-Markov Continuous-Time Model of Topical Trends

Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends Xuerui Wang, Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA 01003 {xuerui,mccallum}@cs.umass.edu ABSTRACT patterns) [15]. In each case, graphical model structures are This paper presents an LDA-style topic model that captures carefully-designed to capture the relevant structure and co- not only the low-dimensional structure of data, but also occurrence dependencies in the data. how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of Many of the large data sets to which these topic models time, here each topic is associated with a continuous distri- are applied do not have static co-occurrence patterns; they bution over timestamps, and for each generated document, are instead dynamic. The data are often collected over the mixture distribution over topics is influenced by both time, and generally patterns present in the early part of word co-occurrences and the document’s timestamp. Thus, the collection are not in effect later. Topics rise and fall in the meaning of a particular topic can be relied upon as con- prominence; they split apart; they merge to form new top- stant, but the topics’ occurrence and correlations change ics; words change their correlations. For example, across 17 significantly over time. We present results on nine months years of the Neural Information Processing Systems (NIPS) of personal email, 17 years of NIPS research papers and conference, activity in “analog circuit design” has fallen off over 200 years of presidential state-of-the-union addresses, somewhat, while research in “support vector machines” has showing improved topics, better timestamp prediction, and recently risen dramatically. The topic “dynamic systems” interpretable trends. used to co-occur with “neural networks,” but now co-occurs with “graphical models.” Categories and Subject Descriptors However none of the topic models mentioned above are aware I.2.6 [Artificial Intelligence]: Learning; H.2.8 [Database of these dependencies on document timestamps. Not model- Management]: Database Applications—data mining ing time can confound co-occurrence patterns and result in unclear, sub-optimal topic discovery. For example, in topic General Terms analysis of U.S. Presidential State-of-the-Union addresses, Algorithms, experimentation LDA confounds Mexican-American War (1846-1848) with some aspects of World War I (1914-1918), because LDA is Keywords unaware of the 70-year separation between the two events. Some previous work has performed some post-hoc analysis— Graphical models, temporal analysis, topic modeling discovering topics without the use of timestamps and then projecting their occurrence counts into discretized time [4]— 1. INTRODUCTION but this misses the opportunity for time to improve topic Research in statistical models of co-occurrence has led to the discovery. development of a variety of useful topic models—mechanisms for discovering low-dimensional, multi-faceted summaries of This paper presents Topics over Time (TOT), a topic model documents or other discrete data. These include models that explicitly models time jointly with word co-occurrence of words alone, such as Latent Dirichlet Allocation (LDA) patterns. Significantly, and unlike some recent work with [2, 4], of words and research paper citations [3], of word similar goals, our model does not discretize time, and does sequences with Markov dependencies [5], of words and their not make Markov assumptions over state transitions in time. authors [11], of words in a social network of senders and Rather, TOT parameterizes a continuous distribution over recipients [9], and of words and relations (such as voting time associated with each topic, and topics are responsible for generating both observed timestamps as well as words. Parameter estimation is thus driven to discover topics that simultaneously capture word co-occurrences and locality of those patterns in time. When a strong word co-occurrence pattern appears for a brief moment in time then disappears, TOT will create a topic with a narrow time distribution. (Given enough ev- ACM SIGKDD-2006 August 20-23, 2005, Philadelphia, Pennsylvania, USA idence, arbitrarily small spans can be represented, unlike schemes based on discretizing time.) When a pattern of naturally injected into other more richly structured topic word co-occurrence remains consistent across a long time models, such as the Author-Recipient-Topic model to cap- span, TOT will create a topic with a broad time distribu- ture changes in social network roles over time [9], and the tion. In current experiments, we use a Beta distribution over Group-Topic model to capture changes in group formation a (normalized) time span covering all the data, and thus we over time [15]. can also flexibly represent various skewed shapes of rising and falling topic prominence. We present experimental results with three real-world data sets. On over two centuries of U.S. Presidential State-of-the- The model’s generative storyline can be understood in two Union addresses, we show that TOT discovers topics with different ways. We fit the model parameters according to a both time-localization and word-clarity improvements over generative model in which a per-document multinomial dis- LDA. On the 17-year history of the NIPS conference, we tribution over topics is sampled from a Dirichlet, then for show clearly interpretable topical trends, as well as a two- each word occurrence we sample a topic; next a per-topic fold increase in the ability to predict time given a document. multinomial generates the word, and a per-topic Beta dis- On nine months of the second author’s email archive, we tribution generates the document’s time stamp. Here the show another example of clearly interpretable, time-localized time stamp (which in practice is always observed and con- topics, such as springtime faculty recruiting. On all three stant across the document) is associated with each word in datasets, TOT provides more distinct topics, as measured the document. We can also imagine an alternative, cor- by KL divergence. responding generative model in which the time stamp is generated once per document, conditioned directly on the 2. TOPICS OVER TIME per-document mixture over topics. In both cases, the likeli- Before introducing the Topics over Time (TOT) model, let hood contribution from the words and the contribution from us review the basic Latent Dirichlet Allocation model. Our the timestamps may need to be weighted by some factor, as notation is summarized in Table 1, and the graphical model in the balancing of acoustic models and language models representations of both LDA and our TOT models are shown in speech recognition. The later generative storyline more in Figure 1. directly corresponds to common data sets (with one timestamp per document); the former is easier to fit, and can also Latent Dirichlet Allocation (LDA) is a Bayesian network allow some flexibility in which different parts of the docu- that generates a document using a mixture of topics [2]. ment may be discussing different time periods. In its generative process, for each document d, a multinomial distribution θ over topics is randomly sampled from a Some previous studies have also shown that topic discov- Dirichlet with parameter α, and then to generate each word, ery can be influenced by information in addition to word a topic z is chosen from this topic distribution, and a word, co-occurrences. For example, the Group-Topic model [15] di w , is generated by randomly sampling from a topic-specific showed that the joint modeling of word co-occurrence and di multinomial distribution φ . The robustness of the model voting relations resulted in more salient, relevant topics. zdi is greatly enhanced by integrating out uncertainty about the The Mixed-Membership model [3] also showed interesting per-document topic distribution θ and the per-topic word results for research papers and their citations. distribution φ. Note that, in contrast to other work that models trajecto- In the TOT model, topic discovery is influenced not only by ries of individual topics over time, TOT topics and their word co-occurrences, but also temporal information. Rather meaning are modeled as constant over time. TOT captures than modeling a sequence of state changes with a Markov changes in the occurrence (and co-occurrence conditioned assumption on the dynamics, TOT models (normalized) ab- on time) of the topics themselves, not changes in the word solute timestamp values. This allows TOT to see long-range distribution of each topic. While choosing to model individ- dependencies in time, to predict absolute time values given ual topics as mutable could be useful, it can also be dan- an unstamped document, and to predict topic distributions gerous. Imagine a subset of documents containing strong given a timestamp. It also helps avoid a Markov model’s co-occurrence patterns across time: first between birds and risk of inappropriately dividing a topic in two when there is aerodynamics, then aerodynamics and heat, then heat and a brief gap in its appearance. quantum mechanics—this could lead to a single topic that follows this trajectory, and lead the user to inappropriately Time is intrinsically continuous. Discretization of time al- conclude that birds and quantum mechanics are time-shifted ways begs the question of selecting the slice size, and the versions of the same topic. Alternatively, consider a large size is invariably too small for some regions and too large subject like medicine, which has changed drastically over for others. time. In TOT we choose to model these shifts as changes in topic co-occurrence—a decrease in occurrence of topics Avoiding discretization, in TOT each topic is associated about blood-letting and bile, and an increase in topics about with a continuous distribution over time.

A Non-Markov Continuous-Time Model of Topical Trends

The Exponential Family 1 Definition

Machine Learning Conjugate Priors and Monte Carlo Methods

Polynomial Singular Value Decompositions of a Family of Source-Channel Models

A Compendium of Conjugate Priors

Bayesian Filtering: from Kalman Filters to Particle Filters, and Beyond ZHE CHEN

36-463/663: Hierarchical Linear Models

A Geometric View of Conjugate Priors

Gibbs Sampling, Exponential Families and Orthogonal Polynomials

Sparsity Enforcing Priors in Inverse Problems Via Normal Variance

Jeffreys Priors 1 Priors for the Multivariate Gaussian

Bayesian Inference for the Multivariate Normal

Prior Distributions for Variance Parameters in Hierarchical Models