Iterative Topic Modeling with Time Series Feedback
Total Page:16
File Type:pdf, Size:1020Kb
Mining Causal Topics in Text Data: Iterative Topic Modeling with Time Series Feedback Hyun Duk Kim Malu Castellanos Meichun Hsu Dept. of Computer Science Information Analytics Lab Information Analytics Lab University of Illinois at HP Laboratories HP Laboratories Urbana-Champaign [email protected] [email protected] [email protected] ChengXiang Zhai Thomas Rietz Daniel Diermeier Dept. of Computer Science Dept. of Finance Kellogg School of University of Illinois at The University of Iowa Management Urbana-Champaign [email protected] Northwestern University [email protected] d-diermeier @kellogg.northwestern.edu ABSTRACT the impact of, for example, news coverage or customer reviews, is of great practical importance. Many applications require analyzing textual topics in conjunction While there are many variants of topic models [2, 3, 18], no with external time series variables such as stock prices. We de- existing model incorporates jointly text and associated “external” velop a novel general text mining framework for discovering such time series variables to identify causal topics. A semantically co- causal topics from text. Our framework naturally combines any herent topic is "causal" if it has strong, possibly lagged, associa- given probabilistic topic model with time-series causal analysis to tions with a non-textual time series variable. This allows for two- discover topics that are both coherent semantically and correlated way relationships: topics may affect the time series and/or vice with time series data. We iteratively refine topics, increasing the versa. Our method can be tailored to the specific application and correlation of discovered topics with the time series. Time series help an analyst quickly identify a small set of possibly causal topics data provides feedback at each iteration by imposing prior distribu- for further analysis.1 tions on parameters. Experimental results show that the proposed A basic approach to identifying causal topics is to: (1) find top- framework is effective. ics with topic modeling techniques then (2) identify causal topics using correlations and causality tests. This approach, however, ig- Categories and Subject Descriptors nores the possibly relevant information contained in the time series. I.2.7 [Artificial Intelligence]: Natural Language Processing—text Candidate topic sets are the same for every time series. analysis; H.3.1 [Information Storage and Retrieval]: Content We instead propose a novel general text mining framework: Iter- Analysis and Indexing—Linguistic processing; H.3 [Information ative Topic Modeling with Time Series Feedback (ITMTF). ITMTF Storage and Retrieval]: Information Search and Retrieval naturally combines probabilistic topic modeling with time series causal analysis to uncover topics that are both coherent seman- tically and correlated with time series data. ITMTF can accom- Keywords modate any topic modeling technique and any causality measure. Causal Topic Mining, Iterative Topic Mining, Time Series This generality enables users to easily adapt different topic mod- els and causality measures as needed. ITMTF iteratively refines 1. INTRODUCTION a topic model, gradually increasing the correlation of discovered topics with the time series data through a feedback mechanism. In Probabilistic topic models [4, 8] have proven very useful for min- each iteration, the time series data informs a prior distribution of ing text data in a range of areas including opinion analysis [11, 17], parameters that feeds back into the topic model. Thus, the discov- text information retrieval [19], image retrieval [9], natural language ered topics are dynamically adapted to fit the patterns of different processing [6], and social network analysis [12]. time series data. Most existing topic modeling techniques focus on text alone. We evaluate ITMTF on a news data set with multiple stock price However, text topics often occur in conjunction with other vari- time series, including stock prices from the Iowa Electronic Mar- ables through time. Such data calls for integrated analysis of text kets and those of two large US companies (American Airlines and and non-text time series data. The causal relationships between the Apple). The results show that ITMTF can effectively discover two may be of particular interest. For example, news about compa- causal topics from text data and the iterative process improves the nies can affect stock prices, sales numbers, etc. Understanding quality of the discovered causal topics. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 2. RELATED WORK on the first page. Copyrights for components of this work owned by others than the There are two basic topic models: Probabilistic Latent Seman- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or tic Analysis (PLSA) [8] and Latent Dirichlet Analysis (LDA) [4]. republish, to post on servers or to redistribute to lists, requires prior specific permission Both focus on word co-occurrences. Recent advanced techniques and/or a fee. Request permissions from [email protected]. analyze the dynamics of topics on a time line [2, 18]. However, CIKM’13, Oct. 27–Nov. 1, 2013, San Francisco, CA, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. 1Our definition allows using “correlation” and “cause” inter- ACM 978-1-4503-2263-8/13/10 ...$15.00. changeably as convenient. http://dx.doi.org/10.1145/2505515.2505612. they do not conduct integrated analyses of topics and external vari- Granger tests are more structured measures of causality, measur- ables; the topic analysis is separate from the external time series. ing statistical significance at different time lags using auto regres- There are some efforts to incorporate external knowledge in mod- sion to identify causal relationships. Let yt and xt be two time eling. Supervised LDA [3] models topics with better prediction series. To see if xt “Granger causes” yt with maximum p time lag, power than simple LDA by incorporating a reference value (e.g. run the following regression: movie review articles with movie ratings) in the modeling process. yt = a0 + a1yt−1 + ... + apyt−p + b1xt−1 + ... + bpxt−p. Labeled LDA [16] associates categorical labels and even text la- Then, use F-tests to evaluate the significance of the lagged x terms. bels for topics. Another way of incorporating external knowledge The coefficients of lagged x terms estimate the impact of x on y. is to use conjugate prior probabilities in the topic modeling process p Pi=1 bi [13]. Topic Sentiment Mixture (TSM) models positive and nega- We average the x term coefficients, |p| , as an impact value. tive sentiment topics using seed sentiment words such as “good” or “bad.” While these methods show that topic mining can be guided 4.2 An Iterative Topic Modeling Framework by external variables, none achieves our objective of capturing the with Time Series Feedback correlation structure between text topics and external time series Input: time series data X = x , ..., xn with time stamp t , ..., variables. Moreover, while these models are specialized for super- 1 1 tn, and a a collection of text documents with time stamps from vision with specific external data, our general approach can flexi- the same period, D = {(d1,td1), ..., (dm, tdm)}, topic modeling bly combine any reasonable topic model with any causal analysis method M, causality measure C, a parameter tn (how many topics method. to model) Research on stock prediction using financial news content also Output: k potentially causal topics (k≤tn): (T1, L1), ..., (Tk, Lk) relates to our work [14]. Such research typically identifies the most Topic modeling method M identifies topics. The causality mea- predictive words and labels news according to its effect on stock sure C gives significance measures (e.g. p-value) and impact ori- prices on a specific day using a supervised regression or a classi- entation. Figure 1 illustrates our iterative algorithm. It works as fication problem setup. In contrast, we search for general causal follows: topics with unsupervised methods. Granger testing [7] is popular for testing causality in economics 1. Apply M to D to generate tn topics T , .., T using lead/lag relationships across multiple time series. Recent ev- 1 tn 2. Use C to find topics with significance values sig(C,X,T ) > idence shows that Granger tests can be used in an opinion mining γ (e.g. 95%). Let CT be the set of candidate causal topics context: predicting stock price movements with a sentiment curve with lags: CT = {(Tc , L ), ..., (T , L )}. [5]. However, Granger testing has not been used directly in text 1 1 ck k mining and topic analysis. 3. For each candidate topic in CT , apply C to find the most A demo system based on our framework is presented [10] with significant causal words among top words w ∈ T . Record a very brief description about the system components and sample the impact values of these significant words (e.g., word-level results. Here, we describe the general problem and framework in Pearson correlations with the time series variable). detail and evaluate the algorithms rigorously. 4. Define a prior on the topic model parameters using signifi- cant terms and their impact values. 3. MININGCAUSALTOPICSINTEXTWITH (a) Separate positive impact terms and negative impact terms. If one orientation is very weak (< δ%, e.g. 10%), ignore SUPERVISION OF TIME SERIES DATA the minor group. Consider time series data x1, ..., xn, with time stamps t1, ..., (b) Assign prior probability proportions according to sig- tn, and a collection of time stamped documents from the same pe- nificance levels. riod, D = {(d1,td1 ), ..., (dm, tdm )}. The goal is to discover a set of causal topics T , ..., T with associated time lags L , ..., L . 5.