Discovery of Causal Time Intervals
Total Page:16
File Type:pdf, Size:1020Kb
Discovery of Causal Time Intervals Zhenhui Li∗ Guanjie Zheng∗ Amal Agarwaly Lingzhou Xuey Thomas Lauvauxz Abstract Granger test X causes Y: rejected Causality analysis, beyond \mere" correlations, has be- X[t0:t1] causes Y[t0:t1]: accepted come increasingly important for scientific discoveries X and policy decisions. Many of these real-world appli- cations involve time series data. A key observation is Y that the causality between time series could vary signif- icantly over time. For example, a rain could cause severe t0 t1 time traffic jams during the rush hours, but has little impact on the traffic at midnight. However, previous studies Figure 1: An example illustrating the causality in mostly look at the whole time series when determining partial time series. The original Granger test fails to the causal relationship between them. Instead, we pro- detect the causal relationship, since X causes Y only pose to detect the partial time intervals with causality. during the time interval [t0 : t1]. Our goal is to find As it is time consuming to enumerate all time intervals such time intervals. and test causality for each interval, we further propose an efficient algorithm that can avoid unnecessary com- causes Y by Granger test if the full model is significantly putations based on the bounds of F -test in the Granger better than the reduced model (i.e., it is critical to use causality test. We use both synthetic datasets and real the values of X to predict Y ). datasets to demonstrate the efficiency of our pruning However, in real-world scenarios, causal relation- techniques and that our method can effectively discover ships may exist only in partial time series. This is interesting causal intervals in the time series data. illustrated in Figure 1, where X causes Y only during the time interval [t0 : t1]. In such cases, if we apply the 1 Introduction traditional Granger test (i.e., fitting the models using In recent years, driven by a wide range of real-world the whole time series), we may not detect any causal- applications and scientific discoveries, detecting causal- ity between X and Y . However, if we only test on the ity from the data has gained increasing attention in partial time series in interval [t0 : t1], the causality can the data mining community. Time series data is a be discovered. Therefore, given two time series X and type of data frequently seen in many applications such Y , our goal in this paper is to find the causal time in- as stock, traffic, and climate. Various methods have tervals in which X causes Y . This is highly motivated been developed to determine the causality between two by a number of real world applications. Consider the time series. Among all causality test methods, Granger following two examples. causality test [10] is one of the most popular meth- ods [25, 2, 20, 18, 17]. EXAMPLE 1.1. Suppose that a local event (e.g., a In short, the idea of Granger test is based on football game) attracts many people, generating a large volume of traffic at the venue (e.g., a football stadium) prediction: Given a pair of time series X = x1x2 : : : xT during the event hours. In such scenario, time series and Y = y1y2 : : : yT , two different predictive models, called the full model and the reduced model, are fitted X could be the frequency of geo-tagged tweets at the using the whole time series (i.e., all data points). The venue (as a proxy of the popularity of local events), and full model uses the past values of X and the past values time series Y could be the taxi pick-up volume at the of Y to predict Y , whereas the reduced model uses past same location. Intuitively, we can say that the tweet values of Y only to predict Y . Then, we say that X frequency X (which represents the people attending the event) causes traffic volume Y during the time of the event. However, for the rest of the time, tweet frequency ∗College of Information Sciences and Technology, Pennsylvania State University. Email: [email protected] does not necessarily cause the pick-up volume. In fact, yDepartment of Statistics, Pennsylvania State University. during the normal days, the pick-ups are mostly due zDepartment of Meteorology, Pennsylvania State University. to human routine behaviors (e.g., commuting between home and work) and may not have any correlation with computations for the Granger test over similar time the frequency of geo-tagged tweets. intervals. • We conduct experiments on both synthetic data EXAMPLE 1.2. Nowadays, environmental pollution and real data to demonstrate the effectiveness and is a big concern in people's daily life. For example, efficiency of the proposed method. poor air quality in big cities in China [27, 28] and po- tential air/water contamination from shale gas devel- Before proceeding, we emphasize that our work opment [15, 22] in the United States have drawn a lot is built upon the Granger test. We do not claim to of attention lately. Here, one biggest scientific question improve the Granger test itself, which is known to have is whether certain type of emissions (e.g., CO2) gener- some weaknesses (more details in Section 2). Instead, ated by industrial, vehicular, or other human activities we focus on an important yet unaddressed problem in causes the environmental change. But the causal rela- practice, that is, discovering partial time intervals with tionship between an emission source and the air quality Granger causality. may vary over time. For example, vehicle emissions re- act in the presence of sunlight and form ground level 2 Related Work ozone, a primary ingredient of smog, whereas industrial Dependency/causality discovery on time series. emissions have a bigger impact on the air quality in cool Many methods have been proposed to discover the tem- and humid days. poral dependency between time series, such as autocor- relation, cross-correlation [7], transfer entropy [6], ran- To detect all the casual time intervals, a naive way domization test [16], phase slope index [21]. However, is to enumerate all the time intervals of the time series temporal correlation or dependency does not necessar- and conduct Granger test for each interval. But such an ily indicate causality. Granger causality test [10] was enumeration could be very time consuming in practice. first introduced in the area of econometrics for time se- To address this challenge, in this paper we propose an ries analysis and has since gained tremendous success efficient algorithm for causal time interval discovery. across many domains due to its simplicity and robust- Here, our key insight is that similar time intervals (i.e., ness [12, 13, 3, 8]. intervals with significant overlaps) are likely to have Weakness of the Granger test. A well-known diffi- similar Granger test results. In fact, given the fitted culty in causality reasoning (including the Granger test) models of a time interval, we can derive useful upper and is the confounding issue, that is, spurious causality may lower bounds for the statistical significance of Granger be detected if both processes X and Y are impacted by causality for similar time intervals. Such bounds can a third (possibly unknown) process (i.e., a confounder). be used to quickly identify intervals that will definitely In [14], variables which mediate the impact of an ac- pass or fail the test without fitting the actual models. tion on the outcome are identified to reduce the bias We have conducted experiments to verify the ef- in assessing the causal effect for online advertising. [5] fectiveness and efficiency of our method on two inter- exploits confounder path delays in an attempt to cancel esting real-world datasets. First, we study the causal- out the spurious confounder effect. However, a princi- ity relationship between local event (represented by the pled solution to this problem remains elusive. frequency of geo-tagged tweets at the venue) and the traffic volume (represented by the taxi drop-offs/pick- Variations of the Granger test. While the original ups). We have observed that, when there is an event, Granger causality test was designed for two time series, taxi drop-offs cause geo-tagged tweets, and geo-tagged several methods [2, 18, 17] have been proposed to an- tweets cause pick-ups. Second, we analyze variables of alyze time series data involving many features and to climate and their causal relationship with temperature. learn a causal graph structure. Following the work [2], We have observed that the causality varies over time [20] detects causality of spatial time series, [18] pro- and is different for different variables. poses to use hidden Markov Random Field method, [17] handles extreme values in time series, [4] detects In summary, our key contributions are as follows: Granger causality from irregular time series, and [5] • To the best of our knowledge, we are the first to presents Copula-Granger method to efficiently capture study the problem of detecting partial time inter- non-linearity in the data. Learning temporal causal vals in which a causal relationship exists between graph has been applied to biology applications [25], cli- two time series based on the Granger test. mate analysis [9], microbiology [19], fMRI data analy- sis [24], anomaly detection [23], and longitudinal anal- • We propose an efficient algorithm which utilizes ysis [26]. However, none of these work address the is- the bounds of regression error to avoid unnecessary sues that causality could change over time, and that causality only exists in partial time intervals. Previ- (i.e., the number of independent variables) in the models ous study [1] uses a fixed-length sliding window to test Mf and Mr, respectively, and T is the length of the time Granger causality, but they cannot detect all the causal series.