Causal Inference in Time Series Via Supervised Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Causal Inference in Time Series via Supervised Learning Yoichi Chikahara and Akinori Fujino NTT Communication Science Laboratories, Kyoto 619-0237, Japan [email protected], [email protected] Abstract only on its past values are significantly reduced by addition- ally using the past values of X. When the regression model Causal inference in time series is an important can be well fitted to the data, we can infer correct causal di- problem in many fields. Traditional methods use rections. However, in practice, selecting an appropriate re- regression models for this problem. The infer- gression model for each time series data is difficult and re- ence accuracies of these methods depend greatly quires a deep understanding of the data analysis. Therefore, on whether or not the model can be well fitted to it is not easy to identify correct causal directions with these the data, and therefore we are required to select an model-based methods. appropriate regression model, which is difficult in practice. This paper proposes a supervised learn- The goal of this paper is to build an approach to causal ing framework that utilizes a classifier instead of inference in time series that does not require a deep under- regression models. We present a feature represen- standing of the data analysis. To realize this goal, we pro- tation that employs the distance between the con- pose a supervised learning framework that utilizes a classifier ditional distributions given past variable values and instead of regression models. Specifically, we propose solv- show experimentally that the feature representation ing the problem of Granger causality identification by ternary provides sufficiently different feature vectors for classification, in other words, by training a classifier that as- time series with different causal relationships. Fur- signs ternary causal labels (X ! Y , X Y , or No Cau- thermore, we extend our framework to multivariate sation) to time series. In fact, several methods have already time series and present experimental results where been proposed that perform classification to infer causal re- our method outperformed the model-based meth- lationships from i.i.d. data, which have worked well ex- ods and the supervised learning method for i.i.d. perimentally [Bontempi and Flauder, 2015; Guyon, 2013; data. Lopez-Paz et al., 2015; 2017]. To solve causal inference in time series via classification, we formulate a feature representation that provides sufficiently different feature vectors for 1 Introduction time series with different causal relationships. The idea for Discovering temporal causal directions is an important task obtaining such feature vectors is founded on the definition of in time series analysis and has key applications in various Granger causality: X is the cause of Y if the following two fields. For instance, finding the causal direction that indi- conditional distributions of the future value of Y are differ- cates that the research and development (R&D) expenditure ent; one is given the past values of Y and the other is given X influences the total sales Y , but not vice versa, is helpful the past values of X and Y . To build the classifier for Granger for decision making in companies. In addition, identifying causality identification, we utilize the distance between these causal (regulatory) relationships between genes from time se- distributions when preparing feature vectors. To compute the ries gene expression data is one of the most important topics distance, by using kernel mean embedding, we map each dis- in bioinformatics. tribution to a point in the feature space called the reproduc- As a definition of temporal causality, Granger causality ing kernel Hilbert space (RKHS) and measure the distance [Granger, 1969] is widely used [Kar et al., 2011; Yao et al., between the points, which is termed the maximum mean dis- 2015]. According to its definition, the variable X is the cause crepancy (MMD) [Gretton et al., 2007]. of the variable Y if the past values of X are helpful in pre- In experiments, our method sufficiently outperformed the dicting the future value of Y . model-based Granger causality methods and the supervised Traditional methods for identifying Granger causality use learning method for i.i.d. data by using the same feature rep- regression models [Bell et al., 1996; Cheng et al., 2014; resentation and the same classifier. Furthermore, we describe Granger, 1969; Marinazzo et al., 2008; Sun, 2008] such as the how our approach can be extended to multivariate time series vector autoregressive (VAR) model and the generalized addi- and show experimentally that feature vectors have a sufficient tive models (GAM). With these methods, we can determine difference that depends on Granger causality, which demon- that X is the cause of Y if the prediction errors of Y based strates the effectivity of our proposed framework. 2042 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 2 Granger Causality Basic Ideas for Granger Causality Identification Granger causality defines X as the cause of Y if the past val- Simply by using the definition of Granger causality (Defini- tion 1)1, for instance, we regard the causal label as X ! Y if ues of X contain helpful information for predicting the future value of Y . Formally, it is defined as follows: X is the cause of Y , and if Y is not the cause of X. Formally, we regard the causal label as 2 Definition 1 (Granger causality[Granger, 1969]) Suppose P (X jS ;S ) = P (X jS ) we have a stationary sequence of random variables f(Xt, X ! Y if t+1 X Y t+1 X (2) Yt)g (t 2 N), where Xt and Yt are on X and Y, respectively. P (Yt+1jSX ;SY ) 6= P (Yt+1jSY ) Let SX and SY be observations of fX1, ··· , Xtg and fY1, P (Xt+1jSX ;SY ) 6= P (Xt+1jSX ) ··· , Ytg, respectively. X Y if (3) P (Y jS ;S ) = P (Y jS ) Granger causality defines fXtg as the cause of fYtg if t+1 X Y t+1 Y P (Xt+1jSX ;SY ) = P (Xt+1jSX ) P (Yt+1jSX ;SY ) 6= P (Yt+1jSY ) No Causation if (4) P (Yt+1jSX ;SY ) = P (Yt+1jSY ) and states that fXtg is not the cause of fYtg if To assign causal labels to time series based on (2), (3), P (Yt+1jSX ;SY ) = P (Yt+1jSY ) (1) and (4), it is necessary to determine whether or not the two conditional distributions are identical. To represent information about conditional distributions, instead of using regres- To see if the two conditional distributions P (Y jS ;S ) sion models, we utilize kernel mean embedding. Kernel mean t+1 X Y embedding maps a distribution to a point in the feature space and P (Yt+1jSY ) are identical, traditional methods [Bell et al., 1996; Granger, 1969; Marinazzo et al., 2008; Sun, 2008] called the RKHS. Interestingly, when a characteristic kernel use statistical testing to determine whether or not the two con- (e.g., a Gaussian kernel) is used, the mapping is injective: different distributions are not mapped to the same point [Sripe- ditional means E[Yt+1jSX ;SY ] and E[Yt+1jSY ] are equal, which is a much simpler problem than (1). For instance, in rumbudur et al., 2010]. [Granger, 1969], the conditional means are represented by us- Suppose that kernel mean embedding maps condi- ing the (V)AR model to compute the test statistic based on the tional distributions P (Xt+1jSX ;SY ), P (Xt+1jSX ) and P (Y jS ;S ) P (Y jS ) µ prediction errors. t+1 X Y , t+1 Y to the points Xt+1jSX ;SY , µ 2 H and µ , µ 2 H , respec- To represent the conditional means, these methods require Xt+1jSY X Yt+1jSX ;SY Yt+1jSY Y us to use an appropriate regression model that can be well tively, where HX and HY are the RKHSs. Then, when using fitted to the data; however, such a model is difficult to select a characteristic kernel, (2), (3), and (4) can be written as in practice. For this problem, we propose a novel approach µ = µ that utilizes a classifier instead of regression models. X ! Y if Xt+1jSX ;SY Xt+1jSX (5) µYt+1jSX ;SY 6= µYt+1jSY µ 6= µ 3 Proposed Method X Y if Xt+1jSX ;SY Xt+1jSX (6) µYt+1jSX ;SY = µYt+1jSY 3.1 Classification Setup µX jS ;S = µX jS Suppose the training data are N pairs of bivariate time se- No Causation if t+1 X Y t+1 X (7) µ = µ ries S1, ··· , SN , where each time series Sj with the fixed Yt+1jSX ;SY Yt+1jSY length Tj consists of the observations of random variables To assign labels based on (5), (6), and (7), we only have f(Xj;Y j), ··· , (Xj ;Y j )g (j 2 f1, ··· , Ng). Here, each 1 1 Tj Tj to determine whether or not two points in the RKHS are the time series Sj has a causal label lj 2 f+1, −1, 0g that indi- same over time t or, equivalently, to determine whether or not cates Xj ! Y j, Xj Y j, or No Causation, where Xj = the distance between the two points in the RKHS, which is j j j j j termed the MMD [Gretton , 2007], is zero over time t. (X1 , ··· , XT ) and Y = (Y1 , ··· , YT ). et al. j j For this reason, to develop the classifier for Granger causal- Using a function ν(·) that maps Sj to a single feature vec- ity identification, in our feature representation, we utilize the tor, we first train a classifier with f(ν(Sj); lj)gN .

Causal Inference in Time Series Via Supervised Learning

The Practice of Causal Inference in Cancer Epidemiology

Statistics and Causal Inference (With Discussion)

Bayesian Causal Inference

Articles Causal Inference in Civil Rights Litigation

Causation and Causal Inference in Epidemiology

Experiments & Observational Studies: Causal Inference in Statistics

Elements of Causal Inference

A Logic for Causal Inference in Time Series with Discrete and Continuous Variables

Week 10: Causality with Measured Confounding

Editor's Comments

Natural Experiments and at the Same Time Distinguish Them from Randomized Controlled Experiments

Causal Inference in Statistics: an Overview∗†‡