CAUSE: Learning Granger Causality from Event Sequences Using Attribution Methods

CAUSE: Learning Granger Causality from Event Sequences using Attribution Methods Wei Zhang 1 Thomas Kobber Panum 2 Somesh Jha 1 3 Prasad Chalasani 3 David Page 4 Abstract covering causal structure among event types from multi-type event sequence data. Since the question of “true causality” We study the problem of learning Granger causal- is deeply philosophical (Schaffer, 2003), and there are still ity between event types from asynchronous, inter- massive debates on its definition (Pearl, 2009; Imbens & Ru- dependent, multi-type event sequences. Existing bin, 2015), we consider a weaker notion of causality based work suffers from either limited model flexibil- on predictability—Granger causality. While Granger causality or poor model explainability and thus fails ity was initially used for studying the dependence structure to uncover Granger causality across a wide vari- for multivariate time series (Granger, 1969; Dahlhaus & ety of event sequences with diverse event inter- Eichler, 2003), it has also been extended to multi-type event dependency. To address these weaknesses, we sequences (Didelez, 2008). Intuitively, for event sequence propose CAUSE (Causality from AttribUtions data, an event type is said to be (strongly) Granger causal on Sequence of Events), a novel framework for for another event type if the inclusion of historical events of the studied task. The key idea of CAUSE is the former type leads to better predictions of future events to first implicitly capture the underlying event of the latter type. interdependency by fitting a neural point process, and then extract from the process a Granger Due to their asynchronous nature, in the literature, multi- causality statistic using an axiomatic attribution type event sequences are often modeled by multivariate method. Across multiple datasets riddled with di- point process (MPP), a class of stochastic processes that verse event interdependency, we demonstrate that characterize the random generation of points on the real CAUSE achieves superior performance on cor- line. Existing point process models for inferring inter-type rectly inferring the inter-type Granger causality Granger causality from multi-type event sequences appear over a range of state-of-the-art methods. to be limited to a particular case of MPPs—Hawkes process (Eichler et al., 2017; Xu et al., 2016; Achab et al., 2018), which assumes past events can only independently and addi- tively excite the occurrence of future events according to a 1. Introduction collection of pairwise kernel functions. While these Hawkes process-based models are very interpretable and many in- Many real-world processes generate a massive number of clude favorable statistical properties, the strong paramet- asynchronous and interdependent events in real time. Ex- ric assumptions inherent in Hawkes processes render these amples include the diagnosis and drug prescription histories models unsuitable for many real-world event sequences with of patients in electronic health records, the posting and potentially abundant inhibitive effects or event interactions. arXiv:2002.07906v1 [cs.LG] 18 Feb 2020 responding behaviors of users on social media, and the For example, maintenance events should reduce the chances purchase and selling orders executed by traders in stock of a system breaking down, and a patient who takes multiple markets, among others. Such data can be generally viewed medicines at the same time is more likely to experience as multi-type event sequences, in which each event consists unexpected adverse events. of both a timestamp and a type label, indicating when and what the event is, respectively. Regarding event sequence modeling in general, a new class of MPPs, loosely referred to as neural point processes In this paper, we focus on the fundamental problem of un- (NPPs), has recently emerged in the literature (Du et al., 1Computer Scineces Department, University of Wisconsin- 2016; Xiao et al., 2017; Mei & Eisner, 2017; Xiao et al., Madison, Madison, WI, USA. 2Department of Electronic Systems, 2019). NPPs use deep (mostly recurrent) neural networks 3 Aalborg University, Aalborg, Denmark. XaiPient, Princeton, to capture complex event dependencies, and thus excel at NJ, USA. 4Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.. Correspondence to: Wei Zhang predicting future events due to their model flexibility. How- <[email protected]>. ever, NPPs are uninterpretable and unable to provide insight into the Granger causality between event types. CAUSE: Learning Granger Causality from Event Sequences using Attribution Methods To address this tension between model explainability and types in continuous time. The most common way to de- model flexibility in existing point process models, we pro- fine an MPP is through a set of conditional intensity func- pose CAUSE (Causality from AttribUtions on Sequence of tions (CIFs), one for each event type. Specifically, let P1 1 Events), a framework for obtaining Granger causality from Nk(t) , i=1 (ti ≤ t ^ ki = k) be the number multi-type event sequences using information captured by of events of type k that have occurred up to t, and let a highly predictive NPP model. At the core of CAUSE H(t) , f(ti; ki)jti < tg be the history of all types of are two steps: first, it trains a flexible NPP model to cap- events before t. The CIF for event type k is defined as the ture the complex event interdependency, then it computes a expected instantaneous event occurrence rate conditioned novel Granger causality statistic by inspecting the trained on history, i.e., NPP with an axiomatic attribution method. In this way, ∗ E[Nk(t + ∆t) − Nk(t)jH(t)] CAUSE is the first technique that brings model-agnostic λk(t) , lim ; explainability to NPPs and endows NPPs with the ability to ∆t#0 ∆t discover Granger causality from multi-type event sequences where the use of the asterisk is a notational convention to exhibiting various types of event interdependencies. emphasize that intensity is conditioned upon H(t). Contributions. Our contributions are: Different parameterizations of CIFs lead to different MPPs. One classic example of MPP is the multivariate Hawkes • We bring model-agnostic explainability to NPPs and pro- ∗ process (Hawkes, 1971a;b), which assumes each λk(t) to pose CAUSE, a novel framework for learning Granger be of the following form: causality from multi-type event sequences exhibiting ∗ X various types of event interdependency. λk(t) = µk + φk;ki (t − ti); (1) • We design a two-level batching algorithm that enables i:ti<t efficient computation of Granger causality scalable to where µk 2 R+ is the baseline rate for event type k, and millions of events. 0 φ 0 (·) for any k; k 2 [K] is a non-negative-valued func- • We evaluate CAUSE on both synthetic and real-world k;k tion (usually referred to as kernel function) that characterizes datasets riddled with diverse event interdependency. Our the excitation effect of event type k0 on type k. experiments demonstrate that CAUSE outperforms other state-of-the-art methods. More recently, a class of MPPs loosely referred to as neural point processes have emerged in the literature (Du et al., Reproducibility. We publish our data and our code at 2016; Xiao et al., 2017; Mei & Eisner, 2017; Xiao et al., https://github.com/razhangwei/CAUSE. 2019). These models parameterize CIFs with deep neural networks and generally follow an encoder-decoder de- 2. Background sign: an encoder successively embeds the event history i Nh f(tj; kj)gj=1 into a vector hi 2 R for each i, and a de- In this section, we first establish some notation and then ∗ coder then predicts with hi the future CIFs λk(t) after time briefly introduce the background for several highly relevant ti (until the next event is generated). topics. Most MPPs are trained by minimizing the negative log- 2.1. Notations likelihood (NLL): S n " K s # s Z ti Suppose there are S subjects and each subject s is associated X X ∗s s X ∗s 0 0 s s s ns s − log λk (ti ) + λk (t )dt ; (2) with a multi-type event sequence f(t ; k )g , where t 2 i s i i i=1 i s=1 i=1 k=1 ti−1 R+ is the timestamp of the i-th event of the s-th sequence, s ∗s ∗ ki 2 [K] is the corresponding event type, and ns is the where λk (t) , λk(tjHs(t)) is the CIF of the s-th sequence s K sequence length. We denote by zi 2 f0; 1g the one- for the type k. In (2), the first term corresponds to the s s s hot representation of each event type ki , and use [n] to NLL of an event of type ki being observed at ti for the represent the set f1; : : : ; ng for any positive integer n. To s-th sequence, whereas the second term is the NLL of the avoid clutter, we omit the subscript/superscript of index observation that no events of any type occurred during the s s s when we are discussing a single event sequence and no window (ti−1; ti ). When there are no closed-form expres- s confusion arises. R ti ∗s 0 0 sions for the integrals s λk (t )dt , Monte-Carlo approx- ti−1 imation needs to be used to approximate either the integrals 2.2. Multivariate Point Process themselves or their gradients with respect to the parameters. Multivariate point processes (MPPs) (Daley & Vere-Jones, However, these approximation techniques are inefficient 2003) are a particular class of stochastic processes that and generally suffer from large variances, resulting in low characterize the dynamics of discrete events of multiple convergence rate. CAUSE: Learning Granger Causality from Event Sequences using Attribution Methods 2.3. Granger Causality for Multi-Type Event in Definition3.

Load more