Recent Advance in Temporal Point Process: from Machine Learning Perspective
Total Page:16
File Type:pdf, Size:1020Kb
Recent Advance in Temporal Point Process: from Machine Learning Perspective Junchi Yan Shanghai Jiao Tong University [email protected] Abstract 1.1 Preliminaries on temporal point process Temporal point process (TPP) [Daley and David, 2007] is a Temporal point process (TPP) has served as a ver- classic mathematical tool for modeling stochastic point pro- satile framework for modeling event sequences in cess in continuous time space which often refers to event se- continuous time space. It spans a wide range of quence as a concrete embodiment1. A temporal point process applications as event data is prevalent and becom- is a random process whose realization involves a sequence of ing increasingly available such as online purchase, (labeled) events in continuous time. TPP provides a princi- device failure. Tailored TPP learning algorithms pled treatment by directly absorbing the raw timestamp such are devised to different special processes, comple- that the time information is accurately kept. Compared with mented by recent neural network based approach- the time series representation that converts event sequence by es. In general, traditional statistical TPP models are aggregation based on predefined time interval, TPP dismisses more interpretable and less data ravenous, which the unwanted discretization error which formally refers to the lay their success on appropriate selection of the in- so-called Modifiable Areal Unit Problem [Fotheringham and tensity function via domain knowledge. In contrast, Wong, 1991] i.e. the learning can be sensitive to the choice of emerging network based models have higher capac- the interval length for aggregation. Moreover, TPP has a re- ity to digest massive event data with less reliance markably well-established theoretical foundation, and it can on model selection. However their physical mean- mathematically incorporate the whole history without speci- ing becomes less comprehensible. From machine fying the order as required by Markovian models. learning perspective, this survey presents a litera- Temporal point process is equivalent to a counting process, ture review on these two threads of research. We denoted by N(t) which counts the number of events before walk through several working examples to provide time t. The keystone of TPP is its (conditional) intensity func- a concrete disclosure of representative techniques. tion i.e. the stochastic model for the next event conditioned on the history events. Formally, for an infinitesimal time win- dow [t; t + dt), let λ∗(t) be the occurrence rate for the future event conditioned on history Ht = fzi; tijti < tg up to but not including time t, we have the following definition: 1 Introduction ∗ λ (t)dt = P(event in [t + dt]jHt) = E(dN(t)jHt) Many real-world scenarios produce data can be modeled by where E(dN(t)jHt) is the expected number of events hap- the temporal point process (TPP). Examples include device pened in the interval (t; t + dt] given the historical observa- disorders with their error codes, earthquake with magnitude tions Ht. Note we assume a regular point process [Rubin, and location whereby various types of asynchronous events 1972] (as most literature do on point process) i.e. two event interact with each other and exhibit complex dynamic pattern- coincide with likelihood 0 i.e. dN(t) 2 f0; 1g. The ∗ nota- s in the continuous time domain. Investigating this dynamic tion reminds us that the function depends on history and we process and the underlying causal relationship will lay the omit Ht for conciseness. The conditional intensity function foundation for further applications such as micro and macro has played a central role and many processes vary on how it level event prediction, root cause diagnose. Specifically it has is parameterized. Readers are referred to the textbook [Daley facilitated the tackling of many domain-specific applications, and David, 2007] for a more detailed and rigorous treatment. with the increasing availability to large-scale event data e.g. 1.2 Motivation of the survey from social media and online commerce. TPP provides a solid mathematical framework for modeling Despite the abounding literature on time series based se- event sequences [Daley and David, 2007]. However until re- quence models such as Markov chain, hidden Markov model cently is the machine learning community starting to widely and vector auto-regressive model, learning methods for ex- plicitly addressing problems with asynchronously generated 1We use event sequence as the concrete form of stochastic point event data only start to emerge over the last decade. process for discussion in this paper without loss of generality. undertake this tool for practical problems. The bond with derivation is based on the definition of so-called regular point modern machine learning techniques in turn, has also signifi- process [Rubin, 1972]. For notational conciseness for a com- cantly advanced its theory and methods, e.g. with alternating pact survey, in the following we ignore the event mark which direction method of multipliers [Zhou et al., 2013a] and ad- will not make significant change of the equations. versarial learning [Xiao et al., 2018]. To have a principled The joint density function can be written by: picture of recent advances and help readers better understand Y ∗ the technical details to an appropriate extent, a survey from f(t1; t2; : : : ; tn) = f (tj jHj ) (1) machine learning perspective is welcomed. j There are a few relevant surveys: [Gonzalez´ et al., 2016] where Hj = (t1; t2; : : : ; tj−1) is the history for event at reviews specific spatio-temporal point process and the latter tj starting from t0 up to time tj but not including tj. Re- ∗ focuses on self-exciting process. For its theoretical impor- call that the conditional density function f (tj+1) (again we ∗ tance and practical dominance in applications, Hawkes pro- omit Hj for f ) has an elegant relation with the condition- cess and its applications in finance are reviewed in [Hawkes, al intensity function: λ∗(t)S∗(t) = f ∗(t) where S∗(t) = 2018] authored by Hawkes himself. More examples and dis- exp − R tj+1 λ(τ)dτ is the probability that no new event cussion for Hawkes process in finance can be found in anoth- tj er excellent survey [Bacry et al., 2015] which is also tailored occurs up to time t since tj. Hence for each tj+1 we have ! to the finance setting. Many relevant works in the machine Z tj+1 ∗ ∗ learning community are missing in these articles, especially f (tj+1) = λ (tj+1) exp − λ(τ)dτ (2) for neural network based ones, which is covered in this paper. tj The purpose of this survey is to identify recent advances in Accordingly the log-likelihood function can be written by: TPP from the machine learning perspective whereby learning n Z tn objectives and algorithms are the main focus of this article. X ∗ ∗ log f(t ; t ; : : : ; t ) = log λ (t ) − λ (τ)dτ (3) The discussion navigates from traditional statistical models 1 2 n j t0 to neural network methods, and the latter show its promising j=1 capability for learning with massive data and less reliance on Note that for conciseness, the presented equations in this pa- prior knowledge. Due to its prevalence in both theoretical per are mainly for unmarked point process. Under the same study and real-world applications, the self-exciting Hawkes framework, readers are referred to [Liniger, 2009] for the de- process and its variants are frequently discussed. tails for multi-dimensional case i.e. marked point process. 1.3 Traditional TPP vs. neural TPP 2.2 Popular intensity function forms Temporal point process models are often used for either fu- Traditional works are mostly developed around the innova- ture prediction or quasi-causality discovery. These two tar- tion of the intensity function such as: gets are closely related to the central problems in machine i) Poisson process: the basic form is history independent learning: model capacity (for prediction accuracy) and mod- λ(t) = λ0 which can be dated back to the 1900’s. Relaxing el interpretability [Choi et al., 2016]. The history of TP- the constant constraint to let λ(t) a function of time leads to P also evolves from traditional parametric models whereby the on-homogeneous Poisson process and further extension to the conditional intensity function’s form is manually pre- stochastic process leads to the famous doubly stochastic Pois- specified, to more recent neural network based models – we son process, also called Cox Process first appeared in [Cox, call neural point process in line with the term used in [Mei 1955]; ii) Reinforced poisson processes [Pemantle, 2007]: and Eisner, 2017; Du et al., 2016], which frees the need the model captures the ‘rich-get-richer’ mechanism by λ(t) = for explicit parametric intensify form selection. In fact it is λ0f(t)i(t) where f(t) mimics the aging effect while i(t) is generally observed [Mei and Eisner, 2017; Du et al., 2016; the accumulation of history events; iii) Self-exciting process Xiao et al., 2017b] that the traditional TPP models with ex- [Hawkes, 1971]: also called Hawkes process, it provides an plicit parametric intensity function excel at clear interpreta- additive model to capture the self-exciting effect from his- tory events λ(t) = λ + P g (t − t ). This model tion on the problem for learning, while the neural point pro- 0 ti<t exc i cess models shows high model capacity for learning arbitrary also has an alternative representation by the Poisson branch- and unknown distributions. In the following, we will navi- ing process [Hawkes and Oakes, 1974]; iv) Reactive point gate through the works along these two threads. In particular, process [Ertekin et al., 2015]: it can be regarded as a gener- as the neural point process is a more emerging area, we will alization for the Hawkes process by adding a self-inhibiting describe in more details involving specific formulas used in term to account for the inhibiting effects from history events λ(t) = λ +P g (t−t )−P g (t−t ); v) Self- representative works.