CARRNN: a Continuous Autoregressive Recurrent Neural Network for Deep Representation Learning from Sporadic Temporal Data
Total Page:16
File Type:pdf, Size:1020Kb
1 CARRNN: A Continuous Autoregressive Recurrent Neural Network for Deep Representation Learning from Sporadic Temporal Data Mostafa Mehdipour Ghazi, Lauge Sørensen, Sebastien´ Ourselin, and Mads Nielsen Abstract—Learning temporal patterns from multivariate lon- architectures. State-of-the-art multivariate sequence learning gitudinal data is challenging especially in cases when data is methods such as recurrent neural networks (RNNs) have been sporadic, as often seen in, e.g., healthcare applications where applied to extract high-level, time-dependent patterns from the data can suffer from irregularity and asynchronicity as the time between consecutive data points can vary across longitudinal data. Moreover, different variants of RNNs with features and samples, hindering the application of existing gating mechanisms such as long short-term memory (LSTM) deep learning models that are constructed for complete, evenly [1] and gated recurrent unit (GRU) [2] were introduced to spaced data with fixed sequence lengths. In this paper, a novel tackle the vanishing and exploding gradients problem [3], [4] deep learning-based model is developed for modeling multiple and capture long-term dependencies efficiently. temporal features in sporadic data using an integrated deep learning architecture based on a recurrent neural network (RNN) However, RNNs are modeled as discrete-time dynamical unit and a continuous-time autoregressive (CAR) model. The systems with evenly-spaced input and output time points, proposed model, called CARRNN, uses a generalized discrete- thereby ill-suited to process sporadic data which is common time autoregressive model that is trainable end-to-end using in many healthcare applications [5], [6]. This real-world data neural networks modulated by time lags to describe the changes problem, as shown in the left part of Figure1, can, e.g., arise caused by the irregularity and asynchronicity. It is applied to multivariate time-series regression tasks using data provided for from different acquisition dates and missing values and is Alzheimer’s disease progression modeling and intensive care unit associated with irregularity and asynchronicity of the features (ICU) mortality rate prediction, where the proposed model based where the time between consecutive timestamps or visits can on a gated recurrent unit (GRU) achieves the lowest prediction vary across different features and subjects. To address this errors among the proposed RNN-based models and state-of-the- issue, most of the existing approaches [7] use a two-step art methods using GRUs and long short-term memory (LSTM) networks in their architecture. process assuming a fixed interval and applying missing data imputation techniques to complete the data before modeling Index Terms—Deep learning, recurrent neural network, long the time-series measurements using RNNs. On the other hand, short-term memory network, gated recurrent unit, autoregressive model, multivariate time-series regression, sporadic time series. a majority of methods that can inherently model varied-length data using RNNs without taking notice of the sampling time I. INTRODUCTION information [8] fail to handle irregular or asynchronous data. HE rapid development of computational resources in re- Recently, there have been efforts to incorporate the time T cent years has enabled the processing of large-scale tem- distance attribute into the RNN architectures for dealing with poral sequences, especially in healthcare, using deep learning sporadic data. For instance, the PLSTM [9], T-LSTM [10], GRU-D [11], tLSTM [12], and DLSTM [13] have placed ©2021 IEEE. Personal use of this material is permitted. Permission from exponential or linear time gates in LSTM or GRU architectures IEEE must be obtained for all other uses, in any current or future media, arXiv:2104.03739v1 [cs.LG] 8 Apr 2021 including reprinting/republishing this material for advertising or promotional heuristically as multiplicative time-modulating functions with purposes, creating new collective works, for resale or redistribution to servers learnable parameters and achieved good results when applied or lists, or reuse of any copyrighted component of this work in other works. to sporadic data. However, none of these studies have analyti- The corresponding author M. Mehdipour Ghazi ([email protected]) was with the Department of Medical Physics and Biomedical Engineering, University cally investigated the proposed models, nor have they provided College London, London, UK, and Biomediq A/S, Copenhagen, DK. He is motivation for the RNN architecture modifications. In other now with the Department of Computer Science, University of Copenhagen, words, current studies mainly focus on the design of deep Copenhagen, DK. L. Sørensen and M. Nielsen are with the Department of Computer Science, architectures or loss functions or apply the proposed solutions University of Copenhagen, Copenhagen, DK. They are also with Biomediq to only one specific type of RNN while the capability of the A/S and Cerebriu A/S, Copenhagen, DK. method using different types of RNNs is not examined. S. Ourselin was with the Department of Medical Physics and Biomedical Engineering, University College London, London, UK. He is now with In this paper, a novel model is reported for modeling multi- the School of Biomedical Engineering & Imaging Sciences, King’s College ple temporal features in sporadic multivariate longitudinal data London, London, UK. using an integrated deep learning architecture as represented in Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database Figure2 based on an RNN, LSTM, or GRU to learn the long- (adni.loni.usc.edu). As such, the investigators within the ADNI contributed term dependencies from evenly-spaced data and a continuous- to the design and implementation of ADNI and/or provided data but did not time autoregressive (CAR) model implemented as a neural participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/ network layer modulated by time lags to compensate for the how to apply/ADNI Acknowledgement List.pdf existing irregularities. The proposed model, called CARRNN, 1 CARRNN • Variable sequence length (across all subjects/samples): normalized loss function • Long-term dependencies: LSTM/GRU • Irregularity: CAR(1) • Asynchronicity: (dynamic) binning and aligning the features • Missing values: • Incomplete features: diagonal (independent) CAR(1) applied to the neighboring points of each feature • Fully missing features: passing zeros by the normalized input array and loss function 2 ∆ 3 ∆ Fig. 1: Illustrative example of time-series data with irregularly-spaced time points and asynchronous events, and the proposed CARRNN structure for modeling the binned time points. The left subfigure shows five feature sequences of a sample subject aligned with a bin width of τ and a time gap of ∆tk defined between the two consecutive data points at time tk and tk−1. The right subfigure shows the CARRNN model structure that learns the multivariate temporal dependencies from the binned data. Note that the bottom feature includes only missing values. where φ 2 N×N is the matrix of the slope coefficients of RNN CAR(1) CARRNN R the model, c 2 RN×1 is the vector of the regression constants N×1 or intercepts, and "t 2 R is white noise as the prediction error. In general, the time interval between the consecutive observations can vary, making the sequence seem like an irregularly-sampled data. Therefore, the aforementioned AR(1) Fig. 2: Illustration of the proposed continuous-time autore- model can be rewritten as gressive recurrent neural network model using an RNN and a x(t ) = φ(∆t )x(t ) + c(∆t ) + "(t ) ; CAR(1) model simulated as a built-in linear neural network k k k−1 k k layer for end-to-end learning. where ∆tk = tk − tk−1 is the time gap between the two consecutive data points at time tk and tk−1, as shown in introduces an analytical solution to the ordinary differential Figure1, and the matrix φ(·), modulated by ∆tk, contains equation (ODE) problem and utilizes a generalized discrete- the autoregressive effects in the main diagonal and cross- time autoregressive model which is trainable end-to-end. The lagged effects in the off-diagonals. The model can then be asynchronous events are aligned using data binning to reduce expressed by the first-order differential equation [14], [15] the effects of missing values and noise. Finally, the remaining using continuous-time autoregressive parameters (drift matrix N×N N×1 missing values are estimated by using a univariate CAR(1) and bias vector) of Φ 2 R and & 2 R with an model applied to the adjacent observations, and in cases when exponential solution as the feature points are fully missing, as shown in the bottom dx(t) d"(t) of the left part of Figure1, the input array and loss function = Φx(t) + & + Γ ; dt dt are normalized with the number of available data points [8]. Φ∆tk −1 Φ∆tk x(tk) = e x(tk−1) + Φ e − IN & + η(tk) ; II. THE PROPOSED METHOD N×N where IN is the identity matrix of size N × N, Γ 2 R is The proposed continuous-time autoregressive recurrent neu- the Cholesky triangle of the innovation covariance or diffusion ral network model, which is trainable end-to-end as shown in matrix Ψ (Ψ = ΓΓT, where T is the transpose operator), and N×1 Figure2, applies an RNN to learn the long-term dependencies η(tk) 2 R is the continuous-time error vector at