UNDER REVIEW 1

Uncertainty-Aware Variational-Recurrent Imputation Network for Clinical Time Series Ahmad Wisnu Mulyadi, Eunji Jun, and Heung-Il Suk, Member, IEEE

Abstract—Electronic health records (EHR) consist of longitudi- data yields poor performance at high missing rates and also nal clinical observations portrayed with sparsity, irregularity, and requires modeling separately for the distinct dataset. In fact, high-dimensionality, which become major obstacles in drawing those missing values reflect the decisions made by health-care reliable downstream clinical outcomes. Although there exist great numbers of imputation methods to tackle these issues, most of providers [5]. Therefore, the missing values and their patterns them ignore correlated features, temporal dynamics and entirely contain information regarding a patient’s health status [5] and set aside the uncertainty. Since the missing value estimates involve correlate with the outcomes or target labels [6]. Thus, we the risk of being inaccurate, it is appropriate for the method to resort to the imputation approach by exploiting those missing handle the less certain information differently than the reliable patterns to improve the prediction of the clinical outcomes as data. In that regard, we can use the uncertainties in estimating the missing values as the fidelity score to be further utilized to the downstream task. alleviate the risk of biased missing value estimates. In this work, There exist numerous strategies for imputing missing values we propose a novel variational-recurrent imputation network, in the literature. Generally, imputation methods can be clas- which unifies an imputation and a prediction network by taking sified into a deterministic or stochastic approach, depending into account the correlated features, temporal dynamics, as well on the use of randomness [7]. Deterministic methods, such as as the uncertainty. Specifically, we leverage the deep generative model in the imputation, which is based on the distribution mean [8] and median filling [9], produce only one possible among variables, and a recurrent imputation network to exploit value when estimating the missing values. However, it is the temporal relations, in conjunction with utilization of the desirable for the imputation model to generate values or sam- uncertainty. We validated the effectiveness of our proposed model ples by considering the distribution of the available observed on two publicly available real-world EHR datasets: PhysioNet data. Thus, it leads us to the employment of stochastic-based Challenge 2012 and MIMIC-III, and compared the results with other competing state-of-the-art methods in the literature. imputation methods. The recent rise of deep learning models has offered potential Index Terms—Bioinformatics, Deep generative model, Deep solutions in accommodating such circumstances. Variational learning, Electronic health records (EHR), In-hospital mortal- ity prediction, Missing value imputation, Time-series modeling, autoencoders (VAEs) [10] and generative adversarial networks Uncertainty (GANs) [11] exploit the latent distribution of high-dimensional incomplete data and generate comparable data points as the approximation estimates for the missing or corrupted values I.INTRODUCTION [12, 13, 14]. However, such deep generative models are LECTRONIC health records (EHRs) store longitudinal insufficient for estimating the missing values of multivariate E data consisting of patients’ clinical observations in the time series owing to their nature of ignoring temporal relations intensive care unit (ICU). Despite the surge of interest in clin- between a span of time points. On the other hand, by virtue ical research on EHR, it still holds diverse challenging issues of recurrent neural networks (RNNs), which have proved to to be tackled, these include high-dimensionality, temporality, perform remarkably well in modeling sequential data, we can sparsity, irregularity, and bias [1, 2]. Specifically, sequences estimate the complete data by taking into account the temporal arXiv:2003.00662v2 [cs.LG] 14 Nov 2020 of multidimensional medical measurements are recorded ir- characteristics. In this approach, GRU-D [6] introduced a mod- regularly in terms of its variables and times. The reasons ified gated-recurrent unit (GRU) cell to model missing patterns behind these typical measurements are diverse, such as lack in the form of masking vectors and temporal delays. Likewise, of collection, documentation, or even recording faults [1, 3]. BRITS [15] exploited the temporal relations of bidirectional Since it carries essential information regarding a patient’s dynamics by considering feature correlations in estimating the health status, improper handling of missing values might cause missing values. Moreover, inspired by the residual networks an unintentional bias [3, 4], yielding an unreliable downstream and graph-based methods, [16] proposed the LIME-RNN to analysis and verdict. tackle the imputation task by integrating the previous hidden Complete-case analysis is an approach that draws clinical states of RNN in the forms of linear memory vector. outcomes by disregarding the missing values and relying Even though such models employed the stochastic approach only on the observed values. However, excluding the missing for inferring and generating samples by utilizing both features A. W. Mulyadi and E. Jun are with the Department of Brain and Cog- and temporal relations, they scarcely exploited the uncertainty nitive Engineering, Korea University, Seoul 02841, Korea (e-mail: wisnu- in estimating the missing values in multivariate time series [email protected], [email protected]). H.-I. Suk is with the Department data (i.e., since the imputation estimates are not thoroughly of Artificial Intelligence and the Department of Brain and Cognitive Engi- neering, Korea University, Seoul 02841 Korea (e-mail: [email protected]). accurate, we may introduce a fidelity score denoted by the (Corresponding author: Heung-Il Suk) uncertainty, which enhances the downstream task performance UNDER REVIEW 2

by emphasizing the reliable information more than the less II.RELATED WORK certain information) [14, 17, 18]. We can use the imputation model, which captures the aleatoric uncertainty in estimating Imputation strategies have been extensively devised to re- the missing values by placing a distribution over the output solve the issue of sparse high-dimensional time series data of the model [19]. In particular, we would like to estimate by means of , machine learning, or deep learning the heteroscedastic aleatoric uncertainties, which are useful in methods. For instance, previous works exploited statistical cases where observation noises vary with the input [19]. attributes of observed data, such as mean [8] and median In this work, we define our primary task as the prediction filling [9], which clearly ignored the temporal relations and of in-hospital mortality on clinical time series data. However, the correlations among variables. From the machine learning since such data are portrayed with sparse and irregularly- approaches, k-nearest neighbors (KNN) [20], and principal sampled characteristics, we devise an imputation model as component analysis (PCA) [21] were proposed by considering the secondary problem to enhance the clinical outcome pre- the relationships of the features either in the original or in dictions. We propose a novel variational-recurrent imputation the latent space. In addition, imputation methods based on network (V-RIN) which unifies the imputation and prediction the expectation-maximization (EM) algorithm [22, 23] were networks for multivariate time series EHR data, governing also proposed in tackling the missing values estimations. Fur- both correlations among variables and temporal relations. thermore, multiple imputation by chained equations (MICE) Specifically, given the sparse data, an inference network of [24, 25] introduced variability by means of repeating the VAEs is employed to capture data distribution in the latent imputation process multiple times. However, these methods space. From this, we employ a generative network to obtain the ignore the temporal relations as crucial attributes in time series reconstructed data as the imputation estimates for the missing modeling and disregard the exploitation of the uncertainties in values and the uncertainty indicating the imputation fidelity estimating the missing values. score. Then, we integrate the temporal and feature correlations The deep learning-based imputation models, are closely into a combined vector and feed it into a novel uncertainty- related to our proposed models. A previous study [12] lever- aware GRU in the recurrent imputation network. Finally, we aged VAEs to generate stochastic imputation estimates by obtain the mortality prediction as a clinical verdict from the exploiting the distribution and correlations of features in the complete imputed data. In general, our main contributions in latent space. However, it ignored the temporal relations as well this study are as follows: as the uncertainties. Meanwhile, GP-VAE [26] was proposed to obtain the latent representation by means of VAEs and model • We estimate the missing values by utilizing a deep temporal dynamics in the latent space using the Gaussian generative model combined with a recurrent imputation process. However, since the model is merely focused on the network to capture both feature correlations and the imputation task, they required a separate model for the further temporal dynamics jointly, yielding the uncertainty. downstream outcome. Recently, the sparse autoencoder (SAE) • We effectively incorporate the uncertainty with the impu- was exploited to extract the features in combination with tation estimates in our novel uncertainty-aware GRU cell a complex recurrent broad learning system (RBLS) [27] to for better prediction results. capture the temporal dynamic characteristics in the time series. • We evaluate the effectiveness of the proposed models by Nevertheless, uncertainties in imputing the missing values training the imputation and prediction networks jointly in were ignored. an end-to-end manner, achieving superior performance on To deal with time series data, a series of RNN-based real-world multivariate time series EHR data compared to imputation models were proposed. GRU-D [6] considered other competing state-of-the-art methods. the temporal dynamics by incorporating the missing patterns, This study extends the preliminary work published in [14]. together with the mean imputation, and forward filling with Unlike to the preceding study, we have further expanded past values using temporal decay factor. Similarly, GRU-I the proposed model by introducing more complex recurrent [13] trained the RNNs using such temporal decay factor imputation networks, which utilize the uncertainties, instead and further incorporated an adversarial scheme of GANs as of vanilla RNNs. We also include two additional real-world the stochastic approach. In the meantime, BRITS [15] were EHR datasets in our experiments, described in Section (IV-A), proposed to combine the feature correlations and temporal and validate the robustness of our proposed networks by dynamic networks using bidirectional dynamics, which en- comparing them with existing state-of-the-art models in the hanced the accuracy by estimating missing values in both literature (Section IV-C). Furthermore, we have conducted forward and backward directions. By considering the delayed extensive experiments to discover the impacts of missing value gradients of the missingness in the forward and backward estimation by utilizing the uncertainties when performing the directions, their models were able to achieve more accurate downstream task (Section IV-D - IV-F). missing values imputations. Likewise, M-RNN [28] utilized The rest of the paper is organized as follows. In Section II, bidirectional recurrent dynamics by operating interpolation we discuss the works closely related to our proposed model (intra-stream) and imputation (inter-stream). Despite temporal in imputing the missing values. In Section III, we detail our dynamics and stochastic methods being considered in the mod- proposed model. In Section IV, we report on the experimental els, the uncertainties for imputation purposes were scarcely results and analysis by comparing them with state-of-the-art incorporated. methods. Finally, we conclude the work in Section V. As we are unsure of the actual values, we argue that these UNDER REVIEW 3 Prediction Mortality

GRU GRU

Imputation Imputation

Variational Autoencoder Imputation Network Recurrent Imputation Network

Fig. 1. Architecture of the proposed model, which unifies two key networks, namely variational autoencoder and recurrent imputation network. The model considers the feature correlations, temporal relations, and the uncertainties in estimating the missing values to get a better prediction of clinical outcome. Refer to Section III for more details of the notation. uncertainties are beneficial and can be utilized to estimate and RNNs, we expect to acquire more likely estimates by the missing values. Such uncertainty can be captured by considering feature correlations and temporal relations over accommodating the distribution over the model output [19]. time, including the utilization of the uncertainty. By doing so, For this purpose, we exploited VAEs [10] as the Bayesian we expect a more reliable prediction outcomes (i.e. in-hospital networks, which are able to model the data distribution. In this mortality) could be obtained. We describe each of the networks work, we introduce the uncertainty as the imputation fidelity of more specifically in the following sections after introducing estimates, which compensates for the potential impairment of the data representation. imputation estimates. Therefore, we assumed that our model could provide reliable estimates while giving less attention A. Data Representation to the unreliable ones determined by its uncertainties. We expect to obtain better estimates of the missing values leading Given the multivariate time series EHR data of N number of to a better prediction performance as a downstream task. patients, a set of clinical observations and their corresponding (n) (n) N However, since VAEs alone are not designed to model the label is denoted as {X , y }n=1. For each patient, we (n) (n) (n) (n) > T ×D temporal dynamics, we combined the model with the recurrent have X = [x1 ,..., xt ,..., xT ] ∈ R , where T imputation network by further utilizing those uncertainties. and D represent the time points and variables, respectively; (n) D Thus, our proposed model differs from the aforementioned xt ∈ R denotes all observable variables at time point t, d,(n) models in which that it integrates the imputation and prediction and xt is the d-th element of variables at time point t. In networks jointly, and the utilization of the uncertainties serve addition, it has the corresponding clinical label y(n) ∈ {0, 1}, as the major motivation in our works. representing the clinical outcome. In our case, it denotes the in-hospital mortality of a patient, which falls into a binary III.PROPOSED METHODS classification problem. For the sake of clarity, we omit the superscript (n) hereafter. Our proposed model architecture consists of two key net- As X is characterized with sparsity properties, we ad- works – an imputation and a prediction network – as depicted dress the missing values by introducing masking matrix in Fig. 1. The imputation network is devised based on VAEs to M ∈ { }T ×D, indicating whether values are observed or capture the latent distribution of the sparse data by means of 0, 1 missing. In addition, we define a new data representation its inference network (i.e., encoder E). Then, the subsequent X˜ x˜ x˜ x˜ > ∈ T ×D to be fed into the model, generative network of VAEs (i.e., decoder D) estimates the = [ 1,..., t,..., T ] R where we initialize the missing value with zero [5, 12, 14] as reconstructed data distribution. We regard its reconstructed follows: values as the imputation estimates while exploiting its vari- ( ( ances as the uncertainties to be further utilized in the recurrent 1 if xd is observed, xd if md = 1, md = t , x˜d = t t imputation network sequentially. t 0 otherwise t 0 otherwise. The succeeding recurrent imputation networks are built upon RNNs to model the temporal dynamics. For each time Another consideration in dealing with the sparse data is that step, we employ the regression layer to model the temporal there exists a time gap between two observed values. Such relations incorporated within the hidden states of RNNs cell information in fact carries a piece of essential information in imputing the missing values. However, as we also consider in estimating the missing values over the times. Thus, we the time gap between observed values, we incorporate the time accommodate this information by further devising the time decay factor, which is then exploited in such hidden states, delay matrix ∆ ∈ RT ×D, which is derived from s ∈ RT , leading to decayed hidden states. Eventually, by systemati- denoting the timestamp of the measurement. As initialization, cally unifying the imputation estimates obtained from VAEs we fix ∆t = 1 for the t = 1, while setting the time delay for UNDER REVIEW 4

Fig. 2. Graphical illustrations of comparing three different forms of GRU cells (F u: update gate, F r: reset gate). the remaining times (t > 1) by referring to a masking matrix quantify this uncertainty as the fidelity score of the missing M and a timestamp vector s as follows: value estimates. In particular, we set the corresponding uncer- ( tainty to zero if the data is observed, indicating that we are s − s if md = 1, ∆d = t t−1 t−1 confident with full trust in the observation, and set it as a value t − d d st st−1 + ∆t−1 otherwise. σx¯,t if the corresponding value is missing as

u¯t = (1 − mt) σˆx¯,t B. VAE-based Imputation Network Finally, as an output of this VAE-based imputation network, Given the observations at each time point x˜t, we infer we acquire the set {X¯ , U¯ } denoting the imputed values and z ∈ Rk as the latent representation with k as its corresponding their corresponding uncertainty, respectively. Furthermore, to dimension by making use of the inference network, using the alleviate the bias of missing value estimations, we utilize this true posterior distribution pφ(z|x˜t). Intuitively, we assume that uncertainty in the following recurrent imputation network. x˜t is generated from some unobserved random variable z by some conditional distribution p (x˜ |z), while z is generated θ t C. Recurrent Imputation Network from a prior distribution pθ(z), which can be interpreted as the hidden health status of the patient. In addition, we define The recurrent imputation network is based on RNNs, where we further model the temporal relations in the imputed data the marginal likelihood as pθ(x˜t) by integrating out the joint and exploit the uncertainties. While both GRU [30] (depicted distribution pθ(x˜t, z) for z defined as in Fig. 2a) and long-short term memory (LSTM) [31] are Z Z feasible choices, inspired by the previous work of GRU-D[6] pθ(x˜t) = pθ(x˜t, z)dz = pθ(x˜t|z)pθ(z)dz. depicted in Fig. 2b, we employed a modified GRU cell by leveraging the uncertainty-aware GRU cell to consider However, in practice, this is analytically intractable. Since, further uncertainty and the temporal decaying factor, which pθ(z|x˜t) = pθ(x˜t|z)pθ(z)/pθ(x˜t), the true posterior becomes is depicted in Fig. 2c. intractable as well. Therefore, we approximate it with qφ(z|x˜t) Specifically, at each time step t, we produce the uncertainty 2 using a Gaussian distribution N (µz, diag(σz)) [14, 29], where decay factor υt in Eq. (1) using a negative exponential rectifier the mean and log-variance are obtained such that to guarantee υt ∈ (0, 1] [6, 15]. 2 µz = Eµ(x˜t; φ), log σz = Eσ(x˜t; φ), υt = exp{−max(0, Wu¯ u¯t + bu¯ )} (1) where E{µ,σ} denotes the inference network with parameter We utilize such factors to emphasize the reliable estimates and φ. Furthermore, we apply the reparameterization trick [10] as give less attention to the uncertain ones. Hereafter, we denote z = µz + σz z, where z ∼ N (0, I), with denoting W and b as the weights and biases of a fully-connected layer, element-wise multiplication, thus, making it possible to be respectively. In particular, we first employ a fully-connected differentiated and trained using standard gradient methods. layer to x¯t and element-wise multiply with the uncertainty Furthermore, given this latent vector z, we estimate pθ(x˜t|z) decay factor υt as follows by means of the generative network D with parameters θ as υ 2 xt = (Wυx¯t + bυ) υt. (2) µx¯,t = Dµ(z; θ), log σx¯,t = Dσ(z; θ), 2 Note that we zeroed-out the diagonal of the parameter Wυ where µx¯,t and σx¯,t denote the mean and variance of recon- structed data distribution, respectively. We regard the mean as to enforce the estimations based on other features. Thus, we xυ feature-based correlated estimates the estimate to the missing values and maintain the observed obtain t as the to the missing values. values in x¯ by making use of corresponding masks vector as t In addition, we further consider the missing value estimates x¯t = mt x˜t + (1 − mt) µx¯,t. based on the temporal relations. For this purpose, we employ the time delay ∆t which is an essential element to capture In the meantime, we regard the variance of reconstructed temporal relations and missing patterns of the data [6]. We data as the uncertainty to be further utilized in the exploit this information as the temporal decay factor γt ∈ recurrent imputation process. For this purpose, we in- 0 1 as follows ¯ T ×D ˆ ( , ] troduce an uncertainty matrix U ∈ R with Σx¯ = > T ×D [diag(σx¯,1),..., diag(σx¯,t),..., diag(σx¯,T )] ∈ R . We γt = exp{−max(0, Wγ ∆t + bγ )}. (3) UNDER REVIEW 5

Meanwhile, by employing the GRU, we obtain the hidden Algorithm 1: Algorithm of our proposed model state h as the comprehensive information compiled from the input : clinical time series data {X˜ , y, M, ∆} preceding sequences. Thus, we take advantage of the temporal output: imputed values Xc; outcome prediction yˆ decay factor in governing the influence of past observations 1 ∇ ← 0 embedded into hidden states using the form of decayed hidden Θ 2 while not converge do states as ˜ 2 3 qφ(Z|X) ← N (µz, diag(σz)) hˆt−1 = ht−1 γt. (4) 4 Z ← µz + Σz z, z ∼ N (0, I) ˆ ˜ 2 Thereby, given the previous hidden states ht−1, we can esti- 5 pθ(X|Z) ← N (µx¯, diag(σx¯)) r mate the current complete observation xt through regression. 6 X¯ ← M X˜ + (1 − M) µx¯ ¯ ˆ r ˆ 7 U ← (1 − M) Σx¯ xt = Wrht−1 + br. (5) 8 h0 ← 0 In addition, we further make use of those estimates by apply- 9 for t ← 1 to T do ing another operation as 10 // Eqs. (1) - (9) 11 υt ← exp{−max(0, Wu¯ u¯t + bu¯ )} τ r υ xt = Wτ xt + bτ , (6) 12 xt ← (Wυx¯t + bυ) υt 13 γ ← exp{−max(0, W ∆ + b )} again setting the diagonal parameter of W to be zeros to t γ t γ τ ˆ consider the feature-based estimation on top of the temporal 14 ht−1 ← ht−1 γt r ˆ relations from the previous hidden states. 15 xt ← Wrht−1 + br τ r Hence, we have a pair of imputed values {xυ, xτ }, corre- 16 xt ← Wτ xt + bτ t t υ τ sponding to missing value estimates obtained from the VAE 17 ct ← x t ∗ xt c by considering the uncertainties, and from the recurrent impu- 18 xt ← mt x¯t + (1 − mt) ct c ˆ tation network, respectively. We then merge this information 19 ht ← GRU([xt ◦ mt], ht−1) jointly to get combined vector ct comprising both estimates 20 end by simply employing a 1 × 1 convolution operation (∗) as 21 yˆ ← σ(WyhT + by) 22 L ← υ τ total Eq. (13) ct = xt ∗ xt (7) ∂ 23 ∇Θ ← ∂Θ Ltotal c 24 Θ ← optimize(Θ, ∇ ) Finally, we obtain the complete vector xt by replacing the Θ missing values with the combined estimates as 25 end c xt = mt x˜t + (1 − mt) ct. (8) In addition, we concatenate the complete vector with the combined imputation estimates by the mean absolute error corresponding mask using ◦ operator, and then feed it into (MAE) as Lreg. the modified GRU cell to obtain the subsequent hidden states N T c X X (n) h = σ(W hˆ + V [x ◦ m ] + b ) (9) Lvae = E (n) [log pθ(x˜t |z)] t h t−1 h t t h qφ(z|x˜t ) n=1 t=1 with σ denotes the activation function. Lastly, to predict the (n) in-hospital mortality as the clinical outcome, we utilize the − DKL[qφ(z|x˜t )kpθ(z)] + λ1k(θ; φ)k1 (11) last hidden state hT to get the predicted label yˆ such that N T X X (n) (n) (n) (n) Lreg = LMAE(x˜ m , c m ) (12) yˆ = σ(WyhT + by). (10) t t t t n=1 t=1 Hereby, W{u¯,υ,γ,r,τ ,h,y}, Vh, and b{u¯,υ,γ,r,τ ,h,y} are the Furthermore, we define the binary cross-entropy loss function learnable parameters in our recurrent imputation network. Lpred to evaluate the prediction of in-hospital mortality. Thus, we define the overall composite loss function Ltotal as D. Learning L = α L + β L + L , (13) We describe the composite loss function, comprising total vae reg pred the imputation and prediction loss function to where α and β are the hyperparameters to represent the ratio tune all model parameters jointly, which are between the Lvae and Lreg, respectively. Θ = {θ, φ, W{u¯,υ,γ,r,τ ,h,y}, Vh, b{u¯,υ,γ,r,τ ,h,y}}. Such Note that our proposed model is also applicable to consider loss function accommodates the VAEs and the recurrent the bidirectional dynamics. Such a scenario can be carried imputation network as well. By means of VAEs, we define out by having the forward and backward direction of the data the loss function Lvae to maximize the variational evidence fed into the recurrent imputation network. By doing so, we lower bound (ELBO) that comprises the reconstruction loss make our proposed model a fair comparison to M-RNN [28], term and the Kullback-Leibler divergence [32]. We add BRITS-I [15], and BRITS [15] which adopted such strategy `1-regularization to introduce sparsity into the network with to achieve better estimates of the missing values and the λ1 as the hyperparameter. Moreover, for each time step, we prediction outcomes. In bidirectional cases, we employ an measure the difference between the observed data and the additional consistency loss function [15] to Ltotal in order UNDER REVIEW 6 to impose a consistency estimates for each time step in both for each hidden layer, we used batch normalization, drop out directions as rate of 0.3 for both of classification and imputation tasks, and N T tanh activation function. Furthermore, 64 hidden units were X X c,(n) c0,(n) Lcons = LMAE(xt , xt ), (14) employed for the recurrent imputation network. n=1 t=1 We trained the proposed model on both datasets using c,(n) c0,(n) an Adam optimizer with 100 epochs and 64 mini-batches. with x and x denoting the estimates from the forward t t For imputation task, we fixed the learning rates of 0.0005 and backward direction, respectively. Another hyperparameter and 0.0003 for PhysioNet and MIMIC-III, respectively, while ξ could be introduced for this consistency loss to optimize 0.005 for both datasets on the classification task. We set λ and the model. Lastly, we use stochastic gradient descent in an 1 weight decay equally with 1e−5. As for bidirectional models, end-to-end manner to optimize the model parameters during we set ξ = 0.1 as the hyperparameter of the consistency loss. the training. We summarize the overall training steps of our proposed framework in Algorithm 1. B. Tasks IV. EXPERIMENTS A. Dataset and Implementation Setup In this work, we validated the performance of our proposed models from two perspectives: (1) the in-hospital mortality PhysioNet 2012 Challenge [33, 34] consists of 35 irregu- prediction (classification), and (2) missing value imputation. larly sampled clinical variables (e.g., heart and respiration 1) Classification Task: Our primary goal for this work is rate, blood pressure, etc.) from 4,000 patients during their to predict the in-hospital mortality as the binary classification first 48 hours of medical care in the ICU. Note that we task. For this purpose, we reported the test result on the ignore the demographic information and categorical data types in-hospital mortality prediction task from the 5-fold cross- from this dataset. Hereby, we exploit only the clinical time validation in terms of the average of the Area Under the series data. From those samples, we excluded three patients Curve (AUC) ROC. Additionally, to measure the robustness with no observations at all. We sampled the observations of the models in dealing with the imbalance data portrayed in hourly, using the time window as the timestamps, and took the both datasets, we also reported the Area Under the Precision- average of values in cases of multiple measurements within Recall Curve (AUPRC). We randomly removed samples of the this time window. It resulted in sparse EHR data with an training data with {5%, 10%} scenarios and left the validation average missing rate of 80.51%. Our aim is to predict the and test set untouched to be reported as the results. in-hospital mortality of patients, with 554 positive mortality 2) Imputation Task: For the secondary task, we additionally labels (13.86%). As for the implementation setup of the evaluated imputation performance on the missing values. As PhysioNet dataset, we employed three layers of feedforward for this task, we randomly removed samples with settings of networks for the inference network of VAEs with hidden units {5%, 10%} of observed data from all training, validation, and of {64, 24, 10}, with 10 denoting the dimension of latent test set as the ground truth. Then we reported the test result representation. The generative network has equal numbers from the 5-fold cross-validation by measuring the MAE. In of hidden units with those of the inference network but in addition, we also measured the other most-used imputation the reverse order. We employed hyperbolic tangent (tanh) similarity metric, namely mean relative error (MRE) and mean as the non-linear activation function for each hidden layer. squared error (MSE). Given x(i) and xc,(i) as the ground truth Prior to those activation functions, we also applied batch and imputation estimates of i-th item, respectively, and N normalization and a dropout rate of {0.1, 0.3} for classification GT ground truth items in total, we defined MAE, MRE, and MSE and imputation tasks, respectively. We employed modified as GRU for recurrent imputation network with 64 hidden units. PNGT |xc,(i) − x(i)| The Medical Information Mart for Intensive Care MAE = i=1 , (15) (MIMIC-III) [35] dataset consists of 53,432 ICU stays of NGT adult patients in the Beth Israel Deaconess Medical Center PNGT c,(i) (i) in the period of 2001 – 2012 [35]. We selected 99 variables |x − x | MRE = i=1 , (16) from several source tables, such as laboratory tests, inputs to PNGT (i) i=1 |x | patients, outputs from patients, and drug prescriptions tables, resulting in a cohort of 13,998 patients with 1,181 positive PNGT (xc,(i) − x(i))2 in-hospital mortality labels (8.44%) among them. Moreover, MSE = i=1 . (17) N from those irregular measurements, we further sampled the GT data into two hourly samples for the first 48 hours of their medical care, leading to an average missing rate of 93.92%. As in the case of PhysioNet, we took the average value if there C. Comparative Models existed multiple measurements. We referred to [6, 36] in pre- processing the MIMIC-III cohort. We employed three layers We compared the performance of our proposed model in of feedforward networks for the inference network of VAEs carrying the aforementioned tasks with the closely-related with hidden units of {128, 32, 16} and an equal number of competing state-of-the-art models in the literature by grouping hidden units in reverse for the generative network. Likewise, them into unidirectional and bidirectional models. UNDER REVIEW 7

β = 0.1 β = 0.25 β = 0.5 β = 0.75 β = 1.0

0.86 V-RIN V-RIN-full

0.84

AUC 0.82

0.80

0.1 0.25 0.5 0.75 1.0 0.1 0.25 0.5 0.75 1.0 0.1 0.25 0.5 0.75 1.0 0.1 0.25 0.5 0.75 1.0 0.1 0.25 0.5 0.75 1.0 α α α α α

Fig. 3. Ablation studies on the impact of hyperparameter pair {α, β} for the classification task on PhysioNet dataset.

TABLE I ABLATION STUDY RESULTS ON BOTH PHYSIONETAND MIMIC-III DATASETS USING 10% REMOVAL SCENARIO.*)FORTHE VAE+RNN [14], WE OBTAINED THE MISSING VALUE ESTIMATES SOLELY FROM THE VAE INRECONSTRUCTINGVALUES

Dataset Task Metric VRNN [37] VAE+RNN [14] V-RIN (Ours) V-RIN-full (Ours) AUC 0.7964 ± 0.0117 0.8098 ± 0.0154 0.8311 ± 0.0079 0.8347 ± 0.0125 Classification AUPRC 0.4082 ± 0.0369 0.4311 ± 0.0409 0.4781 ± 0.0529 0.4792 ± 0.0277 PhysioNet MAE 0.6948 ± 0.0171 0.6493 ± 0.0226∗ 0.2938 ± 0.0080 0.2898 ± 0.0081 Imputation MRE 0.8790 ± 0.0186 0.9126 ± 0.0203∗ 0.4130 ± 0.0049 0.40727 ± 0.0055 MSE 1.0910 ± 0.02000 .8538 ± 0.0906∗ 0.3889 ± 0.0579 0.3807 ± 0.0546 AUC 0.7925 ± 0.0126 0.8147 ± 0.0140 0.8634 ± 0.0137 0.8650 ± 0.0091 Classification AUPRC 0.3087 ± 0.0291 0.3482 ± 0.0159 0.4099 ± 0.0163 0.4144 ± 0.0117 MIMIC-III MAE 0.7042 ± 0.0048 0.6268 ± 0.0063∗ 0.3322 ± 0.0072 0.3279 ± 0.0055 Imputation MRE 0.9922 ± 0.0026 0.9947 ± 0.0011∗ 0.5271 ± 0.0080 0.5204 ± 0.0065 MSE 1.0661 ± 0.02081 .0523 ± 0.1110∗ 0.6172 ± 0.1174 0.5865 ± 0.1038

1) Unidirectional Models: • We extended our proposed model of V-RIN and V- RIN-full into the bidirectional models by means of the • GRU-D [6] estimates the missing values by utilizing the recurrent imputation networks. Similar to BRITS-I and informative missing value patterns in the form of the BRITS, we further computed consistency loss in Eq. (14). masking and time decay factor using modified GRU cells. • RITS-I [15] utilizes unidirectional dynamics that rely solely on temporal relations through regression. D. Experimental Results : Ablation Studies • RITS [15] is devised based on RITS-I by further taking As part of the ablation studies, we report the performance of into account the feature correlations function. Further- the unidirectional model of the V-RIN and V-RIN-full models more, it utilizes the temporal decay as the factor to weigh on the in-hospital mortality classification task. Firstly, we between both features and temporal-based estimates. investigated the effect of the pair {α, β} as the hyperparameter • V-RIN (Ours) is based on our proposed model except that in Eq. (13). We reflected these parameters as a ratio to weigh we ignored the uncertainty decay factor. Specifically, we the imputation by the VAEs and the recurrent imputation excluded Eq. (1) and omitted the element-wise multipli- network to achieve the optimal performance on estimating the cation operation of the υt in Eq. (2). missing values and classifying the clinical outcomes. For each • V-RIN-full (Ours) executed all operations in the proposed parameter, we defined set of values in the range of [0.1, 1.0]. model including feature-based correlations, temporal re- For PhysioNet, as illustrated in Fig. 3, in general, we ob- lations, and the uncertainty decay utilization. served that in almost all the combination settings, V-RIN-full 2) Bidirectional Models: achieved higher performance than V-RIN. We interpreted these findings that introducing the uncertainty helps the model in • M-RNN [28] exploits the multi-directional RNNs which estimating the missing values, leading to better classifying the execute both interpretation and imputation. outcome. Both models were able to achieve high performance • BRITS-I [15] is based on the RITS-I by extending it to in terms of their average AUC scores of 0.8311 ± 0.0079 be able to handle bidirectional dynamics in estimating the for V-RIN and 0.8347 ± 0.0125 for V-RIN-full. These AUC missing values. results were obtained with settings of {α = 0.75, β = 0.25} • BRITS [15] takes the bidirectional dynamics of RITS and {α = 1.0, β = 0.75}, respectively. For these results, in handling the sparsity in the data. Both BRITS-I and we observed that the model favored the emphasis on the BRITS additionally employ consistency loss of forward feature correlations over the temporal relations to obtain its and backward directions in their attempt to estimate the best performance. For the case of V-RIN, once we tried to missing values more precisely. increase the β > 0.25, we observed that the classification UNDER REVIEW 8

TABLE II PERFORMANCEOFIN-HOSPITAL MORTALITY PREDICTION TASK ON BOTH PHYSIONETAND MIMIC-IIIDATASETS

10% 5% Dataset Models AUC AUPRC AUC AUPRC GRU-D [6] 0.8096 ± 0.0308 0.4645 ± 0.0464 0.8096 ± 0.0308 0.4645 ± 0.0464 RITS-I [15] 0.8160 ± 0.0108 0.4477 ± 0.0395 0.8179 ± 0.0110 0.4385 ± 0.0222 RITS [15] 0.8159 ± 0.0153 0.4484 ± 0.0361 0.8109 ± 0.0159 0.4568 ± 0.0394 V-RIN (Ours) 0.8311 ± 0.0079 0.4781 ± 0.0529 0.8292 ± 0.0127 0.4773 ± 0.0428 V-RIN-full (Ours) 0.8347 ± 0.0125 0.4792 ± 0.0277 0.8348 ± 0.0116 0.4773 ± 0.0374

PhysioNet Unidirectional M-RNN [28] 0.7885 ± 0.0236 0.4096 ± 0.0296 0.7817 ± 0.0230 0.3839 ± 0.0426 BRITS-I [15] 0.8253 ± 0.0215 0.4521 ± 0.0347 0.8168 ± 0.0175 0.4508 ± 0.0327 BRITS [15] 0.8225 ± 0.0134 0.4580 ± 0.0284 0.8242 ± 0.0041 0.4600 ± 0.0420 V-RIN (Ours) 0.8393 ± 0.0177 0.4760 ± 0.0496 0.8377 ± 0.0127 0.4886 ± 0.0402

Bidirectional V-RIN-full (Ours) 0.8401 ± 0.0100 0.4924 ± 0.0363 0.8422 ± 0.0070 0.4871 ± 0.0217 GRU-D [6] 0.8536 ± 0.0138 0.3809 ± 0.0198 0.8536 ± 0.0138 0.3809 ± 0.0198 RITS-I [15] 0.8533 ± 0.0127 0.3687 ± 0.0119 0.8543 ± 0.0067 0.3707 ± 0.0119 RITS [15] 0.8601 ± 0.0044 0.3922 ± 0.0048 0.8547 ± 0.0091 0.3739 ± 0.0289 V-RIN (Ours) 0.8634 ± 0.0137 0.4099 ± 0.0163 0.8620 ± 0.0112 0.3951 ± 0.0185 V-RIN-full (Ours) 0.8650 ± 0.0091 0.4144 ± 0.0117 0.8616 ± 0.0101 0.3924 ± 0.0220

MIMIC-III Unidirectional M-RNN [28] 0.8243 ± 0.0065 0.3221 ± 0.0235 0.8227 ± 0.0101 0.3176 ± 0.0162 BRITS-I [15] 0.8625 ± 0.0093 0.4283 ± 0.0359 0.8621 ± 0.0089 0.4065 ± 0.0159 BRITS [15] 0.8649 ± 0.0110 0.4129 ± 0.0324 0.8638 ± 0.0141 0.4144 ± 0.0250 V-RIN (Ours) 0.8695 ± 0.0139 0.4333 ± 0.0288 0.8694 ± 0.0121 0.4158 ± 0.0204

Bidirectional V-RIN-full (Ours) 0.8691 ± 0.0126 0.4270 ± 0.0255 0.8705 ± 0.0148 0.4199 ± 0.0160 performance was degraded to some degree. In contrast, the both imputation and classification tasks. performance of V-RIN-full was considerably better when we increased the β parameter. We also carried out similar ablation E. Classification Result Analysis studies on the MIMIC-III dataset, which is reported in the supplementary material. To summarize, for both datasets, we We presented the experimental results of the in-hospital argue that both features and temporal relations are essential in mortality prediction in comparison with other competing meth- estimating the missing values with some latent proportion. ods in terms of average AUC and AUPRC in Table II for both datasets. We reported the evaluation of both unidirectional Table I presents the comparison of our model with closely and bidirectional models for 10% and 5% removal on both related models on both classification and imputation tasks, datasets. For the case of unidirectional models on PhysioNet such as VRNN [37], which integrates VAEs for each time with 10% and 5% removal, our V-RIN model achieved bet- step of RNNs, and VAE+RNN [14], which employs VAEs ter performance in terms of AUC and AUPRC by a large followed by RNNs without incorporating the uncertainty. For margin compared to all comparative models even with the the case of VRNN on PhysioNet and MIMIC-III, we observed RITS, which utilized both the features and temporal relations, that the classification performance was the lowest among sequentially. The highest AUC was obtained by the V-RIN- reported models in terms of both its AUC and AUPRC. In full in both removal percentage scenarios. As for bidirectional addition, their imputation performance was low in comparison models, all models in the 10% and 5% scenarios improved to both our proposed models. As in the case of VAE+RNN, in their performance, except for the M-RNN, which achieved comparison to VRNN, we noticed a considerable improvement lowest results among all competing methods. Although M- in the performance on PhysioNet and MIMIC-III, especially RNN employed similar strategies using bidirectional dynam- in terms of AUPRC. VAE+RNN is, in fact, the model closely ics, it struggles to perform the task properly. Both AUC and related to ours that it executes the imputation process by AUPRC of M-RNN was lower than GRU-D, despite GRU- first exploiting the feature correlations followed by temporal D using only the forward directional data. The highest AUC dynamics in exact order. However, [14] employed the vanilla was achieved by our V-RIN-full with 0.8401 ± 0.0100 and RNNs instead of recurrent imputation network, which is a 0.8422 ± 0.0070 for the 10% and 5% removal scenarios, novel extension in this study. By introducing the temporal respectively. These findings reassure that the utilization of the decay in V-RIN, the model was better able to learn the uncertainty is truly beneficial in estimating the missing values temporal dynamics effectively, resulting in better AUC and to better predict the clinical outcomes. AUPRC, as well as better imputation results in terms of MAE, As for MIMIC-III, we observed quite similar patterns to MRE and MSE on both datasets, by a large margin. Finally, PhysioNet, with our model outperforming all competing mod- once we introduced the uncertainty which is incorporated in els. The imbalance ratio, missing ratios, and dimensionality of the recurrent imputation network of V-RIN-full, we observed MIMIC-III are much greater than PhysioNet. We found that a significant enhancement of overall performance metrics on the performance of V-RIN and V-RIN-full was comparable both tasks and datasets. Thus, we conclude that the utilization to some extent. In unidirectional models with 10%, our V- of both temporal decay and the uncertainties are beneficial in RIN-full achieved the highest AUC and AUPRC, whereas UNDER REVIEW 9

TABLE III PERFORMANCE OF IMPUTATION TASK ON BOTH PHYSIONETAND MIMIC-IIIDATASETS

10% 5% Models MAE MRE MSE MAE MRE MSE PhysioNet GRU-D [6] 0.661 ± 0.016 0.930 ± 0.014 0.900 ± 0.107 0.662 ± 0.013 0.929 ± 0.015 0.906 ± 0.105 RITS-I [15] 0.407 ± 0.009 0.572 ± 0.004 0.476 ± 0.060 0.415 ± 0.010 0.582 ± 0.009 0.507 ± 0.087 RITS [15] 0.307 ± 0.008 0.431 ± 0.005 0.390 ± 0.057 0.327 ± 0.008 0.459 ± 0.007 0.431 ± 0.074 V-RIN (Ours) 0.293 ± 0.008 0.412 ± 0.004 0.388 ± 0.057 0.320 ± 0.008 0.449 ± 0.007 0.426 ± 0.077 V-RIN-full (Ours) 0.289 ± 0.008 0.407 ± 0.005 0.380 ± 0.054 0.315 ± 0.008 0.442 ± 0.009 0.422 ± 0.075 Unidirectional M-RNN [28] 0.397 ± 0.013 0.558 ± 0.009 0.439 ± 0.066 0.411 ± 0.011 0.577 ± 0.010 0.481 ± 0.088 BRITS-I [15] 0.371 ± 0.011 0.521 ± 0.008 0.416 ± 0.064 0.380 ± 0.010 0.534 ± 0.010 0.448 ± 0.085 BRITS [15] 0.280 ± 0.009 0.393 ± 0.006 0.351 ± 0.058 0.298 ± 0.008 0.418 ± 0.007 0.390 ± 0.076 V-RIN (Ours) 0.268 ± 0.008 0.375 ± 0.006 0.348 ± 0.059 0.292 ± 0.008 0.411 ± 0.008 0.387 ± 0.075

Bidirectional V-RIN-full (Ours) 0.264 ± 0.008 0.371 ± 0.006 0.343 ± 0.059 0.289 ± 0.008 0.406 ± 0.008 0.386 ± 0.075 MIMIC-III GRU-D [6] 0.583 ± 0.005 0.926 ± 0.005 0.973 ± 0.105 0.585 ± 0.007 0.930 ± 0.007 0.978 ± 0.096 RITS-I [15] 0.457 ± 0.007 0.725 ± 0.005 0.809 ± 0.111 0.464 ± 0.006 0.737 ± 0.003 0.820 ± 0.099 RITS [15] 0.332 ± 0.006 0.527 ± 0.005 0.562 ± 0.099 0.356 ± 0.004 0.567 ± 0.003 0.604 ± 0.096 V-RIN (Ours) 0.332 ± 0.007 0.527 ± 0.008 0.617 ± 0.117 0.358 ± 0.005 0.570 ± 0.003 0.636 ± 0.096 V-RIN-full (Ours) 0.327 ± 0.005 0.520 ± 0.006 0.586 ± 0.103 0.353 ± 0.004 0.561 ± 0.001 0.610 ± 0.098 Unidirectional M-RNN [28] 0.441 ± 0.007 0.700 ± 0.006 0.758 ± 0.110 0.451 ± 0.006 0.717 ± 0.004 0.776 ± 0.096 BRITS-I [15] 0.421 ± 0.007 0.668 ± 0.005 0.732 ± 0.110 0.429 ± 0.005 0.682 ± 0.002 0.745 ± 0.100 BRITS [15] 0.313 ± 0.005 0.497 ± 0.006 0.530 ± 0.098 0.333 ± 0.004 0.529 ± 0.003 0.566 ± 0.095 V-RIN (Ours) 0.314 ± 0.007 0.499 ± 0.008 0.584 ± 0.115 0.339 ± 0.003 0.539 ± 0.002 0.604 ± 0.099

Bidirectional V-RIN-full (Ours) 0.311 ± 0.006 0.494 ± 0.006 0.561 ± 0.084 0.333 ± 0.005 0.530 ± 0.004 0.579 ± 0.107 the V-RIN achieved the highest performance by a small feature over time on both datasets in Fig. 4. The observed margin in the bidirectional models. This contrasts with the values are illustrated as black circles. By means of VAEs, 5% scenarios, where V-RIN-full with bidirectional models we obtained the imputation estimates, which are highlighted achieved the highest AUC and AUPRC of 0.8705 ± 0.0148 by the hollow diamond markers alongside the uncertainties, and 0.4199 ± 0.016, respectively. illustrated with the shaded areas. Then, by employing the recurrent imputation network, we acquired the final estimates F. Imputation Results Analysis to the missing values depicted as the hollow circles. Overall, As the secondary task, we further evaluated the imputation we observed that the relevances were getting stronger toward performance of our proposed model in contrast with the the end of the time period, regardless of its sign. Furthermore, comparative models in Table III. In PhysioNet, GRU-D which compared to the observed values, the missing value estimates exploited the mean and temporal relations in imputing the with high uncertainties tend to hold a low relevances. Thus, values struggle to achieve good performance. By using bidirec- this demonstrates the benefits of the uncertainty utilization in tional dynamics, M-RNN obtained better results than GRU-D. the recurrent imputation networks for the downstream task. However, those models are still inferior to the RITS variants in both directional scenarios. Overall, both of our proposed V. CONCLUSION models revealed its imputation robustness on this dataset, In this study, we proposed a novel unified framework by exhibiting best performance, in both unidirectional and consisting of imputation and prediction networks for sparse bidirectional scenario, consistently. Meanwhile, overall results high-dimensional multivariate time series. It combined a deep on MIMIC-III showed that our proposed model performed generative model with a recurrent model to capture feature better than other comparative models on MAE and MRE, correlations and temporal relations for estimating the missing and slightly inferior to RITS and BRITS on MSE. Further values by taking into account the uncertainty. We utilized the investigation on these findings are worthy for our future work uncertainties as the fidelity of our estimation and incorporated to uncover the imputation behavior on various clinical datasets them for predicting clinical outcomes. We evaluated the ef- with diverse (patients’) characteristics. fectiveness of the proposed model with the PhysioNet 2012 We further exploited layer-wise relevance propaga- Challenge and MIMIC-III datasets as the real-world EHR tion (LRP) [38] through a publicly available toolbox: multivariate time series data, proving the superiority of our the innvestigate [39] to examine the relevance of the features. model in the in-mortality prediction task, compared to other Specifically, we employed the LRP-α1β0 rule [38, 39] to comparative state-of-the-art models in the literature. discover which features induce the activation of the neurons in carrying out the classification task. As a result, we depicted the ACKNOWLEDGMENT imputed values as the input to the recurrent imputation network This work was supported by Institute of Information & com- and further revealed the positive and negative relevance of each munications Technology Planning & Evaluation(IITP) grant UNDER REVIEW 10

Fig. 4. Analysis of imputed values relevance in carrying classification task by utilizing LRP [38, 39] to our recurrent imputation network. We obtained the positive and negative relevances depicted as red and blue, respectively, while illustrate the observed values with black circles (•), imputed values by VAEs with hollow diamonds () with shaded areas as its corresponding uncertainties, and combined estimates by means of recurrent imputation network as hollow circles (◦). Color intensity is normalized to the maximum absolute relevance per feature over time. funded by the Korea government(MSIT) (No. 2019-0-00079, [7] J. Brick and G. Kalton, “Handling in survey Department of Artificial Intelligence (Korea University)) and research,” in Stat. Methods Med. Res., vol. 5, no. 3, 1996, the National Research Foundation of Korea(NRF) grant funded pp. 215–238. by the Korea government(MSIT) (No. 2019R1A2C1006543). [8] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data. John Wiley & Sons, Inc., 1987. [9] E. Acuna˜ and C. Rodriguez, The Treatment of Missing REFERENCES Values and its Effect on Classifier Accuracy. Springer [1] Y. Cheng, F. Wang, P. Zhang, and J. Hu, “Risk pre- Berlin Heidelberg, 2004. diction with electronic health records: A deep learning [10] D. P. Kingma and M. Welling, “Auto-encoding varia- approach,” in Proc. 2016 SIAM Int. Conf. Data Min., tional bayes,” in Proc. Int. Conf. Learn. Represent., 2014. 2016, pp. 432–440. [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [2] P. Yadav, M. Steinbach, V. Kumar, and G. Simon, “Min- D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ing electronic health records (EHRs): A survey,” in ACM “Generative adversarial nets,” in Proc. Adv. Neural Inf. Comput. Surv., vol. 50, no. 6, 2018, pp. 85:1–85:40. Process. Syst., 2014, pp. 2672–2680. [3] B. J. Wells, K. M. Chagin, A. S. Nowacki, and M. W. [12] A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera, Kattan, “Strategies for handling missing data in elec- “Handling incomplete heterogeneous data using VAEs,” tronic health record derived data,” in J. Electron. Health 2018. [Online]. Available: https://arxiv.org/abs/1807. Data Methods, vol. 1, no. 3, 2013, p. 1035. 03653 [4] B. K. Beaulieu-Jones and J. H. Moore, “Missing data [13] Y. Luo, X. Cai, Y. Zhang, J. Xu, and Y. Xiaojie, “Multi- imputation in the electronic health record using deeply variate time series imputation with generative adversarial learned autoencoders,” in Pacific Symp. Biocomput., networks,” in Proc. Adv. Neural Inf. Process. Syst., 2018, vol. 22, 2017, pp. 207–218. pp. 1596–1607. [5] Z. C. Lipton, D. Kale, and R. Wetzel, “Directly mod- [14] E. Jun, A. W. Mulyadi, and H.-I. Suk, “Stochastic eling missing data in sequences with RNNs: Improved imputation and uncertainty-aware attention to EHR for classification of clinical time series,” in Proc. 1st Mach. mortality prediction,” in Int. Joint Conf. Neural Netw., Learn. Healthcare Conf., ser. Proceedings of Machine 2019. Learning Research, vol. 56. PMLR, 18–19 Aug 2016, [15] W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li, pp. 253–270. “BRITS: Bidirectional recurrent imputation for time se- [6] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu, ries,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. “Recurrent neural networks for multivariate time series 6775–6785. with missing values,” in Scientific Rep., vol. 8, 2018. [16] Q. Ma, S. Li, L. Shen, J. Wang, J. Wei, Z. Yu, and G. W. UNDER REVIEW 11

Cottrell, “End-to-end incomplete time-series modeling 1997. from linear memory of latent variables,” in IEEE Trans. [32] S. Kullback and R. A. Leibler, “On information and Cybern., 2019, pp. 1–13. sufficiency,” in The Annals of Math. Stat., vol. 22, no. 1, [17] Y. He, “Missing using multiple imputation,” 1951, pp. 79–86. in Circulation: Cardiovascular Quality and Outcomes, [33] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. vol. 3, no. 1, 2010, pp. 98–105. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. [18] J. Gemmeke, U. Remes, and K. J. Palomaki,¨ “Observa- Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, tion uncertainty measures for sparse imputation,” in Proc. physiotoolkit, and physionet,” in Circulation, vol. 101, Interspeech, 2010, pp. 2262–2265. no. 23, 2000, pp. e215–e220. [19] A. Kendall and Y. Gal, “What uncertainties do we need [34] I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. in bayesian deep learning for computer vision?” in Proc. Mark, “Predicting in-hospital mortality of ICU patients: Adv. Neural Inf. Process. Syst., 2017, pp. 5574–5584. The physionet/computing in cardiology challenge 2012,” [20] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, in Comput. in Cardiology, vol. 39, 2012. T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman, [35] A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, “Missing value estimation methods for DNA microarrays M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. An- ,” in Bioinf., vol. 17, no. 6, 2001, pp. 520–525. thony Celi, and R. G. Mark, “MIMIC-III, a freely ac- [21] S. Mohamed, Z. Ghahramani, and K. A. Heller, cessible critical care database,” in Scientific Data, vol. 3, “Bayesian exponential family PCA,” in Proc. Adv. Neural 2016. Inf. Process. Syst., 2009, pp. 1089–1096. [36] S. Purushotham, C. Meng, Z. Che, and Y. Liu, “Bench- [22] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maxi- marking deep learning models on large healthcare mum likelihood from incomplete data via the EM algo- datasets,” in J. Biomed. Inform., vol. 83, 2018, pp. 112 rithm,” in J. Royal Stat. Soc.: Series B (Methodological), – 134. vol. 39, no. 1, 1977, pp. 1–22. [37] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, [23] L. Chen, L. Wang, J. Zhao, and W. Wang, “Relevance and Y. Bengio, “A recurrent latent variable model for vector machines-based time series prediction for incom- sequential data,” in Proc. Adv. Neural Inf. Process. Syst., plete training dataset: Two comparative approaches,” in 2015, pp. 2980–2988. IEEE Trans. Cybern., 2019, pp. 1–14. [38] G. Montavon, W. Samek, and K.-R. Muller,¨ “Methods for [24] I. R. White, P. Royston, and A. M. Wood, “Multiple interpreting and understanding deep neural networks,” in imputation using chained equations: Issues and guidance Digit. Signal Process., vol. 73, 2018, pp. 1 – 15. for practice,” in Stat. Medicine, vol. 30, no. 4, 2011, pp. [39] M. Alber, S. Lapuschkin, P. Seegerer, M. Hagele,¨ K. T. 377–399. Schutt,¨ G. Montavon, W. Samek, K.-R. Muller,¨ and P.- [25] M. Azur, E. Stuart, C. Frangakis, and P. Leaf, “Multiple J. K. Sven Dahne,¨ “innvestigate neural networks!” 2018. imputation by chained equations: What is it and how [Online]. Available: https://arxiv.org/abs/1808.04260 does it work?” in Int. J. of Methods in Psychiatric Res., vol. 20, no. 1, 3 2011, pp. 40–49. [26] V. Fortuin, D. Baranchuk, G. Ratsch,¨ and S. Mandt, “GP- VAE: Deep probabilistic time series imputation,” 2019. [Online]. Available: https://arxiv.org/abs/1907.04155 [27] M. Xu, M. Han, C. L. P. Chen, and T. Qiu, “Recurrent broad learning systems for time series prediction,” in IEEE Trans. Cybern., vol. 50, no. 4, 2020, pp. 1405– 1417. [28] J. Yoon, W. R. Zame, and M. v. Schaar, “Multi- directional recurrent neural networks: A novel method for estimating missing data,” in Int. Conf. Mach. Learn. Time Series Workshop, 2017. [29] J. E. Contreras-Reyes, F. O. Lopez´ Quintero, and R. Wiff, “Bayesian modeling of individual growth variability using back-calculation: Application to pink cusk-eel (genypterus blacodes) off chile,” in Ecological Model., vol. 385, 2018, pp. 145 – 153. [30] K. Cho, B. van Merrienboer,¨ C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Conf. Empirical Meth- ods in Natural Lang. Process., 2014, pp. 1724–1734. [31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780,