<<

Mixture-based Multiple Imputation Model for Clinical Data with a Temporal Dimension

1st Ye Xue 2nd Diego Klabjan 3rd Yuan Luo Northwester University Northwestern University Northwestern University Evanston, USA Evanston, USA Chicago, USA [email protected] [email protected] [email protected]

based on solid statistical foundations and become standard in Abstract—The problem of missing values in multivariable time the last few decades [9], [10]. series is a key challenge in many applications such as clinical Multiple imputation is originally proposed by Rubin [11] data mining. Although many imputation methods show their effectiveness in many applications, few of them are designed to and with the idea of replacing each missing value with multiple accommodate clinical multivariable time series. In this work, we estimates that reflect variation within each imputation model. propose a multiple imputation model that capture both cross- The classic multiple imputation method treats the complete sectional information and temporal correlations. We integrate variables as predictors and incomplete variables as outcomes, Gaussian processes with mixture models and introduce indi- and uses multivariate regression models to estimate missing vidualized mixing weights to handle the variance of predictive confidence of Gaussian process models. The proposed model is values [12], [13]. Another imputation approach relies on compared with several state-of-the-art imputation algorithms on chained equations where a sequence of regression models is both real-world and synthetic datasets. Experiments show that fitted by treating one incomplete variable as the outcome and our best model can provide more accurate imputation than the other variables with possible best estimates for missing values benchmarks on all of our datasets. as predictors [14]–[16]. Multiple imputation has its success Index Terms—Imputation, , EHR, Gaussian pro- cess, data mining in healthcare applications, where many imputation methods designed for various tasks are based on multiple imputation [17]–[23]. I.INTRODUCTION In recent years, additional imputation methods are proposed. The computational modeling in clinical applications attracts Although many imputation methods [7], [8], [14], [24]–[28] growing interest with the realization that the quantitative show their effectiveness in many applications, few of them understanding of patient pathophysiological progression is are designed for time series-based clinical data. These clinical crucial to clinical studies [1]. With a comprehensive and data are usually multivariable time series, where patients have precise modeling, we can have a better understanding of a measurements of multiple laboratory tests at different times. patient’s state, offer more precise diagnosis and provide bet- Many methods are designed for cross-sectional imputation ter individualized therapies [2]. Researchers are increasingly (measurements taken at the same time point) and do not con- motivated to build more accurate computational models from sider temporal information that is useful in making predictions multiple types of clinical data. However, missing values in or imputing missing values. Ignoring informative temporal clinical data challenge researchers using analytic techniques correlations and only capturing cross-sectional information for modeling, as many of the techniques are designed for may yield less effective imputation. complete data. In order to address the limitations mentioned above, we arXiv:1908.04209v3 [cs.LG] 2 Mar 2020 Traditional strategies used in clinical studies to handle miss- present a mixture-based multiple imputation model (MixMI) ing values include deleting records with missing values and for clinical time series 1. Our model captures both cross- imputing missing entries by mean values. However, deleting sectional information and temporal correlations to estimate records with missing values and some other filtering strategies missing values using mixture models. We model the distri- can introduce biases [3] that can impact modeling in many bution of measurements using a mixture model. The mixture ways, thus limiting its generalizability. Mean imputation is is composed of linear regression to model cross-sectional widely used by researchers to handle missing values. However, correlations and Gaussian processes (GPs) to capture temporal it is shown to yield less effective estimates than many other correlations. The problem of integrating GP within a standard modern imputation techniques [4]–[7], such as maximum mixture model is that GP models in all patient cases get the likelihood approaches and multiple imputation methods (e.g. same mixing weights, while the confidence of predictions by multiple imputation by chained equations (MICE) [8]). The GP models can vary largely across different patient cases. maximum likelihood and multiple imputation methods are 1 The implementations of our model is available at: https://github.com/ This work is supported in part by NIH grant R21LM012618. y-xue/MixMI.

©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/BigData47090.2019.9005672 We overcome this problem by introducing individualized variables are often required by these multi-task approaches mixing weights for each patient cases, instead of assigning [35], [39]. Due to the fact that many patients only have records a fixed weight. We train our model using the Expectation- with a limited number of time points, time series of inpatient Maximization (EM) algorithm. We demonstrate the effective- clinical laboratory tests fall short of such a requirement. Hori ness of our model by comparing it with several state-of-the-art et al. [40] extend MTGP to imputation on multi-environmental imputation algorithms on multiple clinical datasets. trial data in a third-order tensor. However, this approach is Our main contributions are summarized as follows. not applicable to clinical data with a large number of patients, 1. To the best of our knowledge, we are the first to build since the covariance matrix of observed values needs to be imputation model for time series by integrating GP within computed together with its inverse, which is intractable for mixture models. We address the problem that all GP models our datasets. in all patient cases get a fixed mixing weight by introducing Recently, Luo et al. [41] explore the application of GPs individualized mixing weights. in clinical data and propose an algorithm, 3-dimensional 2. We test the performance of our model on two real-world multiple imputation with chained equations (3D-MICE), that clinical datasets and several synthetic datasets. Our best model combines the imputation from GP and traditional MICE based outperforms all comparison models including several state-of- on weighting equations. However, the weighting equations are the-art imputation models. Using synthetic datasets, we also calculated only based on the standard deviations of GP and explore and discover the properties of the data that benefit our MICE predictions for missing values. The weighting strategy model and/or comparison models. Experiments show that our is static and not optimized. We postulate that calculating best model is robust to the variation of these properties and weights through an optimization problem can help to improve outperforms comparison models on all synthetic datasets. the imputation quality. In our work, instead of the predictive The remainder of this paper is structured as follows. Section mean matching used in [41], we choose linear regression as II discusses related work while in Section III, the proposed one component of our model. Our model is also based on a method is described. The experimental setup, including dataset statistical model and thus statistically justified which is not the collection and evaluation procedure, is described in Section IV. case for [41]. Section V discusses the computational results and underlying The mixture model proposed in this work differs from analyses. The conclusions are drawn in Section VI. the mixtures of Gaussian processes (MGP) [42] in several ways. The mixing components in our model are not limited to II.RELATED WORK Gaussian process, thus bringing in possibilities of capturing Research in designing imputation methods for multivariable correlations with various choices of parametric functions. time series attracts growing interest in recent decades. Previous More importantly, our model is designed for multivariable studies generally fall into two categories. One comes from time series and captures both temporal and cross-sectional methods using linear or other simple parametric functions to correlations. The latter is ignored in [42] when MGP is applied estimate missing values. The other is the methods treating time to multivariable time series, which yields less effective pre- series as smooth curves and estimating missing values using dictions since some clinical tests have strong cross-sectional GP or other nonparametric methods. relations. Without a Gaussian process component, the standard In the first category, multivariable time series are modeled Gaussian mixture model (GMM) is well studied and has been based on either linear models [29], linear mixed models [30], extended to imputation tasks [43]–[46]. These approaches still [31] or autoregressive models [32], [33]. However, in these suffer from losing informative temporal correlations when methods, the potential trajectories of variables are only limited imputing missing values in longitudinal data. to linear or other simple parametric functions. Alternatively, In order to effectively model the interaction between differ- many authors choose GPs or other nonparametric functions ent aspects, we represent the data as a tensor with each aspect to model time series. Compared to linear models, GPs only being one mode. Therefore, our method is also considered as a have locality constraints in which close time points in a time tensor completion approach. Tensor completion problems are series usually have close values. Therefore, GPs bring in more extensively studied. However, the classic tensor completion flexibility in capturing temporal trajectories of variables. methods [47]–[49] focus on general tensors and usually do However, directly applying GPs on imputation for multivari- not consider temporal aspects. In recent years, many studies able time series usually yields less effective imputation due to explore the application of temporal augmented tensor com- the lack of considerations for the similarities and correlations pletion on imputing missing values in time series [50]–[54]. across time series. Futoma et al. [34] extends the GP-based These methods discretize time into evenly sampled intervals. approach to imputation in the multivariable setting by applying However, due to the fact that inpatient clinical laboratory the multi-task Gaussian Processes (MTGP) prediction [35]. tests are usually measured at varying intervals, assembling However, the quality of estimating missing values relies on clinical data over regularly sampled time periods might have the estimation of covariance structure among variables when several drawbacks, such as leading to sparse tensors if dis- using MTGP or other multi-task functional approaches [36]– cretizing time at fine granularity (e.g. every minute) while [38]. To make a confident estimation of the covariance, a some laboratory tests are measured less frequently (e.g. daily). large amount of time points with shared observations of these Furthermore, extending these methods to the case, where time is not regularly sampled, is not easy and straightforward, TABLE I requiring changing design details and the objective functions to MAIN SYMBOLS AND DEFINITIONS be optimized. Recently, Yang et al. [55] propose a tensor com- pletion method that can deal with irregularly sampled time. Symbol Definition However, most components of this approach are tailored to the X , T Measurement and time tensor characteristics of septic patients. We implement this approach xp,v,−b Measurements in fiber xp,v,: excluding xp,v,b with only the component of time-aware mechanism that is tp,v,−b Times in fiber tp,v,: excluding tp,v,b general and applicable to our experimental settings. However, Mixv,b Mixture model for v and b. Vv,b Concatenation of x:,−v,b and x:,v,−b this approach is not so effective in our experiments, thus not (k) (k) N (µv,b ,Σv,b ) The kth prior multivariate normal distribution in included as a benchmark in this paper. Lately, Zhe et al. Mixv,b [56] propose a Bayesian nonparametric tensor decomposition m(3)(·), Σ(3)(·) Predictive mean and variance of a GP model model that captures temporal correlations between interacting γv,b The set of all trainable parameters of Mixv,b I The identity matrix events. However, this approach is not directly applicable to continuous multivariable time series because it focuses on discrete events and captures temporal correlations between the the “prescribed” time a missing measurement should have been occurrences of events. taken. In the matrix x at time index b, the measurement Recurrent Neural Networks (RNNs) are another choice to :,:,b time t and t can be different when p 6= q, whereas for capture temporal relationships and have been studied in impu- p,v,b q,v,b a given patient p, we have t = t for v, u ∈ [1 : V ]. tation tasks. Early studies [57]–[59] in RNN-based imputation p,v,b p,u,b That is, all tests for a particular patient are taken at the same models capture temporal correlations either within each time time. series or across all time series [60]. Yoon et al. [60] recently propose a novel multi-directional RNN that captures temporal correlations both within and across time series. Compared with traditional statistical models, neural network-based models are usually harder to train and provide limited explainability, while the performance is not always necessarily better. We use [60] as a benchmark since it is the most recent model. Besides imputation, researchers also attempt to build classification models with RNNs that utilize the missing data patterns [61]– [63]. Instead of relying on the estimation of missing values, these models perform predictions by exploring the missingness itself and no separate imputation is required. These models can Fig. 1. Measurement and time tensor. An example of the inputs and output not be a substitute for imputation methods since they do not of the mixture model MixV,B is shown as the colorful fiber and matrices. provide imputed values. In MixV,B , the target output is x:,V,B shown in purple and the inputs are x:,−V,B and x:,V,−B shown in red and blue matrices, respectively, excluding III.METHODOLOGY the purple fiber. We model the output with a mixture model, where we train a linear regression on the red matrix and train GPs or/and another linear A. Imputation Framework regression on the blue matrix. In many predictive tasks on temporal clinical data, time se- ries are often aligned into the same-length sequences to derive Disregarding the temporal dimension, the imputation prob- more robust patient phenotypes through matrix decomposition lem is well studied. If one dimension is time, in order to apply or discover feature groups by applying sequence/graph mining imputation methods that are not designed for time series, we techniques [55], [64]. We model under this assumption. In this need to disregard the temporal aspect or ignore temporal cor- work, we use tensor representation, in which patients have the relations of the data. However, temporal trajectories can reveal same number of time points. We represent the data collected patient’s underlying pathophysiological evolution, the model- from P patients with V laboratory tests and B time points as ing of which can help to better estimate missing values. For two 3D tensors X ∈ RP ×V ×B and T ∈ RP ×V ×B, shown in the reason that both cross-sectional information and temporal Fig. 1. Each laboratory test measurement xp,v,b is stored in aspects can impact the estimation of missing measurements, the measurement tensor X . Each time tp,v,b, when xp,v,b is we explore mixture models, which are composed of several measured, is stored in the time tensor T . base models through either a cross-sectional or temporal view. Table I lists the main symbols we use throughout the We introduce these base models in Section III-B. paper. Missing values in the measurement tensor are denoted In our imputation framework, a mixture model is trained mis as xp,v,b, showing that the value of test v at time index b for each variable and time index. We use Mixv,b to denote obs for patient p is missing. Correspondingly, xp,v,b denotes an the mixture model to impute missing values of variable v at mis observed value. The time tensor T is complete, since we time index b. The missing values x:,v,b in the fiber x:,v,b are only collect patient records at the time when at least one imputed by the optimized Mixv,b. In each iteration of the laboratory test measurement is available. We assume we know algorithm, it is assumed that all other values are known and mis only x:,v,b is imputed. This is wrapped in an outer loop. We The training data consists of observed target values tr call this procedure simple-pass tensor imputation, i.e., one pass (xp,v,b) tr and the input data (xp,−v,b) tr , where P is p∈Pv,b p∈Pv,b v,b through all v, b. Since several simple-pass tensor imputations the training patient set and includes p only if xp,v,b is observed. are conducted, our approach is also considered as an iterative 2) Linear model through temporal view: In addition to the imputation [65], which can also be regarded as a sampling- cross-sectional view, we can also view the measurement tensor based approach where a Gibbs sampler is used to approximate X as a vector of patient-by-time matrices. On matrix x:,v,:, we convergence. The convergence of iterative imputation methods use linear regression to model the measurements at time index can be quite fast with a few iterations [8]. b against those at other indices. In detail, the iterative imputation approaches start by replac- The target values x:,v,b are modeled as ing all missing data with values from simple guesses; we fill (2) (2) (2) (2)2 in all missing values with initial estimates by taking random x:,v,b = x:,v,−bβv,b + v,b, v,b ∼ N (0, σv,b I) (3) draws from observed values. This procedure is called an initial (2) (2) imputation. Then we perform iterative tensor imputation on where βv,b is the column vector of coefficients and σv,b is each copy separately. The training procedure and imputation (2) the standard deviation of the error v,b, regarding the linear for fibers are introduced in Section III-C and III-D. We also regression model though temporal view for variable v and time rely on the concept of multiple imputations, where several index b. iterative imputations are performed and the imputed values The likelihood distribution of x:,v,b is given by are averaged at the end. Each iterative imputation starts with (2) (2)2 (2) (2)2 a different iterative imputation tensor and/or uses a different x:,v,b|x:,v,−b, βv,b , σv,b ∼ N (x:,v,−bβv,b , σv,b I). (4) order of v,b. In summary, the algorithm creates M different copies The training data consists of observed target values 1 M (xp,v,b) tr and input data (xp,v,−b) tr . X ,..., X of X , each one filled with different random p∈Pv,b p∈Pv,b X mis. For each i = 1,...,M, we then perform K simple- 3) Gaussian processes through temporal view: Gaussian pass tensor imputations. Each simple-pass has a loop over all processes are commonly used to capture trajectories of vari- i,mis v,b, which uses Mixv,b to adjust X . At the end of the ables, thus used in our mixture model to capture temporal M imputations, X i,mis are averaged across all i to yield the correlations. Through the same temporal view as introduced final imputed tensor. above, on matrix x:,v,:, we fit GPs on time series for each The whole imputation process involves M × K × V × B patient. imputation models. We next first focus on the base models The target value xp,v,b is modeled as behind Mixv,b and then on the actual mixture model. xp,v,b = µp,v,b + f(tp,v,b), (5) B. Base Models f(tp,v,b) ∼ GP(0, K(tp,v,b, tp,v,b0 )) Our mixture models are composed of three components where µp,v,b is the overall mean of the model, f(·) is a that are derived from two base models, linear regression and Gaussian process with mean of 0 and a covariance matrix Gaussian processes. One component consists of GP models K(t, t0) of time pairs (t,t0). Then the likelihood distribution of and the other two components are linear models through two xp,v,b is written as different views of the measurement tensor. Through a cross- (3) (3) sectional view, the tensor can be considered as a vector of xp,v,b|αp,v,b ∼ N (m (αp,v,b), Σ (αp,v,b)) patient-by-variable matrices at different time indices. Through (6) αp,v,b = (θv,b, xp,v,−b, tp,v,−b) a temporal view, we can view the tensor as a vector of patient- by-time matrices for different variables. where θv,b are the kernel parameters of the GP models, and (3) 1) Linear model through cross-sectional view: We can view the predictive mean and variance are given by m (·) and (3) the measurement tensor X as a vector of patient-by-variable Σ (·); see more details in Appendix B. For a certain v and matrices. On the slice x:,:,b, we use a linear regression model b, all GP models share the same kernel parameters θv,b. to fit the target variable v as a function of the other variables C. The Mixture Model except v. The target values x:,v,b are modeled as Given the likelihood distribution of all three components, we 2 (1) (1) (1) (1) model the joint mixture distribution, regarding the variable v x:,v,b = x:,−v,bβv,b + v,b, v,b ∼ N (0, σv,b I) (1) and time index b, in the following way (1) (1) where βv,b is the column vector of coefficients and σv,b is the p(x:,v,b,Vv,b) standard deviation of the error (1), regarding the regression v,b (1) (1) (1) (1)2 model through cross-sectional view for variable v and time =πv,b N (Vv,b|δv,b )N (x:,v,b|x:,−v,bβv,b , σv,b I) index b. (2) (2) (2) (2)2 +πv,b N (Vv,b|δv,b )N (x:,v,b|x:,v,−bβv,b , σv,b I) The likelihood distribution of x:,v,b is then given by (3) (3) (3) (3) +π N (Vv,b|δ )N (x:,v,b|m (αv,b), diag(Σ (αv,b))) (1) (1)2 (1) (1)2 v,b v,b x:,v,b|x:,−v,b, βv,b , σv,b ∼ N (x:,−v,bβv,b , σv,b I). (2) (7) where we define Vv,b = [x:,−v,b x:,v,−b], αv,b = (αp,v,b)p∈P , D. Mixture Parameter Estimation (k) (k) (k) (k) δ = (µ , Σ ) and π is the mixing weight for the obs tr v,b v,b v,b v,b Let `(γv,b) be the log likelihood ln p(x:,v,b,Vv,b; γv,b). Ex- kth component. This model can be interpreted as the joint plicitly maximizing `(γv,b) is hard. Instead, we use the EM distribution between observed data Vv,b and missing values algorithm to repeatedly construct a lower-bound on `(γv,b) x:,v,b, which consists of a mixture of three distributions. The and then optimize that lower-bound L(γv,b). We first define a first one p1(x:,v,b,Vv,b) is modeled as latent indicator variable qv,b ∈ {1, 2, 3} that specifies which mixing component that data points come from. Then we use p1(x:,v,b,Vv,b) Jensen’s inequality to get the lower-bound L(γv,b), which is = p(x:,v,b|Vv,b)p(Vv,b) given by (1) (8) = N (x:,v,b|µ1(Vv,b), σ1(Vv,b))N (Vv,b|δ ) v,b L(γv,b) 2 (1) (1) (1) p(x ,V , q ; γ ) = N (x:,v,b|x:,−v,bβv,b , σv,b I)N (Vv,b|δv,b ) X X p,v,b p,v,b v,b v,b = Qp(qv,b) ln Qp(qv,b) tr qv,b the remaining two follow the same logic. p∈Pv,b By marginalizing over x , the prior probability distribu- 3 :,v,b X X tion p(Vv,b) is written as ≤ ln p(qv,b = k)p(xp,v,b,Vp,v,b; γv,b|qv,b = k) p∈P tr k=1 3 v,b X (k) (k) p(Vv,b) = πv,b N (Vv,b|δv,b ) (9) = `(γv,b) k=1 (12) which is a mixture of Gaussians. It also follows that where

p(x:,v,b|Vv,b) p(qv,b = k)p(xp,v,b,Vp,v,b|qv,b = k) Qp(qv,b = k) = . p(x ,V ) P3 p(q = j)p(x ,V |q = j) = :,v,b v,b j=1 v,b p,v,b p,v,b v,b p(Vv,b) (13) 2 In (12) and (13), the marginal distribution p(q = k) over (1) (1) (1) (10) v,b = Cv,b N (x:,v,b|x:,−v,bβv,b , σv,b I) (k) qv,b is specified by the mixing coefficients πv,b = p(qv,b = k). (2) (2) (2)2 We can view Qp(qv,b = k) as the responsibility that compo- + Cv,b N (x:,v,b|x:,v,−bβv,b , σv,b I) nent k of the mixture model Mix takes to “explain” x . (3) (3) (3) v,b p,v,b + Cv,b N (x:,v,b|m (αv,b), diag(Σ (αv,b))) We use the standard EM algorithm to maximize the lower-

(k) (k) bound L(γv,b); see more details about the estimation of the (k) πv,b N (Vv,b|δv,b ) where we define Cv,b = P3 (j) (j) . parameters in Appendix A. j=1 πv,bN (Vv,b|δv,b) We train our mixture model on observed target values by E. The Imputation Model maximizing the log likelihood of the joint mixture distribution The proposed mixture model provides flexibility in changing obs tr γˆv,b = arg max ln p(x:,v,b,Vv,b; γv,b) base models and can be solved under various assumptions. γv,b For example, a full mixture model consists of all three base tr models introduced earlier. However, under the assumption of where Vv,b is the training input data and defined as the concatenation of (xp,−v,b) tr and (xp,v,−b) tr , and γv,b only linear correlations within time series, mixture models p∈Pv,b p∈Pv,b is the set of all trainable parameters composed of only the two linear components (denoted as MixMI-LL) can be used. Although the choice of mixtures can (1) (1)2 (2) (2)2 (k) (k) (k) βv,b , σv,b , βv,b , σv,b , θv,b and πv,b , µv,b , Σv,b for k = 1, 2, 3. be determined empirically or based on expert knowledge of the data, we propose an imputation model (denoted as MixMI) After training, the missing values are imputed using individ- that chooses the type of mixture automatically. ualized (per-patient) mixing weights that are derived from the In MixMI, for each variable and time index, we train both mis conditional distribution. The missing value xp,v,b is imputed the MixMI-LL and the full mixture model, then select the one by with lower training error as the final mixture model that is Π(1) x βˆ(1) + Π(2) x βˆ(2) + Π(3) m(3)(ˆα ) used to perform imputation in inference. We use the absolute p,v,b p,−v,b v,b p,v,b p,v,−b v,b p,v,b p,v,b training error, instead of likelihood, as the selection criterion where the individualized mixing weight of the kth component because the likelihood of a mixture model with two linear for patient p, variable v and time index b is defined as models is not of the same scale as the full mixture model.

(k) (k) (k) IV. EXPERIMENTAL SETUP πˆ N (Vp,v,b|µˆ , Σˆ ) Π(k) = v,b v,b v,b , k = 1, 2, 3 p,v,b (j) (11) A. Real-world Datasets P3 πˆ(j)N (V |µˆ(j), Σˆ ) j=1 v,b p,v,b v,b v,b We collect two real-world datasets from the Medical In- where Vp,v,b is the observed data and defined as the concate- formation Mart for Intensive Care (MIMIC-III) database [66] nation of xp,−v,b and xp,v,−b. and the Northwestern Medicine Enterprise Data Warehouse TABLE II OVERALL MASE BY DATASET AND IMPUTATION MODEL.THE BOLD NUMBERS ARE THE SIGNIFICANTLY BEST VALUES AMONG ALL IMPUTATION MODELS.

Dataset GP MTGP1 M-RNN GMM MICE 3D-MICE MixMI-LL MixMI Real-world MIMIC (d=0) 0.13072 0.12599 0.12512 0.11741 0.11763 0.11186 0.09304 0.09285 Synthetic MIMIC (d=0.5) 0.10466 - 0.12320 0.11455 0.11573 0.09087 0.08612 0.08448 Synthetic MIMIC (d=1) 0.09220 - 0.12201 0.11436 0.11561 0.07715 0.08427 0.07538 NMEDW 0.18353 0.17370 0.14607 0.13593 0.13718 0.13624 0.11600 0.11589 1 MTGP cannot have multiple inputs, thus not applicable to the synthetic datasets.

TABLE III MASE ONTHEREAL-WORLD MIMIC DATASET BY VARIABLE AND IMPUTATION MODEL.THEBOLDNUMBERSARETHEBESTVALUESAMONGALL IMPUTATION MODELS.

Variable GP MTGP M-RNN GMM MICE 3D-MICE MixMI-LL MixMI Chloride 0.12993 0.12549 0.10865 0.10650 0.10575 0.10836 0.08664 0.08603 Potassium 0.11533 0.11256 0.11222 0.10963 0.10997 0.10822 0.09453 0.09442 Bicarbonate 0.13196 0.12905 0.12371 0.12231 0.12275 0.11984 0.10302 0.10254 Sodium 0.12525 0.11939 0.11557 0.10159 0.10138 0.10727 0.08787 0.08807 Hematocrit 0.11436 0.10780 0.09189 0.06518 0.06558 0.06726 0.05486 0.05482 Hemoglobin 0.14168 0.12997 0.06556 0.05774 0.05772 0.06301 0.05117 0.05103 MCV 0.14215 0.13732 0.13798 0.13391 0.13474 0.13340 0.11634 0.11657 Platelets 0.13855 0.13087 0.14369 0.14203 0.14236 0.12815 0.10090 0.10070 WBC count 0.13963 0.13614 0.14583 0.14026 0.14068 0.13060 0.10934 0.10913 RDW 0.14592 0.13668 0.15938 0.15778 0.15836 0.13897 0.11340 0.11340 Blood urea nitrogen (BUN) 0.12358 0.12720 0.16345 0.15160 0.15189 0.11814 0.09479 0.09410 Creatinine 0.13341 0.12803 0.14904 0.13211 0.13212 0.12217 0.10067 0.10014 Glucose 0.12491 0.12420 0.11677 0.11771 0.11794 0.11921 0.10493 0.10501

(NMEDW). Each dataset contains inpatient test results from B. Synthetic Datasets 13 laboratory tests. These tests are quantitative, frequently measured on hospital inpatients and commonly used in down- We create synthetic datasets to explore and discover the stream classification tasks [63], [67]. They are the same as properties of the data that might benefit our model and/or those used in [41] in their imputation study. We organize the comparison models. In synthetic datasets, we augment the data by unique admissions. We distinguish multiple admissions correlation between measurements and times. We do not of the same patient. Each admission consists of time series of augment correlations by imposing strong constraints on time the 13 laboratory tests. Although our model is evaluated only series where closer measurements have closer values. Instead, on continuous variables, they are also applicable on categorical we generate synthetic times by altering real times so that the variables with a simple extension (e.g. through predictive mean constraints in are “slightly” stronger than real- matching [68]). world data. We also introduce a scaling factor d to control the strength of the constraints in synthetic data. The synthetic datasets are generated based on the real-world MIMIC-III dataset. We move two consecutive times of a time In both MIMIC-III and NMEDW datasets, the length of series closer, if the relative difference ∆˜x in two consecutive time series varies across admissions. To apply our imputation measurements is smaller than the relative difference ∆t˜ in model on these datasets, we truncate time series so that they two consecutive times. The relative differences ∆˜x and ∆t˜ of have the same length. The length is the average number of time a time series are given by points across all admissions. Before truncating, the average number of time points in MIMIC-III dataset is 11. We first |x − x | ∆˜x = i i−1 exclude admissions that have less than 11 time points, and i PB |xi − xi−1| then we truncate time series by removing measurements taken i=2 |t − t | after the 11-th time point. We also exclude admissions that ∆t˜ = i i−1 . i PB contain time series that have no observed values. Our MIMIC- i=2 |ti − ti−1| III dataset includes 26,154 unique admissions and the missing rate is about 28.71%. The same data collection procedure is The scaling factor d ∈ (0, 1) controls how much we move applied on the NMEDW dataset (approved by Northwestern times. If d = 0, we do not move times. In other words, IRB) where we end up with 13,892 unique admissions that the synthetic dataset at d = 0 is the same as the real-world have 7 time points. The missing rate of the NMEDW dataset MIMIC-III dataset. As d increases, stronger constraints are is 24.22%. introduced to synthetic data. The synthetic time t0 for a time series is generated as follows: k = 1, 2, 3. Other trainable parameters are initialized directly ( from the data after initial imputation and do not require a t , if i = 1 t0 = 1 manual specification. i Pi ˜ ti + j=2[d(∆˜xj − ∆tj)S], otherwise We also evaluate our imputation model under a classification B task on our MIMIC dataset and compare it with GRU-D X S = (|tj − tj−1|). [63], a state-of-the-art RNN model for classifications with j=2 multivariable time series, which does not require a separate If a time series has missing values, we first calculate the data imputation stage. We build a RNN model with the synthetic times for the observed measurements. Then we standard Gated Recurrent Unit as our classification model perform a linear interpolation between real times and synthetic (GRU-MixMI) trained on data imputed by MixMI. GRU-D times for observed measurements to generate synthetic times is trained on data without imputation. We predict whether a for missing measurements. patient dies within 30 days after admission, instead of 48 hours as in [63], because of severe imbalance in our cohort, where C. Evaluation of Imputation Quality only 0.57% patients have positive labels within 48 hours.

We randomly mask 20% observed measurements in a data V. RESULTS set as missing and treat the masked values as the test set. A. Performance Comparison The remaining observed values are used in training. We impute originally missing and masked values together, and 1) Imputation task: Table II and Fig. 2 compare all im- compare the imputed values with the ground truth for masked putation models on 4 datasets. Table II shows the overall data to evaluate imputation performance. The Mean Absolute MASE score of each imputation model on all datasets. Fig. 2 Scaled Error (MASE) [69] is used to measure the quality of provides a comparison for all imputation models in MASE imputation on the test set. MASE is a scale-free measure of the score over 3D-MICE by showing the percentage deviation accuracy of predictions and recommended by Hyndman et al against 3D-MICE. We select 3D-MICE since it is a state-of- [69], [70] to measure the accuracy of predictions for series. We the-art benchmark model. We observe that MixMI outperforms calculate MASE for all variables and take a weighted average all comparison models on all 4 datasets and is significantly according to the number of masked values of a variable to get better than the second best model (p=.001, permutation test an overall MASE per dataset. with 1000 replicates) on all 4 datasets. Let maskp,v be the set of cardinality Ip,v of all time indices that have been masked for patient p and variable v. Also let obs Yp,v = (xp,v,j)j be the sequence of length Jp,v of all observed values for patient p and variable v, and let x˜p,v,i represent the imputed value. The MASE for variable v is defined as P |x˜ − xobs | 1 X i∈maskp,v p,v,i p,v,i MASE(v) = . P Jp,v PJp,v p¯ Ip,v¯ |Yp,v,j − Yp,v,j−1| p Jp,v −1 j=2 To show the effectiveness of our imputation model MixMI and its variant MixMI-LL, we compare MASE scores of MixMI and MixMI-LL with other six imputation methods: (a) MICE [8] with 100 imputations, where the average of all imputations are used for evaluation; (b) the pure Gaussian processes, where a GP model is fitted to the observed data Fig. 2. Percentage deviation of MASE score against 3D-MICE of each time series using GPfit [71] in R and missing values are replaced with the predictions from the fitted GP models; Table III shows a variable-wise comparison of the impu- (c) 3D-MICE, a state-of-the-art imputation model [41] for tation models on the real-world MIMIC (d=0) dataset. Our which we obtain their code and adapt it to account for our use two imputation models outperform all comparison models of the tensor representation; (d) multi-task Gaussian process on all variables. MixMI is better than MixMI-LL on most prediction (MTGP) [35]; (e) a Gaussian mixture model for variables. All models except GP and MTGP achieve a much imputation (GMM) [43], [72] and (f) M-RNN, a state-of-the- lower error on Hematocrit and Hemoglobin than on other art RNN-based imputation model [60]. variables. The reason is that these two variables are highly To tune hyperparameters if any in these models, we mask correlated. Those methods that capture the correlation between out 20% observed measurements in the training set as a variables can reasonably infer missing values for Hematocrit validation set and tune hyperparameters on the validation set. from observed measurements of Hemoglobin, and vice versa. We introduce π(1), π(2) and π(3) as hyperparameters of our Compared to MICE, MixMI achieves even lower errors on imputation model. In our experiments, all mixture models these two variables, which indicates that temporal correlations (k) (k) start with the same mixing weights, i.e., πv,b = π for captured by our model help to make better estimation of missing values, even when there is a more dominant cross- sectional correlation. As shown in Table II, all models except MTGP benefit from the increment of d, the scaling factor used to generate synthetic data. The reason is that all models take into account temporal aspects and the measurements in the synthetic time series have stronger temporal correlations as d increases. MICE and GMM also benefit from the temporal correlations because we include time as a feature in addition to variable features. We also try to exclude times, however, experiments show that these two (a) Chloride, T1, d=0 (b) Chloride, T6, d=0 models perform better when times are included. MixMI-LL is outperformed by 3D-MICE when d increases to 1, while MixMI shows its robustness to the variation of d in our current experimental settings. We compare the running times of all models on the real- world MIMIC dataset. MixMI-LL (taking 4.2 hours), GP (1.1 hours), MTGP (1.2 hours) GMM (2.2 hours) and M-RNN (0.9 hours) are the fastest models; 3D-MICE (156.1 hours) is the slowest; MICE (77.5 hours) and MixMI (109.5 hours) are in the middle. (c) Chloride, T1, d=0.5 (d) Chloride, T6, d=0.5 2) Classification task: Our classification model GRU- Fig. 4. A comparison between individualized mixing weights Π and optimized responsibilities Q that GP component should take to “explain” observed MixMI is compared with GRU-D on data with different measurements for training patients. The plots are from the real-world MIMIC folds and different seeds. First, we make a copy of the (d=0) and synthetic MIMIC (d=0.5) dataset, and for the mixture models original data and split each copy into the same training and of Chloride at time point 1 and 6. The distributions of the optimized responsibilities are shown in blue and the distributions of individualized test set. One copy is then imputed by MixMI and used to mixing weights are in yellow. train GRU-MixMI. The other is used by GRU-D directly without imputation. Both models are trained and tested 5 times with different random seeds pertaining to algorithms. across all patients and times. When imputing Hematocrit and The area under the ROC curve (AUC) scores on test sets are Hemoglobin, our model relies mostly (79.6% and 84.8%) on reported. This procedure repeats 3 times with different splits. the linear model through the cross-sectional view because of On average, GRU-MixMI ties with GRU-D in terms of AUC the strong correlation between these two variables. Interest- scores (0.7857 vs. 0.7853, p > 0.5). However, compared with ingly, most values of π(3) are lower than 10%, which indicates models that perform imputation and prediction together, our that temporal correlations are not strong in these variables in imputation model, as a pure imputation method, offers more our dataset. However, our model is still able to detect the week opportunities for other supervised and unsupervised tasks, and temporal correlations and utilizes them to improve imputation. modeling choices and effective feature engineering. Additionally, we observe that when predicting the missing values at the beginning and the end of a time series, our model reasonably uses a lower value of π(3) than in the middle. On average, π(3) of the first and the last time indices is 13.9% lower than those in the middle. This is due to the usage of GPs, which usually produce less confident estimates at the end points of series. Furthermore, we observe an increment of π(3) as we impose stronger temporal correlations in synthetic datasets, which further validates the ability of our model in capturing such interaction.

Fig. 3. Component weights comparison on real-world MIMIC dataset C. Individualized Weights By introducing individualized (per patient) mixing weights B. Component Weights Π defined in (11), we improve the performance in MASE score The key information our imputation model uses to estimate from 0.08351 to 0.07538, an improvement of 9.73% compared missing values is the component weight π, which quantifies against the model where each mixture component has a fixed the interaction of cross-sectional and temporal correlations. weight for all patient cases. The reason that individualized We study MixMI on the real-world MIMIC dataset and weights are better than fixed weights in our model might be visualize how the interaction is captured by our model. Fig. 3 that they better approximate the responsibilities Q defined in shows a comparison of all component weights for variables (13). In training, we can optimize the responsibility a component [3] G. M. Weber, W. G. Adams, E. V. Bernstam, J. P. Bickel, K. P. Fox, K. Marsolo, V. A. Raghavan, A. Turchin, X. Zhou, S. N. Murphy et al., should take to “explain” an observed target value xp,v,b for tr “Biases introduced by filtering electronic health records for patients p ∈ Pv,b. However, when making inference, the responsibility with “complete data”,” Journal of the American Medical Informatics each component should take to “explain” a missing value is Association, vol. 24, no. 6, pp. 1134–1141, 2017. unknown, because responsibilities depend on observed target [4] R. L. Brown, “Efficacy of the indirect approach for estimating structural equation models with missing data: A comparison of five methods,” values, according to (13). We have to use Π, the individualized Structural Equation Modeling: A Multidisciplinary Journal, vol. 1, no. 4, mixing weights, as an approximation of the responsibilities in pp. 287–316, 1994. inference. As defined in (11), Π only depends on the inputs, [5] W. Wothke, “Longitudinal and multi-group modeling with missing data,” therefore, we can calculate them when making inferences on in Modeling longitudinal and multiplegroup data: Practical issues, applied approaches, and specific examples, T. D. Little, K. U. Schnabel, the test set. and J. Baumert, Eds., 2000, pp. 219–240. (k) [6] J. L. Arbuckle, “Full information estimation in the presence of in- In a standard mixture model, we could use πv,b , which is the average of responsibilities of the kth component across complete data,” in Advanced Structural Equation Modeling: Issues and Techniques, G. A. Marcoulides and R. E. Schumacker, Eds., 1996, vol. all training patients, as a fixed weight that the kth component 243, p. 277. mis should contribute to impute missing values x:,v,b for all test [7] A. K. Waljee, A. Mukherjee, A. G. Singal, Y. Zhang, J. Warren, U. Balis, patients. However, patient time series can be very different J. Marrero, J. Zhu, and P. D. Higgins, “Comparison of imputation methods for missing laboratory data in medicine,” BMJ Open, vol. 3, and the confidence of predictions by the GP component can no. 8, p. e002847, 2013. vary largely across different patient cases. A fixed weight can [8] S. Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate imputation not reflect such variation in prediction confidence. by chained equations in r,” Journal of Statistical Software, vol. 45, no. 3, 2011. We shall view an individualized mixing weight as an [9] J. W. Graham, “Missing : Making it work in the real world,” approximation of how much responsibility a component should Annual Review of Psychology, vol. 60, pp. 549–576, 2009. take to impute the missing value for a particular patient case. [10] J. L. Schafer and J. W. Graham, “Missing data: Our view of the state It is tailored for each patient. To visualize how individualized of the art.” Psychological Methods, vol. 7, no. 2, p. 147, 2002. [11] D. B. Rubin, “Multiple imputations in sample surveys-a phenomeno- mixing weights help to produce better estimates, in Fig. 4, logical bayesian approach to nonresponse,” in Proceedings of the survey we plot the distribution of the individualized weights Π of research methods section of the American Statistical Association, vol. 1. the GP component in the training set and compare it with American Statistical Association, 1978, pp. 20–34. [12] ——, Multiple imputation for nonresponse in surveys. John Wiley & the distribution of the optimized responsibility values Q. The Sons, 1987. responsibilities that the GP component should take can vary [13] J. L. Schafer, Analysis of incomplete multivariate data. Chapman and a lot in different patient cases, especially on the synthetic Hall/CRC, 1997. dataset, which implies that it is more reasonable for patients [14] T. E. Raghunathan, J. M. Lepkowski, J. Van Hoewyk, and P. Solenberger, “A multivariate technique for multiply imputing missing values using a to get individualized mixing weights than a fixed weight. We sequence of regression models,” Survey Methodology, vol. 27, no. 1, pp. also observe that the individualized mixing weights reasonably 85–96, 2001. mimic the distribution of the optimized responsibilities on [15] S. Van Buuren, Flexible imputation of missing data. Chapman and the training set. The improvement of our model on the test Hall/CRC, 2018. [16] S. Van Buuren, J. P. Brand, C. G. Groothuis-Oudshoorn, and D. B. Ru- set attests that the individualized weights approximate the bin, “Fully conditional specification in multivariate imputation,” Journal responsibilities better than fixed weights. of statistical computation and simulation, vol. 76, no. 12, pp. 1049–1064, 2006. [17] O. Harel and X.-H. Zhou, “Multiple imputation for the comparison VI.CONCLUSIONS of two screening tests in two-phase alzheimer studies,” in Medicine, vol. 26, no. 11, pp. 2370–2388, 2007. We present and demonstrate a mixture-based imputation [18] L. Qi, Y.-F. Wang, and Y. He, “A comparison of multiple imputation model MixMI for multivariable clinical time series. Our model and fully augmented weighted estimators for cox regression with missing covariates,” Statistics in Medicine, vol. 29, no. 25, pp. 2592–2604, 2010. can capture both cross-sectional and temporal correlations in [19] Y.-S. Su, A. Gelman, J. Hill, M. Yajima et al., “Multiple imputation with time series. We integrate Gaussian processes with mixture diagnostics (mi) in r: Opening windows into the black box,” Journal of models and introduce individualized mixing weights to fur- Statistical Software, vol. 45, no. 2, pp. 1–31, 2011. [20] C.-H. Hsu, J. M. Taylor, S. Murray, and D. Commenges, “Survival anal- ther improve imputation accuracy. We show that MixMI can ysis using auxiliary variables via non-parametric multiple imputation,” provide more accurate imputation than several state-of-the-art Statistics in Medicine, vol. 25, no. 20, pp. 3503–3517, 2006. imputation models. Although in this work our model is tested [21] Q. Long, C.-H. Hsu, and Y. Li, “Doubly robust nonparametric multiple imputation for ignorable missing data,” Statistica Sinica, vol. 22, p. 149, on inpatient clinical data, the proposed idea is general and 2012. should work for all multivariable time series where interactions [22] S. Van Buuren, H. C. Boshuizen, and D. L. Knook, “Multiple imputation of cross-sectional and temporal correlations exist. of missing blood pressure covariates in survival analysis,” Statistics in Medicine, vol. 18, no. 6, pp. 681–694, 1999. [23] Y. Deng, C. Chang, M. S. Ido, and Q. Long, “Multiple imputation for REFERENCES general missing data patterns in the presence of high-dimensional data,” Scientific Reports, vol. 6, p. 21689, 2016. [1] R. L. Winslow, N. Trayanova, D. Geman, and M. I. Miller, “Computa- [24] D. J. Stekhoven and P. Buhlmann,¨ “Missforest—non-parametric missing tional medicine: Translating models to clinical care,” Science Transla- value imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1, tional Medicine, vol. 4, no. 158, pp. 158rv11–158rv11, 2012. pp. 112–118, 2011. [2] I. S. Kohane, “Ten things we have to do to achieve precision medicine,” [25] R. Little and H. An, “Robust likelihood-based analysis of multivariate Science, vol. 349, no. 6243, pp. 37–38, 2015. data with missing values,” Statistica Sinica, pp. 949–968, 2004. [26] Y. Luo, P. Szolovits, A. S. Dighe, and J. M. Baron, “Using machine [50] M. T. Bahadori, Q. R. Yu, and Y. Liu, “Fast multivariate spatio- learning to predict laboratory test results,” American Journal of Clinical temporal analysis via low rank tensor learning,” in Advances in Neural Pathology, vol. 145, no. 6, pp. 778–788, 2016. Information Processing Systems, 2014, pp. 3491–3499. [27] G. Zhang and R. Little, “Extensions of the penalized spline of propensity [51] R. Yu, D. Cheng, and Y. Liu, “Accelerated online low-rank tensor prediction method of imputation,” Biometrics, vol. 65, no. 3, pp. 911– learning for multivariate spatio-temporal streams,” in Proceedings of the 918, 2009. 32nd International Conference on Machine Learning (ICML-15), 2015, [28] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshi- pp. 238–247. rani, D. Botstein, and R. B. Altman, “Missing value estimation methods [52] H. Ge, J. Caverlee, N. Zhang, and A. Squicciarini, “Uncovering the for dna microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, 2001. spatio-temporal dynamics of memes in the presence of incomplete in- [29] A. Shah, N. Laird, and D. Schoenfeld, “A random-effects model for formation,” in Proceedings of the 25th ACM International on Conference multiple characteristics with possibly missing data,” Journal of the on Information and Knowledge Management. ACM, 2016, pp. 1493– American Statistical Association, vol. 92, no. 438, pp. 775–779, 1997. 1502. [30] M. Liu, J. M. Taylor, and T. R. Belin, “Multiple imputation and pos- [53] K. Takeuchi, H. Kashima, and N. Ueda, “Autoregressive tensor factor- terior simulation for multivariate missing data in longitudinal studies,” ization for spatio-temporal predictions,” IEEE International Conference Biometrics, vol. 56, no. 4, pp. 1157–1163, 2000. on Data Mining, 2017. [31] J. L. Schafer and R. M. Yucel, “Computational strategies for multivariate [54] Y. Cai, H. Tong, W. Fan, P. Ji, and Q. He, “Facets: Fast comprehensive linear mixed-effects models with missing values,” Journal of Computa- mining of coevolving high-order time series,” in Proceedings of the 21th tional and Graphical Statistics, vol. 11, no. 2, pp. 437–457, 2002. ACM SIGKDD International Conference on Knowledge Discovery and [32] E. E. Holmes, E. J. Ward, and K. Wills, “Marss: Multivariate autore- Data Mining. ACM, 2015, pp. 79–88. gressive state-space models for analyzing time-series data,” R Journal, [55] X. Yang, Y. Zhang, and M. Chi, “Time-aware subgroup matrix decom- vol. 4, no. 1, 2012. position: Imputing missing data using forecasting events,” in 2018 IEEE [33] F. Bashir and H.-L. Wei, “Handling missing data in multivariate time se- International Conference on Big Data. IEEE, 2018, pp. 1524–1533. ries using a vector autoregressive model-imputation (var-im) algorithm,” [56] S. Zhe and Y. Du, “Stochastic nonparametric event-tensor decomposi- Neurocomputing, 2017. tion,” in Advances in Neural Information Processing Systems, 2018, pp. [34] J. Futoma, S. Hariharan, and K. Heller, “Learning to detect sepsis with 6855–6865. a multitask gaussian process rnn classifier,” in Proceedings of the 34th [57] Y. Bengio and F. Gingras, “Recurrent neural networks for missing International Conference on Machine Learning-Volume 70. JMLR. org, or asynchronous data,” in Advances in neural information processing 2017, pp. 1174–1182. systems, 1996, pp. 395–401. [35] E. V. Bonilla, K. M. Chai, and C. Williams, “Multi-task gaussian process [58] V. Tresp and T. Briegel, “A solution for missing data in recurrent neural prediction,” in Advances in Neural Information Processing Systems, networks with an application to blood glucose prediction,” in Advances 2008, pp. 153–160. in Neural Information Processing Systems, 1998, pp. 971–977. [36] Y. He, R. Yucel, and T. E. Raghunathan, “A functional multiple impu- [59] S. Parveen and P. Green, “Speech recognition with missing data using tation approach to incomplete longitudinal data,” Statistics in Medicine, recurrent neural nets,” in Advances in Neural Information Processing vol. 30, no. 10, pp. 1137–1156, 2011. Systems, 2002, pp. 1189–1195. [37] J.-M. Chiou, Y.-C. Zhang, W.-H. Chen, and C.-W. Chang, “A functional [60] J. Yoon, W. R. Zame, and M. van der Schaar, “Estimating missing data in data approach to missing value imputation and outlier detection for temporal data streams using multi-directional recurrent neural networks,” traffic flow data,” Transportmetrica B: Transport Dynamics, vol. 2, no. 2, IEEE Transactions on Biomedical Engineering, vol. 66, no. 5, pp. 1477– pp. 106–129, 2014. 1490, 2018. [38] S. Kliethermes and J. Oleson, “A bayesian approach to functional [61] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun, “Doctor mixed-effects modeling for longitudinal data with binomial outcomes,” ai: Predicting clinical events via recurrent neural networks,” in Machine Statistics in Medicine, vol. 33, no. 18, pp. 3130–3146, 2014. Learning for Healthcare Conference, 2016, pp. 301–318. [39] K. Yu, V. Tresp, and A. Schwaighofer, “Learning gaussian processes [62] Z. C. Lipton, D. Kale, and R. Wetzel, “Directly modeling missing data from multiple tasks,” in Proceedings of the 22nd International Confer- in sequences with rnns: Improved classification of clinical time series,” ence on Machine Learning. ACM, 2005, pp. 1012–1019. in Machine Learning for Healthcare Conference, 2016, pp. 253–270. [40] T. Hori, D. Montcho, C. Agbangla, K. Ebana, K. Futakuchi, and [63] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu, “Recurrent H. Iwata, “Multi-task gaussian process for imputing missing data in neural networks for multivariate time series with missing values,” multi-trait and multi-environment trials,” Theoretical and Applied Ge- Scientific reports, vol. 8, no. 1, p. 6085, 2018. netics, vol. 129, no. 11, pp. 2101–2115, 2016. [64] Y. Xue, D. Klabjan, and Y. Luo, “Predicting icu readmission using [41] Y. Luo, P. Szolovits, A. S. Dighe, and J. M. Baron, “3d-mice: inte- grouped physiological and medication trends,” Artificial Intelligence in gration of cross-sectional and longitudinal imputation for multi-analyte Medicine, 2018. longitudinal clinical data,” Journal of the American Medical Informatics [65] A. Gelman, “Parameterization and bayesian modeling,” Journal of the Association, vol. 25, no. 6, pp. 645–653, 2017. American Statistical Association, vol. 99, no. 466, pp. 537–545, 2004. [42] V. Tresp, “Mixtures of gaussian processes,” in Advances in neural [66] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghas- information processing systems, 2001, pp. 654–660. semi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimic-iii, [43] M. Di Zio, U. Guarnera, and O. Luzi, “Imputation through finite gaussian a freely accessible critical care database,” Scientific Data, vol. 3, p. mixture models,” Computational Statistics & Data Analysis, vol. 51, 160035, 2016. no. 11, pp. 5305–5316, 2007. [67] J.-R. Le Gall, S. Lemeshow, and F. Saulnier, “A new simplified acute [44] O. Delalleau, A. Courville, and Y. Bengio, “Efficient em training of physiology score (saps ii) based on a european/north american multi- gaussian mixtures with missing data,” arXiv preprint arXiv:1209.0521, center study,” Jama, vol. 270, no. 24, pp. 2957–2963, 1993. 2012. [68] R. J. Little, “Missing-data adjustments in large surveys,” Journal of [45] X. Yan, W. Xiong, L. Hu, F. Wang, and K. Zhao, “Missing value Business & Economic Statistics, vol. 6, no. 3, pp. 287–296, 1988. imputation based on gaussian mixture model for the internet of things,” [69] R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast Mathematical Problems in Engineering, vol. 2015, 2015. accuracy,” International Journal of Forecasting, vol. 22, no. 4, pp. 679– [46] D. S. Silva and C. V. Deutsch, “Multivariate data imputation using 688, 2006. gaussian mixture models,” Spatial statistics, vol. 27, pp. 74–90, 2018. [70] P. H. Franses, “A note on the mean absolute scaled error,” International [47] M. Filipovic´ and A. Jukic,´ “Tucker factorization with missing data with Journal of Forecasting, vol. 32, no. 1, pp. 20–22, 2016. application to low-n-rank tensor completion,” Multidimensional Systems [71] B. MacDonald, P. Ranjan, and H. Chipman, “Gpfit: An r package for and Signal Processing, vol. 26, no. 3, pp. 677–692, 2015. fitting a gaussian process model to deterministic simulator outputs,” [48] Y. Liu, F. Shang, L. Jiao, J. Cheng, and H. Cheng, “Trace norm Journal of Statistical Software, vol. 64, no. 1, pp. 1–23, 2015. regularized candecomp/parafac decomposition with missing data,” IEEE [72] K. P. Murphy, Machine learning: a probabilistic perspective. MIT Transactions on Cybernetics, vol. 45, no. 11, pp. 2437–2448, 2015. press, 2012. [49] R. Tomioka, K. Hayashi, and H. Kashima. (2010) Estimation of low-rank tensors via convex optimization. [Online]. Available: https://arxiv.org/pdf/1010.0789.pdf APPENDIX A APPENDIX B PARAMETER ESTIMATION IN EM GPMODEL In the E (Expectation) step, we calculate the responsibilities (k) tr wp,v,b = Qp(qv,b = k) for p ∈ Pv,b using the current values We assume the GP model discussed here in a mixture of the parameters in iteration j: model for a certain variable and time, and thus we exclude the subscripts v and b. We use xp,t to denote a measurement [C(1) ](j) p,v,b of the time series xp at time t for patient p of a certain variable. 2 (1) (j) (1) (j) (1) (j) (1) (j) We use xp,−t to denote a time series without the measurement = [π ] [D ] N (xp,v,b|xp,−v,b[β ] , [σ ] ) v,b p,v,b v,b v,b at time t. The GP model is given by (2) (j) [Cp,v,b] 2 (2) (j) (2) (j) (2) (j) (2) (j) xp,t = µp,t + f(t), = [π ] [D ] N (xp,v,b|xp,v,−b[β ] , [σ ] ) v,b p,v,b v,b v,b 0 (3) (j) f(t) ∼ GP(0, K(t, t )) [Cp,v,b] (3) (j) (3) (j) G (j) G (j) = [πv,b ] [Dp,v,b] N (xp,v,b|m ([αp,v,b] ), Σ ([αp,v,b] )) where µp,t is the overall mean of the model and f(t) is a 0 (k) (j) (k) (j) (k) (j) Gaussian process with mean of 0 and covariance of K(t, t ). [D ] = N (Vp,v,b|[µ ] , [Σ ] ), k = 1, 2, 3 p,v,b v,b v,b Following the maximum likelihood approach, the best linear (k) [C ](j) unbiased predictor (BLUP) 2 at t and the mean squared error [w(k) ](j) = p,v,b , k = 1, 2, 3 p,v,b P3 (i) (j) are i=1[Cp,v,b] T −1 Let Zv,b = (xp,−v,b)p∈P tr and Yv,b = (xp,v,−b)p∈P tr . In 1 − r R 1 v,b v,b m(3)(θ, x , t¯) = ( n 1T + rT )R−1x the M (Maximization) step, we re-estimate the parameters in p,−t T −1 n p,−t 1n R 1n iteration (j + 1) using the jth responsibilities: T −1 2 (3) ¯ 2 T −1 (1 − 1n R r) Σ (θ, xp,−t, t) = σf [1 − r R r + ] (k) 1 X (k) −1 [π ](j+1) = [w ](j), k = 1, 2, 3 1nR 1n v,b |P tr | p,v,b v,b p∈P tr v,b 0 0 0 where rt(t ) = corr(f(t), f(t )), r is the vector of rt(t ) for P (k) (j) p∈P tr [wp,v,b] Vp,v,b all possible t, t¯ is a vector of time except for time t, R is [µ(k)](j+1) = v,b v,b P (k) (j) the (B − 1) × (B − 1) correlation matrix and the correlation p∈P tr [wp,v,b] v,b function is given by (k) 1 X (k) (k) [Σ ](j+1) = [w ](j)[U ](j+1) v,b P (k) (j) p,v,b p,v,b 0 2 tr [w ] tr Rt,t0 = exp(−θ|t − t | ). p∈Pv,b p,v,b p∈Pv,b (k) (j+1) (k) (j+1) (k) (j+1) 0 [Up,v,b] = {Vp,v,b − [µv,b ] }{Vp,v,b − [µv,b ] } σ2 (1) (j+1) 0 (1) (j) −1 0 (1) (j) obs 0 The estimator is given by [βv,b ] = {{Zv,b[wv,b] Zv,b} Zv,b[wv,b] x:,v,b} 2 (1) (j+1) 1 (1) (j+1) T −1 [σ ] = [S ] 2 C R C T −1 −1 T −1 v,b P (1) (j) v,b σf = ,C = xp,−t −1n(1n R 1n) (1n R xp,−t) tr [w ] n p∈Pv,b p,v,b (1) (j+1) X (1) (j) (1) (j+1) 2 [S ] = [w ] {xp,v,b − xp,−v,b[β ] } v,b p,v,b v,b where 1n is a vector with length (B − 1) of all ones. tr p∈Pv,b (2) (j+1) 0 (2) (j) −1 0 (2) (j) obs 0 [βv,b ] = {{Yv,b[wv,b] Yv,b} Yv,b[wv,b] x:,v,b} 2 APPENDIX C (2) (j+1) 1 (2) (j+1) [σ ] = [S ] PARTIAL DERIVATIVES IN GP v,b P (2) (j) v,b tr [w ] p∈Pv,b p,v,b (2) (j+1) X (2) (j) (2) (j+1) 2 [Sv,b ] = [wp,v,b] {xp,v,b − xp,v,−b[βv,b ] } To simplify the notations, we assume that the likelihood tr function L under consideration is for a mixture model for a p∈Pv,b (j+1) (3) (j) (j) obs certain variable and time. The partial derivative with respect [θv,b] = G([wv,b] , [θv,b] , x:,v,b,Yv,b, t:,v,:) to Gaussian process parameters θ is (k) (j) (k) (j) tr where [wv,b ] is the vector of [wp,v,b] for p ∈ Pv,b in |ptr | iteration j. The kernel parameters θv,b of GP models are ∂L X ∂ = w ln N (x ; m(3)(θ, x , t¯), Σ(3)(θ, x , t¯)). evaluated by function G, a gradient descent method that ∂θ p ∂θ p,t p,−t p,−t (j+1) p=1 calculates the estimates of [θv,b] to maximize Lv,b(γ), (j) using [θv,b] as the starting point. The first order derivatives of Lv,b(γ) with respect to θv,b that are used in G are given in 2Sacks, Jerome, et al. ”Design and analysis of computer experiments.” Appendix C. Statistical science (1989): 409-423. (3) Letting gp(θ) = m (θ, xp,−t, t¯) and hp(θ) = (3) Σ (θ, xp,−t, t¯), we have −1 T H3 = 1 − (rR r ) + H2 −1 T ∂H3 ∂r −1 T ∂R T −1 ∂r ∂H2 |ptr | = −( R r + r r + rR ) + ∂L X ∂ ∂θ ∂θ ∂θ ∂θ ∂θ = wp ln N (xp,t; gp(θ), hp(θ)) (1T R−1x ) ∂θ ∂θ H = x − 1 n p,−t p=1 4 p,−t n 1T R−11 tr n n |p | 2 −1 X ∂ 1 [xp,t − gp(θ)] ∂H4 1 ∂R = −1 [(1T x )(1T R−11 ) = wp {lnp − } n T −1 2 n p,−t n n ∂θ 2h (θ) ∂θ (1 R 1n) ∂θ p=1 2πhp(θ) p n −1 |ptr | T ∂R T −1 2 − (1 1n)(1 R xp,−t)] X 1 ∂hp(θ) ∂ [xp,t − gp(θ)] n ∂θ n = wp{− − } 2h (θ) ∂θ ∂θ 2h (θ) ∂σ2 −1 p=1 p p f 1 ∂H4 T −1 T ∂R T −1 ∂H4 = [( ) R H4 + H4 H4 + H4 R ]. |ptr | ∂θ n ∂θ ∂θ ∂θ X 1 ∂hp(θ) = w {− p 2h (θ) ∂θ p=1 p

1 ∂gp(θ) − 2 {2[xp,t − gp(θ)][− ]hp(θ) 2hp(θ) ∂θ ∂h (θ) − p [x − g (θ)]2}} ∂θ p,t p |ptr | X 1 ∂hp(θ) = w {− p 2h (θ) ∂θ p=1 p

∂gp(θ) ∂hp(θ) 2 [xp,t − gp(θ)] ∂θ ∂θ [xp,t − gp(θ)] + + 2 }. hp(θ) 2hp(θ)

∂gp(θ) ∂hp(θ) Then ∂θ and ∂θ are given by

∂g (θ) ∂H ∂R−1 p = ( 1 R−1 + H )x ∂θ ∂θ 1 ∂θ p,−t ∂h (θ) ∂H ∂σ2 p = σ2 3 + f H ∂θ f ∂θ ∂θ 3

∂H1 ∂H3 where H1, ∂θ , H3 and ∂θ are given as follows:

−1 [1 − (rR 1n)] T H1 = T −1 1n + r 1n R 1n ∂r −1 ∂R−1 T −1 ∂H1 −( ∂θ R + r ∂θ )1n(1n R 1n) = T −1 2 ∂θ 1n R 1n T ∂R−1 −1 (1n ∂θ 1n)[1 − (rR 1n)] T ∂r − T −1 2 1n + 1n R 1n ∂θ T −1 T 2 fc = (1 − 1n R r ) T −1 gc = 1n R 1n ∂fc ∂R−1 ∂rT = 2(1 − 1T R−1rT )[−1T ( rT + R−1 )] ∂θ n n ∂θ ∂θ ∂gc ∂R−1 = 1T 1 ∂θ n ∂θ n T −1 T 2 (1 − 1n R r ) H2 = T −1 1n R 1n ∂H ∂fc gc − ∂gc fc 2 = ∂θ ∂θ ∂θ gc2