Imputation using Optimal Transport

Boris Muzellec 1 Julie Josse 2 3 * Claire Boyer 4 Marco Cuturi 5 1

Abstract with plausible values, are very appealing as they allow to both get a guess for the missing entries as well as to per- Missing data is a crucial issue when applying ma- form (with care) downstream machine learning methods on chine learning algorithms to real-world datasets. the completed data. Efficient methods include, among oth- Starting from the simple assumption that two ers, methods based on low-rank assumptions (Hastie et al., batches extracted randomly from the same dataset 2015), iterative random forests (Stekhoven & Buhlmann, should share the same distribution, we leverage 2011) and imputation using variational autoencoders (Mattei optimal transport distances to quantify that crite- & Frellsen, 2019; Ivanov et al., 2019). A desirable prop- rion and turn it into a loss function to impute miss- erty for imputation methods is that they should preserve the ing data values. We propose practical methods to joint and marginal distributions of the data. Non-parametric minimize these losses using end-to-end learning, Bayesian strategies (Murray & Reiter, 2016) or recent ap- that can exploit or not parametric assumptions proaches based on generative adversarial networks (Yoon on the underlying distributions of values. We et al., 2018) are attempts in this direction. However, they evaluate our methods on datasets from the UCI can be quite cumbersome to implement in practice. repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods We argue in this work that the optimal transport (OT) tool- match or out-perform state-of-the-art imputation box constitutes a natural, sound and straightforward alter- methods, even for high percentages of missing native. Indeed, optimal transport provides geometrically values. meaningful distances to compare discrete distributions, and therefore data. Furthermore, thanks to recent computational advances grounded on regularization (Cuturi, 2013), OT- 1. Introduction based divergences can be computed in a scalable and dif- ferentiable way (Peyre´ et al., 2019). Those advances have Data collection is usually a messy process, resulting in allowed to successfully use OT as a loss function in many datasets that have many missing values. This has been an applications, including multi-label classification (Frogner issue for as long as data scientists have prepared, curated et al., 2015), inference of pathways (Schiebinger et al., 2019) and obtained data, and is all the more inevitable given the and generative modeling (Arjovsky et al., 2017; Genevay vast amounts of data currently collected. The literature on et al., 2018; Salimans et al., 2018). Considering the similar- the subject is therefore abundant (Little & Rubin, 2019; ities between generative modeling and missing data imputa- van Buuren, 2018): a recent survey indicates that there are tion, it is therefore quite natural to use OT as a loss for the more than 150 implementations available to handle miss- latter. ing data (Mayer et al., 2019). These methods differ on the objectives of their analysis (estimation of parameters and Contributions. This paper presents two main contributions. arXiv:2002.03860v3 [stat.ML] 1 Jul 2020 their variance, matrix completion, prediction), the nature of First, we leverage OT to define a loss function for missing the variables considered (categorical, mixed, etc.), the as- value imputation. This loss function is the mathematical sumptions about the data, and the missing data mechanisms. translation of the simple intuition that two random batches Imputation methods, which consist in filling missing entries from the same dataset should follow the same distribution. Next, we provide algorithms for imputing missing values This research was done while JJ was a visiting researcher at 1 according to this loss. Two types of algorithms are pre- Google Brain Paris. CREST-ENSAE, IP Paris, Palaiseau, France sented, the first (i) being non-parametric, and the second (ii) 2XPOP, INRIA Saclay, France 3CMAP, UMR7641, Ecole´ Poly- technique, IP Paris, Palaiseau, France 4LPSM, Sorbonne Univer- defining a class of parametric models. The non-parametric site,´ ENS Paris, France 5Google Brain, Paris, France. Correspon- algorithm (i) enjoys the most degrees of freedom, and can dence to: Boris Muzellec . therefore output imputations which respect the global shape

th of the data while taking into account its local features. The Proceedings of the 37 International Conference on Machine parametric algorithm (ii) is trained in a round-robin fashion Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). similar to iterative conditional imputation techniques, as Missing Data Imputation using Optimal Transport implemented for instance in the mice package (van Bu- MNAR values lead to important biases in the data, as the uren & Groothuis-Oudshoorn, 2011). Compared to the probability of missingness then depends on the unobserved non-parametric method, this algorithm allows to perform values. On the other hand, MCAR and MAR are “ignorable” out-of-sample imputation. This creates a very flexible frame- mechanisms in the sense that they do not make it necessary work which can be combined with many imputing strate- to model explicitly the distribution of missing values when gies, including imputation with Multi-Layer Perceptrons. maximizing the observed likelihood. Finally, these methods are showcased in extensive exper- The naive workaround which consists in deleting observa- iments on a variety of datasets and for different missing tions with missing entries is not an alternative in high dimen- values proportions and mechanisms, including the difficult sion. Indeed, let us assume as in Zhu et al.(2019) that X is case of informative missing entries. The code to repro- a n d data matrix in which each entry is missing indepen- duce these experiments is available at https://github. dently× with probability 0.01. When d = 5, this would result com/BorisMuzellec/MissingDataOT. in around 95% of the individuals (rows) being retained, but n d Notations. Let Ω = (ω ) 0, 1 × be a binary mask for d = 300, only around 5% of rows have no missing en- ij ij ∈ { } encoding observed entries, i.e. ωij = 1 (resp. 0) iff the entry tries. Hence, providing plausible imputations for missing (i, j) is observed (resp. missing). We observe the following values quickly becomes necessary. Classical imputation incomplete data matrix: methods impute according to a joint distribution which is either explicit, or implicitly defined through a set of condi- (obs) 1 X = X Ω + NA ( n d Ω), tional distributions. As an example, explicit joint modeling × − (obs) n d methods include imputation models that assume a Gaussian where X R × contains the observed entries, is ∈ distribution for the data, whose parameters are estimated us- the elementwise product and 1n d is an n d matrix filled × ing EM algorithms (Dempster et al., 1977). Missing values with ones. Given the data matrix X, our goal× is to construct are then imputed by drawing from their predictive distri- an estimate Xˆ filling the missing entries of X, which can be bution. A second instance of such joint modeling methods written as are imputations assuming low-rank structure (Josse et al., (obs) (imp) 2016). The conditional modeling approach (van Buuren, Xˆ = X Ω + Xˆ (1n d Ω), × − 2018), also known as “sequential imputation” or “imputa- ˆ (imp) n d where X R × contains the imputed values. Let tion using chained equations” (ice) consists in specifying ∈ xi: denote the i-th row of the data set X, such that X = one model for each variable. It predicts the missing values T (xi: )1 i n. Similarly, x:j denotes the j-th column (vari- of each variable using the other variables as explanatory, able)≤ of≤ the data set X, such that X = (x ... x ), and and cycles through the variables iterating this procedure to :1| | :d X: j denotes the dataset X in which the j-th variable has update the imputations until predictions stabilize. been− removed. For K 1, . . . , n a set of m indices, Non-parametric methods like k-nearest neighbors imputa- X = (x ) denotes⊂ the { corresponding} batch, and by K k: k K tion (Troyanskaya et al., 2001) or random forest imputation µ (X ) the empirical∈ measure associated to X , i.e. m K K (Stekhoven & Buhlmann, 2011) have also been developed and account for the local geometry of the data. The herein µ (X ) := 1 δ . m K m xk: proposed methods lie at the intersection of global and local k K X∈ approaches and are derived in a non-parametric and para- def n n metric version. Finally, ∆n = a R : ai = 1 is the simplex in { ∈ + i=1 } dimension n. Wasserstein distances, entropic regularization and P n Sinkhorn divergences. Let α = i=1 aiδxi , β = 0 2. Background n b δ be two discrete distributions, described by their i=1 i yi P n n p n0 n0 p supports (xi) R × and (yi) R × and weight Missing data. Rubin(1976) defined a widely used - yet con- P i=1 ∈ i=1 ∈ vectors a ∆ and b ∆ 0 . Optimal transport compares troversial (Seaman et al., 2013) - nomenclature for missing n n α and β by∈ considering∈ the most efficient of transporting values mechanisms. This nomenclature distinguishes be- the masses a and b onto each-other, according to a ground tween three cases: missing completely at random (MCAR), cost between the supports. The (2-)Wasserstein distance missing at random (MAR), and missing not at random corresponds to the case where this ground cost is quadratic: (MNAR). In MCAR, the missingness is independent of the data, whereas in MAR, the probability of being missing 2 def W2 (α, β) = min P, M , (1) depends only on observed values. A subsequent part of P U(a,b)h i ∈ the literature, with notable exceptions (Kim & Ying, 2018; def n n0 1 T 1 Mohan & Pearl, 2019), only consider these “simple” mecha- where U(a, b) = P R × : P n0 = a, P n = b is nisms and struggles for the harder yet prevalent MNAR case. the set of transportation{ ∈ plans, and M = x y 2 } k i − jk ij ∈  Missing Data Imputation using Optimal Transport

n n0 R × is the matrix of pairwise squared distances between that XK should be close to XL in terms of distributions. the supports. W2 is not differentiable and requires solving a The OT-based metrics described in Section2 provide natu- costly linear program via network simplex methods (Peyre´ ral criteria to catch this distributional proximity and derive et al., 2019, 3). Entropic regularization alleviates both imputation methods. However, as observed in Section2, in issues: consider§ high dimension or with a high proportion of missing values, it is unlikely or even impossible to obtain batches from X def OTε(α, β) = min P, M + εh(P), (2) with no missing values. Nonetheless, a good imputation P U(a,b)h i ∈ method should still ensure that the distributions of any two i.i.d. incomplete batches X and X , both containing miss- where ε > 0 and h(P) def= p log p is the negative K L ij ij ij ing values, should be close. This implies in particular that entropy. Then, OT (α, β) is differentiable and can be solved ε OT-metrics between the distributions µ X and µ X using Sinkhorn iterations (CuturiP , 2013). However, due to m K m L should have small values. This criterion, which is weaker the entropy term, OT is no longer positive. This issue is ε than the one above with one-sided missing data but is more solved through debiasing, by subtracting auto-correlation amenable, will be considered from now on. terms. Let Direct imputation. Algorithm1 is a direct implementa- def 1 Sε(α, β) = OTε(α, β) (OTε(α, α)+OTε(β, β)). (3) tion of this criterion, aiming to impute missing values for − 2 quantitative variables by minimizing OT distances between Eq. (3) defines the Sinkhorn divergences (Genevay et al., batches. First, missing values of any variable are initialized 2018), which are positive, convex, and can be computed with the mean of observed entries plus a small amount of with little additional cost compared to entropic OT (Feydy noise (to preserve the marginals and to facilitate the opti- et al., 2019). Sinkhorn divergences hence provide a differ- mization). Then, batches are sequentially sampled and the entiable and tractable proxy for Wasserstein distances, and Sinkhorn divergence between batches is minimized with will be used in the following. respect to the imputed values, using gradient updates (here using RMSprop (Tieleman & Hinton, 2012)). OT gradient-based methods. Not only are the OT metrics described above good measures of distributional closeness, Algorithm 1 Batch Sinkhorn Imputation they are also well-adapted to gradient-based imputation n d n d Input: X (R NA ) × , Ω 0, 1 × , α, η, ε > 0, methods. Indeed, let XK , XL be two batches drawn from X. n m >∈0, ∪ { } ∈ { } Then, gradient updates for OTε(µm(XK ), µm(XL)), ε 0 ≥ Initialization≥ : for j = 1, . . . , d, w.r.t a point xk: in XK correspond to taking steps along the so-called barycentric transport map. Indeed, with (half) for i s.t. ω = 0, xˆ xobs + ε , with ε quadratic costs, it holds (Cuturi & Doucet, 2014, 4.3) that • ij ij ← :j ij ij ∼ § (0, η) and xobs corresponds to the mean of the N :j ? observed entries in the j-th variable (missing entries) x OT (µ (X ), µ (X )) = P (x x ), ∇ k: ε m K m L k` k: − `: ` X for i s.t. ωij = 1, xˆij xij (observed entries) • ← where P? is the optimal (regularized) transport plan. There- for t = 1, 2, ...,t do fore, a gradient based-update is of the form max Sample two sets K and L of m indices ˆ ˆ ˆ ˆ ? (XK , XL) Sε(µm(XK ), µm(XL)) xk: (1 t)xk: + t Pklxl:. (4) Lˆ (imp) ˆ←(imp) ← − XK L XK L αRMSprop( Xˆ (imp) ) l ∪ ← ∪ − ∇ K∪L L X end for In a missing value imputation context, Eq. (4) thus corre- Output: Xˆ sponds to updating values to make them closer to the target points given by transportation plans. Building on this fact, OT as a loss for missing data imputation. Taking a step OT gradient-based imputation methods are proposed in the back, one can see that Algorithm1 essentially uses Sinkhorn next section. divergences between batches as a loss function to impute values for a model in which “one parameter equals one 3. Imputing Missing Values using OT imputed value”. Formally, for a fixed batch size m, this loss is defined as Let XK and XL be two batches respectively extracted from def the complete rows and the incomplete rows in X, such that (X) = S (µ (X ), µ (X )). (5) Lm ε m K m L only the batch XL contains missing values. In this one- K:0 k1<... 0, n ∈ m > 0, ∈ { } · · · 0 ≥ n d n d Xˆ same initialization as in Algorithm1 Input: X R × , Ω 0, 1 × , Imputer( , , ), Θ0, ← ∈ ∈ { } · · · Θˆ Θ0 ε > 0, n m > 0, ← ˆ 0 ≥ for t = 1, 2, ..., tmax do X same initialization as in Algorithm1 ˆ ← ˆ for k = 1, 2, ..., K do (θ1, ..., θd) Θ0 t ← Xˆ Imputer(Xˆ , Ω, Θ)ˆ for t = 1, 2, ..., tmax do Sample← two sets K and L of m indices for j = 1, 2, ..., d do (Xˆ K , Xˆ L) Sε(µm(Xˆ K ), µm(Xˆ L)) for k = 1, 2, ..., K do L ← ˆ ˆ t ˆ Θ AutoDiff( (Xˆ K , Xˆ L)) X:j Imputer(X: j, Ω:j, θj) ∇ L ← L ← − Θˆ Θˆ αAdam( Θ ) Sample two sets K and L of m indices ← − ∇ L ˆ ˆ end for Sε(µm(XK ), µm(XL)) t+1 t L ← Xˆ Imputer(Xˆ , Ω, Θ)ˆ θj AutoDiff( ) ∇ L ← L end for ← θˆ θˆ αAdam( ) j ← j − ∇θj L Output: Completed data Xˆ = Xˆ tmax , Imputer( , , Θ)ˆ end for · · ˆ t ˆ t ˆ X:j Imputer(X: j, Ω:j, θj) end for← − model with a parameter Θ such that Imputer(X, Ω, Θ) re- Xˆ t+1 Xˆ t turns imputations for the missing values in X. This imputer end for ← has to be differentiable w.r.t. its parameter , so that the Θ Output: Imputations Xˆ tmax , Imputer( , , Θ)ˆ batch Sinkhorn loss can be back-propagated through Xˆ · · to perform gradient-basedL updates of Θ. Algorithm2 does not only return the completed data matrix Xˆ , but also the is updated starting from an initial guess Xˆ 0. The algorithm trained parameter Θˆ , which can then be re-used to impute then consists in three nested loops. (i) The inner-most loop is ˆ missing values in out-of-sample data. dedicated to gradient-based updates of the parameter θj, as illustrated in Figure1. Once this inner-most loop is finished, Round-robin imputation. A remaining unaddressed the j-th variable of Xˆ t is updated using the last update of point in Algorithm2 is how to perform the “ Xˆ t θˆj. (ii) This is performed cyclically over all variables of Xˆ , Imputer(Xˆ t, Ω, Θ)” step in the presence of missing val-← yielding Xˆ t+1. (iii) This fitting-and-imputation procedure ues. A classical method is to perform imputations over over all variables is repeated until convergence, or until a variables in a round-robin fashion, i.e. to iteratively pre- given number of iterations is reached. dict missing coordinates using other coordinates as fea- tures in a cyclical manner. The main advantage of this In practice, several improvements on the generic Algo- method is that it decouples variables being used as inputs rithms2 and3 can be implemented: and those being imputed. This requires having d sets of parameter (θj)1 j d, one for each variable, where each 1. To better estimate Eq. (5), one can sample several pairs ≤ ≤ θj refers to the parameters used to to predict the j-th vari- of batches (instead of a single one) and define as the L able. The j-th variable is iteratively imputed using the d 1 average of Sε divergences. remaining variables, according to the chosen model with− parameter θj: θˆj is first fitted (using e.g. regression or 2. For Algorithm3 in a MCAR setting, instead of sam- Bayesian methods), then the j-th variable is imputed. The pling in each pair two batches from Xˆ , one of the two algorithm then moves to the next variable j + 1, in a cycli- batches can be sampled with no missing value on the cal manner. This round-robin method is implemented for j-th variable, and the other with missing values on the Missing Data Imputation using Optimal Transport

ˆ T Xˆ T X AAAB/XicbVDLSsNAFJ34rPUVdeHCzWARXJVEBMVV0Y3LCn1BE8NkOmmHTh7M3BRLCH6MuKtu/QzX/o2TmoW2ntW595wL5x4/EVyBZX0ZK6tr6xubla3q9s7u3r55cNhRcSopa9NYxLLnE8UEj1gbOAjWSyQjoS9Y1x/fFXp3wqTicdSCacLckAwjHnBKQK8889gZEcickMDID7JennvZjZ0/tjyzZtWtOfAysUtSQyWanvnpDGKahiwCKohSfdtKwM2IBE4Fy6tOqlhC6JgMWX8w4YmKSMiUmz3Nf8jxWRBLDCOG5/Nve0ZCpaahrz1FTLWoFcv/tH4KwbWb8ShJgUVUW7QWpAJDjIsq8IBLRkFMNSFUch0U0xGRhIIurKobsBf/XSadi7pt1e2Hy1rjtuyigk7QKTpHNrpCDXSPmqiNKMrRC5qhN+PZeDVmxvuPdcUob47QHxgf3yf+law= AAAB/XicbVDLSsNAFJ34rPUVdeHCzWARXJVEBMVV0Y3LCn1BE8NkOmmHTh7M3BRLCH6MuKtu/QzX/o2TmoW2ntW595wL5x4/EVyBZX0ZK6tr6xubla3q9s7u3r55cNhRcSopa9NYxLLnE8UEj1gbOAjWSyQjoS9Y1x/fFXp3wqTicdSCacLckAwjHnBKQK8889gZEcickMDID7JennvZjZ0/tjyzZtWtOfAysUtSQyWanvnpDGKahiwCKohSfdtKwM2IBE4Fy6tOqlhC6JgMWX8w4YmKSMiUmz3Nf8jxWRBLDCOG5/Nve0ZCpaahrz1FTLWoFcv/tH4KwbWb8ShJgUVUW7QWpAJDjIsq8IBLRkFMNSFUch0U0xGRhIIurKobsBf/XSadi7pt1e2Hy1rjtuyigk7QKTpHNrpCDXSPmqiNKMrRC5qhN+PZeDVmxvuPdcUob47QHxgf3yf+law= :1 :1

Xˆ T ˆ T AAAB/XicbVDLSsNAFJ3UV62vqAsXboJFcFWSIiiuim5cVugLmhgm05tm6OTBzKRYQvBjxF1162e49m+c1Cy09azOvedcOPd4CaNCmuaXVllb39jcqm7Xdnb39g/0w6OeiFNOoEtiFvOBhwUwGkFXUslgkHDAoceg703uCr0/BS5oHHXkLAEnxOOI+pRgqVaufmIHWGZ2iGXg+dkgz93sppk/dly9bjbMBYxVYpWkjkq0Xf3THsUkDSGShGEhhpaZSCfDXFLCIK/ZqYAEkwkew3A0pYmIcAjCyZ4WP+TGuR9zQwZgLObf9gyHQsxCT3mKmGJZK5b/acNU+tdORqMklRARZVGanzJDxkZRhTGiHIhkM0Uw4VQFNUiAOSZSFVZTDVjL/66SXrNhmQ3r4bLeui27qKJTdIYukIWuUAvdozbqIoJy9ILm6E171l61ufb+Y61o5c0x+gPt4xspgpWt X X(imp) :2 AAAB/XicbVDLSsNAFJ3UV62vqAsXboJFcFWSIiiuim5cVugLmhgm05tm6OTBzKRYQvBjxF1162e49m+c1Cy09azOvedcOPd4CaNCmuaXVllb39jcqm7Xdnb39g/0w6OeiFNOoEtiFvOBhwUwGkFXUslgkHDAoceg703uCr0/BS5oHHXkLAEnxOOI+pRgqVaufmIHWGZ2iGXg+dkgz93sppk/dly9bjbMBYxVYpWkjkq0Xf3THsUkDSGShGEhhpaZSCfDXFLCIK/ZqYAEkwkew3A0pYmIcAjCyZ4WP+TGuR9zQwZgLObf9gyHQsxCT3mKmGJZK5b/acNU+tdORqMklRARZVGanzJDxkZRhTGiHIhkM0Uw4VQFNUiAOSZSFVZTDVjL/66SXrNhmQ3r4bLeui27qKJTdIYukIWuUAvdozbqIoJy9ILm6E171l61ufb+Y61o5c0x+gPt4xspgpWt :2

AAAB/3icbVDLSsNAFJ34rPUVdSVuBotQF5ZEBMVV0Y3LCvYBbSyT6aQdO5MMM5NiCUH8GHFX3foVrv0bJzULbT2rc+85F849vmBUacf5shYWl5ZXVgtrxfWNza1te2e3oaJYYlLHEYtky0eKMBqSuqaakZaQBHGfkaY/vM705ohIRaPwTo8F8TjqhzSgGGmz6tr7HY70wA+SVtpNLh9O3PQ+KVMujtOuXXIqzhRwnrg5KYEcta792elFOOYk1JghpdquI7SXIKkpZiQtdmJFBMJD1Cft3ogKFSJOlJc8Tr9I4VEQSagHBE7n3/YEcaXG3DeeLK2a1bLlf1o71sGFl9BQxJqE2FiMFsQM6ghmZcAelQRrNjYEYUlNUIgHSCKsTWVF04A7++88aZxWXKfi3p6Vqld5FwVwAA5BGbjgHFTBDaiBOsDgCbyACXiznq1Xa2K9/1gXrPxmD/yB9fEN4RCWAQ== :j 1 … X(imp) … X(imp) Imputer( , ✓ ) AAAB/XicbVC7TsMwFHXKq5RXgIGBxaJCKkuVICQQUwULY5HoQ2pD5LhOa2onlu1UVFHExyC2wspnMPM3OCUDFM507j3nSueeQDCqtON8WqWl5ZXVtfJ6ZWNza3vH3t1rqziRmLRwzGLZDZAijEakpalmpCskQTxgpBOMr3O9MyFS0Ti601NBPI6GEQ0pRtqsfPugz5EeBWHazfz08iG7T2uUi5PMt6tO3ZkD/iVuQaqgQNO3P/qDGCecRBozpFTPdYT2UiQ1xYxklX6iiEB4jIakN5hQoSLEifLSx/kPGTwOYwn1iMD5/NOeIq7UlAfGk2dVi1q+/E/rJTq88FIaiUSTCBuL0cKEQR3DvAo4oJJgzaaGICypCQrxCEmEtSmsYhpwF//9S9qnddepu7dn1cZV0UUZHIIjUAMuOAcNcAOaoAUwyMAzmIFX68l6sWbW27e1ZBU3++AXrPcv+paVjw== :j Imputer( , ✓ ) AAAB/3icbVDLSsNAFJ34rPUVdSVuBotQEUoiguKq6MZlBfuANpbJdNKOnUmGmUmxhCB+jLirbv0K1/6Nk5qFtp7VufecC+ceXzCqtON8WQuLS8srq4W14vrG5ta2vbPbUFEsManjiEWy5SNFGA1JXVPNSEtIgrjPSNMfXmd6c0SkolF4p8eCeBz1QxpQjLRZde39Dkd64AdJK+0mlw8nbnqflCkXx2nXLjkVZwo4T9yclECOWtf+7PQiHHMSasyQUm3XEdpLkNQUM5IWO7EiAuEh6pN2b0SFChEnyksep1+k8CiIJNQDAqfzb3uCuFJj7htPllbNatnyP60d6+DCS2goYk1CbCxGC2IGdQSzMmCPSoI1GxuCsKQmKMQDJBHWprKiacCd/XeeNE4rrlNxb89K1au8iwI4AIegDFxwDqrgBtRAHWDwBF7ABLxZz9arNbHef6wLVn6zB/7A+vgG3fqV/w== :j+1 AAACJHicbVBNS8NAFNzU7/pV9eglWARFKUlR9CSiF70pWCs0IWy2r+3a3STsvpSWkH8j/hgRPKgHL/4Wt7UHq85pdmaW996EieAaHefDKkxNz8zOzS8UF5eWV1ZLa+u3Ok4VgxqLRazuQqpB8AhqyFHAXaKAylBAPeyeD/16D5TmcXSDgwR8SdsRb3FG0UhB6cRD6GN2KZMUQeU7XkcnlEFWqR6CzD3WjHFS2vewA0iD7H7PzXeDUtmpOCPYf4k7JmUyxlVQevGaMUslRMgE1brhOgn6GVXImYC86KUazLQubUOj2eOJjqgE7Wf90am5vd2KlW02sEfvn/GMSq0HMjQZSbGjf3tD8T+vkWLr2M94NGwgYiZivFYqbIztYWN2kytgKAaGUKa4WdRmHaooM4XpomnA/X3vX3JbrbhOxb0+KJ+ejbuYJ5tki+wQlxyRU3JBrkiNMPJAnsgrebMerWfr1Xr/jhas8Z8NMgHr8wvV76V+ j+1 AAACIHicbVDLTgJBEJzFF+IL9ehlIzHBxJBdotF4InrRGybySFhCZocGRmZ2NzO9BLLhX4wfY7yhR/0aB+QgaJ9qqmrSXeVHgmt0nE8rtbK6tr6R3sxsbe/s7mX3D6o6jBWDCgtFqOo+1SB4ABXkKKAeKaDSF1Dz+7dTvTYApXkYPOIogqak3YB3OKNoqFb22kMYYnIvoxhBjfNeT0eUQVIoXoAce6wd4iJ15mEPkLaeTlvZnFNwZmP/Be4c5Mh8yq3sxGuHLJYQIBNU64brRNhMqELOBIwzXqzBbOrTLjTaAx7pgErQzWQ4izm2Tzqhss12e/b+bU+o1HokfeORFHt6WZuS/2mNGDtXzYQH0/QBMxajdWJhY2hP27LbXAFDMTKAMsXNoTbrUUWZKUtnTAPuct6/oFosuE7BfTjPlW7mXaTJETkmeeKSS1Iid6RMKoSRZ/JKJuTderHerIn18WNNWfM/h2RhrK9v5XWkAg== j T (obs) T (imp) Xˆ T · (1 ⌦) X + ⌦ X ·

AAAB/XicbVC7TsNAEDyHVwgvAwUFjUWERBXZCAlEFUFDGaS8pNhY58s6PnJ+6O4cEVkWH4PoAi2fQc3fcA4uIGGq2Z1ZaXa8hFEhTfNLq6ysrq1vVDdrW9s7u3v6/kFXxCkn0CExi3nfwwIYjaAjqWTQTzjg0GPQ88a3hd6bABc0jtpymoAT4lFEfUqwVCtXP7IDLDM7xDLw/Kyf5252/Zg/tF29bjbMOYxlYpWkjkq0XP3THsYkDSGShGEhBpaZSCfDXFLCIK/ZqYAEkzEewWA4oYmIcAjCyZ7mP+TGqR9zQwZgzOff9gyHQkxDT3mKmGJRK5b/aYNU+ldORqMklRARZVGanzJDxkZRhTGkHIhkU0Uw4VQFNUiAOSZSFVZTDViL/y6T7nnDMhvW/UW9eVN2UUXH6ASdIQtdoia6Qy3UQQTl6AXN0Jv2rL1qM+39x1rRyptD9Afaxzd+YpXl :j AAACRHicbVBLSwMxGMz6tr6qHr0Ei9Aill0RFE9FL95UsLXQrSWbfttGk82SZItl2X8nXj37F8SbehXTugetzmnyzQxMJog508Z1n52p6ZnZufmFxcLS8srqWnF9o6FloijUqeRSNQOigbMI6oYZDs1YAREBh+vg7nSkXw9AaSajKzOMoS1IL2Iho8TYU6d4U/bwHvbPBfRIxZddabAviOkHYdrMOunxbXaTXpVloCuZ39cxoZDuVfdBZHg3T41D/2SYiCtZp1hyq+4Y+C/xclJCOS46xSe/K2kiIDKUE61bnhubdkqUYZRDVvATDbbFHelBqztgsY6IAN1O78dTZHgnlAqbPuDx+6c9JULroQisZ1RWT2qj439aKzHhUTtlUZwYiKi1WC1MODYSjxbFXaaAGj60hFDFbFFM+0QRauzuBbuAN/nfv6SxX/Xcqnd5UKqd5FssoC20jcrIQ4eohs7QBaojih7QC3pD786j8+q8OR/f1iknz2yiX3A+vwCDf7FQ :j :j … … T ✓j ✓j ⌘ ✓ S" Xˆ AAACKXicbVDLSgNBEJz1bXxFPXoZDIIXw64IehS9eFQ0GsiGpXfSm4yZfTDTGw1LPkj8GPGmXv0RJzGKGutUU1XTdFeYKWnIdV+dqemZ2bn5hcXS0vLK6lp5fePapLkWWBOpSnU9BINKJlgjSQrrmUaIQ4U3Yfd06N/0UBuZJlfUz7AZQzuRkRRAVgrKpz51kCC45b7CiEDr9I5/a3vct4T7CYQKguJLH/DLwO+BxsxINZxScavuCHySeGNSYWOcB+Unv5WKPMaEhAJjGp6bUbMATVIoHJT83GAGogttbLR6MjMJxGiaxf3o3gHfiVLN7S589P4ZLyA2ph+HNhMDdcxfbyj+5zVyio6ahUyynDARNmK9KFecUj6sjbekRkGqbwkILe2iXHRAgyBbbsk24P29d5Jc71c9t+pdHFSOT8ZdLLAtts12mccO2TE7Y+esxgR7YE/shb06j86z8+K8fUannPGfTfYLzvsH4bSnqQ== j ˆ T AAAB/XicbVDLSsNAFJ34rPUVdeHCzWARXJVEBMVV0Y3LCn1BE8NkctMMnTyYmRRLCH6MuKtu/QzX/o1JzUJbz+rce86Fc4+bcCaVYXxpK6tr6xubta369s7u3r5+cNiTcSoodGnMYzFwiQTOIugqpjgMEgEkdDn03fFdqfcnICSLo46aJmCHZBQxn1GiipWjH1sBUZkVEhW4fjbIcye78fLHjqM3jKYxB14mZkUaqELb0T8tL6ZpCJGinEg5NI1E2RkRilEOed1KJSSEjskIht6EJTIiIUg7e5r/kOMzPxZYBYDn8297RkIpp6FbeMqYclErl/9pw1T513bGoiRVENHCUmh+yrGKcVkF9pgAqvi0IIQKVgTFNCCCUFUUVi8aMBf/XSa9i6ZpNM2Hy0brtuqihk7QKTpHJrpCLXSP2qiLKMrRC5qhN+1Ze9Vm2vuPdUWrbo7QH2gf33VKld8= X :d r AAAB/XicbVDLSsNAFJ34rPUVdeHCzWARXJVEBMVV0Y3LCn1BE8NkctMMnTyYmRRLCH6MuKtu/QzX/o1JzUJbz+rce86Fc4+bcCaVYXxpK6tr6xubta369s7u3r5+cNiTcSoodGnMYzFwiQTOIugqpjgMEgEkdDn03fFdqfcnICSLo46aJmCHZBQxn1GiipWjH1sBUZkVEhW4fjbIcye78fLHjqM3jKYxB14mZkUaqELb0T8tL6ZpCJGinEg5NI1E2RkRilEOed1KJSSEjskIht6EJTIiIUg7e5r/kOMzPxZYBYDn8297RkIpp6FbeMqYclErl/9pw1T513bGoiRVENHCUmh+yrGKcVkF9pgAqvi0IIQKVgTFNCCCUFUUVi8aMBf/XSa9i6ZpNM2Hy0brtuqihk7QKTpHJrpCLXSP2qiLKMrRC5qhN+1Ze9Vm2vuPdUWrbo7QH2gf33VKld8= :d

Batch Xˆ Batch Xˆ AAACFHicbVDLSsNAFJ34tr6qLt0MFsFFCYkIuiy6calgH9CUMJneNIOTBzM3pSXkL8SPEXfqVtf+jUnNQlvP6tx7zoVzj5dIodGyvoyl5ZXVtfWNzdrW9s7uXn3/oKPjVHFo81jGqucxDVJE0EaBEnqJAhZ6Errew3Wpd8egtIije5wmMAjZKBK+4AyLlVs3HYQJZlcMeUBz6gQMMydkGHh+1stzNxOu3aSmaTapcMPcrTcs05qBLhK7Ig1S4datfzrDmKchRMgl07pvWwkOMqZQcAl5zUk1JIw/sBH0h2OR6IiFoAfZZPZaTk/8WFEMgM7m3/aMhVpPQ6/wlHn1vFYu/9P6KfqXg0xESYoQ8cJSaH4qKca0bIgOhQKOcloQxpUoglIeMMU4Fj3Wigbs+X8XSefMtC3TvjtvtK6qLjbIETkmp8QmF6RFbsgtaRNOHskzeSVvxpPxYrwa7z/WJaO6OSR/YHx8A9HFngA= i1,...,im AAACFHicbVDLSsNAFJ3Ud31VXboZLIILCYkIuix141LB2kITwmR600w7eTBzI5aQvxA/RtypW137N6Y1C62e1bn3nAvnHj+VQqNlfRq1hcWl5ZXVtfr6xubWdmNn91YnmeLQ4YlMVM9nGqSIoYMCJfRSBSzyJXT98cVU796B0iKJb3CSghuxYSwCwRmWK69hOgj3mLcZ8pAW1AkZ5k7EMPSDvFcUXj7y7GNqmuYxHXlR4TWalmnNQP8SuyJNUuHKa3w4g4RnEcTIJdO6b1spujlTKLiEou5kGlLGx2wI/cGdSHXMItBufj97raCHQaIohkBn8097ziKtJ5FfeqZ59bw2Xf6n9TMMzt1cxGmGEPPSUmpBJikmdNoQHQgFHOWkJIwrUQalPGSKcSx7rJcN2PP//iW3J6Ztmfb1abPVrrpYJfvkgBwRm5yRFrkkV6RDOHkgT+SFvBqPxrPxYrx9W2tGdbNHfsF4/wLU154C j1,...,jm

Sinkhorn batch loss

S" ⇣ ⌘

forAAACAnicbVBNSwJRFH1jX2ZfVstaPJKghQwzEtQmkNoEbQxSAx3kzfOOPnzzwXt3RBncRD8m2lnb/kPr/k0z5qK0szr3nnPh3ONGUmi0rC8jt7K6tr6R3yxsbe/s7hX3Dxo6jBWHOg9lqB5dpkGKAOooUMJjpID5roSmO7jJ9OYQlBZh8IDjCByf9QLhCc4wXXWKx22EEbpe4oWKTuiAXlG7TCumaZbpXadYskxrBrpM7DkpkTlqneJnuxvy2IcAuWRat2wrQidhCgWXMCm0Yw0R4wPWg1Z3KCIdMB+0k4xmj0zoaZYC+0Bn8297wnytx76benyGfb2oZcv/tFaM3qWTiCCKEQKeWlLNiyXFkGZ90K5QwFGOU8K4EmlQyvtMMY5pa4W0AXvx32XSqJi2Zdr356Xq9byLPDkiJ+SM2OSCVMktqZE64eSJvJApeTOejVdjarz/WHPG/OaQ/IHx8Q15N5Tb k =1, 2..., K

Figure 1: Round-robin imputation: illustration of the imputation of the j-th variable in the inner-most loop of Algorithm3.

j-th variable. This allows the imputations for the j-th (i) mean is the coordinate-wise mean imputation; variable to be fitted on actual non-missing values. This (ii) ice (imputation by chained equations) consists in (iter- helps ensuring that the imputations for the j-th vari- ative) imputation using conditional expectation. Here, able will have a marginal distribution close to that of we use scikit-learn’s (Pedregosa et al., 2011) non-missing values. iterativeImputer method, which is based on 3. The order in which the variables are imputed can be mice (van Buuren & Groothuis-Oudshoorn, 2011). adapted. A simple heuristic is to impute variables in This is one of the most popular methods of imputation increasing order of missing values. as it provides empirically good imputations in many scenario and requires little tuning; 4. During training, the loss can be hard to monitor due to the high variance induced by estimating Eq. (5) from (iii) softimpute (Hastie et al., 2015) performs missing val- a few pairs of batches. Therefore, it can be useful to ues imputation using iterative soft-thresholded SVD’s. define a validation set on which fictional additional This method is based on a low-rank assumption for missing values are sampled to monitor the training of the data and is justified by the fact that many large ma- the algorithm, according to the desired accuracy score trices are well approximated by a low-rank structure (Udell & Townsend, 2019). (e.g. MAE, RMSE or W2 as in Section4). Deep learning methods. Additionally, we compare our Note that item2 is a priori only legitimate in a MCAR methods to three DL-based methods: setting. Indeed, under MAR or MNAR assumptions, the distribution of non-missing data is in general not equal to (iv) MIWAE (Mattei & Frellsen, 2019) fits a deep latent 1 the original (unknown) distribution of missing data. Fi- variable model (DLVM) (Kingma & Welling, 2014), nally, the use of Adam (Kingma & Ba, 2014) compared to by optimizing a version of the importance weighted au- RMSprop in Algorithm1 is motivated by empirical perfor- toencoder (IWAE) bound (Burda et al., 2016) adapted mance, but does not have a crucial impact on performance. to missing data; It was observed however that the quality of the imputations given by Algorithm1 seems to decrease when gradient up- (v) GAIN (Yoon et al., 2018) is an adaptation of genera- dates with momentum are used. tive adversarial networks (GAN) (Goodfellow et al., 2014) to missing data imputation; 4. Experimental Results (vi) VAEAC (Ivanov et al., 2019) are VAEs with easily ap- proximable conditionals that allow to handle missing Baselines. We compare our methods to three baselines: data. 1 Consider as an example census data in which low/high income AAADaHicnZJLbxMxFIUnEx4lPJrCAiE2ViOkgqrMTBCiLJCqwoIdRW3SSplo5PHcmVzVYxvbEyUa5VeyYs+Kf8AOTxqpLwQSd3V9z7Hup2OniqOxYfi95bdv3b5zd+Ne5/6Dh482u1uPR0ZWmsGQSS71aUoNcBQwtGg5nCoNtEw5nKRnHxr9ZAbaoBTHdqFgUtJCYI6MWjdKtlo8TqFAUcPXajV6tewcJfGMalAGuRTxARY7a09B7RQ0ZMtOjILxKoNCUzVFZsZWY0nek9chGbwhIQl3CeOodskUsJhap0T9PVZO6mBoHE2QSo0m+KilSuU8kDYRNDjimIEJcixMUNDKGKQi+Xyc5KiNTYwTEyxVX2W52w8iu8AhDWQ8Am1J59+ol5De/SeSASZF9lemBunl+ewi2qTbC/vhqsjNJlo3PW9dh0n3R5xJVpUgLOPUmHEUKjupqbbIOLidlQFF2RktYJzNUBlBSzCTer76GUvyIpeaOCKyOl+217Q0ZlGmzlM6ZnNda4Z/0saVzfcmNQpVWRDMWZyWV5xYSZoPRjLUwCxfuIYyjQ6UsCnVlFmX8pUt6xeZN7FE10O42YwG/SjsR18Gvf2DdUAb3nNv29vxIu+tt+998g69ocda31q/fN9v+z/b3fbT9rNzq99a33niXan29m+5Shn4 people are more likely to fail to answer an income survey than Transport methods. Three variants of the proposed meth- medium income people. ods are evaluated: Missing Data Imputation using Optimal Transport

Ground truth ICE imputation Sinkhorn imputation Toy experiments. In Figure2, we generate two- dimensional datasets with strong structures, such as an S- shape, half-moon(s), or concentric circles. A 20% missing

No NA 1 NA rate is introduced (void rows are discarded), and imputa- tions performed using Algorithm1 or the ice method are compared to the ground truth dataset. While the ice method

Ground truth Scikit imputation Sinkhorn imputation is not able to catch the non-linear structure of the distri- No NA butions at all, Sinkhorn performs efficiently by imputing 1 NA faithfully to the underlying complex data structure (despite the two half-moons and the S-shape being quite challeng- ing). This is remarkable, since Algorithm1 does not rely on any parametric assumption for the data. This underlines in a low-dimensional setting the flexibility of the proposed Ground truth Scikit imputation Sinkhorn imputation method. Finally, note that the trailing points which can be observed for the S shape or the two moons shape come from the fact that Algorithm1 was used as it is, i.e. with pairs of

No NA batches both containing missing values, even though these 1 NA toy examples would have allowed to use batches without missing values. In that case, we obtain imputations that are visually indistinguishable from the ground truth. Ground truth Scikit imputation Sinkhorn imputation Large-scale experimental setup. We evaluate each method on 23 datasets from the UCI machine learning repos- itory2 (see Table1) with varying proportions of missing data and different missing data mechanisms. These datasets only contain quantitative features. Prior to running the exper- No NA 1 NA iments, the data is whitened (i.e. centered and scaled to variable-wise unit variance). For each dataset, all methods Figure 2: Toy examples: 20 % missing values (MCAR) are evaluated on 30 different draws of missing values masks. on toy datasets. Blue points have no missing values, or- For all Sinkhorn-based imputation methods, the regulariza- ange points have one missing value on either coordinate. tion parameter  is set to 5% of the median distance between ice outputs conditional expectation imputations, which are initialization values with no further dataset-dependent tun- irrelevant due to the high non-linearity of these examples. ing. If the dataset has more than 256 points, the batch size Since algorithm1 does not assume a parametric form for the n is fixed to 128, otherwise to 2 2 where n is the size of imputations, it is able to satisfyingly impute missing values. b c the dataset. The noise parameter η in Algorithm1 is fixed to 0.1. For Sinkhorn round-robin models (Linear RR and (vii) Sinkhorn designates the direct non-parametric impu- MLP RR), the maximum number of cycles is 10, 10 pairs of tation method detailed in Algorithm1. batches are sampled per gradient update, and an `2-weight 5 regularization of magnitude 10− is applied during training. For Algorithm3, two classes of imputers are considered: For all 3 Sinkhorn-based methods, we use gradient methods with adaptive step sizes as per algorithms1 and3, with an (viii) Linear RR corresponds to Algorithm3 where for 1 2 initial step size fixed to 10− . For softimpute, the hyperpa- j d, Imputer( , θ ) is a linear model w.r.t. the d ≤1 ≤ · j − rameter is selected at each run through cross-validation on other variables with weights and biases given by θj. a small grid. This CV is performed by sampling additional This is similar to mice or IterativeImputer, missing values. For DL-based methods, the implementa- but fitted with the OT loss eq. (5); tions provided in open-access by the authors were used345 , with the hyperparameter settings recommended in the cor- (ix) MLP RR denotes Algorithm3 with shallow Multi- responding papers. In particular, for GAIN the α param- Layer Perceptrons (MLP) as imputers. These MLP’s eter is selected using cross-validation. GPUs are used for have the following architecture: (i) a first (d 1) Sinkhorn and deep learning methods. The code to reproduce 2(d 1) layer followed by a ReLU layer then− (ii)× a 2 2(d −1) (d 1) layer followed by a ReLU layer and https://archive.ics.uci.edu/ml/index.php 3https://github.com/pamattei/miwae finally− (iii)× a (−d 1) 1 linear layer. All linear layers 4https://github.com/jsyoon0823/GAIN have bias terms.− Each× Imputer( , θ ), 1 j d is · j ≤ ≤ 5https://github.com/tigvarts/vaeac one such MLP with a different set of weights θj. Missing Data Imputation using Optimal Transport

Figure 3: (30% MCAR) Imputation methods on 23 datasets from the UCI repository (Table1). Sinkhorn denotes Algorithm1 and Linear RR, MLP RR the two instances of Algorithm3 precedently described. 30% of the values are missing MCAR. All methods are evaluated on 30 random missing values draws. Error bars correspond to 1 std. For 2 ±2 readability we display scaled mean W2 , i.e. for each dataset we renormalize the results by the maximum W2 . For some datasets W2 results are not displayed due to their large size, which makes evaluating the unregularized W2 distance costly. the experiments is available at https://github.com/ on potentially missing values. The second mechanism, ’self BorisMuzellec/MissingDataOT. masked’, samples a subset of variables whose values in the lower and upper p-th percentiles are masked according to a Missing value generation mechanisms. The implemen- Bernoulli random variable, and the values in-between are tation of a MCAR mechanism is straightfoward. On the left not missing. As detailed in the appendix, MCAR exper- contrary, many different mechanisms can lead to a MAR or iments were performed with 10%, 30% and 50% missing MNAR setting. We here describe those used in our experi- rates, while MAR and both MNAR settings (quantile and ments. In the MCAR setting, each value is masked accord- logistic masking) were evaluated with a 30% missing rate. ing to the realization of a Bernoulli random variable with a fixed parameter. In the MAR setting, for each experiment, Metrics. Imputation methods are evaluated according to a fixed subset of variables that cannot have missing values two “pointwise” metrics: mean absolute error (MAE) and is sampled. Then, the remaining variables have missing root mean square error (RMSE); and one metric on dis- values according to a logistic model with random weights, tributions: the squared Wasserstein distance between em- which takes the non-missing variables as inputs. A bias term pirical distributions on points with missing values. Let n d is fitted using line search to attain the desired proportion X R × be a dataset with missing values. When (i, j) ∈ of missing values. Finally, two different mechanisms are spots a missing entry, recall that xˆij denotes the correspond- true implemented in the MNAR setting. The first is identical to ing imputation, and let us note xij the ground truth. Let the previously described MAR mechanism, but the inputs def def m0 = # (i, j), ωij = 0 and m1 = # i : j, ωij = 0 of the logistic model are then masked by a MCAR mech- respectively{ denote the total} number of{ missing∃ values and}} anism. Hence, the logistic model’s outcome now depends Missing Data Imputation using Optimal Transport

Figure 4: (30% MNAR) Imputation methods on 23 datasets from the UCI repository (Table1). Values are missing MNAR according to the logistic mechanism described in Section4, with 30% variables used as inputs of a logistic masking model for the 70% remaining variables. 30% of those input variables are then masked at random. Hence, all variables have 30% missing values. All methods are evaluated on the same 30 random missing values draws. Error bars correspond to 1 std. 2 2 ± For readability we display scaled mean W2 , i.e. for each dataset we renormalize the results by the maximum W2 . For some datasets W2 results are not displayed due to their large size, which makes evaluating the unregularized W2 distance costly. the number of data points with at least one missing value. and RMSE scores for most datasets. Since both methods are def based on the same cyclical linear imputation model but with Set M1 = i : j, ωij = 0 . We define MAE, RMSE and { ∃ } different loss functions, this shows that the batched Sinkhorn W2 imputation metrics as loss in Eq. (5) is well-adapted to imputation with parametric 1 true x xˆij , (MAE) models. Comparison with DL methods (Figure4) shows m0 | i,j − | (i,j):Xωij =0 that the proposed OT-based methods consistently outper- form DL-based methods, and have the additional benefit of 1 true 2 (x xˆij) , (RMSE) m0 i,j − having a lower variance in their results overall. Interestingly, s (i,j):Xωij =0 while the MAE and RMSE scores of the round-robin MLP 2 (true) model are comparable to that of the linear RR, its W2 scores W µm (Xˆ M ), µm (X ) . (W2) 2 1 1 1 M1 are generally better. This suggests that more powerful base   imputer models lead to better W scores, from which one Results. The complete results of the experiments are re- 2 can conclude that Eq. (5) is a good proxy for optimizing the ported in the Appendix. In Figure3 and Figure4, the unavailable Eq. (1) score, and that Algorithm3 is efficient at proposed methods are respectively compared to baselines doing so. Furthermore, one can observe that the direct impu- and Deep Learning (DL) methods in a MCAR and a logistic tation method is very competitive over all data and metrics masking MNAR setting with 30% missing data. As can be and is in general the best performing OT-based method, as seen from Figure3, the linear round-robin model matches could be expected from the fact that its imputation model is or out-performs scikit’s iterative imputer (ice) on MAE Missing Data Imputation using Optimal Transport

Table 1: Summary of datasets Mean ICE Linear RR MLP RR train test

0.8

dataset n d 0.6

0.4

airfoil self noise 1503 5 Mean MAE 0.2

0.0 blood transfusion 748 4 2.5 breast cancer diagnostic 569 30 2.0 1.5

california 20640 8 1.0

climate model crashes 540 18 Mean RMSE 0.5

) 0.0 d ' 1.5

concrete compression 1030 7 m r o

n 1.0 concrete slump 103 7 e r (

2 2

connectionist bench sonar 208 60 W 0.5

n a

connectionist bench vowel 990 10 e 0.0 M ecoli yeast blood airfoil ecoli 336 7 vowel california wine (red) wine (white) glass 214 9 compression ionosphere 351 34 Figure 5: (OOS) Out of sample imputation: 70% of the iris 150 4 data is used for training (filled bars) and 30 % for testing libras 360 90 with fixed parameters (dotted bars). 30% of the values are parkinsons 195 23 missing MCAR accross both training and testing sets. planning relax 182 12 of the values are missing MCAR, uniformly over training qsar biodegradation 1055 41 and testing sets. Out of the methods presented earlier on, seeds 210 7 we keep those that allow OOS: for the ice, Linear RR and wine 178 13 MLP RR methods, OOS imputation is simply performed wine quality red 1599 10 using the round-robin scheme without further fitting of the wine quality white 4898 11 parameters on the new data. For the mean baseline, missing yacht hydrodynamics 308 6 values in the testing data are imputed using mean observed yeast 1484 8 values from the training data. Figure5 confirms the stability at testing time of the good performance of Linear RR and not restricted by a parametric assumption. This favorable be- MLP RR. haviour tends to be exacerbated with a growing proportion of missing data, see Figure9 in the appendix. Conclusion MAR and MNAR. Figure4 above and Figures 10 to 12 We have shown in this paper how OT metrics could be used in the appendix display the results of our experiments in to define a relevant loss for missing data imputation. This MAR and MNAR settings, and show that the proposed loss corresponds to the expectation of Sinkhorn divergences methods perform well and are robust to difficult missingness between randomly sampled batches. To minimize it, two mechanisms. This is remarkable, as the proposed methods classes of algorithms were proposed: one that freely esti- do not attempt to model those mechanisms. Finally, note mates one parameter per imputed value, and one that fits that the few datasets on which the proposed methods do a parametric model. The former class does not rely on not perform as well as baselines – namely libras and to making parametric assumptions on the underlying data dis- a smaller extent planning relax – remain consistently tribution, and can be used in a very wide range of settings. the same across all missingness mechanisms and missing On the other hand, after training, the latter class allows out- rates. This suggests that this behavior is due to the particular of-sample imputation. To make parametric models trainable, structure of those datasets, rather than to the missingness the classical round-robin mechanism was used. Experiments mechanisms themselves. on a variety of datasets, and for numerous missing value Out-of-sample imputation. As mentioned in Section3, a settings (MCAR, MAR and MNAR with varying missing key benefit of fitting a parametric imputing model with algo- values proportions) showed that the proposed models are rithms2 and3 is that the resulting model can then be used to very competitive, even compared to recent methods based impute missing values in out-of-sample (OOS) data. In Fig- on deep learning. These results confirmed that our loss is ure5, we evaluate the Linear RR and MLP RR models in an a good optimizable proxy for imputation metrics. Future OOS imputation experiment. We compare the training and work includes further theoretical study of our loss function OOS MAE, RMSE and OT scores on a collection of datasets Eq. (5) within the OT framework. selected to have a sufficient number of points. At each run, we randomly sample 70% of the data to be used for training, and the remaining 30% to evaluate OOS imputation. 30% Missing Data Imputation using Optimal Transport

References Hastie, T., Mazumder, R., Lee, J. D., and Zadeh, R. Matrix completion and low-rank svd via fast alternating least Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein squares. J. Mach. Learn. Res., 16(1):33673402, January generative adversarial networks. In Precup, D. and Teh, 2015. ISSN 1532-4435. Y. W. (eds.), Proceedings of the 34th International Con- ference on Machine Learning, pp. 214–223, International Ivanov, O., Figurnov, M., and Vetrov, D. Variational autoen- Convention Centre, Sydney, Australia, 06–11 Aug 2017. coder with arbitrary conditioning. International Confer- PMLR. ence on Learning Representations, 2019.

Burda, Y., Grosse, R., and Salakhutdinov, R. Importance Josse, J., Husson, F., et al. missmda: a package for handling weighted autoencoders. International Conference on missing values in multivariate . Journal of Learning Representations, 2016. Statistical Software, 70(1):1–31, 2016.

Cuturi, M. Sinkhorn distances: Lightspeed computation Kim, J. and Ying, Z. Data Missing Not at Random: Jae- of optimal transport. In Proceedings of the 26th Inter- Kwang Kim, Zhiliang Ying Editors for this Special Is- national Conference on Neural Information Processing sue. Statistica Sinica. Institute of Statistical Science, Systems - Volume 2, pp. 22922300, Red Hook, NY, USA, Academia Sinica, 2018. 2013. Curran Associates Inc. Kingma, D. P. and Ba, J. Adam: A method for stochastic Cuturi, M. and Doucet, A. Fast computation of Wasserstein optimization. arXiv preprint arXiv:1412.6980, 2014. barycenters. In Xing, E. P. and Jebara, T. (eds.), Proceed- ings of the 31st International Conference on Machine Kingma, D. P. and Welling, M. Auto-encoding variational Learning, number 2, pp. 685–693, Bejing, China, 22–24 bayes. International Conference on Learning Represen- Jun 2014. PMLR. tations, 2014.

Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum Little, R. J. A. and Rubin, D. B. Statistical analysis with likelihood from incomplete data via the em algorithm. missing data. John Wiley & Sons, 2019. Journal of the royal statistical society. Series B (method- ological), pp. 1–38, 1977. Mattei, P.-A. and Frellsen, J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In Fatras, K., Zine, Y., Flamary, R., Gribonval, R., and Courty, Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings N. Learning with minibatch Wasserstein : asymptotic of the 36th International Conference on Machine Learn- and gradient properties. CoRR, abs/1910.04091, 2019. ing, pp. 4413–4423, Long Beach, California, USA, 09–15 Jun 2019. PMLR. Feydy, J., Sejourn´ e,´ T., Vialard, F., Amari, S., Trouve,´ A., and Peyre,´ G. Interpolating between optimal transport and Mayer, I., Josse, J., Tierney, N., and Vialaneix, N. R-miss- MMD using Sinkhorn divergences. In The 22nd Interna- tastic: a unified platform for missing values methods and tional Conference on Artificial Intelligence and , workflows. arXiv preprint arXiv:1908.04822, 2019. AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pp. 2681–2690, 2019. Mohan, K. and Pearl, J. Graphical Models for Processing Missing Data. Journal of American Statistical Associa- Frogner, C., Zhang, C., Mobahi, H., Araya-Polo, M., and tion (JASA), 2019. Poggio, T. Learning with a Wasserstein loss. In Proceed- ings of the 28th International Conference on Neural In- Murray, J. S. and Reiter, J. P. Multiple imputation of missing formation Processing Systems - Volume 2, pp. 20532061, categorical and continuous values via bayesian mixture Cambridge, MA, USA, 2015. MIT Press. models with local dependence. Journal of the American Statistical Association, 111(516):1466–1479, 2016. Genevay, A., Peyre, G., and Cuturi, M. Learning generative models with Sinkhorn divergences. In Storkey, A. and Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Perez-Cruz, F. (eds.), Proceedings of the Twenty-First Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., International Conference on Artificial Intelligence and Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- Statistics, pp. 1608–1617, Playa Blanca, Lanzarote, Ca- napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. nary Islands, 09–11 Apr 2018. PMLR. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Peyre,´ G., Cuturi, M., et al. Computational optimal transport. Y. Generative adversarial nets. In Advances in neural Foundations and Trends R in Machine Learning, 11(5-6): information processing systems, pp. 2672–2680, 2014. 355–607, 2019. Missing Data Imputation using Optimal Transport

Rubin, D. B. Biometrika, 63(3):581–592, 1976.

Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Im- proving gans using optimal transport. arXiv preprint arXiv:1803.05573, 2018. Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subrama- nian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube, P., et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in repro- gramming. Cell, 2019. Seaman, S., Galati, J., Jackson, D., and Carlin, J. What is meant by” missing at random”? Statistical Science, pp. 257–268, 2013. Stekhoven, D. J. and Buhlmann, P. MissForest a non- parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 10 2011. ISSN 1367- 4803. doi: 10.1093/bioinformatics/btr597.

Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001. Udell, M. and Townsend, A. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019. van Buuren, S. Flexible Imputation of Missing Data. Chap- man and Hall/CRC, Boca Raton, FL, 2018. van Buuren, S. and Groothuis-Oudshoorn, K. mice: Mul- tivariate imputation by chained equations in r. Journal of Statistical Software, Articles, 45(3):1–67, 2011. ISSN 1548-7660. Yoon, J., Jordon, J., and van der Schaar, M. GAIN: Missing data imputation using generative adversarial nets. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, pp. 5689– 5698, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. Zhu, Z., Wang, T., and Samworth, R. J. High-dimensional principal component analysis with heterogeneous miss- ingness. arXiv preprint arXiv:1906.12125, 2019. Missing Data Imputation using Optimal Transport A. Appendix This appendix contains a full account of our experimen- tal results. These results correspond to the missing value mechanisms described in Section4:

1. 10% MCAR (Figure7), 30% MCAR (Figure8) and 50% MCAR (Figure9); 2. 30% MAR on 70% of the variables with a logistic masking model (Figure 10);

3. 30% MNAR generated with a logistic masking model, whose inputs are then themselves masked (Figure 11); 4. 30% MNAR on 30% of the variables, generated by censoring upper and lower quartiles (Figure 12).

These experiments follow the setup described in Section4. In all the following figures, error bars correspond to 1 standard deviation across the 30 runs performed on each± dataset. For some datasets, the W2 score is not represented: this is due to their large size, which makes computing un- regularized OT computationally intensive. The results show that the proposed methods, Algorithm1 and Algorithm3 with linear and shallow MLP imputers, are very competitive compared to state-of-the-art methods, including those based on deep learning (Mattei & Frellsen, 2019; Yoon et al., 2018; Ivanov et al., 2019), in a wide range of missing data regimes.

Runtimes. Figure6 represents the average runtimes of the methods evaluated in Figure 11. These runtimes show that Algorithm1 has computational running times on par with VAEAC, and faster than the two remaining DL-based methods (GAIN and MIWAE). Round-robin methods are Figure 6: Average runtimes (in seconds, over 30 runs and 23 the slowest overall, but the base imputer model being used datasets) for the experiment described in fig. 11. Note that seems to have nearly no impact on runtimes. This is due to these times are indicative, as runs where randomly assigned the fact that the computational bottleneck of the proposed to different GPU models, which may have an impact on methods is the number of Sinkhorn batch divergences that runtimes. are computed. This number can be made lower by e.g. re- ducing the number of gradient steps performed for each variable (parameter K in algorithm3), or the number of cycles tmax. This fact suggests that more complex mod- els could be used in round-robin imputation without much additional computational cost. Missing Data Imputation using Optimal Transport

Figure 7: (10 % MCAR) Missing Data Imputation using Optimal Transport

Figure 8: (30 % MCAR) Missing Data Imputation using Optimal Transport

Figure 9: (50 % MCAR) Missing Data Imputation using Optimal Transport

Figure 10: (30 % MAR) Missing Data Imputation using Optimal Transport

Figure 11: (30 % MNAR, logistic masking) Missing Data Imputation using Optimal Transport

Figure 12: (30 % MNAR, quantile masking)