Missing Data Imputation Using Optimal Transport
Total Page:16
File Type:pdf, Size:1020Kb
Missing Data Imputation using Optimal Transport Boris Muzellec 1 Julie Josse 2 3 * Claire Boyer 4 Marco Cuturi 5 1 Abstract with plausible values, are very appealing as they allow to both get a guess for the missing entries as well as to per- Missing data is a crucial issue when applying ma- form (with care) downstream machine learning methods on chine learning algorithms to real-world datasets. the completed data. Efficient methods include, among oth- Starting from the simple assumption that two ers, methods based on low-rank assumptions (Hastie et al., batches extracted randomly from the same dataset 2015), iterative random forests (Stekhoven & Buhlmann, should share the same distribution, we leverage 2011) and imputation using variational autoencoders (Mattei optimal transport distances to quantify that crite- & Frellsen, 2019; Ivanov et al., 2019). A desirable prop- rion and turn it into a loss function to impute miss- erty for imputation methods is that they should preserve the ing data values. We propose practical methods to joint and marginal distributions of the data. Non-parametric minimize these losses using end-to-end learning, Bayesian strategies (Murray & Reiter, 2016) or recent ap- that can exploit or not parametric assumptions proaches based on generative adversarial networks (Yoon on the underlying distributions of values. We et al., 2018) are attempts in this direction. However, they evaluate our methods on datasets from the UCI can be quite cumbersome to implement in practice. repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods We argue in this work that the optimal transport (OT) tool- match or out-perform state-of-the-art imputation box constitutes a natural, sound and straightforward alter- methods, even for high percentages of missing native. Indeed, optimal transport provides geometrically values. meaningful distances to compare discrete distributions, and therefore data. Furthermore, thanks to recent computational advances grounded on regularization (Cuturi, 2013), OT- 1. Introduction based divergences can be computed in a scalable and dif- ferentiable way (Peyre´ et al., 2019). Those advances have Data collection is usually a messy process, resulting in allowed to successfully use OT as a loss function in many datasets that have many missing values. This has been an applications, including multi-label classification (Frogner issue for as long as data scientists have prepared, curated et al., 2015), inference of pathways (Schiebinger et al., 2019) and obtained data, and is all the more inevitable given the and generative modeling (Arjovsky et al., 2017; Genevay vast amounts of data currently collected. The literature on et al., 2018; Salimans et al., 2018). Considering the similar- the subject is therefore abundant (Little & Rubin, 2019; ities between generative modeling and missing data imputa- van Buuren, 2018): a recent survey indicates that there are tion, it is therefore quite natural to use OT as a loss for the more than 150 implementations available to handle miss- latter. ing data (Mayer et al., 2019). These methods differ on the objectives of their analysis (estimation of parameters and Contributions. This paper presents two main contributions. arXiv:2002.03860v3 [stat.ML] 1 Jul 2020 their variance, matrix completion, prediction), the nature of First, we leverage OT to define a loss function for missing the variables considered (categorical, mixed, etc.), the as- value imputation. This loss function is the mathematical sumptions about the data, and the missing data mechanisms. translation of the simple intuition that two random batches Imputation methods, which consist in filling missing entries from the same dataset should follow the same distribution. Next, we provide algorithms for imputing missing values This research was done while JJ was a visiting researcher at 1 according to this loss. Two types of algorithms are pre- Google Brain Paris. CREST-ENSAE, IP Paris, Palaiseau, France sented, the first (i) being non-parametric, and the second (ii) 2XPOP, INRIA Saclay, France 3CMAP, UMR7641, Ecole´ Poly- technique, IP Paris, Palaiseau, France 4LPSM, Sorbonne Univer- defining a class of parametric models. The non-parametric site,´ ENS Paris, France 5Google Brain, Paris, France. Correspon- algorithm (i) enjoys the most degrees of freedom, and can dence to: Boris Muzellec <[email protected]>. therefore output imputations which respect the global shape th of the data while taking into account its local features. The Proceedings of the 37 International Conference on Machine parametric algorithm (ii) is trained in a round-robin fashion Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). similar to iterative conditional imputation techniques, as Missing Data Imputation using Optimal Transport implemented for instance in the mice package (van Bu- MNAR values lead to important biases in the data, as the uren & Groothuis-Oudshoorn, 2011). Compared to the probability of missingness then depends on the unobserved non-parametric method, this algorithm allows to perform values. On the other hand, MCAR and MAR are “ignorable” out-of-sample imputation. This creates a very flexible frame- mechanisms in the sense that they do not make it necessary work which can be combined with many imputing strate- to model explicitly the distribution of missing values when gies, including imputation with Multi-Layer Perceptrons. maximizing the observed likelihood. Finally, these methods are showcased in extensive exper- The naive workaround which consists in deleting observa- iments on a variety of datasets and for different missing tions with missing entries is not an alternative in high dimen- values proportions and mechanisms, including the difficult sion. Indeed, let us assume as in Zhu et al.(2019) that X is case of informative missing entries. The code to repro- a n d data matrix in which each entry is missing indepen- duce these experiments is available at https://github. dently× with probability 0.01. When d = 5, this would result com/BorisMuzellec/MissingDataOT. in around 95% of the individuals (rows) being retained, but n d Notations. Let Ω = (! ) 0; 1 × be a binary mask for d = 300, only around 5% of rows have no missing en- ij ij 2 f g encoding observed entries, i.e. !ij = 1 (resp. 0) iff the entry tries. Hence, providing plausible imputations for missing (i; j) is observed (resp. missing). We observe the following values quickly becomes necessary. Classical imputation incomplete data matrix: methods impute according to a joint distribution which is either explicit, or implicitly defined through a set of condi- (obs) 1 X = X Ω + NA ( n d Ω); tional distributions. As an example, explicit joint modeling × − (obs) n d methods include imputation models that assume a Gaussian where X R × contains the observed entries, is 2 distribution for the data, whose parameters are estimated us- the elementwise product and 1n d is an n d matrix filled × ing EM algorithms (Dempster et al., 1977). Missing values with ones. Given the data matrix X, our goal× is to construct are then imputed by drawing from their predictive distri- an estimate X^ filling the missing entries of X, which can be bution. A second instance of such joint modeling methods written as are imputations assuming low-rank structure (Josse et al., (obs) (imp) 2016). The conditional modeling approach (van Buuren, X^ = X Ω + X^ (1n d Ω); × − 2018), also known as “sequential imputation” or “imputa- ^ (imp) n d where X R × contains the imputed values. Let tion using chained equations” (ice) consists in specifying 2 xi: denote the i-th row of the data set X, such that X = one model for each variable. It predicts the missing values T (xi: )1 i n. Similarly, x:j denotes the j-th column (vari- of each variable using the other variables as explanatory, able)≤ of≤ the data set X, such that X = (x ::: x ), and and cycles through the variables iterating this procedure to :1j j :d X: j denotes the dataset X in which the j-th variable has update the imputations until predictions stabilize. been− removed. For K 1; : : : ; n a set of m indices, Non-parametric methods like k-nearest neighbors imputa- X = (x ) denotes⊂ the f correspondingg batch, and by K k: k K tion (Troyanskaya et al., 2001) or random forest imputation µ (X ) the empirical2 measure associated to X , i.e. m K K (Stekhoven & Buhlmann, 2011) have also been developed and account for the local geometry of the data. The herein µ (X ) := 1 δ : m K m xk: proposed methods lie at the intersection of global and local k K X2 approaches and are derived in a non-parametric and para- def n n metric version. Finally, ∆n = a R : ai = 1 is the simplex in f 2 + i=1 g dimension n. Wasserstein distances, entropic regularization and P n Sinkhorn divergences. Let α = i=1 aiδxi , β = 0 2. Background n b δ be two discrete distributions, described by their i=1 i yi P n n p n0 n0 p supports (xi) R × and (yi) R × and weight Missing data. Rubin(1976) defined a widely used - yet con- P i=1 2 i=1 2 vectors a ∆ and b ∆ 0 . Optimal transport compares troversial (Seaman et al., 2013) - nomenclature for missing n n α and β by2 considering2 the most efficient of transporting values mechanisms. This nomenclature distinguishes be- the masses a and b onto each-other, according to a ground tween three cases: missing completely at random (MCAR), cost between the supports. The (2-)Wasserstein distance missing at random (MAR), and missing not at random corresponds to the case where this ground cost is quadratic: (MNAR). In MCAR, the missingness is independent of the data, whereas in MAR, the probability of being missing 2 def W2 (α; β) = min P; M ; (1) depends only on observed values.