Optimal Recovery of Missing Values for Non-Negative Matrix Factorization: a Probabilistic Error Bound

Total Page:16

File Type:pdf, Size:1020Kb

Optimal Recovery of Missing Values for Non-Negative Matrix Factorization: a Probabilistic Error Bound Optimal recovery of missing values for non-negative matrix factorization: A probabilistic error bound Rebecca Chen 1 Lav R. Varshney 1 2 Abstract can discover cell groups and lower-dimensional manifolds (latent factors) describing gene count ratios for different cell Missing values imputation is often evaluated on types. However, due to channel noise, incomplete survey some similarity measure between actual and im- data, or biological limitations, data matrices are usually puted data. However, it may be more meaningful incomplete and matrix imputation is often needed before to evaluate downstream algorithm performance further analysis (Li & Ngom, 2013). In particular, “newer after imputation than the imputation itself. We de- MF algorithms that model missing data are essential for scribe a straightforward unsupervised imputation [single-cell RNA] data” (Stein-O’Brien et al., 2018). algorithm, a minimax approach based on optimal recovery, and derive probabilistic error bounds We took up this challenge using optimal recovery (Golomb on downstream non-negative matrix factorization & Weinberger, 1959; Micchelli & Rivlin, 1976; 1985), an (NMF). We also comment on fair imputation. estimation-theoretic approach used for signal and image in- terpolation (Donoho, 1994; Muresan & Parks, 2004; Shenoy & Parks, 1992). We previously found a tight worst-case 1. Introduction bound on NMF error following our minimax imputation, and experiments showed competitive performance with The performance of missing values imputation is typically more complicated imputation techniques (Chen & Varsh- measured by how similar imputed data is to original data, ney, 2019). In this paper, we find several probabilistic error but Tuikkala et al.(2008) claim it is more meaningful to mea- bounds, which would better characterize experimental re- sure the accuracy of downstream analysis (e.g. clustering sults and serve as useful benchmarks for algorithms. We accuracy) after imputation. Several groups have evaluated made no assumptions on the missingness pattern for the the effects of different imputation methods on clustering and worst-case bound (Little & Rubin, 2002), but we assume classification accuracy (Chiu et al., 2013; de Souto et al., samples are missing completely at random (MCAR) for our 2015). We (2019) extended this to quantify non-negative probabilistic bounds. Finally, we discuss how our minimax matrix factorization (NMF) accuracy after imputation, and approach aligns with certain notions of fairness. introduced a new imputation technique. In this paper, we find a probabilistic bound for NMF error. 2. Related Work NMF is a popular clustering algorithm used by scientists 2.1. Non-negative matrix factorization (NMF) because of its identifiability, or model uniqueness, and its interpretability, with resulting latent factors being intuitive When used to perform cluster analysis, NMF outputs latent (Li & Ngom, 2013; Qi et al., 2009). NMF has been used factors that characterize the clusters. Donoho & Stodden for speech and audio separation, community detection, and (2004) interpret NMF as the problem of finding cones in topic modeling. In biology, NMF of gene count matrices the positive orthant which contain clouds of data points. 1 Liu & Tan(2017) show that a rank-one NMF gives a good Department of Electrical and Computer Engineering, Uni- description of near-separable data and provide an upper versity of Illinois at Urbana-Champaign, Urbana, Illinois, USA. 2Salesforce Research, Palo Alto, CA, USA bound on the relative reconstruction error. Given certain This work was supported by Air Force STTR Grant FA8650-16- biological data is often linearly separable on some manifold- M-1819 and grant number 2018-182794 from the Chan Zuckerberg or high-dimensional space (Clarke et al., 2008), the bound Initiative DAF, an advised fund of the Silicon Valley Commu- given by rank-one NMF is valid. We describe this conical nity Foundation.. Correspondence to: Lav R. Varshney <varsh- model below (Donoho & Stodden, 2004; Fu et al., 2019; [email protected]>. Liu & Tan, 2017). Presented at the first Workshop on the Art of Learning with Missing Let V 2 F ×N be a matrix of N sample points with F Values (Artemiss) hosted by the 37 th International Conference on R+ Machine Learning (ICML). Copyright 2020 by the author(s). non-negative observations. Suppose the columns in V are A probabilistic error bound for NMF with missing values F ×K generated from K clusters. There exist W 2 R+ and K×N H 2 R+ such that V = WH. This is the NMF of V (Lee & Seung, 1999). Suppose the N data points originate from K cones. We define a circular cone C(u; α) by a direction vector u and Figure 1. Feasible set of estimators. an angle α: and we use the subscripted vector (·)po to denote partially- F x · u C(u; α) := x 2 R nf0g : ≥ cos α . (1) observed data points. We use a subscripted matrix (·)fo kxk2 or (·)po to denote the set of all fully-observed or partially- We truncate the circular cones to be in the non-negative observed data columns in the matrix. orthant P. All x’s belonging to Ck can be considered as We can impute a partially-observed vector vpo by see- noisy versions of uk, where k = 1;:::;K. We call the ing where its observed samples intersect with the clusters angle between cones βij := arccos(ui · uj): Assume that C1; :::; Ck; as shown in Fig.1. Let the missing values plane F min βij > max fmaxfαi + 3αj; 3αi + αjgg: be the restriction set over that satisfies the constraints i;j2[K];i6=j i;j2[K];i6=j R (2) on the observed values of vpo. We call this intersection the This is a common assumption used to guarantee cluster- feasible set W : ing performance (Bu et al., 2017; Liu & Tan, 2017; Ng K [ et al., 2001). We can then partition V into k sets, denoted W = fv^po 2 Ck : PΩ(^vpo) = PΩ(vpo)g: (3) Vk := fvn 2 Ck \ P g, and rewrite Vk as the sum of k=1 a rank-one matrix Ak (parallel to uk) and a perturbation matrix Ek (orthogonal to uk). For any vector z 2 Vk, For some norm or error function, k · k, the optimal recov- z = kzk (cos β)u + y, where kyk = kzk (sin β) ≤ ∗ 2 k 2 2 ery estimator v^po minimizes the maximum error over the kzk2(sin αk). This rank-one approximation is used to find feasible set of estimates: error bounds (Liu & Tan, 2017). ∗ v^po = arg min max kv^po − vk: (4) v2Ck 2.2. Optimal recovery imputation v^po2Ck When there are missing values in clustered data, local impu- If W contains estimators belonging to more than one Ck, tation approaches, or those based on local patterns (such as W can be partitioned into disjoint sets Wk. Feasible clusters cluster membership or closest neighboring points), can be are those for which Wk is not empty. used to impute values. Popular algorithms that utilize local If the columns in V come from K circular cones defined as ∗ F ×K ∗ structure include k-nearest neighbors, local least squares, (1), there is a pair of factor matrices W 2 R+ ; H 2 and bicluster Bayesian component analysis (Hastie et al., K×N R+ , such that 1999; Kim et al., 2005; Meng et al., 2014). ∗ ∗ kV − W H kF Optimal recovery imputation minimizes NMF error under ≤ max fsin αkg: (5) kVk k2[K] certain geometric assumptions on data and enables error F bound derivation (Chen & Varshney, 2019). Suppose we ∗ are given an unknown signal v that lies in some signal class The optimal recovery estimator v^po would minimize αk in (1), which is equivalent to: Ck: The optimal recovery estimate v^ minimizes the maxi- mum error between v^ and all signals in the feasible signal ∗ 2 2 v^po = arg maxf(^vpo · uk) − (^vpo · v^po) cos (αk)g. (6) class. Given well-clustered non-negative data V, we impute v^po2Ck missing samples in V so the maximum error is minimized over feasible clusters, regardless of the missingness pattern. If the geometric assumption (2) holds, a greedy clustering Suppose there are missing values in V. Let Ω 2 f0; 1gF ×N algorithm (Alg. 1) returns correct clustering of partially be a matrix of observed values indicators with Ωij = 1 if observed data under certain conditions, and error is bounded as vij is observed and 0 otherwise. We define the projection ∗ ∗ kV − WpoHpokF operator of a matrix Y onto an index set Ω by ≤ max fsin 2αkg; (7) kVkF k2[K] Y if Ω = 1 [P (Y)] = ij ij . where W∗ and H∗ are found by Alg. 2. Experiments show Ω ij 0 if Ω = 0 po po ij that this algorithm performs well on biological and imaging We use the subscripted vector (·)fo to denote fully-observed data in comparison to more complex methods (Chen & data points (columns), or data points with no missing values, Varshney, 2019). A probabilistic error bound for NMF with missing values Algorithm 1 Greedy Clustering with Missing Values using Alg.1. After obtaining the clusters, we compute F ×N F ×N the minimum covering sphere (MCS) on the points in each Input V 2 R+ ;K 2 N; Ω 2 f0; 1g N K F ×K cluster (Hopp & Reeve, 1996). This gives us K balls with Output J 2 f0; 1; :::; Kg ; α 2 (0; π=2) ; u 2 R+ Nk points in each ball. 1: Partition columns in V into subsets Vfo and Vpo, P where Vfo contains data columns for which i rij = Now suppose we have partially observed entries in our data. F , and Vpo contains remaining columns. Let the missingness of a point be a Bernoulli(γ) random 2: Normalize Vfo so that all columns have unit `2-norm.
Recommended publications
  • Imputets: Time Series Missing Value Imputation in R by Steffen Moritz and Thomas Bartz-Beielstein
    CONTRIBUTED RESEARCH ARTICLE 1 imputeTS: Time Series Missing Value Imputation in R by Steffen Moritz and Thomas Bartz-Beielstein Abstract The imputeTS package specializes on univariate time series imputation. It offers multiple state-of-the-art imputation algorithm implementations along with plotting functions for time series missing data statistics. While imputation in general is a well-known problem and widely covered by R packages, finding packages able to fill missing values in univariate time series is more complicated. The reason for this lies in the fact, that most imputation algorithms rely on inter-attribute correlations, while univariate time series imputation instead needs to employ time dependencies. This paper provides an introduction to the imputeTS package and its provided algorithms and tools. Furthermore, it gives a short overview about univariate time series imputation in R. Introduction In almost every domain from industry (Billinton et al., 1996) to biology (Bar-Joseph et al., 2003), finance (Taylor, 2007) up to social science (Gottman, 1981) different time series data are measured. While the recorded datasets itself may be different, one common problem are missing values. Many analysis methods require missing values to be replaced with reasonable values up-front. In statistics this process of replacing missing values is called imputation. Time series imputation thereby is a special sub-field in the imputation research area. Most popular techniques like Multiple Imputation (Rubin, 1987), Expectation-Maximization (Dempster et al., 1977), Nearest Neighbor (Vacek and Ashikaga, 1980) and Hot Deck (Ford, 1983) rely on inter-attribute correlations to estimate values for the missing data. Since univariate time series do not possess more than one attribute, these algorithms cannot be applied directly.
    [Show full text]
  • Uncertainty-Aware Variational-Recurrent Imputation Network for Clinical Time Series Ahmad Wisnu Mulyadi, Eunji Jun, and Heung-Il Suk, Member, IEEE
    UNDER REVIEW 1 Uncertainty-Aware Variational-Recurrent Imputation Network for Clinical Time Series Ahmad Wisnu Mulyadi, Eunji Jun, and Heung-Il Suk, Member, IEEE Abstract—Electronic health records (EHR) consist of longitudi- data yields poor performance at high missing rates and also nal clinical observations portrayed with sparsity, irregularity, and requires modeling separately for the distinct dataset. In fact, high-dimensionality, which become major obstacles in drawing those missing values reflect the decisions made by health-care reliable downstream clinical outcomes. Although there exist great numbers of imputation methods to tackle these issues, most of providers [5]. Therefore, the missing values and their patterns them ignore correlated features, temporal dynamics and entirely contain information regarding a patient’s health status [5] and set aside the uncertainty. Since the missing value estimates involve correlate with the outcomes or target labels [6]. Thus, we the risk of being inaccurate, it is appropriate for the method to resort to the imputation approach by exploiting those missing handle the less certain information differently than the reliable patterns to improve the prediction of the clinical outcomes as data. In that regard, we can use the uncertainties in estimating the missing values as the fidelity score to be further utilized to the downstream task. alleviate the risk of biased missing value estimates. In this work, There exist numerous strategies for imputing missing values we propose a novel variational-recurrent imputation network, in the literature. Generally, imputation methods can be clas- which unifies an imputation and a prediction network by taking sified into a deterministic or stochastic approach, depending into account the correlated features, temporal dynamics, as well on the use of randomness [7].
    [Show full text]
  • Potential Sources of Missing Data in a Meta-Analysis
    SMG advanced workshop, Cardiff, March 4-5, 2010 PotentialPotential sourcessources ofof missingmissing datadata inin aa meta-analysismeta-analysis Missing data • Studies not found • Outcome not reported Julian Higgins • Outcome partially reported (e.g. missing SD or SE) MRC Biostatistics Unit Cambridge, UK • Extra information required for meta-analysis (e.g. imputed correlation coefficient) with thanks to Ian White, Fred Wolf, Angela Wood, Alex Sutton • Missing participants • No information on study characteristic for heterogeneity analysis ConceptsConcepts inin missingmissing datadata ConceptsConcepts inin missingmissing datadata 1. ‘Missing completely at random’ 2. ‘Missing at random’ • As if missing observation was randomly picked • Missingness depends on things you know about, and • Missingness unrelated to the true (missing) value not on the missing data themselves • e.g. – (genuinely) accidental deletion of a file •e.g. – observations measured on a random sample – older people more likely to drop out (irrespective of – statistics not reported due to ignorance of their the effect of treatment) importance(?) – recent cluster trials more likely to report intraclass correlation • OK (unbiased) to analyse just the data available, but – experimental treatment more likely to cause drop- sample size is reduced - which may not be ideal out (independent of therapeutic effect) • Not very common! • Usually surmountable StrategiesStrategies forfor dealingdealing withwith missingmissing datadata ConceptsConcepts inin missingmissing datadata •
    [Show full text]
  • Multiple Imputation: a Review of Practical and Theoretical Findings Jared S
    Submitted to Statistical Science Multiple Imputation: A Review of Practical and Theoretical Findings Jared S. Murray∗ University of Texas at Austin Abstract. Multiple imputation is a straightforward method for handling missing data in a principled fashion. This paper presents an overview of multiple imputation, including important theoretical results and their practical implications for generating and using multiple imputations. A review of strategies for generating imputations follows, including recent developments in flexible joint modeling and sequential regres- sion/chained equations/fully conditional specification approaches. Fi- nally, we compare and contrast different methods for generating impu- tations on a range of criteria before identifying promising avenues for future research. Key words and phrases: missing data, proper imputation, congeniality, chained equations, fully conditional specification, sequential regression multivariate imputation. 1. INTRODUCTION Multiple imputation (MI) (Rubin, 1987) is a simple but powerful method for dealing with missing data. MI as originally conceived proceeds in two stages: A data disseminator creates a small number of completed datasets by filling in the missing values with samples from an imputation model. Analysts compute their estimates in each completed dataset and combine them using simple rules to get pooled estimates and standard errors that incorporate the additional variability due to the missing data. MI was originally developed for settings in which statistical agencies or other data disseminators provide multiply imputed databases to distinct end-users. arXiv:1801.04058v1 [stat.ME] 12 Jan 2018 There are a number of benefits to MI in this setting: The disseminator can support approximately valid inference for a wide range of potential analyses with a small set of imputations, and the burden of dealing with the missing data is on the imputer rather than the analyst.
    [Show full text]
  • Within-Industry Productivity Dispersion and Imputation for Missing Data
    Imputation in U.S. Manufacturing Data and Implications for Within-Industry Productivity Dispersion ∗ T. Kirk Whitey, Jerome P. Reiter**, and Amil Petrin*** y Center for Economic Studies, U.S. Census Bureau ** Duke University *** University of Minnesota,Twin Cities and NBER Abstract In the U.S. Census Bureau’s 2002 and 2007 manufacturing data, respectively, 79% and 73% of observa­ tions have imputed data for at least one variable used to compute total factor productivity. The Bureau imputes for missing values using methods known to result in underestimation of variability and potential bias in multivariate inferences. We present an alternative strategy for handling the missing data based on multiple imputation via sequences of classification and regression trees. We use our imputations and the Bureau’s imputations to estimate within-industry productivity dispersions for every manufacturing industry. The results suggest that there is even more within-industry productivity dispersion in U.S. man­ ufacturing than previous research has indicated. We also estimate relationships between plant exit, pro­ ductivity, prices, and demand shocks. For these estimands, we find the results are substantively robust to using an alternative imputation strategy. ∗Some of the research in this paper was conducted while the first author was a Census Bureau employee. Any opinions and conclusions expressed herein are those of the authors and do not necessarily represent the views of the U.S. Census Bureau. All results have been reviewed to ensure that no confidential information is disclosed. Corresponding author: T. Kirk White. Email: [email protected]. 1 1 Introduction Nearly all economic surveys suffer from item nonresponse, i.e., respondents answer some questions but not others.
    [Show full text]
  • Dealing with Missing Standard Deviation and Mean Values in Meta-Analysis of Continuous Outcomes: a Systematic Review Christopher J
    Weir et al. BMC Medical Research Methodology (2018) 18:25 https://doi.org/10.1186/s12874-018-0483-0 RESEARCHARTICLE Open Access Dealing with missing standard deviation and mean values in meta-analysis of continuous outcomes: a systematic review Christopher J. Weir1*, Isabella Butcher1, Valentina Assi1, Stephanie C. Lewis1, Gordon D. Murray1, Peter Langhorne2 and Marian C. Brady3 Abstract Background: Rigorous, informative meta-analyses rely on availability of appropriate summary statistics or individual participant data. For continuous outcomes, especially those with naturally skewed distributions, summary information on the mean or variability often goes unreported. While full reporting of original trial data is the ideal, we sought to identify methods for handling unreported mean or variability summary statistics in meta-analysis. Methods: We undertook two systematic literature reviews to identify methodological approaches used to deal with missing mean or variability summary statistics. Five electronic databases were searched, in addition to the Cochrane Colloquium abstract books and the Cochrane Statistics Methods Group mailing list archive. We also conducted cited reference searching and emailed topic experts to identify recent methodological developments. Details recorded included the description of the method, the information required to implement the method, any underlying assumptions and whether the method could be readily applied in standard statistical software. We provided a summary description of the methods identified, illustrating selected methods in example meta-analysis scenarios. Results: For missing standard deviations (SDs), following screening of 503 articles, fifteen methods were identified in addition to those reported in a previous review. These included Bayesian hierarchical modelling at the meta-analysis level; summary statistic level imputation based on observed SD values from other trials in the meta-analysis; a practical approximation based on the range; and algebraic estimation of the SD based on other summary statistics.
    [Show full text]
  • Missing Value Imputation Using Feature-Specific Generative
    IFGAN: Missing Value Imputation using Feature-specific Generative Adversarial Networks Wei Qiu* Yangsibo Huang* Quanzheng Li Yuanpei College College of Computer Science and Technology Department of Radiology Peking University Zhejiang University Massachusetts General Hospital Beijing, China Hangzhou, China Boston, MA [email protected] [email protected] [email protected] Abstract—Missing value imputation is a challenging and well- established state-of-the-art performance in many applications. researched topic in data mining. In this paper, we propose When it comes to the task of imputation, deep architectures IFGAN, a missing value imputation algorithm based on Feature- have also shown greater promise than classical ones in recent specific Generative Adversarial Networks (GAN). Our idea is intuitive yet effective: a feature-specific generator is trained studies [7]–[9], due to their capability to automatically learn to impute missing values, while a discriminator is expected to latent representations and inter-variable associations. Nonethe- distinguish the imputed values from observed ones. The proposed less, these work all formulated data imputation as a mapping architecture is capable of handling different data types, data from a missing matrix to a complete one, which ignores the distributions, missing mechanisms, and missing rates. It also difference in data type and statistical importance between improves post-imputation analysis by preserving inter-feature correlations. We empirically show on several real-life datasets features and makes the model unable to learn feature-specific that IFGAN outperforms current state-of-the-art algorithm un- representations. der various missing conditions. To overcome the limitations of classical models and to Index Terms—Data preprocessing, Missing value imputation, incorporate feature-specific representations into deep learning Deep adversarial network frameworks, we propose IFGAN, a feature-specific generative adversarial network for missing data imputation.
    [Show full text]
  • Should a Normal Imputation Model Be Modified to Impute Skewed Variables?
    Sociological Methods and Research, 2013, 42(1), 105-138 Should a Normal Imputation Model Be Modified to Impute Skewed Variables? Paul T. von Hippel Abstract (169 words) Researchers often impute continuous variables under an assumption of normality, yet many incomplete variables are skewed. We find that imputing skewed continuous variables under a normal model can lead to bias; the bias is usually mild for popular estimands such as means, standard deviations, and linear regression coefficients, but the bias can be severe for more shape- dependent estimands such as percentiles or the coefficient of skewness. We test several methods for adapting a normal imputation model to accommodate skewness, including methods that transform, truncate, or censor (round) normally imputed values, as well as methods that impute values from a quadratic or truncated regression. None of these modifications reliably reduces the biases of the normal model, and some modifications can make the biases much worse. We conclude that, if one has to impute a skewed variable under a normal model, it is usually safest to do so without modifications—unless you are more interested in estimating percentiles and shape that in estimated means, variance, and regressions. In the conclusion, we briefly discuss promising developments in the area of continuous imputation models that do not assume normality. Key words: missing data; missing values; incomplete data; regression; transformation; normalization; multiple imputation; imputation Paul T. von Hippel is Assistant Professor, LBJ School of Public Affairs, University of Texas, 2315 Red River, Box Y, Austin, TX 78712, [email protected]. I thank Ray Koopman for help with the calculations in Section 2.2.
    [Show full text]
  • Supervised Nonnegative Matrix Factorization to Predict ICU
    Supervised Nonnegative Matrix Factorization to Predict ICU Mortality Risk Guoqing Chao Chengsheng Mao Fei Wang Yuan Zhao Feinberg School of Medicine Feinberg School of Medicine Weill Cornell Medicine Feinberg School of Medicine Northwestern University Northwestern University Cornell University Northwestern University Chicago, U.S Chicago, U.S New York, U.S Chicago, U.S [email protected] [email protected] [email protected] [email protected] Yuan Luo Feinberg School of Medicine Northwestern University Chicago, U.S [email protected] Abstract—ICU mortality risk prediction is a tough yet im- In this paper, we describe a supervised nonnegative ma- portant task. On one hand, due to the complex temporal data trix factorization (NMF) algorithm that performs predictive collected, it is difficult to identify the effective features and modeling by the exploration of atomic and the higher-order interpret them easily; on the other hand, good prediction can help clinicians take timely actions to prevent the mortality. These features jointly. We automate the mining of higher-order fea- correspond to the interpretability and accuracy problems. Most tures by converting data into graph representation. Moreover, existing methods lack of the interpretability, but recently Sub- supervised latent group identification reduces dimensionality graph Augmented Nonnegative Matrix Factorization (SANMF) of different feature types for patients, and simultaneously has been successfully applied to time series data to provide a group temporal trends to form effective features for patient path to interpret the features well. Therefore, we adopted this approach as the backbone to analyze the patient data. One outcome prediction. Applications on patient physiological time limitation of the original SANMF method is its poor prediction series [1] show significant performance improvements from ability due to its unsupervised nature.
    [Show full text]
  • Optimal Recovery of Missing Values for Non-Negative Matrix Factorization Rebecca Chen and Lav R
    bioRxiv preprint doi: https://doi.org/10.1101/647560; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Optimal Recovery of Missing Values for Non-negative Matrix Factorization Rebecca Chen and Lav R. Varshney Abstract We extend the approximation-theoretic technique of optimal recovery to the setting of imputing missing values in clustered data, specifically for non-negative matrix factorization (NMF), and develop an implementable algorithm. Under certain geometric conditions, we prove tight upper bounds on NMF relative error, which is the first bound of this type for missing values. We also give probabilistic bounds for the same geometric assumptions. Experiments on image data and biological data show that this theoretically-grounded technique performs as well as or better than other imputation techniques that account for local structure. Index Terms imputation, missing values, non-negative matrix factorization, optimal recovery I. INTRODUCTION ATRIX factorization is commonly used for clustering and dimensionality reduction in computational biology, imaging, M and other fields. Non-negative matrix factorization (NMF) is particularly favored by biologists because non-negativity constraints preclude negative values that are difficult to interpret in biological processes [2], [3]. NMF of gene expression matrices can discover cell groups and lower-dimensional manifolds in gene counts for different cell types. Due to physical and biological limitations of DNA- and RNA-sequencing techniques, however, gene-expression matrices are usually incomplete and matrix imputation is often necessary before further analysis [3].
    [Show full text]
  • Convex Nonnegative Matrix Factorization with Missing Data Ronan Hamon, Valentin Emiya, Cédric Févotte
    Convex nonnegative matrix factorization with missing data Ronan Hamon, Valentin Emiya, Cédric Févotte To cite this version: Ronan Hamon, Valentin Emiya, Cédric Févotte. Convex nonnegative matrix factorization with missing data. IEEE International Workshop on Machine Learning for Signal Processing, Sep 2016, Vietri sul Mare, Salerno, Italy. hal-01346492 HAL Id: hal-01346492 https://hal-amu.archives-ouvertes.fr/hal-01346492 Submitted on 7 Oct 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 13–16, 2016, SALERNO, ITALY CONVEX NONNEGATIVE MATRIX FACTORIZATION WITH MISSING DATA Ronan Hamon, Valentin Emiya˚ Cedric´ Fevotte´ Aix Marseille Univ, CNRS, LIF, Marseille, France CNRS & IRIT, Toulouse, France ABSTRACT example concerns the field of image or audio inpainting [5, 6, 7, 8], where CNMF may improve the current reconstruction Convex nonnegative matrix factorization (CNMF) is a variant techniques. In inpainting of audio spectrograms for example, of nonnegative matrix factorization (NMF) in which the com- setting up the dictionary to be a comprehensive collection of ponents are a convex combination of atoms of a known dic- notes from a specific instrument may guide the factorization tionary.
    [Show full text]
  • From Predictive Methods to Missing Data Imputation: an Optimization Approach
    Journal of Machine Learning Research 18 (2018) 1-39 Submitted 2/17; Revised 3/18; Published 4/18 From Predictive Methods to Missing Data Imputation: An Optimization Approach Dimitris Bertsimas [email protected] Colin Pawlowski [email protected] Ying Daisy Zhuo [email protected] Sloan School of Management and Operations Research Center Massachusetts Institute of Technology Cambridge, MA 02139 Editor: Francis Bach Abstract Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including K- nearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt.impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt.impute produces the best overall imputation in most data sets benchmarked against five other methods: mean impute, K-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3% against the best cross-validated benchmark method. Moreover, opt.impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks.
    [Show full text]