Kernel Weighted Least Square Approach for Imputing Missing Values of Metabolomics Data Nishith Kumar1*, Md
Total Page:16
File Type:pdf, Size:1020Kb
www.nature.com/scientificreports OPEN Kernel weighted least square approach for imputing missing values of metabolomics data Nishith Kumar1*, Md. Aminul Hoque2 & Masahiro Sugimoto3,4 Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantifed data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artifcially generated data and experimentally measured data analysis in both the absence and presence of diferent rates of outliers. Performances based on both artifcial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https:// github. com/ Nishi thPaul/ tWLSA. Metabolomics datasets produced by mass spectrometry (MS) ofen contain a wide number of missing cells in the data matrix that can be generated from various sources, including both technological and biological haz- ards. Generally, there are approximately 10% to 40% missing values in metabolomics datasets1–3. Te reasons include: (i) the metabolite concentration peak is below the analytical method’s detectable threshold; (ii) the metabolite concentration peak is not initially present in the chromatogram; (iii) overlapping signal separation; (iv) deconvolution may give false negatives during the separation of overlapping signals, (v) computational and/or measurement error, (vi) the concentration of the metabolite is present in the sample but vanishes dur- ing downstream processing, and (vii) the concentration of a particular metabolite is identifed in one sample, but does not exist at a signifcant concentration in another sample 1,3–6. Tese missing values can be categorised as (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR). If a missing variable is not related to any observed variable or response it is MCAR. If a missing vari- able is linked with one or more observed variables, but not to the response, it is MAR. Te response associated with missing is MNAR. In metabolomics datasets, if the concentration of a metabolite is not seen in one group of samples, but is present in another group of samples, the missing values most likely occur for a biological reason and can be classifed as MNAR. However, if the peak of metabolite concentration is smaller than the analytical method’s detection threshold, this missing type is a combination of biological and technological issues and can be considered as MNAR. Finally, MCAR is caused by only technological reasons, for example, errors related to peak picking sofware, in which the peak was evident but not included in the raw data. Te easiest and most straight forward method of dealing with missing values is the fltering method. In this method, variables7,8 or samples9,10 are removed. In recent times, this is applicable only when the data matrix includes a greater percentage of missing data. To handle the missing value problem, an alternative approach is the imputation technique. Te conventional and widely used missing imputation techniques in diferent studies and sofware for imputing missing data are half of the minimum value replacement2,11, mean replacement12, median replacement12, k-nearest neighbour (kNN)13, Bayesian principal component analysis (BPCA)14,15, probabilistic 1Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh. 2Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh. 3Health Promotion and Preemptive Medicine, Research and Development Center for Minimally Invasive Therapies, Tokyo Medical University, Shinjuku, Tokyo 160-8402, Japan. 4Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0052, Japan. *email: [email protected] Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 1 Vol.:(0123456789) www.nature.com/scientificreports/ principal component analysis (PPCA)16, zero imputation17, multiple imputations with expectation maximisation (EM) algorithm and Monte Carlo Markov chain (MCMC) method 18, expectation–maximization principal com- ponent analysis (EM-PCA)19, and random forest (RF) imputation 20. Recently developed techniques include Gibbs sampler-based lef-censored missing value imputation approach (GSimp)21, quantile regression imputation of lef-censored data (QRILC)2, kNN on observations with variable pre-selection (“kNN-obs-sel”)22, BayesMetab23, robust missing imputation using mean absolute error (rmiMAE)24, multivariate imputation by chained equations (MICE)25, and others. Several missing imputation techniques are described in the literature. However, the selec- tion of the missing imputation technique has a profound impact on univariate and multivariate (unsupervised and supervised) data analyses and interpretation 1,26–28. Terefore, the appropriate handling of missing data is very important according to the structure or nature of the original data for downstream analysis. Te pattern of metabolomics datasets is very complicated because metabolomics datasets contain outliers29, non-normality, and inherent correlation structure 30. However, the missing value imputation techniques, such as mean, kNN, EM-PCA, PPCA, BPCA, and RF, are sensitive to outliers25. All the aforementioned techniques can only handle the problem of missing values. Tey cannot signifcantly and simultaneously reduce the outlier problem. Tis is because the conventional imputation algorithms do not directly consider any outlier-robust function or any out- lier identifcation and substitute algorithms. Furthermore, existing outlier resolving techniques do not consider missing value problems. For these reasons, we have developed a novel kernel-weight-based missing imputation (KMI) method that can simultaneously overcome both the missing value imputation problems and outliers. We compared our proposed method with widely used conventional techniques and recently developed techniques. To evaluate the performance of the proposed weight-based missing imputation method compared to the other existing missing value imputation methods, we took into account nine widely used well-known missing imputation methods: zero imputation, mean imputation, median imputation, half of the minimum value imputa- tion, kNN imputation, BPCA imputation, PPCA imputation, EM-PCA imputation, and RF imputation. We also considered fve recently developed missing imputation techniques: GSimp, QRILC, BayesMetab, rmiMAE, and MICE. We measured the performances of the missing imputation methods, including the proposed technique, using both artifcial and real data analysis in the absence and presence of diferent rates of outliers. Material and methods In this dissertation, we developed a new missing data imputation method by minimising the two-way kernel weighted square error loss function. To compare the competence of the proposed method, we considered nine widely used traditional missing imputation techniques as described above. Substituting all missing values are by zero is known as zero imputation. In the mean, median, and half of the minimum value imputation, missing data for each metabolite are substituted by the corresponding metabolite average, median, and half of the mini- mum value, respectively. Missing data substitution using kNN, EM-PCA, and RF are found in the “impute”, “missMDA” and “missForest” packages, respectively of the R platform. Moreover, BPCA and PPCA imputation can be done using “pcaMethods” package in Bioconductor. As comparators of our proposed missing imputation method, we also considered fve recently developed missing imputation techniques: GSimp, QRILC, BayesMetab, rmiMAE, and MICE. Among the techniques, rmiMAE is a comparatively more robust missing imputation technique which is computed by minimising the two-way mean absolute error loss function, i.e., L1 (Least 1 n 1 n absolute deviation) loss function like minimizing n j=1 eij = n j=1 |xij − ricj| , which is more robust against n 2 n 2 outliers than L2 (Least square error) loss function like minimizing j=1 (eij) = j=1 (xij − ricj) . To reduce the infuence of outliers in the least square error loss function, here, we used the weighted squared error loss 2 function, where the weight function is wj = exp − 2 (xij − median(xj)) . Te speciality of the weight 2(mad(xj)) function is that the weight will be close to zero if the corresponding observation is apart from its median and if the corresponding observation is the neighbour of the median, the weight will be close to one. A detailed descrip- tion of the proposed missing value imputation method using a two-way