www.nature.com/scientificreports

OPEN Kernel weighted least square approach for imputing missing values of metabolomics Nishith Kumar1*, Md. Aminul Hoque2 & Masahiro Sugimoto3,4

Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantifed data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artifcially generated data and experimentally measured in both the absence and presence of diferent rates of outliers. Performances based on both artifcial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://​github.​com/​NishithPaul/​ ​tWLSA.

Metabolomics datasets produced by mass spectrometry (MS) ofen contain a wide number of missing cells in the data matrix that can be generated from various sources, including both technological and biological haz- ards. Generally, there are approximately 10% to 40% missing values in metabolomics ­datasets1–3. Te reasons include: (i) the metabolite concentration peak is below the analytical method’s detectable threshold; (ii) the metabolite concentration peak is not initially present in the chromatogram; (iii) overlapping signal separation; (iv) deconvolution may give false negatives during the separation of overlapping signals, (v) computational and/or measurement error, (vi) the concentration of the metabolite is present in the sample but vanishes dur- ing downstream processing, and (vii) the concentration of a particular metabolite is identifed in one sample, but does not exist at a signifcant concentration in another sample­ 1,3–6. Tese missing values can be categorised as (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR). If a missing variable is not related to any observed variable or response it is MCAR. If a missing vari- able is linked with one or more observed variables, but not to the response, it is MAR. Te response associated with missing is MNAR. In metabolomics datasets, if the concentration of a metabolite is not seen in one group of samples, but is present in another group of samples, the missing values most likely occur for a biological reason and can be classifed as MNAR. However, if the peak of metabolite concentration is smaller than the analytical method’s detection threshold, this missing type is a combination of biological and technological issues and can be considered as MNAR. Finally, MCAR is caused by only technological reasons, for example, errors related to peak picking sofware, in which the peak was evident but not included in the raw data. Te easiest and most straight forward method of dealing with missing values is the fltering method. In this method, variables­ 7,8 or ­samples9,10 are removed. In recent times, this is applicable only when the data matrix includes a greater percentage of missing data. To handle the missing value problem, an alternative approach is the imputation technique. Te conventional and widely used missing imputation techniques in diferent studies and sofware for imputing missing data are half of the minimum value ­replacement2,11, ­replacement12, ­replacement12, k-nearest neighbour (kNN)13, Bayesian principal component analysis (BPCA)14,15, probabilistic

1Department of , Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh. 2Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh. 3Health Promotion and Preemptive Medicine, Research and Development Center for Minimally Invasive Therapies, Tokyo Medical University, Shinjuku, Tokyo 160‑8402, Japan. 4Institute for Advanced Biosciences, Keio University, Tsuruoka 997‑0052, Japan. *email: [email protected]

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 1 Vol.:(0123456789) www.nature.com/scientificreports/

principal component analysis (PPCA)16, zero ­imputation17, multiple imputations with expectation maximisation (EM) algorithm and Monte Carlo (MCMC) method­ 18, expectation–maximization principal com- ponent analysis (EM-PCA)19, and random forest (RF) imputation­ 20. Recently developed techniques include Gibbs sampler-based lef-censored missing value imputation approach (GSimp)21, quantile regression imputation of lef-censored data (QRILC)2, kNN on observations with variable pre-selection (“kNN-obs-sel”)22, ­BayesMetab23, robust missing imputation using mean absolute error (rmiMAE)24, multivariate imputation by chained equations (MICE)25, and others. Several missing imputation techniques are described in the literature. However, the selec- tion of the missing imputation technique has a profound impact on univariate and multivariate (unsupervised and supervised) data analyses and interpretation­ 1,26–28. Terefore, the appropriate handling of missing data is very important according to the structure or nature of the original data for downstream analysis. Te pattern of metabolomics datasets is very complicated because metabolomics datasets contain ­outliers29, non-normality, and inherent correlation structure­ 30. However, the missing value imputation techniques, such as mean, kNN, EM-PCA, PPCA, BPCA, and RF, are sensitive to ­outliers25. All the aforementioned techniques can only handle the problem of missing values. Tey cannot signifcantly and simultaneously reduce the outlier problem. Tis is because the conventional imputation algorithms do not directly consider any outlier-robust function or any out- lier identifcation and substitute algorithms. Furthermore, existing outlier resolving techniques do not consider missing value problems. For these reasons, we have developed a novel kernel-weight-based missing imputation (KMI) method that can simultaneously overcome both the missing value imputation problems and outliers. We compared our proposed method with widely used conventional techniques and recently developed techniques. To evaluate the performance of the proposed weight-based missing imputation method compared to the other existing missing value imputation methods, we took into account nine widely used well-known missing imputation methods: zero imputation, mean imputation, median imputation, half of the minimum value imputa- tion, kNN imputation, BPCA imputation, PPCA imputation, EM-PCA imputation, and RF imputation. We also considered fve recently developed missing imputation techniques: GSimp, QRILC, BayesMetab, rmiMAE, and MICE. We measured the performances of the missing imputation methods, including the proposed technique, using both artifcial and real data analysis in the absence and presence of diferent rates of outliers. Material and methods In this dissertation, we developed a new missing data imputation method by minimising the two-way kernel weighted square error . To compare the competence of the proposed method, we considered nine widely used traditional missing imputation techniques as described above. Substituting all missing values are by zero is known as zero imputation. In the mean, median, and half of the minimum value imputation, missing data for each metabolite are substituted by the corresponding metabolite average, median, and half of the mini- mum value, respectively. Missing data substitution using kNN, EM-PCA, and RF are found in the “impute”, “missMDA” and “missForest” packages, respectively of the R platform. Moreover, BPCA and PPCA imputation can be done using “pcaMethods” package in Bioconductor. As comparators of our proposed missing imputation method, we also considered fve recently developed missing imputation techniques: GSimp, QRILC, BayesMetab, rmiMAE, and MICE. Among the techniques, rmiMAE is a comparatively more robust missing imputation technique which is computed by minimising the two-way mean absolute error loss function, i.e., L1 (Least 1 n 1 n absolute deviation) loss function like minimizing n j=1 eij = n j=1 |xij − ricj| , which is more robust against   n 2 n 2 outliers than L2 (Least square error) loss function like minimizing  j=1 (eij) = j=1 (xij − ricj) . To reduce the infuence of outliers in the least square error loss function, here, we used the weighted squared error loss 2 function, where the weight function is wj = exp − 2 (xij − median(xj)) . Te speciality of the weight 2(mad(xj))  function is that the weight will be close to zero if the corresponding observation is apart from its median and if the corresponding observation is the neighbour of the median, the weight will be close to one. A detailed descrip- tion of the proposed missing value imputation method using a two-way kernel weighted square error loss func- tion is given below.

Missing data imputation using two‑way kernel weighted least square error approach (proposed). Let X = xij be metabolomics data, where i = 1, 2, ..., p represents the metabolites and  j = 1, 2, ..., n represents the samples. Tus, in the metabolomics data X, diferent rows indicate diferent metab- olites, and the columns indicate diferent samples.

x11 x12 ··· x1n   x21 x22 ··· x2n X =  . . . .   . . .. .     xp1 xp2 ··· xpn 

Each cell of the metabolomics data could be represented as the product of the metabolite (row) efect and the sample (column) efect. Mathematically, it is written in a bilinear form, xij = ricj (1)

where ri and cj represent the i-th row efect (i.e. metabolite efect) and the j-th column efect (i.e. sample efect), respectively. Te observed metabolomic data matrix usually contains missing cells and outliers. Tus, both the missing cell and outliers in the data matrix can be estimated by considering the efect of the corresponding row and column. In Eq. (1), ri and cj are both unknown. Terefore, our motive is to determine ri and cj to forecast the ij-th missing cell or outlying cell. To estimate ri and cj, consider model

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 2 Vol:.(1234567890) www.nature.com/scientificreports/

xij = ricj+∈ij, (2)

where xij is the yield corresponding to the efect of the ith metabolite (row) and jth sample (column), ri indicates the factors of the ith metabolite, and cj indicates the factors of the j-th sample and ∈ij indicates the error term. From model (2), we must estimate ri and cj simultaneously. To estimate ri and cj, we developed a weighted least 2 square approach using a kernel weight function wj = exp − 2 (xij − median(xj)) and updated r and 2(mad(xj))  i cj by an iterative procedure, where mad represents the median absolute deviation. Te speciality of the kernel weight function is that it lies between zero and one. Te weight will be close to zero if the corresponding observa- tion is apart from its median. If the corresponding observation is the neighbour of the median, the weight will be close to one. In the kernel weight function, is the tuning parameter, where the value of is chosen by k-fold cross-validation. Te details of the appropriate selection procedure are given in Supplementary Information 1 (Supplementary Fig. S1). If the is clean (i.e. no outliers), then will be zero. In this condition, all the weights will be 1, that is, the technique will be the classical least-squares approach. Te steps for estimating ri and cj are given below:

Step 1 To initialise the j-th column (sample) efect (cj), calculate the j-th column median of X. Column median is computed by excluding the missing values j = 1, 2, ..., n. Step 2 Using the weighted least square approach, estimate the i-th row efect (i.e. metabolite efect) r by n 2 n 2 i minimising eij = wij xij − ricj , based on the i-th row of X, by eliminating the missing j=1   j=1   values, i = 1, 2, ..., p. Step 3 Revise the j-th column effect cj , using the weighted least square approach by minimising p 2 w x − r c j = 1, 2, ..., n i=1 ij ij i j , based on the j-th column rof X, rby eliminatingc c the missing values, .   | new− old |+| new− old | Step 4 Repeat Steps 2 and 3 until it satisfes the rule n+p ≤ ε ; here ε is a very small positive number, which depends on the researcher’s interest. Here, we choose ε = 0.01.

ˆ (1) r c r T Step 5 • Compute the first fitted bilinear form as X =ˆ1 ˆ1 , where ˆ1 = (rˆ1, rˆ2, ..., rˆp) and cˆ1 = (cˆ1, cˆ2, ..., cˆn) are obtained from Step 4. ˆ (1) r c • Calculate the frst remainder matrix (XR1) as XR1 = X − X = X −ˆ1 ˆ1 (excluding the missing cells of the data matrix) ˆ r c • Using steps 1–4 on XR1, compute the second ftted bilinear form as,XR1 =ˆ2 ˆ2 and calculate the second ˆ r c r c remainder matrix (XR2) as, XR2 = XR1 − XR1 = X −ˆ1 ˆ1 −ˆ2 ˆ2 (excluding the missing cells of the data matrix) ˆ r r c • Similarly, calculate the r-th remainder (XRr) as, XRr = XR(r−1) − XR(r−1) = X − k=1 ˆk ˆk that is, r r c r r c X = XRr + k=1 ˆk ˆk . Te number of r is selected in such a way that the total row variations of k=1 ˆk ˆk can explain (1 − α)100% variations of X (using the concept of singular value decomposition; the details of the r selection procedure are given in Appendix 1 of the supplementary materials), where α is chosen by the researcher interest. In this case, α = 0.05. Terefore, the approximation of X is: r X ≈ Xˆ (r) = rˆ cˆ k k (3) k=1 Step 6 Substitute the missing values and the outlying cells of X by the corresponding cells of Xˆ (r) that produce the reconstructed full and clean data matrix X˜ . Here, the inter quartile (IQR) ­rule31 was used to detect outliers.

Te application procedure of the proposed method in metabolomics data is given below. Te metabolomics dataset may contain several groups of samples in their data structure. If a metabolomics dataset contains k groups of samples, then the dataset is split according to the groups as

group−1 group−2 ··· group−k   x x ··· x x x ··· x ··· x x ··· x � 11 12�� 1g1� � 1(g1+1) 1(g1+��2) 1(g1+g2)� ���� � 1(g1+···+gk−1+1) 1(g1+···+��gk−1+2) 1(g1+···+gk)�  x x ··· x x x ··· x ··· x x ··· x   21 22 2g1 2(g1+1) 2(g1+2) 2(g1+g2) 2(g1+···+gk−1+1) 2(g1+···+gk−1+2) 2(g1+···+gk)  X =  ......   ......   ......   ··· ··· ··· ···   xp1 xp2 xpg1 xp(g1+1) xp(g1+2) xp(g1+g2) xp(g1+···+gk−1+1) xp(g1+···+gk−1+2) xp(g1+···+gk)     

where g1 is the column number (subjects) of group-1, g2 is the column number (subjects) of group-2, and so on g1 + g2 +···+gk = n. Terefore, we checked whether the metabolomics data matrix X contained multiple groups in the samples. If X contains multiple groups, then partition matrix X as X = ( X1 X2 ··· Xk ) according to k groups of samples,

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 3 Vol.:(0123456789) www.nature.com/scientificreports/

x11 x12 ··· x1g x1(g +1) x1(g +2) ··· x1(g +g )  1   1 1 1 2  x21 x22 ··· x2g1 x2(g1+1) x2(g1+2) ··· x2(g1+g2) where X1 =  . . . . , X2 =  . . . .  and  . . .. .   . . .. .       xp1 xp2 ··· xpg1   xp(g1+1) xp(g1+2) ··· xp(g1+g2) 

x1(g +···+g +1) x1(g +···+g +2) ··· x1(g +···+g )  1 k−1 1 k−1 1 k  x2(g1+···+gk−1+1) x2(g1+···+gk−1+2) ··· x2(g1+···+gk) Xk =  . . . . ;  . . .. .     xp(g1+···+gk−1+1) xp(g1+···+gk−1+2) ··· xp(g1+···+gk)  x11 x12 ··· x1n   x21 x22 ··· x2n otherwise, X =  . . . .   . . .. .     xp1 xp2 ··· xpn 

If X contains k groups, then apply Steps 1–6 for each partitioned data matrix and compute X˜ 1, X˜ 2, ··· X˜ k . ˜ Tus, the reconstructed full and clean data matrix X = (X˜ 1 X˜ 2 ··· X˜ k ) . Otherwise, apply Steps 1–6 for the data matrix X and compute the reconstructed full and clean data matrix X˜ . User can install the package in R platform using the following R code

library(devtools) install_github("NishithPaul/tWLSA") library(tWLSA)

Artifcially generated metabolomics data. To simulate metabolomics datasets, we used the following additive : xijk = µi + gij+∈ijk (4) th th th where xijk is the concentration of the i metabolite, j group, and k sample; the average concentration for the th th i-th metabolite is µi ; gij represents the j group efect of the i metabolite and the random error term of the i- th metabolite, j-th group, and k-th sample is ∈ijk . To generate the data, we considered, µi ∼ uniform(5, 10) and ∈ijk∼ N(0, 1) . To measure the efciency of the proposed technique, we created three types of metabolomics datasets: (i) without a class level in the samples, (ii) two class levels (two groups) in the samples, and (iii) three class levels (three groups) in the samples. In the case of two-and three-class level-based datasets, we also gener- ated two types of metabolites: (a) equal concentration (EE) metabolites and (b) diferential concentrations (DE) metabolites. DE metabolites were classifed into two groups: upregulated and down-regulated metabolites. For up-concentrated metabolites, we used gij ∼ N(0, 1) the healthy group and gij ∼ N(2, 1) the disease group. Simi- larly, for down-regulated metabolites, we used gij ∼ N(2, 1) the healthy group and gij ∼ N(0, 1) the disease group. For EE metabolites, gij ∼ N(0, 1) in both groups. We generated 200 metabolites and 90 samples for each dataset. In two-and three-class datasets, we considered 80 metabolites as DE and 120 metabolites as EE. We generated 100 datasets for each type of dataset. We also incorporated various rates (5%, 10%, 15%, and 20%) of missing cells in the data matrix. Among the total missing values, 60% MAR and 40% for lower values. To investigate the efciency of our proposed technique in the presence of outliers, we also included various rates (3%, 5%, 7%, and 2 10%) of outliers in the artifcial datasets. In the i-th metabolite, we provided N(5*μi, σi ) as outliers, where μi 2 and σi are the mean and of the i-th metabolite; these outliers were distributed randomly in the dataset; thus, outliers may occur anywhere in the dataset.

Real metabolomics data. To measure the performance of our proposed missing imputation method, we frst considered two publicly available fully defned real metabolomics data matrices. One is the Human Cachexia dataset­ 32, collected from 1H-NMR profles of urinary metabolites that are available in the R-specmine library. Te other is the treated dataset­ 33, which is also available in the R-metabolomics library. Since, these two data matrices did not contain any missing values, to investigate the efciency of the proposed technique compared to the other techniques we randomly incorporated diferent rates (5%, 10%, 15% and 20%) of missing values and also com- puted the mean square error (MSE) between the reconstructed data and original data. We also considered two datasets: hepatocellular carcinoma (HCC) with 26.52% missing values/cells34 and MDA-MB-231 breast cancer dataset with 15.81% missing ­values35 to evaluate the performance of the proposed missing value imputation method. Te HCC and MDA-MB-231 datasets were also modifed by artifcially including various rates (3%, 5%, 7%, and 10%) of outliers to investigate the performance of the proposed method. Outliers are distributed 2 2 randomly and follow N(5*μi, σi ), where μi and σi are the mean and variance of the i-th metabolite, respectively. Results To demonstrate the performance of the proposed missing imputation technique compared to the extensively used conventional techniques and recently developed missing imputation techniques, we analysed both artifcial and experimentally measured metabolomics datasets.

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 4 Vol:.(1234567890) www.nature.com/scientificreports/

Figure 1. Performance investigation of diferent missing imputation techniques using average MSE for without class level data.

Artifcial data analysis results. In simulation studies, we frst measured the performance of the proposed missing imputation technique compared to the other ten missing imputation methods (zero, mean, median, half of the minimum value, kNN, BPCA, PPCA, EM-PCA, RF imputations and rmiMAE) using the distance-based measurement. We computed the MSE between the original simulated dataset and the reconstructed missing imputed dataset in both the presence and absence of outliers. We generated three types of simulated metabo- lomics datasets and 100 datasets for each type and calculated the average MSE from 100 MSEs for each type of dataset for diferent rates of outliers (0%, 3%, 5%, 7%, and 10%) and diferent rates (5%, 10%, 15%, and 20%) of missing values. For the datasets with no class level in the samples, the results of the above calculation are shown in Fig. 1. Similarly, for two class levels (two groups) in the sample datasets and three class levels (three groups) in the sample datasets, the results of the aforementioned calculation are given in the Supplementary Information in Fig. S1 and Fig. S2. In the same way, a comparison of the performance of our proposed method with the recently developed techniques (GSimp, QRILC, BayesMetab, rmiMAE, and MICE) using the datasets with no class level in the samples are given in the Supplementary Information in Fig. S3. In all these fgures, the proposed missing value imputation technique produced lower average MSEs for various rates (0%, 3%, 5%, 7%, and 10%) of outli- ers, as well as for various rates (5%, 10%, 15%, and 20%) of missing values. Terefore, our missing imputation method was better than the other existing techniques. Second, we evaluated the performance of our developed KMI method using the misclassifcation error rate (MER), receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC) through DE metab- olite identifcation for two groups and three groups of datasets. To calculate the performance indices (MER, ROC curve, and AUC values), we identifed the DE metabolites from the diferent reconstructed datasets (missing were

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 5 Vol.:(0123456789) www.nature.com/scientificreports/

Figure 2. Performance investigation of diferent missing value imputation techniques using receiver operating characteristic curve of DE calculation for two class level dataset with 5% missing values in absence and presence of outliers.

Without outliers MER 3% outliers MER 5% outliers MER 7% outliers MER 10% outliers MER Methods (AUC) (AUC) (AUC) (AUC) (AUC) RF 4.23 (0.956) 19.76 (0.797) 27.41 (0.720) 31.39 (0.679) 35.52 (0.638) PPCA 3.77 (0.964) 18.59 (0.811) 26.03 (0.737) 30.06 (0.698) 34.48 (0.652) kNN 4.83 (0.952) 20.41 (0.794) 27.85 (0.719) 31.83 (0.678) 35.93 (0.637) BPCA 3.45 (0.967) 21.31 (0.788) 28.65 (0.703) 32.34 (0.675) 36.45 (0.641) EM-PCA 3.38 (0.969) 18.76 (0.829) 26.08 (0.733) 30.09 (0.707) 36.03 (0.649) Zero 16.57 (0.829) 28.88 (0.709) 34.15 (0.651) 37.13 (0.623) 39.95 (0.598) Mean 5.23 (0.949) 21.29 (0.789) 28.62 (0.719) 32.37 (0.672) 36.41 (0.642) Median 5.12 (0.951) 20.99 (0.791) 28.36 (0.706) 32.18 (0.676) 36.25 (0.643) Minimum 12.44 (0.867) 26.45 (0.728) 32.26 (0.673) 35.52 (0.640) 38.75 (0.604) rmiMAE 2.94 (0.971) 4.21 (0.958) 4.77 (0.951) 4.98 (0.948) 5.13 (0.964) Proposed 2.73 (0.973) 2.93 (0.971) 2.98 (0.970) 3.03 (0.969) 3.05 (0.969)

Table 1. Average misclassifcation error rate (MER) and area under the receiver operating characteristic curve (AUC) of DE calculation for two class simulated data with 5% missing values and diferent rates of outliers. Bold indicates the lower MER and Higher AUC throughout the column.

imputed by diferent methods) using a t-test for the two class level dataset and (ANOVA) for the multiclass level dataset. Since the DE and EE metabolites were known in the simulated dataset, we computed the MER, ROC curve, and AUC for diferent missing imputed datasets in both the absence and presence of vari- ous rates of outliers. Te above calculation procedures are provided in Supplementary Information in Fig. S4. Te ROC curve of DE calculation for two-class datasets with 5% missing data and various rates of outliers are depicted in Fig. 2. Similarly, for three classes of simulated datasets, the ROC curve of the DE calculation is also shown in the Supplementary Information in Fig. S5. Similarly, for 10%, 15%, and 20% missing values, the ROC curves are given in the Supplementary Information (Fig. S6–S11). In addition, Table 1 presents the MER and AUC values of the DE calculation for two-class datasets with 5% missing as well as various rates of outli- ers. Moreover, for the two classes of datasets with 5% missing as well as various rates of outliers, the MER and AUC values of DE identifcation are also presented in the Supplementary Information in Table S1. Similarly, for

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 6 Vol:.(1234567890) www.nature.com/scientificreports/

Figure 3. Performance investigation of diferent missing value imputation techniques using receiver operating characteristic curve of sample classifcation for two class level dataset with 5% missing values in presence of outliers.

10%, 15%, and 20% missing values, the MER and AUC values of the DE calculation are also given in the Sup- plementary Information (Tables S2 to S7). Te results of the performance measures of Fig. 2, Table 1, Fig. S5 to S11, and Tables S1 to S7 show that the proposed missing imputation method produced lower average MER and higher average AUC values for diferent rates (5%, 10%, 15%, and 20%) of missing values and various rates (0%, 3%, 5%, 7%, and 10%) of outliers. Terefore, the proposed KMI technique was better than the existing missing value imputation techniques. Finally, we measured the performance of our proposed KMI technique through sample classifcation using only DE metabolites. Although taking only the diferentially expressed variables may give over-optimistic values for prediction performance, however, to increase accuracy, it is ofen used as the feature selection approach. To overcome this problem we used the cross-validation approach. Te performance measure calculation pro- cedure of diferent imputation methods based on sample classifcation (using a SVM classifer) is given in the Supplementary Information in Fig. S12. Te ROC curve based on sample classifcation using a test dataset for two-class simulated datasets with 5% and 10% missing values and various rates (3%, 5%, 7%, and 10%) of outliers are presented in Fig. 3 and Supplementary Fig. S13, respectively. Figure 3 and Fig. S13 show that our proposed KMI technique gave a higher average true positive rate at any point of average false positive rate compared to the other missing imputation methods in the presence of diferent rates of outliers (3%, 5%, 7%, and 10%). We also computed the average MER and AUC in the appearance of 5% missing data as well as the diferent percent- ages of outliers using two-and three class level datasets, which are presented in Tables 2 and 3. Similarly, for 10% missing data, as well as diferent percentages of outliers using two-and three class level datasets, the average MER and AUC are given in Supplementary Tables S8 and S9. Tables 2 and 3 show that the proposed KMI technique produced lower average MER and higher average AUC values at various rates of missing values and diferent rates of outliers for two-and three class level simulated metabolomics data. Terefore, in simulation studies, our proposed KMI technique was better than the existing missing value imputation methods.

Real data analysis results. Here, we used four real metabolomics datasets to evaluate the efciency of our newly developed KMI technique compared to other missing imputation methods for real data analysis. Since the Human Cachexia and treated datasets are fully defned, to explore the performance of our proposed technique we artifcially incorporated various percentage of missing values (5%, 10%, 15% and 20%) and reconstructed the data matrix using several missing value imputation methods including the proposed one. We measured the MSE between the original and reconstructed datasets. We also repeated the aforementioned calculation 100 times and

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 7 Vol.:(0123456789) www.nature.com/scientificreports/

Methods 3% Outliers MER (AUC) 5% Outliers MER (AUC) 7% Outliers MER (AUC) 10% Outliers MER (AUC) RF 4.10 (0.9516) 8.67 (0.9143) 16.17 (0.8395) 17 (0.8332) PPCA 5.67 (0.9408) 7.37 (0.9276) 15.53 (0.8454) 19.37 (0.8048) kNN 5.53 (0.9412) 9.27 (0.9054) 16.57 (0.83885) 19.6 (0.8042) BPCA 5.77 (0.9391) 8.50 (0.9149) 16.27 (0.84025) 21.2 (0.7882) EM-PCA 5.03 (0.9460) 7.43 (0.9270) 15.33 (0.8475) 15.77 (0.8385) Zero 7.73 (0.9224) 12.70 (0.8759) 19 (0.8068) 18.9 (0.8114) Mean 5.93 (0.9371) 9.30 (0.9059) 16.5 (0.8358) 21.37 (0.7858) Median 6.17 (0.9353) 9.67 (0.9021) 16.43 (0.8366) 19.87 (0.8021) Minimum 8.67 (0.9088) 13.70 (0.8675) 18.87 (0.8088) 19.27 (0.8073) rmiMAE 1.69 (0.9831) 1.81 (0.9819) 2.36 (0.9764) 2.82 ( 0.9718) Proposed 1.27 (0.9882) 1.30 (0.9878) 1.30 (0.9876) 1.32 (0.9872)

Table 2. Average misclassifcation error rate(MER) and area under the receiver operating characteristic curve (AUC) for two class simulated data with 5% missing values and diferent rates of outliers. Bold indicates the lower MER and Higher AUC throughout the column.

Methods 3% Outliers MER (AUC) 5% Outliers MER (AUC) 7% Outliers MER (AUC) 10% Outliers MER (AUC) RF 5.60 (0.9661) 10.60 (0.9122) 12.97 (0.8631) 21.63 (0.7687) PPCA 4.33 (0.9718) 7.73 (0.9375) 11.03 (0.8797) 21.47 (0.7734) kNN 3.67 (0.9783) 9.40 (0.9251) 13.80 (0.8634) 18.60 (0.7785) BPCA 4.00 (0.9735) 10.17 (0.9131) 13.37 (0.8688) 21.77 (0.7562) EM-PCA 5.20 (0.9645) 8.93 (0.9302) 11.07 (0.8799) 21.67 (0.7674) Zero 4.67 (0.9696) 8.67 (0.9172) 15.80 (0.8546) 15.20 (0.8054) Mean 4.13 (0.9724) 10.97 (0.8995) 13.93 (0.8620) 18.23 (0.7924) Median 4.20 (0.9721) 10.03 (0.9152) 13.50 (0.8673) 17.93 (0.7948) Minimum 5.23 (0.9626) 8.30 (0.9193) 12.13 (0.8913) 14.33 (0.8214) rmiMAE 1.82 (0.9816) 2.21 (0.9783) 2.57 (0.9721) 3.29 (0.9672) Proposed 1.28 (0.9892) 1.32 (0.9877) 1.39 (0.9862) 1.48 (0.9835)

Table 3. Average misclassifcation error rate and area under the receiver operating characteristic curve (AUC) for three class simulated data with 5% missing values and diferent rates of outliers. Bold indicates the lower MER and Higher AUC throughout the column.

Figure 4. Performance investigation of diferent missing value imputation techniques using MSE calculation for diferent rates of missing values of (a) Human Cachexia dataset and (b) treated dataset.

computed the average MSE for diferent rates of missing values, as presented in Fig. 4. Te fgure shows that the proposed missing value imputation technique produced a lower average MSE for diferent rates of missing values for the Human Cachexia dataset (Fig. 4a) and the treated dataset (Fig. 4b). Terefore, our proposed imputation method displayed comparatively better performance than the other ten conventional missing value imputation methods. Moreover, we conducted a comparative study of the efciency of our proposed missing imputation technique and fve recently developed techniques (GSimp, QRILC, BayesMetab, rmiMAE, and MICE) using

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 8 Vol:.(1234567890) www.nature.com/scientificreports/

Figure 5. Performance measures calculation procedure for real dataset on the basis of sample classifcation.

Without outliers MER 3% outliers MER 5% outliers MER 7% outliers MER 10% outliers MER Methods (AUC) (AUC) (AUC) (AUC) (AUC) RF 10.67 (0.8903) 13.61 (0.8642) 20.16 (0.8001) 21.18 (0.7883) 22.61 (0.7751) PPCA 15.22 (0.8495) 16.45 (0.8365) 26.44 (0.7364) 26.56 (0.7355) 26.67 (0.7323) kNN 13.53 (0.8795) 13.94 (0.8676) 25.56 (0.7484) 25.97 (0.7402) 26.33 (0.7375) BPCA 14.61 (0.8571) 16.28 (0.8342) 21.67 (0.7882) 23.38 (0.7679) 25.45 (0.7452) EM-PCA 11.44 (0.8894) 11.58 (0.8813) 23.72 (0.7665) 23.70 (0.7648) 23.69 (0.7618) Zero 11.55 (0.8874) 12.39 (0.8747) 15.58 (0.8433) 17.51 (0.8246) 19.44 (0.8026) Mean 10.56 (0.8914) 13.61 (0.8644) 20.73 (0.7947) 22.69 (0.7723) 24.89 (0.7502) Median 9.94 (0.9068) 12.38 (0.8783) 17.61 (0.8276) 20.17 (0.7975) 23.54 (0.7637) Minimum 10.56 (0.8937) 11.24 (0.8894) 13.57 (0.8646) 15.67 (0.8427) 17.44 (0.8265) rmiMAE 0.78 (0.9922) 1.84 (0.9815) 2.36 (0.9773) 2.97 (0.9701) 3.65 (0.9644) Proposed 0.00 (1.00) 0.32 (0.9972) 0.75 (0.9929) 1.18 (0.9895) 1.87 (0.9837)

Table 4. Average misclassifcation error rate and area under the receiver operating characteristic curve (AUC) of sample classifcation for two class real dataset (hepatocellular carcinoma) with 26.52% missing values and artifcially imputed diferent rates of outliers. Bold indicates the lower MER and Higher AUC throughout the column.

MSE on the Cachexia dataset with various rates of missing values. Tis is presented in the Supplementary Infor- mation in Fig. S14. We also measured the competency of our proposed KMI technique using MER and AUC of sample clas- sifcation for both the two-class hepatocellular carcinoma dataset and the three-class MDA-MB-231 dataset. To evaluate the performance of all well-known missing value imputation methods in the presence of outliers, we modifed both datasets by artifcially incorporating diferent rates of outliers (3%, 5%, 7%, and 10%). Te performance measure calculation procedure for diferent missing imputation techniques is shown in Fig. 5. Te calculation of performance measures (MER and AUC) using the HCC dataset and the MDA-MB-231 dataset are shown in Tables 4 and 5, respectively. Te data indicated that our proposed KMI technique produced a lower average MER and higher AUC values compared to other missing imputation methods in the appearance of vari- ous rates of outliers. Terefore, both simulation studies and real data analysis showed that our proposed missing value imputation method performed better than the existing missing value imputation methods.

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 9 Vol.:(0123456789) www.nature.com/scientificreports/

Without outliers MER 3% outliers MER 5% outliers MER 7% outliers MER 10% outliers MER Methods (AUC) (AUC) (AUC) (AUC) (AUC) RF 4.23 (0.9635) 21.57 (0.7996) 23.53 (0.7771) 31.96 (0.6951) 34.70 (0.6777) PPCA 4.33 (0.9629) 18.27 (0.8288) 25.13 (0.7564) 21.83 (0.7938) 30.63 (0.7238) kNN 3.43 (0.9759) 18.63 (0.8296) 21.80 (0.7940) 24.56 (0.7646) 43.56 (0.6672) BPCA 8.07 (0.9267) 18.77 (0.8207) 22.06 (0.7836) 26.43 (0.7475) 33.70 (0.6858) EM-PCA 4.46 (0.9619) 19.76 (0.8101) 19.63 (0.8161) 21.66 (0.7938) 34.93 (0.6741) Zero 3.45 (0.9755) 12.20 (0.8843) 25.86 (0.7554) 27.40 (0.7347) 38.53 (0.6374) Mean 3.73 (0.9728) 11.73 (0.8858) 25.30 (0.7597) 26.2 (0.7452) 35.90 (0.6668) Median 3.67 (0.9732) 11.36 (0.8861) 22.73 (0.7874) 23.16 (0.7753) 34.90 (0.6748) Minimum 3.43 (0.9758) 13.10 (0.8774) 24.64 (0.7647) 27.16 (0.7457) 37.10 (0.6457) rmiMAE 1.45 (0.9854) 2.47 (0.9757) 3.04 (0.9753) 3.56 (0.9668) 4.01 (0.9611) Proposed 0.17 (0.9992) 0.25 (0.9983) 0.30 (0.9975) 0.52 (0.9951) 1.13 (0.9892)

Table 5. Average misclassifcation error rate and area under the receiver operating characteristic curve (AUC) of sample classifcation for three class real dataset (MDA-MB-231) with 15.81% missing values and artifcially imputed diferent rates of outliers. Bold indicates the lower MER and Higher AUC throughout the column.

Discussion We examined the performance of each missing imputation technique by optimising the parameter settings using a trial-and-error basis to avoid biased comparisons. For example, in the case of kNN imputation, we chose k, for which the MSE and MER were smaller and the accuracy was maximum. Te performance of diferent missing imputation techniques may depend on the structure and the value/intensity of data. Terefore, we presently generated three types of simulated metabolomics datasets and 100 datasets for each type and calculated the average MSE from 100 MSEs for each type of dataset at diferent rates of outliers (0%, 3%, 5%, 7%, and 10%) and diferent rates (5%, 10%, 15%, and 20%) of missing values. MAR may occur at any position in the data matrix, thus, we generated 100 modifed real datasets, includ- ing diferent MAR positions of the data matrix, to measure the performance of diferent missing imputation techniques. To compute the performance of various missing imputation methods through MER and AUC using the classifcation technique, we divided the dataset into two parts: the test dataset and the training dataset. To reduce the error during the calculation of MER and AUC, we generated 100 training datasets and 100 test datasets for each case and computed the average MER and AUC for measuring the performance of diferent missing imputation methods. Te detailed calculation procedure of diferent performance measures calculated by diferent missing imputation methods for the artifcial dataset is shown in the Supplementary Information in Fig. S4 and Fig. S12. As well, information for the artifcial dataset is presented in Fig. 5. We calculated the execu- tion time (speed of execution) for diferent methods, including the proposed method, for diferent numbers of metabolites and samples (Supplementary Information Table S10). Te URL of the R package and the user manual of our proposed method are https://​github.​com/​Nishi​thPaul/​tWLSA. Conclusion Te Selection of the missing imputation method afects consecutive metabolomics data analysis. Moreover, metabolomics data generated from diferent platforms ofen contain missing values and outliers. Tus, in this study, we developed a new outlier-robust kernel-weight-based two-way alternating weighted least square approach for imputing missing values. We also measured the performance of our proposed KMI technique compared to the existing conventional methods (zero, mean, median, half of the minimum value, kNN, BPCA, PPCA, EM-PCA, and RF imputations) and recently developed missing imputation methods (GSimp, QRILC, BayesMetab, rmiMAE, and MICE) through both artifcial and real metabolomics data analysis. Based on our computational results, the presently developed missing value imputation method is better than the existing miss- ing value imputation methods in both the absence and presence of outliers. For this reason, our recommendation is to apply our proposed two-way kernel weighted least square-based missing value imputation method instead of existing missing imputation methods to substitute the missing values in metabolomics datasets for consecutive univariate, multivariate, and exploratory metabolomics data analysis. Data availability Comparisons were evaluated using R language. To identify the DE metabolites in R we used ‘t.test’ function (for two groups) and ‘anova’ function (for more than two groups) from “stats” package. “ROCR” and “pROC” packages have been used to draw receiver operating characteristic (ROC) curve as well as to calculate the area under the ROC curve. We also used the support vector machine ‘svm’ function from “e1071” package to calcu- late the misclassifcation error rate (MER) for sample classifcation. Moreover, Te source code and packages of diferent missing imputation techniques are given below, Proposed: R package and the user manual of our proposed method are available at https://github.​ com/​ Nishi​ thPaul/​ tWLSA​ . rmiMAE: Te R code for rmiMAE is available at https://github.​ com/​ Nishi​ thPaul/​ missi​ ngImp​ utati​ on/​ blob/​ main/​ rmiMAE.R​ . GSimp: Te R code for GSimp is available at https://​github.​com/​Wande​Rum/​GSimp. kNN: Here, kNN imputation technique has been implemented using "impute" package from bioconductor. Te reference manual of this package can be found

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 10 Vol:.(1234567890) www.nature.com/scientificreports/

at https://​www.​bioco​nduct​or.​org/​packa​ges/​relea​se/​bioc/​manua​ls/​impute/​man/​impute.​pdf. EM-PCA: We used "missMDA" package in R to impute missing values by EM-PCA method. Te reference manual of this package can be found at https://cran.r-​ proje​ ct.​ org/​ web/​ packa​ ges/​ missM​ DA/​ missM​ DA.​ pdf​ . RF: "missForest" package has been used to impute the missing values by Random Forest (RF) method. Te reference manual of this package can be found at https://cran.r-​ ​proje​ct.​org/​web/​packa​ges/​missF​orest/​missF​orest.​pdf. BPCA and PPCA: BPCA and PPCA imputation techniques have been implemented using "pcaMethods" package from bioconductor. Te reference manual is available at https://www.​ bioco​ nduct​ or.​ org/​ packa​ ges/​ relea​ se/​ bioc/​ manua​ ls/​ pcaMe​ thods/​ man/​ ​ pcaMethods.​ pdf​ . QRILC: Here QRILC imputation technique has been implemented using “imputeLCMD” pack- age in R. Te reference manual can be found at https://​cran.r-​proje​ct.​org/​web/​packa​ges/​imput​eLCMD/​imput​ eLCMD.pdf​ . BayesMetab: R code for BayesMetab method is available at https://bmcbi​ oinfo​ rmati​ cs.​ biome​ dcent​ ​ ral.com/​ artic​ les/​ 10.​ 1186/​ s12859-​ 019-​ 3250-2​ . MICE: Here, “missCompare” package in R has been used to impute the missing values by MICE method. Te reference manual of the package can be found at https://​cran.r-​proje​ ct.​org/​web/​packa​ges/​missC​ompare/​missC​ompare.​pdf

Received: 4 January 2021; Accepted: 13 May 2021

References 1. Gromski, P. S. et al. Infuence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites 4, 433–452. https://​doi.​org/​10.​3390/​metab​o4020​433 (2014). 2. Wei, R. et al. Missing value imputation approach for mass spectrometry-based metabolomics data. Sci. Rep. 8, 663. https://​doi.​ org/​10.​1038/​s41598-​017-​19120-0 (2018). 3. Hrydziuszko, O. & Viant, M. R. Missing values in mass spectrometry based metabolomics: an undervalued step in the data pro- cessing pipeline. Metabolomics 8, 161–174. https://​doi.​org/​10.​1007/​s11306-​011-​0366-4 (2012). 4. Steuer, R., Morgenthal, K., Weckwerth, W. & Selbig, J. A gentle guide to the analysis of metabolomic data. In Metabolomics—Meth- ods and Protocols (ed. Weckwerth, W.) 105–126 (Human Press, 2007). 5. Di Guida, R. et al. Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics 12, 93. https://​doi.​org/​10.​1007/​s11306-​016-​1030-9 (2016). 6. Armitage, E. G., Godzien, J., Alonso-Herranz, V., Lopez-Gonzalvez, A. & Barbas, C. Missing value imputation strategies for metabolomics data. Electrophoresis 36, 3050–3060. https://​doi.​org/​10.​1002/​elps.​20150​0352 (2015). 7. Navarrete, A. et al. Metabolomic evaluation of Mitomycin C and rapamycin in a personalized treatment of pancreatic cancer. Pharmacol. Res. Perspect. 2, e00067. https://​doi.​org/​10.​1002/​prp2.​67 (2014). 8. Qiu, Y. et al. Multivariate classifcation analysis of metabolomic data for candidate biomarker discovery in type 2 diabetes mellitus. Metabolomics 4, 337–346. https://​doi.​org/​10.​1007/​s11306-​008-​0123-5 (2008). 9. Kirwan, J. A., Weber, R. J., Broadhurst, D. I. & Viant, M. R. Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and . Sci. Data 1, 140012. https://​doi.​org/​10.​1038/​sdata.​2014.​12 (2014). 10. Krug, S. et al. Te dynamic range of the human metabolome revealed by challenges. FASEB J. 26, 2607–2619. https://​doi.​org/​10.​ 1096/​f.​11-​198093 (2012). 11. Sun, X. & Weckwerth, W. COVAIN: a toolbox for uni- and , time-series and correlation network analysis and inverse estimation of the diferential Jacobian from metabolomics data. Metabolomics 8, 81–93. https://​doi.​org/​ 10.​1007/​s11306-​012-​0399-3 (2012). 12. Madhu, G., Bharadwaj, B. L., Vardhan, K. S. & Chandrika, G. N. A normalized mean algorithm for imputation of missing data values in medical databases. In Innovations in Electronics and Communication Engineering (eds Saini, H. S. et al.) 773–781 (Springer, 2020). 13. Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. (Oxford, England) 17, 520–525. https://​doi.​org/​10.​1093/​bioin​forma​tics/​17.6.​520 (2001). 14. Nyamundanda, G., Brennan, L. & Gormley, I. C. Probabilistic principal component analysis for metabolomic data. BMC Bioinform. 11, 571. https://​doi.​org/​10.​1186/​1471-​2105-​11-​571 (2010). 15. Xia, J. & Wishart, D. S. Web-based inference of biological patterns, functions and pathways from metabolomic data using Meta- boAnalyst. Nat. Protoc. 6, 743–760. https://​doi.​org/​10.​1038/​nprot.​2011.​319 (2011). 16. Ilin, A. & Raiko, T. Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 11, 1957–2000 (2010). 17. Jansen, J. J., Hoefsloot, H. C. J., Boelens, H. F. M., van der Greef, J. & Smilde, A. K. Analysis of longitudinal metabolomics data. Bioinformatics 20, 2438–2446. https://​doi.​org/​10.​1093/​bioin​forma​tics/​bth268 (2004). 18. Lin, T. H. A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data. Qual. Quant. 44, 277–287. https://​doi.​org/​10.​1007/​s11135-​008-​9196-5 (2010). 19. Roweis, S. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems, 10, 626–632 (MIT Press, 1998). 20. Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118. https://​doi.​org/​10.​1093/​bioin​forma​tics/​btr597 (2012). 21. Wei, R. et al. GSimp: a Gibbs sampler based lef-censored missing value imputation approach for metabolomics studies. PLoS Comput. Biol. 14, e1005973. https://​doi.​org/​10.​1371/​journ​al.​pcbi.​10059​73 (2018). 22. Do, K. T. et al. Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 14, 128. https://​doi.​org/​10.​1007/​s11306-​018-​1420-2 (2018). 23. Shah, J., Brock, G. N. & Gaskins, J. BayesMetab: treatment of missing values in metabolomic studies using a Bayesian modeling approach. BMC Bioinform. 20, 673. https://​doi.​org/​10.​1186/​s12859-​019-​3250-2 (2019). 24. Kumar, N., Hoque, M. A., Shahjaman, M., Islam, S. M. & Mollah, M. N. A new approach of outlier-robust missing value imputation for metabolomics data analysis. Curr. Bioinform. 14, 43–52. https://​doi.​org/​10.​2174/​15748​93612​66617​11211​54655 (2019). 25. Faquih, T. et al. A workfow for missing values imputation of untargeted metabolomics data. Metabolites 10, 486. https://​doi.​org/​ 10.​3390/​metab​o1012​0486 (2020). 26. Pedreschi, R. et al. Treatment of missing values for multivariate statistical analysis of gel-based proteomics data. Proteomics 8, 1371–1383. https://​doi.​org/​10.​1002/​pmic.​20070​0975 (2008). 27. Scheel, I. et al. Te infuence of missing values imputation on detection of diferentially expressed genes from microarray data. Bioinformatics 21, 4272–4279. https://​doi.​org/​10.​1093/​bioin​forma​tics/​bti708 (2005). 28. de Brevern, A. G., Hazout, S. & Malpertuy, A. Infuence of microarrays missing values on the stability of gene groups by hierarchical clustering. BMC Bioinform. 5, 114. https://​doi.​org/​10.​1186/​1471-​2105-5-​114 (2004).

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 11 Vol.:(0123456789) www.nature.com/scientificreports/

29. Blanchet, L. & Smolinska, A. Data fusion in metabolomics and proteomics for biomarker discovery. In Statistical Analysis in Proteomics (ed. Jung, K.) 209–223 (Humana Press, 2016). 30. Tzoulaki, I., Ebbels, T. M., Valdes, A., Elliott, P. & Ioannidis, J. P. Design and analysis of metabolomics studies in epidemiologic research: a primer on-omic technologies. Am. J. Epidemiol. 180, 129–139. https://​doi.​org/​10.​1093/​aje/​kwu143 (2014). 31. Tibshirani, R. & Hastie, T. Outlier sums for diferential gene expression analysis. 8, 2–8. https://doi.​ org/​ 10.​ 1093/​ biost​ ​ atist​ics/​kxl005 (2007). 32. Eisner, R. et al. Learning to predict cancer-associated skeletal muscle wasting from 1H-NMR profles of urinary metabolites. Metabolomics 7, 25–34. https://​doi.​org/​10.​1007/​s11306-​010-​0232-9 (2011). 33. De Livera, A. M. & Bowne, J. Metabolomics: a collection of functions for analysing metabolomics data. R package version 0.1.1, https://​rdrr.​io/​cran/​metab​olomi​cs/ (2013). 34. Kumar, N., Hoque, M. A., Shahjaman, M., Islam, S. M. & Mollah, M. N. H. Metabolomic biomarker identifcation in presence of outliers and missing values. Biomed. Res. Int. 2017, 2437608. https://​doi.​org/​10.​1155/​2017/​24376​08 (2017). 35. Kotze, H. L. et al. A novel untargeted metabolomics correlation-based network analysis incorporating human metabolic recon- structions. BMC Syst. Biol. 7, 107. https://​doi.​org/​10.​1186/​1752-​0509-7-​107 (2013). Acknowledgements Tis work was supported by research funds from JSPS KAKENHI, Grant Number 20H05743. We would like to thank Editage (www.​edita​ge.​com) for English language editing. Author contributions N.K. developed a two-way kernel weighted least square-based missing imputation technique. N.K. also analysed the data, drafed the manuscript, and performed the statistical analysis. Md.A.H. and M.S. coordinated and supervised the project. M.S. revised the manuscript. All authors have read and approved the fnal manuscript.

Competing interests Te authors declare no competing interests. Additional information Supplementary information Te online version contains supplementary material available at https://​doi.​org/​ 10.​1038/​s41598-​021-​90654-0. Correspondence and requests for materials should be addressed to N.K. Reprints and permissions information is available at www.nature.com/reprints. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional afliations. Open Access Tis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. Te images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/.

© Te Author(s) 2021

Scientifc Reports | (2021) 11:11108 | https://doi.org/10.1038/s41598-021-90654-0 12 Vol:.(1234567890)