Optimization and Expansion of Non-Negative Matrix Factorization Xihui Lin1* and Paul C
Total Page:16
File Type:pdf, Size:1020Kb
Lin and Boutros BMC Bioinformatics (2020) 21:7 https://doi.org/10.1186/s12859-019-3312-5 METHODOLOGY ARTICLE Open Access Optimization and expansion of non-negative matrix factorization Xihui Lin1* and Paul C. Boutros1,2,3 Abstract Background: Non-negative matrix factorization (NMF) is a technique widely used in various fields, including artificial intelligence (AI), signal processing and bioinformatics. However existing algorithms and R packages cannot be applied to large matrices due to their slow convergence or to matrices with missing entries. Besides, most NMF research focuses only on blind decompositions: decomposition without utilizing prior knowledge. Finally, the lack of well-validated methodology for choosing the rank hyperparameters also raises concern on derived results. Results: We adopt the idea of sequential coordinate-wise descent to NMF to increase the convergence rate. We demonstrate that NMF can handle missing values naturally and this property leads to a novel method to determine the rank hyperparameter. Further, we demonstrate some novel applications of NMF and show how to use masking to inject prior knowledge and desirable properties to achieve a more meaningful decomposition. Conclusions: We show through complexity analysis and experiments that our implementation converges faster than well-known methods. We also show that using NMF for tumour content deconvolution can achieve results similar to existing methods like ISOpure. Our proposed missing value imputation is more accurate than conventional methods like multiple imputation and comparable to missForest while achieving significantly better computational efficiency. Finally, we argue that the suggested rank tuning method based on missing value imputation is theoretically superior to existing methods. All algorithms are implemented in the R package NNLM, which is freely available on CRAN and Github. Keywords: Non-negative matrix factorization, Deconvolution, Imputation Background related to some biological pathways [2, 3]. NMF has been Non-negative matrix factorization (NMF or NNMF) [1] used to extract trinucleotide mutational signatures from has been widely used as a general method for dimensional mutations found in cancer genomic sequences and it was reduction and feature extraction on non-negative data. suggested that the trinucleotide profile of each cancer type The main difference between NMF and other factoriza- is a positive linear combination of these signatures [4]. tion methods, such as SVD, is the nonnegativity, which There are several different algorithms available for NMF allows only additive combinations of intrinsic ‘parts’, i.e. decomposition, including the multiplicative algorithms the hidden features. This is demonstrated in [1], where proposed in [1], gradient descent and alternating non- NMF learns parts of faces and a face is naturally rep- negative least square (ANLS). ANLS is gaining attention resented as an additive linear combination of different due to its guarantee to converge to a stationary point and parts. Indeed, negative combinations are not as intuitive being a faster algorithm for non-negative least squares or natural as positive combinations. (NNLS). In bioinformatics, NMF is sometimes used to find In this paper, we first unify various regularization forms ‘metagenes’ from expression profiles, which may be on the result components, which encourage desired prop- erties such as orthogonality and sparsity and show how *Correspondence: [email protected] the conventional multiplicative algorithms [1]canbe 1Informatics & Biocomputing, Ontario Institute for Cancer Research, Toronto, modified to adapt to these regularizations inspired by [3]. Canada We then adopt the ANLS approach, but incorporate a Full list of author information is available at the end of the article solution to the NNLS problem using a coordinate-wise © The author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Lin and Boutros BMC Bioinformatics (2020) 21:7 Page 2 of 10 algorithm proposed by [5], in which each unknown vari- • developing a faster algorithm for NMF using able can be solved sequentially and explicitly as simple sequential coordinate-wise descent; quadratic optimization problems. We demonstrate that • introducing a method to handle missing entries in the this algorithm can converge much faster than traditional target matrix, which results in a novel method to multiplicative algorithms. For NMF with Kullback-Leibler determine the rank k and a new application of NMF divergence loss, we extend this methodology by approach- for missing value imputation; ing the loss with a quadratic function. • introducing a masking technique to integrate prior NMF is a dimension reduction method, as the resulting knowledge and desirable properties and decomposed matrices have a smaller number of entries demonstrating how it can be used to achieve tumour than the original matrix. This means that one does not content deconvolution. need all the entries of the original matrix to perform a decomposition, thus NMF should be able to handle miss- Results ing entries in the target matrix. Indeed, factorization can Algorithms comparison be fulfilled by dropping the loss items related to the miss- We carry out an experiment for illustration purpose using ing entries if the target loss function is a sum of per-entry a subset of microarray data from a group of Non-Small losses, e.g., mean square error (MSE) or Kullback-Leibler Cell Lung Cancer (NSCLC) data ([8], available in package (KL) divergence. Furthermore, the reconstructed matrix NNLM) to compare SCD with Lee’s multiplicative algo- has values on entries that are missing in the original rithms. Results are shown in Fig. 1 and Table 1.Hereone matrix. This reveals the capability of NMF for missing can see that the SCD and Lee’s algorithms have roughly value imputation. Inspired by this observation and the the same run time for each epoch, i.e., updating W and popular training-validation tuning strategy in supervised H entries once. However, SCD generally converges much models, we introduce a novel method to optimize the only faster, achieving the same accuracy in fewer epochs and a hyper-parameter k, i.e. the rank of NMF decomposition. much shorter time. Obviously, algorithms with mean KL NMF is essentially unsupervised. It performs a blind loss are slower than those with MSE for each epoch, but decomposition, which puts the meaning of the result in reducing error a bit more in each epoch. The multiplica- question. This might limit the applications of unsuper- tive algorithm with MSE is faster when a multiple-epochs vised methods in areas where strong interpretability is update (Ni > 1) is performed in each outer alternating critical, including most biomedical research. On the other iteration (LEE-MSE vs LEE-MSE-1). hand, decomposition without utilizing known discovery (prior knowledge) may not be effective, especially with a Missing value imputation small sample size. To overcome these challenges, we apply A comparison of different imputation methods are a masking technique to the NMF decomposition during shown in Fig. 2 and Table 2.AsubsetoftheNSCLC the iterating algorithms to retain certain structures or pat- dataset [8] is used with 30% randomly selected to be terns in one or both resulting matrices, which can be missing. One can see that NMF is almost as good designed according to our prior knowledge or research as missForest [9] but much faster, and clearly better interest. This technique can be used to perform a path- than MICE [10] and simple median imputation in this way or sub-network guided decomposition or to separate example. different cell types from mixed tissue samples. All of these algorithmic innovations are implemented Choice of k in the popular R programming language. They serve as We performed a simulation study to illustrate our pro- an alternative to the widely-used NMF package [6] which posed method for selecting the rank hyperparameter k. was first translated from a MATLAB package and later Entries of W ∈ R400×3 and H ∈ R3×50 are sam- optimized via C++ for some algorithms. The sparse alter- pled independently and uniformly from interval (0, 1) and nating NNLS (ANLS) by [3] is anticipated to be fast in (0, 10) respectively. A was constructed by WH plus noise theory, but its implementation in R leads to slow per- sampled independently for each entry from the standard formance in practice. Our NNLM package combines the normal distribution. All negative values in A are set to 0 efficient NNLS algorithm with the use of Rcpp, which (very unlikely). We choose MSE as loss and run the pur- seamlessly integrates R and C++ [7]andisfreelyavailable posed algorithm 5 times, each with a random 30% entries and open-source. deleted. In summary, the main contributions of this work The result is shown in Fig. 3. As we could see, dif- include: ferent runs (indicated by different colors) give consistent results. The mean square errors (MSEs) of the reconstruc- • unifying