Optimal Recovery of Missing Values for Non-Negative Matrix Factorization: a Probabilistic Error Bound
Total Page:16
File Type:pdf, Size:1020Kb
Optimal recovery of missing values for non-negative matrix factorization: A probabilistic error bound Rebecca Chen 1 Lav R. Varshney 1 2 Abstract can discover cell groups and lower-dimensional manifolds (latent factors) describing gene count ratios for different cell Missing values imputation is often evaluated on types. However, due to channel noise, incomplete survey some similarity measure between actual and im- data, or biological limitations, data matrices are usually puted data. However, it may be more meaningful incomplete and matrix imputation is often needed before to evaluate downstream algorithm performance further analysis (Li & Ngom, 2013). In particular, “newer after imputation than the imputation itself. We de- MF algorithms that model missing data are essential for scribe a straightforward unsupervised imputation [single-cell RNA] data” (Stein-O’Brien et al., 2018). algorithm, a minimax approach based on optimal recovery, and derive probabilistic error bounds We took up this challenge using optimal recovery (Golomb on downstream non-negative matrix factorization & Weinberger, 1959; Micchelli & Rivlin, 1976; 1985), an (NMF). We also comment on fair imputation. estimation-theoretic approach used for signal and image in- terpolation (Donoho, 1994; Muresan & Parks, 2004; Shenoy & Parks, 1992). We previously found a tight worst-case 1. Introduction bound on NMF error following our minimax imputation, and experiments showed competitive performance with The performance of missing values imputation is typically more complicated imputation techniques (Chen & Varsh- measured by how similar imputed data is to original data, ney, 2019). In this paper, we find several probabilistic error but Tuikkala et al.(2008) claim it is more meaningful to mea- bounds, which would better characterize experimental re- sure the accuracy of downstream analysis (e.g. clustering sults and serve as useful benchmarks for algorithms. We accuracy) after imputation. Several groups have evaluated made no assumptions on the missingness pattern for the the effects of different imputation methods on clustering and worst-case bound (Little & Rubin, 2002), but we assume classification accuracy (Chiu et al., 2013; de Souto et al., samples are missing completely at random (MCAR) for our 2015). We (2019) extended this to quantify non-negative probabilistic bounds. Finally, we discuss how our minimax matrix factorization (NMF) accuracy after imputation, and approach aligns with certain notions of fairness. introduced a new imputation technique. In this paper, we find a probabilistic bound for NMF error. 2. Related Work NMF is a popular clustering algorithm used by scientists 2.1. Non-negative matrix factorization (NMF) because of its identifiability, or model uniqueness, and its interpretability, with resulting latent factors being intuitive When used to perform cluster analysis, NMF outputs latent (Li & Ngom, 2013; Qi et al., 2009). NMF has been used factors that characterize the clusters. Donoho & Stodden for speech and audio separation, community detection, and (2004) interpret NMF as the problem of finding cones in topic modeling. In biology, NMF of gene count matrices the positive orthant which contain clouds of data points. 1 Liu & Tan(2017) show that a rank-one NMF gives a good Department of Electrical and Computer Engineering, Uni- description of near-separable data and provide an upper versity of Illinois at Urbana-Champaign, Urbana, Illinois, USA. 2Salesforce Research, Palo Alto, CA, USA bound on the relative reconstruction error. Given certain This work was supported by Air Force STTR Grant FA8650-16- biological data is often linearly separable on some manifold- M-1819 and grant number 2018-182794 from the Chan Zuckerberg or high-dimensional space (Clarke et al., 2008), the bound Initiative DAF, an advised fund of the Silicon Valley Commu- given by rank-one NMF is valid. We describe this conical nity Foundation.. Correspondence to: Lav R. Varshney <varsh- model below (Donoho & Stodden, 2004; Fu et al., 2019; [email protected]>. Liu & Tan, 2017). Presented at the first Workshop on the Art of Learning with Missing Let V 2 F ×N be a matrix of N sample points with F Values (Artemiss) hosted by the 37 th International Conference on R+ Machine Learning (ICML). Copyright 2020 by the author(s). non-negative observations. Suppose the columns in V are A probabilistic error bound for NMF with missing values F ×K generated from K clusters. There exist W 2 R+ and K×N H 2 R+ such that V = WH. This is the NMF of V (Lee & Seung, 1999). Suppose the N data points originate from K cones. We define a circular cone C(u; α) by a direction vector u and Figure 1. Feasible set of estimators. an angle α: and we use the subscripted vector (·)po to denote partially- F x · u C(u; α) := x 2 R nf0g : ≥ cos α . (1) observed data points. We use a subscripted matrix (·)fo kxk2 or (·)po to denote the set of all fully-observed or partially- We truncate the circular cones to be in the non-negative observed data columns in the matrix. orthant P. All x’s belonging to Ck can be considered as We can impute a partially-observed vector vpo by see- noisy versions of uk, where k = 1;:::;K. We call the ing where its observed samples intersect with the clusters angle between cones βij := arccos(ui · uj): Assume that C1; :::; Ck; as shown in Fig.1. Let the missing values plane F min βij > max fmaxfαi + 3αj; 3αi + αjgg: be the restriction set over that satisfies the constraints i;j2[K];i6=j i;j2[K];i6=j R (2) on the observed values of vpo. We call this intersection the This is a common assumption used to guarantee cluster- feasible set W : ing performance (Bu et al., 2017; Liu & Tan, 2017; Ng K [ et al., 2001). We can then partition V into k sets, denoted W = fv^po 2 Ck : PΩ(^vpo) = PΩ(vpo)g: (3) Vk := fvn 2 Ck \ P g, and rewrite Vk as the sum of k=1 a rank-one matrix Ak (parallel to uk) and a perturbation matrix Ek (orthogonal to uk). For any vector z 2 Vk, For some norm or error function, k · k, the optimal recov- z = kzk (cos β)u + y, where kyk = kzk (sin β) ≤ ∗ 2 k 2 2 ery estimator v^po minimizes the maximum error over the kzk2(sin αk). This rank-one approximation is used to find feasible set of estimates: error bounds (Liu & Tan, 2017). ∗ v^po = arg min max kv^po − vk: (4) v2Ck 2.2. Optimal recovery imputation v^po2Ck When there are missing values in clustered data, local impu- If W contains estimators belonging to more than one Ck, tation approaches, or those based on local patterns (such as W can be partitioned into disjoint sets Wk. Feasible clusters cluster membership or closest neighboring points), can be are those for which Wk is not empty. used to impute values. Popular algorithms that utilize local If the columns in V come from K circular cones defined as ∗ F ×K ∗ structure include k-nearest neighbors, local least squares, (1), there is a pair of factor matrices W 2 R+ ; H 2 and bicluster Bayesian component analysis (Hastie et al., K×N R+ , such that 1999; Kim et al., 2005; Meng et al., 2014). ∗ ∗ kV − W H kF Optimal recovery imputation minimizes NMF error under ≤ max fsin αkg: (5) kVk k2[K] certain geometric assumptions on data and enables error F bound derivation (Chen & Varshney, 2019). Suppose we ∗ are given an unknown signal v that lies in some signal class The optimal recovery estimator v^po would minimize αk in (1), which is equivalent to: Ck: The optimal recovery estimate v^ minimizes the maxi- mum error between v^ and all signals in the feasible signal ∗ 2 2 v^po = arg maxf(^vpo · uk) − (^vpo · v^po) cos (αk)g. (6) class. Given well-clustered non-negative data V, we impute v^po2Ck missing samples in V so the maximum error is minimized over feasible clusters, regardless of the missingness pattern. If the geometric assumption (2) holds, a greedy clustering Suppose there are missing values in V. Let Ω 2 f0; 1gF ×N algorithm (Alg. 1) returns correct clustering of partially be a matrix of observed values indicators with Ωij = 1 if observed data under certain conditions, and error is bounded as vij is observed and 0 otherwise. We define the projection ∗ ∗ kV − WpoHpokF operator of a matrix Y onto an index set Ω by ≤ max fsin 2αkg; (7) kVkF k2[K] Y if Ω = 1 [P (Y)] = ij ij . where W∗ and H∗ are found by Alg. 2. Experiments show Ω ij 0 if Ω = 0 po po ij that this algorithm performs well on biological and imaging We use the subscripted vector (·)fo to denote fully-observed data in comparison to more complex methods (Chen & data points (columns), or data points with no missing values, Varshney, 2019). A probabilistic error bound for NMF with missing values Algorithm 1 Greedy Clustering with Missing Values using Alg.1. After obtaining the clusters, we compute F ×N F ×N the minimum covering sphere (MCS) on the points in each Input V 2 R+ ;K 2 N; Ω 2 f0; 1g N K F ×K cluster (Hopp & Reeve, 1996). This gives us K balls with Output J 2 f0; 1; :::; Kg ; α 2 (0; π=2) ; u 2 R+ Nk points in each ball. 1: Partition columns in V into subsets Vfo and Vpo, P where Vfo contains data columns for which i rij = Now suppose we have partially observed entries in our data. F , and Vpo contains remaining columns. Let the missingness of a point be a Bernoulli(γ) random 2: Normalize Vfo so that all columns have unit `2-norm.