<<

1 Robust Decomposition with Sparse Corruptions Daniel Hsu, Sham M. Kakade, and Tong Zhang

Abstract—Suppose a given observation matrix can be matrix XS of errors before being observed as decomposed as the sum of a low- matrix and a , and the goal is to recover these individual compo- Y = XS + XL. nents from the observed sum. Such additive decompositions (sparse) (low-rank) have applications in a variety of numerical problems including system identification, latent variable graphical The goal is to recover the original data matrix XL modeling, and principal components analysis. We study (and the error components XS) from the corrupted conditions under which recovering such a decomposition observations Y . In the latent variable model application is possible via a combination of `1 norm and trace norm minimization. We are specifically interested in the question of Chandrasekaran et al. [2], Y represents the precision of how many sparse corruptions are allowed so that convex matrix over visible nodes of a Gaussian graphical model, programming can still achieve accurate recovery, and we and XS represents the precision matrix over the visible obtain stronger recovery guarantees than previous studies. nodes when conditioned on the hidden nodes. In general, Moreover, we do not assume that the spatial pattern of Y may be dense as a result of dependencies between corruptions is random, which stands in contrast to related analyses under such assumptions via matrix completion. visible nodes through the hidden nodes. However, XS will be sparse when the visible nodes are mostly in- Index Terms—Matrix decompositions, sparsity, low- dependent after conditioning on the hidden nodes, and rank, outliers the difference XL = Y − XS will be low-rank when the number of hidden nodes is small. The goal is then I.INTRODUCTION to infer the relevant dependency structure from just the visible nodes and measurements of their correlations. HIS work studies additive decompositions of ma- Even if the matrix Y is exactly the sum of a sparse trices into sparse and low-rank components. Such T matrix X and a low-rank matrix X , it may be im- decompositions have found applications in a variety of S L possible to identify these components from the sum. For numerical problems, including system identification [1], instance, the sparse matrix X may be low-rank, or the latent variable graphical modeling [2], and principal S low-rank matrix X may be sparse. In such cases, these component analysis (PCA) [3]. In these settings, the L components may be confused for each other, and thus user has an input matrix Y ∈ m×n which is believed R the desired decomposition of Y may not be identifiable. to be the sum of a sparse matrix X and a low-rank S Therefore, one must impose conditions on the sparse matrix X . For instance, in the application to PCA, L and low-rank components in order to guarantee their X represents a matrix of m data points from a low- L identifiability from Y . dimensional subspace of Rn, and is corrupted by a sparse We present sufficient conditions under which XS and D. Hsu is with Microsoft Research New England, Cambridge, MA XL are identifiable from the sum Y . Essentially, we 02142 USA (e-mail: [email protected]). require that XS not be too dense in any single row S. M. Kakade is with Microsoft Research New England, Cambridge, or column, and that the singular vectors of XL not be MA 02142 USA, and also with the Department of Statistics, Wharton School, University of Pennsylvania, Philadelphia, PA 19104 (e-mail: too sparse. The level of denseness and sparseness are [email protected]). This author was partially supported by considered jointly in the conditions in order to obtain the NSF grant IIS-1018651. weakest possible conditions. Under a mild strengthening T. Zhang is with the Department of Statistics, Rutgers University (e- mail: [email protected]). This author was partially supported by of the condition, we also show that XS and XL can be the following grants: AFOSR FA9550-09-1-0425, NSA-AMS 081024, recovered by solving certain convex programs, and that NSF DMS-1007527, and NSF IIS-1016061. the solution is robust under small perturbations of Y . Copyright c 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes The first program we consider is must be obtained from the IEEE by sending a request to pubs- [email protected]. min λkXSkvec(1) + kXLk∗ (subject to certain feasibility constraints such as kXS + of [3]). Narrowing this gap with alternative deterministic XL −Y k ≤ ), where k·kvec(1) is the entry-wise 1-norm conditions is an interesting open problem. Follow-up and k · k∗ is the trace norm. These norms are natural work to [3] studies the robustness of the recovery pro- convex surrogates for the sparsity of XS and the rank of cedure [8], as well as quantitatively weaker conditions XL [4], [5], which are generally intractable to optimize. on XS [9], but these works are only considered under We also consider a regularized formulation the random support model. Our work is therefore largely 1 complementary to these probabilistic analyses. min kX + X − Y k2 + λkX k + kX k 2µ S L vec(2) S vec(1) L ∗ B. Outline where k · k is the Frobenius norm; this formulation vec(2) We describe our main results in Section II. In Sec- may be more suitable in certain applications and enjoys tion III, we review a number of technical tools such as different recovery guarantees. matrix and operator norms that are used to characterize the rank-sparsity incoherence properties of the desired A. Related work decomposition. Section IV analyzes these incoherence Our work closely follows that of Chandrasekaran et properties in detail, giving sufficient conditions for iden- al. [1], who initiated the study of rank-sparsity incoher- tifiability as well as for certifying the (approximate) ence and its application to matrix decompositions. There, optimality of a target decomposition for our optimization the authors identify parameters that characterize the formulations. The main recovery guarantees are proved in Sections V and VI. incoherence of XS and XL sufficient to guarantee identi- fiability and recovery using convex programs. However, their analysis of this characterization yields conditions II.MAIN RESULTS that are significantly stronger than those given in our Fix an observation matrix Y ∈ Rm×n. Our goal is to present work. For instance, the allowed fraction of non- (approximately) decompose the matrix Y into the sum zero entries in XS is quickly vanishing as a function of of a sparse matrix XS and a low-rank matrix XL. the matrix size, even under the most favorable conditions on XL; our analysis does not have this restriction and A. Optimization formulations allows X to have up to Ω(mn) non-zero entries when S We consider two convex optimization problems over X is low-rank and has non-sparse singular vectors. In L (X ,X ) ∈ m×n × m×n. The first is the constrained terms of the PCA application, our analysis allows for S L R R formulation (parametrized by λ > 0,  ≥ 0, and up to a constant fraction of the data matrix entries to vec(1)  ≥ 0) be corrupted by noise of arbitrary magnitude, while the ∗ analysis of [1] requires that it decrease as a function of min λkXSkvec(1) + kXLk∗ the matrix dimensions. Moreover, [1] only considers ex- s.t. kXS + XL − Y kvec(1) ≤ vec(1) (1) act decompositions, which may be unrealistic in certain kXS + XL − Y k∗ ≤ ∗ applications; we allow for approximate decompositions, where k · kvec(1) is the entry-wise 1-norm, and k · k∗ is and study the effect of perturbations on the accuracy of the trace norm (i.e., sum of singular values). The sec- the recovered components. ond is the regularized formulation (with regularization The application to principal component analysis with parameter µ > 0) gross sparse errors was studied by Candes` et al. [3], 1 2 building on previous results and analysis techniques for min kXS + XL − Y kvec(2) + λkXSkvec(1) + kXLk∗ e.g. 2µ the related matrix completion problem ( , [6], [7]). (2) The sparse errors model of [3] requires that the support where k · k is the Frobenius norm (entry-wise 2- X vec(2) of the sparse matrix S be random, which can be norm). unrealistic in some settings. However, the conditions We also consider adding a constraint to control are significantly weaker than those of [1]: for instance, kXLkvec(∞), the entry-wise ∞-norm of XL. To (1), we they allow for Ω(mn) non-zero entries in XS. Our add the constraint work makes no probabilistic assumption on the sparsity pattern of XS and instead studies purely deterministic kXLkvec(∞) ≤ b structural conditions. The price we pay, however, is and to (2), we add roughly a factor of rank(XL) in what is allowed for the support size of XS (relative to the probabilistic analysis kXS − Y kvec(∞) ≤ b

2 n The parameter b is intended as a natural bound for XL where kMkp→q := max{kMvkq : v ∈ R , kvkp ≤ 1}, and is typically known in applications. For example, in  −1 if M < 0 image processing, the values of interest may lie in the  i,j sign(M) = 0 if M = 0 ∀i ∈ [m], j ∈ [n] interval [0, 255] (say), and hence, we might take b = 500 i,j i,j  +1 if M > 0 as a relaxation of the box constraint [0, 255]. The core of i,j our analyses do not rely on these additional constraints; and ρ > 0 is a balancing parameter to accommodate we only consider them to obtain improved robustness disparity between the number of rows and columns; guarantees for recovering XL, which may be important a natural choice for the balancing parameter is ρ := in some applications. pn/m. We remark that ρ is only a parameter for the analysis; the optimization formulations do not directly ρ X¯ Ω(mn) B. Identifiability conditions involve . Note that S may√ have non-zero entries and α(pn/m) = O( mn) as long as the non- Our first result is a refinement of the rank-sparsity zero entries of X¯S are spread out over the entire matrix. incoherence notion developed by [1]. We characterize Conversely, a sparse matrix with just O(m + n) could ¯ ¯ √ a target decomposition of Y into Y = XS + XL by have α(pn/m) = mn by having all of its non-zero ¯ the projection operators to subspaces associated with XS entries in just a few rows and columns. ¯ and XL. Let The second property measures the sparseness of the ¯ ¯ ¯ m×n ¯ singular vectors of XL: Ω = Ω(XS) := {X ∈ R : supp(X) ⊆ supp(XS)} −1 ¯ ¯ > ¯ ¯ > be the space of matrices whose supports are subsets of β(ρ) := ρ kUU kvec(∞) + ρkV V kvec(∞) ¯ the support of XS, and let PΩ¯ be the orthogonal projector + kU¯k2→∞kV¯ k2→∞. to Ω¯ under the inner product hA, Bi = tr(A>B); this ¯ projection is given by For instance, if the singular vectors of XL are perfectly aligned with the coordinate axes, then β(ρ) = Ω(1). On  ¯ Mi,j if (i, j) ∈ supp(XS) the other hand, if the left and right singular vectors have [PΩ¯ (M)]i,j = 0 otherwise entries bounded by pc/m and pc/n, respectively, for p √ for all i ∈ [m] := {1, . . . , m}, j ∈ [n] := {1, . . . , n}. some c ≥ 1, then β( n/m) ≤ 3cr/¯ mn. Furthermore, let Our main identifiability result is the following.

Theorem 1. If infρ>0 α(ρ)β(ρ) < 1, then Ω¯ ∩T¯ = {0}. T¯ = T (X¯L) := m×n ¯ Theorem 1 is an immediate consequence of the fol- {X1 + X2 ∈ R : range(X1) ⊆ range(XL), lowing lemma (also given as Lemma 10). > ¯ > range(X2 ) ⊆ range(XL )} m×n Lemma 1. For all M ∈ R , kPΩ¯ (PT¯(M))kvec(1) ≤ be the span of matrices either with row-space contained infρ>0 α(ρ)β(ρ)kMkvec(1). in that of X¯L, or with column-space contained in that of ¯ ¯ Proof of Theorem 1: Take any M ∈ Ω¯ ∩ T¯. By XL. Let PT¯ be the orthogonal projector to T , again, un- der the inner product hA, Bi = tr(A>B); this projection Lemma 1, kPΩ¯ (PT¯(M))kvec(1) ≤ α(ρ)β(ρ)kMkvec(1). is given by On the other hand, PΩ¯ (PT¯(M)) = M, so α(ρ)β(ρ) < 1 implies kMkvec(1) = 0, i.e., M = 0. ¯ ¯ > ¯ ¯ > ¯ ¯ > ¯ ¯ > PT¯(M) = UU M + MV V − UU MV V Clearly, if Ω¯ ∩ T¯ contains a matrix other than 0, then {(X¯ + M, X¯ − M): M ∈ Ω¯ ∩ T¯} gives a family where U¯ ∈ m×r¯ and V¯ ∈ n×r¯ are, respectively, S L R R of sparse/low-rank decompositions of Y = X¯ + X¯ matrices of left and right orthonormal singular vectors S L with at least the same sparsity and rank as (X¯ , X¯ ). corresponding to the non-zero singular values of X¯ , S L L Conversely, if Ω¯ ∩T¯ = {0}, then any matrix in the direct and r¯ is the rank of X¯ . We will see that certain L sum Ω¯ ⊕ T¯ has exactly one decomposition into a matrix operator norms of P¯ and P ¯ can be bounded in terms Ω T A ∈ Ω¯ plus a matrix B ∈ T¯, and in this sense (X¯ , X¯ ) of structural properties of X¯ and X¯ . S L S L is identifiable. The first property measures the maximum number of Note that, as we have argued above, the condition non-zero entries in any row or column of X¯S: infρ>0 α(ρ)β(ρ) < 1 may be achieved even by ma- X¯ Ω(mn) α(ρ) := maxρk sign(X¯ )k , trices S with non-zero entries, provided that S 1→1 the non-zero entries of X¯ are sufficiently spread out, −1 S ρ k sign(X¯S)k∞→∞ and that X¯L is low-rank and has singular vectors far

3 from the coordinate . This is in contrast with the We now proceed with conditions for the regularized conditions studied by [1]. Their analysis uses a different formulation (2). Let E := Y − (X¯S + X¯L) and characterization of X¯S and X¯L, which leads to a stronger identifiability condition in certain cases. Roughly, if 2→2 := kEk2→2 X¯S has an approximately symmetric sparsity pattern vec(∞) := kEkvec(∞) + kPT¯(E)kvec(∞). (so k sign(X¯ )k ≈ k sign(X¯ )k ), then [1] S 1→1 S ∞→∞ We require the following, for some ρ > 0 and c > 1: requires α(1)pβ(1) < 1 for square n × n matrices.1 ¯ n×n Since β(1) = Ω(1/n) for any XL ∈ R , the condition α(ρ)β(ρ) < 1 (3) 2 implies α(1) = O(n). Therefore X¯S must have at most −1 (1 − α(ρ)β(ρ))(1 − c · µ 2→2) O(n) non-zero entries (or else α(1)2 becomes super- λ ≤ c · α(ρ) linear). In other words, the fraction of non-zero entries −1 (4) ¯ p α(ρ)µ vec(∞) + α(ρ)γ allowed in XS by the condition α(1) β(1) < 1 is − quickly vanishing as a function of n. α(ρ) γ + µ−1(2 − α(ρ)β(ρ)) λ ≥ c · vec(∞) > 0. (5) 1 − α(ρ)β(ρ) − c · α(ρ)β(ρ) C. Recovery guarantees For instance, if for some ρ > 0, Our next results are guarantees for (approximately) ¯ ¯ 1 3 recovering the sparse/low-rank decomposition (XS, XL) α(ρ)γ ≤ and α(ρ)β(ρ) ≤ , (6) from Y = X¯S + X¯L via solving either convex optimiza- 41 41 tion problems (1) or (2). We require a mild strengthening then the conditions are satisfied for c = 2 provided that of the condition infρ>0 α(ρ)β(ρ) < 1, as well as µ and λ are chosen to satisfy appropriate settings of λ > 0 and µ > 0 for our recovery   guarantees. Before continuing, we first define another 2 vec(∞) µ ≥ max 4 · 2→2, · and property of X¯L: 15 λ 15 15 1 ¯ ¯ > · γ ≤ λ ≤ · . (7) γ := kUV kvec(∞) 2 82 α(ρ) −1 which is approximately the same as (in fact, bounded Note that (6) can be satisfied when c1 ≤ c2 /41 in above by) the third term in the definition of β(ρ). Proposition 1. The quantities α(ρ), β(ρ), and γ are central to our For the constrained formulation (1), our analysis re- analysis. Therefore we state the following proposition for quires the same conditions as above, except with E set reference, which provides a more intuitive understanding to 0. Note that our analysis still allows for approximate of their behavior. We note that this is the only part in decompositions; it is only the conditions that are for- which any explicit dimensional dependencies comes into mulated with E = 0. Specifically, we require for some our analysis. ρ > 0 and c > 1:

Proposition 1. Let m0 be the maximum number of non- α(ρ)β(ρ) < 1 (8) ¯ zero entries of XS per column and n0 be the maximum 1 − α(ρ)β(ρ) − c · α(ρ)γ ¯ λ ≤ (9) number of non-zero entries of XS per row. Let r¯ be the c · α(ρ) ¯ ¯ rank of U and V . Assume further that m0 ≤ c1m/r¯ and γ ¯ λ ≥ c · > 0. (10) n0 ≤ c1n/r¯ for some c1 ∈ (0, 1), and kUkvec(∞) ≤ 1 − α(ρ)β(ρ) − c · α(ρ)β(ρ) pc /m and kV¯ k ≤ pc /n for some c > 0. 2 vec(∞) 2 2 For instance, if for some ρ > 0, Then with ρ = pn/m, we have 1 1 c √ 3c r¯ c r¯ α(ρ)γ ≤ and α(ρ)β(ρ) ≤ , (11) α(ρ) ≤ 1 mn, β(ρ) ≤ √ 2 , γ ≤ √2 . 15 5 r¯ mn mn then the conditions are satisfied for c = 2 provided that λ is chosen to satisfy 1[1] does not explicitly work out the non-square case, but claims that n can be replaced in their analysis by the larger matrix dimension 1 max{m, n}. However this does not seem possible, and the analysis 5γ ≤ λ ≤ . (12) there should only lead to the quite suboptimal dimensionality depen- 3α(ρ) ¯ dency min{m, n}. This is because a rectangular matrix XL will have −1 left and right singular vectors of different dimensions and thus different Note that (11) can be satisfied when c1 ≤ c2 /15 in allowable ranges of infinity norms. Proposition 1.

4 In summary, Proposition 1 shows that our results can It is possible to modify the constraints in (1) to use be applied even with m0 = Ω(m/r¯) and n0 = Ω(n/r¯) norms other than k · kvec(1) and k · k∗; the analysis could corruptions. In contrast, the results of [1] only apply un- at the very least be modified by simply using standard p der the condition max(m0, n0) = O( min(m, n)/r¯), relationships to change between norms, although this which is significantly stronger. Moreover, unlike the may introduce new slack in the bounds. Finally, the analysis of [3], we do not have to assume that supp(X¯S) second part of the theorem shows how the accuracy of is random. XˆL in Frobenius norm can be improved by adding an The following theorem gives our recovery guarantee additional constraint or by post-processing the solution. for the constrained formulation (1). Now we state our recovery guarantees for the regular- ¯ ¯ m×n ized formulation (2). Theorem 2. Fix a target pair (XS, XL) ∈ R × m×n ¯ ¯ ¯ ¯ m×n R satisfying kY − (XS + XL)kvec(1) ≤ vec(1) and Theorem 3. Fix a target pair (XS, XL) ∈ R × ¯ ¯ m×n ¯ ¯ kY − (XS + XL)k∗ ≤ ∗. Assume the conditions (8), R . Let E := Y − (XS + XL) and (9), and (10) hold for some ρ > 0 and c > 1. ˆ ˆ m×n  := kEk Let (XS, XL) ∈ R be the solution to the convex 2→2 2→2 optimization problem (1). We have vec(∞) := kEkvec(∞) + kPT¯(E)kvec(∞) 0 n ˆ ¯ ˆ ¯ o ∗ := kPT¯(E)k∗. max kXS − XSkvec(1), kXL − XLkvec(1) ¯   Let k := | supp(X¯S)| and r¯ := rank(X¯L). Assume the −1 2 − α(ρ)β(ρ) ≤ 1 + (1 − 1/c) · · vec(1) conditions (3), (4), and (5) hold for some ρ > 0 and c > 1 − α(ρ)β(ρ) ˆ ˆ m×n 1. Let (XS, XL) ∈ R be the solution to the convex −1 2 − α(ρ)β(ρ) + (1 − 1/c) · · ∗/λ. optimization problem (2) augmented with the constraint 1 − α(ρ)β(ρ) ¯ kXS − Y kvec(∞) ≤ b for some b ≥ kXS − Y kvec(∞) ¯ If, in addition for some b ≥ kXLkvec(∞), either: (b = ∞ is allowed). Let • the optimization problem is augmented with the (1) r¯0 := constraint kX k ≤ b (and letting X := L vec(∞) eL   ¯   ˆ vec(∞) 2k vec(∞) XL), or λ + · · λ + γ + µ 1 − α(ρ)β(ρ) µ • XˆL is post-processed by replacing [XˆL]i,j with ˆ −1  [XeL]i,j := min{max{[XL]i,j, −b}, b} for all i, j, + 1 + 2µ 2→2 · 2¯r     then we also have 2α(ρ) vec(∞) 22→2 · · λ + γ + + 1 + . 1 − α(ρ)β(ρ) µ µ ¯ kXeL − XLkvec(2)  q  We have ˆ ¯ ˆ ¯ ≤ min kXL − XLkvec(1), 2b · kXL − XLkvec(1) . kXˆS − X¯Sk vec(1) √ The proof of Theorem 2 is in Section V. It is clear that r¯0·µ ¯ ¯ ¯ (1−1/c)λ + λk · µ + 2 kr¯ · µ + k · vec(∞) if Y = X¯ + X¯ , then we can set  =  = 0 and ≤ S L vec(1) ∗ 1 − α(ρ)β(ρ) we obtain exact recovery: XˆS = X¯S and XˆL = X¯L. kXˆ − X¯ k Moreover, any perturbation Y − (X¯S + X¯L) affects S S vec(2) the accuracy of (Xˆ , Xˆ ) in entry-wise 1-norm by an  q  S L ≤ min kXˆ − X¯ k , 2b · kXˆ − X¯ k amount O( +  /λ). Note that here, the parameter S S vec(1) S S vec(1) vec(1) ∗ √ λ serves to balance the entry-wise 1-norm and trace ˆ ¯ ˆ ¯ 0 kXL − XLk∗ ≤ 2¯r · kXS − XSkvec(2) + ∗ norm of the perturbation in the same way it is used r¯0 · (1 − 1/c)−1  in the objective function of (1). So, for instance, if we + + 2¯r · µ. have the simplified conditions (11), then we may choose 2 p λ = (5/3)γ/α(ρ) to satisfy (12), upon which the error The proof of Theorem 3 is in Section VI. As before, bound becomes if Y = X¯S + X¯L so E = 0, then we can set µ → 0 and n o obtain exact recovery with Xˆ = X¯ and Xˆ = X¯ . max kXˆ − X¯ k , kXˆ − X¯ k S S L L S S vec(1) L L vec(1) When the perturbation E is non-zero, we control the s ! ¯ α(ρ) accuracy of XS in entry-wise 1-norm and 2-norm, and = O  + ·  . the accuracy of X¯ in trace norm. Under the simplified vec(1) γ ∗ L conditions (6), we can choose λ = (15/82)/α(ρ) and

5 µ = max{42→2, 2vec(∞)/(15λ)} to satisfy (7); this at random over all families of r¯ orthonormal vectors in leads to the error bounds Rm and Rn, respectively. Using arguments similar to ˆ ¯  those in [6], one can show that with high probability, kXS − XSkvec(1) = O rα¯ (ρ) max{2→2, α(ρ)vec(∞)} ˆ ¯ r¯log m kXL − XLk∗ = ¯ ¯ > kUU kvec(∞) = O √ q  m ˆ ¯ ˆ ¯ O r¯min b · kXS − XSkvec(1), kXS − XSkvec(1) r¯log n kV¯ V¯ >k = O  vec(∞) n 0  +  +r ¯ · max 2→2, α(ρ)vec(∞) r ! ∗ r¯log m kU¯k2→∞ = O (here, we have used the facts k¯ ≤ α(ρ)2, α(ρ)λ = Θ(1), m 0 ¯ r ! and r¯ = O(¯r), which also implies that k · vec(∞) = r¯log n kV¯ k = O , O(α(ρ)·α(ρ)vec(∞))). Finally, note that if the constraint 2→∞ n kXS − Y kvec(∞) ≤ b is added (i.e., b < ∞), then the ¯ requirement b ≥ kXS − Y kvec(∞) can be satisfied with ¯ so for the previously chosen ρ, we have b := kXSkvec(∞) + vec(∞). This allows for a possibly ˆ ¯ r ! improved bound on kXL − XLk∗. (log m)(log n) Our analysis centers around the construction of a dual β(ρ) = O r¯ and mn certificate using a least-squares method similar to that r ! in related works [1], [3]. The construction requires the (log m)(log n) γ = O r¯ . invertibility of PΩ¯ ◦ PT¯ (a composition of projection mn operators), which is established in our analysis by study- ing certain operator norms of PΩ¯ and PT¯ (in previous Therefore works, invertibility is established only under probabilis- ! tic assumptions [3] or stricter sparsity conditions [1]). k˜r¯(log m)(log n) α(ρ)β(ρ) = O and The rest of the analysis then relates the accuracy of the mn solutions to (1) and (2) to properties of the constructed ! dual certificate. k˜r¯(log m)(log n) α(ρ)γ = O , mn D. Examples both of which are  1 provided that We illustrate our main results with some simple ex- amples. mn k˜ ≤ δ · 1) Random models: We first consider a random model r¯(log m)(log n) for the matrices X¯S and X¯L [1]. Let the support of X¯S be chosen uniformly at random k˜ times over the [m]×[n] for a small enough constant δ ∈ (0, 1). In other words, matrix entries (so that one entry can be selected multiple when X¯L is low-rank, the matrix X¯S can have nearly times). The value of the entries in the chosen support can a constant fraction of its entries be non-zero while still be arbitrary. With high probability, we have allowing for exact decomposition of Y = X¯S +X¯L. Our guarantee improves over that of [1] by roughly a factor of ˜ ! k log n Ω((mn)1/4), but is worse by a factor of r¯(log m)(log n) k sign(X¯S)k1→1 = O and n relative to the guarantees of [3] for the random model. ! Therefore there is a gap between our generic determin- k˜ log m k sign(X¯S)k∞→∞ = O istic analysis and a direct probabilistic analysis of this m random model, and this gap seems unavoidable with sparsity conditions based on α(ρ). This is because X¯ so for ρ := p(n log m)/(m log n), we have L could be an n × n (for simplicity) block r ! (log m)(log n) with r blocks of n/r × n/r rank-1 matrices; such a α(ρ) = O k˜ . 2 mn matrix guarantees β(1) = O(r/n) but has just n /r non-zero entries. It is an interesting open problem to The logarithmic factors are due to collisions in the find alternative characterizations of supp(X¯S) that can random process. Now let U¯ and V¯ be chosen uniformly narrow or close this gap.

6 2) Principal component analysis with sparse corrup- III.TECHNICALPRELIMINARIES tions: Suppose X¯ is matrix of m data points lying in a L A. Norms, inner products, and projections low-dimensional subspace of Rn, and Z is a random matrix with independent Gaussian noise entries with Our analysis involves a variety of norms of vectors, 2 0 variance σ . Then Y = X¯L + Z is the standard matrices (viewed as elements of a as well model for principal component analysis. We augment as linear operators of vectors), and linear operators of the model with a sparse noise component X¯S to obtain matrices; we define these and related notions in this Y = X¯S + X¯L + Z; here, we allow the non-zero entries section. of X¯S to possibly approach infinity. 1) Entry-wise norms: For any p ∈ [1, ∞], define P p 1/p According to Theorem 3, we need to estimate kvkp := ( i |vi| ) be the p-norm of a vector v kZk2→2, kZkvec(∞), kPT¯(Z)kvec(∞), and kPT¯(Z)k∗. (with kvk∞ := maxi |vi|). Also, define kMkvec(p) := P p 1/p We have the following with high probability [10], ( |Mi,j| ) to be the entry-wise p-norm of a √ √ i,j matrix M (again, with kMkvec(∞) := maxi,j |Mi,j|). kZk2→2 ≤ σ m + σ n + O(σ). Note that k · kvec(2) corresponds to the Frobenius norm. Using standard arguments with the rotational invariance 2) Inner products, linear operators, and orthogonal of the Gaussian distribution, we also have m×n projections: We endow R with the inner product kZkvec(∞) ≤ O(σ log(mn)) and h·, ·i between matrices that induces the Frobenius norm > k · kvec(2); this is given by hM,Ni = tr(M N). kPT¯(Z)kvec(∞) ≤ O(σ log(mn)) For a linear operator T : Rm×n → Rm×n, we denote with high probability. Finally, by Lemma 5, we have its adjoint by T ∗; this is the unique linear operator that √ √ satisfies hT ∗(M),Ni = hM, T (N)i for all M ∈ m×n kPT¯(Z)k∗ ≤ 2¯rkZk2→2 ≤ 2¯rσ m + 2¯rσ n + O(¯rσ). R √ and N ∈ m×n (in this work, we only consider bounded Suppose (X¯ , X¯ ) has α(ρ) ≤ c ( mn/r¯), β(ρ) = R √ S L √ 1 linear operators). For any two linear operators T and Θ(¯r/ mn), and γ = Θ(¯r/ mn) and satisfies the 1 T , we let T ◦ T denote their composition as defined simplified condition (6). This can be achieved with 2 1 2 by (T ◦ T )(M) := T (T (M)). c c ≤ 1/41 in Proposition 1. Also assume λ and µ are 1 2 1 2 1 2 Given a subspace W ⊆ m×n, we let W ⊥ denote its chosen to satisfy (7), and that b ≥ kX¯ k + . R L vec(∞) vec(∞) orthogonal complement, and let P : m×n → m×n Then we note that k¯ = O(c2mn/r¯2), and thus have from W R R 1 denote the orthogonal projector to W with respect to Theorem 3 (see the discussion thereafter): h·, ·i, i.e., the unique linear operator with range W and ˆ ¯ ∗ kXS − XSkvec(1) satisfying PW = PW and PW ◦ PW = PW . √ √ √ √  = O c1 mn max{σ m + σ n, σ mn log(mn)/r¯} 3) Induced norms: For any two vector norms k · kp and k · kq, define kMkp→q := maxx6=0 kMxkq/kxkp to = O (σc1mn log(mn)/r¯) be the corresponding induced operator norm of a matrix ˆ ¯ p kXL − XLk∗ = O bσc1mn log(mn)/r¯ M. Our analysis uses the following special cases which √ √ √  +rσ ¯ ( m + n)) + c1 mn , have alternative definitions: • where we may take b = O(σ log(mn) + kXLkvec(∞))). kMk1→1 = maxj kMejk1, Now consider the situation where both m, n → • kMk1→2 = maxj kMejk2, ¯ • ∞, and assume that kXLkvec(∞) remains bounded. If kMk2→2 = spectral norm of M 2 c1(log(mn)) = o(1)—which means that the number (i.e., largest of M), > of corruptions per column is o(m/(log(mn))2) and • kMk2→∞ = maxi kM eik2, and > the number of deterministic corruptions per row is • kMk∞→∞ = maxi kM eik1. 2 o(n/(log(mn)) )—then Here, ei is the ith coordinate vector which has a 1 in the √ √ ith position and 0 elsewhere. kXˆL − X¯Lk∗ = O(¯rσ( m + n)) Finally, we also consider induced operator norms so the normalized trace norm error of Xˆ tends to zero m×n m×n L of linear matrix operators T : R → R (in 1 particular, projection operators with respect to h·, ·i). √ kXˆL − X¯Lk∗ → 0. mn For any two matrix norms k · k♦ and k · k♥, define This means that we can correctly recover the principal kT k♦→♥ := maxM6=0 kT (M)k♥/kMk♦. components of X¯L with both deterministic corruptions 4) Other norms: The trace norm (or nuclear norm) and random noise, when both m and n are large and kMk∗ of a matrix M is the sum of the singular values 2 c1(log(mn)) = o(1) in Proposition 1. of M. We will also make use of a hybrid matrix norm

7 k · k](ρ), parametrized by ρ > 0, which we define by −1 The following lemma is the dual of Lemma 2. kMk](ρ) := max{ρkMk1→1, ρ kMk∞→∞}. Lemma 3. For any M ∈ m×n, we have for all ρ > 0, Also define kMk := sup hM,Ni, i.e., the R [(ρ) kNk](ρ)≤1 dual of k · k](ρ) (see below). kMk[(ρ) ≤ kMk∗. 5) Dual pairs: The matrix norm k · k♥ is said to m×n be dual to k · k♠ if, for all M ∈ R , kMk♥ = Proof: We know that kMk[(ρ) = hM,Ni for some sup hM,Ni matrix N such that kNk](ρ) = 1. Therefore kNk2→2 ≤ kNk♠≤1 . 1 from Lemma 2, and thus using Proposition 2, Proposition 2. Fix any matrix norm k·k♠, and let k·k♥ m×n m×n be its dual. For all M ∈ R and N ∈ R , we have kMk[(ρ) = hM,Ni ≤ kMk∗kNk2→2 ≤ kMk∗.

hM,Ni ≤ kMk♠kNk♥. Proposition 3. Fix any any linear matrix operator T : Finally we state a lemma concerning the invertibility m×n m×n of a certain block-form operator used in our analysis. R → R and any pair of matrix norms k · k♠ and k · k . We have m×n ♣ Lemma 4. Fix any matrix norm k · k♠ on R ∗ m×n m×n kT k♠→♣ = kT k♦→♥, and linear operators T1 : R → R and T2 : Rm×n → Rm×n. Let I : Rm×n → Rm×n be the identity where k·k♥ is dual to k·k♠, and k·k♦ is dual to k·k♣. operator, and suppose kT1 ◦ T2k♠→♠ < 1.

The following pairs of matrix norms are dual to each 1) I − T1 ◦ T2 is invertible and satisfies other: −1 1 1) k · kvec(p) and k · kvec(q) where 1/p + 1/q = 1; k(I − T1 ◦ T2) k♠→♠ ≤ . 1 − kT1 ◦ T2k♠→♠ 2) k · k∗ and k · k2→2; m×n m×n 3) k · k](ρ) and k · k[(ρ) (by definition). 2) The linear operator on R × R 6) Some lemmas: First we show that the k·k](ρ) norm   IT1 (for any ρ > 0) bounds the spectral norm k · k2→2. T2 I m×n Lemma 2. For any M ∈ R , we have for all ρ > 0, is invertible, and its inverse is given by kMk ≤ kMk . 2→2 ](ρ)  −1 IT1 Proof: Let σ be the largest singular value of M, T I m n 2 and let u ∈ R and v ∈ R be, respectively, associated  −1 −1  (I − T1 ◦ T2) −(I − T1 ◦ T2) ◦ T1 left and right singular vectors. Then = −1 −1 . −(I − T2 ◦ T1) ◦ T2 (I − T2 ◦ T1)  0 ρM   ρ1/2u  Proof: The first claim is a standard application ρ−1M > 0 ρ−1/2v 1 of Taylor expansions. The second claim then follows  ρ1/2Mv   ρ1/2u  = = σ . from formulae of inverses using Schur ρ−1/2M >u ρ−1/2v 1 1 complements.

Moreover, by definition of k · k1→1,

   1/2  B. Projection operators and subdifferential sets 0 ρM ρ u ρ−1/2M > 0 ρ−1/2v Recall the definitions of the following subspaces 1    1/2  m×n 0 ρM ρ u Ω(XS) := {X ∈ : supp(X) ⊆ supp(XS)} ≤ . R ρ−1M > 0 ρ−1/2v 1→1 1 and Therefore m×n   T (XL) := {X1 + X2 ∈ R : 0 ρM kMk2→2 = σ ≤ −1 > range(X ) ⊆ range(X ), ρ M 0 1 L 1→1 > > −1 > range(X2 ) ⊆ range(XL )}. = max{kρ M k1→1, kρMk1→1} −1 = max{ρ kMk∞→∞, ρkMk1→1} The orthogonal projectors to these spaces are given in the following proposition. = kMk](ρ).

8 m×n m×n Proposition 4. Fix any XS ∈ R and XL ∈ R . Proposition 5. The subdifferential set of XS 7→ m×n For any matrix M ∈ R , kXSkvec(1) is  Mi,j if (i, j) ∈ supp(XS) ∂X (kXSkvec(1)) [PΩ(X )(M)]i,j = S S 0 otherwise m×n = {G ∈ R : kGkvec(∞) ≤ 1, PΩ(XS )(G) for all 1 ≤ i ≤ m and 1 ≤ j ≤ n, and = sign(XS)};

> > > > PT (XL)(M) = UU M + MVV − UU MVV the subdifferential set of XL 7→ kXLk∗ is where U and V are the matrices of left and right singular m×n ∂XL (kXLk∗) = {G ∈ R : vectors of XL. kGk2→2 ≤ 1, PT (XL)(G) = orth(XL)}. Lemma 5. Under the setting of Proposition 4, with k := | supp(XS)|, The following lemma is a simple consequence of √ subgradient properties. kP (M)k ≤ kkP (M)k Ω(XS ) vec(1) √ Ω(XS ) vec(2) Lemma 6. Fix λ > 0 and define the function ≤ kkMkvec(2) g(XS,XL) := λkXSkvec(1) + kXLk∗. Consider any (X¯ , X¯ ) in m×n × m×n. If there exists Q ∈ m×n kPΩ(XS )(M)kvec(1) ≤ kkPΩ(XS )(M)kvec(∞) S L R R R such that: Q is a subgradient of λkXSkvec(1) at XS = ≤ kkMkvec(∞) X¯S, Q is a subgradient of kXLk∗ at XL = X¯L, and kPT (XL)(M)k2→2 ≤ 2kMk2→2 ⊥ ⊥ kPΩ(X¯S ) (Q)kvec(∞) ≤ λ/c and kPT (X¯L) (Q)k2→2 ≤ kPT (XL)(M)k∗ ≤ 2 rank(XL)kMk2→2 1/c for some c > 1, then p kPT (XL)(M)kvec(2) ≤ 2 rank(XL)kMk2→2. g(XS,XL) − g(X¯S, X¯L)

Proof: The first and second claims rely on the fact ≥ hQ, XS + XL − X¯S − X¯Li that | supp(PΩ(X )(M))| ≤ | supp(XS)|, as well as the ¯ S + (1 − 1/c)λkPΩ¯ ⊥ (XS − XS)kvec(1) fact that P is an orthonormal projector with respect Ω(XS ) ¯ + (1 − 1/c)kPT¯⊥ (XL − XL)k∗ to the inner product that induces the k·kvec(2) norm. For the third claim, note that m×n m×n for all (XS,XL) ∈ R × R . ¯ ¯ ¯ ¯ kPT (XL)(M)k2→2 Proof: Let Ω := Ω(XS), T := T (XL), ∆S := ¯ ¯ ≤ kUU >Mk + k(I − UU >)MVV >k XS − XS, and ∆L : XL − XL. For any subgradient 2→2 2→2 ¯ G ∈ ∂XS (λkXSkvec(1)), we have G − Q = PΩ¯ (G) + ≤ 2kMk2→2. PΩ¯ ⊥ (G) − PΩ¯ (Q) − PΩ¯ ⊥ (Q) = PΩ¯ ⊥ (G) − PΩ¯ ⊥ (Q). The remaining claims use a similar decomposition as Therefore the third claim as well as the fact that rank(UU >M) ≤ λkX¯ + ∆ k − λkX¯ k − hQ, ∆ i rank(X ) and rank((I−UU >)MVV >)} ≤ rank(X ). S S vec(1) S vec(1) S L L ¯ ≥ sup{hG, ∆Si − hQ, ∆Si : G ∈ ∂XS (λkXSkvec(1))} ¯ Define ≥ sup{hG − Q, ∆Si : G ∈ ∂XS (λkXSkvec(1))}

= sup{hP¯ ⊥ (G) − P¯ ⊥ (Q), ∆ i : sign(X ) ∈ {−1, 0, +1}m×n Ω Ω S S ¯ G ∈ ∂XS (λkXSkvec(1))} to be the matrix whose (i, j)th entry is sign([XS]i,j), = sup{hPΩ¯ ⊥ (G) − PΩ¯ ⊥ (Q), PΩ¯ ⊥ (∆S)i : and define ¯ G ∈ ∂X (λkXSkvec(1))} > S orth(XL) := UV , = sup{hPΩ¯ ⊥ (G), PΩ¯ ⊥ (∆S)i ¯ where U and V , respectively, are matrices of the left and − hPΩ¯ ⊥ (Q), PΩ¯ ⊥ (∆S)i : G ∈ ∂XS (λkXSkvec(1))} right orthonormal singular vectors of XL corresponding = λkPΩ¯ ⊥ (∆S)kvec(1) − hPΩ¯ ⊥ (Q), PΩ¯ ⊥ (∆S)i to non-zero singular values. The following proposition ≥ λkPΩ¯ ⊥ (∆S)kvec(1) characterizes the subdifferential sets for the non-smooth − kPΩ¯ ⊥ (Q)kvec(∞)kPΩ¯ ⊥ (∆S)kvec(1) norms k · kvec(1) and k · k∗ [11]. ≥ λ(1 − 1/c)kPΩ¯ ⊥ (∆S)kvec(1)

9 where the second-to-last inequality uses the duality of This implies, for all ρ > 0, k · kvec(1) and k · kvec(∞) and Proposition 3. Similarly, kPΩ¯ kvec(∞)→](ρ) ≤ α(ρ). ¯ ¯ m×n kXL − ∆Lk∗ − kXLk∗ − hQ, ∆Li Proof: Define s(XS) ∈ {0, 1} to be the entry-

≥ (1 − 1/c)kPT¯⊥ (∆L)k∗ wise absolute value of sign(XS). We have by noting the duality of k · k∗ and k · k2→2. Combining kPΩ¯ (M)kp→p = max{kPΩ¯ (M)vkp : kvkp ≤ 1} these gives the desired inequality. ≤ kPΩ¯ (M)kvec(∞)

max{ks(PΩ¯ (M))vkp : kvkp ≤ 1} IV. RANK-SPARSITY INCOHERENCE ≤ kMkvec(∞) Throughout this section, we fix a target (X¯S, X¯L) ∈ max{ks(X¯S)vkp : kvkp ≤ 1} m×n × m×n, and let Ω¯ := Ω(X¯ ) and T¯ := T (X¯ ). R R S L = kMk k sign(X¯ )k . Also let U¯ and V¯ be, respectively, matrices of the left and vec(∞) S p→p ¯ right singular vectors of XL corresponding to non-zero The second part follows from the definitions of k · k](ρ) singular values. Recall the following structural properties and α(ρ). of X¯S and X¯L: Lemma 8. For any M ∈ Rm×n, we have ¯ α(ρ) := k sign(XS)k](ρ) kP ¯(M)kvec(∞) ¯ −1 ¯ T = max{ρk sign(XS)k1→1, ρ k sign(XS)k∞→∞}; ¯ ¯ > ¯ ¯ > ≤ kUU kvec(∞)kMk1→1 + kV V kvec(∞)kMk∞→∞ −1 ¯ ¯ > ¯ ¯ > β(ρ) := ρ kUU kvec(∞) + ρkV V kvec(∞) + kU¯k2→∞kV¯ k2→∞kMk2→2. + kU¯k2→∞kV¯ k2→∞; This implies, for all ρ > 0, ¯ ¯ ¯ > γ := k orth(XL)kvec(∞) = kUV kvec(∞). kPT¯k](ρ)→vec(∞) ≤ β(ρ). The parameter ρ is a balancing parameter to handle ¯ ¯ > disparity between row and column dimensions. The Proof: We have kPT¯(M)kvec(∞) = kUU M + ¯ ¯ > ¯ ¯ > ¯ ¯ > ¯ ¯ > quantity α(ρ) is the maximum number of non-zero MV V − UU MV V kvec(∞) ≤ kUU Mkvec(∞) + ¯ ¯ > ¯ ¯ > ¯ ¯ > entries in any single row or column. The quantities β(ρ) kMV V kvec(∞) + kUU MV V kvec(∞) by the trian- and γ measure the coherence of the singular vectors of gle inequality. The bounds for each term now follow from the definitions: X¯L, that is, the alignment of the singular vectors with the coordinate basis. For instance, under the conditions ¯ ¯ > > ¯ ¯ > kUU Mkvec(∞) = max kM UU eik∞ of Proposition 1, we have (with ρ = pn/m) i > > ≤ kM k∞→∞ max kU¯U¯ eik∞ √ i α(ρ) ≤ c1 mn, = kMk kU¯U¯ >k ; 3c rank(X¯ ) c rank(X¯ ) 1→1 vec(∞) β (ρ) ≤ 2 √ L and γ ≤ 2 √ L mn mn ¯ ¯ > ¯ ¯ > kMV V kvec(∞) = max kMV V ejk∞ j for some constants c1 and c2. > ≤ kMk∞→∞ max kV¯ V¯ ejk∞ j ¯ ¯ A. Operator norms of projection operators = kMk∞→∞kV V kvec(∞);

We show that under the condition infρ>0 α(ρ)β(ρ) < and 1, the pair (X¯ , X¯ ) is identifiable from its sum X¯ + S L S ¯ ¯ > ¯ ¯ > kUU MV V kvec(∞) X¯L (Theorem 1). This is achieved by proving that the > ¯ ¯ > ¯ ¯ > composition of projection operators PΩ¯ and PT¯ is a = max |ei U(U MV )V ej| i,j contraction as per Lemma 1, which in turn implies that ¯ > ¯ > ¯ ¯ > ¯ ¯ ≤ max kU eik2kU MV k2→2kV ejk2 Ω ∩ T = {0}. i,j The following two lemmas bound the projection op- ≤ kMk2→2kU¯k2→∞kV¯ k2→∞ erators PΩ¯ and PT¯ in complementary norms. ¯ ¯ ≤ kMk](ρ)kUk2→∞kV k2→∞ Lemma 7. For any M ∈ Rm×n and p ∈ {1, ∞}, we have where the second step follows by Cauchy-Schwarz, and the fourth step follows from Lemma 2. The second part ¯ kPΩ¯ (M)kp→p ≤ k sign(XS)kp→pkMkvec(∞). now follows the definition of β(ρ).

10 Now we show that the composition of PΩ¯ and PT¯ with E = 0, while the guarantees for the regularized gives a contraction under the certain norms and their formulation takes E = Y − (X¯S + X¯L). duals. Theorem 5. Pick any c > 1, ρ > 0, and E ∈ Rm×n. ¯ Lemma 9. For all ρ > 0, Let k := | supp(X¯S)| and r¯ := rank(X¯L). Let

1) kPΩ¯ ◦ PT¯k](ρ)→](ρ) ≤ α(ρ)β(ρ); 2→2 := kEk2→2 2) kPT¯ ◦ PΩ¯ kvec(∞)→vec(∞) ≤ α(ρ)β(ρ); vec(∞) := kEkvec(∞) + kPT¯(E)kvec(∞). Proof: Immediate from Lemma 7 and Lemma 8. If the following conditions hold: Lemma 10. For all ρ > 0, α(ρ)β(ρ) < 1 (13) 1) kPT¯ ◦ PΩ¯ k[(ρ)→[(ρ) ≤ α(ρ)β(ρ); (1 − α(ρ)β(ρ))(1 − c · µ−1 ) 2) kPΩ¯ ◦ PT¯kvec(1)→vec(1) ≤ α(ρ)β(ρ). λ ≤ 2→2 ∗ ∗ ∗ c · α(ρ) Proof: First note that (PT¯ ◦ PΩ¯ ) = PΩ¯ ◦ PT¯ = −1 α(ρ)µ vec(∞) + α(ρ)γ PΩ¯ ◦ PT¯ because PΩ¯ and PT¯ are self-adjoint, and − (14) ∗ α(ρ) similarly (PΩ¯ ◦PT¯) = PT¯ ◦PΩ¯ . Now the claim follows −1 by Proposition 3 and Lemma 9, using the facts that γ + µ (2 − α(ρ)β(ρ))vec(∞) λ ≥ c · > 0 (15) k · k[(ρ) is dual to k · k](ρ) and that k · kvec(1) is dual 1 − α(ρ)β(ρ) − c · α(ρ)β(ρ) to k · k . vec(∞) (these are a restatement of (3), (4), and (5)), then Note that Lemma 1 is encompassed by Lemma 10. −1 ¯ ¯ Another consequence of these contraction properties is QΩ¯ := (I − PΩ¯ ◦ PT¯) λ sign(XS) − PΩ¯ (orth(XL)) the following uncertainty principle, analogous to one −1  − µ (P¯ ◦ P ¯⊥ )(E) ∈ Ω¯ and stated by [1], which effectively states that a matrix X Ω T −1 ¯ ¯ QT¯ := (I − PT¯ ◦ PΩ¯ ) orth(XL) − λPT¯(sign(XS)) cannot have both k sign(X)k](ρ) and k orth(X)kvec(∞) −1  ¯ simultaneously small. − µ (PT¯ ◦ PΩ¯ ⊥ )(E) ∈ T

Theorem 4. If X = X¯S = X¯L 6= 0, then are well-defined and satisfy infρ>0 α(ρ)β(ρ) ≥ 1. −1 ¯ PΩ¯ (QΩ¯ + QT¯ + µ E) = λ sign(XS) −1 ¯ Proof: Note that the non-zero element X lives in PT¯(QΩ¯ + QT¯ + µ E) = orth(XL) Ω¯ ∩ T¯, so we get the conclusion by the contrapositive of and Theorem 1. −1 kPΩ¯ ⊥ (QΩ¯ + QT¯ + µ E)kvec(∞) ≤ λ/c −1 kPT¯⊥ (QΩ¯ + QT¯ + µ E)k2→2 ≤ 1/c. B. Dual certificate Moreover, The incoherence properties allow us to construct an α(ρ) approximate dual certificate (Q¯ ,Q ¯) ∈ Ω¯ × T¯ that is −1  Ω T kQΩ¯ k2→2 ≤ · λ + γ + µ vec(∞) central to the analysis of the optimization problems (1) 1 − α(ρ)β(ρ) and (2). 2α(ρ) −1  kQ ¯k ≤ · λ + γ + µ  The certificate is constructed as the solution to the T 2→2 1 − α(ρ)β(ρ) vec(∞) −1 linear system + 1 + 2µ 2→2  −1 ¯ PΩ¯ (QΩ¯ + QT¯ + µ E) = λ sign(XS) kQT¯k∗ ≤ 2¯rkQT¯k2→2 −1 ¯ PT¯(QΩ¯ + QT¯ + µ E) = orth(XL) 1 −1  kQ ¯k ≤ · λ + γ + µ  T vec(∞) 1 − α(ρ)β(ρ) vec(∞) for some matrix E ∈ Rm×n; this can be equivalently 2 −1  written as kQ¯ k ≤ · λ + γ + µ  Ω vec(∞) 1 − α(ρ)β(ρ) vec(∞)     −1  IP¯ Q¯ λ sign(X¯ ) − µ P¯ (E) Ω Ω S Ω kQ¯ k ≤ k¯kQ¯ k = ¯ −1 . Ω vec(1) Ω vec(∞) PT¯ I QT¯ orth(XL) − µ PT¯(E) 2 −1 −1  kQΩ¯ + QT¯kvec(2) ≤ λkQΩ¯ kvec(1) 1 + µ λ vec(∞) We show the existence of the dual certificate (QΩ¯ ,QT¯) −1  + kQ ¯k∗ 1 + 2µ 2→2 . under the conditions (3), (4), and (5) relative to an T arbitrary matrix E. Recall that the recovery guarantees Remark 1. The dual certificate constitutes an approxi- −1 for the constrained formulation requires the conditions mate subgradient in the sense that QΩ¯ + QT¯ + µ E

11 ¯ is a subgradient of both λkXSkvec(1) at XS = XS, and Now we bound kQT¯kvec(∞) as kXLk∗ at XL = X¯L. Proof: Under the condition (13), we have kQT¯kvec(∞) −1 ¯ ¯ α(ρ)β(ρ) < 1, and therefore Lemma 9 and Lemma 4 = (I − PT¯ ◦ PΩ¯ ) orth(XL) − λPT¯(sign(XS) −1  imply that the operators I − PΩ¯ ◦ PT¯ and I − PT¯ ◦ PΩ¯ ⊥ − µ (PT¯ ◦ PΩ¯ )(E) vec(∞) are invertible and satisfy 1 ¯ ¯ ≤ · orth(XL) − λP ¯(sign(XS) 1 − α(ρ)β(ρ) T −1 −1 1 − µ (PT¯ ◦ PΩ¯ ⊥ )(E) vec(∞) k(I − PΩ¯ ◦ PT¯) k](ρ)→](ρ) ≤ , 1 − α(ρ)β(ρ) 1 ¯ ≤ · k orth(XL)kvec(∞) −1 1 1 − α(ρ)β(ρ) k(I − P ¯ ◦ P¯ ) k ≤ . T Ω vec(∞)→vec(∞) 1 − α(ρ)β(ρ) ¯ + λkPT¯(sign(XS))kvec(∞) −1  + µ k(PT¯ ◦ PΩ¯ ⊥ )(E)kvec(∞) 1 Thus QΩ¯ and QT¯ are well-defined. We can bound ≤ · γ + λα(ρ)β(ρ) + µ−1  1 − α(ρ)β(ρ) vec(∞) kQΩ¯ k2→2 as (Lemma 9).

kQΩ¯ k2→2 ≤ kQΩ¯ k](ρ) (Lemma 2) Above, we have used the bound k(PT¯ ◦ −1 ¯ ¯ P¯ ⊥ )(E)k = kP ¯(E) − (P ¯ ◦ P¯ )(E)k ≤ = (I − PΩ¯ ◦ PT¯) λ sign(XS) − PΩ¯ (orth(XL)) Ω vec(∞) T T Ω vec(∞) −1  kPT¯(E)kvec(∞) + α(ρ)β(ρ)kEkvec(∞) ≤ vec(∞). − µ (P¯ ◦ P ¯⊥ )(E) Ω T ](ρ) Therefore, 1 ¯ ¯ ≤ · λ sign(XS) − PΩ¯ (orth(XL)) 1 − α(ρ)β(ρ) −1 kP¯ ⊥ (Q ¯ + µ E)k −1 Ω T vec(∞) − µ (PΩ¯ ◦ PT¯⊥ )(E) ](ρ) −1 ≤ kQT¯kvec(∞) + µ kPΩ¯ ⊥ (E)kvec(∞) 1 ¯ 1 ≤ · λk sign(XS)k](ρ) ≤ · γ + λα(ρ)β(ρ) + µ−1  1 − α(ρ)β(ρ) 1 − α(ρ)β(ρ) vec(∞) ¯ −1  + kP¯ (orth(XL))k](ρ) + µ k(P¯ ◦ P ¯⊥ )(E)k](ρ) −1 Ω Ω T + µ vec(∞). α(ρ) −1  ≤ · λ + γ + µ kPT¯⊥ (E)kvec(∞) 1 − α(ρ)β(ρ) The condition (15) now implies that this quantity is at (Lemma 7) most λ/c. α(ρ) We also have ≤ · λ + γ + µ−1  . 1 − α(ρ)β(ρ) vec(∞) −1 ¯ kQT¯k2→2 = kPT¯(QΩ¯ + µ E) − orth(XL)k2→2 2α(ρ) −1  Above, we have used the bound kP ¯⊥ (E)k = ≤ · λ + γ + µ  T vec(∞) 1 − α(ρ)β(ρ) vec(∞) kE − PT¯(E)kvec(∞) ≤ vec(∞). Therefore, −1 + 1 + 2µ 2→2

−1 since kP ¯(Q¯ )k ≤ 2kQ¯ k and kP ¯(E)k ≤ kPT¯⊥ (QΩ¯ + µ E)k2→2 T Ω 2→2 Ω 2→2 T 2→2 22→2 by Lemma 5, and > > kPT¯⊥ (E)k2→2 ≤ k(I − U¯U¯ )Q¯ (I − V¯ V¯ )k + Ω 2→2 µ −1 ¯ −1 kQΩ¯ kvec(∞) = kPΩ¯ (QT¯ + µ E) − λ sign(XS)kvec(∞) ≤ kQΩ¯ k2→2 + µ 2→2 1 −1  α(ρ) −1 −1 ≤ · λ + γ + µ vec(∞) ≤ · (λ + γ + µ vec(∞)) + µ 2→2. 1 − α(ρ)β(ρ) 1 − α(ρ)β(ρ) −1 + λ + µ vec(∞).

The condition (14) now implies that this quantity is at The bounds on kQT¯k∗ and kQΩ¯ kvec(1) follow from ¯ most 1/c. the facts that rank(QT¯) ≤ 2¯r and k supp(QΩ¯ )k ≤ k.

12 Finally, Using the triangle inequality, we have 2 kQΩ¯ + QT¯kvec(2) ˆ ˆ = hQΩ¯ , PΩ¯ (QΩ¯ + QT¯)i + hQT¯, PT¯(QΩ¯ + QT¯)i λkXSkvec(1) + kXLk∗ ¯ −1 ¯ ¯ = hQΩ¯ , λPΩ¯ (sign(XS)) − µ PΩ¯ (E)i = λkXS + ∆Skvec(1) + kXL + ∆Lk∗ ¯ −1 ¯ + hQT¯, PT¯(orth(XL)) − µ PT¯(E)i = λkXS + ∆mid + ∆avgkvec(1) −1 −1  ≤ λkQΩ¯ kvec(1) 1 + µ λ kPΩ¯ (E)kvec(∞) + kX¯L − ∆mid + ∆avgk∗ −1  ¯ + kQT¯k∗ 1 + µ kPT¯(E)k2→2 ≥ λkXS + ∆midkvec(1) − λk∆avgkvec(1) −1 −1  ¯ ≤ λkQΩ¯ kvec(1) 1 + µ λ vec(∞) + kXL − ∆midk∗ − k∆avgk∗. −1  + kQT¯k∗ 1 + 2µ 2→2 . ˆ ˆ Now using the fact that λkXSkvec(1) + kXLk∗ ≤ λkX¯ k + kX¯ k gives the claim. V. ANALYSIS OF CONSTRAINED FORMULATION S vec(1) L ∗ Throughout this section, we fix a target decomposition Lemma 13. Let k¯ := | supp(X¯ )|. Assume the condi- ¯ ¯ S (XS, XL) that satisfies the constraints of (1), and let tions of Theorem 5 hold with E = 0. We have (XˆS, XˆL) be the optimal solution to (1). Let ∆S := XˆS − X¯S and ∆L := XˆL − X¯L. We show that under the conditions of Theorem 5 with E = 0 (recall that kP¯ (∆mid)kvec(1) this does not mean we assume Y − X¯ − X¯ = 0) and Ω S L (1 − 1/c)−1 appropriately chosen λ, solving (1) accurately recovers ≤ · (k∆ k + k∆ k /λ). 1 − α(ρ)β(ρ) avg vec(1) avg ∗ the target decomposition (X¯S, X¯L). We decompose the errors into symmetric and anti- symmetric parts ∆avg := (∆S + ∆L)/2 and ∆mid := Proof: ∆ = P¯ (∆ ) + (∆S − ∆L)/2. The constraints allow us to easily bound Because mid Ω mid P¯ ⊥ (∆ ) = P ¯(∆ ) + P ¯⊥ (∆ ) ∆avg, so most of the analysis involves bounding ∆mid Ω mid T mid T mid , we have in terms of ∆avg. the equation

Lemma 11. k∆avgkvec(1) ≤ vec(1) and k∆avgk∗ ≤ ∗.

PΩ¯ (∆mid) − PT¯(∆mid) = −PΩ¯ ⊥ (∆mid) + PT¯⊥ (∆mid). Proof: Since both (XˆS, XˆL) and (X¯S, X¯L) are feasible solutions to (1), we have for ♦ ∈ {vec(1), ∗},

1 k∆avgk♦ = /2k∆S + ∆Lk♦ Separately applying PΩ¯ and PT¯ to both sides gives 1 = /2k(XˆS + XˆL − Y ) − (X¯S + X¯L − Y )k♦   1 ˆ ˆ ¯ ¯       ≤ /2 kX + X − Y k + kX + X − Y k IP¯ P¯ (∆ ) (P¯ ◦ P ¯⊥ )(∆ ) S L ♦ S L ♦ Ω Ω mid = Ω T mid . PT¯ I −PT¯(∆mid) −(PT¯ ◦ PΩ¯ ⊥ )(∆mid) ≤ ♦.

Under the condition α(ρ)β(ρ) < 1, Lemma 10 and Lemma 12. Assume the conditions of Theorem 5 hold Lemma 4 imply that with E = 0. We have

λkPΩ¯ ⊥ (∆mid)kvec(1) + kPT¯⊥ (∆mid)k∗ −1 1 −1  k(I − P¯ ◦ P ¯) kvec(1)→vec(1) ≤ ≤ (1 − 1/c) λk∆avgkvec(1) + k∆avgk∗ . Ω T 1 − α(ρ)β(ρ)

Proof: Let Q := QΩ¯ + QT¯ be the dual certificate guaranteed by Theorem 5. Note that Q satisfies the and that conditions of Lemma 6, so we have ¯ ¯ λkXS + ∆midkvec(1) + kXL − ∆midk∗ −1 P¯ (∆mid) = (I − P¯ ◦ P ¯) (P¯ ◦ P ¯⊥ )(∆mid) − λkX¯Sk − kX¯Lk∗ Ω Ω T Ω T vec(1)   + (PΩ¯ ◦ PT¯ ◦ PΩ¯ ⊥ )(∆mid) . ≥ (1 − 1/c) λkPΩ¯ ⊥ (∆mid)kvec(1) + kPT¯⊥ (∆mid)k∗ .

13 Therefore by Lemma 12 and Lemma 13. The bounds on k∆Skvec(1) and k∆Lkvec(1) follow from the bounds kPΩ¯ (∆mid)kvec(1) on k∆midkvec(1), k∆avgkvec(1), and k∆avgk∗ (from 1 Lemma 11). ≤ · k(PΩ¯ ◦ PT¯⊥ )(∆mid)kvec(1) 1 − α(ρ)β(ρ) If the constraint kX k ≤ b is added, then we  L vec(∞) + k(PΩ¯ ◦ PT¯ ◦ PΩ¯ ⊥ )(∆mid)kvec(1) can use the facts 1 p ˆ ¯ ≤ · k¯ · kP ¯⊥ (∆ )k k∆Lkvec(∞) ≤ kXLkvec(∞) + kXLkvec(∞) 1 − α(ρ)β(ρ) T mid vec(2)  ≤ 2b + α(ρ)β(ρ) · kPΩ¯ ⊥ (∆mid)kvec(1) (Lemma 10) and 1 p¯ ≤ · k · kPT¯⊥ (∆mid)k∗ q 1 − α(ρ)β(ρ) k∆Lkvec(2) ≤ k∆Lkvec(∞)k∆Lkvec(1)  + α(ρ)β(ρ) · kP¯ ⊥ (∆mid)kvec(1) q Ω ≤ 2bk∆ k . (1 − 1/c)−1 p L vec(1) ≤ · max k,¯ α(ρ)β(ρ)/λ 1 − α(ρ)β(ρ) If XˆL is post-processed, then (letting clip(XˆL) be the · (λk∆avgkvec(1) + k∆avgk∗) (Lemma 12) result of the post-processing) (1 − 1/c)−1 |[X ] − [X¯ ] | ≤ |[Xˆ ] − [X¯ ] | ≤ · (k∆ k + k∆ k /λ) eL i,j L i,j L i,j L i,j 1 − α(ρ)β(ρ) avg vec(1) avg ∗ for all i, j, so ¯ 2 where the last inequality uses the facts k ≤ α(ρ) , ¯ kXeL − XLkvec(1) ≤ k∆Lkvec(1) α(ρ)β(ρ) < 1, and λα(ρ) ≤ 1 (this last inequality uses the condition in (14)). and q We now prove Theorem 2, which we restate here for ¯ ¯ kXeL − XLkvec(2) ≤ 2bkXeL − XLkvec(1) convenience. q Theorem 6 (Theorem 2 restated). Assume the conditions ≤ 2bk∆Lkvec(1). of Theorem 5 hold with E = 0. We have

max{k∆Skvec(1), k∆Lkvec(1)} VI.ANALYSIS OF REGULARIZED FORMULATION   −1 2 − α(ρ)β(ρ) Throughout this section, we fix a target decomposition ≤ 1 + (1 − 1/c) · · vec(1) ¯ ¯ ¯ 1 − α(ρ)β(ρ) (XS, XL) that satisfies kXS − Y kvec(∞) ≤ b, and let ˆ ˆ −1 2 − α(ρ)β(ρ) (XS, XL) be the optimal solution to (2) augmented with + (1 − 1/c) · · ∗/λ. ¯ 1 − α(ρ)β(ρ) the constraint kXS −Y kvec(∞) ≤ b for some b ≥ kXS − ˆ ¯ ¯ Y kvec(∞) (b = ∞ is allowed). Let ∆S := XS − XS and If, in addition for some b ≥ kXLkvec(∞), either: ∆L := XˆL − X¯L. We show that under the conditions of • the optimization problem (1) is augmented with the Theorem 5 with E = Y − (X¯S + X¯L) and appropriately constraint kXLkvec(∞) ≤ b (and letting XeL := chosen λ and µ, solving (2) accurately recovers the target XˆL), or decomposition (X¯S, X¯L). • XˆL is post-processed by replacing [XˆL]i,j with ˆ Lemma 14. There exists G ,G ,H ∈ m×n such that [XeL]i,j := min{max{[XL]i,j, −b}, b} for all i, j, S L S R −1 ˆ ˆ then we also have 1) µ (XS + XL − Y ) + λGS + HS = 0; n q o kGSkvec(∞) ≤ 1; ¯ −1 kXeL−XLkvec(2) ≤ min k∆Lkvec(1), 2bk∆Lkvec(1) . 2) µ (XˆS + XˆL − Y ) + GL = 0; kGLk2→2 ≤ 1; 3) [HS]i,j[∆S]i,j ≥ 0 ∀i, j . Proof: First note that since ∆S = ∆avg + ∆mid and ∆L = ∆avg − ∆mid, we have Proof: We express the constraint kXS −Y kvec(∞) ≤ max{k∆Skvec(1), k∆Lkvec(1)} ≤ k∆avgkvec(1) + b in (2) as 2mn constraints [XS]i,j − Yi,j − b ≤ 0 k∆midkvec(1). We can bound k∆midkvec(1) as and −[XS]i,j + Yi,j − b ≤ 0 for all i, j. Now the corresponding Lagrangian is k∆midkvec(1) ≤ kPΩ¯ ⊥ (∆mid)kvec(1) + kPΩ¯ (∆mid)kvec(1)   1 −1 1 2 ≤ (1 − 1/c) · 1 + kXS + XL − Y kvec(2) + λkXSkvec(1) + kXLk∗ 1 − α(ρ)β(ρ) 2µ + − + hΛ ,XS − Y − b1m,ni + hΛ , −XS + Y − b1m,ni · (k∆avgkvec(1) + k∆avgk∗/λ)

14 + − ¯ ¯ where Λ , Λ ≥ 0 and 1m,n is the all-ones m × n hold with E = Y − (XS + XL), and let (QΩ¯ ,QT¯) be matrix. First-order optimality conditions imply that there the dual certificate from the conclusion. We have ˆ exists a subgradient GS of kXSkvec(1) at XS = XS and a subgradient G of kX k at X = Xˆ such that L L ∗ L L (1 − α(ρ)β(ρ)) · k∆Skvec(1) −1 + − −1 −1 2 µ (XˆS + XˆL − Y ) + λGS + (Λ − Λ ) = 0 ≤ λ (1 − 1/c) kQΩ¯ + QT¯kvec(2)µ ¯ p¯ ¯ and + λkµ + 2 krµ¯ + kk(PΩ¯ ◦ PT¯⊥ )(E)kvec(∞), −1 µ (XˆS + XˆL − Y ) + GL = 0. ¯ ¯ Now since kXS − Y kvec(∞) ≤ b, we have [XS]i,j ≤ ¯ n q o Yi,j + b and −[XS]i,j ≤ −Yi,j + b. By complementary k∆ k ≤ min k∆ k , 2bk∆ k , + ˆ S vec(2) S vec(1) S vec(1) slackness, if Λi,j > 0, then [XS]i,j −Yi,j −b = 0, which means [XˆS]i,j − [X¯S]i,j ≥ [XˆS]i,j − (Yi,j + b) = 0. So + − ˆ Λi,j[∆S]i,j ≥ 0. Similarly, if Λi,j > 0, then [XS]i,j − [X¯ ] ≤ 0. So Λ− [∆ ] ≤ 0. Therefore H := Λ+ − −1 2 S i,j i,j S i,j k∆Lk[(ρ) ≤ (1 − 1/c) kQ¯ + Q ¯k µ/2 − Ω T vec(2) Λ satisfies Hi,j[∆S]i,j ≥ 0. n √ o + min β(ρ)k∆Skvec(1), 2¯rk∆Skvec(2) Lemma 15. Assume the conditions of Theorem 5 hold ¯ ¯ + kP ¯(E)k∗ + 2¯rµ, with E = Y − (XS + XL), and let (QΩ¯ ,QT¯) be the T dual certificate from the conclusion. We have and λkPΩ¯ ⊥ (∆S)kvec(1) + kPT¯⊥ (∆L)k∗ −1 2 ≤ (1 − 1/c) kQΩ¯ + QT¯kvec(2)µ/2. −1 2 k∆Lk∗ ≤ (1 − 1/c) kQΩ¯ + QT¯k µ/2 √ vec(2) Proof: Let Q := Q¯ + Q ¯ and ∆ := ∆S + ∆L. Ω T + 2¯rk∆Skvec(2) + kPT¯(E)k∗ + 2¯rµ. Since Q + µ−1E satisfies the conditions of Lemma 6,  (1 − 1/c) λkPΩ¯ ⊥ (∆S)kvec(1) + kPT¯⊥ (∆L)k∗ Proof: From Lemma 14, we obtain GS,GL,HS ∈ ˆ ˆ ¯ ¯ m×n and the following equations: ≤ (λkXSkvec(1) + kXLk∗) − (λkXSkvec(1) + kXLk∗) R −1 − hQ + µ E, ∆S + ∆Li. −1 λPΩ¯ (GS) = −µ (PΩ¯ (∆S) + PΩ¯ (∆L) − PΩ¯ (E)) Furthermore, by the optimality of (XˆS, XˆL), − PΩ¯ (HS) (16) ˆ ˆ ¯ ¯ −1 (λkXSkvec(1) + kXLk∗) − (λkXSkvec(1) + kXLk∗) PT¯(GL) = −µ (PT¯(∆S) + PT¯(∆L) − PT¯(E)) (17) 1 ¯ ¯ 2 1 ˆ ˆ 2 ≤ kXS + XL − Y kvec(2) − kXS + XL − Y kvec(2) −1 2µ 2µ (PΩ¯ ◦ PT¯)(GL) = −µ (PΩ¯ ◦ PT¯)(∆S) 1 2 1 2  = kEk − k∆ + ∆ − Ek + (PΩ¯ ◦ PT¯)(∆L) − (PΩ¯ ◦ PT¯)(E) . (18) 2µ vec(2) 2µ S L vec(2) 1 = (2hE, ∆i − h∆, ∆i). Subtracting (18) from (16) gives 2µ

Combining the inequalities gives −1 µ PΩ¯ (∆S) − (PΩ¯ ◦ PT¯ ◦ PΩ¯ )(∆S)   (1 − 1/c) λkPΩ¯ ⊥ (∆S)kvec(1) + kPT¯⊥ (∆L)k∗ − (PΩ¯ ◦ PT¯ ◦ PΩ¯ ⊥ )(∆S) + (PΩ¯ ◦ PT¯⊥ )(∆L) 1 ≤ −hQ, ∆i − h∆, ∆i ≤ kQk2 µ/2 + PΩ¯ (HS) 2µ vec(2) = −λPΩ¯ (GS) + (PΩ¯ ◦ PT¯)(GL) where the last inequality follows by taking the maximum −1 + µ (PΩ¯ ◦ PT¯⊥ )(E). value over ∆ at ∆ = −µQ. Now we prove Theorem 3, restated below (with an Moreover, we have hsign(∆S), PΩ¯ (∆S)i = additional result for k∆Lk[(ρ)). kPΩ¯ (∆S)kvec(1) and hsign(∆S), PΩ¯ (HS)i = ¯ ¯ Theorem 7 (Theorem 3 restated). Let k := | supp(XS)| kPΩ¯ (HS)kvec(1), so taking inner products with and r¯ := rank(X¯L). Assume the conditions of Theorem 5 sign(∆S) on both sides of the equation gives the

15 following chain of inequalities: For the third and fourth bounds, we obtain from (17)

−1 kPT¯(∆L)k[(ρ) µ kPΩ¯ (∆S)kvec(1) + kPΩ¯ (HS)kvec(1) −1 ≤ kPT¯(∆S)k[(ρ) + kPT¯(E)k[(ρ) + µkPT¯(GL)k[(ρ) ≤ µ k(PΩ¯ ◦ PT¯ ◦ PΩ¯ )(∆S)kvec(1) −1 ≤ kPT¯kvec(1)→[(ρ)k∆Skvec(1) + kPT¯(E)k∗ + µ k(PΩ¯ ◦ PT¯ ◦ PΩ¯ ⊥ )(∆S)kvec(1) −1 + µkPT¯(GL)k∗ (Lemma 3) + µ k(PΩ¯ ◦ PT¯⊥ )(∆L)kvec(1) ∗ = kPT¯k](ρ)→vec(∞)k∆Skvec(1) + kPT¯(E)k∗ + λkPΩ¯ (GS)kvec(1) + k(PΩ¯ ◦ PT¯)(GL)kvec(1) −1 + µkPT¯(GL)k∗ (Proposition 3) + µ k(PΩ¯ ◦ PT¯⊥ )(E)kvec(1) −1 ≤ β(ρ)k∆Skvec(1) + kPT¯(E)k∗ + µkPT¯(GL)k∗ ≤ µ α(ρ)β(ρ)kPΩ¯ (∆S)kvec(1) −1 ¯ (Lemma 8) + µ α(ρ)β(ρ)kPΩ¯ ⊥ (∆S)kvec(1) + λk ≤ β(ρ)k∆ k + kP ¯(E)k + 2¯rµ −1p p S vec(1) T ∗ + µ k¯kP ¯⊥ (∆ )k + k¯kP ¯(G )k T L vec(2) T L vec(2) (Lemma 5 and kG k ≤ 1) −1¯ L 2→2 + µ kk(PΩ¯ ◦ PT¯⊥ )(E)kvec(∞) −1 and ≤ µ α(ρ)β(ρ)kPΩ¯ (∆S)kvec(1) −1 kP ¯(∆L)k∗ + µ α(ρ)β(ρ)kPΩ¯ ⊥ (∆S)kvec(1) T −1p¯ ≤ kPT¯(∆S)k∗ + kPT¯(E)k∗ + µkPT¯(GL)k∗ + µ kkPT¯⊥ (∆L)kvec(2) √ p ≤ 2¯rk∆Skvec(2) + kP ¯(E)k∗ + 2¯rµ + λk¯ + 2 k¯r¯kG k T L 2→2 (Lemma 5 and kG k ≤ 1). −1¯ L 2→2 + µ kk(PΩ¯ ◦ PT¯⊥ )(E)kvec(∞) −1 Now we combine these with ≤ µ α(ρ)β(ρ)kPΩ¯ (∆S)kvec(1) −1 k∆ k ≤ kP ¯⊥ (∆ )k + kP ¯(∆ )k + µ α(ρ)β(ρ)kPΩ¯ ⊥ (∆S)kvec(1) L [(ρ) T L [(ρ) T L [(ρ) −1p¯ ≤ kPT¯⊥ (∆L)k∗ + µ kkPT¯⊥ (∆L)kvec(2) + min{kP ¯(∆ )k , kP ¯(∆ )k } p −1 T L ∗ T L [(ρ) + λk¯ + 2 k¯r¯ + µ k¯k(P¯ ◦ P ¯⊥ )(E)k . Ω T vec(∞) (Lemma 3)

The second and third inequalities above follow from k∆Lk∗ ≤ kPT¯⊥ (∆L)k∗ + kPT¯(∆L)k∗ Lemma 5 and Lemma 10, and the fourth inequality uses and Lemma 15. the fact that kG k ≤ 1. Rearranging the inequality L 2→2 Note that we have an error bound for ∆ in k · k and applying Lemma 15 gives L [(ρ) norm, which can be significantly smaller than the bound for the trace norm of ∆L. (1 − α(ρ)β(ρ)) · kPΩ¯ (∆S)kvec(1) p¯ ≤ α(ρ)β(ρ)kPΩ¯ ⊥ (∆S)kvec(1) + kkPT¯⊥ (∆L)kvec(2) ACKNOWLEDGMENT ¯ p¯ ¯ We thank Emmanuel Candes` for clarifications about + λkµ + 2 krµ¯ + kk(PΩ¯ ◦ PT¯⊥ )(E)kvec(∞) p the results in [3]. ≤ maxα(ρ)β(ρ)/λ, k¯ −1 2 · (1 − 1/c) kQΩ¯ + QT¯kvec(2)µ/2 REFERENCES ¯ p¯ ¯ [1] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, + λkµ + 2 krµ¯ + kk(PΩ¯ ◦ PT¯⊥ )(E)kvec(∞) “Rank-Sparsity Incoherence for Matrix Decomposition,” ArXiv e- −1 −1 2 prints, Jun. 2009. ≤ λ (1 − 1/c) kQΩ¯ + QT¯k µ/2 vec(2) [2] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, “Latent p Variable Graphical Model Selection via Convex Optimization,” + λkµ¯ + 2 k¯rµ¯ + k¯k(P¯ ◦ P ¯⊥ )(E)k Ω T vec(∞) ArXiv e-prints, Aug. 2010. 2 [3] E. J. Candes,` X. Li, Y. Ma, and J. Wright, “Robust Principal since k¯ ≤ α(ρ) , α(ρ)β(ρ) < 1, and λα(ρ) ≤ 1. Now Component Analysis?” ArXiv e-prints, Dec. 2009. we combine this with k∆Skvec(1) ≤ kPΩ¯ ⊥ (∆S)kvec(1) + [4] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal. Statist. Soc B., vol. 58, no. 1, pp. 267–288, 1996. kP¯ (∆ )k and Lemma 15 to get the first bound. Ω S vec(1) [5] M. Fazel, “Matrix rank minimization with applications,” Ph.D. For the second bound, we use the facts dissertation, Department of Electrical Engineering, Stanford Uni- ˆ ¯ versity, 2002. k∆Skvec(∞) ≤ kXS −Y kvec(∞) +kXS −Y kvec(∞) ≤ 2b p [6] E. Candes` and R. Recht, “Exact matrix completion via convex op- and k∆Skvec(2) ≤ k∆Skvec(1)k∆Skvec(∞) ≤ p timization,” Foundations of Computational , vol. 9, 2bk∆Skvec(1). pp. 717–772, 2009.

16 [7] D. Gross, “Recovering low-rank matrices from few coefficients in any basis,” ArXiv e-prints, Oct. 2009. [8] Z. Zhou, X. Li, J. Wright, E. J. Candes,` and Y. Ma, “Stable principal component pursuit,” in Proceedings of International Symposium on Information Theory, 2010. [9] A. Ganesh, J. Wright, X. Li, E. J. Candes,` and Y. Ma., “Dense error correction for low-rank matrices via principal component pursuit,” in Proceedings of International Symposium on Informa- tion Theory, 2010. [10] K. R. Davidson and S. J. Szarek, “Local operator theory, random matrices and banach spaces,” in Handbook of the geometry of Banach spaces, Vol. I, 2001, pp. 317–366. [11] G. A. Watson, “Characterization of the subdifferential of some matrix norms,” and Applications, vol. 170, pp. 1039–1053, 1992.

Daniel Hsu received a B.S. in Computer Science and Engineering from the University of California, Berkeley in 2004 and a Ph.D. in Computer Science from the University of California, San Diego in 2010. From 2010 to 2011, he was a postdoctoral scholar at Rutgers University and a visiting scholar at the Wharton School of the University of Pennsylvania. He is currently a postdoctoral researcher at Microsoft Research New England. His research interests are in algorithmic statistics and .

Sham M. Kakade is currently an associate professor of statistics at the Wharton School at the University of Pennsylvania, and a Senior Researcher at Microsoft Research New England. He received his B.A. in Physics from the California Institute of Technology and his Ph.D. from the Gatsby Computational Neuroscience Unit affiliated with University College London. He spent the following two years as a Postdoctoral Researcher at the Department of Computer and Information Science at the University of Pennsylvania. Subsequently, he joined the Toyota Technological Institute, where he was an assistant professor for four years. His research focuses on artificial intelligence and machine learning, and their connections to other areas such as game theory and economics.

Tong Zhang received a B.A. in Mathematics and Computer Science from Cornell University in 1994 and a Ph.D. in Computer Science from Stanford University in 1998. After graduation, he worked at IBM T.J. Watson Research Center in Yorktown Heights, New York, and Yahoo Research in New York city. He is currently a professor of statistics at Rutgers University. His research interests include machine learning, for statistical computation, their mathematical analysis and applications.

17