IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 1 Matrix Completion with Deterministic Sampling: Theories and Methods

Guangcan Liu, Senior Member, IEEE, Qingshan Liu, Senior Member, IEEE, Xiao-Tong Yuan, Member, IEEE, and Meng Wang

!

Abstract—In some significant applications such as data forecasting, observed data observed data the locations of missing entries cannot obey any non-degenerate dis- missing data missing data tributions, questioning the validity of the prevalent assumption that the missing data is randomly chosen according to some probabilistic model.

To break through the limits of random sampling, we explore in this value paper the problem of real-valued matrix completion under the setup of value deterministic sampling. We propose two conditions, isomeric condition and relative well-conditionedness, for guaranteeing an arbitrary matrix to be recoverable from a sampling of the matrix entries. It is provable that the proposed conditions are weaker than the assumption of uniform time time sampling and, most importantly, it is also provable that the isomeric Fig. 1. The unseen future values of time series are essentially a special condition is necessary for the completions of any partial matrices to type of missing data. be identifiable. Equipped with these new tools, we prove a collection of theorems for missing data recovery as well as convex/nonconvex Problem 1.1 (Matrix Completion). Denote by [·]ij the (i, j)th matrix completion. Among other things, we study in detail a Schatten m×n entry of a matrix. Let L0 ∈ R be an unknown matrix of quasi-norm induced method termed isomeric dictionary pursuit (IsoDP), interest. The rank of L0 is unknown either. Given a sampling of and we show that IsoDP exhibits some distinct behaviors absent in the the entries in L and a 2D sampling set Ω ⊆ {1, ··· , m} × {1, traditional bilinear programs. 0 ··· , n} consisting of the locations of observed entries, i.e., given

Index Terms—matrix completion, deterministic sampling, identifiability, Ω and {[L0]ij|(i, j) ∈ Ω}, isomeric condition, relative well-conditionedness, Schatten quasi-norm, bilinear programming. can we identify the target L0? If so, under which conditions? In general cases, matrix completion is an ill-posed prob- lem, as the missing entries can be of arbitrary values. Thus, NTRODUCTION 1 I some assumptions are necessary for studying Problem 1.1. N the presence of missing data, the representativeness Candes` and Recht [10] proved that the target L0, with I of data samples may be reduced significantly and the high probability, is exactly restored by , inference about data is therefore distorted seriously. Given provided that L0 is low rank and incoherent and the set Ω this pressing circumstance, it is crucially important to de- of locations corresponding to the observed entries is a set

arXiv:1805.02313v5 [cs.IT] 6 Sep 2019 vise computational methods that can restore unseen data sampled uniformly at random (i.e., uniform sampling). This from available observations. As the data in practice is often pioneering work provides people several useful tools to organized in matrix form, it is considerably significant to investigate matrix completion and many other related prob- study the problem of matrix completion [1–9], which aims to lems. Its assumptions, including low-rankness, incoherence fill in the missing entries of a partially observed matrix. and uniform sampling, are now standard and widely used in the literatures, e.g., [11–18]. However, the assumption of uniform sampling is often invalid in practice: • G. Liu is with B-DAT and CICAEET, School of Automation, Nan- jing University of Information Science and Technology, Nanjing, China • A ubiquitous type of missing data is the unseen 210044. Email: [email protected]. future data, e.g., the next few values of a time series • Q. Liu is with B-DAT, School of Automation, Nanjing University of Information Science and Technology, Nanjing, China 210044. Email: as shown in Figure 1. It is certain that the (missing) [email protected]. future data is not randomly selected, not even being • X-T. Yuan is with B-DAT and CICAEET, School of Automation, Nan- sampled uniformly at random. In this case, as will be jing University of Information Science and Technology, Nanjing, China shown in Section 6.1, the theories built upon uniform 210044. Email: [email protected]. • M. Wang is with School of Computer Science and Information Engineer- sampling are no longer applicable. ing, Hefei University of Technology, 193 Tunxi Road, Hefei, Anhui, China • Even when the underlying regime of the missing 230009. E-mail: [email protected]. data pattern is a probabilistic model, the reasons for different observations being missing could be IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 2

correlated rather than independent. In fact, most real- 휔 = {1} 휔 = {3} 휔 = {1,3} world datasets cannot satisfy the uniform sampling assumption, as pointed out by [19, 20]. 1 2 1 2 1 2 푀휔,: 3 4 푀휔,: 3 4 푀 3 4 There has been sparse research in the direction of deter- 휔,: 5 6 5 6 5 6 ministic or nonuniform sampling, e.g., [19–25]. For example, Negahban and Wainwright [23] studied the case of weighted entrywise sampling, which is more general than the setup Ω = { 1,1 , 1,3 , 2,1 , {2,2}} of uniform sampling but still a special form of random sampling. In particular, Kiraly´ et al. [21, 22] treated matrix 1 2 1 2 1 2 completion as an algebraic problem and proposed determin- 1 푀Ω ,: 3 4 푀Ω2,: 3 4 푀Ω3,: 3 4 istic conditions to decide whether a particular entry of a 5 6 5 6 5 6 generic matrix can be restored. Pimentel-Alarcon´ et al. [25] Fig. 2. Illustrations of the sampled submatrices. built deterministic sampling conditions for ensuring that, almost surely, there are only finitely many matrices that agree formula of which can be derived from Schatten quasi-norm with the observed entries. However, strictly speaking, those minimization [4], and we show that IsoDP is superior to the conditions ensure only the recoverability of a special kind traditional bilinear programs. In summary, the main contri- of matrices, but they cannot guarantee the identifiability of bution of this work is to establish deterministic sampling an arbitrary L0 for sure. This gap is indeed striking, as the conditions for ensuring the success in completing arbitrary data matrices arising from modern applications are often matrices from a subset of the matrix entries, producing some of complicate structures and unnecessary to be generic. theoretical results useful for understanding the completion Moreover, the sampling conditions given in [21, 22, 25] are regimes of arbitrary missing data patterns. not so interpretable and thus not easy to use while applying to the other related problems such as matrix recovery (which 2 SUMMARYOF MAIN NOTATIONS Ω is matrix completion with being unknown) [11]. Capital and lowercase letters are used to represent (real- To break through the limits of random sampling, we valued) matrices and vectors, respectively, except that some propose in this work two deterministic conditions, isomeric lowercase letters, such as i, j, k, m, n, l, p, q, r, s and t, are condition [26] and relative well-conditionedness, for guarantee- used to denote integers. For a matrix M, [M]ij is the (i, j)th ing an arbitrary matrix to be recoverable from a sampling of entry of M, [M]i,: is its ith row, and [M]:,j is its jth its entries. The isomeric condition is a mixed concept that column. Let ω1 = {i1, i2, ··· , ik} and ω2 = {j1, j2, ··· , js} combines together the rank and coherence of L0 with the be two 1D sampling sets. Then [M]ω1,: denotes the sub- locations and amount of the observed entries. In general, matrix of M obtained by selecting the rows with indices isomerism (noun of isomeric) ensures that the sampled sub- i1, i2, ··· , ik, [M]:,ω is the submatrix constructed by choos- matrices (see Section 2) are not rank deficient1. Remarkably, it 2 ing the columns at j1, j2, ··· , js, and similarly for [M]ω1,ω2 . is provable that isomerism is necessary for the identifiability For a 2D sampling set Ω ⊆ {1, ··· , m} × {1, ··· , n}, L of 0: Whenever the isomeric condition is violated, there we imagine it as a sparse matrix and define its “rows”, exist infinity many matrices that can fit the observed entries “columns” and “transpose” as follows: the ith row Ωi = not worse than L0 does. Hence, logically speaking, the j {j1|(i1, j1) ∈ Ω, i1 = i}, the jth column Ω = {i1|(i1, j1) ∈ conditions given in [21, 22, 25] should suffice to ensure T Ω, j1 = j}, and the transpose Ω = {(j1, i1)|(i1, j1) ∈ Ω}. isomerism. While necessary, unfortunately isomerism does These notations are important for understanding the pro- L not suffice to guarantee the identifiability of 0 in a deter- posed conditions. For the ease of presentation, we shall call ministic fashion. This is because isomerism does not exclude [M]ω,: as a sampled submatrix of M (see Figure 2), where ω the unidentifiable cases where the sampled submatrices is a 1D sampling set. are severely ill-conditioned. To compensate this weakness, Three types of matrix norms are used in this paper: 1) the we further propose the so-called relative well-conditionedness, operator norm or 2-norm denoted by kMk, 2) the Frobenius which encourages the smallest singular values of the sam- norm denoted by kMkF and 3) the nuclear norm denoted pled submatrices to be away from 0. by kMk∗. The only used vector norm is the `2 norm, which Equipped with these new tools, isomerism and relative is denoted by k · k2. Particularly, the symbol | · | is reserved well-conditionedness, we prove a set of theorems pertaining for the cardinality of a set. to missing data recovery [27] and matrix completion. In partic- The special symbol (·)+ is reserved to denote the Moore- ular, we prove that the exact solutions that identify the target Penrose pseudo-inverse of a matrix. More precisely, for a matrix L0 are strict local minima to the commonly used 2 T matrix M with SVD M = UM ΣM VM , its pseudo-inverse is bilinear programs. Although theoretically sound, the classic + −1 T given by M = VM ΣM UM . For convenience, we adopt the bilinear programs suffer from a weakness that the rank of conventions of using span{M} to denote the linear space L0 has to be known. To fix this flaw, we further consider spanned by the columns of a matrix M, using y ∈ span{M} a method termed isomeric dictionary pursuit (IsoDP), the to denote that a vector y belongs to the space span{M}, and Y ∈ span{M} 1. In this paper, rank deficiency means that a submatrix does not have using to denote that all the column vectors the largest possible rank. Specifically, suppose that M 0 is a submatrix of a matrix Y belong to span{M}. of some matrix M, then M 0 is rank deficient iff (i.e., if and only if) rank (M 0) < rank (M). Note here that a submatrix is rank deficient 2. In this paper, SVD always refers to skinny SVD. For a rank- m×n T does not necessarily mean that the submatrix does not have full rank, r matrix M ∈ R , its SVD is of the form UM ΣM VM , where m×r r×r n×r and a submatrix of full rank could be rank deficient. UM ∈ R , ΣM ∈ R and VM ∈ R . IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 3

3 IDENTIFIABILITY CONDITIONS Similar to k-isomerism, Ω-isomerism also assumes that n {[M] j } In this section, we introduce the so-called isomeric condi- the sampled submatrices, Ω ,: j=1, are not rank defi- Ω tion [26] and relative well-conditionedness. cient. The main difference is that -isomerism requires the rank of M to be preserved by the submatrices sampled according to a specific sampling set Ω, and k-isomerism 3.1 Isomeric Condition assumes that every submatrix consisting of k rows of M has For the ease of understanding, we shall begin with a concept the same rank as M. Hence, Ω-isomerism is less strict than j called k-isomerism (or k-isomeric in adjective form), which k-isomerism. More precisely, provided that |Ω | ≥ k, ∀1 ≤ can be regarded as an extension of low-rankness. j ≤ n, a matrix M is k-isomeric ensures that M is Ω- m×l isomeric as well, but not vice versa. In the extreme case Definition 3.1 (k-isomeric). A matrix M ∈ R is called k- where M is nonzero at only one row, interestingly, M can isomeric iff any k rows of M can linearly represent all rows in be Ω-isomeric as long as the locations of the nonzero entries M. That is, are included in Ω. For example, the following rank-1 matrix M is not 1-isomeric but still Ω-isomeric for some Ω with rank ([M]ω,:) = rank (M) , ∀ω ⊆ {1, ··· , m}, |ω| = k, |Ωj| = 1, ∀1 ≤ j ≤ n: |ω|×l where |·| is the cardinality of a sampling set and [M]ω,: ∈ R  1 1  is called a “sampled submatrix” of M. Ω = {(1, 1), (1, 2), (1, 3)} and M =  0 0  , In short, a matrix M is k-isomeric means that the sam- 0 0 [M] |ω| = k 3 pled submatrix ω,: (with ) is not rank deficient . where it is configured that m = n = 3 and l = 2. k According to the above definition, -isomerism has a nice With the notation of ΩT = {(j , i )|(i , j ) ∈ Ω}, the M k M 1 1 1 1 property; that is, suppose is 1-isomeric, then is also isomeric property can be also defined on the column vectors k k ≥ k 2-isomeric for any 2 1. So, to verify whether a matrix of a matrix, as shown in the following definition. M is k-isomeric with unknown k, one just needs to find the T m×n smallest k¯ such that M is k¯-isomeric. Definition 3.3 (Ω/Ω -isomeric). Let M ∈ R and Ω ⊆ j Generally, k-isomerism is somewhat similar to Spark [28], {1, ··· , m} × {1, ··· , n}. Suppose Ωi 6= ∅ and Ω 6= ∅, ∀i, j. T which defines the smallest linearly dependent subset of the Then the matrix M is called Ω/Ω -isomeric iff M is Ω-isomeric T T rows of a matrix. For a matrix M to be k-isomeric, it is neces- and M is Ω -isomeric as well. rank (M) ≤ k k sary that , not sufficient. In fact, -isomerism To solve Problem 1.1 without the assumption of missing is also somehow related to the concept of coherence [10, 29]. m×n T at random, as will be shown later, it is necessary to assume For a rank-r matrix M ∈ with SVD UM ΣM V , its T R M that L0 is Ω/Ω -isomeric. This condition has excluded the coherence is denoted as µ(M) and given by unidentifiable cases where any rows or columns of L0 are T m 2 n 2 wholly missing. Moreover, Ω/Ω -isomerism has partially µ(M) = max( max k[UM ]i,:kF , max k[VM ]j,:kF ). 1≤i≤m r 1≤j≤n r considered the cases where L0 is of high coherence: For the extreme case where L is 1 at only one entry and 0 M ∈ m×l 0 When the coherence of a matrix R is not too high, everywhere else, L cannot be Ω/ΩT -isomeric unless the M k k k = rank (M) 0 could be -isomeric with a small , e.g., . index of the nonzero element is included in Ω. In general, M Whenever the coherence of is very high, one may need there are numerous reasons for the target matrix L to be k k 0 a large to satisfy the -isomeric property. For example, isomeric. For example, the standard assumptions of low- M consider an extreme case where is a rank-1 matrix with rankness, incoherence and uniform sampling are indeed one row being 1 and everywhere else being 0. In this case, sufficient to ensure isomerism, not necessary. we need k = m to ensure that M is k-isomeric. However, m×n the connection between isomerism and coherence is not Theorem 3.1. Let L0 ∈ R and Ω ⊆ {1, ··· , m}×{1, ··· , indestructible. A counterexample is the Hadamard matrix n}. Denote n1 = max(m, n), n2 = min(m, n), µ0 = µ(L0) m with 2 rows and 2 columns. In this case, the matrix has an and r0 = rank (L0). Suppose that Ω is a set sampled uniformly optimal coherence of 1, but the matrix is not k-isomeric for at random, namely Pr((i, j) ∈ Ω) = ρ0 and Pr((i, j) ∈/ Ω) = m−1 any k ≤ 2 . 1 − ρ0. If ρ0 > cµ0r0(log n1)/n2 for some numerical constant c −10 T While Definition 3.1 involves all 1D sampling sets of then, with probability at least 1 − n1 , L0 is Ω/Ω -isomeric. k cardinality , we often need the isomeric property to be Notice, that the isomeric condition can be also proven by Ω associated with a certain 2D sampling set . To this end, we discarding the uniform sampling assumption and accessing Ω Ω define below a concept called -isomerism (or -isomeric). only the concept of coherence (see Theorem 3.4). Further- m×l Definition 3.2 (Ω-isomeric). Let M ∈ R and Ω ⊆ {1, ··· , more, the isomeric condition could be even obeyed in the m}×{1, ··· , n}. Suppose that Ωj 6= ∅ (empty set), ∀1 ≤ j ≤ n. case of high coherence. For example, Then the matrix M is called Ω-isomeric iff 1 0 0  Ω={(1, 1),(1, 2),(1, 3),(2, 1),(3, 1)} L = 0 0 0 , rank [M]Ωj ,: = rank (M) , ∀j = 1, ··· , n. and 0   (1) 0 0 0 Note here that Ωj (i.e., jth column of Ω) is a 1D sampling set and where L0 is not incoherent and the sampling is not uniform l 6= n is allowed. T either, but it can be verified that L0 is Ω/Ω -isomeric. In

3. Here, the largest possible rank is rank (M). So rank ([M]ω,:) = fact, the isomeric condition is necessary for the identifiability rank (M) gives that the submatrix [M]ω,: is not rank deficient. of L0, as shown in the following theorem. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 4

m×n Theorem 3.2. Let L0 ∈ R and Ω ⊆ {1, ··· , m}×{1, ··· , γω(M) = 1 whenever ω = {1, ··· , m}. The concept of ω- T T n}. If either L0 is not Ω-isomeric or L0 is not Ω -isomeric then relative can be extended to the case of 2D m×n there exist infinity many matrices (denoted as L ∈ R ) that sampling sets, as shown below. fit the observed entries not worse than L does: 0 Definition 3.5 (Ω-relative condition number). Let M ∈ m×l L 6= L0, rank (L) ≤ rank (L0) , [L]ij = [L0]ij, ∀(i, j) ∈ Ω. R and Ω ⊆ {1, ··· , m} × {1, ··· , n}. Suppose that [M]Ωj ,: 6= 0, ∀1 ≤ j ≤ n. Then the Ω-relative condition number 0 In other words, for any partial matrix M with sampling of M is denoted as γΩ(M) and given by set Ω, if there exists a completion M that is not Ω/ΩT - γΩ(M) = min γΩj (M), isomeric, then there are infinity many completions that are 1≤j≤n different from M and have a rank not greater than that of M. j In other words, isomerism is also necessary for the so-called where Ω is a 1D sampling set corresponding to the jth column finitely completable property explored in [21, 22, 25]. As a of Ω. Again, note here that l 6= n is allowed. consequence, logically speaking, the deterministic sampling Using the notation of ΩT , we can define the concept of conditions established in [21, 22, 25] should suffice to ensure Ω/ΩT -relative condition number as in the following. isomerism. The above theorem illustrates that the isomeric T condition is indeed necessary for the identifiability of the Definition 3.6 (Ω/Ω -relative condition number). Let M ∈ m×n completions to any partial matrices, no matter how the R and Ω ⊆ {1, ··· , m} × {1, ··· , n}. Suppose that j observed entries are chosen. [M]Ω ,: 6= 0 and [M]:,Ωi 6= 0, ∀1 ≤ i ≤ m, 1 ≤ j ≤ n. Then the Ω/ΩT -relative condition number of M is denoted as γΩ,ΩT (M) and given by 3.2 Relative Well-Conditionedness T γ T (M) = min(γ (M), γ T (M )). While necessary, the isomeric condition is unfortunately Ω,Ω Ω Ω unable to guarantee the identifiability of L0 for sure. More To make sure that an arbitrary matrix L0 is recoverable concretely, consider the following example: from a subset of the matrix entries, we need to assume that γ T (L )  1 10  Ω,Ω 0 is reasonably large; this is the so-called relative Ω = {(1, 1), (2, 2)} L = 9 . and 0 9 1 (2) well-conditionedness. Under the standard settings of uniform 10 sampling and incoherence, we have the following theorem T It can be verified that L0 is Ω/Ω -isomeric. However, there to bound γΩ,ΩT (L0). still exist infinitely many rank-1 completions different than m×n Theorem 3.3. Let L0 ∈ R and Ω ⊆ {1, ··· , m}×{1, ··· , L0, e.g., L∗ = [1, 1; 1, 1], which is a matrix of all ones. For n}. Denote n1 = max(m, n), n2 = min(m, n), µ0 = µ(L0) this particular example, L∗ is the optimal rank-1 completion and r0 = rank (L0). Suppose that Ω is a set sampled uniformly in the sense of coherence. In general, isomerism is only at random, namely Pr((i, j) ∈ Ω) = ρ0 and Pr((i, j) ∈/ Ω) = a condition for the sampled submatrices to be not rank 1 − ρ0. For any α > 1, if ρ0 > αcµ0r0(log n1)/n2 for some deficient, but there is no guarantee that the sampled subma- −10 numerical constant c then, with probability at least 1 − n1 , trices are well-conditioned. To compensate this weakness, √ γΩ,ΩT (L0) > (1 − 1/ α)ρ0. we further propose an additional hypothesis called relative well-conditionedness, which encourages the smallest singular The above theorem illustrates that, under the setting of value of the sampled submatrices to be far from 0. uniform sampling plus incoherence, the relative condition Again, we shall begin with a simple concept called ω- number approximately corresponds to the fraction of the relative condition number, with ω being a 1D sampling set. observed entries. Actually, the relative condition number can be bounded from below without the assumption of Definition 3.4 (ω-relative condition number). Let M ∈ m×l uniform sampling. and ω ⊆ {1, ··· , m}. Suppose that [M]ω,: 6= 0. Then R m×n the ω-relative condition number of the matrix M is denoted as Theorem 3.4. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × γω(M) and given by {1, ··· , n}. Denote µ0 = µ(L0) and r0 = rank (L0). Denote by ρ the smallest fraction of the observed entries in each column + 2 γω(M) = 1/kM([M]ω,:) k , and row of L0; namely, where (·)+ and k · k are the pseudo-inverse and operator norm of |Ω | |Ωj| ρ = min( min i , min ). a matrix, respectively. 1≤i≤m n 1≤j≤n m

Regarding the bound of the ω-relative condition number For any 0 ≤ α < 1, if ρ > 1 − (1 − α)/(µ0r0) then the matrix T γω(M), simple calculations yield L0 is Ω/Ω -isomeric and γΩ,ΩT (L0) > α. 2 2 σmin/kMk ≤ γω(M) ≤ 1, It is worth noting that the relative condition number could be large even if the coherence of L0 is extremely where σmin is the smallest singular value of [M]ω,:. Hence, high. For the example shown in (1), it can be calculated that the sampled submatrix [M]ω,: has a large minimum singular γΩ,ΩT (L0) = 1. value is sufficient for ensuring that γω(M) is large, not necessary. Roughly, the value of γω(M) measures how much information of a matrix M is contained in the sampled 4 THEORIESAND METHODS submatrix [M]ω,:. The more information [M]ω,: contains, the In this section, we shall prove some theorems pertaining larger γω(M) is (this will be more clear later). For example, to matrix completion as well as missing data recovery. In IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 5

addition, we suggest a method termed IsoDP for matrix 4.2 Convex Matrix Completion completion, which possesses some remarkable features that Low rank matrix completion concerns the problem of seek- we miss in the traditional bilinear programs. ing a matrix that not only attains the lowest rank but also satisfies the constraints given by the observed entries: 4.1 Missing Data Recovery min rank (L) , s.t. [L]ij = [L0]ij, ∀(i, j) ∈ Ω. Before exploring the matrix completion problem, we would L like to consider a missing data recovery problem studied m by [27], which is described as follows: Let y0 ∈ R be a Unfortunately, this idea is of little practical because the data vector drawn form some low-dimensional subspace, problem above is essentially NP-hard and cannot be solved m denoted as y0 ∈ S0 ⊂ R . Suppose that y0 contains some in polynomial time [30]. To achieve practical matrix comple- k available observations in yb ∈ R and some missing entries tion, Candes` and Recht [10, 31] suggested an alternative that m−k in yu ∈ R . Namely, after a permutation, minimizes instead the nuclear norm; namely,   yb k m−k min kLk∗, s.t. [L]ij = [L0]ij, ∀(i, j) ∈ Ω, (7) y0 = , yb ∈ R , yu ∈ R . (3) L yu

Given the observations in yb, we seek to restore the unseen where k · k∗ denotes the nuclear norm, i.e., the sum of the entries in yu. To do this, we consider the prevalent idea that singular values of a matrix. Under the context of uniform represents a data vector as a linear combination of the bases sampling, it has been proved that the above convex program in a given dictionary: succeeds in recovering the target L0. Although its theory is built upon the assumption of y = Ax , 0 0 (4) missing at random, as observed widely in the literatures, the m×p where A ∈ R is a dictionary constructed in advance convex program (7) actually works even when the locations p and x0 ∈ R is the representation of y0. Utilizing the same of the missing entries are distributed in a correlated and permutation used in (3), we can partition the rows of A into nonuniform fashion. This phenomenon could be explained two parts according to the locations of the observed and by the following theorem, which states that the solution missing entries: to the problem in (7) is unique and exact, provided that   the isomeric condition is obeyed and the relative condition Ab k×p (m−k)×p A = ,Ab ∈ R ,Au ∈ R . (5) number of L0 is large enough. Au m×n Theorem 4.2. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × In this way, the equation in (4) gives that T {1, ··· , n}. If L0 is Ω/Ω -isomeric and γΩ,ΩT (L0) > 0.75 yb = Abx0 and yu = Aux0. then L0 is the unique minimizer to the problem in (7).

As we now can see, the unseen data yu is exactly restored, as Roughly speaking, the assumption γΩ,ΩT (L0) > 0.75 long as the representation x0 is retrieved by only accessing requires that more than three quarters of the information in the available observations in yb. In general cases, there are L0 is observed. Such an assumption is seemingly restrictive infinitely many representations that satisfy y0 = Ax0, e.g., but technically difficult to reduce in general cases. + + x0 = A y0, where (·) is the pseudo-inverse of a matrix. + Since A y0 is the representation of minimal `2 norm, we revisit the traditional `2 program: 4.3 Nonconvex Matrix Completion

1 2 The problem of missing data recovery is closely related to min kxk , s.t. yb = Abx, (6) x 2 2 matrix completion, which is actually to restore the missing entries in multiple data vectors simultaneously. Hence, we where k · k2 is the `2 norm of a vector. The above problem + would transfer the spirits of the `2 program (6) to the has a closed-form solution given by A yb. Under some b case of matrix completion. Following (6), one may consider verifiable conditions, the above `2 program is indeed con- Frobenius norm minimization for matrix completion: sistently successful in a sense as in the following: For any y0 ∈ S0 with an arbitrary partition y0 = [yb; yu] (i.e., 1 2 + min kXk , s.t. [AX]ij = [L0]ij, ∀(i, j) ∈ Ω, (8) arbitrarily missing), the desired representation x0 = A y0 X 2 F is the unique minimizer to the problem in (6). That is, the m×p unseen data yu is exactly recovered by firstly computing where A ∈ R is a dictionary matrix assumed to be + x∗ = Ab yb and then calculating yu = Aux∗. given. Similar to (6), the convex program (8) can also exactly recover the desired representation matrix A+L , as shown y = [y ; y ] ∈ m 0 Theorem 4.1. Let 0 b u R be an authentic sample in the theorem below. drawn from some low-dimensional subspace S0. Denote by k m×n the number of available observations in yb. Then the convex Theorem 4.3. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × program (6) is consistently successful, as long as S0 ⊆ span{A} {1, ··· , n}. Provided that L0 ∈ span{A} and the given dic- + and the given dictionary A is k-isomeric. tionary A is Ω-isomeric, the desired representation X0 = A L0 is the unique minimizer to the problem in (8). The above theorem says that, in order to recover an m-dimensional vector sampled from some subspace deter- Theorem 4.3 tells us that, in general, even when the mined by a given k-isomeric dictionary A, one only needs locations of the missing entries are placed arbitrarily, the to see k entries of the vector. target L0 is restored as long as we have a proper dictionary IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 6

A. This motivates us to consider the commonly used bilinear 4.4 Isomeric Dictionary Pursuit program that seeks both A and X simultaneously: Theorem 4.4 illustrates that program (9) relies on the as- 1 2 2 sumption of p = rank (L0). This is consistent with the min (kAk +kXk ), s.t. [AX]ij = [L0]ij, ∀(i, j) ∈ Ω, (9) A,X 2 F F widely observed phenomenon that program (9) may not work well while the parameter p is far from the true rank A ∈ m×p X ∈ p×n where R and R . The problem above of L0. To overcome this drawback, again, we recall Theo- is bilinear and therefore nonconvex. So, it would be hard rem 4.3. Notice, that the Ω-isomeric condition imposed on to obtain a strong performance guarantee as done in the the dictionary matrix A requires that convex programs, e.g., [10, 29]. What is more, the setup j of deterministic sampling requires a deterministic recovery rank (A) ≤ |Ω |, ∀j = 1, ··· , n. guarantee, the proof of which is much more difficult than This, together with the condition of L0 ∈ span{A}, moti- a probabilistic guarantee. Interestingly, under the very mild vates us to combine the formulation (8) with the popular condition of isomerism, the problem in (9) is proven to in- idea of nuclear norm minimization, resulting in a bilinear clude the exact solutions that identify the target matrix L0 as program termed IsoDP, which estimates both A and X by the critical points. Furthermore, when the relative condition minimizing a mixture of the nuclear and Frobenius norms: number of L0 is sufficiently large, the local optimality of the 1 2 exact solutions is guaranteed surely. min kAk + kXk , s.t. [AX]ij =[L0]ij,∀(i, j) ∈ Ω, (11) A,X ∗ 2 F m×n Theorem 4.4. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × where A ∈ m×p and X ∈ p×n. The above formula {1, ··· , n}. Denote the rank and the SVD of L0 as r0 and R R T can be also derived from the framework of Schatten quasi- U0Σ0V0 , respectively. Define norm minimization [4, 32, 33]. It has been proven in [32, 33] 1 1 m×n 2 T 2 T p×r0 T r L ∈ A0 = U0Σ0 Q ,X0 = QΣ0 V0 , ∀Q ∈ R ,Q Q = I. that, for any rank- matrix R with singular values σ1, ··· , σr, the following holds: Then we have the following: 1 1 1 kLkq = min kAkq1 + kXkq2 , AX = L, T q q1 q2 s.t. (12) 1. If L0 is Ω/Ω -isomeric then the exact solution, denoted q A,X q1 q2 as (A0,X0), is a critical point to the problem in (9). T as long as p ≥ r and 1/q = 1/q1 +1/q2 (q, q1, q2 > 0), where 2. If L is Ω/Ω -isomeric, γ T (L ) > 0.5 and p = r r q 0 Ω,Ω 0 0 kLk = (P σ )1/q is the Schatten-q norm. In that sense, then (A ,X ) is a local minimum to the problem in (9), q i=1 i 0 0 the IsoDP program (11) is related to the following Schatten-q and the local optimality is strict while ignoring the differ- quasi-norm minimization problem with q = 2/3: ences among the exact solutions that equally recover L0. 3 2/3 The condition of γ T (L0) > 0.5, roughly, demands min kLk , s.t. [L]ij = [L0]ij, ∀(i, j) ∈ Ω. (13) Ω,Ω L 2 2/3 that more than half of the information in L0 is observed. Unless some extra assumptions are imposed, this condition Nevertheless, programs (13) and (11) are not equivalent to p < m m ≤ n is not reducible, because counterexamples do exist when each other; this is obvious if (assume ). In fact, even when p ≥ m, the conclusion (12) only implies that γΩ,ΩT (L0) < 0.5. Consider a concrete case with the global minima of (13) and (11) are equivalent, but their " √ # 1 α2 − 1 local minima and critical points could be different. More Ω = {(1, 1), (2, 2)} and L0 = √ 1 , (10) 2 1 precisely, any local minimum to (13) certainly corresponds α −1 4 √ to a local minimum to (11), but not vice versa . For the T where α > 2. Then it can be verified that L0 is Ω/Ω - same reason, the bilinear program (9) is not equivalent to isomeric. Via some calculations, we have (assume p = r0) the convex program (7). 1 1 1 Regarding the recovery performance of the IsoDP pro- γ T (L ) = min(1 − , ) = < 0.5, gram (11), we establish the following theorem that repro- Ω,Ω 0 α2 α2 α2 " 1 # " # duces Theorem 4.4 without the assumption of p = r0. 2 4 (α − 1) 1 2 1 1 4 m×n A0 = and X0 = 1 , (α − 1) . Theorem 4.5. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × 1 2 4 (α2−1) 4 (α − 1) {1, ··· , n}. Denote the rank and the SVD of L0 as r0 and T Now, construct U0Σ0V0 , respectively. Define 1 2 1 " (α2−1) 4 # " # 3 T 3 T p×r0 T 1 +  2 1 A0 = U0Σ0 Q ,X0 = QΣ0 V0 , ∀Q ∈ R ,Q Q = I. A = 1+ and X = , (α − 1) 4 , 2 1 2 1 1/(α − 1) 4 (α − 1) 4 4. Suppose that L1 is a local minimum to the problem in (13). (A ,X ) = arg min kAk + 0.5 kXk2 AX = L  > 0 (A ,X ) Let 1 1 A,X ∗ F , s.t. 1. Then where . It is easy to see that  √ is a feasible (A ,X ) 2 1 1 has to be a local minimum to (11). This can be proven by solution to (9). However, as long as 0 <  < α − 1 − 1, it the method of reduction to absurdity. Assume that (A1,X1) is not can be verified that a local minimum to (11). Then there exists some feasible solution, denoted as (A2,X2), that is arbitrarily close to (A1,X1) and satisfies 2 2 2 2 2 2 kAkF + kXkF < kA0kF + kX0kF , kA2k∗+0.5 kX2kF < kA1k∗+0.5 kX1kF . Taking L2 = A2X2, we have 3 2/3 2 that L2 is arbitrarily close to L1 and 2 kL2k2/3 ≤ kA2k∗+0.5 kX2kF < which implies that (A0,X0) is not a local minimum to (9). 2/3 kA k + 0.5 kX k2 = 3 kL k , which contradicts the premise that In fact, for the particular example shown in (10), it can be 1 ∗ 1 F 2 1 2/3 L1 is a local minimum to (13). So, a local minimum to (13) also gives a proven that a global minimum to (9) is given by (A∗ = local minimum to (11). But the converge of this statement may not be [1; 1],X∗ = [1, 1]), which cannot correctly reconstruct L0. true, and (11) might have more local minima than (13). IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 7

Then we have the following: Algorithm 1 Solving problem (14) by alternating proximal T 1: Input: {[L0]ij|(i, j) ∈ Ω}. 1. If L0 is Ω/Ω -isomeric then the exact solution (A0,X0) 2: Output: the dictionary A and the representation X. is a critical point to the problem in (11). T 3: Initialization: A = I. 2. If L0 is Ω/Ω -isomeric and γΩ,ΩT (L0) > 0.5 then 4: repeat (A0,X0) is a local minimum to the problem in (11), and 5: Update the representation matrix X by (16). the local optimality is strict while ignoring the differences 6: Update the dictionary matrix A by (17). among the exact solutions that equally recover L0. 7: until convergence Due to the advantages of the nuclear norm, the above theorem does not require the assumption of p = rank (L0) any more. Empirically, unlike (9), which exhibits superior 5 MATHEMATICAL PROOFS performance only if p is close to rank (L ) and the initial 0 This section shows the detailed proofs of the theorems solution is chosen carefully, IsoDP can work well by simply proposed in this work. choosing p = m and using A = I as the initial solution.

5.1 Notations 4.5 Optimization Algorithm Besides of the notations presented in Section 2, there are Considering the fact that the observations in reality are some other notations used throughout the proofs. Letters often contaminated by noise, we shall investigate instead U, V , Ω and their variants (complements, subscripts, etc.) the following bilinear program that can also approximately are reserved for left singular vectors, right singular vec- solve the problem in (11): tors and support set, respectively. For convenience, we 1 2 1 X 2 shall abuse the notation U (resp. V ) to denote the linear min λ(kAk + kXk )+ ([AX]ij −[L0]ij) , (14) A,X ∗ 2 F 2 space spanned by the columns of U (resp. V ), i.e., the (i,j)∈Ω column space (resp. row space). The orthogonal projection where A ∈ m×m (i.e., p = m), X ∈ m×n and λ > 0 is onto the column space U, is denoted by PU and given R R T taken as a parameter. by PU (M) = UU M, and similarly for the row space T The optimization problem in (14) can be solved by any of PV (M) = MVV . Also, we denote by PT the projection the many first-order methods established in the literatures. to the sum of the column space U and the row space V , T T T T For the sake of simplicity, we choose to use the proximal i.e., PT (·) = UU (·) + (·)VV − UU (·)VV . The same methods by [34, 35]. Let (At,Xt) be the solution estimated notation is also used to represent a subspace of matrices M ∈ P at the tth iteration. Define a function gt(·) as (i.e., the image of an operator), e.g., we say that U for any matrix M which satisfies PU (M) = M. The symbol PΩ 1 X g (A) = ([AX ] − [L ] )2. denotes the orthogonal projection onto Ω: t 2 t+1 ij 0 ij (i,j)∈Ω  [M]ij, if (i, j) ∈ Ω, [PΩ(M)]ij = Then the solution to (14) is updated via iterating the follow- 0, otherwise. ing two procedures: ⊥ Similarly, the symbol PΩ denotes the orthogonal projection ⊥ λ 2 1 X 2 onto the complement space of Ω; that is, PΩ + PΩ = I, Xt+1 = arg min kXkF + ([AtX]ij − [L0]ij) , (15) X 2 2 where I is the identity operator. (i,j)∈Ω

λ 1 ∂gt(At) 2 5.2 Basic Lemmas At+1 = arg min kAk∗ + kA − (At − )kF , A µt 2 µt While its definitions are associated with a certain matrix, the isomeric condition is actually characterizing some prop- where µt > 0 is a penalty parameter and ∂gt(At) is the erties of a space, as shown in the lemma below. gradient of the function gt(A) at A = At. According to [34], 2 m×n the penalty parameter µt could be set as µt = kXt+1k . The Lemma 5.1. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × T two optimization problems in (15) both have closed-form {1, ··· , n}. Denote the SVD of L0 as U0Σ0V0 . Then we have: solutions. To be more precise, the X-subproblem is a least 1. L0 is Ω-isomeric iff U0 is Ω-isomeric. square regression problem: T T T 2. L0 is Ω -isomeric iff V0 is Ω -isomeric. [X ] = (AT A + λI)−1AT y , ∀1 ≤ j ≤ n, (16) t+1 :,j j j j j Proof. It can be manipulated that where Aj = [At]Ωj ,: and yj = [L0]Ωj ,j. The A-subproblem T [L ] j = ([U ] j )Σ V , ∀j = 1, ··· , n. is solved by Singular Value Thresholding (SVT) [36]: 0 Ω ,: 0 Ω ,: 0 0 T T Since Σ0V0 is row-wisely full rank, we have At+1 = UHλ/µ (Σ)V , (17) t   T rank [L0]Ωj ,: = rank [U0]Ωj ,: , ∀j = 1, ··· , n. where UΣV is the SVD of At − ∂gt(At)/µt and Hλ/µt (·) denotes the shrinkage operator with parameter λ/µt. As a consequence, L0 is Ω-isomeric is equivalent to U0 is The whole optimization procedure is also summarized Ω-isomeric. Similarly, the second claim is proven. in Algorithm 1. Without loss of generality, assume that m ≤ The isomeric property is indeed subspace successive, as n. Then the computational complexity of each iteration in shown in the next lemma. Algorithm 1 is O(m2n) + O(m3). IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 8

m×n Lemma 5.2. Let Ω ⊆ {1, ··· , m} × {1, ··· , n} and U0 ∈ Lemma 5.5. Let L ∈ R be a rank-r matrix with r ≤ p. m×r m T R be the basis matrix of a subspace embedded in R . Suppose Denote the SVD of L as UΣV . Then we have the following: T that U is a subspace of U0, i.e., U = U0U0 U. If U0 is Ω-isomeric 3  2  1 2 then U is Ω-isomeric as well. tr Σ 3 = min kAk∗ + kXk , s.t. AX = L, m×p p×n F 2 A∈R ,X∈R 2 Proof. By U = U U T U and U is Ω-isomeric, 0 0 0 where tr (·) is the trace of a square matrix.   T   T  rank [U] j = rank ([U ] j )U U = rank U U Ω ,: 0 Ω ,: 0 0 Proof. Denote the singular values of L as σ1 ≥ · · · ≥ σr > 0.  T  We first consider the case that rank (A) = rank (L) = r. = rank U0U U = rank (U) , ∀1 ≤ j ≤ n. 0 Since AX = L, the SVD of A must have a form of T UQΣAVA , where Q is an orthogonal matrix of size r × r and ΣA = diag (α1, ··· , αr) with α1 ≥ · · · ≥ αr > 0. Since The following lemma reveals the fact that the isomeric + 2 A L = arg minX kXkF , s.t. AX = L, we have property is related to the invertibility of matrices. 1 2 1 + 2 Lemma 5.3. Let Ω ⊆ {1, ··· , m} × {1, ··· , n} and U0 ∈ kAk∗ + kXkF ≥ kAk∗ + kA LkF m×r m T 2 2 R be the basis matrix of a subspace of R . Denote by ui the 1   T T = tr (Σ ) + tr Σ−1QT Σ2QΣ−1 . ith row of U0, i.e., U0 = [u1 ; ··· ; um]. Define δij as A 2 A A  1, if (i, j) ∈ Ω, It can be proven that the eigenvalues of Σ−1QT Σ2QΣ−1 are δij = (18) A A 0, otherwise. {σ2/α2 }r {α }r given by i πi i=1, where πi i=1 is a permutation of r Pm T {αi}i=1. By rearrangement inequality, Then the matrices, i=1 δijuiui , ∀1 ≤ j ≤ n, are all invertible r r iff U0 is Ω-isomeric.   X σ2 X σ2 tr Σ−1QT Σ2QΣ−1 = i ≥ i . A A α2 α2 Proof. Note that i=1 πi i=1 i m m T X 2 T X T As a consequence, we have ([U0]Ωj ,:) ([U0]Ωj ,:) = (δij) uiui = δijuiui . r  2  r  2  i=1 i=1 1 2 X σi X 1 1 σi kAk∗ + kXk ≥ αi + = αi + αi + Pm T 2 F 2α2 2 2 2α2 Now, it is easy to see that the matrix i=1 δijuiui is invert- i=1 i i=1 i T r ible is equivalent to the matrix ([U ] j ) ([U ] j ) is posi- 2 0 Ω ,: 0 Ω ,: X 3 3  2   3 3 rank [U ] j = ≥ σ = tr Σ . tive definite, which is further equivalent to 0 Ω ,: 2 i 2 rank (U0), ∀j = 1, ··· , n. i=1 rank (A) ≥ rank (L) The following lemma gives some insights to the relative Regarding the general case of , we can A = UU T A AX = L A X = L condition number. construct 1 . By , 1 . Since rank (A1) = rank (L), we have m×l Lemma 5.4. Let M ∈ R and ω ⊆ {1, ··· , m}. Define m 1 2 1 2 {δi}i=1 with δi = 1 if i ∈ ω and 0 otherwise. Define a dialog kAk∗ + kXkF ≥ kA1k∗ + kXkF m×m 2 2 matrix D ∈ R as D = diag (δ1, δ2, ··· , δm). Denote the T 1 + 2 3  2  SVD of M as UΣV . If rank ([M]ω,:) = rank (M) then ≥ kA k + kA Lk ≥ tr Σ 3 . 1 ∗ 2 1 F 2

γω(M) = σmin,  2  3 3 Finally, the optimal value of 2 tr Σ is attained by A∗ = where σmin is the the smallest singular value (or eigenvalue) of 2 T 1 T T UΣ 3 H and X∗ = HΣ 3 V , ∀H H = I. the matrix U T DU. The next lemma will be used multiple times in the proofs Proof. First note that [M]ω,: can be equivalently written as T presented in this paper. DUΣV . By the assumption of rank ([M]ω,:) = rank (M), DU is column-wisely full rank. Thus, Lemma 5.6. Let Ω ⊆ {1, ··· , m} × {1, ··· , n} and P be m×n + T T + T T + + an orthogonal projection onto some subspace of R . Then the M([M]ω,:) = UΣV (DUΣV ) = UΣV (ΣV ) (DU) following are equivalent: = U(DU)+ = U(U T DU)−1U T D, 1. PPΩP is invertible. ⊥ which gives that 2. kPPΩ Pk < 1. ⊥ + + T T −1 T 3. P ∩ PΩ = {0}. M([M]ω,:) (M([M]ω,:) ) = U(U DU) U . → vec(·) + 2 Proof. 1 2: Let denote the vectorization of a matrix As a result, we have kM([M]ω,:) k = 1/σmin, and thereby formed by stacking the columns of the matrix into a single γ (M) = 1/kM([M] )+k2 = σ . column vector. Suppose that the basis matrix associated ω ω,: min mn×r T with P is given by P ∈ R ,P P = I; namely, vec(P(M)) = PP T vec(M), ∀M ∈ m×n. 1 2 R It has been proven in [37] that kLk∗ = minA,X 2 (kAkF + 2 δ D kXkF ), s.t. AX = L. We have an analogous result, which Denote ij as in (18) and define a diagonal matrix as has also been proven by [4, 32, 33]. mn×mn D = diag(δ11, δ21, ··· , δij, ··· , δmn) ∈ R . IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 9 P∞ ⊥ i Notice that Similarly, it can be also proven that (I + i=1(PPΩ P) ) P∞ ⊥ i X T T X T T PPΩP(M) = M. Hence, I + i=1(PPΩ P) is indeed the P(M) = P( hM, eie ieie ) = hM, eie iP(eie ), j j j j inverse operator of PPΩP. i,j i,j The lemma below is adapted from the arguments in [38]. where ei is the ith standard basis and h·i denotes the inner m×p product between two matrices. With this notation, it is easy Lemma 5.7. Let A ∈ R be a matrix with column space U, + to see that and let A1 = A + ∆. If ∆ ∈ U and k∆k < 1/kA k then T T T T kA+k [vec(P(e1e1 )), vec(P(e2e1 )), ··· , vec(P(emen ))] = PP . rank (A ) = rank (A) and kA+k ≤ . 1 1 1 − kA+kk∆k Similarly, we have Proof. By ∆ ∈ U, X T T PPΩP(M) = hP(M), eiej i(δijP(eiej )), T + + A1 = A + UU ∆ = A + AA ∆ = A(I + A ∆). i,j By k∆k < 1/kA+k, I + A+∆ is invertible and thus and thereby rank (A1) = rank (A). T vec(PPΩP(M)) = PP Dvec(P(M)) To prove the second claim, we denote by V1 the row A = PP T DPP T vec(M). space of 1. Then we have T + + + T V1V1 = A1 A1 = A1 A(I + A ∆), For PPΩP to be invertible, the matrix P DP must be T + T + −1 positive definite. Because, whenever P DP is singular, which gives that A1 A = V1V1 (I + A ∆) . Since A1 ∈ U, mn T there exists z ∈ R that satisfies z 6= 0 and P DP z = 0, we have and thus there exists M ∈ P and M 6= 0 such that + + T + + T + −1 + A1 = A1 UU = A1 AA = V1V1 (I + A ∆) A , PPΩP(M) = 0; this contradicts the assumption that PPΩP is invertible. Denote the minimal singular value of P T DP from which the conclusion follows. T as 0 < σmin ≤ 1. Since P DP is positive definite, we have ⊥ ⊥ 5.3 Critical Lemmas kPPΩ P(M)kF = kvec(PPΩ P(M))k2 T T T The following lemma has a critical role in the proofs. = k(I − P DP )P vec(M)k2 ≤ (1 − σmin)kP vec(M)k2 L ∈ m×n Ω ⊆ {1, ··· , m} × = (1 − σmin)kP(M)kF , Lemma 5.8. Let 0 R and T {1, ··· , n}. Let the SVD of L0 be U0Σ0V0 . Denote PU0 (·) = kPP⊥Pk ≤ 1 − σ < 1 T T which gives that Ω min . U0U0 (·) and PV0 (·) = (·)V0V0 . Then we have the following: ⊥ 2→3: Suppose that M ∈ P ∩ PΩ , i.e., M = P(M) = ⊥ ⊥ 1. PU0 PΩPU0 is invertible iff U0 is Ω-isomeric. PΩ (M). Then we have M = PPΩ P(M) and thus T 2. PV0 PΩPV0 is invertible iff V0 is Ω -isomeric. ⊥ ⊥ kMkF = kPPΩ P(M)kF ≤ kPPΩ PkkMkF ≤ kMkF . Proof. The above two claims are proven in the same way, Since kPP⊥Pk < 1, the last equality above can hold only and thereby we only present the proof of the first one. Since Ω P P P P when M = 0. the operator U0 Ω U0 is linear and U0 is a linear space of 3→1: Consider a nonzero matrix M ∈ P. Then we have finite dimension, the sufficiency can be proven by showing that PU0 PΩPU0 is an injection. That is, we need to prove 2 2 ⊥ 2 kMkF = kP(M)kF = kPΩP(M) + PΩ P(M)kF that the following linear system has no nonzero solution: 2 ⊥ 2 = kPΩP(M)k + kP P(M)k , F Ω F PU0 PΩPU0 (M) = 0, s.t. M ∈ PU0 . which gives that Assume that PU0 PΩPU0 (M) = 0. Then we have ⊥ 2 ⊥ 2 2 2 T T kPPΩ P(M)kF ≤ kPΩ P(M)kF = kMkF − kPΩP(M)kF . U0 PΩ(U0U0 M) = 0. ⊥ i j U U T M By P ∩ PΩ = {0}, PΩP(M) 6= 0. Thus, Denote the th row and th column of 0 and 0 as uT b U = [uT ; uT ; ··· ; uT ] ⊥ 2 2 i and j, respectively; that is, 0 1 2 m and kPPΩ Pk ≤ 1 − inf kPΩP(M)kF < 1. U T M = [b , b , ··· , b ] δ j kMk =1 0 1 2 n . Define ij as in (18). Then the th F T T Pm T column of U0 PΩ(U0U0 M) is given by ( i=1 δijuiui )bj. ⊥ P∞ ⊥ i Pm T Provided that kPPΩ Pk < 1, I + i=1(PPΩ P) is well By Lemma 5.3, the matrix i=1 δijuiui is invertible. Hence, T T defined. Notice that, for any M ∈ P, the following holds: U0 PΩ(U0U0 M) = 0 implies that ∞ X ⊥ i bj = 0, ∀j = 1, ··· , n, PPΩP(I + (PPΩ P) )(M) T i=1 i.e., U0 M = 0. By the assumption of M ∈ PU0 , M = 0. ∞ It remains to prove the necessity. Assume U0 is not Ω- ⊥ X ⊥ i = P(I − PPΩ P)(I + (PPΩ P) )(M) isomeric. By Lemma 5.3, there exists j1 such that the matrix i=1 Pm T i=1 δij1 uiui is singular and therefore has a nonzero null ∞ ∞ T X ⊥ i ⊥ X ⊥ i space. So, there exists M1 6= 0 such that U0 PΩ(U0M1) = 0. = P(I + (PPΩ P) − PPΩ P − (PPΩ P) )(M) Let M = U0M1. Then we have M 6= 0, M ∈ PU and i=1 i=2 0

= P(M) = M. PU0 PΩPU0 (M) = 0. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 10

This contradicts the assumption that PU0 PΩPU0 is invert- which gives that ible. As a consequence, U must be Ω-isomeric. 0 s 1 The next four lemmas establish some connections be- k(P P P )−1P P P⊥ k ≤ − 1. U0 Ω U0 U0 Ω U0 tween the relative condition number and the operator norm. γΩ(L0) m×n Lemma 5.9. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × Using a similar argument as in the proof of Lemma 5.9, T p {1, ··· , n}, and let the SVD of L0 be U0Σ0V0 . Denote it can be proven that the value of 1/γΩ(L0) − 1 is at- T T T PU0 (·) = U0U0 (·) and PV0 (·) = (·)V0V0 . If L0 is Ω/Ω - tainable. To be more precise, assume without loss of gen- isomeric then erality that j1 = arg minj σj, where σj is the smallest T ∗ ∗ ⊥ ⊥ T singular value of U0 DjU0. Denote by σ and v the largest kP P P k = 1 − γ (L ), kP P P k = 1 − γ T (L ). U0 Ω U0 Ω 0 V0 Ω V0 Ω 0 singular value and the corresponding right singular vec- T −1 T T Proof. We only need to prove the first claim. Denote δij tor of (U0 DjU0) U0 Dj(I − U0U0 ), respectively. Then n ∗ as in (18) and define a set of diagonal matrices {Dj}j=1 the above justifications have already proven that σ = m×m p as Dj = diag(δ1j, δ2j, ··· , δmj) ∈ R . Denote the jth 1/γΩ(L0) − 1. Construct an m × n matrix M with the j v∗ column of PU0 (M) as bj. Then we have 1th column being and everywhere else being zero. Then k(P P P )−1P P P⊥ (M)k = ⊥ T T T it can be verified that U0 Ω U0 U0 Ω U0 F k[PU P PU (M)]:,jk2 = kU0U bj − U0(U DjU0)U bjk2 p 0 Ω 0 0 0 0 1/γΩ(L0) − 1kMkF . T T T T = k(I − U0 DjU0)U0 bjk2 ≤ k(I − U0 DjU0)kkU0 bjk2. Lemma 5.11. Let L ∈ m×n and Ω ⊆ {1, ··· , m} × By Lemma 5.3, U T D U is positive definite. As a conse- 0 R 0 j 0 {1, ··· , n}, and let the SVD of L be U Σ V T . Denote quence, σ I U T D U I, where σ > 0 is the minimal 0 0 0 0 j 4 0 j 0 4 j P (·) = U U T (·)+(·)V V T −U U T (·)V V T . If L is Ω/ΩT - eigenvalue of U T D U . By Lemma 5.4 and Definition 3.5, T0 0 0 0 0 0 0 0 0 0 0 j 0 isomeric then σj ≥ γΩ(L0), ∀1 ≤ j ≤ n. Thus, n ⊥ kP P P k ≤ 2(1 − γ T (L )). ⊥ 2 X 2 2 T0 Ω T0 Ω,Ω 0 kPU0 PΩ PU0 (M)kF ≤ (1 − σj) kbjk2 j=1 Proof. Using the same arguments as in the proof of 2 2 ⊥ ⊥ 2 ≤ (1 − γΩ(L0)) kPU0 (M)kF , Lemma 5.6, it can be proven that kPPΩ Pk = kPPΩ k , with m×n P being any orthogonal projection onto a subspace of R . where gives that kP P⊥P k ≤ 1 − γ (L ). U0 Ω U0 Ω 0 Thus, we have the following It remains to prove that the value of 1 − γΩ(L0) is attainable. Without loss of generality, assume that j1 = ⊥ ⊥ 2 ⊥ 2 kPT0 PΩ PT0 k = kPT0 PΩ k = sup kPT0 PΩ (M)kF arg minj σj, i.e., σj1 = γΩ(L0). Construct a r0 × r0 matrix B kMkF =1 with the j1th column being the eigenvector corresponding = sup kP P⊥(M) + P⊥ P P⊥(M)k2 T U0 Ω U0 V0 Ω F to the smallest eigenvalue of U0 Dj1 U0 and everywhere else kMkF =1 being zero. Let M1 = U0B. Then it can be verified that = sup (kP P⊥(M)k2 + kP⊥ P P⊥(M)k2 ) ⊥ U0 Ω F U0 V0 Ω F kPU0 PΩ PU0 (M1)kF = (1 − γΩ(L0))kM1kF . kMkF =1 m×n ≤ sup kP P⊥(M)k2 + sup kP P⊥(M)k2 Lemma 5.10. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × U0 Ω F V0 Ω F T kMkF =1 kMkF =1 {1, ··· , n}. Let the SVD of L0 be U0Σ0V0 . Denote PU0 (·) = T T T = kP P⊥k2 + kP P⊥k2, U0U0 (·) and PV0 (·) = (·)V0V0 . If L0 is Ω/Ω -isomeric then: U0 Ω V0 Ω s 1 which, together with Lemma 5.9, gives that k(P P P )−1P P P⊥ k = − 1, U0 Ω U0 U0 Ω U0 γΩ(L0) ⊥ ⊥ ⊥ s kPT0 PΩ PT0 k ≤ kPU0 PΩ PU0 k + kPV0 PΩ PV0 k −1 ⊥ 1 T k(P P P ) P P P k = − 1. = 1 − γ (L ) + 1 − γ T (L ) ≤ 2(1 − γ T (L )) V0 Ω V0 V0 Ω V0 T Ω 0 Ω 0 Ω,Ω 0 γΩT (L0 ) m×n Proof. We shall prove the first claim. Let M ∈ R . Denote −1 ⊥ the jth column of M and (PU0 PΩPU0 ) PU0 PΩPU (M) 0 L ∈ m×n Ω ⊆ {1, ··· , m} × as bj and yj, respectively. Denote δij as in (18) and Lemma 5.12. Let 0 R and T n {1, ··· , n}, and let the SVD of L0 be U0Σ0V . Denote define a set of diagonal matrices {Dj}j=1 as Dj = 0 m×m P (·) = U U T (·) + (·)V V T − U U T (·)V V T . If the operator diag(δ1j, δ2j, ··· , δmj) ∈ R . Then we have T0 0 0 0 0 0 0 0 0 PT PΩPT is invertible, then we have −1 ⊥ 0 0 yj = [(PU PΩPU ) PU PΩP (M)]:,j 0 0 0 U0 s T −1 T T 1 = U0(U0 DjU0) U0 Dj(I − U0U0 )bj. k(P P P )−1P P P⊥ k = − 1. T0 Ω T0 T0 Ω T0 ⊥ 1 − kPT P PT k It can be calculated that 0 Ω 0 2 T −1 T T 2 2 kyjk2 ≤ k(U0 DjU0) U0 Dj(I − U0U0 )k kbjk2 = Proof. We shall use again the two notations, vec(·) and T −1 T T T −1 2 D, defined in the proof of Lemma 5.6. Let P ∈ k(U0 DjU0) U0 Dj(I − U0U0 )DU0(U0 DjU0) kkbjk2 mn×r   R be a column-wisely orthonormal matrix such that T −1 2 1 2 vec(P (M)) = PP T vec(M) ∀M P P P = k(U0 DjU0) − Ikkbjk2 ≤ − 1 kbjk2, T0 , . Since T0 Ω T0 is in- γΩ(L0) vertible, it follows that P T DP is positive definite. Denote IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 11

by σmin(·) the smallest singular value of a matrix. Then we Since PU0 (·) = PU0 PT0 (·) = PT0 PU0 (·), we have have the following: ⊥ ⊥ ⊥ kPU0 PΩ PU0 k = kPU0 PT0 PΩ PT0 PU0 k ≤ kPT0 PΩ PT0 k < 1. −1 ⊥ 2 k(PT0 PΩPT0 ) PT0 PΩPT k 0 Due to the virtues of Lemma 5.6, Lemma 5.8 and Lemma 5.1, = kP (P T DP )−1P T D(I − PP T )k2 it can be concluded that L0 is Ω-isometric with probability T −1 T T T −1 T −10 = kP (P DP ) P D(I − PP )DP (P DP ) P k at least 1 − n1 . In a similar way, it can be also proven that T T −10 1 L0 is Ω -isometric with probability at least 1 − n . = k(P T DP )−1 − Ik = − 1 1 σ (P T DP ) min Proof. (proof of Theorem 3.2) When L0 is not Ω-isomeric, 1 1 Lemma 5.1 and Lemma 5.8 give that PU0 PΩPU0 is not = T − 1 = ⊥ − 1. ⊥ 1 − kP (I − D)P k 1 − kPT0 PΩ PT0 k invertible. By Lemma 5.6, PU0 ∩PΩ 6= {0}. Thus, there exists ⊥ ∆ 6= 0 that satisfies ∆ ∈ PU0 and ∆ ∈ PΩ . Now construct L = L0 + ∆. Then we have L 6= L0, PΩ(L) = PΩ(L0) The following lemma is more general than Theorem 4.3. and rank (L) = rank (PU0 (L0 + ∆)) ≤ rank (L0). Since ⊥ m×n PU ∩ P is a nonempty linear space, there are indeed Lemma 5.13. Let L0 ∈ R and Ω ⊆ {1, ··· , m} × 0 Ω {1, ··· , n}. Consider the following convex problem: infinitely many choices for L. Proof. (proof of Theorem 3.3) Using the same arguments as min kXk , s.t. PΩ(AX − L0) = 0, (19) X UI in the proof of Theorem 3.1, we conclude that the following holds with probability at least 1 − n−10: where k·kUI generally denotes a convex unitary invariant norm 1 m×p and A ∈ R is given. If L0 ∈ span{A} and A is Ω- ⊥ ρ0 + kPU0 PΩ PU0 k < 1 − ρ0 + √ , isomeric then X0 = A L0 is the unique minimizer to the convex α optimization problem in (19). which, together with Lemma 5.9, gives that γΩ(L0) > (1 − T √ T Proof. Denote the SVD of A as U Σ V . Then it follows T A A A 1/ α)ρ√0. Similarity, it can be also proven that γΩ (L0 ) > P (AX − L ) = 0 L ∈ span{A} −10 from Ω 0 and 0 that (1 − 1/ α)ρ0 with probability at least 1 − n1 .

PUA PΩPUA (AX − L0) = 0. 5.5 Proof of Theorem 3.4 P P P By Lemma 5.1 and Lemma 5.8, UA Ω UA is invertible and T AX = L P (AX − L ) = 0 Let the SVD of L0 be U0Σ0V0 . Denote the ith row of U0 thus 0. Hence, Ω 0 is equivalent to T T T T AX = L as ui , i.e., U0 = [u1 ; u2 ; ··· ; um]. Define δij as in (18), 0. Notice, that Theorem 4.1 of [14] actually holds n and define a collection of diagonal matrices {Dj}j=1 as for any convex unitary invariant norms. That is, m×m Dj = diag(δ1j, δ2j, ··· , δmj) ∈ . With these nota- + R A L0 = arg min kXkUI , s.t. AX = L0, P P⊥P X tions, we shall show that the operator norm of U0 Ω U0 j + can be bounded from above. Considering the th column of which implies that A L0 is the unique minimizer to the ⊥ PU0 PΩ PU0 (X), ∀X, j, we have problem in (19). ⊥ T T [PU0 PΩ PU0 (X)]:,j = U0U0 (I − Dj)U0U0 [X]:,j, 5.4 Proofs of Theorems 3.1, 3.2 and 3.3 which gives that We need to use some notations as follows. Let the SVD of L0 ⊥ T T T T T k[PU0 PΩ PU0 (X)]:,jk2 ≤ kU0U0 (I − Dj)U0U0 kk[X]:,jk2. be U0Σ0V0 . Denote PU0 (·) = U0U0 (·), PV0 (·) = (·)V0V0

and PT0 (·) = PU0 (·) + PV0 (·) − PU0 PV0 (·). Since the diagonal of Dj has at most (1 − ρ)m zeros,

Proof. (proof of Theorem 3.1) Define an operator H in the m1 T T X T same way as in [10]: kU0U0 (I − Dj)U0U0 k = k (1 − δij)uiui k i=1 1 m H = PT0 − PT0 PΩA PT0 . X T ρ0 ≤ (1 − δij)kuiui k ≤ (1 − ρ)µ0r0, According to Theorem 4.1 of [10], there exists some numeri- i=1 cal constant c > 0 such that the inequality, where the last inequality follows from the definition of s coherence. Thus, we have cµ0r0 log n1 kHk ≤ , ⊥ kPU P PU k ≤ (1 − ρ)µ0r0. ρ0n2 0 Ω 0 −10 Similarly, based on the assumption that at least ρn entries in holds with probability at least 1 − n provided that the 1 each row of L are observed, we have right hand side is smaller than 1. So, kHk < 1 provided that 0 ⊥ cµ0r0 log n1 kPV0 PΩ PV0 k ≤ (1 − ρ)µ0r0. ρ0 > . n2 By the assumption ρ > 1 − (1 − α)/(µ0r0), When kHk < 1, we have ⊥ ⊥ kPU0 PΩ PU0 k < 1 − α and kPV0 PΩ PV0 k < 1 − α. ⊥ kPT0 PΩ PT0 k = kρ0H + (1 − ρ0)PT0 k T By Lemma 5.6 and Lemma 5.8, L0 is Ω/Ω -isomeric. In ≤ ρ0kHk + (1 − ρ0)kPT k < 1. 0 addition, it follows from Lemma 5.9 that γΩ,ΩT (L0) > α. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 12 5.6 Proofs of Theorems 4.1 and 4.3 5.8 Proof of Theorem 4.4 1 1 2 T 2 T Theorem 4.3 is indeed an immediate corollary of Proof. Since A0 = U0Σ0 Q and X0 = QΣ0 V0 , we have Lemma 5.13. So we only prove Theorem 4.1. the following: 1) A0X0 = L0; 2) L0 ∈ span{A0} and A0 T T T T is Ω-isomeric; 3) L0 ∈ span{X0 } and X0 is Ω -isomeric. + Proof. By y0 ∈ S0 ⊆ span{A}, y0 = AA y0 and therefore Hence, according to Lemma 5.13, we have + + yb = AbA y0. That is, x0 = A y0 is a feasible solution to + 2 k X0 = A0 L0 = arg min kXkF , s.t. PΩ(A0X − L0) = 0, the problem in (6). Provided that yb ∈ R and the dictionary X A k rank (A ) = + 2 matrix is -isomeric, Definition 3.1 gives that b A0 = L0X0 = arg min kAkF , s.t. PΩ(AX0 − L0) = 0. rank (A), which implies that A (A ,X ) T T Hence, 0 0 is a critical point to the problem in (9). span{Ab } = span{A }. It remains to prove the second claim. Suppose that (A =

+ T A0 + ∆0,X = X0 + E0) with k∆0k ≤ ε and kE0k ≤ ε is a On the other hand, it is easy to see that A y0 ∈ span{A }. p feasible solution to (9). We want to prove that Hence, there exists a dual vector w ∈ R that obeys 1 2 2 1 2 2 1 (kAkF + kXkF ) ≥ (kA0kF + kX0kF ) AT w = A+y , i.e., AT w ∈ ∂ kA+y k2. 2 2 b 0 b 2 0 2 holds for some small ε, and show that the equality can hold + By standard convexity arguments [39], x0 = A y0 is an only if AX = L0. Denote optimal solution to the problem in (6). Since the squared T T PU (·) = U0U (·), PV (·) = (·)V0V , (20) `2 norm is a strongly convex function, it follows that the 0 0 0 0 P = (P P P )−1P P P⊥ , optimal solution to (6) is unique. 1 U0 Ω U0 U0 Ω U0 P = (P P P )−1P P P⊥ . 2 V0 Ω V0 V0 Ω V0 5.7 Proof of Theorem 4.2 Define T Proof. Let the SVD of L0 be U0Σ0V0 . Denote PT0 (·) = A¯ = A + P (∆ ) X¯ = X + P (E ). T T T T 0 0 U0 0 and 0 0 V0 0 (21) U0U0 (·)+(·)V0V0 −U0U0 (·)V0V0 . Since γΩ,ΩT (L0) > 0.5, ⊥ + + it follows from Lemma 5.11 that kPT0 PΩ PT0 k is strictly Provided that ε < min(1/kA0 k, 1/kX0 k), it follows from smaller than 1. By Lemma 5.6, PT0 PΩPT0 is invertible and Lemma 5.7 that ⊥ T ∩ Ω = {0} γ T (L ) > 0.75 0 . Given Ω,Ω 0 , Lemma 5.12 and ¯  ¯  Lemma 5.11 imply that rank A0 = rank X0 = r0, (22) kA+k kX+k s ¯+ 0 ¯ + 0 kA0 k ≤ + and kX0 k ≤ + . −1 ⊥ 1 1 − kA kε 1 − kX kε k(PT PΩPT ) PT PΩP k = − 1 0 0 0 0 0 T0 1 − kP P⊥P k T0 Ω T0 P (AX − L ) = 0 s By Ω 0 , 1 ≤ − 1 < 1. PΩ(A0E0 + ∆0X0 + ∆0E0) = 0. 2γΩ,ΩT (L0) − 1 Then it can be manipulated that Next, we shall consider a feasible solution L = L0 + ∆ and show that the objective strictly increases unless ∆ = 0. By PΩ(A¯0E0) ⊥ PΩ(∆) = 0, PΩPT (∆) = −PΩP (∆). Since the operator ⊥ ⊥ 0 T0 = −P (∆ X¯ − P P (∆ E ) + P P (∆ E )). Ω 0 0 U0 V0 0 0 U0 V0 0 0 PT0 PΩPT0 is invertible, we have P P P −1 ⊥ Since U0 Ω U0 is invertible, we have PT0 (∆) = −(PT0 PΩPT0 ) PT0 PΩPT (∆). 0 ⊥ ¯ ⊥ −1 ¯ PV (A0E0) = −PV (PU0 PΩPU0 ) PU0 PΩ(∆0X0 (23) −1 ⊥ 0 0 By k(PT PΩPT ) PT PΩP k < 1, kPT (∆)k∗ < ⊥ ⊥ 0 0 0 T0 0 − PU PV (∆0E0) + P P (∆0E0)) kP⊥ (∆)k P⊥ (∆) = 0 0 0 U0 V0 T0 ∗ holds unless T0 . By the convexity of ⊥ ⊥ ⊥ ⊥ ⊥ = −P P1P (∆0X¯0) − P P1P P (∆0E0) the nuclear norm, V0 U0 V0 U0 V0

T Similarly, by the invertibility of PV0 PΩPV0 , kL0 + ∆k∗ − kL0k∗ ≥ h∆,U0V0 + W i, ⊥ P (∆0X¯0) (24) W ∈ P⊥ kW k ≤ 1 U0 where T0 and . Due to the duality between ⊥ ⊥ ⊥ ⊥ ⊥ = −P P2P (A¯0E0) − P P2P P (∆0E0). the nuclear norm and operator norm, we can construct a W U0 V0 U0 U0 V0 ⊥ such that h∆,W i = kP (∆0)k∗. Thus, T0 The combination of (23) and (24) gives that

⊥ ⊥ ⊥ ⊥ kL0 + ∆k∗ − kL0k∗ ≥ kP (∆)k∗ − kPU PV (∆)k∗ P (A¯ E ) = P P P P (A¯ E )+ T0 0 0 V0 0 0 V0 1 2 V0 0 0 ⊥ ⊥ ⊥ ⊥ ≥ kP (∆)k∗ − kPT (∆)k∗. P (P P − P )P P (∆ E ). T0 0 V0 1 2 1 U0 V0 0 0 ¯  Hence, kL0 + ∆k∗ is strictly greater than kL0k∗ unless ∆ ∈ By rank A0 = r0 = p, ⊥ T0. Since T0 ∩ Ω = {0}, it follows that L0 is the unique P⊥ P⊥ (∆ E ) = P⊥ P⊥ (∆ A¯+A¯ E ). minimizer to the problem in (7). U0 V0 0 0 U0 V0 0 0 0 0 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 13

By Lemma 5.10 and the assumption of γΩ,ΩT (L0) > 0.5, We also have kP1k < 1 and kP2k < 1. Thus, ¯ + ˜ ˜T ¯ + ˜ ˜T ⊥ ¯ ⊥ ¯ hA0 +∆0 −L0X0 , U0Q i = h∆0 + A0E0X0 , U0Q i kPV (A0E0)k ≤ kP1P2kkPV (A0E0)k 0 0 = h∆ X¯ X¯ + + A E X¯ +, U˜ Q˜T i, + ε(kP P k + kP k)kA¯+kkP⊥ (A¯ E )k 0 0 0 0 0 0 0 1 2 1 0 V0 0 0 ! 1 2εkA+k which gives that ≤ − 1 + 0 kP⊥ (A¯ E )k. + V0 0 0 γΩ,ΩT (L0) 1 − kA kε 0 ¯ + ˜ ˜T ¯ + abs(hA0 + ∆0 − L0X0 , U0Q i) ≤ kX0 kkPU0 PV0 (27) Let (∆ X¯ + A E )k ≤ kX¯ +kk∆ X¯ + P (A E )k , ! 0 0 0 0 ∗ 0 0 0 V0 0 0 ∗ 1 2γ T (L ) − 1 ε < min , Ω,Ω 0 . + + abs(·) 2kA0 k 4kA0 kγΩ,ΩT (L0) where we denote by the absolute value of a real number. By PΩ(A0E0 + ∆0X0 + ∆0E0) = 0, kP⊥ (A¯ E )k < kP⊥ (A¯ E )k Then we have that V0 0 0 V0 0 0 strictly P⊥ (A¯ E ) = 0 rank A¯  = r = p ∆ X¯ + P (A E ) = −P P⊥ (A E ) − P P⊥ (∆ E ) holds unless V0 0 0 . Since 0 0 , 0 0 V0 0 0 2 V0 0 0 2 V0 0 0 P⊥ (A¯ E ) = 0 simply leads to E ∈ P . Hence, ⊥ ⊥ V0 0 0 0 V0 = −P (−P P (∆ X + ∆ E ) − P P (∆ E )) 2 V0 1 0 0 0 0 V0 U0 0 0 ⊥ − P P⊥ (∆ E ) = P P (∆ X + ∆ E ) − P P⊥ (∆ E ) A0E0 + ∆0X0 + ∆0E0 ∈ PV0 ∩ PΩ = {0}, 2 V0 0 0 2 1 0 0 0 0 2 U0 0 0 ⊥ ⊥ = P2P1(∆0X¯0) + P2P1P (∆0E0) − P2P (∆0E0). which implies that AX = L0. Thus, we finally have V0 U0 1 1 2 2 2 2 By Lemma 5.10 and the assumption of γΩ,ΩT (L0) > 0.5, (kAkF + kXkF ) ≥ kL0k∗ = (kA0kF + kX0kF ), 2 2 kP1k < 1 and kP2k < 1. As a result, we have where the inequality follows from kAXk∗ = 1 2 2 k∆ X¯ + P (A E )k (28) minA,X 2 (kAkF + kXkF ) [37]. 0 0 V0 0 0 ∗ ≤ kP⊥ (∆ X¯ )k + 2kP⊥ (∆ E )k . U0 0 0 ∗ U0 0 0 ∗ 5.9 Proof of Theorem 4.5 2 1 3 T 3 T Let Proof. Since A0 = U0Σ0 Q and X0 = QΣ0 V0 , we have   the following: 1) A0X0 = L0; 2) L0 ∈ span{A0} and A0 0.1kX0k 0.175 T T T T is Ω-isomeric; 3) L ∈ span{X } and X is Ω -isomeric. ε < min , + . 0 0 0 1 + 1.1τ0 kX k Due to Lemma 5.13, we have 0 ⊥ ⊥ + 2 Due to (27), (28) and the assumption of kP (∆0)P ¯ k∗ ≥ X0 = A0 L0 = arg min kXkF , s.t. PΩ(A0X − L0) = 0, U0 Q X ⊥ 2τ kP (∆ )P ¯ k + 0 U0 0 Q ∗, it can be calculated that A0 = L0X0 = arg min kAk∗, s.t. PΩ(AX0 − L0) = 0. A ¯ + ˜ ˜T abs(hA0 + ∆0 − L0X0 , U0Q i) (29) Hence, (A0,X0) is a critical point to the problem in (11). + ⊥ + ⊥ ⊥ ≤ kX¯ kkP (∆ X¯ )k + 2kX¯ kkP (∆ (P ¯ + P )E )k Regarding the second claim, we consider a feasible so- 0 U0 0 0 ∗ 0 U0 0 Q Q¯ 0 ∗ (A = A + ∆ ,X = X + E ) k∆ k ≤ ε + ⊥ + ⊥ lution 0 0 0 0 , with 0 and ≤ kX¯ kkX¯0kkP (∆0)P ¯ k∗ + 2εkX¯ kkP (∆0)P ¯ k∗ ¯ ¯ 0 U0 Q 0 U0 Q kE0k ≤ ε. Define PU0 , PV0 , P1, P2, A0 and X0 in the same + ⊥ ⊥ ⊥ + 2εkX¯ kkP (∆0)P ¯ k∗ ≤ 1.1τ0kP (∆0)P ¯ k∗ way as in (20) and (21). Note that the statements in (22) still 0 U0 Q U0 Q ⊥ ⊥ ⊥ p ≥ r X¯ + 0.2τ kP (∆ )P ¯ k + 0.35kP (∆ )P k hold in the general case of 0. Denote the SVD of 0 as 0 U0 0 Q ∗ U0 0 Q¯ ∗ ¯ ¯ ¯ T T ¯ ¯ T QΣV0 . Then we have V0V0 = V0V0 . Denote ≤ (0.65 + 0.35)kP⊥ (∆ )P ⊥k = kP⊥ (∆ )P ⊥k . U0 0 Q¯ ∗ U0 0 Q¯ ∗ ¯ ¯T ⊥ ¯ ¯T PQ¯ = QQ and PQ¯ = I − QQ . Now, combining (25), (26) and (29), we have Denote the condition number of X0 as τ0. With these nota- tions, we shall finish the proof by exploring two cases. kAk − kL X¯ +k ≥ kP⊥ (∆ )P ⊥k ∗ 0 0 ∗ U0 0 Q¯ ∗ + T ⊥ ⊥ ⊥ − abs(hA + ∆ − L X¯ , U˜ Q˜ i) ≥ 0, 5.9.1 Case 1: kP (∆ )P k ≥ 2τ kP (∆ )P ¯ k 0 0 0 0 0 U0 0 Q¯ ∗ 0 U0 0 Q ∗ ¯ + ˜ ˜ ˜T Denote the SVD of L0X0 as U0ΣQ . Then we have which, together with Lemma 5.5, simply leads to ˜ ˜ T T ˜ ˜T ¯ ¯T U0U0 = U0U0 and QQ = QQ . 1 2 ¯ + kAk∗ + kXkF = (kAk∗ − kL0X k∗) By the convexity of the nuclear norm, 2 0 + 1 2 + 1 2 ¯ + ¯ + + (kL0X¯ k∗ + kXk ) ≥ kL0X¯ k∗ + kX¯0k kAk∗ − kL0X0 k∗ = kA0 + ∆0k∗ − kL0X0 k∗ (25) 0 2 F 0 2 F ¯ + ˜ ˜T 3  2  1 ≥ hA0 + ∆0 − L0X0 , U0Q + W i, ≥ tr Σ 3 = kA k + kX k2 . 2 0 0 ∗ 2 0 F m×p ˜ T ˜ where W ∈ R , U0 W = 0, W Q = 0 and kW k ≤ 1. Due 2 2 to the duality between the nuclear norm and operator norm, For the equality of kAk∗ + 0.5kXkF = kA0k∗ + 0.5kX0kF we can construct a W such that to hold, at least, kXkF = kX¯0kF must be obeyed, which implies that E ∈ P . Hence, we have A E + ∆ X + h∆ ,W i = kP⊥ (∆ )P ⊥k . 0 V0 0 0 0 0 0 U0 0 Q¯ ∗ (26) ⊥ ∆0E0 ∈ PV0 ∩ PΩ = {0}, which gives that AX = L0. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 14

1 0.8 200 nonuniform uniform actual 0.8 Theorem 3.4 0.6 150

0.6 dB 0.4 100 miss 80% 0.4 m=100 miss 50% PSNR m=200 0.2 0.2 miss 20% 50 m=500 m=1000

0 relative condition number condition relative relative condition number condition relative 0 0 20 40 60 80 100 200 300 400 500 20 40 60 80 missing entries (%) matrix size (m) missing entries (%)

Fig. 3. Left: The relative condition number γΩ,ΩT (L0) vs the missing rate 1 − ρ0 at m = 500. Middle: The relative condition number vs Fig. 4. Visualizing the configurations of Ω used in our simulations. The the matrix size m. Right: Plotting the recovery performance of convex white points correspond to the locations of the observed entries. In these optimization as a function of the missing rate. two examples, 90% entries of the matrix are missing.

⊥ ⊥ ⊥ 5.9.2 Case 2: kP (∆0)P k∗ ≤ 2τ0kP (∆0)P ¯ k∗ U0 Q¯ U0 Q convolution matrix [40]. Thus, the forecasting tasks here can Using a similar manipulation as in the proof of Theorem 4.4, be converted to matrix completion problems, with we have L = A(x) and Ω = supp(A(y)), ⊥ ¯ ⊥ ⊥ ¯ 0 PU (∆0X0) = PU P2P1PU (∆0X0)+ 0 0 0 5 ⊥ ⊥ ⊥ ⊥ ⊥ where A(·) is the convolution matrix of a tensor , and P (P2P1 − P2)P P (∆0E0) = P P2P1P (∆0X¯0) U0 U0 V0 U0 U0 supp(·) is the support set of a matrix. In this example, ⊥ ⊥ ⊥ ⊥ + P (P P − P )P P (∆ P ¯ E + ∆ P E ). m×m U0 2 1 2 U0 V0 0 Q 0 0 Q¯ 0 L0 ∈ R is a circulant matrix that is perfectly incoherent and low rank; namely, rank (L0) ≡ 2 and µ(L0) ≡ 1, Due to Lemma 5.10 and the assumption of γΩ,ΩT (L0) > 0.5, ∀m > 2. Moreover, each column and each row of Ω have we have kP1k < 1 and kP2k < 1. By the assumption of ⊥ ⊥ ⊥ exactly a cardinality of ρ0m. We use the convex program (7) kP (∆ )P k ≤ 2τ kP (∆ )P ¯ k U0 0 Q¯ ∗ 0 U0 0 Q ∗, to restore L0 from the given observations. kP⊥ (∆ X¯ )k ≤ kP P kkP⊥ (∆ X¯ )k The results are shown in Figure 3. It can be seen that U0 0 0 ∗ 2 1 U0 0 0 ∗ ⊥ ⊥ the relative condition number is independent of the matrix + (4τ0 + 2)εkP (∆0)P ¯ k∗ = kP2P1kkP (∆0X¯0)k∗ U0 Q U0 sizes and monotonously deceases as the missing rate grows. + (4τ + 2)εkP⊥ (∆ )X¯ X¯ +k 0 U0 0 0 0 ∗ As we can see from the right hand side of Figure 3, the ! 1 (4τ + 2)εkX+k recovery performance visibly declines when the missing ≤ − 1 + 0 0 kP⊥ (∆ X¯ )k . + U0 0 0 ∗ rate exceeds 30% (i.e., ρ0 < 0.7), which approximately γΩ,ΩT (L0) 1 − kX0 kε corresponds to γΩ,ΩT (L0) < 0.55. When ρ0 < 0.3 (which Let corresponds approximately to γΩ,ΩT (L0) < 0.15), matrix ! completion totally breaks down. These results illustrate that 1 2γΩ,ΩT (L0) − 1 ε < min + , + . relative well-conditionedness is important for guaranteeing 2kX0 k (8τ0 + 4)kX0 kγΩ,ΩT (L0) the success of matrix completion in practice. Of course, the kP⊥ (∆ X¯ )k < kP⊥ (∆ X¯ )k lower bound on γΩ,ΩT (L0) would depend on the character- Then U0 0 0 ∗ U0 0 0 ∗ strictly holds unless ⊥ istics of data, and the condition γΩ,ΩT (L0) > 0.75 proven P (∆0X¯0) = 0. That is, U0 in Theorem 4.2 is just a universal bound for guaranteeing ⊥ ⊥ ⊥ P (∆ )P ¯ = 0 P (∆ )P = 0. U0 0 Q and thus U0 0 Q¯ exact recovery in the worst case. In addition, the estimate given in Theorem 3.4 is accurate only when the missing rate Hence, we have P⊥ (∆ ) = 0, which simply leads to U0 0 is low, as shown in the left part of Figure 3. ⊥ Among the other things, it is worth noting that the A0E0 + ∆0X0 + ∆0E0 ∈ PU0 ∩ PΩ = {0}, sampling complexity does not decrease as the matrix size and which gives that AX = L0. By Lemma 5.5, m grows. This phenomenon is in conflict with the uniform  2  sampling based matrix completion theories, which prove 1 2 3 3 1 2 kAk∗ + kXk ≥ tr Σ = kA0k∗ + kX0k . O((log m)2/m) 2 F 2 0 2 F that a small fraction of entries should suffice to recover L0 [41], and which implies that the sampling complexity should decrease to zero when the matrix size m goes to infinity. Hence, as aforementioned, the theories built upon uniform sampling are no longer applicable when 6 EXPERIMENTS applying to the deterministic missing data patterns. 6.1 Investigating the Relative Condition Number To study the properties of the relative condition number, m 6.2 Results on Randomly Generated Matrices we generate a vector x ∈ R according to the model To evaluate the performance of various matrix completion [x]t = sin(2tπ/m), t = 1, ··· , m. That is, x is a univariate time series of dimension m. We consider the forecasting methods, we generate a collection of m × n (m = n = 100) m×r0 tasks of recovering x from a collection of l observations, target matrices according to L0 = BC, where B ∈ R r0×n l and C ∈ R are N (0, 1) matrices. The rank of L0, {[x]t}t=1, where l = ρ0m varies from 0.1m to 0.9m with step size 0.1m. Let y ∈ m be the mask vector of the R 5. Unlike [40], we adopt here the circulant boundary condition. Thus, sampling operator, i.e., [y]t is 1 if [x]t is observed and 0 the jth column of A(x) is simply the vector obtained by circularly otherwise. In order to recover x, it suffices to recover its shifting the elements in x by j − 1 positions. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 15

convex optimization LRFD IsoDP 95 95 95 75 75 75 convex optimization LRFD IsoDP 9555 9555 9555

7535 7535 7535

5515 5515 5515

observed entries (%) 351 observed entries (%) 351 observed entries (%) 351 1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95 15 15 15 rank(L ) rank(L ) rank(L ) 0 0 0 observed entries (%) observed entries (%) observed entries (%) 1 convex optimization 1 LRFD 1 IsoDP 1 15 35 55 75 95 (a)1 15 nonuniform35 55 75 95 1 15 35 55 75 95 95 rank(L ) 95 rank(L ) 95 rank(L ) Fig. 7. An example image from the Oxford dinosaur sequence and the 75 0 75 0 75 0 locations of the observed entries in the data matrix of trajectories. In this convex optimization LRFD IsoDP 9555 9555 9555 dataset, 74.29% entries of the trajectory matrix are missing. 7535 7535 7535 (a) input (b) convex 5515 5515 5515

observed entries (%) 351 observed entries (%) 351 observed entries (%) 351 1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95 15 rank(L ) 15 rank(L ) 15 rank(L ) 0 0 0 observed entries (%) 1 observed entries (%) 1 observed entries (%) 1 1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95 rank(L ) rank(L ) rank(L ) 0 0 0 (b) uniform Fig. 5. Comparing IsoDP with convex optimization and LRFD. The numbers plotted on the above figures are the success rates within 20 random trials. The white and black areas mean “succeed” and “fail”, respectively. Here, the success is in a sense that PSNRdB ≥ 40. (c) LRFD (d) IsoDP

uniform nonuniform 95 95

75 75

55 55

35 35

observed entries (%) entries observed 15

observed entries (%) entries observed 15 Fig. 8. Some examples of the originally incomplete and fully restored 1 1 trajectories. (a) The original incomplete trajectories. (b) The trajectories 1 15 35 55 75 95 1 15 35 55 75 95 restored by convex optimization [10]. (c) The trajectories restored by rank(L ) rank(L ) 0 0 LRFD [29]. (d) The trajectories restored by IsoDP. Fig. 6. Visualizing the regions in which the isomeric condition holds. terms of the number of successfully restored matrices, IsoDP i.e., r0, is configured as r0 = 1, 5, 10, ··· , 90, 95. Regard- outperforms both convex optimization and LRFD by 44%. ing the sampling set Ω consisting of the locations of the These results verify the effectiveness of IsoDP. Figure 6 observed entries, we consider two settings: One is to cre- plots the regions where the isometric condition is valid. ate Ω by using a Bernoulli model to randomly sample a By comparing Figure 5 to Figure 6, it can be seen that the subset from {1, ··· , m} × {1, ··· , n} (referred to as “uni- recovery performance of IsoDP has not reached the upper form”), the other is to let the locations of the observed limit defined by isomerism. That is, there is still some room entries be centered around the main diagonal of a matrix left for improvement. (referred to as “nonuniform”). Figure 4 shows how the sampling set Ω looks like. The observation fraction is set 6.3 Results on Motion Data as |Ω|/(mn) = 0.01, 0.05, ··· , 0.9, 0.95. To show the advan- We now consider the Oxford dinosaur sequence6, which tages of IsoDP, we include for comparison two prevalent contains in total 72 image frames corresponding to 4983 methods: convex optimization [10] and Low-Rank Factor track points observed by at least 2 among 36 views. The Decomposition (LRFD) [29]. The same as IsoDP, these two values of the observations range from 8.86 to 629.82. We methods do not assume that rank of L0 either. When p = m select 195 track points which are observed by at least 6 views and the identity matrix is used to initialize the dictionary for experiment, resulting in a 72 × 195 trajectory matrix A, the bilinear program (9) does not outperform convex 74.29% entries of which are missing (see Figure 7). The optimization, thereby we exclude it from the comparison. tracked dinosaur model is rotating around its center, and The accuracy of recovery, i.e., the similarity between thus the true trajectories should form complete circles [42]. L Lˆ 0 and 0, is measured by Peak Signal-to-Noise Ratio The results in Theorem 4.5 imply that our IsoDP may PSNR ( dB). Figure 5 compares IsoDP to convex optimization possess the ability to attain a solution of strictly low rank. and LRFD. It can be seen that IsoDP works distinctly better To confirm this, we evaluate convex optimization, LRFD than the competing methods. Namely, while handling the and IsoDP by examining the rank of the restored trajectory nonuniformly missing data, the number of matrices suc- matrix as well as the fitting error on the observed entries. cessfully restored by IsoDP is 102% and 71% more than Table 1 shows the evaluation results. It can be seen that, convex optimization and LRFD, respectively. While dealing with the missing entries chosen uniformly at random, in 6. Available at http://www.robots.ox.ac.uk/∼vgg/data1.html IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 16

TABLE 1 better than convex optimization and LRFD, confirming the Mean square error (MSE) on the Oxford dinosaur sequence. Here, the −4 effectiveness of IsoDP on realistic datasets. rank of a matrix is estimated by #{i|σi ≥ 10 σ1}, where σ1 ≥ σ2 ≥ · · · are the singular values of the matrix. The regularization parameter in each method is manually tuned such that the rank of the restored matrix meets a certain value. Here, the MSE values are 7 CONCLUSION evaluated on the training data (i.e., observed entries). This work studied the identifiability of real-valued matrices rank of the restored matrix convex optimization LRFD IsoDP under the convex of deterministic sampling. We established 6 426.1369 28.4649 0.6140 two deterministic conditions, isomerism and relative well- 7 217.9963 21.6968 0.4682 conditionedness, for ensuring that an arbitrary matrix is 8 136.7643 17.2269 0.1480 identifiable from a subset of the matrix entries. We first 9 94.4673 13.954 0.0585 10 53.9864 6.3768 0.0468 proved that the proposed conditions can hold even if the 11 43.2613 5.9877 0.0374 missing data pattern is irregular. Then we proved a series of 12 29.7542 4.5136 0.0302 theorems for missing data recovery and convex/nonconvex TABLE 2 matrix completion. In general, our results could help to MSE on the MovieLens dataset. The regularization parameters of the understand the completion regimes of arbitrary missing competing methods have been manually tuned to the best. Here, the data patterns, providing a basis for investigating the other MSE values are evaluated on the testing data. related problems such as data forecasting. methods MSE random 3.7623 average 1.6097 convex optimization 0.9350 ACKNOWLEDGEMENT LRFD 0.9213 IsoDP (λ = 0.0005) 0.8412 This work is supported in part by New Generation AI Major IsoDP (λ = 0.0008) 0.8250 Project of Ministry of Science and Technology under Grant IsoDP (λ = 0.001) 0.8228 SQ2018AAA010277, in part by national Natural Science IsoDP (λ = 0.002) 0.8295 IsoDP (λ = 0.005) 0.8583 Foundation of China (NSFC) under Grant 61622305, Grant 61532009 and Grant 71490725, in part by Natural Science Foundation of Jiangsu Province of China (NSFJPC) under while the restored matrices have the same rank, the fitting Grant BK20160040, in part by SenseTime Research Fund. error produced by IsoDP is much smaller than the compet- ing methods. The error of convex optimization is quite large, because the method cannot produce a solution of exactly REFERENCES low rank unless a biased regularization parameter is chosen. Figure 8 shows some examples of the originally incomplete [1] E. Candes` and T. Tao, “The power of convex relaxation: and fully restored trajectories. Our IsoDP method can ap- Near-optimal matrix completion,” IEEE Transactions on proximately recover the circle-like trajectories. Information Theory, vol. 56, no. 5, pp. 2053–2080, 2009. [2] E. Candes` and Y. Plan, “Matrix completion with noise,” in IEEE Proceeding, vol. 98, 2010, pp. 925–936. 6.4 Results on Movie Ratings [3] K. Mohan and M. Fazel, “New restricted isometry re- We also consider the MovieLens [43] datasets that are sults for noisy low-rank recovery,” in IEEE International widely used in research and industry. The dataset we use Symposium on Information Theory, 2010, pp. 1573–1577. is consist of 100,000 ratings (integers between 1 and 5) from [4] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral 943 users on 1682 movies. The distribution of the observed regularization algorithms for learning large incomplete entries is severely imbalanced: The number of movies rated matrices,” Journal of Machine Learning Research, vol. 11, by each user ranges from 20 to 737, and the number of users pp. 2287–2322, 2010. who have rated for each movie ranges from 1 to 583. We [5] A. Krishnamurthy and A. Singh, “Low-rank matrix and remove the users that have less than 80 ratings, and so for tensor completion via adaptive sampling,” in Neural the movies. Thus the final dataset used for experiments is Information Processing Systems, 2013, pp. 836–844. consist of 14,675 ratings from 231 users on 206 movies. For [6] W. E. Bishop and B. M. Yu, “Deterministic symmetric the sake of quantitative evaluation, we randomly select 1468 positive semidefinite matrix completion,” in Neural In- ratings as the testing data, i.e., those ratings are intentionally formation Processing Systems, 2014, pp. 2762–2770. set unknown to the matrix completion methods. So, the [7] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix com- percentage of the observed entries used as inputs for matrix pletion from noisy entries,” Journal of Machine Learning completion is only 27.75%. Research, vol. 11, pp. 2057–2078, 2010. Despite convex optimization and LRFD, we also con- [8] ——, “Matrix completion from a few entries,” IEEE sider two “trivial” baselines: One is to estimate the unseen Transactions on Information Theory, vol. 56, no. 6, pp. ratings by randomly choosing an integer from the range of 2980–2998, 2010. 1 to 5, the other is to simply use the average rating of 3 to [9] T. Lee and A. Shraibman, “Matrix completion from fill the unseen entries. The comparison results are shown in any given set of observations,” in Neural Information Table 2. As we can see, all the considered matrix completion Processing Systems, 2013, pp. 1781–1787. methods outperform distinctly the trivial baselines, illustrat- [10] E. Candes` and B. Recht, “Exact matrix completion ing that matrix completion is beneficial on this dataset. In via convex optimization,” Foundations of Computational particular, IsoDP with proper parameters performs much Mathematics, vol. 9, no. 6, pp. 717–772, 2009. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, 2019 17

[11] E. J. Candes,` X. Li, Y. Ma, and J. Wright, “Robust of Sciences, vol. 100, no. 5, pp. 2197–2202, 2003. principal component analysis?” Journal of the ACM, [29] G. Liu and P. Li, “Low-rank matrix completion in the vol. 58, no. 3, pp. 1–37, 2011. presence of high coherence,” IEEE Transactions on Signal [12] H. Xu, C. Caramanis, and S. Sanghavi, “Robust PCA Processing, vol. 64, no. 21, pp. 5623–5633, 2016. via outlier pursuit,” IEEE Transactions on Information [30] A. L. Chistov and D. Grigoriev, “Complexity of quan- Theory, vol. 58, no. 5, pp. 3047–3064, 2012. tifier elimination in the theory of algebraically closed [13] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion fields,” in Proceedings of the Mathematical Foundations of via non-convex factorization,” IEEE Transactions on In- Computer Science, 1984, pp. 17–31. formation Theory, vol. 62, no. 11, pp. 6535 – 6579, 2016. [31] B. Recht, W. Xu, and B. Hassibi, “Necessary and suffi- [14] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust cient conditions for success of the nuclear norm heuris- recovery of subspace structures by low-rank represen- tic for rank minimization,” CalTech, Tech. Rep., 2008. tation,” IEEE Transactions on Pattern Recognition and [32] F. Shang, Y. Liu, and J. Cheng, “Scalable algorithms for Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013. tractable schatten quasi-norm minimization,” in AAAI [15] P. Netrapalli, U. N. Niranjan, S. Sanghavi, A. Anand- Conference on Artificial Intelligence, 2016, pp. 2016–2022. kumar, and P. Jain, “Non-convex robust PCA,” in Ad- [33] C. Xu, Z. Lin, and H. Zha, “A unified convex surro- vances in Neural Information Processing Systems, 2014, pp. gate for the schatten-p norm,” in AAAI Conference on 1107–1115. Artificial Intelligence, 2017, pp. 926–932. [16] G. Liu, H. Xu, J. Tang, Q. Liu, and S. Yan, “A deter- [34] H. Attouch and J. Bolte, “On the convergence of the ministic analysis for LRR,” IEEE Transactions on Pattern proximal algorithm for nonsmooth functions involving Recognition and Machine Intelligence, vol. 38, no. 3, pp. analytic features,” Mathematical Programming, vol. 116, 417–430, 2016. no. 1-2, pp. 5–16, 2009. [17] T. Zhao, Z. Wang, and H. Liu, “A nonconvex optimiza- [35] J. Bolte, S. Sabach, and M. Teboulle, “Proximal al- tion framework for low rank matrix estimation,” in ternating linearized minimization for nonconvex and Neural Information Processing Systems, 2015, pp. 559–567. nonsmooth problems,” Mathematical Programming, vol. [18] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has 146, no. 1, pp. 459–494, 2014. no spurious local minimum,” in Neural Information [36] J. Cai, E. Candes, and Z. Shen, “A singular value Processing Systems, 2016, pp. 2973–2981. thresholding algorithm for matrix completion,” SIAM [19] R. Salakhutdinov and N. Srebro, “Collaborative fil- J. on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010. tering in a non-uniform world: Learning with the [37] B. Recht, M. Fazel, and P. Parrilo, “Guaranteed weighted trace norm,” in Neural Information Processing minimum-rank solutions of linear matrix equations via Systems, 2010, pp. 2056–2064. nuclear norm minimization,” SIAM Review, vol. 52, [20] R. Meka, P. Jain, and I. S. Dhillon, “Matrix completion no. 3, pp. 471–501, 2010. from power-law distributed samples,” in Neural Infor- [38] G. W. Stewart, “On the continuity of the generalized mation Processing Systems, 2009, pp. 1258–1266. inverse,” SIAM Journal on Applied Mathematics, vol. 17, [21] F. Kiraly´ and R. Tomioka, “A combinatorial algebraic no. 1, pp. 33–45, 1969. approach for the identi ability of low-rank matrix com- [39] R. Rockafellar, Convex Analysis. Princeton, NJ, USA: pletion,” in International Conference on Machine Learning, Princeton University Press, 1970. 2012, pp. 2056–2064. [40] G. Liu, S. Chang, and Y. Ma, “Blind image deblurring [22] F. Kiraly,´ L. Theran, and R. Tomioka, “The algebraic using spectral properties of convolution operators,” combinatorial approach for low-rank matrix comple- IEEE Transactions on Image Processing, vol. 23, no. 12, tion,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 1391–1436, pp. 5047–5056, 2014. Jan. 2015. [41] Y. Chen, “Incoherence-optimal matrix completion,” [23] S. Negahban and M. J. Wainwright, “Restricted strong IEEE Transactions on Information Theory, vol. 61, no. 5, convexity and weighted matrix completion: Optimal pp. 2909–2923, 2015. bounds with noise,” Journal of Machine Learning Re- [42] Y. Zheng, G. Liu, S. Sugimoto, S. Yan, and M. Okutomi, search, vol. 13, pp. 1665–1697, 2012. “Practical low-rank matrix approximation under robust [24] Y. Chen, S. Bhojanapalli, S. Sanghavi, and R. Ward, l1-norm,” in IEEE Conference on Computer Vision and “Completing any low-rank matrix, provably,” Journal of Pattern Recognition, 2012, pp. 1410–1417. Machine Learning Research, vol. 16, pp. 2999–3034, 2015. [43] F. M. Harper and J. A. Konstan, “The movielens [25] D. L. Pimentel-Alarcon,´ N. Boston, and R. D. Nowak, datasets: History and context,” ACM Transactions on “A characterization of deterministic sampling patterns Interactive Intelligent Systems, vol. 5, no. 4, pp. 1901– for low-rank matrix completion,” J. Sel. Topics Signal 1919, 2015. Processing, vol. 10, no. 4, pp. 623–636, 2016. [26] G. Liu, Q. Liu, and X.-T. Yuan, “A new theory for matrix completion,” in Neural Information Processing Systems, 2017, pp. 785–794. [27] Y. Zhang, “When is missing data recoverable?” CAAM Technical Report TR06-15, 2006. [28] D. L. Donoho and M. Elad, “Optimally sparse repre- sentation in general (nonorthogonal) dictionaries via `1 minimization,” Proceedings of the National Academy