arXiv:2102.01570v1 [cs.LG] 2 Feb 2021 C-553 n nu otasORYugIvsiao Awar Investigator Young ONR Moitra’s Ankur and CCF-1565235 ∗ † ‡ § [email protected] [email protected] [email protected] [email protected] ok ease hsi h eaieb iigasimple a giving by negative the in this answer we work, in vectors Boolean oefr f“n-rie euiy gis reconstructio against security” “fine-grained of form some to applied a of Gram the given problem: factorization matrix ing [ regimes various [ learning etn ftepolmweetecleto of collection the where problem the of setting yegahfo t iegraph. line its from hypergraph hscnb huh fa pre ymti ain ftewe classi the the of of variant version average-case symmetric an sparse, as a or as analysis, of thought factor be can this arxfcoiainpolm u loih,bsdo tens on based algorithm, Our m problem. factorization matrix ymti ola atrAayi ihApiain to Applications with Analysis Factor Boolean Symmetric = nti okw xmn h euiyo ntHd,arecently a InstaHide, of security the examine we work this In speiu loihsete required either algorithms previous As Ω( e ia Chen Sitan r ) HSLA20 ecmlmn hsrsl ihaqaioyoiltm alg quasipolynomial-time a with result this complement We . k I.Ti okwsspotdi atb S AERAadCCF- Award CAREER NSF by part in supported was work This MIT. . 2 = h nvriyo ea tAustin. at Texas of University The . CLSZ21 [ .Anme frcn ok aegvnrcntuto attack reconstruction given have works recent of number A ]. Car20 { ∗ 0 , rneo University. Princeton . oubaUniversity. Columbia . 1 , } , HST r Car20 eoe h etr u otetiilsmere) Equiva symmetries). trivial the to (up vectors the recover , hoSong Zhao + 20 , HST ,te etoe h usino hte ntHd possesse InstaHide whether of question the open left they ], + 20 InstaHide † ylvrgn nitiun oncint h follow- the to connection intriguing an leveraging by ] Abstract m k ob xoetal ag in large exponentially be to sas etr scoe arbitrarily. chosen is vectors -sparse uzo Tao Runzhou d. tak o oeaeylarge moderately for attacks n O ( rbe frcvrn a recovering of problem c m ω +1 rdcmoiin nyrequires only decomposition, or olcinof collection rpsdshm o distributed for scheme proposed ‡ ) lsuidpolmo Boolean of problem ll-studied ieagrtmfrteabove the for algorithm time uzeZhang Ruizhe rtmfraworst-case a for orithm k m [ o ntHd in InstaHide for s CLSZ21 random 436,NFLarge NSF 1453261, k k ronly or ] -uniform k nthis In . § -sparse lently, s 1 Introduction

We study the following symmetric variant of the well-known factorization (NMF) problem [AGKM12, Moi13, RSW16, SWZ17, SWZ19]. Given a symmetric, nonnegative m m × matrix M and rank parameter r, we are interested in the optimization problem:

min M WW⊤ (1) W∈S − where is some family of nonnegative m r matrices and denotes some matrix norm. S × k · k When consists of all nonnegative m r matrices, this is the problem of symmetric NMF, S × which is closely related to kernel k-means [DHS05] and has been studied in a variety of contexts like graph clustering, topic modeling, and computer vision [ZS05, YHD+12, YGL+13, CRDH08, KG12, HXZ+11, WLW+11, KDP12]. We consider the setting where consists of matrices with 0, 1 -valued entries, in which case S { } (1) is the symmetric version of the well-studied problem of binary matrix factorization (BMF) [ZLDZ07, CIK17, BBB+19, FGL+19, KPRW19], which is connected to a diverse array of problems like LDPC codes [RPG16], optimizing passive OLED displays [KPRW19], and graph partitioning [CIK17]. For instance, if we equip 0, 1 with the structure of the Boolean semiring1 (in which case { } BMF is sometimes called Boolean factor analysis) and M is the of undirected graph G, then (1) is equivalent to finding the best possible covering of G by r cliques.

Boolean Factor Analysis and Distributed Learning Our motivation for studying Boolean factor analysis comes from an intriguing connection, recently identified by [CLSZ21, CDG+20], to designing attacks on a proposed scheme for private distributed learning called InstaHide [HSLA20]2. At a high level (see Section 5 for details), InstaHide is a procedure for taking a private dataset of size r and generating a synthetic dataset m via the following procedure. Fix a sparsity parameter k, and for every i = 1,...,m: 1) sample a random subset S of 1, . . . , r of size k, 2) define v to { } be the average of the k private feature vectors indexed by S, 3) add to the synthetic dataset the entrywise absolute value of v. The security of InstaHide has recently been placed under some scrutiny [Jag20, CLSZ21, CDG+20, Car20, HST+20], and a number of works have given attacks for reconstructing private feature vec- tors out of synthetic ones generated by InstaHide. [CLSZ21] gave a provable attack running in poly(m) time when the private dataset is Gaussian but required m to scale exponentially in k,3 while [CDG+20] gave a heuristic attack on real image data for k = 2. These works left open the following question which is the main focus of the present work: Question 1.1. Do there exist efficient reconstruction attacks on InstaHide for arbitrary k? Interestingly, [CLSZ21, CDG+20] both essentially reduce the reconstruction problem to solving the following instantiation of the optimization problem (1):

⊤ m×r min M WW 0 for m,r,k = W 0, 1 : Wj 0 = k j [m] , (2) W∈Sm,r,k k − k S { ∈ { } k k ∀ ∈ } where denotes number of nonzero entries. k · k0 In the reduction, the input M to the optimization problem (2) is the similarity matrix whose (i, j)-th entry is 1 if synthetic feature vectors i and j have at least one constituent private feature

1In the Boolean semiring, addition is given by logical OR, and multiplication is given by logical AND 2In a follow-up work, this was generalized to the text domain [HSC+20]. 3In the special case of k = 2, the bound in [CLSZ21] was later refined by [HST+20].

1 vector in common, and 0 otherwise (see Definition 5.4). In practice, it is very reasonable to assume one can extract M from the synthetic dataset e.g. by training an appropriate neural network [CDG+20], and when the private dataset is Gaussian, there is a provable algorithm for doing so [CLSZ21]. The key point is that this matrix admits an exact factorization M = WW⊤ over the Boolean semiring, where W 0, 1 m×r is the matrix whose (i, ℓ)-th coordinate is 1 if synthetic feature ∈ { } vector i contains private feature vector ℓ, and is 0 otherwise (see Fact 5.5). In particular, because every synthetic feature vector has k constituent private feature vectors, W . [CLSZ21] ∈ Sm,r,k used a rather involved combinatorial argument to “partially” factorize M, while [CDG+20] gave a heuristic algorithm for completely factorizing M when k = 2 by solving a min-cost max-flow (see Section 1.1 for details). Our first result is to give a polynomial-time algorithm for (2) when M is the similarity matrix for a synthetic dataset generated by InstaHide:

Theorem 1.2. Fix any integer 2 k r, failure probability δ (0, 1), and suppose m ≤ ≤ ∈ ≥ Ω(rk log(1/δ)). Let W 0, 1 m×r be generated by the following random process: for every i [m], ∈ { } ∈ the i-th row of W is a uniformly random k-sparse binary vector. Define M , WW⊤ where matrix multiplicatione is over the Boolean semiring. There is an algorithm which runs in O(mω+1) time and, with probability 1 δ over the randomness of W, outputs a matrix W 0, 1 m×r whose − ∈ { } columns are a permutation of those of W.4 c Note that not only is the minimum m for which Theorem 1.2 holds a fixed polynomial in r, k (rather than rΩ(k)), but in fact the dependence on r is near optimal (Remark 1.5)! As a consequence, we show a synthetic dataset generated by InstaHide is vulnerable to efficient reconstruction attacks as soon as its size is comparable to that of the private dataset from which it was generated, answering Question 1.1 in the affirmative:

Theorem 1.3 (Informal, see Theorem 5.1). Fix any integer k 2, failure probability δ (0, 1), ≥ ∈ and suppose m Ω(rk log(d/δ)). Given a synthetic dataset of size m generated by InstaHide, ≥ together with its similarity matrix, there is an O(mω+1 + d r m)-time algorithm for approximately · · recovering the magnitudese of the “heavy” coordinates of every feature vector in the private dataset. Here, a coordinate of a feature vector is “heavy” if its magnitude is Ω(k) times the average value of any private image in that coordinate.

One can interpret Theorem 1.2 as handling a realizable, average-case version of (2), i.e. it assumes input M admits an exact factorization and furthermore was generated by a random process. One can also ask how to handle worst-case instances of (2) where M can be arbitrary and in particular need not admit an exact factorization, and indeed we complement the results above with a quasipolynomial-time algorithm for (2) in this setting:

Theorem 1.4 (Worst-case guarantee, see Theorem C.3 for formal statement). Given a M Zm×m, rank parameter r m, and accuracy parameter ǫ (0, 1], there is an algorithm ∈ −1 2 ≤ ∈ that runs in mO(ǫ k log r) time and outputs W such that ∈ Sm,r,k ⊤ ⊤ 2 M WW 0 cmin M WW 0 + ǫm , k − k ≤ W∈Sm,r,k k − k where canc bec over R or over the Boolean semiring. 4ω ≈ 2.373 is the exponent of matrix multiplication.

2 1.1 Related Work

Symmetric NMF The most standard setting of NFM is minU V M UV⊤ , where U, V , ≥0 k − k range over all possible m r matrices with nonnegative entries. This has been the subject of a sig- × nificant body of theoretical and applied work, and we refer to the survey [Gil12] for a comprehensive overview of this literature. Symmetric NMF has received comparatively less attention but is nevertheless a popular clus- tering technique [HXZ+11, KDP12, DHS05] where, similar to our application to InstaHide, one takes the input matrix M to be some similarity matrix for a collection of data points and interprets the factorization W as specifying a soft clustering of these points into r groups, where Wi,ℓ is the “probability” that point i lies in cluster ℓ. Like NMF [Vav10], the symmetric variant of NMF that we study is also hard in the worst case. Indeed, matrices which admit an exact factorization WW⊤ for a nonnegative matrix W are called completely positive, and even determining whether a matrix is completely positive is known to be NP-hard [DG14]. While there exist efficient provable algo- rithms for asymmetric NMF under certain separability assumptions [AGKM12, Moi13], the bulk of the work on symmetric NMF has been focused on designing iterative solvers for converging to a stationary point [HXZ+11, KDP12, VGL+16, LHW17, ZLLL18]; we refer to the recent work of [DBd19] for one such result and an exposition of this line of work.

Binary Matrix Factorization Most works on NMF where the factors are further constrained to have 0, 1 -valued entries focus on the asymmetric setting. Over the reals, this is directly re- { } lated to the bipartite clique partition problem [Orl77, CHHK14, CIK17]. [KPRW19] gave the first 2 constant-factor approximation algorithm for this problem that runs in time 2O(r log r) poly(m). Our Theorem 1.4 also extends to this asymmetric setting (see Theorem C.3); see Remark C.4 for a comparison. On the other hand, over the Boolean semiring, this problem is directly related to the bipartite clique cover problem. Also called Boolean factor analysis, the discrete basis problem, or minimal noise role mining, it has received substantial attention in the context of topic modeling, database tiling, association rule mining, etc. [SBM03, ŠH06, BV10, MMG+08, VAG07, LVAH12, MSVA16]. O(r) The best algorithm in this case, due to [FGL+19], runs in time 22 m2, matching the lower bound · of [CIK17]. However, our problem requires that each row of W is k-sparse, which corresponds to constraint that each vertex belongs to k bi-cliques in the bipartite clique cover problem. [EU18] considered the decision version of this problem, whose goal is to decide the minimum clique cover size r, and proved that it is NP-hard if k 5. A more general version of this problem was studied by ≥ [AGSS12] with applications to community detection. Our algorithm also applies to this setting (see Corollary 6.4). In particular, even without the sparsity condition, the running time still matches the lower bound (see Remark C.6). More generally, we refer to the comprehensive survey of [MN20] for other results on BMF. As for symmetric BMF, [ZWA13] studied this in the context of community detection with overlapping communities.

Line Graphs of Hypergraphs In combinatorics, there is a large body of work on recovering hypergraphs from their line graphs, which is equivalent to (2) when the input matrix M admits an exact factorization. Indeed, we can regard any W as the of a k- ∈ Sm,r,k uniform hypergraph H with m hyperedges and n vertices, so if we work over the Boolean semiring, M , WW⊤ is exactly the adjacency matrix for the line graph of H. When k = 2, Whitney’s isomorphism theorem [Whi92] characterizes which graphs are uniquely identified by their line graphs; in our notation, this theorem characterizes when one can uniquely

3 identify W (up to permutation) from M = WW⊤. In this case [Rou73, Leh74, Sys82, DS95, LTVM15] have given efficient algorithms for reconstruction, i.e. recovering W from M. Indeed, the attack on InstaHide of [HST+20] used Whitney’s theorem to show that m = O(r log r) synthetic images suffice for reconstruction. Unfortunately, for k > 2, no analogue of Whitney’s theorem exists [Lov77]. In fact, even detection, i.e. determining whether a given graph is the line graph of some k-uniform hypergraph, or equivalently whether the objective value of (2) is zero for an input M, is NP-complete [LT93, PRT81]. A number of results for reconstructing a hypergraph H from its line graph are known under additional assumptions on H [JKL97, MT97, SST05]. In the language of line graphs, Theorem 1.2 says that whereas even detection is NP-complete for worst-case hypergraphs, reconstruction is tractable for random ones.

Remark 1.5. The connection to line graphs makes clear why m must be at least Ω(r log r) for unique recovery of W (up to permutation) from M = WW⊤ to be possible, even for k = 2. In this case, M is simply the adjacency matrix for the line graph of a random (multi)graph G given by sampling m edges with replacement. It is well-known that such a graph is w.h.p. not connected if m = o(r log r) [ER60], so by Whitney’s theorem there will be multiple non-isomorphic graphs for which M is the adjacency matrix of their line graph.

2 Preliminaries

We use Fq to denote finite field and sometimes use F to denote F2. For any nonnegative function f, we use O(f) to denote f poly(log f) and use Ω(f) to denote f/ poly(log f). For a vector x Rn, we use supp(x) to denote the nonzero indices, x to denote the number ∈ k k0 of nonzeroe entries, and x to denote its ℓ norm.e We use ~1 to denote the all-ones vector in Rd k kp p d and suppress the subscript when the context is clear. For a matrix A, we use A+ to denote the pseudo-inverse of matrix A. We use A to denote k kF the Frobenius norm of matrix A, A = A to denote its entrywise ℓ norm, A to denote k k1 i,j | i,j| 1 k k0 the number of nonzero entries, and A ∞ to denote maxi,j Ai,j . r k k P | | Given r, k, we let [k] denote the collection of all subsets of [r] of size k. Fact 2.1. For any 0

In this section we overview the main technical ingredients for our results.

3.1 Average-Case Guarantee The main idea in the proof of Theorem 1.2 is the following thought experiment. If instead of M r W⊗2 getting access to the matrix = i=1 i over the Boolean semiring, suppose we had the tensor T = r W⊗3 over Z. Provided the columns W of W are linearly independent over R, then we i=1 i P i can run a standard tensor decomposition algorithm to recover W ,..., W up to permutation. As P 1 r

4 such, there are two technical steps: 1) forming the tensor T given only M, and 2) showing of the Wi. For 1), note that for any a, b, c [m], T is the number of coordinates in [r] that the supports ∈ a,b,c of Wa, Wb, Wc all have in common. Denote these supports by Sa,Sb,Sc. We observe that there is a simple monotonically decreasing function µ : Z [0, 1] (see (4)) mapping S S S to → | a ∪ b ∪ c| the probability that a random size-k subset of [r] does not intersect any of Sa,Sb,Sc. This latter probability can be estimated by computing the fraction of columns of M which are simultaneously zero in rows a, b, c, so provided the number of columns m is sufficiently large, we can estimate this probability and invert along µ to exactly recover S S S . We can recover S S , S S , | a ∪ b ∪ c| | a ∪ b| | a ∪ c| and S S in a similar fashion, from which we obtain T = S S S (Lemma 4.2). | b ∪ c| a,b,c | a ∩ b ∩ c| Part 2), formally stated in Lemma A.5, is the most technically involved component of our proof of Theorem 1.2. Note that by a simple net argument, when m = Ω(r2) one can show that W is not only full rank, but polynomially well-conditioned. But as our emphasis is on having the sample complexity m depend near-optimally on the number of private featuree vectors r, we need to be much more careful. We note that similar problems have been studied before in discrete probability (see Section A.4). Showing Lemma A.5 for odd k turns out to more straightforward. In this case, to show linear independence of the Wi’s over R, we first observe that it suffices to show linear independence over F (Lemma A.4). For a given u Fr, one can explicitly compute the probability that Wu = 0, and 2 ∈ 2 by giving fine enough estimates for these probabilities (Proposition A.7) and taking a union bound, we can complete the proof of Lemma A.5. This proof strategy breaks down for even k because in this case W is not full-rank over F2: the columns of W add up to zero over F2. Instead, we build on ideas from [FKS20], which studies the squared matrix version of this problem and upper bounds the probability that there exists some x for which Wx = 0 and for which the most frequent entry in x occurs a fixed number of times. Intuitively, if the most frequent entry does not occur too many times, then the probability that Wx = 0 is very small. Otherwise, even if the probability is large, there are “few” such vectors. Unfortunately, [FKS20] requires k Ω(log r), and in order to handle the practically relevant regime ≥ of k = O(1), we need to adapt their techniques and exploit the fact that W is a tall matrix in our setting.

3.2 Implication for InstaHide For real-world image data [CDG+20], the similarity matrix M is generated by training a neural network to classify whether two given synthetic images generated by InstaHide have some constituent private image in common. For Gaussian data [CLSZ21], one can provably construct M via covariance estimation of a certain folded normal distribution. For these reasons, it is quite reasonable to assume that the attacker has access to M, so we do not belabor this point further. As mentioned in the introduction, the key subroutine in the attacks of [CLSZ21, CDG+20] is to factorize the similarity matrix M. Given this factorization, it remains to specify how to efficiently recover parts of the private dataset. At a high level, this amounts to solving a piecewise linear system given by the equality Wp = z for every feature index j Rd, where p Rr (resp. z ) is | j| j ∈ j ∈ j the unknown vector whose i-th entry is the j-th feature of private (resp. synthetic) feature vector i. We elaborate in Section 5 on how [CLSZ21, CDG+20] do this. A takeaway from [CDG+20] is that empirically, given W, it is quite easy to solve this system exactly by alternating minimization. While it remains open how to do so provably in the regime of Theorem 1.2, we exhibit a simple estimator (Lemma 5.6) for obtaining a multiplicative approximation to the “heavy” coordinates of every private feature vector. At a high level, we show w.h.p. that for any j, the vector 1 m (w λ~1) (z )2 m i=1 i − · j i P 5 is highly correlated with the unknown vector pj for appropriately chosen λ> 0.

3.3 Worst-Case Guarantee To obtain Theorem 1.4, our quasi-polynomial time algorithm for the worst-case setting of (2), we construct an appropriate 2-CSP instance out of M and apply an existing approximation algorithm for dense Max 2-CSP (Theorem 6.2) proposed by [DM18]. Concretely, given M = WW⊤ + , E where W is a minimizer in (2), we construct a 2-CSP instance over a complete graph where the vertices correspond to the rows of W, the alphabet consists of all k-sparse Boolean vectors of length r, and the constraint on any edge checks whether the inner product of the vectors assigned to the two vertices is equal to that entry in M (Theorem 6.3). This type of reduction turns out to be quite flexible, and we are similarly able to obtain guarantees for the asymmetric analogue of (2) (Theorem C.3), as well as for the case where matrix multiplication is defined over the Boolean semiring instead of the reals (Corollary 6.4).

Roadmap In Section 4, we sketch our proof of Theorem 1.2, deferring the proof details to Ap- pendix A. In Section 6, we sketch our proof of Theorem 1.4, deferring the proof details to Ap- pendix C. In Section 5 we show how to leverage Theorem 1.2 to give our reconstruction attack on InstaHide (Theorem 1.3).

4 Average-Case Complexity Upper Bound

In this section we prove Theorem 1.2. Our algorithm is given by TensorRecover below.

Algorithm 1: TensorRecover(M) Input: M 0, 1 m×m s.t. M = WW⊤ over Boolean semiring for some W ∈ { } ∈ Sm,r,k Output: Matrix W which is equal to W up to column permutation ∈ Sm,r,k /* Form the tensor T */ 1 for (a, b, c) [m] c[m] [m] do ∈ × × 2 Let µ be the fraction of ℓ [m] for which M = M = M are all zero. Define abc ∈ a,ℓ b,ℓ c,ℓ µab,µac,µbc analogously. // Can compute with fast matrix multiplication (Lemma 4.2) 3 Let t be the nonnegative integer tabc for which µtabc (see (4)) is closest to µabc. Define tab,tac,tbc analogously. 4 T t t t t + 3k. a,b,c ← abc − ab − ac − bc /* Run Jennrich’s on T */ 5 Randomly sample unit vectors v , v Sm−1. 1 2 ∈ 6 M T(Id, Id, v ), M T(Id, Id, v ). 1 ← 1 2 ← 2 7 Let w ,..., w Rm denote the left eigenvectors outside the kernel of M M+. 1 r ∈ 1 2 8 Round w to Boolean vectors v ; let W 0, 1 m×r consist of w . { i} { i} ∈ { } { i} 9 returne W. e e c c 4.1 Forming the Tensor The key insight is that although the similarity matrix M only gives access to “second-order” infor- mation about correlations between the entries of W, we can bootstrap third-order information in

6 the form of n T , W⊗3 i , (3) Xi=1 where Wi is the i-th column of W and the tensor is defined over R rather than the Boolean semiring. Given sets S ,...,S [r], let µ(S ,...,S ) denote the probability that a randomly chosen 1 c ⊂ 1 c subset T of [r] of size k does not intersect any of them. It is easy to see that

r S S r µ(S ,...,S )= − | 1 ∪···∪ c| , µ . (4) 1 c k k |S1∪···∪Sc|   

Fact 4.1. There exist absolute constants C,C′ > 0 for which the following holds. If r C k2, then ≥ · for any 0 t 3k, we have that µ µ + C′k/r and µ 1 O(tk/r). ≤ ≤ t ≥ t+1 t ≥ − We defer the proof to Section A.1. This will allow us to extract T from M. The algorithm outlined in the proof of the following lemma is given formally in Lines 1 to 6 of TensorRecover below.

Lemma 4.2. If m Ω (rk log(1/δ)), then with probability at least 1 δ over the randomness of M, ≥ − there is an algorithm for computing T for any a, b, c [m] in time O(mω+1). a,b,c ∈ e To show this, note that for any a, b, c [m], if entries a, b, c correspond to subsets S ,S ,S of ∈ 1 2 3 [m], then

T = S S S = S S S S S S S S S + 3k. (5) a,b,c | 1 ∩ 2 ∩ 3| | 1 ∪ 2 ∪ 3| − | 1 ∪ 2| − | 2 ∪ 3| − | 1 ∪ 3| We can compute each of the terms on the right-hand side by estimating the corresponding proba- bilities µ( ) to within error O(k/r) and inverting along the univariate function t µ . As for the · 7→ t runtime, we can compute each slice of T by setting up an appropriate matrix multiplication. We defer a full proof of Lemma 4.2 to Section A.2. It suffices to apply the following standard guarantee for (noiseless) tensor decomposition:

m Lemma 4.3. Given a collection of linearly independent vectors w1, . . . , wr R , there is an al- T r ⊗3 ω ∈ gorithm that takes any tensor = i=1 wi , runs in time O(m ), and outputs a list of vectors w ,..., w for which there exists permutation π satisfying w = w for all i [r]. 1 r P i π(i) ∈ For completeness, we provide a short proof of this well-known fact in Section A.3. b b b It remains to show that the columns of W in the context of Theorem 1.2 are indeed linearly independent with high probability. This will occupy the bulk of our analysis. We emphasize that because we wish to obtain m with near-optimal dependence on r, the argument will be quite involved. Formally, we show:

Theorem 4.4 (Linear independence of W). Let W 0, 1 m×r be a whose rows ∈ { } are i.i.d. random vectors each following a uniform distribution over 0, 1 r with exactly k ones. For { } constant k 1 and m = Ω((r/k) log n), the r columns of W are linearly independent in R with ≥ probability at least 1 1 . − poly(r) We defer the details of Theorem 4.4 to Section A.4, as well as a discussion of previous works which have studied similar questions. Altogether, this allows us to conclude Theorem 1.2, and we defer a formal proof of this to Section A.5.

7 5 Connection to InstaHide

In this section we review the details about InstaHide and the known connection to optimization problem (2). In lieu of “feature vectors,” we will refer to the elements of the private/synthetic datasets as “images” and their individual entries as “pixels.” The main result of this section is to show the following: Theorem 5.1. For any absolute constant η> 0, there is an absolute constant c> 0 for which the fol- lowing holds. Fix any integer k 2, failure probability δ (0, 1), and suppose m Ω(rk log(d/δ)). ≥ ∈ ≥ Given a synthetic dataset of size m generated by InstaHide from an image matrix X, together with its similarity matrix, there is an O(mω+1 + d r m)-time algorithm which outputse an image ma- · · trix X such that for any (i, j) [r] [d] satisfying X (ck/r) ′ X ′ , we have that ∈ × | i,j| ≥ i ∈[r] | i ,j| Xi,j = Xi,j (1 η). | |b | | · ± P That is, we give a reconstruction attack that approximately recovers the magnitudes of the b “heavy” pixels of every image in the private dataset, where a pixel of some image i is “heavy” if its value is roughly k times larger in absolute value than the average pixel value in that location across all private images.

5.1 InstaHide Details We first recall the process by which InstaHide generates synthetic images from private ones. Definition 5.2 (Image matrix notation). Let private image matrix X Rd×r be a matrix whose ∈ columns consist of vectors X1, ..., Xr corresponding to r images each with d pixels taking values in R. It will also be convenient to refer to the rows of X as p , ..., p Rn. 1 d ∈ Definition 5.3 (Synthetic images). Given private image matrix X and subset S [r] of size k, the X,S , ⊂ corresponding synthetic image is the vector y i∈S xi where denotes entrywise absolute | · | X value. We will refer to the images x as the constituent private images of y ,S. To generate { i}i∈S P a synthetic dataset of size m via InstaHide, one independently samples subsets S1,...,Sm [r] of X X ⊂ size k and outputs y ,S1 , ..., y ,Sm . { } The main finding of [HSLA20] is that given a private dataset, one can train on the synthetic dataset generated by InstaHide and still achieve good test accuracy on the private dataset. The hope in that work was that taking the entrywise absolute value of i∈S xi makes it difficult to recover properties of the private images. P 5.2 Matrix Factorization and InstaHide As observed by [CLSZ21, CDG+20], in many settings one can compute the matrix M′ Rm×m ∈ whose (i, j)-th entry is S S . Indeed, when the entries of the private image matrix X are inde- | i ∩ j| pendent Gaussians, then [CLSZ21] gave a provable algorithm for extracting M′ from the synthetic dataset. For real-world image datasets, when k = 2 in which case the entries of M′ are 0, 1 -valued, { } [CDG+20] gave a heuristic algorithm for computing M by training a neural network to classify pairs of synthetic images based on whether they have a constituent private image in common. For real-world image datasets and general k, while it is not necessarily so easy to recover M′ itself with the latter strategy, as the entries of M′ take values in 0,...,k in general, it is nevertheless { } very plausible that one could still train a neural network to classify pairs of synthetic images as sharing at least one constituent private image or not. In other words, it is quite reasonable to assume access to the following matrix.

8 X X Definition 5.4 (Similarity Matrix). Given a synthetic dataset y ,S1 ,...,y ,Sm generated by { } InstaHide, let M 0, 1 m×m denote the matrix for which M = 1[S S = ] for all i, j [m]. ∈ { } i,j i ∩ j 6 ∅ ∈ Also define the selection matrix W 0, 1 m×r to be the matrix whose i-th row is the indicator ∈ { } vector for subset Si. Note that W is distributed as a matrix whose rows are independent random k-sparse Boolean vectors.

The key property that relates InstaHide to problem (2) is the following basic fact:

Fact 5.5. Over the Boolean semiring, the similarity matrix M satisfies M = WW⊤.

Reconstructing X From a Factorization Suppose one can successfully construct the selection matrix W (up to column permutation) given similarity matrix M. By definition of the synthetic m dataset, we have that for any pixel index j [d], if zj R is the vector whose i-th entry is the X ∈ ∈ j-th coordinate of y ,Si , then Wp = z . (6) | j| j At this point is certainly information-theoretically possible to recover the private dataset (up to unavoidable ambiguities in sign) by brute-forcing all 2m possible sign patterns. As for computationally efficient algorithms, note that (6) can be thought of as an instance of phase retrieval in a non-standard setting where the sensing matrix W consists of k-sparse binary measurements. [HST+20] showed that if W is chosen adversarially instead of randomly, ∈ Sm,r,k then this problem is computationally hard in general, even if k = 2. When W is random as in Definition 5.4, the system (6) appears to be easy to solve in practice. [CDG+20] demonstrated empirically in this case that when X is given by real-world image data, then one can efficiently solve (6) by alternating minimization. In terms of provable algorithms, in [CLSZ21], the authors showed that if m rΩ(k), then with high probability over the choice of W W k+2≥ there exists a submatrix of of size k consisting of indicator vectors for all size-k subsets of a particular set T [r] of size k + 2, and by brute-forcing over sign patterns for that particular ⊂  collection of constraints in the piecewise-linear system (6), one can provably recover the coordinates of pj indexed by T for any j. In other words, one can exactly recover the k + 2 images in the private dataset that are indexed by T . Here we give a different approach that allows us to recover the “heavy” coordinates of pj for any pixel index j [d]. ∈ Algorithm 2: GetHeavyPixels(M) Input: Similarity matrix M for synthetic images generated from image matrix X Output: Image matrix X which approximates heavy entries of X (see Theorem 5.1) 1 W TensorRecover(M). ← 2 for j [d] do b ∈ 3 Let z Rm have i-th entry equal to the j-th coordinate of synthetic image i. ∈ 4 Form the vector p′ , 1 m w k−1~1 z2, where w is the i-th row of W. m i=1 i − r−2 · i i 5 X ′ r(r−1)  Set the j-th row of toP be p k(r−2k+1) . e · 6 return X. b e b Lemma 5.6. For any absolute constant η > 0, there is an absolute constant c > 0 for which the following holds as long as m Ω(log(d/δ)). There is an algorithm GetHeavyPixels that takes ≥

9 as input W defined according to Definition 5.4 and vector z satisfying Wp = z for some vector | | p Rr, runs in time O(r m), and outputs p such that for every i [r] for which p (ck/r) p, ∈ · ∈ | i|≥ · we have that p = p (1 η). i i · ± b We defer the proof of this to Section B.1. e 6 Worst-case Algorithm for Matrix Factorization

In this section, we give a quasi-polynomial time algorithm for (2) in the worst case. We reduce the problem to solving a constraints satisfaction problem (CSP), and then use a 2-CSP solver developed by [DM18] to find the solution. We first give the definition of a 2-CSP. Definition 6.1 (Max 2-CSP). A 2-CSP problem is defined by the tuple (Σ,V,E, ). Σ is an alphabet C set of size q, V is a variable set of size n, E V V is the constraint set. V and E define an ⊆ × underlying graph of the 2-CSP instance, and = C describes the constraints. For each e E, C { e}e∈E ∈ C is a function Σ Σ 0, 1 . The goal is to find an assignment σ : V Σ with maximal e × → { } → value, defined to be the number of satisfied edges e = (u, v) E (i.e., for which C (σ(u),σ(v)) = 1). ∈ e We will use the following known algorithm for solving “dense” 2-CSP instances: Theorem 6.2 ([DM18]). Define the density δ of a 2-CSP instance to be δ , E / |V | . For any | | 2 0 <ǫ 1, there is an approximation algorithm that, given any δ-dense 2-CSP instance with optimal ≤ −1 −1  value OPT, runs in time (nq)O(ǫ ·δ ·log q) and outputs an assignment with value OPT ǫ E , where − | | n = V and q = Σ . | | | | The main theorem of this section is as follows: Theorem 6.3. Given m, k, r 0 and a symmetric m m matrix M as the input of an instance ≥ × of sparse Boolean matrix factorization problem. Let OPT be the optimal value of the problem, i.e., OPT := minW M WW⊤ , where W is a row k- in 0, 1 m×r. k − k0 { } For any accuracy ǫ (0, 1), there is an algorithm running in time ∈ −1 −1 2 mO(ǫ k log r)rO(ǫ k log r) (7) which finds a row k-sparse matrix W satisfying M WW⊤ OPT + ǫm2. k − k0 ≤ The proof is deferred to SectioncC. c c As a corollary, we also get an algorithm for (2) over Boolean arithmetic. Corollary 6.4. Given m, k, r 0 and a symmetric Boolean m m matrix M. Let OPT be the ≥ × optimal value of the problem, i.e., OPT := minW M WW⊤ , where W is a row k-sparse matrix k − k0 in 0, 1 m×r and the matrix multiplication is over the Boolean semiring, i.e., a + b is a b and a b { } ∨ · is a b. ∧ For any accuracy parameter ǫ (0, 1), there exists an algorithm that runs in ∈ −1 −1 2 mO(ǫ k log r)rO(ǫ k log r) time and finds a row k-sparse matrix W satisfying M WW⊤ OPT + ǫm2. k − k0 ≤ We defer its proof to Section C, wherec we also give algorithmsc c for the asymmetric setting where one wishes to solve minU V M UV⊤ . , k − k0

10 References

[AGKM12] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Comput- ing a nonnegative matrix factorization–provably. In Proceedings of the forty- fourth annual ACM symposium on Theory of computing (STOC), pages 145–162. https://arxiv.org/pdf/1111.0952.pdf, 2012.

[AGSS12] Sanjeev Arora, Rong Ge, Sushant Sachdeva, and Grant Schoenebeck. Finding overlap- ping communities in social networks: toward a rigorous approach. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 37–54, 2012.

[AHP20] Elad Aigner-Horev and Yury Person. On sparse random combinatorial matrices. arXiv preprint arXiv:2010.07648, 2020.

[BBB+19] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P Woodruff. A ptas for lp-low rank approximation. In Proceedings of the Thir- tieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 747–766. SIAM, 2019.

[BV10] Radim Belohlavek and Vilem Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences, 76(1):3–20, 2010.

[Car20] Nicholas Carlini. Instahide disappointingly wins bell labs prize, 2nd place, Dec 2020.

[CDG+20] Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Moham- mad Mahmoody, Shuang Song, Abhradeep Thakurta, and Florian Tramer. An attack on instahide: Is private learning possible with instance encoding? arXiv preprint arXiv:2011.05315, 2020.

[CFK+15] Marek Cygan, Fedor V Fomin, Łukasz Kowalik, Daniel Lokshtanov, Dániel Marx, Marcin Pilipczuk, Michał Pilipczuk, and Saket Saurabh. Parameterized algorithms, volume 5. Springer, 2015.

[CHHK14] Parinya Chalermsook, Sandy Heydrich, Eugenia Holm, and Andreas Karrenbauer. Nearly tight approximability results for minimum biclique cover and partition. In Eu- ropean Symposium on Algorithms (ESA), pages 235–246. Springer, 2014.

[CIK17] Sunil Chandran, Davis Issac, and Andreas Karrenbauer. On the parameterized complex- ity of biclique cover and partition. In 11th International Symposium on Parameterized and Exact Computation (IPEC). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

[CLSZ21] Sitan Chen, Xiaoxiao Li, Zhao Song, and Danyang Zhuo. On instahide, phase retrieval, and sparse matrix factorization. In ICLR. arXiv preprint arXiv:2011.11181, 2021.

[CRDH08] Yanhua Chen, Manjeet Rege, Ming Dong, and Jing Hua. Non-negative matrix fac- torization for semi-supervised data clustering. Knowledge and Information Systems, 17(3):355–379, 2008.

[CV08] Kevin P Costello and Van H Vu. The rank of random graphs. Random Structures & Algorithms, 33(3):269–285, 2008.

11 [CV10] Kevin p Costello and Van Vu. On the rank of random sparse matrices. Combinatorics, Probability and Computing, 19(3):321–342, 2010.

[DBd19] Radu-Alexandru Dragomir, Jérôme Bolte, and Alexandre d’Aspremont. Fast gra- dient methods for symmetric nonnegative matrix factorization. arXiv preprint arXiv:1901.10791, 2019.

[DG14] Peter JC Dickinson and Luuk Gijben. On the computational complexity of membership problems for the completely positive cone and its dual. Computational optimization and applications, 57(2):403–415, 2014.

[DHS05] Chris Ding, Xiaofeng He, and Horst D Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM international conference on data mining (ICDM), pages 606–610. SIAM, 2005.

[DM18] Irit Dinur and Pasin Manurangsi. Eth-hardness of approximating 2-csps and directed steiner network. arXiv preprint arXiv:1805.03867, 2018.

[DS95] Daniele Giorgio Degiorgi and Klaus Simon. A dynamic algorithm for line graph recog- nition. In International Workshop on Graph-Theoretic Concepts in Computer Science, pages 37–48. Springer, 1995.

[ER60] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.

[EU18] Alessandro Epasto and Eli Upfal. Efficient approximation for restricted biclique cover problems. Algorithms, 11(6):84, 2018.

[FGL+19] Fedor V Fomin, Petr A Golovach, Daniel Lokshtanov, Fahad Panolan, and Saket Saurabh. Approximation schemes for low-rank binary matrix approximation problems. ACM Transactions on Algorithms (TALG), 16(1):1–39, 2019.

[FJLS20] Asaf Ferber, Vishesh Jain, Kyle Luh, and Wojciech Samotij. On the counting problem in inverse littlewood–offord theory. Journal of the London Mathematical Society, 2020.

[FKS20] Asaf Ferber, Matthew Kwan, and Lisa Sauermann. Singularity of sparse random ma- trices: simple proofs. arXiv preprint arXiv:2011.01291, 2020.

[FMPS09] Herbert Fleischner, Egbert Mujuni, Daniël Paulusma, and Stefan Szeider. Covering graphs with few complete bipartite subgraphs. Theoretical Computer Science, 410(21- 23):2045–2053, 2009.

[Gil12] Nicolas Gillis. Sparse and unique nonnegative matrix factorization through data pre- processing. The Journal of Machine Learning Research, 13(1):3349–3386, 2012.

[Har70] Richard A Harshman. Foundations of the parafac procedure: Models and conditions for an "explanatory" multimodal factor analysis. 1970.

[HSC+20] Yangsibo Huang, Zhao Song, Danqi Chen, Kai Li, and Sanjeev Arora. Texthide: Tackling data privacy in language understanding tasks. In EMNLP. arXiv preprint arXiv:2010.06053, 2020.

12 [HSLA20] Yangsibo Huang, Zhao Song, Kai Li, and Sanjeev Arora. Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learn- ing (ICML), pages 4507–4518, 2020. [HST+20] Baihe Huang, Zhao Song, Runzhou Tao, Ruizhe Zhang, and Danyang Zhuo. Instahide’s sample complexity when mixing two private images. arXiv preprint arXiv:2011.11877, 2020. [Hua18] Jiaoyang Huang. Invertibility of adjacency matrices for random d-regular graphs. arXiv preprint arXiv:1807.06465, 2018. [HXZ+11] Zhaoshui He, Shengli Xie, Rafal Zdunek, Guoxu Zhou, and Andrzej Cichocki. Sym- metric nonnegative matrix factorization: Algorithms and applications to probabilistic clustering. IEEE Transactions on Neural Networks, 22(12):2117–2131, 2011. [Jag20] Matthew Jagielski. InstaHide security experiment, 2020. https://colab.research.google.com/drive/1ONVjStz2m3BdKCE16axVHZ00hcwdivH2?usp=sharing. [Jai19] Vishesh Jain. Approximate spielman-teng theorems for the least singular value of ran- dom combinatorial matrices. arXiv preprint arXiv:1904.10592, 2019. [JKL97] Michael S Jacobson, André E Kézdy, and Jenő Lehel. Recognizing intersection graphs of linear uniform hypergraphs. Graphs and Combinatorics, 13(4):359–367, 1997. [KDP12] Da Kuang, Chris Ding, and Haesun Park. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of the SIAM International Conference on Data Mining (ICDM), pages 106–117. SIAM, 2012. [KG12] Vassilis Kalofolias and Efstratios Gallopoulos. Computing symmetric nonnegative rank factorizations. Linear algebra and its applications, 436(2):421–435, 2012. [KPRW19] Ravi Kumar, Rina Panigrahy, Ali Rahimi, and David Woodruff. Faster algorithms for binary matrix factorization. In International Conference on Machine Learning (ICML), pages 3551–3559, 2019. [Leh74] Philippe GH Lehot. An optimal algorithm to detect a line graph and output its root graph. Journal of the ACM (JACM), 21(4):569–575, 1974. [LHW17] Songtao Lu, Mingyi Hong, and Zhengdao Wang. A nonconvex splitting method for sym- metric nonnegative matrix factorization: Convergence analysis and optimality. IEEE Transactions on Signal Processing, 65(12):3120–3135, 2017. [Lov77] L Lovász. Problem, beitrag zur graphentheorie und deren auwendungen, vorgstragen auf dem intern. koll, 1977. [LRA93] Sue E Leurgans, Robert T Ross, and Rebecca B Abel. A decomposition for three-way arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064–1083, 1993. [LT93] AG Levin and Regina Iosifovna Tyshkevich. Edge hypergraphs. Diskretnaya Matem- atika, 5(1):112–129, 1993. [LTVM15] Dajie Liu, Stojan Trajanovski, and Piet Van Mieghem. Iligra: an efficient inverse line graph algorithm. Journal of Mathematical Modelling and Algorithms in Operations Research, 14(1):13–33, 2015.

13 [LVAH12] Haibing Lu, Jaideep Vaidya, Vijayalakshmi Atluri, and Yuan Hong. Constraint-aware role mining via extended boolean matrix decomposition. IEEE Transactions on De- pendable and Secure Computing, 9(5):655–669, 2012.

[MMG+08] Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. IEEE Transactions on Knowledge and Data Engineering (TKDE), 20(10):1348–1362, 2008.

[MN20] Pauli Miettinen and Stefan Neumann. Recent developments in boolean matrix factor- ization. arXiv preprint arXiv:2012.03127, 2020.

[Moi13] Ankur Moitra. An almost optimal algorithm for computing nonnegative rank. In Pro- ceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1454–1464. SIAM, 2013.

[MSVA16] Barsha Mitra, Shamik Sural, Jaideep Vaidya, and Vijayalakshmi Atluri. A survey of role mining. ACM Computing Surveys (CSUR), 48(4):1–37, 2016.

[MT97] Yury Metelsky and Regina Tyshkevich. On line graphs of linear 3-uniform hypergraphs. Journal of , 25(4):243–251, 1997.

[Ngu13] Hoi H Nguyen. On the singularity of random combinatorial matrices. SIAM Journal on Discrete Mathematics, 27(1):447–458, 2013.

[Orl77] James Orlin. Contentment in graph theory: Covering graphs with cliques. In Indaga- tiones Mathematicae (Proceedings), volume 80, pages 406–424. Elsevier, 1977.

[Pol19] Yury Polyanskiy. Hypercontractivity of spherical averages in hamming space. SIAM Journal on Discrete Mathematics, 33(2):731–754, 2019.

[PRT81] Svatopluk Poljak, Vojtěch Rödl, and Daniel Turzik. Complexity of representation of graphs by set systems. Discrete Applied Mathematics, 3(4):301–312, 1981.

[Rou73] Nicholas D Roussopoulos. A max m,n algorithm for determining the graph h from { } its line graph g. Information Processing Letters, 2(4):108–112, 1973.

[RPG16] Siamak Ravanbakhsh, Barnabás Póczos, and Russell Greiner. Boolean matrix factoriza- tion and noisy completion via message passing. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML), pages 945–954, 2016.

[RSW16] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approx- imations with provable guarantees. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pages 250–263, 2016.

[RV08] Mark Rudelson and Roman Vershynin. The littlewood–offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008.

[SBM03] Jouni K Seppänen, Ella Bingham, and Heikki Mannila. A simple algorithm for topic identification in 0–1 data. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 423–434. Springer, 2003.

14 [ŠH06] Tomáš Šingliar and Miloš Hauskrecht. Noisy-or component analysis and its application to link analysis. Journal of Machine Learning Research (JMLR), 7(Oct):2189–2213, 2006.

[SST05] PV Skums, SV Suzdal, and RI Tyshkevich. Edge intersection graphs of linear 3-uniform hypergraphs. Electronic Notes in Discrete Mathematics, 22:33–40, 2005.

[SWZ17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entry- wise l1-norm error. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 688–701, 2017.

[SWZ19] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approx- imation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2772–2789. SIAM, 2019.

[Sys82] Maciej M Syslo. A labeling algorithm to recognize a line digraph and output its root graph. Information Processing Letters, 15(1):28–30, 1982.

[VAG07] Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. The role mining problem: finding a minimal descriptive set of roles. In Proceedings of the 12th ACM symposium on Access control models and technologies, pages 175–184, 2007.

[Vav10] Stephen A Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization, 20(3):1364–1377, 2010.

[VGL+16] Arnaud Vandaele, Nicolas Gillis, Qi Lei, Kai Zhong, and Inderjit Dhillon. Efficient and non-convex coordinate descent for symmetric nonnegative matrix factorization. IEEE Transactions on Signal Processing, 64(21):5571–5584, 2016.

[Vu08] Van Vu. Random discrete matrices. In Horizons of combinatorics, pages 257–280. Springer, 2008.

[Whi92] Hassler Whitney. Congruent graphs and the connectivity of graphs. In Hassler Whitney Collected Papers, pages 61–79. Springer, 1992.

[WLW+11] Fei Wang, Tao Li, Xin Wang, Shenghuo Zhu, and Chris Ding. Community discov- ery using nonnegative matrix factorization. Data Mining and Knowledge Discovery, 22(3):493–521, 2011.

[YGL+13] Xiaohui Yan, Jiafeng Guo, Shenghua Liu, Xueqi Cheng, and Yanfeng Wang. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In proceedings of the SIAM International Conference on Data Mining (ICDM), pages 749–757. SIAM, 2013.

[YHD+12] Zhirong Yang, Tele Hao, Onur Dikmen, Xi Chen, and Erkki Oja. Clustering by nonneg- ative matrix factorization using graph random walk. Advances in Neural Information Processing Systems (NeurIPS), 25:1079–1087, 2012.

[ZLDZ07] Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. Binary matrix factor- ization with applications. In Seventh IEEE international conference on data mining (ICDM), pages 391–400. IEEE, 2007.

15 [ZLLL18] Zhihui Zhu, Xiao Li, Kai Liu, and Qiuwei Li. Dropping symmetry for fast symmetric nonnegative matrix factorization. Advances in Neural Information Processing Systems (NeurIPS), 31:5154–5164, 2018.

[ZS05] Ron Zass and Amnon Shashua. A unifying approach to hard and probabilistic clustering. In Tenth IEEE International Conference on Computer Vision (ICCV), pages 294–301. IEEE, 2005.

[ZWA13] Zhong-Yuan Zhang, Yong Wang, and Yong-Yeol Ahn. Overlapping community detection in complex networks using symmetric binary matrix factorization. Physical Review E, 87(6):062803, 2013.

A Deferred Proofs from Section 4

A.1 Proof of Fact 4.1 Fact A.1. There exist absolute constants C,C′ > 0 for which the following holds. If r C k2, ≥ · then for any 0 t 3k, we have that µ µ + C′k/r and µ 1 O(tk/r). ≤ ≤ t ≥ t+1 t ≥ − Proof. For any 0 t r, note that ≤ ≤ r−t r−t−1 r t 1 r t 2 r t k + 1 r t r t k k k = − − − − − − − − − . r − r r · r 1 · · · r k + 2 · r k + 1 − r k + 1 k  k  − −  − −  t + 1 k−1 k   1 ≥ − r k + 2 · r k + 1  −  − (t + 1)(k 1) k 1 − Ω(k/r), ≥ − r k + 2 · r k + 1 ≥  −  − where in the last step we used the assumptions that r Ω(k2) and t O(k). For the second part ≥ ≤ of the claim, we similarly have that

r−t r t r t 1 r t k + 1 k = − − − − − r r · r 1 · · · r k + 1 k  − − t k  1 ≥ − r k + 1  −  tk 1 1 O(tk/r). ≥ − r k + 1 ≥ − −

A.2 Proof of Lemma 4.2 Lemma A.2. If m Ω (rk log(1/δ)), then with probability at least 1 δ over the randomness of ≥ − M, there is an algorithm for computing T for any a, b, c [m] in time O(mω+1). a,b,c ∈ e Proof. If entries a, b, c correspond to subsets S1,S2,S3 of [m], then T = S S S = S S S S S S S S S + 3k. (8) a,b,c | 1 ∩ 2 ∩ 3| | 1 ∪ 2 ∪ 3| − | 1 ∪ 2| − | 2 ∪ 3| − | 1 ∪ 3| By Fact 2.1 and the second part of Fact 4.1 applied to t = S S or t = S S S , we conclude | i ∪ j| | 1 ∪ 2 ∪ 3| that any µ or µ can be estimated to error C′k/2r with probability 1 δ/m3 provided |Si∪Sj | |S1∪S2∪S3| −

16 2 that m Ω t r log(m3/δ) . Note that t 3k, so this holds by the bound on m in the hypothesis ≥ k ≤ of the lemma. By the first part of Fact 4.1, provided these quantities can be estimated within the desired accuracy, we can exactly recover every S S as well as S S S . So by (8) and a | i ∪ j| | 1 ∪ 2 ∪ 3| union bound over all (a, b, c), with probability at least 1 δ we can recover every entry of T. − It remains to show that T can be computed in the claimed time. Note that naively, each entry would require O(m) time to compute, leading to O(m4) runtime. We now show how to do this more efficiently with fast matrix multiplication. Fix any a [m] and consider the a-th slice of T. ∈ Recall that for every b, c [m], we would like to compute the number of columns ℓ [m] for which ∈ ∈ Ma,ℓ, Mb,ℓ, Mc,ℓ are all zero (note that we can compute the other relevant statistics like the number 3 of ℓ for which Ma,ℓ and Mb,ℓ are both zero in total time O(m ) across all a, b even naively). We can first restrict our attention to the set La of ℓ for which Ma,ℓ = 0, which can be computed in time O(m). Let Ma denote the matrix given by restricting M to the columns indexed by La and M M⊤ subtracting every resulting entry from 1. By design, ( a a )b,c is equal to the number of ℓ La M M M M⊤ ω ∈ for which b,ℓ = c,ℓ = 0, and the matrix a a can be computed in time O(m ). The claimed runtime follows.

A.3 Proof of Lemma 4.3 m Lemma A.3. Given a collection of linearly independent vectors w1, . . . , wr R , there is an T r ⊗3 ω ∈ algorithm that takes any tensor = i=1 wi , runs in time O(m ), and outputs a list of vectors w ,..., w for which there exists permutation π satisfying w = w for all i [r]. 1 r P i π(i) ∈ Proof. The algorithm is simply to run Jennrich’s algorithm ([Har70, LRA93]), but we include a b b m−1 b proof for completeness. Pick random v1, v2 S and define M1 , T(Id, Id, v1) and M2 , m×r ∈ T(Id, Id, v2). If W R is the matrix whose columns consist of w1, . . . , wr, then we can M r ∈ ⊤ WD W⊤ D , write a = i=1 wi, va wiwi = a , where a diag( w1, va ,..., wr, va ). As a re- M M+ WDh D+Wi + h M Mi + h i sult, a b = a b . This gives an eigendecomposition of a b because the entries of D D+ P a b are distinct almost surely because wi are linearly independent. We conclude that the non- M M+ { } trivial eigenvectors of a b are precisely the vectors wi up to permutation as claimed. Forming M M 2 M M+ { } ω 1 and 2 takes O(m ) time, and forming a b and computing its eigenvectors takes O(m ) time.

A.4 Proof of Theorem 4.4 In this section we prove Theorem 4.4, restated here for convenience:

Theorem 4.4 (Linear independence of W). Let W 0, 1 m×r be a random matrix whose rows ∈ { } are i.i.d. random vectors each following a uniform distribution over 0, 1 r with exactly k ones. For { } constant k 1 and m = Ω((r/k) log n), the r columns of W are linearly independent in R with ≥ probability at least 1 1 . − poly(r) Related Results This kind of problem has been well-studied in random matrix theory when each entry of W is an i.i.d. Bernoulli random variable of value 1 (see for example [CV10, RV08, CV08]). ± However, in our setting, there exists some dependence within a row since we require that the row sum equals to k. Most of the previous work for this model (for example [Ngu13, FJLS20, Jai19, AHP20, FKS20]) studied the case when k is large (i.e., when k = Ω(log r)) and there was a long- standing conjecture in [Vu08] which conjectures that the singularity probability of the adjacency matrix of a random k-regular graph is o(1) when 3 k < r. A recent work [Hua18] fully resolved ≤ this conjecture using local central limit theorem and large deviation estimate. Note that our model

17 is a little different, since W corresponds to the adjacency matrix of a bipartite graph with only left-regularity. Here, we give a more elementary proof, drawing in part on ideas from [FKS20], to show that for “tall” matrices, the column-singularity probability is r−O(1) for any constant k 1. ≥ Our Proof We treat the cases of odd and even k separately. For the former, we will show that the columns of W are linearly independent over F2 with high probability, which by the following is sufficient for linear independence over R:

Lemma A.4 (Reduction from F to R). Let W 0, 1 m×r . If the columns of W are linearly 2 ∈ { } independent in F2, then they are also linearly independent in R. Proof. We prove a contrapositive statement, i.e., if the columns are linearly dependent in R, then they are still dependent in F2. Let w1, . . . , wn be the columns of W. If they are linearly dependent in R, then there exists c ,...,c R such that r c w = 0. By Gaussian elimination, it is easy to see that c ,...,c Q. 1 r ∈ i=1 i i 1 r ∈ By multiplying some common factor, we can get r integers c′ ,...,c′ Z such that not all of them P 1 r ∈ are even numbers and

c′ w + c′ w + + c′ w = 0. 1 1 2 2 · · · r r Then, apply (mod 2) to both sides of the equation and for any j [m], the j-th entry of the resulting ∈ vector is r r c′ W (mod 2) = (c′ mod 2) W = 0, (9) i ij i · ij Xi=1 Mi=1 where the first step follows from W 0, 1 . ij ∈ { } Define a vector a Fr such that a := c′ mod 2 for i [r]. Then, a = 0 and Eq. (9) implies ∈ 2 i i ∈ 6 that Wa = 0 in F2, which means the columns of W are linearly dependent in F2. And the claim hence follows.

We are now ready to handle the case of odd k. The following lemma proves the F2 case. Lemma A.5 (Linear independence in F ). For r 0, odd constant k, if 2 ≥ m = Ω((r/k) log r) r, ≥ then with probability 1 1 , the columns of matrix W are linearly independent over F . − poly(r) 2 Proof. To show that the columns of W are linearly independent, equivalently, we can show that ker(W)= with high probability; that is, for all x Fr 0 , Wx = 0. ∅ ∈ 2 \{ } 6 For 1 λ r, let P := Pr[Wu = 0] for u Fr and u = λ. Note that this probability is the ≤ ≤ λ ∈ 2 | | same for all weight-λ vectors. Then, we have

r Pr[ker(W) = ] P u Fr u = λ 6 ∅ ≤ λ · ∈ 2 | | λ=1 n o Xr r P . ≤ λ · λ Xλ=1  

18 Fix λ [r] and u Fr with weight λ. Since the rows of W are independent, we have ∈ ∈ 2 m P = Pr[Wu =0]= Pr[ w , u = 0] = (Pr[ w, u = 0])m, λ h i i h i Yi=1 where w is a uniformly random vector in the set u Fr u = k . It’s easy to see that k should be { ∈ 2 || | } an odd number; otherwise, W1 = 0. By Proposition A.7, we have

r/2 m 1 1 2k λ r 1 m r Pr[ker W = ] + 1 + ∅ ≤ 2 2 − r ! · λ 2 · λ Xλ=1         r/2 m 1 1 2k λ r 2r−1−m + + 1 ( r/2 r = 2r−1.) ≤ 2 2 − r · λ i=0 i λ=1   !   X  r/2 P 1+ kλ/r m r 2r−1−m + ((1 2k )λ 1 .) ≤ 1 + 2kλ/r · λ − r ≤ 1+(2kλ)/r Xλ=1     r/2 kmλ/r r 2r−1−m + exp (1 x e−x.) ≤ −1+ kλ/r · λ − ≤ Xλ=1     r/2 km/r λ 2r−1−m + exp + log r ( r rλ.) ≤ −1+ kλ/r λ ≤ λ=1   X  When m = Ω((r/k) log r), we have exp km/r + log r 1 . Hence, − 1+kλ/r ≤ rO(1)   1 1 Pr[ker W = ] 2r−1−Ω(r log r) + (r/2) = , ∅ ≤ · nO(1) nO(1) and the lemma is then proved.

Combining Lemma A.4 and Lemma A.5, we immediately have the following corollary:

Corollary A.6 (Linear independence in R for odd k). For r 0, odd constant k 1, when ≥ ≥ m = Ω((r/k) log r) r, with probability at least 1 1 , the columns of matrix W are linearly ≥ − poly(r) independent in R.

In the proof of Lemma A.5 above, we required the following helper lemma:

Proposition A.7. For 1 λ r , we have either ≤ ≤ 2 m 1 1 2k λ 1 m P + 1 and P , λ ≤ 2 2 − r r−λ ≤ 2   !   or m 1 m 1 1 2k λ P and P + 1 . λ ≤ 2 r−λ ≤ 2 2 − r     !

19 Proof. We can write the probability of w, u = 0 as follows: h i k−1 r −1 λ r λ Pr[ w, u =0]= − . h i k · i k i even   i=0X, i   −  We first consider the following sum with alternating signs:

k λ r λ Kr(λ) := − ( 1)i, k i k i − Xi=0   −  which is the binary Krawtchouk polynomial. Then, for 1 λ r , we have ≤ ≤ 2 k−1 r −1 λ r λ Pr[ w, u =0]= − h i k i k i   i=0, i even   −  r X 1 Kk(λ) = + r . 2 2 k r  By symmetry, for 2 < λ < r,

k−1 r −1 λ r λ Pr[ w, u =0]= − h i k i k i even   i=0X, i   −  k r −1 r λ λ = − k i k i odd   i=1X, i   −  r 1 Kk(r λ) = r− . 2 − 2 k  λ By Lemma A.8, we know that for 1 λ r , Kr(λ) 1 2k . ≤ ≤ 2 | k |≤ − r Therefore, if Kr(λ) > 0, then k  m 1 1 2k λ P Pr[ w, u = 0]m + 1 , λ ≤ h i ≤ 2 2 − r   ! 1 m P . r−λ ≤ m   r If Kk(λ) < 0, then m 1 1 2k λ P + 1 , r−λ ≤ 2 2 − r   ! 1 m P . λ ≤ m   The proposition hence follows.

In the proof of Proposition A.7, we needed the following estimate for the binary Krawtchouk polynomial:

20 Lemma A.8 ([Pol19]). For k 0.16r, λ r , we have ≤ ≤ 2 r 2k λ Kr(λ) 1 . | k |≤ k · − r     Next, we turn to the case of even k. When k is even, we cannot use Lemma A.4 because the matrix is not linearly independent in F2. Instead, we use a variant of the proof in [FKS20] to show that in this case, the linear independence of the columns of W still holds with high probability:

Lemma A.9. For r 0, even constant k > 0, when m = Ω((r/k) log r), with probability at least ≥ 1 1 , the columns of matrix W are linearly independent in R. − poly(r) Proof. Suppose the columns of W are not linearly independent, i.e., there exists x Qr such that ∈ Wx = 0. Then, we will be able to multiply by an integer and then divide by a power of two to obtain a vector c Zr with at least one odd entry such that w , c = 0 for all i [m] where w ∈ h i i ∈ i denotes the i-th row of w.

Case 1: supp(c) < r. Note that the reduction (Lemma A.4) fails only when all of the entries of | | c are odd because W1 = 0 in F when k is even. That is, when supp(c) < r, by Lemma A.5, we 2 | | know that the columns of W is linearly dependent with probability at most r−O(1).

Case 2: supp(c) = r. Following the proof in [FKS20], we define a fibre of a vector to be a set of | | all indices whose entries are equal to a particular value. Define a set to be P := c r : c has largest fibre of size at most (1 δ)r P { ∈Z − } where δ (0, 1) is a constant to be chosen later. Then, let m = m + m where m = Θ(log r) and ∈ 1 2 2 we have

Pr Wc = 0 with supp(c) = r B + B , | | ≤ 1 2 where  

B := Pr non-zero c / such that w , c = 0 i [m ] , 1 ∃ ∈ P h i i ∀ ∈ 1 B2 := max Pr wi, c = 0 i [m2] . c∈P h i ∀ ∈    For B , let q = k 1. It suffices to prove that there is no non-constant vector c Zn with largest 1 − ∈ q fibre of size at least (1 δ)r such that w , c 0 (mod q) for all i [m ]. Then, by Lemma A.10, − h i i≡ ∈ 1 let t = O(r/k) and the probability can be upper bounded by

δr r B qs+1(P )m1 1 ≤ s s s=1 X   t δr 2s log r+(s+1) log q−O(sk/r)·m1 + 2s log r+(s+1) log q−Ω(m1) ≤ s=1 s=t+1 X X δr 2−Ω(log r) ≤ · = r−O(1),

3 if we take m1 = Ω((r/k) log r) and δ = 2k < 1.

21 For B2, by Lemma A.11, we have

m2 B2 = max Pr wi, c = 0 c∈P h i O( r/ (δrk ))Ω(log r)  ≤ r−Op(1). ≤ Hence, combining B1 and B2 together, we know that in Case 2, Pr[ c : Wc = 0 supp(c)= r] r−O(1). ∃ | ≤ Therefore, by a union bound for Case 1 and 2, we get that Pr The columns of W are linearly independent 1 r−O(1), ≥ − which completes the proof of the lemma. 

In the above proof of Lemma A.9, we used the following two results from [FKS20]: Lemma A.10 ([FKS20]). Let w 0, 1 r be a random vector with exactly k ones where k

Proof. By Lemma 4.2, as long as m satisfies the bound in the hypothesis, with probability at least 1 δ one can successfully form the tensor T = r W⊗3 in time O(mω+1). By Theorem 4.4, the − i=1 i columns of W are linearly independent with high probability. By Lemma 4.3, one can therefore P recover the columns of W up to permutation.

B Deferred Proofs from Section 5

In this section we provide complete proofs that were deferred from the main body of the paper, as well as a discussion of how the setup in Section 5 differs from that of the original InstaHide paper [HSLA20].

B.1 Proof of Lemma 5.6 In this section we prove Lemma 5.6, restated here for convenience. Lemma B.1. For any absolute constant η > 0, there is an absolute constant c > 0 for which the following holds as long as m Ω(log(d/δ)). There is an algorithm GetHeavyPixels that takes ≥ as input W defined according to Definition 5.4 and vector z satisfying Wp = z for some vector | | p Rr, runs in time O(r m), and outputs p such that for every i [r] for which p (ck/r) p, ∈ · ∈ | i|≥ · we have that p = p (1 η). i i · ± b e 22 We will need the following basic calculation. Henceforth, given a vector p, let p denote the sum of its entries.

Fact B.2. For any vector p Rr, ∈

2 k(r k) 2 k(k 1) 2 E[ eS ,p ]= − p 2 + − p (10) S h i r(r 1) k k r(r 1) − − where the expectation is over a random size-k subset S [r]. ⊂ Proof. Let ξ denote the indicator for the event that i S so that i ∈

2 E[ eS ,p ]= E ξiξjpipj S h i S   i,jX∈[r] =  p2 E[ξ ]+  p p E[ξ ξ ] i · i i j i j iX∈[r] Xi6=j k k(k 1) k(k 1) = − p 2 + − p2 r − r(r 1) k k2 r(r 1)  −  − k(r k) k(k 1) = − p 2 + − p2. r(r 1) k k2 r(r 1) − − as claimed.

We are now ready to prove Lemma 5.6.

Proof of Lemma 5.6. Define the vector

2 p , E eS ,p eS , (11) S h i ·   where the expectation is over a randome subset S [r] of size k, and e 0, 1 k is the indicator ⊂ S ∈ { } vector for the subset S. The i-th entry of p is given by

r −1 p = ( ee ,p + p )2 i k · h S i i   S∈ r−1 X([k−1]) e −1 r = p2 + 2p p + p2 (p , p .) k · i i · S S S j∈S j   S∈( r−1 ) X[k−1] P

r −1 r 1 = p2 − + 2p p + p2 k ·  i · k 1 i · S S    −  S∈( r−1 ) S∈( r−1 )  X[k−1] X[k−1]    For the second term, we have

r 2 p = p − . S j · k 2 S∈ r−1 j∈[r]−{i}   X([k−1]) X −

23 For the third term, we have

2 pS = pjpℓ S∈ r−1 S∈ r−1 j,ℓ∈S X([k−1]) X([k−1]) X 2 = pj + pjpℓ S∈ r−1 j∈S S∈ r−1 j6=ℓ X([k−1]) X X([k−1]) X r 2 r 3 = p2 − + p p − . j · k 2 j ℓ · k 3 j∈[r]−{i}  −  j,ℓ∈[r]−{i}  −  X Xj6=ℓ Hence, the i-th entry of E [ e ,p 2 e ] is S h S i · S 2 2 k(r 2k + 1) 2 k(k 1)(r k) E eS eS ,p = pi − + p 2 − − S · h i i · r(r 1) k k · r(r 1)(r 2) − − −     k(k 1) k(k 1)(k 2) + 2p p − + p2 − − . i · r(r 1) · r(r 1)(r 2) − − − We conclude by Fact B.2 that the i-th entry of E e ,p 2 e k−1 ~1 is bounded by S h S i · S − r−2 · h  i k 1~ 2 2 k(r 2k + 1) k(k 1) E eS − 1 eS,p = pi − + 2pip − S − r 2 · h i · r(r 1) · r(r 1)   −  i − − k(k 1)(2k 3) + p2 − − · r(r 1)(r 2) − − We do not have exact access to p, but we may form the unbiased estimator

m 1 k 1 p′ e, w − ~1 z2, (12) m i − r 2 · i Xi=1  −  where w is the i-th row of W.e For any i [m], each coordinate of w z2 is bounded within i ∈ i · i the interval z 2 , z 2 , so by Chernoff, provided that m Ω(log(d/δ)/ǫ2), we ensure that −k k∞ k k∞ ≥ p′ p (1 ǫ) for all i with probability at least 1 δ. Now consider the following estimator for p2: i ∈ i ±   − i r(r 1) e e q , p′ − . (13) i i · k(r 2k + 1) − 2 We can thus upper bound the error ofbqi relativee to pi by p k p 2 k2 p k p2 k2 O + b O ǫ O 1+ + . p · r p · r2 ± · p r p2 r2 i    i     i i  If we assume that p Ω(k/r)p and ǫ = O(1) for appropriately chosen constant factors, then we | i|≥ have that q p2 (1 η) as desired. i ∈ i · ± We are now ready to complete the proof of Theorem 5.1: b Proof. By Theorem 1.2 and the assumed lower bound on m, we can exactly recover the selection matrix W (up to some column permutation) in time O(mω+1). Using Lemma 5.6, for every pixel index j [d] we can run RecoverHeavyPixels(M) to recover the pixels in position j which are ∈ heaviest among the r private images in time O(m r), yielding the desired guarantee. ·

24 B.2 Comparison of Setup to [HSLA20] In the original formulation of InstaHide [HSLA20], a private and public dataset are used in tandem to construct the synthetic dataset. As was shown in [CLSZ21], by a connection to phase retrieval, it is not hard to to remove the contribution of the public dataset to the synthetic dataset, at least when the datasets are Gaussian. And for real-world image datasets, it was empirically demonstrated by [CDG+20] that heuristically, one can essentially treat the contribution of the public images as benign white noise. In any case, the setting where one only uses private images to generate the synthetic dataset is at least as hard because one can always treat the public dataset as private. For these reasons, in this work we focused on the case where the synthetic dataset is generated purely from private images.

C Deferred Proofs from Section 6

In this section, we show a quasi-polynomial time algorithm for sparse matrix factorization problem. We first define a general (asymmetric) version of the problem as follows:

Definition C.1 (Sparse Boolean matrix factorization (Sparse BMF)). Given an m m matrix × M where each entry is in 0, 1, , k . Suppose matrix M can be factorized into two matrices { · · · } U 0, 1 m×r and V 0, 1 r×m, where each row of U is k-sparse and each column of V is ∈ { } ∈ { } k-sparse. The task is to find a row k-sparse matrix U 0, 1 m×r and a column k-sparse matrix V ∈ { } ∈ 0, 1 r×m such that M = UV. { } b b We can also define theb sparseb Boolean matrix factorization as an optimization problem. Definition C.2 (Sparse BMF, optimization version). Given an m m matrix M where each entry × is in 0, 1, , k . The goal is to find a row k-sparse matrix U 0, 1 m×r and a column k-sparse { · · · } ∈ { } matrix V 0, 1 r×m such that the number of different entries M UV is minimized. ∈ { } k − k0 b We giveb a reduction that can reduce the general sparse BMF problemb b to a Max 2-CSP problem, and then use a quasi-polynomial time 2-CSP solver to find an approximation solution.

Theorem C.3 (QPTAS for asymmetric sparse BMF). Given m, k, r 0 and an m m matrix M ≥ × as the input of an instance of sparse Boolean matrix factorization problem. Let OPT be the optimal value of the problem, i.e., OPT := minU V M UV , where U, V satisfy the sparsity constraints , k − k0 of the problem. −1 −1 2 For any 1 ǫ > 0, there exists an algorithm that runs in mO(ǫ k log r)rO(ǫ k log r) time and ≥ finds a row k-sparse matrix U and a column k-sparse matrix V satisfying M UV OPT+ǫm2. k − k0 ≤ O(r2 log r) Remark C.4. We briefly compareb this to the guarantee of [KPRW19b ], who obtainedb b a 2 poly(m) constant-factor approximation algorithm. By introducing a sparsity constraint on the rows of U, V, we circumvent the exponential dependence on r, at the cost of running in time quasipolynomial in m. In particular, our guarantee dominates when the rank parameter r is at least roughly Ω(√log m), though strictly speaking our guarantee is incomparable because we aim for an additive approximation and only measure error in L0 rather than Frobenius norm. e Proof. For the input matrix M, let U and V be the ground-truth of the factorization. Let ⊤ ⊤ U V (b1 , . . . , bm) be the rows of and (c1,...,cm) be the columns of . We construct a 2-CSP instance that finds U and V as follows: FA

25 • Σ= (q ,...,q ) q 0, 1 i [r] and q = k be the alphabet. 1 r i ∈ { } ∀ ∈ i∈[r] i  P • The underlying graph is a bipartite graph. The left-side vertices VL = [m] corresponding to the rows of U. The right-side vertices VR = [m] corresponding to the columns of V. • For e = (u, v) V V , define the constraint C to be: for all (p ,...,p ), (q ,...,q ) Σ Σ, ∈ L× R e 1 r 1 r ∈ × r C ((p ,...,p ), (q ,...,q )) = 1 p q = A . e 1 r 1 r ⇐⇒ i i u,v Xi=1 Note that has value OPT. We can create an assignment from the ground-truth such that FA σ(u) = b for u V and σ(v) = c for v V . By definition of the sparse Boolean matrix u ∈ L v ∈ R factorization problem, this is a legal assignment. Also, since the number of A = b , c for all u,v h u vi (u, v) [m] [m] is m2 OPT, we can see that all such edges are satisfied by this assignment. ∈ × − Then, we can run the QPTAS algorithm (Theorem 6.2) on and obtain an assignment that at FA most OPT ǫ E constraints are unsatisfied, which means the number of different entries between − | | M and UV is at most OPT ǫm2. − r 2 k The alphabet size of A is k . The reduction time is O(m r ) and the 2-CSP solving time is −1 k F (mrk)O(ǫb blog(r )) by Theorem 6.2 since the density of a complete bipartite graph is δ = 1 . The  2 theorem is then proved.

A similar reduction can be used to prove Theorem 6.3.

Theorem 6.3. Given m, k, r 0 and a symmetric m m matrix M as the input of an instance ≥ × of sparse Boolean matrix factorization problem. Let OPT be the optimal value of the problem, i.e., OPT := minW M WW⊤ , where W is a row k-sparse matrix in 0, 1 m×r. k − k0 { } For any accuracy ǫ (0, 1), there is an algorithm running in time ∈ −1 −1 2 mO(ǫ k log r)rO(ǫ k log r) (7) which finds a row k-sparse matrix W satisfying M WW⊤ OPT + ǫm2. k − k0 ≤ Proof. The construction of the 2-CSP instance is almost the same as in the proof of Theorem C.3, c FA c c except that in this case, the underlying graph is a complete graph, where the vertices V = [m] correspond to the rows of W. Then, each constraint C checks whether b , b = M . The (u,v) h u vi u,v correctness of the reduction follows exactly the proof of Theorem C.3 and we omit it here. The −1 k density of in this case is 1, and hence the running time of the algorithm is (mrk)O(ǫ log(r )). FA A direct corollary of Theorem 6.3 is that, the sparse BMF over Boolean semiring can also be solved in quasi-polynomial time.

Corollary C.5. Given m, k, r 0 and a symmetric Boolean m m matrix M. Let OPT be the ≥ × optimal value of the problem, i.e., OPT := minW M WW⊤ , where W is a row k-sparse matrix k − k0 in 0, 1 m×r and the matrix multiplication is over the Boolean semiring, i.e., a + b is a b and a b { } ∨ · is a b. ∧ For any accuracy parameter ǫ (0, 1), there exists an algorithm that runs in ∈ −1 −1 2 mO(ǫ k log r)rO(ǫ k log r) time and finds a row k-sparse matrix W satisfying M WW⊤ OPT + ǫm2. k − k0 ≤ c c c 26 Proof. The construction can be easily adapted to the case when matrix multiplication is over the Boolean semiring, where a + b becomes a b and a b becomes a b. We can just modify the ∨ · ∧ constraints of the 2-CSP instance in the reduction and it is easy to see that the algorithm still works.

Remark C.6. Factorizing Boolean matrix with Boolean arithmetic is equivalent to the bipartite clique cover problem. It was proved by [CIK17] that the time complexity lower bound for the exact 2Ω(r) 2 1 version of this problem is 2 . Since the approximation error is ǫm , when ǫ< m2 , the output of our algorithm is the exact solution. Further, if we do not have the row sparsity condition, i.e., k = r, 2 2 then the time complexity becomes 2O(m (log m)·r log r). In the realm of parameterized complexity (see for example [CFK+15]), due to the kernelization in [FMPS09], we may assume m 2r and the e 2r 3 ≤ running time of our algorithm is 2O(2 ·r ), which matches the lower bound of this problem.

27