Symmetric Boolean Factor Analysis with Applications to Instahide

Symmetric Boolean Factor Analysis with Applications to InstaHide Sitan Chen∗ Zhao Song† Runzhou Tao‡ Ruizhe Zhang§ Abstract In this work we examine the security of InstaHide, a recently proposed scheme for distributed learning [HSLA20]. A number of recent works have given reconstruction attacks for InstaHide in various regimes [CLSZ21, Car20, HST+20] by leveraging an intriguing connection to the following matrix factorization problem: given the Gram matrix of a collection of m random k-sparse Boolean vectors in 0, 1 r, recover the vectors (up to the trivial symmetries). Equivalently, this can be thought{ of as} a sparse, symmetric variant of the well-studied problem of Boolean factor analysis, or as an average-case version of the classic problem of recovering a k-uniform hypergraph from its line graph. As previous algorithms either required m to be exponentially large in k [CLSZ21] or only applied to k = 2 [Car20, HST+20], they left open the question of whether InstaHide possesses some form of “fine-grained security” against reconstruction attacks for moderately large k. In this work, we answer this in the negative by giving a simple O(mω+1) time algorithm for the above matrix factorization problem. Our algorithm, based on tensor decomposition, only requires m = Ω(r). We complement this result with a quasipolynomial-time algorithm for a worst-case setting of the problem where the collection of k-sparse vectors is chosen arbitrarily. e arXiv:2102.01570v1 [cs.LG] 2 Feb 2021 ∗[email protected]. MIT. This work was supported in part by NSF CAREER Award CCF-1453261, NSF Large CCF-1565235 and Ankur Moitra’s ONR Young Investigator Award. †[email protected]. Princeton University. ‡[email protected]. Columbia University. §[email protected]. The University of Texas at Austin. 1 Introduction We study the following symmetric variant of the well-known nonnegative matrix factorization (NMF) problem [AGKM12, Moi13, RSW16, SWZ17, SWZ19]. Given a symmetric, nonnegative m m × matrix M and rank parameter r, we are interested in the optimization problem: min M WW⊤ (1) W∈S − where is some family of nonnegative m r matrices and denotes some matrix norm. S × k · k When consists of all nonnegative m r matrices, this is the problem of symmetric NMF, S × which is closely related to kernel k-means [DHS05] and has been studied in a variety of contexts like graph clustering, topic modeling, and computer vision [ZS05, YHD+12, YGL+13, CRDH08, KG12, HXZ+11, WLW+11, KDP12]. We consider the setting where consists of matrices with 0, 1 -valued entries, in which case S { } (1) is the symmetric version of the well-studied problem of binary matrix factorization (BMF) [ZLDZ07, CIK17, BBB+19, FGL+19, KPRW19], which is connected to a diverse array of problems like LDPC codes [RPG16], optimizing passive OLED displays [KPRW19], and graph partitioning [CIK17]. For instance, if we equip 0, 1 with the structure of the Boolean semiring1 (in which case { } BMF is sometimes called Boolean factor analysis) and M is the adjacency matrix of undirected graph G, then (1) is equivalent to finding the best possible covering of G by r cliques. Boolean Factor Analysis and Distributed Learning Our motivation for studying Boolean factor analysis comes from an intriguing connection, recently identified by [CLSZ21, CDG+20], to designing attacks on a proposed scheme for private distributed learning called InstaHide [HSLA20]2. At a high level (see Section 5 for details), InstaHide is a procedure for taking a private dataset of size r and generating a synthetic dataset m via the following procedure. Fix a sparsity parameter k, and for every i = 1,...,m: 1) sample a random subset S of 1, . , r of size k, 2) define v to { } be the average of the k private feature vectors indexed by S, 3) add to the synthetic dataset the entrywise absolute value of v. The security of InstaHide has recently been placed under some scrutiny [Jag20, CLSZ21, CDG+20, Car20, HST+20], and a number of works have given attacks for reconstructing private feature vectors out of synthetic ones generated by InstaHide. [CLSZ21] gave a provable attack running in poly(m) time when the private dataset is Gaussian but required m to scale exponentially in k,3 while [CDG+20] gave a heuristic attack on real image data for k = 2. These works left open the following question which is the main focus of the present work: Question 1.1. Do there exist efficient reconstruction attacks on InstaHide for arbitrary k? Interestingly, [CLSZ21, CDG+20] both essentially reduce the reconstruction problem to solving the following instantiation of the optimization problem (1): ⊤ m×r min M WW 0 for m,r,k = W 0, 1 : Wj 0 = k j [m] , (2) W∈Sm,r,k k − k S { ∈ { } k k ∀ ∈ } where denotes number of nonzero entries. k · k0 In the reduction, the input M to the optimization problem (2) is the similarity matrix whose (i, j)-th entry is 1 if synthetic feature vectors i and j have at least one constituent private feature 1In the Boolean semiring, addition is given by logical OR, and multiplication is given by logical AND 2In a follow-up work, this was generalized to the text domain [HSC+20]. 3In the special case of k = 2, the bound in [CLSZ21] was later refined by [HST+20]. 1 vector in common, and 0 otherwise (see Definition 5.4). In practice, it is very reasonable to assume one can extract M from the synthetic dataset e.g. by training an appropriate neural network [CDG+20], and when the private dataset is Gaussian, there is a provable algorithm for doing so [CLSZ21]. The key point is that this matrix admits an exact factorization M = WW⊤ over the Boolean semiring, where W 0, 1 m×r is the matrix whose (i, ℓ)-th coordinate is 1 if synthetic feature ∈ { } vector i contains private feature vector ℓ, and is 0 otherwise (see Fact 5.5). In particular, because every synthetic feature vector has k constituent private feature vectors, W . [CLSZ21] ∈ Sm,r,k used a rather involved combinatorial argument to “partially” factorize M, while [CDG+20] gave a heuristic algorithm for completely factorizing M when k = 2 by solving a min-cost max-flow (see Section 1.1 for details). Our first result is to give a polynomial-time algorithm for (2) when M is the similarity matrix for a synthetic dataset generated by InstaHide: Theorem 1.2. Fix any integer 2 k r, failure probability δ (0, 1), and suppose m ≤ ≤ ∈ ≥ Ω(rk log(1/δ)). Let W 0, 1 m×r be generated by the following random process: for every i [m], ∈ { } ∈ the i-th row of W is a uniformly random k-sparse binary vector. Define M , WW⊤ where matrix multiplicatione is over the Boolean semiring. There is an algorithm which runs in O(mω+1) time and, with probability 1 δ over the randomness of W, outputs a matrix W 0, 1 m×r whose − ∈ { } columns are a permutation of those of W.4 c Note that not only is the minimum m for which Theorem 1.2 holds a fixed polynomial in r, k (rather than rΩ(k)), but in fact the dependence on r is near optimal (Remark 1.5)! As a consequence, we show a synthetic dataset generated by InstaHide is vulnerable to efficient reconstruction attacks as soon as its size is comparable to that of the private dataset from which it was generated, answering Question 1.1 in the affirmative: Theorem 1.3 (Informal, see Theorem 5.1). Fix any integer k 2, failure probability δ (0, 1), ≥ ∈ and suppose m Ω(rk log(d/δ)). Given a synthetic dataset of size m generated by InstaHide, ≥ together with its similarity matrix, there is an O(mω+1 + d r m)-time algorithm for approximately · · recovering the magnitudese of the “heavy” coordinates of every feature vector in the private dataset. Here, a coordinate of a feature vector is “heavy” if its magnitude is Ω(k) times the average value of any private image in that coordinate. One can interpret Theorem 1.2 as handling a realizable, average-case version of (2), i.e. it assumes input M admits an exact factorization and furthermore was generated by a random process. One can also ask how to handle worst-case instances of (2) where M can be arbitrary and in particular need not admit an exact factorization, and indeed we complement the results above with a quasipolynomial-time algorithm for (2) in this setting: Theorem 1.4 (Worst-case guarantee, see Theorem C.3 for formal statement). Given a symmetric matrix M Zm×m, rank parameter r m, and accuracy parameter ǫ (0, 1], there is an algorithm ∈ −1 2 ≤ ∈ that runs in mO(ǫ k log r) time and outputs W such that ∈ Sm,r,k ⊤ ⊤ 2 M WW 0 cmin M WW 0 + ǫm , k − k ≤ W∈Sm,r,k k − k where matrix multiplication canc bec over R or over the Boolean semiring. 4ω ≈ 2.373 is the exponent of matrix multiplication. 2 1.1 Related Work Symmetric NMF The most standard setting of NFM is minU V M UV⊤ , where U, V , ≥0 k − k range over all possible m r matrices with nonnegative entries. This has been the subject of a sig- × nificant body of theoretical and applied work, and we refer to the survey [Gil12] for a comprehensive overview of this literature. Symmetric NMF has received comparatively less attention but is nevertheless a popular clustering technique [HXZ+11, KDP12, DHS05] where, similar to our application to InstaHide, one takes the input matrix M to be some similarity matrix for a collection of data points and interprets the factorization W as specifying a soft clustering of these points into r groups, where Wi,ℓ is the “probability” that point i lies in cluster ℓ.

Load more