Algorithms for Generalized Topic Modeling

Avrim Blum Nika Haghtalab Toyota Technological Institute at Chicago Computer Science Department [email protected] Carnegie Mellon University [email protected]

Abstract lations, like shooting a free throw or kicking a field goal. Better would be a model in which sentences are drawn i.i.d. Recently there has been significant activity in developing al- from a distribution over sentences. Even better would be gorithms with provable guarantees for topic modeling. In this paragraphs drawn i.i.d. from a distribution over paragraphs work we consider a broad generalization of the traditional topic modeling framework, where we no longer assume that (this would account for the word correlations that exist within words are drawn i.i.d. and instead view a topic as a complex a coherent paragraph). Or, even better, how about a model in distribution over sequences of paragraphs. Since one could which paragraphs are drawn non-independently, so that the not hope to even represent such a distribution in general (even second paragraph in a document can depend on what the first if paragraphs are given using some natural feature representa- paragraph was saying, though presumably with some amount tion), we aim instead to directly learn a predictor that given a of additional entropy as well? This is the type of model we new document, accurately predicts its topic mixture, without study here. learning the distributions explicitly. We present several natural Note that an immediate problem with considering such conditions under which one can do this from unlabeled data a model is that now the task of learning an explicit distri- only, and give efficient algorithms to do so, also discussing issues such as noise tolerance and sample complexity. More bution (over sentences or paragraphs) is hopeless. While a generally, our model can be viewed as a generalization of the distribution over words can be reasonably viewed as a proba- multi-view or co-training setting in machine learning. bility vector, one could not hope to learn or even represent an explicit distribution over sentences or paragraphs. Indeed, except in cases of plagiarism, one would not expect to see 1 Introduction the same paragraph twice in the entire corpus. Moreover, this Topic modeling is an area with significant recent work is likely to be true even if we assume paragraphs have some in the intersection of algorithms and machine learning natural feature-vector representation. Instead, we bypass this [4, 5, 3, 1, 2, 8]. In topic modeling, a topic (such as sports, issue by aiming to directly learn a predictor for documents— business, or politics) is modeled as a probability distribution that is, a function that given a document, predicts its mixture over words, expressed as a vector ai. A document is gener- over topics—without explicitly learning topic distributions. ated by first selecting a mixture w over topics, such as 80% Another way to think of this is that our goal is not to learn sports and 20% business, and then choosing words i.i.d. from a model that could be used to write a new document, but the associated mixture distribution, which in this case would instead just a model that could be used to classify a document be 0.8asports +0.2abusiness. Given a large collection of such written by others. This is much as in standard supervised documents (and some assumptions about the distributions learning where algorithms such as SVMs learn a decision ai as well as the distribution over mixture vectors w) the boundary (such as a linear separator) for making predictions goal is to recover the topic vectors ai and then to use the ai on the labels of examples without explicitly learning the dis- to correctly classify new documents according to their topic tributions D+ and D− over positive and negative examples mixtures. Algorithms for this problem have been developed respectively. However, our setting is unsupervised (we are with strong provable guarantees even when documents con- not given labeled data containing the correct classifications sist of only two or three words each [5, 1, 18]. In addition, of the documents in the training set) and furthermore, rather algorithms based on this problem formulation perform well than each data item belonging to one of the k classes (topics), empirically on standard datasets [9, 15]. each data item belongs to a mixture of the k topics. Our goal As a theoretical model for document generation, however, is given a new data item to output what that mixture is. an obvious problem with the standard topic modeling frame- We begin by describing our high level theoretical formula- work is that documents are not actually created by inde- tion. This formulation can be viewed as a generalization both pendently drawing words from some distribution. Moreover, of standard topic modeling and of multi-view learning or co- important words within a topic often have meaningful corre- training [10, 12, 11, 7, 21]. We then describe several natural Copyright c 2018, Association for the Advancement of Artificial assumptions under which we can indeed efficiently solve the Intelligence (www.aaai.org). All rights reserved. problem, learning accurate topic mixture predictors. Pk 2 Preliminaries mixture distribution i=1 wiai. The document vector xˆ is We assume that paragraphs are described by n real-valued then the vector of word counts, normalized by dividing by features and so can be viewed as points x in an instance the number of words in the document so that kxˆk1 = 1. space X ⊆ Rn. We assume that each document consists of As a thought experiment, consider infinitely long docu- at least two paragraphs and denote it by (x1, x2). Further- ments. In the standard framework, all infinitely long docu- more, we consider k topics and partial membership func- ments of a mixture weight w have the same representation tions f , . . . , f : X → [0, 1], such that f (x) determines Pk 1 k i x = i=1 wiai. This representation implies x · vi = wi for the degree to which paragraph x belongs to topic i, and, all i ∈ [k], where V = (v1,..., vk) is the pseudo-inverse Pk k i=1 fi(x) = 1. For any vector of probabilities w ∈ R — of A = (a1,..., ak). Thus, by partitioning the document which we sometimes refer to as mixture weights — we define into two halves (views) x1 and x2, our noise-free model with w n X = {x ∈ R | ∀i, fi(x) = wi} to be the set of all para- fi(x) = vi · x generalizes the standard topic model for long graphs with partial membership values w. We assume that documents. However, our model is substantially more gen- both paragraphs of a document have the same partial mem- eral: features within a view can be arbitrarily correlated, the 1 2 S w w bership values, that is (x , x ) ∈ w X × X , although views themselves can also be correlated, and even in the zero- we also allow some noise later on. To better relate to the noise case, documents of the same mixture can look very literature on multi-view learning, we also refer to topics as different so long as they have the same projection to the span “classes” and refer to paragraphs as “views” of the document. of the a1,..., ak. Much like the standard topic models, we consider an unla- For a shorter document xˆ, each feature xˆi is drawn accord- Pk beled sample set that is generated by a two-step process. First, ing to a distribution with mean xi, where x = i=1 wiai. we consider a distribution P over vectors of mixture weights Therefore, xˆ can be thought of as a noisy measurement of x. and draw w according to P. Then we consider distribution The fewer the words in a document, the larger is the noise in Dw over the set X w × X w and draw a document (x1, x2) xˆ. Existing work in topic modeling, such as [5, 2], provide according to Dw. We consider two settings. In the first set- elegant procedures for handling large noise that is caused ting, which is addressed in Section3, the learner receives the by drawing only 2 or 3 words according to the distribution instance (x1, x2). In the second setting, the learner receives induced by x. As we show in Section4, our method can samples (xˆ1, xˆ2) that have been perturbed by some noise. We also tolerate large amounts of noise under some conditions. discuss two noise models in Sections4 and F.2. In both cases, While our method cannot deal with documents that are only the goal of the learner is to recover the partial membership 2- or 3-words long, the benefit is a model that is much more functions fi. general in many other respects. More specifically, in this work we consider partial mem- Generalization of Co-training Framework bership functions of the form fi(x) = f(vi · x), where n v1,..., vk ∈ R are linearly independent and f : R → [0, 1] Here, we briefly discuss how our model is a generalization of is a monotonic function.1 For the majority of this work, we the co-training framework. The standard co-training frame- consider f to be the identity function, so that fi(x) = vi · x. work of [10] considers learning a binary classifier from pri- 1 2 Define ai ∈ span{v1,..., vk} such that vi · ai = 1 and marily unlabaled instances, where each instance (x , x ) is vj ·ai = 0 for all j 6= i. In other words, the matrix containing a pair of views that have the same classification. For exam- ais as columns is the pseudoinverse of the matrix containing ple, [10] and [6] show that if views are independent of each vis as columns, and ai can be viewed as the projection of other given the classification, then one can efficiently learn a a paragraph that is purely of topic i onto span{v1,..., vk}. halfspace from primarily unlabeled data. In the language of Define ∆ = CH({a1,..., ak}) to be the convex hull of our model, this corresponds to a setting with k = 2 classes, a1,..., ak. unknown class vectors v1 = −v2, where each view of an in- Throughout this work, we use k · k2 to denote the spectral stance belongs to one class fully using membership function norm of a matrix or the L2 norm of a vector. When it is fi(x) = sign(vi ·x). Our work generalizes co-training by ex- clear from the context, we simply use k · k to denote these tending it to multi-class settings where each instance belongs quantities. We denote by Br(x) the ball of radius r around x. to one or more classes partially, using a partial membership + For a M, we use M to denote the pseudoinverse of M. function fi(·). Generalization of Standard Topic Modeling 3 An Easier Case with Simplifying Let us briefly discuss how the above model is a generalization Assumptions of the standard topic modeling framework. In the standard We make two main simplifying assumptions in this section, framework, a topic is modeled as a probability distribution both of which will be relaxed in Section4: 1) The documents n 1 2 over n words, expressed as a vector ai ∈ [0, 1] , where are not noisy, i.e., x ·vi = x ·vi; 2) There is non-negligible aij is the probability of word j in topic i. A document is probability density on instances that belong purely to one generated by first selecting a mixture w ∈ [0, 1]k over k class. In this section we demonstrate ideas and techniques. topics, and then choosing words i.i.d. from the associated The Setting: We make the following assumptions. The 1 2 1We emphasize that linear independence is a much milder as- documents are not noisy, that is for any document (x , x ) 1 2 sumption compared to the assumption that topic vectors are orthog- and for all i ∈ [k], x · vi = x · vi. Regarding distribution onal. P, we assume that a non-negligible probability density is assigned to pure documents for each class. More formally, for some ξ > 0, for all i ∈ [k], Prw∼P [w = ei] ≥ ξ. Regarding distribution Dw, we allow the two paragraphs in a document, i.e., the two views (x1, x2) drawn from Dw, to be correlated as long as for any subspace Z ⊂ null{v1 ..., vk} 1 of dimension strictly less than n − k, Pr(x1,x2)∼Dw [(x − x2) 6∈ Z] ≥ ζ for some non-negligible ζ. One way to view this in the context of topic modeling is that if, say, “sports” is a topic, then it should not be the case that the second paragraph always talks about the exact same sport as the first paragraph; else “sports” would really be a union of several separate but closely-related topics. Thus, while we do not Figure 1: v1, v2 correspond to class 1 and 2, and a1 and a2 corre- require independence we do require some non-correlation spond to canonical vectors purely of class 1 and 2, respectively. between the paragraphs. Algorithm and Analysis: The main idea behind our ap- proach is to use the consistency of the two views of the Using Lemma 3.1, Phase 1 of Algorithm1 recovers span{v1,..., vk}. Next, we show that pure samples are the samples to first recover the subspace spanned by v1,..., vk (Phase 1). Once this subspace is recovered, we show that extreme points of the convex hull of all samples when pro- a projection of a sample on this space corresponds to the jected on the subspace span{v1,..., vk}. Figure1 demon- convex combination of class vectors using the appropriate strates the relation between the class vectors, vi, projection mixture weight that was used for that sample. Therefore, we of samples, and the projection of pure samples ai. The next lemma, whose proof appears in Appendix A.2, formalizes find vectors a1,..., ak that purely belong to each class by taking the extreme points of the projected samples (Phase 2). this claim. The class vectors v ,..., v are the unique vectors (up to Lemma 3.2. For any x, let x represent the projection of x 1 k P permutations) that classify a1,..., ak as pure samples. Phase on span{v1,..., vk}. Then, x = i∈[k](vi · x)ai. 2 is similar to that of [5]. Algorithm1 formalizes the details With P (v · x)a representing the projection of x on of this approach. i∈[k] i i span{v1,..., vk}, it is clear that the extreme points of the Algorithm 1 ALGORITHMFOR GENERALIZED TOPIC set of all projected instances that belong to X w for all w are MODELS —NONOISE a1,..., ak. Since in a large enough sample set, with high 1 2 Input: A sample set S = {(xi , xi ) | i ∈ [m]} such that for probability for all i ∈ [k], there is a pure sample of type i, 1 2 each i, first a vector w is drawn from P and then (xi , xi ) is taking the extreme points of the set of projected samples is w drawn from D . also a1,..., ak. The following lemma, whose proof appears Phase 1: Let X1 and X2 be matrices where the ith column in Appendix A.3, formalizes this discussion. 1 2 1 k is xi and xi , respectively. Let P be the projection matrix on Lemma 3.3. Let m = c( log( )) for a large enough 1 2 ξ δ the last k left singular vectors of (X − X ). constant c > 0. Let P be the projection matrix for Phase 2: Let S = {P xj | i ∈ [m], j ∈ {1, 2}}. Let A j i span{v1,..., vk} and S = {P xi | i ∈ [m], j ∈ {1, 2}} be a matrix whose columns are the extreme points of the be the set of projected samples. With probability 1 − δ, convex hull of S . (This can be found using farthest traversal {a1,..., ak} is the set of extreme points of CH(S ). or linear programming.)2 Therefore, a ,..., a can be learned by taking the ex- Output: Return columns of A+ as v ,..., v . 1 k 1 k treme points of the convex hull of all samples projected on + In Phase 1 for recovering span{v1,..., vk}, note that for span({v1,..., vk}). Furthermore, V = A is unique, there- 1 2 w 1 any sample (x , x ) drawn from D , we have that vi · x = fore v1,..., vk can be easily found by taking the pseudoin- 2 vi · x = wi. Therefore, regardless of what w was used to verse of matrix A. Together with Lemma 3.1 and 3.3 this 1 2 produce the sample, we have that vi · (x − x ) = 0 for all proves the next theorem regarding learning class vectors in i ∈ [k]. That is, v1,..., vk are in the null-space of all such the absence of noise. 1 2 w (x − x ). The assumptions on D show that after seeing Theorem 3.4 (No Noise). There is a polynomial time algo- sufficiently many samples, (x1−x2) span a n−k dimensional   i i rithm for which m = O n−k ln( 1 ) + 1 ln( k ) is sufficient subspace. So, span{v1,..., vk} can be recovered by taking ζ δ ξ δ 1 2 1 2 w w k null{(x − x ) | (x , x ) ∈ X × X , ∀w ∈ R }. This to recover vi exactly for all i ∈ [k], with probability 1−δ. null space is spanned by the last k singular vectors of X1 − 2 1 2 1 2 4 Relaxing the Assumptions X , where X and X are matrices with columns xi and xi , respectively. The next lemma (see Appendix A.1 for a proof) In this section, we relax the two main simplifying assump- formalizes this discussion. tions from Section3. We relax the assumption on non-noisy 1 2 documents and allow a large fraction of the documents to not Lemma 3.1. Let Z = span{(xi − xi ) | i ∈ [m]}. Then, satisfy v · x1 = v · x2. In the standard topic model, this n−k 1 i i m = O( ζ log( δ )) is sufficient such that with probability corresponds to having a large fraction of short documents. 1 − δ, rank(Z) = n − k. Furthermore, we relax the assumption on the existence of pure documents to an assumption on the existence of “almost- the remaining points, when x and a ball of points close to it pure” documents. are removed. We then show that such points are close to one of the pure class vectors, a . Algorithm2 and the details of The Setting: We assume that any sampled document has a i the above approach and its performance are as follows. non-negligible probability of being non-noisy and with the remaining probability, the two views of the document are perturbed by additive Gaussian noise, independently. More Algorithm 2 ALGORITHMFOR GENERALIZED TOPIC 1 2 MODELS —WITH NOISE formally, for a given sample (x , x ), with probability p0 > 1 2 1 2 0 the algorithm receives (x , x ) and with the remaining Input: A sample set {(xˆi , xˆi ) | i ∈ [m]} such that for each i, 1 2 1 2 probability 1 − p0, the algorithm receives (xˆ , xˆ ), such that first a vector w is drawn from P, then (xi , xi ) is drawn from j j j j 2 w j j xˆ = x + e , where e ∼ N (0, σ In). D , then with probability p0, xˆi = xi , else with probability j j 2 We assume that for each topic, the probability that a docu- 1 − p0, xˆi = xi + N (0, σ In) for i ∈ [m] and j ∈ {1, 2}. ment is mostly about that topic is non-negligible. More for- Phase 1:  2 4 2  mally, for any topic i ∈ [k], Prw∼P [kei−wk1 ≤ k] ≥ g(), n−k 1 nσ M r nrM 1. Take m1 = Ω ζ ln( δ ) + δ22 polylog( δ ) where g is a polynomial function of its input. A stronger form 0 samples. of this assumption, better known as the dominant admixture ˆ 1 ˆ 2 th 1 2. Let X and X be matrices where the i column is xˆi assumption, assumes that every document is mostly about 2 one topic and has been empirically shown to hold on several and xˆi , respectively. real world data sets [8]. Furthermore, in the Latent Dirichlet 3. Let Pˆ be the projection matrix on the last k left singular 2 ˆ 1 ˆ 2 Allocation model, Prw∼P [maxi∈[k] wi ≥ 1−] ≥ O( ) for vectors of X − X . typical values of the concentration parameter. Denoising Phase: We also make assumptions on the distribution over in- 4. Let 0 = /(8r) and γ = g (0/(8kα)). stances. We assume that the covariance of the distribution  k 1  5. Take m2 = Ω ln fresh samples and let Sˆ = 1 2 1 2 > p0γ δ over (xi − xi )(xi − xi ) is larger than the noise covariance n o 2 3 ˆ 1 σ . That is, for some δ0 > 0, the least significant non-zero P xˆi | ∀i ∈ [m2] . 1 2 1 2 > 1 2 eigen value of E(x ,x )[(xi − xi )(xi − xi ) ], equivalently p0γm2 i i xˆ Sˆ th 2 6. Remove from , for which there are less than 2 its (n − k) eigen value, is greater than 6σ + δ . More- 0 0 points within distance of  in Sˆ . over, we assume that the L norm of each view of a sample 2 2 Phase 2: is bounded by some M > 0. We also assume that for all 0 ˆ ˆ ˆ 0 ˆ i ∈ [k], ka k ≤ α for some α > 0. At a high level, ka ks 7. For all x in S , if dist(x , CH(S \ B6r (x))) ≥ 2 i i add xˆ to C. are inversely proportional to the non-zero singular values of 0 V = (v ,..., v ). Therefore, ka k ≤ α implies that the k 8. Cluster C using single linkage with threshold 16r . As- 1 k i ˆ topic vectors are sufficiently different. sign any point from cluster i as ai. Output: Return aˆ1,..., aˆk. Algorithm and Results: Our approach follows the general theme of the previous section: First, recover Theorem 4.1. Consider any small enough  > 0 and any span{v1,..., vk} and then recover a1,..., ak by taking the δ > 0, there is an efficient algorithm for which an unlabeled extreme points of the projected samples. In this case, in the sample set of size first phase we recover span{v1,..., vk} approximately, by n − k 1 nσ2M 4r2 nrM ˆ ˆ m = O ln( ) + polylog( ) finding a projection matrix P such that kP − P k ≤  for ζ δ δ22 δ an arbitrarily small , where P is the projection matrix on 0 k ln(1/δ)  span{v1,..., vk}. At this point in the algorithm, the projec- + tion of samples on Pˆ can include points that are arbitrarily far p0 g(/(krα)) from ∆. This is due to the fact that the noisy samples are per- 2 is sufficient to recover aˆi such that kaˆi − aik2 ≤  for all turbed by N (0, σ In), so, for large values of σ some noisy i ∈ [k], with probability 1 − δ. Where, r is a parameter that samples map to points that are quite far from ∆. Therefore, depends on the geometry of the simplex ∆ and will be defined we have to detect and remove these samples before continu- in section 4.3. ing to the second phase. For this purpose, we show that the low density regions of the projected samples can safely be The proof of Theorem 4.1 relies on the next lemmas re- removed such that the convex hull of the remaining points is garding the performance of each phase of the algorithm. We close to ∆. In the second phase, we consider projections of formally state them here, but defer their proofs to Sections 4.1, 4.2 and 4.3. each sample using Pˆ. To approximately recover a1,..., ak, we recover samples, x, that are far from the convex hull of Lemma 4.2 (Phase 1). For any σ,  > 0, it is sufficient to have an unlabeled sample set of size 3This assumption is only used in Phase 1. One can assure that  2 2  this assumption holds by taking the average of several documents in n − k 1 nσ M n 1 2 1 2 m = O ln( ) + 2 2 polylog( ) . phase 1, where the average of documents (xˆ1, xˆ1),..., (xˆm, xˆm) ζ δ δ  δ Pm 1 Pm 2  0 is i=1 xˆi /m, i=1 xˆi /m . Since the noise shrinks in the aver- aged documents, the noise level falls under the required level. This so with probability 1−δ, Phase 1 of Algorithm2 returns a would mildly increase the sample complexity. matrix Pˆ, such that kP −Pˆk2 ≤ . √ 0 1 ˆ > > Lemma 4.3 (Denoising). Let  ≤ 3 σ k, kP − P k ≤ noise and E[DE + ED ] = 0, we have  0  0/8M, and γ = g  . An unlabeled sample size of m = h i 8kα 1 ˆ ˆ > 2 1  >   E DD − 2σ In = E DD . (1) O k ln( 1 ) is sufficient such that for Sˆ defined in Step6 m m p0γ δ of Algorithm2 the following holds with probability 1 − δ: Moreover, the eigen vectors and their order are the same in 1 [DˆDˆ >] and 1 [DˆDˆ >] − 2σ2I . Therefore, one can For any x ∈ Sˆ , dist(x, ∆) ≤ 0, and, for all i ∈ [k], there m E m E n recover the nullspace of 1 [DD>] by taking the space of the exists aˆ ∈ Sˆ such that kaˆ − a k ≤ 0. m E i i i 1 smallest k eigen vectors of E[DˆDˆ >]. Next, we show how Lemma 4.4 (Phase 2). Let Sˆ be a set for which the conclu- m to recover the nullspace using DˆDˆ >, rather than [DˆDˆ >]. sion of Lemma 4.3 holds with the value of 0 = /8r. Then, E Assume that the following properties hold: Phase 2 of Algorithm2 returns aˆ1,..., aˆk such that for all i ∈ [k], ka − aˆ k ≤ . 1. Equation1 holds not only in expectation, but also with i i 1 ˆ ˆ > high probability. That is, with high probability, k m DD − We now prove our main Theorem 4.1 by directly leverag- 2 1 > 2σ In − m DD k2 ≤ . ing the three lemmas we just stated. 1 ˆ ˆ > 2 2. With high probability λn−k( m DD ) > 4σ + δ0/2, th Proof of Theorem 4.1. By Lemma 4.2, sample set of size where λi(·) denotes the i most significant eigen value. > ˆ ˆ ˆ ˆ > m1 is sufficient such that Phase 1 of Algorithm2 leads to Let D = UΣV and D = UΣV be SVD representations. ˆ  0  1 ˆ ˆ > 2 ˆ 1 ˆ 2 2 ˆ > kP − P k ≤ 32Mr , with probability 1 − δ/2. Let  = 8r We have that m DD − 2σ In = U( m Σ − 2σ In)U . By 1 2 2 and take a fresh sample of size m2. By Lemma 4.3, with property 2, λn−k( Σˆ ) > 4σ + δ0/2. That is, the eigen 0 m probability 1 − δ/2, for any x ∈ Sˆ , dist(x, ∆) ≤  , and, 1 > 2 vectors and their order are the same in DˆDˆ − 2σ In and ˆ ˆ ˆ 0 m for all i ∈ [k], there exists ai ∈ S such that kai − aik ≤  . 1 DˆDˆ >. As a result the projection matrix, Pˆ, on the least Finally, by Lemma 4.4 we have that Phase 2 of Algorithm2 m significant k eigen vectors of 1 DˆDˆ >, is the same as the returns aˆi, such that for all i ∈ [k], kai − aˆik ≤ . m projection matrix, Q, on the least significant k eigen vectors of 1 DˆDˆ > − 2σ2I . Theorem 4.1 discusses the approximation of ai for all m n i ∈ [k]. It is not hard to see that such an approximation Recall that Pˆ and P and Q are the projection matrices on also translates to the approximation of class vectors, vi for the least significant k eigen vectors of 1 DˆDˆ >, 1 DD>, and all i ∈ [k]. That is, using the properties of perturbation of m m 1 DˆDˆ > −2σ2I, respectively. As we discussed, Pˆ = Q. Now, pseudoinverse matrices (see Proposition B.5) one can show m using the Wedin sin θ theorem [13, 24] (see Proposition B.1) ˆ+ ˆ ˆ ˆ+ that kA − V k ≤ O(kA − Ak). Therefore, V = A is a from matrix perturbation theory, we have, good approximation for V . ˆ 4.1 Proof of Lemma 4.2 — Phase 1 kP − P k2 = kP − Qk j ˆ j th 1 ˆ ˆ > 2 1 > For j ∈ {1, 2}, let X and X be n×m matrices with the i k m DD − 2σ In − m DD k2 2 j j ≤ ≤ , column being x and xˆ , respectively. As we demonstrated 1 > 2 1 > δ0 i i λn−k( DˆDˆ ) − 2σ − λn−k+1( DD ) in Lemma 3.1, with high probability rank(X1 − X2) = m m 1 2 n − k. Note that the nullspace of columns of X − X where we use Properties 1 and 2 and the fact that 1 2 1 > is spanned by the left singular vectors of X − X that λn−k+1( DD ) = 0, in the last transition. correspond to its k zero singular values. We show that the m nullspace of columns of X1 − X2 can be approximated Concentration It remains to prove Properties 1 and 2. within any desirable accuracy by the space spanned by the We briefly describe our proof that when m is large, with 1 1 1 2 ˆ ˆ > 2 > k least significant left singular vectors of Xˆ − Xˆ , given a high probability k m DD − 2σ In − m DD k2 ≤  1 ˆ ˆ > 2 sufficiently large number of samples. and λn−k( m DD ) > 4σ + δ0/2. Let us first describe 1 2 ˆ ˆ 1 ˆ 2 1 ˆ ˆ > 2 1 > Let D = X − X and D = X − X . For ease of m DD − 2σ In − m DD in terms of the error matrices. exposition, assume that all samples are perturbed by Gaus- We have 2 4 sian noise N (0, σ In). Since each view of a sample is per- 1 1  1  turbed by an independent draw from a Gaussian noise dis- DˆDˆ > − 2σ2I − DD> = EE> − 2σ2I m n m m n tribution, we can view Dˆ = D + E, where each column 2  1 1  of E is drawn i.i.d from distribution N (0, 2σ In). Then, + DE> + ED> . (2) 1 ˆ ˆ > 1 > 1 > 1 > 1 > m m m DD = m DD + m DE + m ED + m EE . As a thought experiment, consider this equation in expectation. 1 > 2 It suffices to show that for large enough m > m,δ, Since E[ m EE ] = 2σ In is the covariance matrix of the 1 > 2 1 > Pr[k m EE − 2σ Ink2 ≥ ] ≤ δ and Pr[k m DE + 1 > 1 > 4 ED k ≥ ] ≤ δ. In the former, note that EE is the The assumption that with a non-negligible probability a sample m 2 m 2 is non-noisy is not needed for the analysis and correctness of Phase sample covariance of the Gaussian noise matrix and 2σ In is 1 of Algorithm2. This assumption only comes into play in the the true covariance matrix of the noise distribution. The next denoising phase. two claims follow by the convergence properties of sample h ˆ i covariance of the Gaussians and the use of Matrix Bernstein Claim 4. For all i ∈ [k], Prx∼D P x ∈ B0/4(ai) ≥ p0γ. inequality [22] (AppendixB). See, Appendix C.1 for a proof. On the other hand, any projected point that is far from the nσ4 1 1 > Claim 1. m = O( 2 log( δ )) is sufficient to get k m EE − convex hull of a1,..., ak has to be noisy, and as a result, has 2 2 2σ Ink2 ≤ , with probability 1 − δ. been generated by a Gaussian distribution with variance σ . 0 nσ2M 2 n For a choice of  that is small with respect to σ, such points Claim 2. m=O( 2 polylog δ ) is sufficient to get do not concentrate well within any ball of radius 0. In the 1 > 1 > m DE + m ED 2 ≤, with probability 1−δ. next claim, we show that the regions that are far from the convex hull have low density. We prove that λ ( 1 DˆDˆ >) > 4σ2 + δ /2. Since for n−k m 0 Claim 5. For any z such that dist(z, ∆) ≥ 0, we have any two matrices, the difference in λn−k can be bounded by h i ˆ p0γ the spectral norm of their difference (see Proposition B.4), Prx∼D P x ∈ B0/2(z) ≤ 4 . by Equation2, we have     Proof. We first show that B0/2(z) does not include any non- 1 ˆ ˆ > 1 > λn−k DD −λn−k DD noisy points. Take any non-noisy sample x. Note that P x = m m Pk     i=1 wiai, where wi are the mixture weights corresponding 2 1 > 2 1 > 1 > ≤ 2σ I+ EE −2σ In − DE + ED to point x. We have, m m m k X 2 δ0 ˆ ˆ ≤ 2σ + , z − P x = z − wiai + (P − P )x 4 i=1 where in the last transition we use Claims1 and2 with the k X 0 value of δ0/8 to bound the last two terms by a total of δ0/4. ≥ z − wiai − kP − Pˆkkxk ≥  /2 1 > 2 Since λn−k(E[ m DD ]) ≥ 6σ + δ0, it is sufficient to show i=1 1 > 1 > that |λn−k(E[ m DD ]) − λn−k([ m DD ])| ≤ δ0/4. Simi- Therefore, B0/2(z) only contains noisy points. Since 1 > 1 > larly as before, this is bounded by k m DD − E[ m DD ]k. noisy points are perturbed by a spherical Gaussian, the pro- We use the Matrix Bernstein inequality to prove this concen- jection of these points on any k-dimensional subspace can be tration result; see Appendix C.2 for a proof. thought of points generated from a k-dimensional Gaussian σ2 M 4 n distributions with variance and potentially different cen- Claim 3. m=O 2 log is sufficient to get δ0 δ ters. One can show that the densest ball of any radius is at 1 DD>−  1 DD> ≤ δ0 , with probability 1−δ. the center of a Gaussian. Here, we prove a slightly weaker m E m 2 4 2 claim. Consider one such Gaussian distribution, N (0, σ Ik). This completes the analysis of Phase 1 of our algorithm and the proof of Lemma 4.2 follows directly from the above analysis and the application of Claims1 and2 with the error of δ0, and Claim3. 4.2 Proof of Lemma 4.3 — Denoising Step We use projection matrix Pˆ to partially denoise the samples while approximately preserving ∆ = CH({a1,..., ak}). At a high level we show that, in the projection of samples on Pˆ, 1) the regions around ai have sufficiently high density, and, 2) the regions that are far from ∆ have low density. We claim that if xˆ ∈ Sˆ is non-noisy and corresponds Figure 2: Density is maximized when blue and red gaussians coin- almost purely to one class then Sˆ also includes a non- cide and the ball is at their center. negligible number of points within O(0) distance of xˆ . This Note that the pdf of the Gaussian distribution decreases as we is due to the fact that a non-negligible number of points get farther from its center. By a coupling between the density (about p0γm points) correspond to non-noisy and almost- of the points, B0/2(0) has higher density than any B0/2(c) pure samples that using P would get projected to points 0 within a distance of O(0) of each other. Furthermore, the with kck2 >  . Therefore, ˆ 0 sup Pr [x ∈ B0/2(c)] ≤ Pr [x ∈ B30/2(0)]. inaccuracy in P can only perturb the projections up to O( ) 2 2 c x∼N (0,σ Ik) x∼N (0,σ Ik) distance. So, the projections of all non-noisy samples that are 0 So, over D this value will be maximized if the Gaussians almost purely of class i fall within O( ) of ai. The following 2 had the same center (see Figure2). Moreover, in N (0, σ Ik), claim, whose proof appears in Appendix D.1, formalizes this p 2 0 discussion. Pr[kxk2 ≤ σ k(1 − t)] ≤ exp(−kt /16). Since 3 /2 ≤ √ r q In the following lemmas, let D denote the flattened distri- σ k/2 ≤ σ k(1 − 16 ln 4 ) we have bution of the first paragraphs. That is, the distribution over k p0γ 1 1 2 w xˆ where we first take w ∼ P, then take (x , x ) ∼ D , 0 p0γ 1 Pr [x ∈ B0/2(c)] ≤ Pr [kxk2 ≤ 3 /2] ≤ . 2 and finally take xˆ . xˆ∼D x∼N (0,σ Ik) 4 Figure 3: Demonstrating the distinction between points close to ai Figure 4: Parameter r is determined by the geometry of ∆. ˆ and far from ai. The convex hull of CH(S|| \ Br2 (xˆ)), which is a subset of the blue and gray region, intersects Br1 (xˆ) only for xˆ that is sufficiently far from ai’s. particular, does not intersect Br1 (x). The geometry of the simplex and the angles between a1,..., ak play an important role in choosing the appropriate r1 and r2. Note that when the samples are perturbed by noise, a1,..., ak can only be approximately recovered if they are The next claim shows that in a large sample set, the fraction sufficiently far apart and the angles of the simplex at each ai of samples that fall within any of the described regions in is far from being flat. That is, we assume that for all i 6= j, Claims4 and5 is close to the density of that region. The kai − ajk ≥ 3. Furthermore, define r ≥ 1 to be the smallest proof of this claim follows from VC dimension of the set of value such that the distance between ai and CH(∆\Br(ai)) balls. is at least . Note that such a value of r always exists and k Claim 6. Let D be any distribution over R and x1,..., xm depends entirely on the angles of the simplex defined by the k 1 class vectors. Therefore, the number of samples needed for be m points drawn i.i.d from D. Then m = O( γ ln δ ) is sufficient so that with probability 1 − δ, for any ball B ⊆ k our method depends on the value of r. The smaller the value R of r, the larger is the separation between the topic vectors such that Prx∼D[x ∈ B] ≥ 2γ, |{xi | xi ∈ B}| > γm and for any ball B ⊆ k such that Pr [x ∈ B] ≤ γ/2, and the easier it is to identify them. The next claim, whose R x∼D proof appears in Appendix E.1, demonstrates this concept. |{xi | xi ∈ B}| < γm. Claim 7. Let 0 = /8r. Let Sˆ be the set of denoised projec- Therefore, upon seeing Ω( k ln 1 ) samples, with prob- p0γ δ tions, as in step6 of Algorithm2. For any xˆ ∈ Sˆ such that ability 1 − δ, for all i ∈ [k] there are more than p0γm/2 0 0 0 for all i, kxˆ −aik > 8r , dist(xˆ, CH(Sˆ \B6r0 (xˆ))) ≤ 2 . projected points within distance  /4 of ai (by Claims4 and 0 ˆ ˆ 6), and, no point that is  far from ∆ has more than p0γm/2 Furthermore, for all i ∈ [k] there exists ai ∈ S such that 0 0 0 points in its  /2-neighborhood (by Claims5 and6). Phase kaˆi − aik <  and dist(aˆi, CH(Sˆ \ B6r0 (aˆi))) > 2 . 2 of Algorithm2 leverages these properties of the set of Given the above structure, it is clear that set of points in C projected points for denoising the samples while preserv- are all within  of one of the ai’s. So, we can cluster C using ∆ Sˆ p γm/2 ing : Remove any point from with fewer than 0 single linkage with threshold  to recover ai upto accuracy . neighbors within distance 0/2. We conclude the proof of Lemma 4.3 by noting that the 5 Additional Results and Extensions In this section, we briefly mention some additional results remaining points in Sˆ are all within distance 0 of ∆. Fur- and extensions. We explain these and discuss other exten- thermore, any point in B0/4(ai) has more than p0γm/2 0 sions (such as alternative noise models) in more detail in points within distance of  /2. Therefore, such points re- AppendixF. main in Sˆ and any one of them can serve as aˆ for which i Sample Complexity Lower bound As we observed the ka − aˆ k ≤ 0/4. i i number of samples required by our method is poly(n). How- 4.3 Proof of Lemma 4.4 — Phase 2 ever, as the number of classes can be much smaller than the number of features, one might hope to recover v ,..., v , At a high level, we consider two balls around each projected 1 k ˆ with a number of samples that is polynomial in k rather than sample point xˆ ∈ S with appropriate choice of radii r1 < r2 n. Here, we show that in the general case Ω(n) samples are ˆ (see Figure3). Consider the set of projections S when points needed to learn v1,..., vk regardless of the value of k. See, in Br2 (x) are removed from it. For points that are far from AppendixF for more information. all a , this set still includes points that are close to a for i i General function f(·) We also consider the general model all topics i ∈ [k]. So, the convex hull of Sˆ \ Br (x) is 2 described in Section2, where fi(x) = f(vi · x) for an un- ∆ B (x) + close to , and in particular, intersects r1 . On the other known strictly increasing function f : → [0, 1] such ˆ R hand, for x that is close to ai, S \ Br2 (x) does not include that f(0) = 0. We describe how variations of the techniques an extreme point of ∆ or points close to it. So, the convex discussed up to now can extend to this more general setting. ˆ hull of S \ Br2 (x) is considerably smaller than ∆, and in See AppendixF for more information. References [14] Moritz Hardt and Ankur Moitra. Algorithms and hard- ness for robust subspace recovery. In Proceedings of the [1] Anima Anandkumar, Yi-kai Liu, Daniel J Hsu, Dean P 26th Conference on Computational Learning Theory Foster, and Sham M Kakade. A spectral algorithm (COLT), pages 354–375, 2013. for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925. 2012. [15] Thomas Hofmann. Probabilistic latent semantic anal- ysis. In Proceedings of the Fifteenth conference on [2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Uncertainty in artificial intelligence, pages 289–296. Sham M Kakade, and Matus Telgarsky. Tensor decom- Morgan Kaufmann Publishers Inc., 1999. positions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832, 2014. [16] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Disentangling gaussians. Communications [3] Sanjeev Arora, Rong Ge, Yonatan Halpern, David M of the ACM, 55(2):113–120, 2012. Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling [17] Ankur Moitra and Gregory Valiant. Settling the polyno- with provable guarantees. In Proceedings of the 29th mial learnability of mixtures of gaussians. In Founda- International Conference on Machine Learning (ICML), tions of Computer Science (FOCS), 2010 51st Annual pages 280–288, 2013. IEEE Symposium on, pages 93–102. IEEE, 2010. [18] Christos H Papadimitriou, Hisao Tamaki, Prabhakar [4] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Raghavan, and . Latent semantic in- Moitra. Computing a nonnegative matrix factorization– dexing: A probabilistic analysis. In Proceedings of the provably. In Proceedings of the 44th Annual ACM seventeenth ACM SIGACT-SIGMOD-SIGART sympo- Symposium on Theory of Computing (STOC), pages sium on Principles of database systems, pages 159–168. 145–162. ACM, 2012. ACM, 1998. [5] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning [19] G. W. Stewart. On the perturbation of pseudo-inverses, topic models – going beyond svd. In Proceedings of the projections and linear least squares problems. SIAM 53rd Symposium on Foundations of Computer Science Review, 19(4):634–662, 1977. (FOCS), pages 1–10, 2012. [20] GW Stewart and Ji-Guang Sun. Matrix perturbation [6] Maria-Florina Balcan and Avrim Blum. A discrimina- theory (computer science and scientific computing), tive model for semi-supervised learning. Journal of the 1990. ACM (JACM), 57(3):19, 2010. [21] S. Sun. A survey of multi-view machine learning. Neu- [7] Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co- ral computing and applications, 23:2031–2038, 2013. training and expansion: Towards bridging theory and [22] Joel A Tropp. An introduction to matrix concentration practice. In Advances in neural information processing inequalities. arXiv preprint arXiv:1501.01571, 2015. systems, pages 89–96, 2004. [23] Roman Vershynin. Introduction to the non-asymptotic [8] Trapit Bansal, Chiranjib Bhattacharyya, and Ravindran analysis of random matrices. arXiv preprint Kannan. A provable svd-based algorithm for learning arXiv:1011.3027, 2010. topics in dominant admixture corpus. In Advances in Neural Information Processing Systems, pages 1997– [24] Per-Åke Wedin. Perturbation bounds in connection 2005, 2014. with singular value decomposition. BIT Numerical Mathematics, 12(1):99–111, 1972. [9] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. [10] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Conference on Computational Learning Theory (COLT), pages 92–100. ACM, 1998. [11] Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. [12] Sanjoy Dasgupta, Michael L Littman, and David McAllester. Pac generalization bounds for co-training. Advances in neural information processing systems, 1:375–382, 2002. [13] Chandler Davis and William Morton Kahan. The rota- tion of eigenvectors by a perturbation. iii. SIAM Journal on Computing, 7(1):1–46, 1970. ˆ ˆ A Omitted Proof from Section3 — No Noise λ1 ≥ · · · ≥ λp, respectively. Fix 1 ≤ r ≤ s ≤ p and ˆ ˆ ˆ A.1 Proof of Lemma 3.1 let V = (vr,..., vs) and V = (vr,..., vs) be the or- thonormal eigenvectors corresponding to λr, . . . , λs and 1 2 j n λˆ ,..., λˆ . Let δ = inf{|λˆ − λ| : λ ∈ [λ , λ ], λˆ ∈ For all j ≤ n − k, let Zj = {(xi − xi ) | i ≤ ζ ln δ }. r s s r ˆ ˆ We prove by induction that for all j, rank(Zj) < j with (−∞, λs−1] ∪ [λr+1, ∞)} > 0. Then , probability at most j δ . n kBˆ − Bk2 For j = 0, the claim trivially holds. Now assume that the k sin Θ(V, Vˆ )k2 ≤ . induction hypothesis holds for some j. Furthermore, assume δ that rank(Z ) ≥ j. Then, rank(Z ) < j + 1 only if the ˆ j j+1 where sin Θ(V, V ) = PV − PVˆ , where PV and PVˆ are the 1 n additional ζ ln δ samples in Zj+1 all belong to span(Zj). projection matrices for V and Vˆ . Since, the space of such samples has rank < n − k, this hap- Proposition B.2 (Corollary 5.50 of [23]). Consider a Gaus- 1 ln n δ (1 − ζ) ζ δ ≤ n pens with probability at most n . Together sian distribution in R with co-variance matrix Σ. Let with the induction hypothesis that rank(Zj) ≥ j with proba- A ∈ Rn×m be a matrix whose rows are drawn i.i.d from δ 1 > bility at most j n , we have that rank(Zj+1) < j+1 with prob- this distribution, and let Σm = m AA . For every  ∈ (0, 1), (j+1)δ 2 ability at most . Therefore rank(Z) = rank(Zn−k) = and t, if m ≥ cn(t/) for some constant c, then with proba- n 2 n − k with probability at least 1 − δ. bility at least 1 − 2 exp(−t n), kΣm − Σk2 ≤ kΣk2 A.2 Proof of Lemma 3.2 Proposition B.3 (Matrix Bernstein [22]). Let S1,...,Sn be independent, centered random matrices with common First note that V is a the pseudo-inverse of A, so their dimension d × d , and assume that each one is uniformly P 1 2 span is equal. Hence, i∈[k](vi · x)ai ∈ span{v1,..., vk}. bounded. That is, ESi = 0 and kSik2 ≤ L for all i ∈ [n].   Pn P Let Z = i=1 Si, and let v(Z) denote the matrix variance: It remains to show that x − i∈[k](vi · x)ai ∈ ( n n ) null{v1,..., vk}. We do so by showing that this vector is X X v(Z) = max [S S>] , [S>S ] . orthogonal to vj for all j. We have E i i E i i i=1 i=1 k ! k X X Then, x − (vi · x)ai · vj = x · vj − (vi · x)(ai · vj) 2 i=1 i=1  −t /2  X P[kZk ≥ t] ≤ (d1 + d2) exp . = x · vj − (vi · x)(ai · vj) − (vj · x)(aj · vj) v(Z) + Lt/3 i6=j Proposition B.4 (Theorem 4.10 of [20]). Let Aˆ = A + E = x · v − x · v = 0. 0 0 j j and let λ1, . . . , λn and λ1, . . . , λn be the eigen values of A 0 Where, the second equality follows from the fact when A = and A + E. Then, max{|λi − λi|} ≤ kEk2. + V , for all i, ai · vi = 1 and aj · vi = for j 6= i. Therefore, Proposition B.5 (Theorem 3.3 of [19]). For any A and B = P i∈[k](vi · x)ai is the projection of x on span{v1,..., vk}. A + E, + +  + 2 + 2 A.3 Proof of Lemma 3.3 kB − A k ≤ max 3 kA k , kB k kEk, Assume that S included samples that are purely of type i, where k · k is an arbitrary norm. for all i ∈ [k]. That is, for all i ∈ [k] there is j ≤ m, such 1 2 1 2 C Omitted Proof from Section 4.1 — Phase 1 that vi · xj = vi · xj = 1 and vi0 · xj = vi0 · xj = 0 for i0 6= i. By Lemma 3.2, the set of projected vectors form the C.1 Proof of Claim2 set {Pk (v · x )a | j ∈ [m]}. Note that Pk (v · x )a i=1 i j i i=1 i j i th is in the simplex with vertices a ,..., a . Moreover, for Let ei and di be the i row of E and D. Then 1 k > Pm > > Pm > each i, there exists a pure sample in S of type i. Therefore, ED = i=1 eidi and DE = i=1 diei . Let Pk  0 e d> CH{ (vi · xj)ai | j ∈ [m]} is the simplex on linearly 1 i i 1 > 1 > i=1 Si = m > . Then, k m DE + m ED k2 ≤ independent vertices a ,..., a . As a result, a ,..., a are diei 0 1 k 1 k Pm the extreme points of it. 2k Sik2. We will use matrix Bernstein to show that P i=1 It remains to prove that with probability 1 − δ, the sample i∈[m] Si is small with high probability. set has a document of purely type j, for all j ∈ [k]. By the First note that the distribution of ei is a Gaussian centered assumption on the probability distribution P, with probability at 0, therefore, E[Si] = 0√. Furthermore, for each i, with prob- at most (1−ξ)m, there is no document of type purely j. Using ability 1 − δ, ke k ≤ σ n log 1 . So, with probability 1 − δ, i 2 δ √ the union bound, we get the final result. m for all samples i ∈ [m], keik2 ≤ σ n log δ . Moreover, kd k = kx1 − x2k ≤ 2M B Technical Spectral Lemmas by assumption i i i . Therefore, with probability 1 − δ, Proposition B.1 (sin θ theorem [13]). . Let B, Bˆ ∈ Rp×p 1 2 √ n be symmetric, with eigen values λ ≥ · · · ≥ λ and L = max kSik2 = max keikkdik ≤ σ nM polylog . 1 p i m i m δ > 1 > 2 2 ˆ 0 Note that, E[SiSi ] = m2 E[(eidi ) ] ≤ L . Since Since kP − P k ≤  /8M, we have S is Hermitian, the matrix covariance defined by Matrix i 0 0 0 Bernstein inequality is ka − Pˆxk = ka − P xk + kP x − Pˆxk ≤ + ≤ i i 8 8 4 ( m m )

X > X > The claim follows immediately. v(Z) = max E[SiSi ] , E[Si Si] i=1 i=1 E Omitted Proof from Section 4.3 — Phase 2

m X > 2 = E[SiSi ] ≤ mL . E.1 Proof from Claim7 i=1 Recall that by Lemma 4.3, for any xˆ ∈ Sˆ there exists x ∈ ∆ nσ2M 2 n such that kxˆ − xk ≤ 0 and for all i ∈ [k], there exists If  ≤ v(Z)/L and m ∈ Ω( 2 polylog δ ) or  ≥ √ ˆ 0 v(Z)/L and m ∈ Ω( nσM polylog n ), using Matrix Bern- aˆi ∈ S such that kaˆi − aik ≤  . For the first part, let x =  δ P stein inequality (Proposition B.3), we have i αiai ∈ ∆ be the corresponding point to xˆ, where αi’s are the coefficients of the convex combination. Furthermore,   " m # 0 P 1 > 1 > X  let x = i αiaˆi. We have, Pr DE + ED ≥  = Pr Si ≥ ≤ δ. m m 2 i=1 k k 0 X X kx − xˆk ≤ αiaˆi − αiai + x − xˆ C.2 Proof of Claim3 i=1 i=1 th > Pm > Let di be the i row D. Then DD = didi . 0 i=1 ≤ max (aˆi − ai) + kx − xˆk ≤ 2 . 1 > 1 > 1 > i∈[k] Let Si = m didi − m E[didi ]. Then, k m DD −  1 > m 1 2 DD k = k P S k . Since, d = x − x and 0 E m 2 i=1 i 2 i i i The first claim follows from the fact that kxˆ − aik > 8r j > > 0 kxi k ≤ M, we have that for any i, kdidi − E[didi ]k ≤ and as a result x ∈ CH(Sˆ \ B6r0 (xˆ)). Next, note that 2 0 4M . Then, B4r0 (ai) ⊆ B5r0 (aˆi). So, by the fact that kai − aˆik ≤  , 1 4 0 > > 2 dist (aˆ , CH(∆ \ B 0 (aˆ ))) ≥ dist (a , CH(∆ \ B 0 (a ))) −  L = max kSik2 = max kdidi − E[didi ]k2 ≤ M , i 5r i i 4r i i m i m ≥ 30. > 2 and kE[SiSi ] ≤ L . Note that Si is Hermitian, so, the matrix covariance is Next, we argue that if there is xˆ ∈ CH(Sˆ \B5r0 (aˆi)) then 0 ( m m ) there exists x ∈ CH(∆ \ B 0 (aˆ )), such that kx − xˆk ≤  . 4r i X > X > To see this, let x = P α zˆ be the convex combination v(Z) = max E[SiSi ] , E[Si Si] i i i i=1 i=1 of zˆ1,..., zˆ` ∈ Sˆ \ Bd+0 (aˆi). By Claim 4.3, there are m z ,..., z ∈ ∆, such that kz − zˆ k ≤ 0 for all i ∈ [k]. X 1 ` i i > 2 z zˆ z 6∈ = E[SiSi ] ≤ mL . Furthermore, by the proximity of i to i we have that i i=1 Bd(aˆi). Therefore, z1,..., z` ∈ ∆ \ Bd(aˆi). Then, x = P 0 αizi is also within distance  . 2 M 4 n 2 i If δ0 ≤ 4M and m ∈ Ω( 2 log ) or δ0 ≥ 4M δ0 δ Using this claim, we have M 2 n   0 and m ∈ Ω( log ), then by Matrix Bernstein inequality dist aˆ , CH(Sˆ \ B 0 (aˆ )) ≥ 2 . δ0 δ i 6r i (Proposition B.3), we have " m # F Additional Results, Extensions, and Open δ X 0 Problems Pr Si ≥ ≤ δ. 2 i=1 F.1 Sample Complexity Lower bound D Omitted Proof from Section 4.2— As we observed the number of samples required by our Denoising method is poly(n). However, as the number of classes can be much smaller than the number of features, one might hope D.1 Proof of Claim4 to recover v1,..., vk, with a number of samples that is poly- Recall that for any i ∈ [k], with probability γ = g(0/(8kα)) nomial in k rather than n. Here, we show that in the general a nearly pure weight vector w is generated from P, such that case Ω(n) samples are needed to learn v1,..., vk regardless 0 of the value of k. kw − eik ≤  /(8kα). And independently, with probability p0 the point is not noisy. Therefore, there is p0γ density on For ease of exposition, let k = 1 and note that in this case non-noisy points that are almost purely of class i. Note that every sample should be purely of one type. Assume that the class vector, v, is promised to be in the set C = {vj | vj = for such points, x, √ ` 1/ 2, if ` = 2j −1 or 2j, else vj = 0}. Consider instances k 0 ` √ X 0  (x1, x2) such that the `th coordinate of x1 is x1 = −1/ 2 kP x − aik = wjaj − ai ≤ k( /(8kα))(α) ≤ . j j √ j j` √ 8 2 j=1 if ` = 2j − 1 and 1/ 2 otherwise, and xj` = −1/ 2 if √ 1 2 2 ` = 2j and 1/ 2 otherwise. For a given (xj , xj ), we have taking xˆ = x + N (0, σ In). Can we learn ai’s within er- j 1 j 2 ror of  using polynomially many samples? Note that when that v · xj = v · xj = 0. On the other hand, for all ` 6= j, v` · x1 = v` · x2 = 1. Therefore, sample (x1, x2) is D is only supported on the corners of ∆, this problem re- j j j j duces to learning mixture of Gaussians, for which there is a consistent with v = v` for any ` 6= j, but not with v = vj. 1 2 wealth of literature on estimating Gaussian means and mix- That is, each instance (xj , xj ) renders only one candidate ture weights [12, 16, 17]. It would be interesting to see under of C invalid. Even after observing at most n − 2 samples of 2 what regimes ai (and not necessarily the mixture weights) this types, at least 2 possible choices for v remain. So, Ω(n) can be learned when D is an arbitrary distribution over ∆. samples are indeed needed to find the appropriate v. The next theorem, whose proof appears in AppendixG generalizes this construction and result to the case of any k. F.3 General function f(·) Theorem F.1. For any k ≤ n, any algorithm√ that for all 0 0 i ∈ [k] learns vi such that kvi − vik2 ≤ 1/ 2, requires Consider the general model described in Section2, where Ω(n) samples. fi(x) = f(vi ·x) for an unknown strictly increasing function f : + → [0, 1] such that f(0) = 0. We describe how Note that in the above construction samples have large R variations of the techniques discussed up to now can extend components in the irrelevant features. It would be interesting to this more general setting. to see if this lower bound can be circumvented using addi- For ease of exposition, consider the non-noisy case. Since tional natural assumptions in this model, such as assuming 1 2 f is a strictly increasing function, f(vi · x ) = f(vi · x ) that the samples have length poly(k). 1 2 if and only if vi · x = vi · x . Therefore, we can recover F.2 Alternative Noise Models span(v1,..., vk) by the same approach as in Phase 1 of Al- gorithm1. Although, by definition of pseudoinverse matrices, Consider the problem of recovering v ,..., v in the pres- 1 k the projection of x is still represented by x = P (v · x)a , ence of agnostic noise, where for an  fraction of the sam- i i i this is not necessarily a convex combination of a ’s anymore. ples (x1, x2), x1 and x2 correspond to different mixture i This is due to the fact that v · x can add up to values larger weights. Furthermore, assume that the distribution over the i than 1 depending on x. However, x is still a non-negative instance space is rich enough such that any subspace other combination of a ’s. Moreover, a ’s are linearly independent, than span{v ,..., v } is inconsistent with a set of instances i i 1 k so a cannot be expressed by a nontrivial non-negative com- of non-negligible density.5 Since the VC dimension of the i bination of other samples. Therefore, for all i, a /ka k can set of k dimensional subspaces in n is min{k, n − k}, i i R be recovered by taking the extreme rays of the convex cone from the information theoretic point of view, one can re- of the projected samples. So, we can recover v ,..., v , by cover span{v ,..., v } as it is the only subspace that is 1 k 1 k taking the psuedoinverse of a /ka k and re-normalizing the inconsistent with less than O() fraction of O˜( k ) samples. i i 2 outcome such that kvik2 = 1. When samples are perturbed Furthermore, we can detect and remove any noisy sample, by noise, a similar argument that also takes into account the for which the two views of the sample are not consistent with smoothness of f proves similar results. span{v ,..., v } a ,..., a 1 k . And finally, we can recover 1 k It would be interesting to see whether a more general class using phase 2 of Algorithm1. of similarity functions, such as kernels, can be also learned In the above discussion, it is clear that once we have recov- in this context. ered span{v1,..., vk}, denoising and finding the extreme points of the projections can be done in polynomial time. For the problem of recovering a k-dimensional nullspace, [14] introduced an efficient algorithm that tolerates agnostic noise G Proof of Theorem F.1 — Lower Bound up to  = O(k/n). Furthermore, they provide an evidence For ease of exposition assume that n is a multiple of k. Fur- that this result might be tight. It would be interesting to see 0 whether additional structure present in our model, such as thermore, in this proof we adopt the notion (xi, xi) to repre- sent the two views of the ith sample. For any vector u ∈ Rn the fact that samples are convex combination of classes, can th n allow us to efficiently recover the nullspace in presence of and i ∈ [k], we use (u)i to denote the i k -dimensional block of u, i.e., coordinates u n , . . . , ui n . more noise. (i−1) k +1 k n Another interesting open problem is whether it is possible Consider the k -dimensional vector uj, such that uj` = 1 to handle the case of p0 = 0. That is, when every document is if ` = 2j − 1 or 2j, and uj` = 0, otherwise. And con- 2 n 0 affected by Gaussian noise N (0, σ In), for σ  . A simpler sider k -dimensional vectors zj and zj, such that zj` = −1 0 form of this problem is as follows. Consider a distribution if ` = 2j − 1 and zj` = 1 otherwise, and zj` = −1 if 0 induced by first drawing x ∼ D, where D is an arbitrary ` = 2j and zj` = 1 otherwise. Consider a setting where and unknown distribution over ∆ = CH({a1,..., ak}), and v C = {vj | (vj) = i is√ restricted to the set of candidate i i i i j 0 th 5 u / 2 (v ) 0 = 0 i 6= i} ` This assumption is similar to the richness assumption made in j and i i √for . In other words, the the standard case, where we assume that there is enough “entropy” j n coordinate of vi is 1/ 2 if ` = (i − 1) + 2j − 1 or between the two views of the samples such that even in the non- k (i−1) n +2j, else 0. Furthermore, consider instances (xj, x0j) noisy case the subspace can be uniquely determined by taking the k √ √ i i j 0j 0 nullspace of X1 − X2. such that (xi )i = zj/ 2 and (xi )i = zj/ 2 and for all 0 j 0j i 6= i, (xi )i0 = (xi )i0 = 0. In other words,

n n (i−1) k +2j−1,(i−1) k +2j j 1 z }| { x = √ (0,..., 0, 1,..., 1, 1, −1 , 1,..., 1, 0,..., 0), i 2 1 x0j = √ (0,..., 0, 1,..., 1, −1, 1 , 1,..., 1, 0,..., 0), i 2 j 1 vi = √ (0,..., 0, 0,..., 0, 1, 1 , 0,..., 0, 0,..., 0). 2 | {z } ith block 0 0 n First note that, for any i, i ∈ [k] and any j, j ∈ [ 2k ], j j0 j 0j0 vi · xi0 = vi · xi0 . That is, the two views of all instances are consistent with each other with respect to all candidate vectors. Furthermore, for any i and i0 such that i 6= i0, for 0 0 j j all j, j , vi · xi0 = 0. Therefore, for any observed sample j 0j (xi , xi ), the sample should be purely of type i. j 0j For a given i, consider all the samples (xi , xi ) that are j j j 0j observed by the algorithm. Note that vi · xi = vi · xi = 0. 0 0 0 j j j 0j And for all j 6= j, vi · xi = vi · xi = 1. Therefore, ob- j 0j j serving (xi , xi ) only rules out vi as a candidate, while this 0 j 0 sample is consistent with candidates vi for j 6= j. There- n fore, even after observing ≤ 2k − 2 samples of this types, at least 2 possible choices for vi remain valid. Moreover, the j j0 √ distance between any two vi , vi ∈ Ci is 2. Therefore, n − 1 samples are needed to learn v to an accuracy better 2k √ i than 2/2. Note that consistency of the data with vi0 is not affected j by the samples of type xi that are observed by the algorithms when i0 6= i. So, Ω(k n ) = Ω(n) samples are required to k √ approximate all vi’s to an accuracy better than 2/2.