A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Sanjeev Arora 1 2 Hrishikesh Khandeparkar 1 Mikhail Khodak 3 Orestis Plevrakis 1 Nikunj Saunshi 1 {arora, hrk, orestisp, nsaunshi}@cs.princeton.edu [email protected]

Abstract the softmax) of a powerful deep net trained on ImageNet. In natural language processing (NLP), low-dimensional rep- Recent empirical works have successfully used resentations of text – called text embeddings – have been unlabeled data to learn feature representations computed with unlabeled data (Peters et al., 2018; Devlin that are broadly useful in downstream classifica- et al., 2018). Often the embedding function is trained by tion tasks. Several of these methods are remi- using the embedding of a piece of text to predict the sur- niscent of the well-known embedding rounding text (Kiros et al., 2015; Logeswaran & Lee, 2018; algorithm: leveraging availability of pairs of se- Pagliardini et al., 2018). Similar methods that leverage simi- mantically “similar” data points and “negative larity in nearby frames in a video clip have had some success samples,” the learner forces the inner product of for images as well (Wang & Gupta, 2015). representations of similar pairs with each other to be higher on average than with negative samples. Many of these algorithms are related: they assume access to The current paper uses the term contrastive learn- pairs or tuples (in the form of co-occurrences) of text/images ing for such algorithms and presents a theoretical that are more semantically similar than randomly sampled framework for analyzing them by introducing la- text/images, and their objective forces representations to tent classes and hypothesizing that semantically respect this similarity on average. For instance, in order to similar points are sampled from the same latent learn a representation function f for sentences, a simplified class. This framework allows us to show provable version of what Logeswaran & Lee(2018) minimize is the guarantees on the performance of the learned rep- following resentations on the average classification task that " T + !# is comprised of a subset of the same set of latent ef(x) f(x ) − log E T + T − classes. Our generalization bound also shows that x,x+,x− ef(x) f(x ) + ef(x) f(x ) learned representations can reduce (labeled) sam- ple complexity on downstream tasks. We conduct where (x, x+) are a similar pair and x− is presumably dis- controlled experiments in both the text and image similar to x (often chosen to be a random point) and typi- domains to support the theory. cally referred to as a negative sample. Though reminiscent of past ideas – e.g. kernel learning, metric learning, co- training (Cortes et al., 2010; Bellet et al., 2013; Blum & 1. Introduction Mitchell, 1998) – these algorithms lack a theoretical frame- work quantifying when and why they work. While it seems This paper concerns unsupervised representation learning: arXiv:1902.09229v1 [cs.LG] 25 Feb 2019 intuitive that minimizing such loss functions should lead using unlabeled data to learn a representation function f to representations that capture ‘similarity,’ formally it is such that replacing data point x by feature vector f(x) in unclear why the learned representations should do well on new classification tasks reduces the requirement for labeled downstream linear classification tasks – their somewhat data. This is distinct from semi-, where mysterious success is often treated as an obvious conse- learning can leverage unlabeled as well as labeled data. quence. To analyze this success, a framework must connect (Section7 surveys other prior ideas and models). ‘similarity’ in unlabeled data with the semantic information For images, a proof of existence for broadly useful represen- that is implicitly present in downstream tasks. tations is the output of the penultimate layer (the one before We propose the term Contrastive Learning for such methods 1Princeton University, Princeton, New Jersey, USA. 2Institute for and provide a new conceptual framework with minimal Advanced Study, Princeton, New Jersey, USA. 3Carnegie Mellon assumptions1. Our main contributions are the following: University, Pittsburgh, Pennsylvania, USA. 1The alternative would be to make assumptions about genera- Copyright 2019 by the authors. tive models of data. This is difficult for images and text. Contrastive Unsupervised Representation Learning

1. We formalize the notion of semantic similarity by in- containing dogs and low/zero probabilities to other images. troducing latent classes. Similar pairs are assumed to Classes can overlap arbitrarily.2 Finally, we assume a distri- be drawn from the same latent class. A downstream bution ρ over the classes that characterizes how these classes task is comprised of a subset of these latent classes. naturally occur in the unlabeled data. Note that we make no assumption about the functional form of D or ρ. 2. Under this formalization, we prove that a representa- c tion function f learned from a function class F by con- Semantic Similarity trastive learning has low average linear classification loss if F contains a function with low unsupervised To formalize similarity, we assume similar data points x, x+ loss. Additionally, we show a generalization bound for are i.i.d. draws from the same class distribution Dc for some contrastive learning that depends on the Rademacher class c picked randomly according to measure ρ. Negative complexity of F. After highlighting inherent limita- samples are drawn from the marginal of Dsim: tions of negative sampling, we show sufficient proper- + + Dsim(x, x ) = Dc(x)Dc(x ) (1) ties of F which allow us to overcome these limitations. c∼Eρ − − 3. Using insights from the above framework, we provide Dneg(x ) = E Dc(x ) (2) a novel extension of the algorithm that can leverage c∼ρ larger blocks of similar points than pairs, has better Since classes are allowed to overlap and/or be fine-grained, theoretical guarantees, and performs better in practice. this is a plausible formalization of “similarity.” As the iden- tity of the class in not revealed, we call it unlabeled data. Ideally, one would like to show that contrastive learning al- Currently empirical works heuristically identify such similar ways gives representations that compete with those learned pairs from co-occurring image or text data. from the same function class with plentiful labeled data. Our formal framework allows a rigorous study of such ques- tions: we show a simple counterexample that prevents such Supervised Tasks a blanket statement without further assumptions. However, We now characterize the tasks that a representation function if the representations are well-concentrated and the mean f will be tested on. A (k + 1)-way3 supervised task T classifier (Definition 2.1) has good performance, we can consists of distinct classes {c1, . . . , ck+1} ⊆ C. The labeled show a weaker version of the ideal result (Corollary 5.1.1). dataset for the task T consists of m i.i.d. draws from the Sections2 and3 give an overview of the framework and the following process: results, and subsequent sections deal with the analysis. Re- lated work is discussed in Section7 and Section8 describes A label c ∈ {c1, ..., ck+1} is picked according to a distribu- experimental verification and support for our framework. tion DT . Then, a sample x is drawn from Dc. Together they form a labeled pair (x, c) with distribution

2. Framework for Contrastive Learning DT (x, c) = Dc(x)DT (c) (3) We first set up notation and describe the framework for A key subtlety in this formulation is that the classes in unlabeled data and classification tasks that will be essential downstream tasks and their associated data distributions Dc for our analysis. Let X denote the set of all possible data are the same as in the unlabeled data. This provides a path to points. Contrastive learning assumes access to similar data formalizing how capturing similarity in unlabeled data can + in the form of pairs (x, x ) that come from a distribution lead to quantitative guarantees on downstream tasks. DT is − − − 4 Dsim as well as k i.i.d. negative samples x1 , x2 , . . . , xk assumed to be uniform for theorems in the main paper. from a distribution Dneg that are presumably unrelated to x. Learning is done over F, a class of representation functions Evaluation Metric for Representations f : X → Rd, such that kf(·)k ≤ R for some R > 0. The quality of the representation function f is evaluated Latent Classes by its performance on a multi-class classification task T using linear classification. For this subsection, we fix a To formalize the notion of semantically similar pairs task T = {c1, ..., ck+1}. A multi-class classifier for T is + (x, x ), we introduce the concept of latent classes. a function g : X → Rk+1 whose output coordinates are indexed by the classes c in task T . Let C denote the set of all latent classes. Associated with each class c ∈ C is a probability distribution Dc over X . The loss incurred by g on point (x, y) ∈ X × T is defined

2 Roughly, Dc(x) captures how relevant x is to class c. For An image of a dog by a tree can appear in both Ddog & Dtree. example, X could be natural images and c the class “dog” 3We use k as the number of negative samples later. 4 whose associated Dc assigns high probability to images We state and prove the general case in the Appendix. Contrastive Unsupervised Representation Learning as `({g(x)y − g(x)y0 }y06=y), which is a function of a k- Note that, by the assumptions of the framework described dimensional vector of differences in the coordinates. The above, we can now express the unsupervised loss as two losses we will consider in this work are the standard hinge loss `(v) = max{0, 1+maxi{−vi}} and the logistic Lun(f) P k loss `(v) = log (1 + exp(−vi)) for v ∈ . Then the   T + −   2 i R = E E ` f(x) f(x ) − f(xi ) c+,c− x,x+∼D2 supervised loss of the classifier g is i c+ k+1 − ∼ρ x ∼D −   i c Lsup(T , g) := E ` {g(x)c − g(x)c0 }c06=c i (x,c)∼DT The algorithm to learn a representation function from F is To use a representation function f with a linear classifier, (k+1)×d to find a function fb ∈ arg min Lbun(f) that minimizes a matrix W ∈ R is trained and g(x) = W f(x) is f∈F used to evaluate classification loss on tasks. Since the best the empirical unsupervised loss. This function fb can be W can be found by fixing f and training a linear classifier, subsequently used for supervised linear classification tasks. we abuse notation and define the supervised loss of f on T In the following section we proceed to give an overview of to be the loss when the best W is chosen for f: our results that stem from this framework.

Lsup(T , f) = inf Lsup(T , W f) (4) (k+1)×d W ∈R 3. Overview of Analysis and Results Crucial to our results and experiments will be a specific W provably f where the rows are the means of the representations of each What can one say about the performance of b? L class which we define below. As a first step we show that un is like a “surrogate” for L by showing that L (f) ≤ αL (f), ∀f ∈ F, sug- Definition 2.1 (Mean Classifier). For a function f and task sup sup un µ th gesting that minimizing Lun makes sense. This lets us T = (c1, . . . , ck+1), the mean classifier is W whose c show a bound on the supervised performance Lsup(fb) of row is the mean µc of representations of inputs with label c: µ µ the representation learned by the algorithm. For instance, µc := E [f(x)]. We use Lsup(T , f) := Lsup(T ,W f) x∼Dc when training with one negative sample, the performance on as shorthand for its loss. average binary classification has the following guarantee: Since contrastive learning has access to data with latent Theorem 4.1 (Informal binary version). class distribution ρ, it is natural to have better guarantees for tasks involving classes that have higher probability in ρ. Lsup(fb) ≤ αLun(f) + η GenM + δ ∀f ∈ F Definition 2.2 (Average Supervised Loss). Average loss for where α, η, δ are constants depending on the distribution a function f on (k + 1)-way tasks is defined as ρ and GenM → 0 as M → ∞. When ρ is uniform and  k+1  Lsup(f) := E Lsup({ci}i=1 , f) | ci 6= cj |C| → ∞, we have that α, η → 1, δ → 0. k+1 k+1 {ci} ∼ρ i=1 At first glance, this bound seems to offer a somewhat com- The average supervised loss of its mean classifier is plete picture: When the number of classes is large, if the µ  µ k+1  unsupervised loss can be made small by F, then the super- Lsup(f) := E Lsup({ci}i=1 , f) | ci 6= cj k+1 k+1 {ci}i=1 ∼ρ vised loss of fb, learned using finite samples, is small.

Contrastive Learning Algorithm While encouraging, this result still leaves open the question: Can Lun(f) indeed be made small on reasonable datasets We describe the training objective for contrastive learning: using function classes F of interest, even though the similar the choice of loss function is dictated by the ` used in pair and negative sample can come from the same latent the supervised evaluation, and k denotes number of neg- class? We shed light on this by upper-bounding Lun(f) by + 6= ative samples used for training. Let (x, x ) ∼ Dsim, two components: (a) the loss Lun(f) for the case where the − − k (x1 , .., xk ) ∼ Dneg as defined in Equations (1) and (2). positive and negative samples are from different classes; (b) Definition 2.3 (Unsupervised Loss). The population loss is a notion of deviation s(f), within each class. h k  i :  T + −  Theorem 4.5 (Informal binary version). Lun(f) = E ` f(x) f(x ) − f(xi ) i=1 (5) 6= and its empirical counterpart with M samples Lsup(fb) ≤ Lun(f) + βs(f) + η GenM ∀f ∈ F + − − M k (xj, xj , xj1, ..., xjk)j=1 from Dsim × Dneg is for constants β, η that depend on the distribution ρ. Again, M when ρ is uniform and |C| → ∞ we have β → 0, η → 1 1 X  T + −  k  . Lbun(f) = ` f(xj) f(x ) − f(x ) M j ji i=1 j=1 This bound lets us infer the following: if the class F is rich 6= (6) enough to contain a function f for which Lun(f)+βs(f) is Contrastive Unsupervised Representation Learning

6= low, then fbhas high supervised performance. Both Lun(f) Remark. The complexity measure RS(F) is tightly related and s(f) can potentially be made small for rich enough F. to the labeled sample complexity of the classification tasks. For the function class G = {wT f(·)|f ∈ F, kwk ≤ 1} Ideally, however, one would want to show that fbcan com- that one would use to solve a binary task from scratch using pete on classification tasks with every f ∈ F labeled data, it can be shown that RS (F) ≤ dRS (G), where RS (G) is the usual Rademacher complexity of G on (Ideal Result): Lsup(fb) ≤ αLsup(f) + η GenM (7) S (Definition 3.1 from (Mohri et al., 2018)). Unfortunately, we show in Section 5.1 that the algorithm can pick something far from the optimal f. However, we We state two key lemmas needed to prove the theorem. extend Theorem 4.5 to a bound similar to (7) (where the Lemma 4.2. With probability at least 1−δ over the training classification is done using the mean classifier) under as- set S, for all f ∈ F sumptions about the intraclass concentration of f and about its mean classifier having high margin. Lun(fb) ≤ Lun(f) + GenM Sections 6.1 and 6.2 extend our results to the more compli- We prove Lemma 4.2 in Appendix A.3. cated setting where the algorithm uses k negative samples (5) and note an interesting behavior: increasing the num- Lemma 4.3. For all f ∈ F ber of negative samples beyond a threshold can hurt the µ 1 performance. In Section 6.3 we show a novel extension of L (f) ≤ (Lun(f) − τ) sup (1 − τ) the algorithm that utilizes larger blocks of similar points. Finally, we perform controlled experiments in Section8 to Proof. The key idea in the proof is the use of Jensen’s in- validate components of our framework and corroborate our equality. Unlike the unsupervised loss which uses a random suspicion that the mean classifier of representations learned point from a class as a classifier, using the mean of the using labeled data has good classification performance. class as the classifier should only make the loss lower. Let µc = E f(x) be the mean of the class c. 4. Guaranteed Average Binary Classification x∼Dc  T + −  To provide the main insights, we prove the algorithm’s guar- Lun(f) = `(f(x) (f(x ) − f(x ))) + E (x,x )∼Dsim antee when we use only 1 negative sample (k = 1). For − µ x ∼Dneg this section, let Lsup(f) and Lsup(f) be as in Definition =(a) `(f(x)T (f(x+) − f(x−))) 2.2 for binary tasks. We will refer to the two classes in the + E− 2 + E + − c ,c ∼ρ x ∼Dc+ supervised task as well as the unsupervised loss as c , c . − x∼Dc+ x ∼D + − M c− Let S = {xj, x , x } be our training set sampled from j j j=1 (b)  T  ≥ `(f(x) (µc+ − µc− )) the distribution D × D and f ∈ arg min L (f). + E− 2 E sim neg b f∈F bun c ,c ∼ρ x∼Dc+ (c) µ + − + − = (1 − τ) E [Lsup({c , c }, f)|c 6= c ] + τ 4.1. Upper Bound using Unsupervised Loss c+,c−∼ρ2 + −  3dM (d) µ Let f = f (x ), f (x ), f (x ) ∈ be = (1 − τ)Lsup(f) + τ |S t j t j t j j∈[M],t∈[d] R the restriction on S for any f ∈ F. Then, the statistical where (a) follows from the definitions in (1) and (2), (b) complexity measure relevant to the estimation of the repre- follows from the convexity of ` and Jensen’s inequality by sentations is the following Rademacher average taking the expectation over x+, x− inside the function, (c)   + − RS (F) = E suphσ, f|S i follows by splitting the expectation into the cases c = c σ∼{±1}3dM f∈F and c+ 6= c−, from symmetry in c+ and c− in sampling 0 and since classes in tasks are uniformly distributed (general Let τ = E 1{c = c } be the probability that two classes c,c0∼ρ2 distributions are handled in Appendix B.1). Rearranging sampled independently from ρ are the same. terms completes the proof. Theorem 4.1. With probability at least 1 − δ, for all f ∈ F Proof of Theorem 4.1. The result follows directly by apply- µ 1 1 L (fb) ≤ (Lun(f) − τ) + GenM sup (1 − τ) (1 − τ) ing Lemma 4.3 for fband finishing up with Lemma 4.2. where One could argue that if F is rich enough such that Lun can  s  1 be made small, then Theorem 4.1 suffices. However, in the RS (F) 2 log δ GenM = O R + R  next section we explain that unless τ  1, this may not M M always be possible and we show one way to alleviate this. Contrastive Unsupervised Representation Learning

4.2. Price of Negative Sampling: Class Collision 5. Towards Competitive Guarantees Note first that the unsupervised loss can be decomposed as We provide intuition and counter-examples for why con- trastive learning does not always pick the best supervised = 6= Lun(f) = τLun(f) + (1 − τ)Lun(f) (8) representation f ∈ F and show how our bound captures these. Under additional assumptions, we show a competitive 6= where Lun(f) is the loss suffered when the similar pair and bound where classification is done using the mean classifier. the negative sample come from different classes.

6= 5.1. Limitations of contrastive learning Lun(f)  T + − + − The bound provided in Theorem 4.5 might not appear as the = E `(f(x) (f(x ) − f(x )))|c 6= c c+,c−∼ρ2 most natural guarantee for the algorithm. Ideally one would x,x+∼D2 c+ like to show a bound like the following: for all f ∈ F, − x ∼Dc− (Ideal 1): Lsup(fb) ≤ αLsup(f) + η GenM (10) = and Lun(f) is when they come from the same class. Let ν be a distribution over C with ν(c) ∝ ρ2(c), then for constants α, η and generalization error GenM . This guarantees that fb is competitive against the best f on the L= (f) = `(f(x)T (f(x+) − f(x−))) average binary classification task. However, the bound we un c∼Eν + − 3 prove has the following form: for all f ∈ F, x,x ,x ∼Dc  T  ≥ `(f(x) (µc − µc)) = 1 µ 6= E L (f) ≤ αL (f) + βs(f) + η GenM c∼ν,x∼Dc sup b un To show that this discrepancy is not an artifact of our anal- by Jensen’s inequality again, which implies Lun(f) ≥ τ. In ysis but rather stems from limitations of the algorithm, we general, without any further assumptions on f, Lun(f) can be far from τ, rendering the bound in Theorem 4.1 useless. present two examples in Figure1. Our bound appropriately However, as we will show, the magnitude of L= (f) can captures these two issues individually owing to the large un 6= be controlled by the intraclass deviation of f. Let Σ(f, c) values of L (f) or s(f) in each case, for the optimal f. the covariance matrix of f(x) when x ∼ Dc. We define a In Figure 1a, we see that there is a direction on which f1 notion of intraclass deviation as follows: can be projected to perfectly separate the classes. Since the hp i algorithm takes inner products between the representations, s(f) := E kΣ(f, c)k2 E kf(x)k (9) it inevitably considers the spurious components along the c∼ν x∼D c orthogonal directions. This issue manifests in our bound as 6= Lemma 4.4. For all f ∈ F, the term Lun(f1) being high even when s(f1) = 0. Hence, contrastive learning will not always work when the only = 0 Lun(f) − 1 ≤ c s(f) guarantee we have is that F can make Lsup small.

0 This should not be too surprising, since we show a relatively where c is a positive constant. µ strong guarantee – a bound on Lsup for the mean classifier We prove Lemma 4.4 in Appendix A.1. Theorem 4.1 com- of fb. This suggests a natural stronger assumption that F µ bined with Equation (8) and Lemma 4.4 gives the following can make Lsup small (which is observed experimentally result. in Section8 for function classes of interest) and raises the question of showing a bound that looks like the following: With probability at least 1 − δ, ∀f ∈ F Theorem 4.5. for all f ∈ F, µ 6= Lsup(fb) ≤ L (fb) ≤ L (f) + β s(f) + η GenM µ µ sup un (Ideal 2): Lsup(fb) ≤ αLsup(f) + ηGenM (11)

0 τ 1 0 where β = c 1−τ , η = 1−τ and c is a constant. without accounting for any intraclass deviation – recall that s(f) captures a notion of this deviation in our bound. How- The above bound highlights two sufficient properties of the ever this is not true: high intraclass deviation may not imply function class for unsupervised learning to work: when the µ = high Lsup(f), but can make Lun(f) (and thus Lun(f)) function class F is rich enough to contain some f with low high, resulting in the failure of the algorithm. Consequently, 6= 6= βs(f) as well as low Lun(f) then fb, the empirical mini- the term s(f) also increases while Lun does not necessarily mizer of the unsupervised loss – learned using sufficiently have to. This issue, apparent in Figure 1b, shows that a guar- large number of samples – will have good performance on antee like (11) cannot be shown without further assumptions. supervised tasks (low Lsup(fb)). Contrastive Unsupervised Representation Learning

geswaran & Lee(2018) that often use more than one nega- tive sample for every similar pair, we show provable guar- antees for this case by careful handling of class collision. Additionally, in Section 6.2 we show simple examples where increasing negative samples beyond a certain threshold can hurt contrastive learning. Second, in Section 6.3, we ex- plore a modified algorithm that leverages access to blocks (a) Mean is bad (b) High intraclass variance of similar data, rather than just pairs and show that it has stronger guarantees as well as performs better in practice. Figure 1. In both examples we have uniform distribution over classes C = {c1, c2}, blue and red points are in c1 and c2 re- 6.1. Guarantees for k Negative Samples spectively and Dci is uniform over the points of ci. In the first − − Here the algorithm utilizes k negative samples x1 , ..., xk figure we have one point per class, while in the second we have + drawn i.i.d. from Dneg for every positive sample pair x, x two points per class. Let F = {f0, f1} where f0 maps all points drawn from D and minimizes (6). As in Section4, we to (0, 0) and f1 is defined in the figure. In both cases, using the sim hinge loss, Lsup(f1) = 0, Lsup(f0) = 1 and in the second case prove a bound for fbof the following form: µ Lsup(f1) = 0. However, in both examples the algorithm will pick Theorem 6.1. (Informal version) For all f ∈ F 2 f0 since Lun(f0) = 1 but Lun(f1) = Ω(r ). µ 6= Lsup(fb) ≤ Lsup(fb) ≤ αLun(f) + βs(f) + η GenM 5.2. Competitive Bound via Intraclass Concentration 6= where L (f) and GenM are extensions of the correspond- µ un We saw that Lsup(f) being small does not imply low ing terms from Section4 and s(f) remains unchanged. The µ Lsup(fb), if f is not concentrated within the classes. In formal statement of the theorem and its proof appears in this section we show that when there is an f that has intra- Appendix B.1. The key differences from Theorem 4.5 are β class concentration in a strong sense (sub-Gaussianity) and and the distribution of tasks in Lsup that we describe below. can separate classes with high margin (on average) with the The coefficient β of s(f) increases with k, e.g. when ρ is µ k mean classifier, then Lsup(fb) will be low. uniform and k  |C|, β ≈ |C| . x Let `γ (x) = (1 − γ )+ be the hinge loss with margin γ and The average supervised loss that we bound is µ µ Lγ,sup(f) be Lsup(f) with the loss function `γ . h i Lsup(fb) := E Lsup(T , fb) Lemma 5.1. For f ∈ F, if the random variable f(X), T ∼D 2 where X ∼ Dc, is σ -sub-Gaussian in every direction for where D is a distribution over tasks, defined as follows: every class c and has maximum norm R = maxx∈X kf(x)k, + − − k+1 then for all  > 0, sample k + 1 classes c , c1 , . . . , ck ∼ ρ , conditioned on the event that c+ does not also appear as a negative 6= µ T Lun(f) ≤ γLγ,sup(f) +  sample. Then, set to be the set of distinct classes in {c+, c−, . . . , c−}. Lµ (fb) is defined by using Lµ (T , fb). q 1 k sup sup where γ = 1 + c0Rσ log R and c0 is some constant.  Remark. Bounding Lsup(fb) directly gives a bound for average (k + 1)-wise classification loss Lsup(fb) from Def- The proof of Lemma 5.1 is provided in the Appendix A.2. inition 2.2, since L (f) ≤ L (f)/p, where p is the Using Lemma 5.1 and Theorem 4.5, we get the following: sup b sup b probability that the k + 1 sampled classes are distinct. For Corollary 5.1.1. For all  > 0, with probability at least k  |C| and ρ ≈ uniform, these metrics are almost equal. 1 − δ, for all f ∈ F,

µ µ We also extend our competitive bound from Section 5.2 for L (fb) ≤ γ(f)L (f) + βs(f) + ηGenM +  sup γ(f),sup the above fbin Appendix B.2. 0 τ where γ(f) is as defined in Lemma 5.1, β = c 1−τ , τ 0 6.2. Effect of Excessive Negative Sampling η = 1−τ and c is a constant. The standard belief is that increasing the number of negative 6. Multiple Negative Samples and Block samples always helps, at the cost of increased computational Similarity costs. In fact for Noise Contrastive Estimation (NCE) (Gut- mann & Hyvarinen¨ , 2010), which is invoked to explain the In this section we explore two extensions to our analysis. success of negative sampling, increasing negative samples First, in Section 6.1, inspired by empirical works like Lo- has shown to provably improve the asymptotic variance of Contrastive Unsupervised Representation Learning the learned parameters. However, we find that such a phe- making it a more attractive choice than Lun when larger nomenon does not always hold for contrastive learning – blocks are available.6. The algorithm can be extended, anal- larger k can hurt performance for the same inherent reasons ogously to Equation (5), to handle more than one negative block highlighted in Section 5.1, as we illustrate next. block. Experimentally we find that minimizing Lun in- stead of L can lead to better performance and our results When ρ is close to uniform and the number of negative un are summarized in Section 8.2. We defer the proof of Propo- samples is k = Ω(|C|), frequent class collisions can prevent sition 6.2 to Appendix A.4. the unsupervised algorithm from learning the representation f ∈ F that is optimal for the supervised problem. In this case, owing to the contribution of s(f) being high, a large 7. Related Work number of negative samples could hurt. This problem, in The contrastive learning framework is inspired by several fact, can arise even when the number of negative samples empirical works, some of which were mentioned in the is much smaller than the number of classes. For instance, introduction. The use of co-occurring words as semanti- if the best representation function f ∈ F groups classes cally similar points and negative sampling for learning word into t “clusters”,5 such that f cannot contrast well between embeddings was introduced in Mikolov et al.(2013). Subse- classes from the same cluster, then L6= will contribute to un quently, similar ideas have been used by Logeswaran & Lee the unsupervised loss being high even when k = Ω(t). We (2018) and Pagliardini et al.(2018) for sentences representa- illustrate, by examples, how these issues can lead to picking tions and by Wang & Gupta(2015) for images. Notably the suboptimal f in AppendixC. Experimental results in Fig- b sentence representations learned by the quick thoughts (QT) ures 2a and 2b also suggest that larger negative samples hurt method in Logeswaran & Lee(2018) that we analyze has performance beyond a threshold, confirming our suspicions. state-of-the-art results on many text classification tasks. Pre- vious attempts have been made to explain negative sampling 6.3. Blocks of Similar Points (Dyer, 2014) using the idea of Noise Contrastive Estima- Often a dataset consists of blocks of similar data instead tion (NCE) (Gutmann & Hyvarinen¨ , 2010) which relies on the assumption that the data distribution belongs to some of just pairs: a block consists of x0, x1, . . . xb that are i.i.d. known parametric family. This assumption enables them to draws from a class distribution Dc for a class c ∼ ρ. In text, for instance, paragraphs can be thought of as a block consider a broader class of distributions for negative sam- of sentences sampled from the same latent class. How can pling. The mean classifier that appears in our guarantees is an algorithm leverage this additional structure? of significance in meta-learning and is a core component of ProtoNets (Snell et al., 2017). We propose an algorithm that uses two blocks: one for + + Our data model for similarity is reminiscent of the one in positive samples x, x1 , . . . , xb that are i.i.d. samples from + − − co-training (Blum & Mitchell, 1998). They assume access c ∼ ρ and another one of negative samples x1 , . . . xb that are i.i.d. samples from c− ∼ ρ. Our proposed algorithm to pairs of “views” with the same label that are conditionally then minimizes the following loss: independent given the label. Our unlabeled data model can be seen as a special case of theirs, where the two views have block Lun (f) := the same conditional distributions. However, they addition-   P f(x+) P f(x−) (12) ally assume access to some labeled data (semi-supervised), ` f(x)T i i − i i E b b while we learn representations using only unlabeled data, which can be subsequently used for classification when la- To understand why this loss function make sense, recall that beled data is presented. Two-stage kernel learning (Cortes µ the connection between Lsup and Lun was made in Lemma et al., 2010; Kumar et al., 2012) is similar in this sense: in 4.3 by applying Jensen’s inequality. Thus, the algorithm the first stage, a positive linear combination of some base that uses the average of the positive and negative samples kernels is learned and is then used for classification in the in blocks as a proxy for the classifier instead of just one second stage; they assume access to labels in both stages. point each should have a strictly better bound owing to Similarity/metric learning (Bellet et al., 2012; 2013) learns the Jensen’s inequality getting tighter. We formalize this a linear feature map that gives low distance to similar points intuition below. Let τ be as defined in Section4. and high to dissimilar. While they identify dissimilar pairs Proposition 6.2. ∀f ∈ F using labels, due to lack of labels we resort to negative sam- 1 1 pling and pay the price of class collision. While these works L (f) ≤ Lblock(f) − τ ≤ (L (f) − τ) sup 1 − τ un 1 − τ un analyze linear function classes, we can handle arbitrarily powerful representations. Learning of representations that Lblock L This bound tells us that un is a better surrogate for sup, 6Rigorous comparison of the generalization errors is left for future work. 5This can happen when F is not rich enough. Contrastive Unsupervised Representation Learning

Table 1. Performance of supervised and unsupervised represen- Table 2. Effect of larger block size on representations. For CIFAR- tations on average k-wise classification tasks (AVG-k) and for 100 and WIKI-3029 we measure the average binary classification comparison, on full multiclass (TOP-R) which is not covered by accuracy. IMDB representations are tested on IMDB supervised our theory. Classifier can have a trained output layer (TR), or the task. CURL is our large block size contrastive method, QT is the mean classifier (µ) of Definition 2.1, with µ-5 indicating the mean algorithm from (Logeswaran & Lee, 2018). For larger block sizes, was computed using only 5 labeled examples. QT uses all pairs within a block as similar pairs. We use the same GRU architecture for both CURL and QT for a fair comparison. SUPERVISED UNSUPERVISED TR µ µ-5 TR µ µ-5 DATASET METHOD b = 2 b = 5 b = 10 AVG-2 97.8 97.7 97.0 97.3 97.7 96.9 CIFAR-100 CURL 88.1 89.6 89.7 AVG-10 89.1 87.2 83.1 88.4 87.4 83.5 WIKI-3029 WIKI-3029 CURL 96.6 97.5 97.7 TOP-10 67.4 59.0 48.2 64.7 59.0 45.8 CURL 89.2 89.6 89.7 TOP-1 43.2 33.2 21.7 38.7 30.4 17.0 IMDB QT 86.5 87.7 86.7 AVG-2 97.2 95.9 95.8 93.2 92.0 90.6 AVG-5 92.7 89.8 89.4 80.9 79.4 75.7 CIFAR-100 TOP-5 88.9 83.5 82.5 70.4 65.6 59.0 TOP-1 72.1 69.9 67.3 36.9 31.8 25.0 8.1. Controlled Experiments To simulate the data generation process described in Sec- tion2, we generate similar pairs (blocks) of data points by are broadly useful on a distribution of tasks is done in mul- sampling from the same class. Dissimilar pairs (negative titask learning, specifically in the learning-to-learn model samples) are selected randomly. Contrastive learning was (Maurer et al., 2016) but using labeled data. done using our objectives (5), and compared to performance Recently Hazan & Ma(2016) proposed “assumption-free” of standard supervised training, with both using the same methods for representation learning via MDL/compression architecture for representation f. For CIFAR-100 we use arguments, but do not obtain any guarantees comparable VGG-16 (Simonyan & Zisserman, 2014) with an additional to ours on downstream classification tasks. As noted by 512x100 linear layer added at the end to make the final Arora & Risteski(2017), this compression approach has to representations 100 dimensional, while for Wiki-3029 we preserve all input information (e.g. preserve every pixel of use a Gated Recurrent Network (GRU) (Chung et al., 2015) the image) which seems suboptimal. with output dimension 300 and fix the word embedding layer with pretrained GloVe embeddings (Pennington et al., 2014). The unsupervised model for CIFAR-100 is trained 8. Experimental Results with 500 blocks of size 2 with 4 negative samples, and for We report experiments in text and vision domains supporting Wiki-3029 we use 20 blocks of size 10 with 8 negative sam- our theory. Since contrastive learning has already shown to ples. We test (1) learned representations on average tasks obtain state-of-the-art results on text classification by quick by using the mean classifier and compare to representations thoughts (QT) in Logeswaran & Lee(2018), most of our trained using labeled data; (2) the effect of various parame- 7 experiments are conducted to corroborate our theoretical ters like amount of unlabeled data (N) , number of negative analysis. We also show that our extension to similarity samples (k) and block size (b) on representation quality; (3) blocks in Section 6.3 can improve QT on a real-world task. whether the supervised loss tracks the unsupervised loss as suggested by Theorem 4.1; (4) performance of the mean Datasets: Two datasets were used in the controlled exper- classifier of the supervised model. iments. (1) The CIFAR-100 dataset (Krizhevsky, 2009) consisting of 32x32 images categorized into 100 classes Results: These appear in Table1. For Wiki-3029 the un- with a 50000/10000 train/test split. (2) Lacking an appropri- supervised performance is very close to the supervised per- ate NLP dataset with large number of classes, we create the formance in all respects, while for CIFAR-100 the avg-k Wiki-3029 dataset, consisting of 3029 Wikipedia articles performance is respectable, rising to good for binary clas- as the classes and 200 sentences from each article as sam- sification. One surprise is that the mean classifier, central ples. The train/dev/test split is 70%/10%/20%. To test our to our analysis of unsupervised learning, performs well method on a more standard task, we also use the unsuper- also with representations learned by supervised training on vised part of the IMDb review corpus (Maas et al., 2011), CIFAR-100. Even the mean computed by just 5 labeled sam- which consists of 560K sentences from 50K movie reviews. ples performs well, getting within 2% accuracy of the 500 Representations trained using this corpus are evaluated on 7If we used M similar blocks of size b and k negative blocks the supervised IMDb binary classification task, consisting for each similar block, N = Mb(k + 1). In practice, however, we of training and testing set with 25K reviews each. reuse the blocks for negative sampling and lose the factor of k + 1. Contrastive Unsupervised Representation Learning

92.5 24 1.0 Unsup train loss 90.0 22 Unsup test loss 87.5 20 0.8 Sup loss 85.0 18 0.6 82.5 16

k=1 k=1 Loss 80.0 14 0.4 Accuracy k=2 Accuracy k=4 77.5 k=4 12 k=8 75.0 0.2 k=10 10 k=50 72.5 8 200 400 600 800 1000 50 100 150 200 250 300 350 400 0 10 20 30 40 50 Unlabeled data Unlabeled data Epoch (a) CIFAR-100 (b) Wiki-3029 (c) Wiki-3029

Figure 2. Effect of amount of unlabeled data and # of negative samples on unsupervised representations, measured on binary classification for CIFAR100 in (a) and on top-1 performance on Wiki-3029 in Fig (b) (top-1 performance is used because avg binary was same for all k). Fig. (c) shows the dynamics of train/test loss; supervised loss roughly tracks unsupervised test loss, as suggested by Theorem 4.1 sample mean classifier on CIFAR-100. This suggests that 9. Conclusion representations learnt by standard supervised deep learn- ing are actually quite concentrated. We also notice that Contrastive learning methods have been empirically success- the supervised representations have fairly low unsupervised ful at learning useful feature representations. We provide a training loss (as low as 0.4), even though the optimization new conceptual framework for thinking about this form of is minimizing a different objective. learning, which also allows us to formally treat issues such as guarantees on the quality of the learned representations. To measure the sample complexity benefit provided by con- The framework gives fresh insights into what guarantees trastive learning, we train the supervised model on just 10% are possible and impossible, and shapes the search for new fraction of the dataset and compare it with an unsupervised assumptions to add to the framework that allow tighter guar- model trained on unlabeled data whose mean classifiers are antees. The framework currently ignores issues of efficient computed using the same amount of labeled data. We find minimization of various loss functions, and instead studies that the unsupervised model beats the supervised model by the interrelationships of their minimizers as well as sample almost 4% on the 100-way task and by 5% on the average complexity requirements for training to generalize, while binary task when only 50 labeled samples are used. clarifying what generalization means in this setting. Our Figure2 highlights the positive effect of increasing num- approach should be viewed as a first cut; possible exten- ber of negative samples as well as amount of data used by sions include allowing tree structure – more generally met- unsupervised algorithm. In both cases, using a lot of neg- ric structure – among the latent classes. Connections to ative examples stops helping after a point, confirming our meta-learning and transfer learning may arise. suspicions in Section 6.2. We also demonstrate how the We use experiments primarily to illustrate and support the supervised loss tracks unsupervised test loss in Figure 2c. new framework. But one experiment on sentence embed- dings already illustrates how fresh insights derived from our 8.2. Effect of Block Size framework can lead to improvements upon state-of-the-art models in this active area. We hope that further progress As suggested in Section 6.3, a natural extension to the model will follow, and that our theoretical insights will begin to would be access to blocks of similar points. We refer to our influence practice, including design of new heuristics to method of minimizing the loss in (12) as CURL for Con- identify semantically similar/dissimilar pairs. trastive Unsupervised Representation Learning and perform experiments on CIFAR-100, Wiki-3029, and IMDb. In Ta- ble2 we see that for CIFAR-100 and Wiki-3029, increasing 10. Acknowledgements block size yields an improvement in classification accuracy. This work is supported by NSF, ONR, the Simons Founda- For IMDb, as is evident in Table2, using larger blocks tion, the Schmidt Foundation, Mozilla Research, Amazon provides a clear benefit and the method does better than Research, DARPA, and SRC. We thank Rong Ge, Elad QT, which has state-of-the-art performance on many tasks. Hazan, Sham Kakade, Karthik Narasimhan, Karan Singh A thorough evaluation of CURL and its variants on other and Yi Zhang for helpful discussions and suggestions. unlabeled datasets is left for future work. Contrastive Unsupervised Representation Learning

References Logeswaran, L. and Lee, H. An efficient framework for learning sentence representations. In Proceedings of the Arora, S. and Risteski, A. Provable benefits of representa- International Conference on Learning Representations, tion learning. arXiv, 2017. 2018. Bellet, A., Habrard, A., and Sebban, M. Similarity learning Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., for provably accurate sparse linear classification. arXiv and Potts, C. Learning word vectors for sentiment anal- preprint arXiv:1206.6476, 2012. ysis. In Proceedings of the 49th Annual Meeting of the Bellet, A., Habrard, A., and Sebban, M. A survey on metric ACL: Human Language Technologies, 2011. learning for feature vectors and structured data. CoRR, Maurer, A. A vector-contraction inequality for rademacher abs/1306.6709, 2013. complexities. In International Conference on Algorithmic Blum, A. and Mitchell, T. Combining labeled and unlabeled Learning Theory, pp. 3–17. Springer, 2016. data with co-training. In Proceedings of the Eleventh Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit Annual Conference on Computational Learning Theory, of multitask representation learning. J. Mach. Learn. Res., COLT’ 98, 1998. 2016. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Gated Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and feedback recurrent neural networks. In Proceedings of Dean, J. Distributed representations of words and phrases the 32Nd International Conference on International Con- and their compositionality. In Neural Information Pro- ference on - Volume 37, ICML’15. cessing Systems, 2013. JMLR.org, 2015. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Founda- Cortes, C., Mohri, M., and Rostamizadeh, A. Two-stage tions of machine learning. MIT press, 2018. learning kernel algorithms. 2010. Pagliardini, M., Gupta, P., and Jaggi, M. Unsupervised Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: learning of sentence embeddings using compositional Pre-training of deep bidirectional transformers for lan- n-gram features. Proceedings of the North American guage understanding. arXiv preprint arXiv:1810.04805, Chapter of the ACL: Human Language Technologies, 10 2018. 2018.

Dyer, C. Notes on noise contrastive estimation and negative Pennington, J., Socher, R., and Manning, C. D. GloVe: sampling. CoRR, abs/1410.8251, 2014. URL http: Global vectors for word representation. In Proceedings //arxiv.org/abs/1410.8251. of the Conference on Empirical Methods in Natural Lan- guage Processing, 2014. Gutmann, M. and Hyvarinen,¨ A. Noise-contrastive esti- mation: A new estimation principle for unnormalized Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, statistical models. In Proceedings of the Thirteenth Inter- C., Lee, K., and Zettlemoyer, L. Deep contextualized national Conference on Artificial Intelligence and Statis- word representations. In Proceedings of NAACL-HLT, tics, pp. 297–304, 2010. 2018.

Hazan, E. and Ma, T. A non-generative framework and Simonyan, K. and Zisserman, A. Very deep convolu- convex relaxations for unsupervised learning. In Neural tional networks for large-scale image recognition. arXiv Information Processing Systems, 2016. preprint arXiv:1409.1556, 2014.

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, Snell, J., Swersky, K., and Zemel, R. Prototypical networks A., Urtasun, R., and Fidler, S. Skip-thought vectors. In for few-shot learning. In Advances in Neural Information Neural Information Processing Systems, 2015. Processing Systems 30. 2017.

Krizhevsky, A. Learning multiple layers of features from Wang, X. and Gupta, A. Unsupervised learning of visual tiny images. Technical report, 2009. representations using videos. In Proc. of IEEE Interna- tional Conference on Computer Vision, 2015. Kumar, A., Niculescu-Mizil, A., Kavukcoglu, K., and Daume,´ H. A binary classification framework for two- stage multiple kernel learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, 2012. Contrastive Unsupervised Representation Learning A. Deferred Proofs A.1. Class Collision Lemma We prove a general Lemma, from which Lemma 4.4 can be derived directly. Lemma A.1. Let c ∈ C and ` : Rt → R be either the t-way hinge loss or t-way logistic loss, as defined in Section2. Let + − − x, x , x1 , ..., xt be iid draws from Dc. For all f ∈ F, let

= h  T + −  t i Lun,c(f) = E ` f(x) f(x ) − f(xi ) i=1 + − x,x ,xi

Then = ~ 0 p Lun,c(f) − `(0) ≤ c t kΣ(f, c)k2 E [kf(x)k] (13) x∼Dc where c0 is a positive constant.

Lemma 4.4 is a direct consequence of the above Lemma, by setting t = 1 (which makes `(0) = 1), taking an expectation = = over c ∼ ν in Equation (13) and noting that Ec∼ν [Lun,c(f)] = Lun(f).

T − +  Proof of Lemma A.1. Fix an f ∈ F and let zi = f(x) f(xi ) − f(x ) and z = maxi∈[t] zi. First, we show that = ~ 0 0 Lun,c(f) − `(0) ≤ c E[|z|], for some constant c . Note that E[|z|] = P[z ≥ 0]E[z|z ≥ 0] + P[z ≤ 0]E[−z|z ≤ 0] ≥ P[z ≥ 0]E[z|z ≥ 0]. = t-way hinge loss: By definition `(v) = max{0, 1+maxi∈[t]{−vi}}. Here, Lun,c(f) = E[(1+z)+] ≤ E[max{1+z, 1}] = 1 + P[z ≥ 0]E[z|z ≥ 0] ≤ 1 + E[|z|].

Pt −vi = P zi t-way logistic loss: By definition `(v) = log2(1 + i=1 e ), we have Lun,c(f) = E[log2(1 + i e )] ≤ E[log2(1 + z z P[z≥0]E[z|z≥0] E[|z|] te )] ≤ max{ log 2 + log2(1 + t), log2(1 + t)} = log 2 + log2(1 + t) ≤ log 2 + log2(1 + t).

Finally, E[|z|] ≤ E[maxi∈[t] |zi|] ≤ tE[|z1|]. But,

 T − +   [|z1|] = + − f(x) f(x ) − f(x ) E Ex,x ,x1 1  v " # u  T 2 √ u f(x) − +  p ≤ x kf(x)kt + − f(x ) − f(x ) ≤ 2 kΣ(f, c)k2 [kf(x)k] E  Ex ,x1 1  E kf(x)k x∼Dc

A.2. Proof of Lemma 5.1

2 8 Fix an f ∈ F and suppose that within each class c, f is σ -subgaussian in every direction. Let µc = E [f(x)]. This x∼Dc T 2 means that for all c ∈ C and unit vectors v, for x ∼ Dc, we have that v (f(x) − µc) is σ -subgaussian. Let  > 0 and γ = 1 + 2Rσp2 log R + log 3/. 9 Consider fixed c+, c−, x and let f(x)T (f(x−) − f(x+)) = µ + z, where

T T −  T +  µ = f(x) (µc− − µc+ ) and z = f(x) f(x ) − µc− − f(x) f(x ) − µc+

+ + − − 2 2 2 2 For x ∼ Dc , x ∼ Dc independently, z is the sum of two independent R σ -subgaussians (x is fixed), so z is 2R σ - 2 2 − 4R σ (2 log R+log 3/) 4R2σ2  subgaussian and thus p = Pr[z ≥ γ − 1] ≤ e = 3R2 . So, Ez[(1 + µ + z)+] ≤ (1 − p)(γ + µ)+ + 2 µ 2 + − 2 p(2R + 1) ≤ γ(1 + γ )+ +  (where we used that µ + z ≤ 2R ). By taking expectation over c , c ∼ ρ , x ∼ Dc+ we

8 2 λ(X− [X]) λ2σ2/2 d 2 A random variable X is called σ -subgaussian if E[e E ] ≤ e , ∀λ ∈ R. A random vector V ∈ R is σ -subgaussian in d 2 every direction, if ∀u ∈ R , ||u|| = 1, the random variable hu, V i is σ -subgaussian. 9We implicitly assume here that R ≥ 1, but for R < 1, we just set γ = 1 + 2Rσplog 3/ and the same argument holds. Contrastive Unsupervised Representation Learning have

"  T  # 6= f(x) (µc− − µc+ ) + − Lun(f) ≤ E γ 1 + c 6= c +  c+,c−∼ρ2 γ + x∼Dc+ " " T  # " T  # # 1 f(x) (µc− − µc+ ) 1 f(x) (µc+ − µc− ) + − = γ 1 + + 1 + c 6= c +  + E− 2 E E c ,c ∼ρ 2 x∼Dc+ γ + 2 x∼Dc− γ +  µ + − + − = γ E Lγ,sup({c , c }, f) c 6= c +  c+,c−∼ρ2 (14)

µ + − µ + − where Lγ,sup({c , c }, f) is Lsup({c , c }, f) when `γ (x) = (1 − x/γ)+ is the loss function. Observe that in 14 we used that DT are uniform for binary T , which is an assumption we work with in section 4, but we remove it in section 5. µ The proof finishes by observing that the last line in 14 is equal to γLγ,sup(f) + .

A.3. Generalization Bound We first state the following general Lemma in order to bound the generalization error of the function class F on the unsupervised loss function Lun(·). Lemma 4.2 can be directly derived from it. Lemma A.2. Let ` : Rk → R be η-Lipschitz and bounded by B. Then with probability at least 1 − δ over the training set + − − M S = {(xj, xj , xj1, . . . , xjk)}j=1, for all f ∈ F  √ s  ηR kR (F) log 1 L (fˆ) ≤ L (f) + O S + B δ (15) un un  M M  where " # RS (F) = E suphσ, f|S i (16) σ∼{±1}(k+2)dM f∈F

 + − −  and f|S = ft(xj), ft(xj ), ft(xj1), . . . , , ft(xjk) j∈[M] t∈[d]

Note that for k + 1-way classification, for hinge loss we have η = 1 and B = O(R2), while for logistic loss η = 1 and B = O(R2 + log k). Setting k = 1, we get Lemma 4.2. We now prove Lemma A.2.

Proof of Lemma A.2. First, we use the classical bound for the generalization error in terms of the Rademacher complexity of the function class (see (Mohri et al., 2018) Theorem 3.1). For a real function class G whose functions map from a set Z M δ to [0, 1] and for any δ > 0, if S is a training set composed by M iid samples {zj}j=1, then with probability at least 1 − 2 , for all g ∈ G s M 4 1 X 2RS (G) log [g(z)] ≤ g(z ) + + 3 δ (17) E M i M 2M j=1 k+2 where RS (G) is the usual Rademacher complexity. We apply this bound to our case by setting Z = X , S is our training set and the function class is  1  G = g (x, x+, x−, ..., x−) = ` {f(x)T f(x+) − f(x−)}k  f ∈ F (18) f 1 k B i i=1 √ ηR k We will show that for some universal constant c, RS (G) ≤ c B RS (F) or equivalently " # √ " # ηR k E sup σ, (gf )|S ≤ c E sup σ, f|S (19) σ∼{±1}M f∈F B σ∼{±1}d(k+2)M f∈F Contrastive Unsupervised Representation Learning

+ − − M where (gf )|S = {gf (xj, xj , xj1, ..., xjk)}j=1. To do that we will use the following vector-contraction inequality.

M M Theorem A.3. [Corollary 4 in (Maurer, 2016)] Let Z be any set, and S = {zj}j=1 ∈ Z . Let Fe be a class of functions ˜ n n ˜ ˜ f : Z → R and h : R → R be L-Lipschitz. For all f ∈ Fe, let gf˜ = h ◦ f. Then " # √ " # D E D ˜ E sup σ, (gf˜)|S ≤ 2L sup σ, f|S E M E nM σ∼{±1} f˜∈Fe σ∼{±1} f˜∈Fe ˜  ˜  where f|S = ft(zj) . t∈[n],j∈[M]

We apply Theorem A.3 to our case by setting Z = X k+2, n = d(k + 2) and

n ˜ + − − + − − o Fe = f(x, x , xj1, ..., xjk) = (f(x), f(x ), f(xj1), ..., f(xjk))|f ∈ F

˜ We also use g ˜ = gf where f is derived from f as in the definition of Fe. Observe that now A.3 is exactly in the form of 19 and f √ we need to show that L ≤ √c ηR k for some constant c. But, for z = (x, x+, x−, ..., x−), we have g (z) = 1 `(φ(f˜(z))) 2 B 1 k f˜ B (k+2)d k + − −  P + −  1 where φ : R → R and φ (vt, vt , vt1, ..., vtk)t∈[d] = t vt(vt − vti ) i∈[k]. Thus, we may use h = B ` ◦ φ to apply Theorem A.3. √ P 2 P + 2 P − 2 2 Now, we see that φ is 6kR-Lipschitz when t vt , t(vt ) , t(vtj) ≤ R by computing its Jacobian. Indeed, for all ∂φi + − ∂φi ∂φi i, j ∈ [k] and t ∈ [d], we have = v − v , + = vt and − = −vt1{i = j}. From triangle inequaltiy, the Frobenius ∂vt t ti ∂vt ∂vtj norm of the Jacobian J of φ is s √ X + − 2 X 2 p 2 2 ||J||F = (vt − vti ) + 2k vt ≤ 4kR + 2kR = 6kR i,t t √ ||J|| ≤ ||J|| φ 6kR ` η Now, taking into account√ that 2 F , we have that is -Lipschitz on its domain and since is -Lipschitz, √ ηR k we have L ≤ 6 B . δ Now, we have that with probability at least 1 − 2  √ s  log 1 ˆ ˆ ηR kRS (F) δ Lun(f) ≤ Lbun(f) + O + B (20)  M M 

q 2 ∗ δ ∗ ∗ log δ Let f ∈ arg minf∈F Lun(f). With probability at least 1− 2 , we have that Lbun(f ) ≤ Lun(f )+3B 2M (Hoeffding’s ˆ ∗ inequality). Combining this with Equation (20), the fact that Lbun(f) ≤ Lbun(f ) and applying a union bound, finishes the proof.

A.4. Proof of Proposition 6.2 By convexity of `,

!  P f(x+) P f(x−) 1 X 1 X ` f(x)T i i − i i = ` f(x)T f(x+) − f(x−) ≤ ` f(x)T f(x+) − f(x−) b b b i i b i i i i Thus,   P + P −  " # block T i f(xi ) i f(xi ) 1 X T + −  Lun (f) = E ` f(x) − ≤ E ` f(x) f(xi ) − f(xi ) = Lun(f) + b b + b x,xi x,xi i − − xi xi Contrastive Unsupervised Representation Learning

The proof of the lower bound is analogous to that of Lemma 4.3.

B. Results for k Negative Samples B.1. Formal theorem statement and proof We now present Theorem B.1 as the formal statement of Theorem 6.1 and prove it. First we define some necessary quantities. + − − + − − Let (c , c1 , . . . , ck ) be k + 1 not necessarily distinct classes. We define Q(c , c1 , . . . , ck ) to be the set of distinct classes + − − − + + in this tuple. We also define I (c1 , ..., ck ) = {i ∈ [k] | ci = c } to be the set of indices where c reappears in the negative samples. We will abuse notation and just write Q, I+ when the tuple is clear from the context. 6= + − − k+1 To define Lun(f) consider the following tweak in the way the latent classes are sampled: sample c , c1 , . . . , ck ∼ ρ + − + + 2 conditioning on |I | < k and then remove all ci , i ∈ I . The datapoints are then sampled as usual: x, x ∼ Dc+ and − x ∼ D − , i ∈ [k], independently. i ci h   i 6=  T + −  + Lun(f) := ` f(x) f(x ) − f(xi ) + |I | < k +E − i/∈I c ,ci + − x,x ,xi which always contrasts points from different classes, since it only considers the negative samples that are not from c+. The generalization error is 10  s  √ R (F) log 1 Gen = O R k S + (R2 + log k) δ M  M M 

  were RS (F) is the extension of the definition in Section4: RS (F) = E supf∈F hσ, f|S i , where f|S = σ∼{±1}(k+2)dM  + − −  ft(xj), ft(xj ), ft(xj1), . . . , , ft(xjk) . j∈[M],t∈[d] + − − k+1 + 0 + − For c , c1 , ..., ck ∼ ρ , let τk = P[I 6= ∅] and τ = P[c = ci , ∀i]. Observe that τ1, as defined in Section4, is + − P[c = c1 ]. Let pmax(T ) = maxc DT (c) and + + +  ρmin(T ) = min Pc+,c−∼ρk+1 c = c|Q = T ,I = ∅ c∈T i   ρ+ (T ) min µ ˆ In Theorem B.1 we will upper bound the following quantity: E p (T ) Lsup(T , f) (D was defined in Section 6.1). T ∼D max ˆ Theorem B.1. Let f ∈ arg minf∈F Lbun(f). With probability at least 1 − δ, for all f ∈ F

 +  0 ρmin(T ) µ ˆ 1 − τ 6= 0 τ1 1 E Lsup(T , f) ≤ Lun(f) + c k s(f) + GenM T ∼D pmax(T ) 1 − τk 1 − τk 1 − τk where c0 is a constant.

Note that the definition of s(f) used here is defined in Section4

Proof. First, we note that both hinge and logistic loss satisfy the following property: ∀I1,I2 such that I1 ∪ I2 = [t] we have that

`({vi}i∈I1 ) ≤ `({vi}i∈[t]) ≤ `({vi}i∈I1 ) + `({vi}i∈I2 ) (21)

We now prove the Theorem in 3 steps. First, we leverage the convexity of ` to upper bound a supervised-type loss with the unsupervised loss Lun(f) of any f ∈ F. We call it supervised-type loss because it also includes degenerate tasks: |T | = 1.

10The log k term can be made O(1) for the hinge loss. Contrastive Unsupervised Representation Learning

Then, we decompose the supervised-type loss into an average loss over a distribution of supervised tasks, as defined in the 6= Theorem, plus a degenerate/constant term. Finally, we upper bound the unsupervised loss Lun(f) with two terms: Lun(f) that measures how well f contrasts points from different classes and an intraclass deviation penalty, corresponding to s(f). ˆ Step 1 (convexity): When the class c is clear from context, we write µˆc = [f(x)]. Recall that the sampling procedure for xE∼c + − − k+1 + 2 − unsupervised data is as follows: sample c , c , ..., c ∼ ρ and then x, x ∼ D + and x ∼ D − , i ∈ [k]. So, we have 1 k c i ci

  k  ˆ n ˆ T  ˆ + ˆ − o Lun(f) = E ` f(x) f(x ) − f(xi ) + − k+1 i=1 c ,ci ∼ρ x,x+∼D2 c+ − x ∼D − i c i   k    k  n ˆ T  ˆ + ˆ − o n ˆ T  o = E E ` f(x) f(x ) − f(xi ) ≥ E ` f(x) µˆc+ − µˆc− + − k+1 + i=1 + − k+1 i i=1 c ,ci ∼ρ x ∼Cc+ c ,ci ∼ρ − x∼Dc+ x ∼D − x∼Dc+ i c i (22) where the last inequality follows by applying the usual Jensen’s inequality and the convexity of `. Note that in the upper + − − bounded quantity, the c , c1 , ..., ck don’t have to be distinct and so the tuple does not necessarily form a task. Step 2 (decomposing into supervised tasks) We now decompose the above quantity to handle repeated classes.

  k  n ˆ T  o E ` f(x) µˆc+ − µˆc− + − k+1 i i=1 c ,ci ∼ρ x∼Dc+ " # n  ok  ˆ T + + ≥ (1 − τk) E ` f(x) µˆc+ − µˆc− I = ∅ + τk E [`( 0, ..., 0 )|I 6= ∅] + − k+1 i i=1 + − k+1 c ,ci ∼ρ c ,ci ∼ρ | {z } |I+| times x∼Dc+ " ! # n o h i ˆ T + ~ + ≥ (1 − τk) E ` f(x) (ˆµc+ − µˆc) c∈Q I = ∅ + τk E `|I+|(0) I 6= ∅ + − k+1 + + − k+1 c ,ci ∼ρ c6=c c ,ci ∼ρ x∼Dc+ (23) where `t(~0) = `(0,..., 0) (t times). Both inequalities follow from the LHS of Equation (21). Now we are closer to our goal of lower bounding an average supervised loss, since the first expectation in the RHS has a loss which is over a set of + + − − + distinct classes. However, notice that this loss is for separating c from Q(c , c1 , ..., ck ) \{c }. We now proceed to a symmetrization of this term to alleviate this issue. Recall that in the main paper, sampling T from D is defined as sampling the (k+1)-tuple from ρk+1 conditioned on I+ = ∅ and setting T = Q. Based on this definition, by the tower property of expectation, we have

" ! # n o ˆ T + E ` f(x) (ˆµc+ − µˆc) c∈Q I = ∅ + − k+1 + c ,ci ∼ρ c6=c x∼Dc+ h n o  i ˆ T  + = E E ` f(x) µˆc+ − µˆc c∈Q Q = T ,I = ∅ (24) T ∼D + − k+1 + c ,ci ∼ρ c6=c x∼Dc+ h n ˆ T o i = E E ` f(x) µˆc+ − µˆc c∈T T ∼D + + c ∼ρ (T ) c6=c+ x∼Dc+

+ + + − − k+1 + where ρ (T ) is the distribution of c when (c , c1 , ..., ck ) are sampled from ρ conditioned on Q = T and I = ∅. + Recall that ρmin(T ) from the theorem’s statement is exactly the minimum out of these |T | probabilities. Now, to lower bound the last quantity with the LHS in the theorem statement, we just need to observe that for all tasks T Contrastive Unsupervised Representation Learning

h n ˆ T o i ` f(x) µˆc+ − µˆc c∈T + E+ c ∼ρ (T ) c6=c+ x∼Dc+ + ρmin(T ) h n ˆ T o i ≥ E ` f(x) µˆc+ − µˆc c∈T (25) p (T ) + max c ∼DT c6=c+ x∼Dc+ + ρmin(T ) ˆ = Lsup(T , f) pmax(T )

By combining this with Equations (22), (23), (25) we get

" # ρ+ (T ) h i min ˆ ˆ ~ + (1 − τk) E Lsup(T , f) ≤ Lun(f) − τk E `|I+|(0) I 6= ∅ (26) T ∼D pmax(T ) + − k+1 c ,ci ∼ρ Now, by applying Lemma A.2, we bound the generalization error: with probability at least 1 − δ, ∀f ∈ F

ˆ Lun(f) ≤ Lun(f) + GenM (27)

However, Lun(f) cannot be made arbitrarily small. One can see that for all f ∈ F, Lun(f) is lower bounded by the second term in Equation (22), which cannot be made arbitrarily small as τk > 0. h  i h i  T + −  ~ + Lun(f) ≥ E ` f(x) f(x ) − f(xi ) i∈I+ ≥ τ E `|I+|(0) I 6= ∅ (28) + − k+1 + − k+1 c ,ci ∼ρ c ,ci ∼ρ + x,x ∼Dc+ − x ∼D − i c i where we applied Jensen’s inequality. Since τk is not 0, the above quantity can never be arbitrarily close to 0 (no matter how rich F is).

Step 3 (Lun decomposition) Now, we decompose Lun(f) by applying the RHS of Equation (21)

h n T + − o  n T + − o i Lun(f) ≤ E ` f(x) f(x ) − f(xi ) + ` f(x) f(x ) − f(xi ) (29) + − k+1 i/∈I+ i∈I+ c ,ci ∼ρ x,x+∼D2 c+ − x ∼D − i c i h n T + − o i h  T + −  i = E ` f(x) f(x ) − f(xi ) + E ` f(x) f(x ) − f(xi ) + (30) + − k+1 i/∈I+ + − k+1 i∈I c ,ci ∼ρ c ,ci ∼ρ x,x+∼D2 x,x+∼D2 c+ c+ − + − + x ∼D − , i/∈I x ∼D − , i∈I i c i c i i h n o  i 0 T + −  + = (1 − τ ) E ` f(x) f(x ) − f(xi ) |I | < k + − k+1 i/∈I+ c ,ci ∼ρ x,x+∼D2 c+ − + x ∼D − ,i∈ /I i c i " # (31)    T + −  + +τk ` f(x) f(x ) − f(xi ) + I 6= ∅ + −E k+1 i∈I c ,ci ∼ρ x,x+∼D2 c+ − + x ∼D − , i∈I i c i

0 6= Observe that the first term is exactly (1 − τ )Lun(f). Thus, combining (26), (27) and (31) we get Contrastive Unsupervised Representation Learning

" + # ρmin(T ) ˆ 0 6= (1 − τk) E Lsup(T , f) ≤ (1 − τ )Lun(f) + GenM T ∼D pmax(T ) " # h n T + − o i + + τk ` f(x) f(x ) − f(x ) − `|I+|(~0) I 6= ∅ E E i + c+,c−∼ρk+1 x,x+∼D2 i∈I i c+ − + x ∼D − , i∈I i c i | {z } ∆(f) (32)

+ − + + From the definition of I , ci = c , ∀i ∈ I . Thus, from Lemma A.1, we get that

  0 + p + ∆(f) ≤ c E |I | kΣ(f, c)k2 E [kf(x)k] I 6= ∅ (33) + − k+1 x∼Dc c ,ci ∼ρ for some constant c0. + + Let u be a distribution over classes with u(c) = + − k+1 [c = c|I 6= ∅] and it is easy to see that u(c) ∝ ρ(c) 1 − Pc ,ci ∼ρ (1 − ρ(c))k By applying the tower property to Equation (33) we have " # 0  + + +  p ∆(f) ≤ c E E |I | c = c, I 6= ∅ kΣ(f, c)k2 E [kf(x)k] (34) c∼u + − k+1 x∼Dc c ,ci ∼ρ But,

k  + + +  X − + + +  E |I | c = c, I 6= ∅ = Pc+,c−∼ρk+1 ci = c c = c, I 6= ∅ + − k+1 i c ,ci ∼ρ i=1 − + + +  = k + − k+1 c = c c = c, I 6= ∅ Pc ,ci ∼ρ 1 − +  (35) Pc+,c−∼ρk+1 c1 = c = c = k i + +  + − k+1 c = c, I 6= ∅ Pc ,ci ∼ρ ρ2(c) ρ(c) = k = k ρ(c)1 − (1 − ρ(c))k 1 − (1 − ρ(c))k

P 0 0 k P 0 0 k P 2 Now, using the fact that τk = 1 − c0 ρ(c )(1 − ρ(c )) = c0 ρ(c ) 1 − (1 − ρ(c )) and τ1 = c ρ (c),   τk τk 0 ρ(c) p ∆(f) ≤ c E k k kΣ(f, c)k2 E [kf(x)k] 1 − τk 1 − τk c∼u 1 − (1 − ρ(c)) x∼Dc 2 0 τk X ρ (c) p = c k kΣ(f, c)k2 E [kf(x)k] (36) 1 − τ P ρ(c0) (1 − (1 − ρ(c0))k) x∼D k c c0 c   0 τ1 p 0 τ1 = c k E kΣ(f, c)k2 E [kf(x)k] = c k s(f) 1 − τk c∼ν x∼Dc 1 − τk and we are done.

B.2. Competitive Bound

As in Section 5.2, we prove a competitive type of bound, under similar assumptions. Let `γ (v) = max{0, 1 + k µ µ maxi{−vi}/γ}, v ∈ R , be the multiclass hinge loss with margin γ and for any T let Lγ,sup(T , f) be Lsup(T , f) 0+ + + − − when `γ is used as loss function. For all tasks T , let ρ (T ) is the distribution of c when (c , c1 , ..., ck ) are sampled k+1 + 0+ from ρ conditioned on Q = T and |I | < k. Also, let ρ max(T ) be the maximum of these |T | probabilities and pmin(T ) = minc∈T DT (c). Contrastive Unsupervised Representation Learning

h 0+ i ρ max(T ) µ We will show a competitive bound against the following quantity, for all f ∈ F: E p (T ) Lγ,sup(T , f) , where T ∼D0 min 0 + − − k+1 + D is defined as follows: sample c , c1 , ..., ck ∼ ρ , conditioned on |I | < k. Then, set T = Q. Observe that when I+ = ∅ with high probability, we have D0 ≈ D. 2 Lemma B.2. For all f ∈ F suppose the random variable f(X), where X ∼ Dc, is σ (f)-subgaussian in every direction for every class c and has maximum norm R(f) = maxx∈X kf(x)k. Let fb ∈ arg minf∈F Lbun(f). Then for all  > 0, with probability at least 1 − δ, for all f ∈ F

 +  " 0+ # ρmin(T ) µ ˆ ρ max(T ) µ E Lsup(T , f) ≤ αγ(f) E Lγ,sup(T , f) + βs(f) + ηGenM +  T ∼D pmax(T ) T ∼D0 pmin(T )

√ q 0 where γ(f) = 1 + c0R(f)σ(f)( log k + log R(f) ), c0 is some constant, α = 1−τ , β = k τ1 and η = 1 .  1−τk 1−τk 1−τk

Proof. We will show that ∀f ∈ F

" 0+ # 6= ρ max(T ) µ Lun(f) ≤ γ(f) E Lγ,sup(T , f) (37) T ∼D0 pmin(T ) and the Lemma follows from Theorem 6.1. Now, we fix an  > 0, an f ∈ F and we drop most of the arguments f in the rest + − − + + − of the proof. Also, fix c , c1 . . . ck , x and let t = k − |I |. We assume without loss of generality, that c 6= ci , ∀i ∈ [t]. Now, T − + − + max f(x) (f(xi ) − f(x )) ≤ µ + max zi − z (38) i∈[t] i

T − T − + T + where µ = maxi∈[t] f(x) (µc− − µc+ ), zi = f(x) (f(xi ) − µc− ) and z = f(x) (f(x ) − µc+ ). zi are cen- i i √ 2 2 − √ tered σ R -subgaussian, so from standard properties of subgaussian random variables P[maxi zi ≥ 2σR log t + √ p 2c σR log R/] ≤ (/R)c1 (again we consider here the case where R ≥ 1 and for R < 1, the same arguments hold 1 √ + 2 2 + p c1 but with removing R from the log). z is also centered σ R -subgaussian, so P[z ≥ 2c1σR log R/] ≤ (/R) . 0 √ p 0 − + Let γ = 1 + c σR( log t + log R/) for appropriate constant c . By union bound, we have p = P[maxi zi − z ≥ c1 − + 2 γ − 1] ≤ 2(/R) . Thus, + − [(1 + µ + maxi z − z )+] ≤ (1 − p)(µ + γ)+ + p(2R + 1) ≤ γ(1 + µ/γ)+ +  (for Ez ,zi i + − k+1 + appropriate constant c1). By taking expectation over c , ci ∼ ρ , conditioned on |I | < k , and over x ∼ Dc+ we get

" T  # 6= maxc∈Q,c6=c+ f(x) (µc − µc+ ) + Lun(f) ≤ γ E 1 + |I | < k + − k+1 γ c ,ci ∼ρ + x∼Dc+ " T  # maxc∈Q,c6=c+ f(x) (µc − µc+ ) + = γ E E 1 + Q = T , |I | < k T ∼D0 + − k+1 γ (39) c ,ci ∼ρ + x∼Dc+ " T  # " 0+ # maxc∈T,c6=c+ f(x) (µc − µc+ ) ρ max(T ) µ = γ 1 + ≤ γ Lγ,sup(T , f) E 0 E E 0 T ∼D c+∼ρ0+(T ) γ + T ∼D pmin(T ) x∼Dc+

C. Examples for Section 6.2

Here, we illustrate via examples two ways in which the increase of k can lead to suboptimal fˆ. We will consider the hinge loss as the loss function, while the examples carry over trivially for logistic loss.

1. The first example is the case where even though there exist representations in F that can separate every class, the

suboptimal representation is picked by the algorithm when k = Ω(|C|). Let C = {ci}i∈[n] where for each class, Dci is Contrastive Unsupervised Representation Learning

1 2 n uniform over two points {xi , xi }. Let ei be the indicator vectors in R and let the class F consists of {f0, f1} with n 1 2 ~ f0, f1 : X 7→ R where f1(xi ) = 3/2rei and f1(xi ) = 1/2rei for all i, for some r > 0, and f0 = 0. Finally, ρ is + − uniform over C. Now, when the number of negative samples is Ω(n), the probability that ∃j ∈ [k] such that c = cj is 2 constant, and therefore Lun(f) = Ω(r ) > 1 = Lun(f0) when r is large. This means that despite Lsup(C, f1) = 0, the algorithm will pick f0 which is a suboptimal representation. 2. We can extend the first example to the case where, even when k = o(|C|), the algorithm picks suboptimal representations. To do so, we simply ‘replicate’ the first example to create clusters of classes. Formally, let C = {cij}i,j∈[n] where for 1 2 each class, Dcij is uniform over two points {xij, xij}. Finally, same as above, let F consist of two functions {f0, f1}. 1 2 ~ The function f1 maps f1(xij) = 3/2rei and f1(xij) = 1/2rei for all i, j and f0 = 0. ρ is uniform over C. Now, note 2 that f1 ‘clutsters’ the n classes and their points into n clusters, each along an ei. Thus, it is only useful for contrasting classes from different clusters. However, note that the probability of intra-cluster collision with k negative samples k is 1 − (1 − 1/n) . When k = o(n), we have that Lun(f1) = o(1) < 1 = Lun(f0) so the algorithm will pick f1. 2 However, when k = Ω(n), Lun(f) = Ω(r ) > 1 = Lun(f0) and the algorithm will pick the suboptimal representation 2 f0. Thus, despite |C| = n , having more than n negative samples can hurt performance, since even tough f1 cannot t−1 solve all the tasks, the average supervised loss over t-way tasks, t = o(n), is Lsup(f) ≤ O(1 − (1 − 1/n) ) = o(1).

D. Experiments D.1. Wiki-3029 construction We use the Wikipedia dump and select articles that have entries in the WordNet, have at least 8 sections and at least 12 sentences of length at least 4 per section. At the end of this filtering we are left with 3029 articles with at least 200 sentences per article. We then sample 200 sentences from each article and do a 70%/10%/20% train/dev/test split.

D.2. GRU model We use a bi-directional GRU with output dimension of 300 trained using dropout 0.3. The input word embeddings are initialized to pretrained CC GloVe vectors and fixed throughout training.