Strength from Weakness: Fast Learning Using Weak Supervision

Joshua Robinson 1 Stefanie Jegelka 1 Suvrit Sra 1

Abstract plex model on a related, large data task, and to then uses We study generalization properties of weakly su- the learned features for finetuning on the small-data target pervised learning, that is, learning where only a task (Girshick et al., 2014; Donahue et al., 2014; Zeiler & few “strong” labels (the actual target for predic- Fergus, 2014; Sun et al., 2017). Numerous approaches to tion) are present but many more “weak” labels are weakly have succeeded in a variety of available. In particular, we show that pretraining tasks; beyond computer vision (Oquab et al., 2015; Durand using weak labels and finetuning using strong can et al., 2017; Carreira & Zisserman, 2017; Fries et al., 2019). accelerate the learning rate for the strong task to Examples include clinical text classification (Wang et al., 2019), sentiment analysis (Medlock & Briscoe, 2007), so- the fast rate of O(1/n), where n is the number of strongly labeled data points. This acceleration cial media content tagging (Mahajan et al., 2018) and many can happen even if, by itself, the strongly labeled others. Weak supervision is also closely related to unsu- √ pervised learning methods such as complementary and con- data admits only the slower O(1/ n) rate. The acceleration depends continuously on the number trastive learning (Xu et al., 2019; Chen & Batmanghelich, of weak labels available, and on the relation be- 2019; Arora et al., 2019), and particularly to self-supervised tween the two tasks. Our theoretical results are learning (Doersch et al., 2015), where feature maps learned reflected empirically across a range of tasks and via supervised training on artificially constructed tasks have illustrate how weak labels speed up learning on been found to even outperform ImageNet learned features on the strong task. certain downstream tasks (Misra & van der Maaten, 2019). In this paper, we make progress towards building theoretical 1. Introduction foundations for weakly supervised learning, i.e., where we have a few strong labels, but too few to learn a good model While access to large amounts of labeled data has enabled in a conventional supervised manner. Specifically we ask, the training of big models with great successes in applied Under what conditions can large amounts of , labeled data remains a key bottleneck. weakly labeled data provably help us learn In numerous settings (e.g., scientific measurements, experi- a better model than strong labels alone? ments, medicine), obtaining a large number of labels can be prohibitively expensive, error prone, or otherwise infeasible. We answer this question by analyzing a generic feature learning algorithm that learns features by pretraining on When labels are scarce, a common alternative is to use ad- the weak task, and fine-tunes a model on those features for ditional sources of information: “weak labels” that contain the strong downstream task. While generalization bounds information about the “strong” target task and are more √ for supervised learning typically scale as O(1/ n), where readily available, e.g., a related task, or noisy versions of n is the number of strongly labeled data points, we show strong labels from non-experts or cheaper measurements. that the pretrain-finetune algorithm can do better, achieving Such a setting is called weakly supervised learning, and, the superior rate of Oe(n−γ) for 1/2 ≤ γ ≤ 1, where given its great practical relevance, it has received much at- γ depends on how much weak data is available, and on tention (Zhou, 2018; Pan & Yang, 2009; Liao et al., 2005; generalization error for the weak task. This rate smoothly Dai et al., 2007; Huh et al., 2016). A prominent example interpolates between Oe(1/n) in the best case, when weak that enabled breakthrough results in computer vision and is data is plentiful and the weak task is not too difficult, and now standard is pretraining, where one first trains a com- slower rates when less weak data is available or the weak task itself is hard. 1Massachusetts Institute of Technology, Cambridge, MA 02139. Correspondence to: Joshua Robinson . One instantiation of our results for categorical weak labels √ says that, if we can train a model with O(1 m) excess risk th / Proceedings of the 37 International Conference on Machine for the weak task (where m is the amount of weak data), Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by 2 the author(s). and m = Ω(n ), then we obtain a “fast rate” Oe(1/n) on the Fast Learning Using Weak Supervision excess risk of the strong task. This speedup is significant which becomes worse for labels requiring expert knowledge. √ compared to the commonly observed O(1/ n) “slow rates”. Typically, more knowledgeable labelers are more expen- sive (e.g., professional doctors versus medical students for a To obtain any such results, it is necessary to capture the task medical imaging task), which introduces a tradeoff between relatedness between weak and strong tasks. We formulate a label quality and cost that the user must carefully manage. general sufficient condition: that there exists a shared mutual embedding for which predicting the strong label is “easy”, Object Detection. A common computer vision task is to and predicting the weak label is possible. “Easy” prediction draw bounding boxes around objects in an image (Oquab is formalized via the central condition (van Erven et al., et al., 2015). A popular alternative to expensive bounding 2012; 2015), a property that ensures that learning improves box annotations is a set of words describing the objects quickly as more data is added. We merely assume existence present, without localization information (?Bilen & Vedaldi, of such an embedding; a priori we do not know what this 2016; Branson et al., 2017; Wan et al., 2018). This setting shared embedding is. Our theoretical analysis shows that too is an instance of coarse labeling. learning an estimate of the embedding by pretraining on the Model Personalization. In examples like recommender weak task still allows fast learning on the strong task. systems (Ricci et al., 2011), online advertising (Naumov In short, we make the following contributions: et al., 2019), and personalized medicine (Schork, 2015), one • We introduce a theoretical framework for analyzing needs to make predictions for individuals, while informa- weakly supervised learning problems. tion shared by a larger population acts as supportive, weak supervision (Desrosiers & Karypis, 2011). • We propose the shared embedding plus central condi- tion as a viable way to quantify relatedness between weak and strong tasks. The condition merely posits the 2. Weakly Supervised Learning existence of such an embedding; this makes obtaining We begin with some notation. The spaces X and Y denote generalization bounds non-trivial. as usual the space of features and strong labels. In weakly • We obtain generalization bounds for the strong task. supervised learning, we have in addition W, the space of These bounds depend continuously on two key quan- weak labels. We receive the tuple (X, W, Y) drawn from the tities: 1) the growth rate of the number m of weak product space X × W × Y. The goal is to then predict the labels in terms of the number n of strong labels, and strong label Y using the features X, and possibly benefiting 2) generalization performance on the weak task. from the related information captured by W. • We show that in the best case, when m is sufficiently More specifically, we work with two datasets: (1) a weakly larger than n, weak supervision delivers fast rates. weak labeled dataset Dm of m examples drawn independently from the marginal distribution PX,W; and (2) a dataset strong We validate our theoretical findings, and observe that our Dn of n strong labeled examples drawn from the fast and intermediate rates are indeed observed in practice. marginal PX,Y. Typically, n  m. 1.1. Examples of Weak Supervision We then use the weak labels to learn an embedding in a latent space Z ⊆ Rs. In particular, we assume that there Coarse Labels. It is often easier to collect labels that cap- exists an unknown “good” embedding Z = g0(X) ∈ Z, ture only part of the information about the true label of using which a linear predictor A∗ can determine W, i.e., interest (Zhao et al., 2011; Guo et al., 2018; Yan et al., 2015; ∗ ∗ A Z = A g0(X) = W in the regression case, and Taherkhani et al., 2019). A particularly pertinent example ∗ σ(A g0(X)) = W in the classification setting, where σ is semantic labels obtained from hashtags attached to im- is the . The g0 assumption holds, for ex- ages (Mahajan et al., 2018; Li et al., 2017). Such tags are ample, whenever W is a deterministic function of X. This generally easy to gather in large quantities, but tend to only assumption is made to simplify the exposition; if it does capture certain aspects of the image that the person tagging not hold exactly then one can still obtain a generalization them focused on. For example, an image with the tag #dog guarantee by introducing an additive term to the final gener- could easily also contain children, or other label categories alization bound equal to the smallest error attainable by any that have not been explicitly tagged. measurable hypothesis, reflecting the inherent noise in the Crowd Sourced Labels. A primary way for obtaining large problem of predicting W from X. labeled data is via crowd-sourcing using platforms such as Using the latent space Z, we define two function classes: Amazon Mechanical Turk (Khetan et al., 2018; Kleindess- strong predictors F ⊂ { f : X × Z → Y}, and weak ner & Awasthi, 2018). Even for the simplest of labeling feature maps G ⊂ {g : X → Z}. Later we will assume tasks, crowd-sourced labels can often be noisy (Zhang & that class F is parameterized, and identify functions f in F Sabuncu, 2018; Branson et al., 2017; Zhang et al., 2014), Fast Learning Using Weak Supervision

Algorithm 1 Pretrain-finetune meta-algorithm weak strong 1: input Dm , Dn , F, G 2: Obtain weak embedding gˆ ← Algm(G, PX,W ) aug n 3: Form dataset Dn = {(xi, zi, yi)}i=1 where zi := gˆ(xi) for strong (xi, yi) ∈ Dn 4: Define distribution Pˆ(X, Z, Y) = P(X, Y)1{Z = gˆ(X)} 5: Obtain strong predictor fˆ ← Alg (F, Pˆ) Figure 1. Schema for weakly supervised learning using Algo- n 6: return hˆ(·) := fˆ(·, gˆ(·)) rithm1. The dotted lines denote the flow of strong data, and the solid lines the flow of weak data. 2.1. Pretrain-finetune meta-algorithm with parameter vectors. We then learn a predictor f ∈ F by The algorithm we analyze solves two supervised learning replacing the latent vector Z with an embedding gˆ(X) ∈ Z problems in sequence. The first step runs an algorithm, that we learn from weakly labeled data. Corresponding to each of these function classes we introduce two loss gˆ ← Algm(G, PX,W ) functions. on m i.i.d. observations from PX,W, and outputs a feature First, ` : Y × Y → R+ measures loss of the strong predic- map gˆ ∈ G. Using the resulting gˆ we form an augmented aug n tor; we assume this loss to be continuously differentiable in dataset Dn = {(xi, zi, yi)}i=1, where zi := gˆ(xi) for strong its first argument. We will equivalently write ` f (x, z, y) := (xi, yi) ∈ Dn . Therewith, we have n i.i.d. samples from `( f (x, z), y) for predicting from a latent vector z ∈ Z; sim- the distribution Pˆ(X, Z, Y) := P(X, Y)1{Z = gˆ(X)}. ilarly, for predicting from an estimate zˆ = g(x), we write The second step then runs an algorithm, the loss as ` f (·,g)(x, y) := `( f (x, g(x)), y). ˆ ˆ f ← Algn(F, P) weak Second, ` : W × W → R+ measures loss for the weak on n i.i.d samples from Pˆ, and outputs a strong predictor task. This loss also applies to measuring loss of feature maps fˆ ∈ F. The final output is then simply the composition g : X → Z, by using the best possible downstream lin- hˆ = fˆ(·, gˆ). This procedure is summarized in Algorithm1 weak weak > ear classifier, i.e., `g (x, w) = ` (Ag g(x), w) where and the high level schema in Figure1. weak A ∈ |Y|× E` (Ag(X) W). In the classifi- g arg minA∈R s , Algorithm1 is generic because in general the two supervised cation case, for notational simplicity, we fold the softmax learning steps can use any learning algorithm. Our analysis function into `weak to convert numeric predictions to a prob- treats the case where Alg (F, Pˆ) is empirical risk mini- ability distribution. n mization (ERM) but is agnostic to the choice of learning Our primary goal is to learn a model hˆ = fˆ(·, gˆ) : X → Y algorithm Algm(G, PX,W ). Our results use high level prop- erties of these two steps, in particular their generalization that achieves low risk E[`hˆ (X, Y)]. error, which we introduce next. To that end, we seek to bound the excess risk: We break the generalization analysis into two terms depend- ∗ EP[`hˆ (X, Y) − `h (X, Y)], (1) ing on the bounds for each of the two supervised learning steps. We introduce here the notation Rate(·) to enable ∗ ∗ ∗ ∗ ∗ for h = f (·, g ) where g and f are given by a more convenient discussion of these rates. We describe our notation in the format of definitions to expedite the ∗ weak g ∈ argming∈G E[`g (X, W)], statement of the theoretical results in Section3. ∗ Definition 1 (Weak learning). Let Rate (G P ) be f ∈ argmin f ∈F E[` f (·,g∗)(X, Y)]. m , X,W; δ such that a (possibly randomized) algorithm Algm(G, PX,W ) that takes as input a function class G and m i.i.d. obser- The comparison of hˆ to h∗ based on the best embedding g∗ vations from , returns a weak embedding ∈ G for for the weak task is the most natural one for the pretrain- PX,W gˆ which, finetune algorithm that we analyze (see Algorithm1). E `weak(X, W) ≤ Rate (G, P ; δ), We are interested in studying the rate at which the excess P gˆ m X,W risk (1) goes to zero. Specifically, we are interested in with probability at least 1 − δ. studying the learning rate parameter γ for which the excess We are interested in two particular cases of loss function risk is O(n−γ). We refer to γ ≤ 1/2 as a slow rate, and `weak: (i) `weak(w, w0) = 1{w 6= w0} when W is a cate- γ ≥ 1 as a fast rate (possibly ignoring logarithmic factors, gorical space; and (ii) `weak(w, w0) = w − w0 (for some i.e., O(1/n)). When 1/2 < γ < 1 we have intermediate e normk·k on W) when W is a continuous space. rates. Fast Learning Using Weak Supervision

Definition 2 (Strong learning). Let Raten(F, Q; δ) be such trivial embedding X = g0(X), which will not inform the that a (possibly randomized) algorithm Algn(F, Q) that strong task. The central condition that we introduce in this takes as input a function space F, and n i.i.d. observations section quantifies how the embedding makes the strong task from a distribution Q(X × Z × Y), returns a strong pre- “easy”. Second, we assume a shared stability via a “relative dictor fˆ ∈ F for which, Lipschitz” property: small perturbations to the feature map h i g that do not hurt the weak task, do not affect the strong ∗ EU∼Q ` fˆ(U) − ` f (U) ≤ Raten(F, Q; δ) prediction loss much either.

with probability at least 1 − δ. Definition 4. We say that f is L-Lipschitz relative to G if for all x ∈ X , y ∈ Y, and g, g0 ∈ G, Henceforth, we drop δ from the rate symbols, for example weak > > 0 |` (x, y) − ` 0 (x, y)| ≤ L` (β g(x), β 0 g (x))). writing Ratem(G, PX,W ) instead of Ratem(G, PX,W; δ). f (·,g) f (·,g ) g g

It is important to note that the algorithms Algm(G, PX,W ) We say the function class F is L-Lipschitz relative to G, if and Algn(F, Q) can use any loss functions during train- every f ∈ F is L-Lipschitz relative to G. ing. In particular, even though we assume `weak to be a metric, it is possible to use non-metric losses such as cross The Lipschitz terminology is justified since the do- 0 entropy during training. This is because the only require- main uses the pushforward pseudometric (z, z ) 7→ weak > > 0 ment we place on these algorithms is that they imply gener- ` (Ag z, Ag0 z ), and the range is a subset of R+. In alization bounds in terms of the losses `weak and ` respec- the special case where Z = W, g(X) is actually an esti- tively. For concreteness, our analysis focuses the case where mate of the weak label W and relative Lipschitzness reduces Algn(F, Q) is ERM using loss `. to conventional Lipschitzness of `( f (x, w), y) in w. The central condition is well-known to yield fast rates for 3. Excess Risk Analysis supervised learning (van Erven et al., 2015); it directly im- plies that we could learn a map (X, Z) 7→ Y with O(1/n) In this section we analyze Algorithm1 with the objective e excess risk. The difficulty with this naive view is that at test of obtaining high probability excess risk bounds (see (1)) time we would need access to the latent value Z = g (X), for the strong predictor hˆ = fˆ(·, gˆ). Informally, the main 0 an implausible requirement. To circumnavigate this hurdle, theorem we prove is the following. we replace g0 with gˆ by solving the supervised problem Theorem 3 (Informal). Suppose that Ratem(G, PX,W ) = (`, Pˆ, F), for which we will have access to data. O(m−α) and that Alg (F, Pˆ) is ERM. Under suitable as- n But it is not clear whether this surrogate problem would sumptions on (`, P, F), Algorithm1 obtains excess risk, continue to satisfy the central condition. One of our main  αβ log n + log(1/δ) 1  theoretical contributions is to show that (`, Pˆ, F) indeed O + n nαβ satisfies a weak central condition (Theorems7 and8), and to show that this weak central condition still enables strong β with probability 1 − δ, when m = Ω(n ) for W discrete, excess risk guarantees (Theorem9). or m = Ω(n2β) for W continuous. We are now ready to define the central condition. In essence, For the prototypical scenario where Algm(G, PX,W ) = this condition requires that (X, Z) is highly predictive of √ 2 O(1/ m), one obtains fast rates when m = Ω(n ), and Y, which, combined with the fact that g0(X) = Z has zero m = Ω(n4), in the discrete and continuous cases, respec- risk on W, links the weak and strong tasks together. −αβ tively. More generally, if αβ < 1 then O(n ) is the Definition 5 (The Central Condition). A learning problem dominant term and we observe intermediate or slow rates. (`, P, F) on U := X × Z × Y is said to satisfy the ε- ∗ In order to obtain any such result, it is necessary to quantify weak η-central condition if there exists an f ∈ F such that how the weak and strong tasks relate to one another – if they −η(` f (U)−` f ∗ (U)) ηε EU∼P(U)[e ] ≤ e , are completely unrelated, then there is no reason to expect the representation gˆ(X) to benefit the strong task. The next for all f ∈ F. The 0-weak central condition is known as subsection addresses this question. the strong central condition. We assume that the strong central condition holds for our 3.1. Relating weak and strong tasks weakly supervised problem (`, P, F) with P = PU = Next, we formally quantify the relaship between the weak PX,Z,Y where Z = g0(X). A concrete but general exam- and strong task, via two concepts. First, we assume that the ple of a class of weakly supervised problems satisfying two tasks share a mutual embedding Z = g0(X). Alone, the shared embedding assumption and central condition this is not enough, since otherwise one could simply take the are those where weak and strong labels share a common Fast Learning Using Weak Supervision latent embedding Z, and Y is a logistic model on Z. In needed for obtaining our bounds. We summarize the key detail let Z = g0(X) be an arbitrary embedding of in- steps of our argument below. put X, and let W be a deterministic function of Z. Sup- ∗ ∗ pose also that Y = σ(A Z) = f (Z) for some matrix 1. Decompose the excess risk into two components: the ∗ A , where σ denotes the . Then, as ob- excess risk of the weak predictor and the excess risk served by Foster et al.(2018), the learning problem (`, P, F) on the learning problem (`, Pˆ, F) (Proposition6). is Vovk mixable, and hence the central condition holds 2. Show that the learning problem (`, Pˆ, F) satisfies a (van Erven et al., 2015), where ` is the logistic loss, and relaxed version of the central condition - the “weak F = {Z 7→ (AZ) A ∈ R|Y|×s} . σ : central condition” (Propositions7 and8). It is important to note that if one knows that a problem 3. Show that the ε-weak central condition yields excess (`, P, F) on U := X × Z × Y satisfies the central con- risk bounds that improve as ε decreases (Prop.9). dition, then it is not a priori clear if one can construct a 4. Combine all previous results to obtain generalization hypothesis set F ⊆ {X → Y} such that (`, P , F) satis- e X,Y e bounds for Algorithm1 (Theorem 10). fies the central condition. In the standard supervised setting with samples only from PX,Y this is likely impossible in general. This is because in the later case the feature Z is 3.2. Generalization Bounds for Weakly Supervised no longer an input to the model and so potentially valuable Learning predictive features are lost. However, one perspective on The first item on the agenda is Proposition6 which ob- the analysis in this section is that we show that with the tains a generic bound on the excess risk in terms of added support of samples from P , the hypothesis class X,W Ratem(G, PX,W ) and Raten(F, Pˆ). F = { f (· gˆ(·)) f ∈ F}, where gˆ(X) is learned using e , : Proposition 6 (Excess risk decomposition). Suppose that the weak labeled samples, does indeed satisfy the central f ∗ is L-Lipschitz relative to G. Then the excess risk condition with a slightly larger ε. ∗ E[`hˆ (X, Y) − `h (X, Y)] is bounded by, ˆ The central condition and related literature. The cen- 2LRatem(G, PX,W ) + Raten(F, P). tral condition unifies many well-studied conditions known to imply fast rates (van Erven et al., 2015), including Vap- The first term corresponds to excess risk on the weak task, nik and Chervonenkis’ original condition, that there is an which we expect to be small since that environment is data- f ∗ ∈ F with zero risk (Vapnik & Chervonenkis, 1971; rich. Hence, the problem of obtaining excess risk bounds 1974). The popular strong-convexity condition (Kakade & reduces to bounding the second term, Raten(F, Pˆ). This Tewari, 2009; Lecue´ et al., 2014) is also a special case, as second term is much more opaque; we spend the rest of the is (stochastic) exponential concavity, which is satisfied by section primarily analyzing it. density estimation: where F are probability densities, and We now prove that if (` P F) satisfies the ε-weak central ` (u) = − log f (u) is the logarithmic loss (Audibert et al., , , f (` ˆ F) 2009; Juditsky et al., 2008; Dalalyan et al., 2012). Another condition, then the artificial learning problem , P, ob- example is Vovk mixability (Vovk, 1990; 1998), which holds tained by replacing the true population distribution P with ˆ for online logistic regression (Foster et al., 2018), and also the estimate P satisfies a slightly weaker central condition. W holds for uniformly bounded functions with the square loss. We consider the categorical and continuous -space cases A modified version of the central condition also generalizes separately, obtaining an improved rate in the categorical the Bernstein condition and Tsybakov’s margin condition case. In both cases, the proximity of this weaker central (Bartlett & Mendelson, 2006; Tsybakov et al., 2004). condition to the ε-weak central condition is governed by Ratem(G, PX,W ), but the dependencies are different. As noted earlier, Z is not observable at train or test time, so we cannot simply treat the problem as a single supervised Proposition 7 (Categorical weak label). Suppose that learning problem. Therefore, obtaining fast or intermediate `weak(w, w0) = 1{w 6= w0} and that ` is bounded by rates is a nontrivial challenge. We approach this challenge B > 0, F is Lipschitz relative to G, and that (`, P, F) by splitting the learning procedure into two supervised tasks satisfies the ε-weak central condition. Then (`, Pˆ, F) satis- (Algorithm1). In its second step, Algorithm1 replaces   fies the ε + O eBRate (G, P ) -weak central condition (`, P, F) with (`, Pˆ, F). Our strategy to obtain generaliza- m X,W with probability at least . tion bounds is first to guarantee that (`, Pˆ, F) satisfies the 1 − δ weak central condition, and then to show that the weak cen- Next, we consider the norm induced loss. In this case it is tral condition implies the desired generalization guarantees. also possible to obtain obtain the weak central condition for The rest of this section develops the theoretical machinery the artificially augmented problem (`, Pˆ, F). Fast Learning Using Weak Supervision

Proposition 8 (Continuous weak label). Suppose that Proposition9 gives a generalization bound for any learning

`weak(w, w0) = w − w0 and that ` is bounded by B > 0, problem (`, Q, F) satisfying the weak central condition, F is L-Lipschitz relative to G, and that (`, P, F) satis- and may therefore be of interest in the theory of fast rates fies the ε-weak central condition. Then (`, Pˆ, F) satisfies more broadly. For our purposes, however, we shall apply q  B it only to the particular learning problem (`, Pˆ, F). In this the ε + O Le Ratem(G, PX,W ) -weak central condition case, the ε shall depend on Rate (G, P ), yielding strong with probability at least 1 − δ. m X,W generalization bounds when gˆ has low excess risk. For both propositions, a slight modification of the proofs Combining Proposition9 with both of the two previous B easily eliminates the e term when Ratem(G, PX,W ) ≤ propositions yields fast rates guarantees (Theorem 10) for − O(e B). Since we typically consider the regime where the double estimation algorithm (Algorithm1) for ERM. Ratem(G, PX,W ) is close to zero, Propositions7 and8 es- The final bound depends on the rate of learning for the weak sentially say that replacing P by Pˆ only increases the weak task, and on the quantity of weak data available m. central condition parameter slightly. Theorem 10 (Main result). Suppose the assumptions of The next, and final, step in our argument is to obtain a gener- Proposition9 hold, (`, P, F) satisfies the central condi- −α alization bound for ERM under the ε-weak central condition. tion, and that Ratem(G, PX,W ) = O(m ). Then, when Once we have this bound, one can obtain good generaliza- ˆ Algn(F, P) is ERM we obtain excess risk EP[`hˆ (X, Y) − tion bounds for the learning problem (` Pˆ F) since the pre- , , `h∗ (X, Y)] that is bounded by, vious two propositions guarantee that it satisfies the weak central condition from some small ε. Combining this ob-  dαβ log RL0n + log 1 L  O δ + , servation with the results from the previous section finally n nαβ allows us to obtain generalization bounds on Algorithm1 with probability at least − , if either of the following when Rate (F, Pˆ) is ERM. 1 δ n conditions hold, For this final step, we assume that our strong predictor class d F is parameterized by a vector in R , and identify each f 1. m = Ω(nβ) and `weak(w, w0) = 1{w 6= w0} (dis- with this parameter vector. We also assume that the param- crete W-space). eters live in an L ball of radius R. By Lagrangian duality 2 2. m = Ω(n2β) and `weak(w, w0) = w − w0 (contin- this is equivalent to our learning algorithm being ERM with uous W-space). L2-regularization for some regularization parameter. Proposition 9. Suppose (`, Q, F) satisfies the ε-weak cen- To reduce clutter we absorb the dependence on B into the 0 tral condition, ` is bounded by B > 0, each F is L - big-O. The key quantities governing the ultimate learning Lipschitz in its parameters in the `2 norm, F is con- rate are α, the learning rate on the weak task, and β, which tained in the Euclidean ball of radius R, and Y is com- determines the amount of weak labels relative to strong. pact. Then when Algn(F, Q) is ERM, the excess risk E [` (U) − ` ∗ (U)] is bounded by, Q fˆ f 4. Experiments  d log(RL0/ε) + log(1/δ)  We experimentally study two types of weak labels: noisy, O V + Vε , n and coarse. We study two cases: when the amount of weak data grows linearly with the amount of strong data, and with probability at least − , where = + . 1 δ V B ε when the amount of weak data grows quadratically with Any parameterized class of functions that is continuously the amount of strong data (plus a baseline). Note that in differentiable in its parameters satisfies the L0-Lipschitz a log-log plot the negative of the gradient is the learning −γ requirement since we assume the parameters live in a closed rate γ such that excess risk is O(n ). All image-based ball of radius R. The Y compactness assumption can be experiments use either a ResNet-18 or ResNet-34 for the dropped in the case where y 7→ `(y, ·) is Lipschitz. weak feature map g (see AppendixC for full details). Observe that the bound in Proposition9 depends linearly 4.1. Noisy Labels on d, the number of parameters of F. Since we consider We simulate a noisy labeler who makes labelling mistakes the regime where n is small, the user might use only a in a data dependent way (as opposed to independent random small model (e.g., a shallow network) to parameterize F, noise) by training an auxiliary deep network on a held out so d may not be too large. On the other hand, the bound is dataset to classify at a certain accuracy - for our CIFAR-10 independent of the complexity of G. This is important since experiments we train to 90% accuracy. This is intended the user may want to use a powerful model class for g to to mimic human annotators working on a crowd sourcing profit from the bountiful amounts of weak labels. platform. The predictions of the auxiliary network are used Fast Learning Using Weak Supervision

0.6 0.6 m = 0 m = 0 0.5 m = Ω(n) 0.5 m = Ω(n) m = Ω(n2) m = Ω(n2) γ = 0.44 γ = 0.45 0.4 0.4

γ = 0.51 0.3 0.3 γ = 0.47 error error

0.2 γ = 0.92 0.2 γ = 0.81

2000 2500 3000 3500 4000 4500 5000 2000 2500 3000 3500 4000 4500 5000 num strong labels num strong labels

Figure 2. Generalization error on CIFAR-10 using noisy weak labels for different growth rates of m. Left hand diagram is for simulated “noisy labeler”, the right hand picture is for random noise.

0.6 0.40 m = 0 m = 0 m = 0 0.25 0.30 γ = 0.89 m = Ω(n) m = Ω(n) 0.5 m = Ω(n) γ = 0.53 m = Ω(n2) m = Ω(n2) m = Ω(n2) 0.20 0.20 0.4 γ = 0.43 γ = 1.09

0.15 0.10 γ = 0.52

error error error 0.3 γ = 0.55

0.05 γ = 1.52 0.10 γ = 0.80 γ = 0.72 0.2

0.02 200 300 400 500 600 700 800 2000 2500 3000 3500 4000 4500 5000 2000 2500 3000 3500 4000 4500 5000 num strong labels num strong labels num strong labels

Figure 3. Coarse labels. Generalization error on various datasets using coarse weak label grouping for different growth rates of m. Datasets left to right: MNIST, SVHN, and CIFAR-10.

as weak labels. We also run experiments using indepen- we consider the TREC fast-based question categorization dent random noise, flipping each label randomly with 10% dataset. Similarly to CIFAR-100, examples are naturally chance. See Figure2 for results. divided up into six coarse groups of questions concerning numerics, humans, locations etc. They are further divided In each case, both the generalization error when using ad- up into 50 fine grained classes, used as strong labels. Since ditional weak data is lower, and the learning rate itself is we observed the natural language experiments to be much higher. Indeed, the learning rate improvement is significant. noisier than the vision tasks, we ran 20 repeats to get a For simulated noisy labels, γ = 0.44 when m = 0, and reliable average to observe the central tendency. γ = 0.92 for m = Ω(n2). Random noisy labels has a similar result with γ = 0.45 and γ = 0.81 for m = 0, and Again, generalization error is consistently lower, and learn- m = Ω(n2) respectively. The experimental results are in ing rate constantly high for larger m growth rate. The dif- line with our theoretical result that the learning rate should ferences are generally very significant, e.g. for CIFAR-100 double when β doubles. where top-5 accuracy learning rate is γ = 0.40 for m = 0, and γ = 0.82 for m = Ω(n2), and for MNIST γ = 0.89 4.2. Coarse labels and γ = 1.52 for m = 0 and m = Ω(n2) respectively. The We also consider fine tuning on a small subset CIFAR-100 contrast is slightly less pronounced on the TREC experi- with one of 100 categories. For weak pretraining we use ments, but the same broad trend is observed. a larger dataset of examples labeled with 20 semantically meaningful “super-categories”. Each super category con- 5. Related Work tains exactly 5 of the 100 fine grained categories. For ex- ample, the categories“maple”, “oak”, “palm”, “pine”, and Weakly supervised learning. There exists previous work “willow” are all part of the super-category“trees”. The results on the case where one only has weak labels. Khetan et al. are presented in Figure4. In the natural language domain (2018) consider crowd sourced labels and use an EM-style Fast Learning Using Weak Supervision

0.6 m = 0 m = 0 0.50 m = 0 0.80 γ = 0.18 γ = 0.18 m = Ω(n) γ = 0.40 m = Ω(n) m = Ω(n) 0.5 γ = 0.20 0.75 m = Ω(n2) m = Ω(n2) m = Ω(n2)

0.70 0.45 0.4

0.65 γ = 0.30 γ = 0.62 error error error 0.3 0.60 0.40 γ = 0.25

0.55 γ = 0.39 γ = 0.82

0.50 0.2 0.35 2000 2500 3000 3500 4000 4500 5000 2000 2500 3000 3500 4000 4500 5000 300 400 500 600 700 800 900 num strong labels num strong labels num strong labels

Figure 4. Left and middle: Generalization error on CIFAR-100 using coarse weak labels for different growth rates of m. Left diagram is top-1 accuracy, and middle diagram is top-5 accuracy. Right: top-1 error for the TREC question sentiment analysis task. algorithm to model the quality of individual workers. An- tion about the strong label such as invariances and spacial other approach proposed by Ratner et al.(2016; 2019) uses understanding (Noroozi & Favaro, 2016; Misra & van der correlation between multiple different weak label sources Maaten, 2019). Conversely, weakly supervised learning can to estimate the ground truth label. A different approach is be viewed as a special case of self-supervision where the to use pairwise semantic (dis)similarity as a form of weak pretext task is selected from some naturally occurring label signal about unlabeled data (Arora et al., 2019; Bao et al., source (Jing & Tian, 2019). 2018) or to use complementary labels, which give you a label telling you a class that the input is not in (Xu et al., 6. Discussion 2019; Ishida et al., 2017). Our work focuses on analyzing weakly supervised learn- Fast rates. There is a large body of work studying a va- ing. We believe, however, that the same framework could riety of favorable situations under which it is possible to be used to analyze other popular learning paradigms. One obtain rates better than slow-rates. From a generalization immediate extension of our analysis would be to multiple in- and optimization perspective, strongly convex losses en- consistent sources of weak labels as by Khetan et al.(2018). able fast rates for generalization and for fast convergence of stochastic gradient (Kakade & Tewari, 2009; Hazan et al., Other important extensions would be to include self- 2007; Foster & Syrgkanis, 2019). These works are special supervised pretraining. Cases where the marginal P(X) cases of exponentially concave learning, which is itself a does not shift fall within the scope of our analysis. How- special case of the central condition. There are completely ever, a key technical difference between our setting and different lines of work on fast rates, such as developing data approaches such as image super-resolution (Ledig et al., dependent local Rademacher averages (Bartlett et al., 2005); 2017), solving jigsaw puzzles (Noroozi & Favaro, 2016), and herding, which has been used to obtain fast rates for and image inpaining (Pathak et al., 2016) is that in the lat- integral approximation (Welling, 2009). ter, the marginal distribution of features P(X) is potentially different on the pretext task as compared to the downstream Learning with a nuisance component. The two-step es- tasks of interest. Because of this difference our analysis timation algorithm we study in this paper is closely related doesn’t immediately transfer over to these settings, leaving to statistical learning under a nuisance component (Cher- an interesting avenue for future work. nozhukov et al., 2018; Foster & Syrgkanis, 2019). In that setting one wishes to obtain excess risk bounds for the model Another option is to use our representation transfer analysis fˆ(·, g0(·)) where W = g0(X) is the true weak predictor. to study multi-task or meta-learning settings where one The analysis of learning in such settings rests crucially on wishes to reuse an embedding across multiple tasks with the Neyman orthogonality assumption (Neyman & Scott, shared characteristics with the aim of obtaining certified 1965). Our setting has the important difference of seeking performance across all tasks. excess risk bounds for the compositional model fˆ(·, gˆ(·)). A completely different direction, based on the observation Self-supervised learning. In self-supervised learning the that our analysis is predicated on the idea of “cheap” weak user artificially constructs pretext learning problems based labels and “costly” strong labels, is to ask how best to al- on attributes of unlabeled data (Doersch et al., 2015; Gidaris locate a finite budget for label collection when faced with et al., 2018). In other words, it is often possible to construct varying quality label sources. a weakly supervised learning problem where the choice of weak labels are a design choice of the user. In line with our analysis, the success of self-supervised representations relies on picking pretext labels that capture useful informa- Fast Learning Using Weak Supervision

Acknowledgements This work was supported by NSF Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. TRIPODS+X grant (DMS-1839258), NSF-BIGDATA (IIS- Logistic regression: The importance of being improper. In 1741341), and the MIT-MSR TRAC collaboration. The Conference on Learning Theory (COLT), 2018. authors thank Ching-Yao Chuang for discussions. Fries, J. A., Varma, P., Chen, V. S., Xiao, K., Tejeda, H., Saha, P., Dunnmon, J., Chubb, H., Maskatia, S., Fiterau, M., et al. Weakly supervised classification of aortic valve malformations References using unlabeled cardiac mri sequences. Nature communications, Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saun- 10(1), 2019. shi, N. A theoretical analysis of contrastive unsupervised rep- Gidaris, S., Singh, P., and Komodakis, N. Unsupervised rep- resentation learning. In Int. Conference on Machine Learning resentation learning by predicting image rotations. preprint (ICML), 2019. arXiv:1803.07728, 2018. Audibert, J.-Y. et al. Fast learning rates in statistical inference Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature through aggregation. The Annals of Statistics, 37(4), 2009. hierarchies for accurate object detection and semantic segmen- Bao, H., Niu, G., and Sugiyama, M. Classification from pairwise tation. In IEEE Conference on Computer Vision and Pattern similarity and unlabeled data. 2018. Recognition (CVPR), 2014. Bartlett, P. L. and Mendelson, S. Empirical minimization. Proba- Guo, Y., Liu, Y., Bakker, E. M., Guo, Y., and Lew, M. S. CNN- bility theory and related fields, 135(3), 2006. RNN: A large-scale hierarchical image classification framework. Multimedia Tools and Applications, 77(8), 2018. Bartlett, P. L., Bousquet, O., Mendelson, S., et al. Local Rademacher complexities. The Annals of Statistics, 33(4), 2005. Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3), Bilen, H. and Vedaldi, A. Weakly supervised deep detection 2007. networks. In IEEE Conference on Computer Vision and (CVPR), 2016. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision Branson, S., Van Horn, G., and Perona, P. Lean crowdsourc- and Pattern Recognition (CVPR), 2016. ing: Combining humans and machines in an online system. In IEEE Conference on Computer Vision and Pattern Recognition Huh, M., Agrawal, P., and Efros, A. A. What makes ImageNet (CVPR), 2017. good for ? preprint arXiv:1608.08614, 2016. Carl, B. and Stephani, I. Entropy, compactness and the approxi- Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from mation of operators. Number 98. Cambridge University Press, complementary labels. In Advances in Neural Information 1990. Processing Systems (NeurIPS), 2017. Carreira, J. and Zisserman, A. Quo vadis, action recognition? A Jing, L. and Tian, Y. Self-supervised visual feature learning with new model and the kinetics dataset. In IEEE Conference on deep neural networks: A survey. preprint arXiv:1902.06162, Computer Vision and Pattern Recognition (CVPR), 2017. 2019. Chen, J. and Batmanghelich, K. Weakly supervised disentangle- Juditsky, A., Rigollet, P., Tsybakov, A. B., et al. Learning by ment by pairwise similarities. In Association for the Advance- mirror averaging. The Annals of Statistics, 36(5), 2008. ment of Artificial Intelligence (AAAI), 2019. Kakade, S. M. and Tewari, A. On the generalization ability of Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., online strongly convex programming algorithms. In Advances Hansen, C., Newey, W., and Robins, J. Double/debiased ma- in Neural Information Processing Systems (NeurIPS), 2009. chine learning for treatment and structural parameters, 2018. Khetan, A., Lipton, Z. C., and Anandkumar, A. Learning from Dai, W., Yang, Q., Xue, G.-R., and Yu, Y. Boosting for transfer noisy singly-labeled data. Int. Conf. on Learning Representa- learning. In Int. Conference on Machine Learning (ICML), tions (ICLR), 2018. 2007. Kingma, D. P. and Ba, J. Adam: A method for stochastic opti- Dalalyan, A. S., Tsybakov, A. B., et al. Mirror averaging with mization. In Int. Conf. on Learning Representations (ICLR), sparsity priors. Bernoulli, 18(3), 2012. 2015. Desrosiers, C. and Karypis, G. A comprehensive survey of Kleindessner, M. and Awasthi, P. Crowdsourcing with arbitrary neighborhood-based recommendation methods. In Recom- adversaries. In Int. Conference on Machine Learning (ICML), mender systems handbook. Springer, 2011. 2018. Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual Lecue,´ G., Rigollet, P., et al. Optimal learning with Q-aggregation. representation learning by context prediction. In Int. Conference The Annals of Statistics, 42(1), 2014. on Computer Vision (ICCV), 2015. Ledig, C., Theis, L., Huszar,´ F., Caballero, J., Cunningham, A., Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al. and Darrell, T. Decaf: A deep convolutional activation feature Photo-realistic single image super-resolution using a generative for generic visual recognition. In Int. Conference on Machine adversarial network. In cvpr, 2017. Learning (ICML), 2014. Li, W., Wang, L., Li, W., Agustsson, E., and van Gool, L. Web- Durand, T., Mordan, T., Thome, N., and Cord, M. Wildcat: Weakly vision database: Visual learning and understanding from web supervised learning of deep convnets for image classification, data. In preprint arXiv:1708.02862, 2017. pointwise localization and segmentation. In IEEE Conference Liao, X., Xue, Y., and Carin, L. Logistic regression with an on Computer Vision and Pattern Recognition (CVPR), 2017. auxiliary data source. In Int. Conference on Machine Learning Foster, D. J. and Syrgkanis, V. Orthogonal statistical learning. In (ICML), 2005. Conference on Learning Theory (COLT), 2019. Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Fast Learning Using Weak Supervision

Li, Y., Bharambe, A., and van der Maaten, L. Exploring the cal learning. The Annals of Statistics, 32(1), 2004. limits of weakly supervised pretraining. In Europ. Conference van Erven, T., Grunwald,¨ P., Reid, M., and Williamson, R. Mixa- on Computer Vision (ECCV), 2018. bility in statistical learning. In Advances in Neural Information Medlock, B. and Briscoe, T. Weakly supervised learning for hedge Processing Systems (NeurIPS), 2012. classification in scientific literature. In Proceedings of the 45th van Erven, T., Grunwald, P., Mehta, N. A., Reid, M., Williamson, annual meeting of the association of computational linguistics, R., et al. Fast rates in statistical and online learning. In Journal 2007. of Machine Learning Research, 2015. Mehta, N. A. Fast rates with high probability in exp-concave statis- Vapnik, V. and Chervonenkis, A. Theory of pattern recognition, tical learning. In Proc. Int. Conference on Artificial Intelligence 1974. and Statistics (AISTATS), 2016. Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence Misra, I. and van der Maaten, L. Self-supervised learning of of relative frequencies of events to their probabilities. Theory pretext-invariant representations. preprint arXiv:1912.01991, of Probability and its Applications, 16(2), 1971. 2019. Vovk, V. A game of prediction with expert advice. Journal of Naumov, M., Mudigere, D., Shi, H.-J. M., Huang, J., Sundaraman, Computer and System Sciences, 56(2), 1998. N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Azzolini, A. G., et al. recommendation model for personaliza- Vovk, V. G. Aggregating strategies. Proc. of Computational tion and recommendation systems. preprint arXiv:1906.00091, Learning Theory, 1990, 1990. 2019. Wan, F., Wei, P., Jiao, J., Han, Z., and Ye, Q. Min-entropy latent Neyman, J. and Scott, E. L. Asymptotically optimal tests of com- model for weakly supervised object detection. In IEEE Con- posite hypotheses for randomized experiments with noncon- ference on Computer Vision and Pattern Recognition (CVPR), trolled predictor variables. Journal of the American Statistical 2018. Association, 60(311), 1965. Wang, Y., Sohn, S., Liu, S., Shen, F., Wang, L., Atkinson, E. J., Noroozi, M. and Favaro, P. of visual repre- Amin, S., and Liu, H. A clinical text classification paradigm sentations by solving jigsaw puzzles. In Europ. Conference on using weak supervision and deep representation. BMC medical Computer Vision (ECCV). Springer, 2016. informatics and decision making, 19(1), 2019. Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Is object localization Welling, M. Herding dynamical weights to learn. In Int. Confer- for free?-weakly-supervised learning with convolutional neural ence on Machine Learning (ICML), 2009. networks. In IEEE Conference on Computer Vision and Pattern Xu, Y., Gong, M., Chen, J., Liu, T., Zhang, K., and Batmanghe- Recognition (CVPR), 2015. lich, K. Generative-discriminative complementary learning. In Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE preprint arXiv:1904.01612, 2019. Transactions on knowledge and data engineering, 22(10), 2009. Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, W., and Yu, Y. HD-CNN: hierarchical deep convolutional neural G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Py- networks for large scale visual recognition. In Int. Conference Torch: An imperative style, high-performance deep learning on Computer Vision (ICCV), 2015. library. In Advances in Neural Information Processing Systems Zeiler, M. D. and Fergus, R. Visualizing and understanding convo- (NeurIPS), 2019. lutional neural networks. In Europ. Conference on Computer Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, Vision (ECCV), 2014. A. A. Context encoders: Feature learning by inpainting. In cvpr, Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. Spectral methods 2016. meet EM: A provably optimal algorithm for crowdsourcing. In Ratner, A., Hancock, B., Dunnmon, J., Sala, F., Pandey, S., and Re,´ Advances in Neural Information Processing Systems (NeurIPS), C. Training complex models with multi-task weak supervision. 2014. In Association for the Advancement of Artificial Intelligence Zhang, Z. and Sabuncu, M. Generalized cross entropy loss for (AAAI), volume 33, 2019. training deep neural networks with noisy labels. In Advances in Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., and Re,´ C. Data Neural Information Processing Systems (NeurIPS), 2018. programming: Creating large training sets, quickly. In Advances Zhao, B., Li, F., and Xing, E. P. Large-scale category structure in Neural Information Processing Systems (NeurIPS), 2016. aware image categorization. In Advances in Neural Information Ricci, F., Rokach, L., and Shapira, B. Introduction to recom- Processing Systems (NeurIPS), 2011. mender systems handbook. In Recommender systems handbook. Zhou, Z.-H. A brief introduction to weakly supervised learning. Springer, 2011. National Science Review, 5(1), 2018. Schork, N. J. Personalized medicine: time for one-person trials. Nature, 520(7549), 2015. Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Int. Conference on Computer Vision (ICCV), 2017. Taherkhani, F., Kazemi, H., Dabouei, A., Dawson, J., and Nasrabadi, N. M. A weakly supervised fine label classifier enhanced by coarse supervision. In Int. Conference on Com- puter Vision (ICCV), 2019. Tsybakov, A. B. et al. Optimal aggregation of classifiers in statisti-