Arxiv:2102.04848V1 [Cs.CV] 9 Feb 2021 Et Al

Train a One-Million-Way Instance Classiﬁer for Unsupervised Visual Representation Learning

Yu Liu, Lianghua Huang, Pan Pan, Bin Wang, Yinghui Xu, Rong Jin Machine Intelligence Technology Lab, Alibaba Group ly103369, xuangen.hlh, panpan.pp, ganfu.wb, renji.xyh, jinrong.jr @alibaba-inc.com { }

Abstract N instance classes Pool + MLP This paper presents a simple unsupervised visual represen- Phase 1: Unsupervised pretraining tation learning method with a pretext task of discriminat- Encoder ing all images in a dataset using a parametric, instance-level classifier. The overall framework is a replica of a supervised x ~ N instances C semantic classes classification model, where semantic classes (e.g., dog, bird, Pool + FC and ship) are replaced by instance IDs. However, scaling up Phase 2: Supervised transferring the classification task from thousands of semantic labels to millions of instance labels brings specific challenges includ- (a) Outline of our representation learning framework. ing 1) the large-scale softmax computation; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy. This work presents several novel techniques to handle these difficulties. First, we introduce a hybrid parallel training framework to make large-scale training feasible. Second, we present a raw-feature initialization mechanism for classification weights, which we assume offers a contrastive prior for instance discrimination and can clearly speed up converge in our experiments. Finally, we propose to smooth the labels of a few hardest classes to avoid optimizing over very similar negative pairs. While being conceptually simple, our framework achieves competitive or superior performance compared to state-of-the-art unsupervised approaches, i.e., SimCLR, Mo- (b) Correlation between instance and semantic classification. CoV2, and PIC under ImageNet linear evaluation protocol and on several downstream visual tasks, verifying that full Figure 1: (a) An overview of our unsupervised visual rep- instance classification is a strong pretraining technique for resentation learning framework. Without manual labels, we many semantic visual tasks. simply train an instance-level classifier that tries to distin- guish all images in a dataset to learn discriminative repre- Unsupervised visual representation learning has recently sentations that can be well transferred to supervised tasks. shown encouraging progress (He et al. 2020; Chen et al. (b) The relationship between instance (unsupervised) and se- 2020a). Methods using instance discrimination as a pretext mantic (supervised) classification accuracies. We observe a task (Tian, Krishnan, and Isola 2019; He et al. 2020; Chen strong positive correlation between them in our experiments. arXiv:2102.04848v1 [cs.CV] 9 Feb 2021 et al. 2020a) have demonstrated competitive or even superior performance compared to supervised counterparts under ImageNet (Deng et al. 2009) linear evaluation protocol and 2018a), momentum encoder (He et al. 2020; Chen et al. on many downstream visual tasks. This shows the potential 2020c), large batch size (Chen et al. 2020a,b), or shuffled of unsupervised representation learning methods since they batch normalization (BN) (He et al. 2020; Chen et al. 2020c) can utilize almost unlimited data without manual labels. to compensate for the lack of negative samples or handle the To solve the instance discrimination task, usually a dual- information leakage issue (i.e., samples on a same GPU tend branch structure is used, where two transformed views of a to get closer due to shared BN statistics). same image are encouraged to get close, while transformed Unlike dual-branch approaches, one-branch scheme (e.g., views from different images are expected to get far apart (He parametric instance-level classification) usually avoids the et al. 2020; Chen et al. 2020a,c). These methods often rely information leakage issue, and can potentially explore a on specialized designs such as memory bank (Wu et al. larger set of negative samples. ExemplarCNN (Dosovitskiy Copyright © 2021, Association for the Advancement of Artificial et al. 2014) and PIC (Cao et al. 2020) are of this category. Intelligence (www.aaai.org). All rights reserved. Nevertheless, due to the high GPU computation and memory Top-K Gather Partial Partial Partial Smoothed lookup One-hot Sub-batch features weights logits loss labels table labels Forward pass Backward pass Matrix production GPU 1 Encoder Pool + MLP W1 J1

GPU 2 Encoder Pool + MLP W2 J2

GPU T Encoder Pool + MLP WT JT

Data Parallel Model Parallel Label Smoothing

Figure 2: An outline of our distributed hybrid parallel (DHP) training process on T GPU nodes. Data parallel: Following data parallel mechanism, we copy the encoding and MLP layers to all nodes, each processing a subset of minibatch data. Model parallel: Following model parallel mechanism, we evenly divide the classification weights to different nodes, and distribute the computation of classification scores (forward pass) and weight/feature gradients (backward pass) to different GPUs. Label smoothing: We smooth labels of the top-K hardest negative classes for each instance to avoid optimizing over noisy pairs. overhead of large-scale instance-level classification, these weighted average of other instance features extracted in pre- methods are either tested on small datasets (Dosovitskiy vious iterations. On the other hand, initializing classification et al. 2014) or rely on negative class sampling (Cao et al. weights as instance features in essence converts the classifi- 2020) to make training feasible. cation task to a pair-wise instance comparison task, provid- ing a warm start for convergence. This work summarizes typical challenges of using one- branch instance discrimination for unsupervised representa- Finally, regarding the massive number of negative in- tion learning, including 1) the large-scale classification, 2) stance classes that significantly raises the risk of optimizing the slow convergence due to the infrequent instance access, over very similar negative pairs, we propose to smooth the and 3) a large number of negative classes that can be noisy, labels of the top-K hardest negative classes to make training and proposes novel techniques to handle them. First, we in- easier. Specifically, we compute cosine similarities between troduce a hybrid parallel training framework to make large- instance proxies – their corresponding classification weights scale classifier training feasible. It relies on model paral- – and find the negative classes with the top-K highest sim- lelism that divides classification weights to different GPUs, ilarities for each instance. The labels of these classes are and evenly distribute the softmax computation (both in for- smoothed by a factor of α (i.e., from y = 0 to y = α/K). ward and backward passes) to them. Figure 2 shows an The right part of Figure 2 shows the− smoothing− process. overview of our distributed training process. This training Note these top-K indices are computed once per training framework can theoretically support up to 100-million-way epoch, which is very efficient and adds only minimal com- classification using 256 GPUs (Song et al. 2020), which far putational overhead for the training process. exceeds the number of 1.28 million ImageNet-1K instances, We evaluate our method under the common ImageNet lin- indicating the scalability of our method. ear evaluation protocol as well as on several downstream Second, instance classification faces the slow conver- tasks related to detection or fine-grained classification. De- gence problem due to the extremely infrequent visiting of in- spite its simplicity, our method shows competitive results on stance samples (Cao et al. 2020). In this work, we tackle this these tasks. For example, our method achieves a top-1 accu- problem by introducing a a contrastive prior to the instance racy of 71.4% under ImageNet linear classification protocol, classifier. Specifically, with a randomly initialized network, outperforming all other instance discrimination based meth- we fix all but BN layers of it and run an inference epoch ods (Chen et al. 2020a,c; Cao et al. 2020). We also obtain a to extract all instance features; then we directly assign these semi-supervised accuracy of 81.8% on ImageNet-1K when features to classification weights as an initialization. The in- only 1% of labels are provided, surpassing previous best re- tuition is two-fold. On the one hand, we presume that run- sult by around 4.7%. We hope our full instance classification ning BNs may offer a contrastive prior in the output instance framework can serve as a simple and strong baseline for the features, since in each iteration the features will subtract a unsupervised representation learning community. Related Work Methodology Unsupervised visual representation learning. Unsuper- Overall Framework vised or self-supervised visual representation learning aims This work presents an unsupervised representation learning to learn discriminative representation from visual data where method with a pretext task of classifying all image instances no manual labels are available. Usually a pretext task is in a dataset. Figure 1 (a) shows the outline of our method. utilized to determine the quality of the learned representa- The pipeline is similar to common supervised classification, tion and to iteratively optimize the parameters. Represen- where semantic classes are replaced by instance IDs. In- tative pretext tasks include transformation prediction (Gi- spired by the design improvements used in recent unsuper- daris, Singh, and Komodakis 2018; Zhang et al. 2019), in- vised frameworks (Chen et al. 2020a), we slightly modify painting (Pathak et al. 2016), spatial or temporal patch or- some components, including using stronger data augmen- der prediction (Doersch, Gupta, and Efros 2015; Noroozi tation (i.e., random crop, color jitter, and Gaussian blur), a and Favaro 2016; Noroozi et al. 2018), colorization (Zhang, two-layer MLP projection head, and a cosine softmax loss. Isola, and Efros 2016), clustering (Caron et al. 2018; The cosine softmax loss is defined as Zhuang, Zhai, and Yamins 2019a; Asano, Rupprecht, and Vedaldi 2020), data generation (Jenni and Favaro 2018; 1 X exp(cos(wi, xi)/τ) J = log , (1) Donahue and Simonyan 2019; Donahue, Krahenb¨ uhl,¨ and − I PN i I j=1 exp(cos(wj, xi)/τ) Darrell 2016), geometry (Dosovitskiy et al. 2015), and a | | ∈ combination of multiple pretext tasks (Doersch and Zisser- where I denotes the indices of sampled image instances in a man 2017; Feng, Xu, and Tao 2019). minibatch, xi is the projected embedding of instance i, W = D N Contrastive visual representation learning. More re- w1, w2, , wN R × represents the instance clas- cently, contrastive representation learning methods (Henaff´ sification{ weights,··· }cos( ∈ w , x ) = (wTx )/( w x ) j i j i k jk2 · k ik2 et al. 2019; He et al. 2020) have shown significant perfor- denotes the cosine similarity between wj and xi, and τ is a mance improvements by using strong data augmentation and temperature adjusting the scale of cosine similarities. proper loss functions (Chen et al. 2020c,a). For these meth- Nevertheless, there are still challenges for this vanilla in- ods, usually a dual-branch structure is employed, where two stance classification model to learn good representation, in- augmented views of an image are encouraged to get close cluding 1) the large-scale instance classes (e.g., 1.28 million while augmented views from different images are forced to instance classes for ImageNet-1K dataset); 2) the extremely get far apart. One problem of these methods is the short- infrequent visiting of instance samples; and 3) the massive age of negative samples. Some methods rely on large batch number of negative classes that makes training difficult. We size (Chen et al. 2020a), memory bank (Wu et al. 2018a), or propose three efficient techniques to improve the represen- momentum encoder (He et al. 2020; Chen et al. 2020c) to tation learning and the scalability of our method: enlarge the negative pool. Another issue is regarding the information leakage issue (He et al. 2020; Chen et al. 2020c) • Hybrid parallelism. To support large-scale instance clas- where features extracted on a same GPU tend to get close sification, we rely on hybrid parallelism and evenly dis- due to the shared BN statistics. MoCo (He et al. 2020; Chen tribute the softmax computation (both in forward and et al. 2020c) solves this problem by using shuffled batch nor- backward passes) to different GPUs. Figure 2 shows a T malization (BN), while SimCLR (Chen et al. 2020a) handles schematic of the distributed training process on GPUs. the problem with a synchronized global BN. • A contrastive prior. To improve the convergence, we pro- Instance discrimination for representation learning. Un- pose to introduce a a contrastive prior to the instance clas- like the two-branch structure used in contrastive methods, sifier. This is simply achieved by initializing classification some approaches (Dosovitskiy et al. 2014; Cao et al. 2020) weights as raw instance features extracted by a fixed ran- employ a parametric, one-branch structure for instance dis- dom network with running BNs. crimination, which avoids the information leakage issue. • Smoothing labels of hardest classes. the massive num- Exemplar-CNN (Dosovitskiy et al. 2014) learns a classifier ber of negative classes raises the risk of optimizing over to discriminate between a set of “surrogate classes”, each very similar pairs. We apply label smoothing on the top-K class represents different transformed patches of a single hardest instance classes to alleviate this issue. image. Nevertheless, it shows worse performance than non- parametric approaches (Wu et al. 2018a). PIC (Cao et al. Note that the above improvements only bring little or no 2020) improves Exemplar-CNN in two ways: 1) it intro- computational overhead for the training process. Next we duces a sliding-window data scheduler to alleviate the infre- will introduce these techniques respectively in details. quent instance visiting problem; 2) it utilizes recent classes sampling to reduce the GPU memory consumption. Despite Hybrid Parallelism its effectiveness, it relies on complicated scheduling and op- Training an instance classifier usually requires to learn a timization processes, and it cannot fully explore the large large-scale fc layer. For example, for ImageNet-1K with ap- number of negative instances. This work presents a much proximately 1.28 million images, one needs to optimize a D 1280000 simpler instance discrimination method that uses an ordi- weight matrix of size W R × . For ImageNet- ∈ D 14200000 nary data scheduler and optimization process. In addition, it 21K, the size further enlarged to W R × . This is able to make full usage of the massive number of negative is often infeasible when using a regular∈ distributed data par- instances in every training iteration. allel (DDP) training pipeline. In this work, we introduce a 100 30 DDP 93.4 91.9 DHP Intra-class similarity 25 Inter-class similarity 80

60 15 TableTable 2: Ablation 2: Ablation study study on the on effectiveness the effectiveness of different of different componentscomponents in our in method. our method. 39.6 10 40 LinearLinear eval. eval. 30.3Init. methodInit. method (10(epoch epochs)(epoch 10) 10) 5 Cosine similarity (%) RandomRandom init. init. 12.4 12.4 20

Maximum class number (Million) Raw withRaw fixed with BNsfixed BNs 22.7 22.7 0 10 15 20 25 30 35 Raw withRaw running with running BNs BNs 27.4 27.4 GPU memory overhead (GB) 0 Raw with Raw with Figure 3: Comparison of the maximum number of classes fixed BNs running BNs 100 100 supported by DDP and DHP training frameworks under dif- 93.4 91.993.4 91.9 Figure 4: Table: comparison of ImageNet linear evaluationIntra-classIntra-class sim. sim. ferent GPU memory constraints. Inter-classInter-class sim. sim. accuracies of different80 weight-initialization80 methods, evaluated after 10 epochs training. Bar chart: comparison of av- Table 1: Comparison of the GPU memory consumption and erage intra- and inter-instance-class60 60 cosine similarities when the training time per epoch of DDPFigure andFigure 3: DHP Comparison 3: Comparison training ofGPU frame- of GPU memory memory consumption consumption of of DDPDDP and DHP and DHPtraining training frameworks. frameworks. using different initialization methods.TableTableTable 2: AblationTable 2: 2:Ablation Ablation 2: studyAblation study studyon studyonthe on the effectiveness onthe effectiveness the effectiveness effectiveness of differentof ofdifferent differentof different 39.6 39.6 works. Experiments are conducted on ImageNet-1K and 40 40 componentscomponentscomponentscomponents in our in our method.in our method.in our method. method. ImageNet-21K datasets. 30.3 30.3 TableTable 1: Ablation 1: Ablation study study on the on effectiveness the effectiveness of different of different Init.Init. methodInit. methodInit. method method Linear Linear eval.Linear eval.Linear eval. eval. Cosine similarity (%) Cosine similarity (%) componentscomponents in our in method. our method. 20 20 RandomRandomRandom init.Random init. init. init. 12.4 12.4 12.4 12.4 note that the DHP can benefit fromRaw moreRaw withRaw with fixed GPUsRaw with fixed BNs with fixed BNs fixed toBNs BNs support 22.7 22.7 22.7 22.7 DDP DHP RawRaw withRaw with runningRaw with running with running BNs running BNs BNs BNs27.427.427.427.4 DDPDDPlarger-scale DHP DHP instance classification, but DDP does not bear Dataset #Instances DatasetDataset #Instances #Instances Memory Time Memorymemory Timememory speed speed memory memory speed speed 0 0 this scalability. Table 1 comparesRaw withRaw Rawwith with theRaw with training efficiency of ImageNet-1K 1.28M 15.6GBImageNet-1KImageNet-1K 428s 9.18GB 1.28M 1.28M 15.6G 302s 15.6G 428s 428s 9.18G 9.18G 302s 302s fixed BNsfixedrunning BNs running BNs BNs ImageNet-21K* ImageNet-21K 14.2M 14.2MOOMOOM-- 21.51GDDP 21.51G and 2940s DHP 2940s frameworks on ImageNet-1K and ImageNet- ImageNet-21K 14.2M OOM - 21.51GB 2940s 100 100 100 100 21K under the same batch size settings.93.4 91.993.4 93.491.9 We91.993.4 91.9 show that the * FigureFigure 4: Comparison 4: Comparison of initialization of initialization methods. methods.Intra-classIntra-classIntra-class sim.Intra-class sim. sim. sim. OOM: Out-of-memory. Inter-classInter-classInter-class sim.Inter-class sim. sim. sim. DHP training framework not only80 consumes80 80 80 less GPU mem- • Raw-feature• Raw-feature initialization initialization with runningwith running BNs.ory, BNs.Tobut improveTo also improve trains much faster than the DDP counterpart. the convergence,the convergence, we propose we propose to initialize to initialize the classification the classification modelmodel parallel parallelmechanismmechanism60 60 and60 60 split and splitthe weightsthe weights evenly evenly to to weightsweights as raw as instance raw instance featuresFigureFigure featuresFigure 3: extractedFigure Comparison 3: 3:Comparison extracted Comparison 3: Comparisonby of a GPUbyofrandom of GPU a memory GPUofrandom memory GPU memory consumption memoryT consumptionGPUs. consumptionT consumptionGPUs. ofAt of each ofAt oftraining each training iteration iteration and for and each for GPUeach GPU node, node, TableTableTable 2:TableTable 2:Ablation 2:Table AblationTable 2:Ablation 2:Table Ablation 2:Ablation 2:study Ablation 2:studyAblation study Ablation on study study on the onstudy studythe on effectivenessthe on study effectivenessthe on theeffectiveness on theeffectiveness on effectivenessthe effectivenessthe effectiveness of effectiveness of different of different of different of different of different of different of different different DDPDDP andDDP and DHPDDP and DHP trainingand DHP trainingA DHP training frameworks.Contrastive training frameworks. frameworks. frameworks.we Prior 1)we extract 1) extract features features of a subset of a subset39.6 of39.6 minibatch39.6 of39.6 minibatch samples; samples; 2) 2) distributed hybrid parallel (DHP)initial traininginitial network,framework network, where where all but (Song all BN but layers BN layers of it are of it fixed. are fixed. We We 40 40 40 40componentscomponentscomponentscomponentscomponentscomponents incomponents our incomponents in our method. our in in method. our ourmethod. in in ourmethod. method. our in method. our method. method. assumeassume that such that suchinitialization initialization offers offers a contrastive a contrastive prior prior gathergather features features from from all other all other nodes;30.3 nodes; 3)30.3 compute30.3 3)30.3 compute partial partial co- co- et al. 2020) to make large-scale classification feasible. TableTableTable 1: AblationTable 1: 1:Ablation Ablation 1: studyAblationInstance study studyon studyonthe on the effectiveness classification onthe effectiveness the effectiveness effectiveness of differentof ofdifferent faces differentof different the slow convergenceInit.Init. methodInit. methodInit.Init. method problemInit. method methodInit. methodInit. method method linear linear linear eval. linear linear eval. eval. linear linear eval. eval. linear eval. eval. eval. for instancefor instance discrimination. discrimination. sine logitssine logits using using localCosine similarity (%) localclassificationCosine similarity (%) Cosine similarity (%) classificationCosine similarity (%) weights; weights; 4) compute 4) compute the the componentscomponentscomponentscomponents in our in our method.in our method.in our method. method. 20 20 20 20 RandomRandomRandomRandom init.Random init.Random init.Random init.Random init. init. init. init. 12.4 12.4 12.4 12.4 12.4 12.4 12.4 12.4 Figure 2 summarizes the outline of the distributed hybrid in early epochsexponential dueexponential to the values infrequent values of logits of logits and visiting sum and them sumRawRawRaw ofwith them overRaw withRaw withfixed instanceRaw withfixedRaw with fixed allBNs overRaw with fixed BNs withfixed BNs nodes withfixed BNs fixed allBNs fixed BNs BNs22.7 nodes to BNs22.7 22.7 22.7 22.7 to 22.7 22.7 22.7 • Smoothing• Smoothing labels labels of hardest of hardest classes. classes.The massiveThe massive amount amount RawRawRaw withRaw withRaw withrunningRaw withrunningRaw with runningRaw with runningBNs withrunning BNs withrunning BNs running BNs runningBNs27.4 BNs BNs27.427.4 BNs27.427.427.427.427.4 parallel training process on T GPU nodes. For encoding and samplesDDP (i.e.,DDPDDP onceDDPobtain DHP accessobtain DHP the DHP softmax DHP the per softmax denominators; epoch). denominators; A recent 5) compute 5) compute work softmax (Cao softmax prob- prob- of negativeof negative classes classes raises raises theDataset risk theDatasetDataset of riskDataset optimizing of #Instancesoptimizing #Instances #Instances over #Instances very over very 0 0 0 0 memorymemorymemory speedmemory speed speed memoryabilities speed memory memoryabilities speed memory and speed speed the and speed cross the cross entropyRaw entropy withRaw loss RawwithRaw withRaw on losswithRaw with the Rawwith on withRaw subset the with subset data; data;6) de- 6) de- MLP layers, we follow the data parallelsimilarsimilar pairs.pipeline pairs. We apply We and apply labelcopy label smoothing smoothinget on al. the on 2020) top- theK top- handlesK this infrequent visitingfixed fixedBNsfixed BNsrunning fixedproblemBNsrunning BNs BNsrunning BNsrunning BNs BNs by using ImageNet-1KImageNet-1KImageNet-1KImageNet-1K 1.28M 1.28M 1.28M 15.6G1.28M 15.6G 15.6G 428s 15.6G 428s 428s 9.18Gduce 428s 9.18G 9.18Gduce gradients 302s 9.18G 302s gradients 302s 302sof the of local the local lossmodel with lossmodelmodel parallelmodelmodelrespectwith parallelmodel parallelmodel parallel parallelmechanismmodelrespect parallelmechanism parallelmechanism to parallelmechanismmechanism featuresmechanism andmechanism to andmechanismsplit and featuressplit and andsplitthe andsplitthesplit and weightsthesplit and weightsthesplitthe weights andsplitthe weights weights evenlythe weights evenlythe weights evenly weights evenlyto evenly to evenly to evenly to evenlyto to to to hardesthardest instance instance classes classes to alleviateImageNet-21K toImageNet-21K alleviateImageNet-21K thisImageNet-21K issue. this 14.2M 14.2Missue. 14.2M 14.2MOOMOOMOOMOOM-- 21.51G- 21.51G- 21.51G 2940s 21.51G 2940s 2940s 2940s T GPUs.TTGPUs.GPUs.TT AtGPUs.GPUs.T AtTeachGPUs. AtGPUs.T each Ateach AttrainingGPUs. each Attrainingeach Attraining each Attrainingeach trainingiteration each trainingiteration trainingiteration trainingiteration iteration and iteration and iteration forand iteration forand each andfor each andfor eachfor andGPU eachfor andGPU each for GPU node, each for GPUeach GPUnode, node, each GPU GPUnode, node, GPU node, node, node, them to different GPUs, each processing a subset of mini- a sliding-windowweights; dataweights; scheduler, 7) gather 7) gather gradientsFigure whichFigure gradientsFigure 4: ComparisonfromFigure 4: sampleswe Comparison 4:we 1)we Comparisonfrom extract1) all4:we 1)we extract Comparison extract1)we 1) GPUwe extractfeaturesof extract1) allwe 1) features initializationextractoverlapped featuresof extract1) GPUfeatures initializationnodeextractfeatures of featuresaof initialization featuresof subset a featuresaof subsetinitializationnode ofand subset a aof ofmethods.subset ofsubset a minibatchof aofsumsubset ofmethods.and minibatchsubset a minibatchof ofmethods.subset minibatch minibatchofsumof methods. samples; minibatch minibatchof samples; samples; minibatch samples; samples; 2) samples; 2) samples; 2) samples; 2) 2) 2) 2) 2) gathergathergather featuresgathergather featuresgather featuresgather features featuresfromgather featuresfrom featuresfrom all featuresfrom fromall other all from other fromall other all nodes; from other all other nodes; all nodes; other all other3) nodes; nodes; othercompute3) nodes; 3) nodes;compute compute3) 3)nodes; compute compute3) partial 3) compute partial compute3) partial compute co- partial partial co- partial co- partial co- co- partial co- co- co- batch data; while for the large-scaleNotefcNote thatlayer, the that above the we aboveimprovements follow improvements the only bringonlybatches bring little littleor between no or no them; adjacentthem; 8) run 8)iterations. a run step a of step optimization of This optimizationsinesine increasessine logitssine logitstosine logits usingsine updatelogitssine logits using usingsine logitsto local logits using using local updatelogits local usingclassification parametersusingthe local classificationlocal usingclassification local classificationlocal classificationpos- parameters local classification classification weights; classification weights; weights; of weights; weights; 4) weights; compute 4) weights; 4) ofcompute weights; compute 4) 4) compute compute 4) the 4) compute the compute 4) the compute the the the the the computationalcomputational overhead overhead for the for training the training process. process. Next Next we we encoding,encoding, MLP, MLP, and classification and classificationexponentialexponentialexponential layers.exponentialexponentialexponential valuesexponential layers. valuesexponentialThe values of values values logitsof valuespipelineofThelogits values logitsof of values and logits logitsof and pipelineof sumand logits logitsof sumand and sumis them logits and sum themre-sum and them over sumand them sumis themover over allsum them re- themover all nodesover all them nodesover all nodesover all to nodesover all nodes to all to nodes all nodes to to nodes to to to model parallel mechanism and split the weights evenly• Raw-feature• toRaw-feature• Raw-feature• Raw-feature initialization initializationitive initialization initialization instance with with running with running with running visiting BNs. running BNs.To BNs. improveTo BNs.To improve but improveTo improveit also significantlyobtainobtainobtain theobtainobtain the softmaxobtain the multipliesobtain softmax the softmax theobtain softmax the softmaxdenominators; the softmaxdenominators; the softmaxdenominators; softmaxdenominators; denominators; denominators; denominators; 5) the denominators; compute5) 5) compute compute5) 5) compute compute5) softmax 5) compute softmax compute5) softmax compute softmax softmaxprob- softmaxprob- softmaxprob- softmaxprob- prob- prob- prob- prob- will introducewill introduce these these techniques techniques respectively respectively in details. in details. modelmodelmodel parallelmodel parallel parallelmechanism parallelabilitiesmechanismabilitiesabilitiesmechanismabilitiesabilities andmechanismabilities and theabilities and theabilities and cross and the andsplit cross and the cross the and entropysplit cross the and cross entropy thetheand entropysplit cross the cross entropy lossentropythe weightssplit cross lossentropythe lossonentropy weights lossonentropy the lossthe onweights the subsetloss evenlyon the losson weights subset the subsetlosson the evenly on data; subset the subset evenlyon the data; to subsetdata; the 6) subset evenly data; to de-6) data; subset 6) de- data; to de-6) data; 6) de- data; to de-6) 6) de- de-6) de- thethe convergence,the convergence,the convergence, convergence, we proposewe we propose proposewe toFigure propose initializeFigure toFigure initialize toFigure 3:Figure initialize3: ComparisonFigure to 3:Figure Comparisonthe initialize3: ComparisonFigure 3:peated Comparisontheclassification 3: Comparison 3: Comparisonthe classification 3: Comparison ofpeated Comparisontheclassification of GPU toof GPU classification of GPU ofloop memory GPU of GPU memoryto of memory GPU of GPU overloopmemory memory GPUconsumption memory consumption memory consumption theover memory consumption consumption consumptioncomplete consumption theof consumption of of complete of of of ofdataset of dataset for several for several epochs epochs T GPUs. At each training iteration and for each GPU node, time of looping over the complete dataset.duceduceduce gradientsduce gradientsduce gradientsduce gradientsduce gradients ofduce gradients theofgradients of thegradients local theof of local the localtheof loss of localthe localloss theof losswith local the locallosswith loss with respect local loss with respect losswith respect losswith respect withto respect featureswithto respect to respect features featuresto respect to features featuresandto to featuresand featuresandto featuresand and and and and weightsweightsweights asweights raw as rawas instance raw as instance raw instance features instance featuresDDP featuresDDPDDP extractedand featuresDDPDDP and DHPextractedandDDP DHPDDPand extractedandDHP trainingDDP andDHPto DHPtrainingextractedbyand training DHPoptimizeand DHPtrainingaby trainingframeworks. randomDHPto trainingframeworks.by a trainingframeworks. optimizerandom trainingframeworks.aby frameworks. random frameworks.a for frameworks. random frameworks. betterT forGPUs.T betterGPUs. representation.T GPUs. AtT GPUs. representation.each At Ateach training eachAt training each training iteration training iteration iteration and iteration and for and foreach and for each GPU foreach GPU each node, GPU node, GPU node, node, wewe 1)we extract 1)we extract1) extract 1) features extract featuresweights;weights; featuresweights; ofweights; featuresweights; 7) a ofweights; 7)gathersubsetweights; 7) aof gatherweights; 7)gathersubset 7) a ofgradientsgather 7)gathersubset of 7)gradients a gather gradients 7)gatherminibatchsubset of gradients gradientsgather fromminibatchof gradients gradientsfrom minibatch fromof gradientsall from fromminibatchall GPUsamples; all from GPU fromall GPUsamples; all node from GPU all GPUsamples; node all node GPU and all2) GPUsamples; node nodeand GPU andsum2) node nodeand sum and 2)sum node and sum andsum 2) andsum sum sum we 1) extract features of a subsetHybrid ofHybrid minibatch Parallelism Parallelism samples;initial 2)initial network,initial network,initial network, where network, whereIn whereall butthis whereall allbut BN but work,all BN layers but BN layers BN oflayers itwe oflayers are it ofFigure arefixed. handle it of are fixed. itFigure We arefixed. 3 We fixed.compares this We 3 We compares problem the GPU the fromthem; GPU memorythem;them; 8)them;them; run8) amemory 8)them; runthem; adifferent run8) 8) stepthem; a runoverhead run8)a step 8) step of runa a run8) step optimizationof step of a runoverhead optimization a step optimizationof step of a optimization step optimizationof per-of optimization optimizationof to DDP optimization update to to of update update to toDDP updateparameters update to toparameters update parameters update to parameters updateparameters parameters of parameters of parameters of of of of of of assumeassumeassume thatassume that such that such thatinitialization such initialization such initialization initializationTable offersTableTable 1: offersTableTable 1:Ablation offers1:Tablea AblationTable contrastive1: Ablation 1: offersTablea Ablation 1:Ablation contrastive 1:studya Ablation contrastive1:study Ablation studya Ablation on contrastivestudy study prioron the onstudy studythe prioron effectivenessthe on study effectivenessthe prioron theeffectiveness on theeffectiveness prioron effectivenessthegather effectivenessthe effectiveness ofgather effectiveness of differentgather of features different of different ofgather features different of different of features different offrom different featuresencoding, different fromencoding,encoding, all fromencoding,encoding, other all fromMLP,encoding,encoding, allother MLP, MLP,encoding, nodes; and other all MLP, MLP, and nodes;classification and otherMLP, MLP, classification and nodes; and classification3) MLP, and classification nodes;computeclassification and 3) classification and computeclassification 3) layers. classification layers.compute 3) layers. partial Thelayers. compute layers. The partiallayers. Thepipeline layers. Thepipeline partial Theco-layers. pipeline The pipeline partial Thepipelineco- is re- Thepipeline is co- pipeline is re- re- pipeline is co-is re- re- is is re- re- is re- gather features from all other nodes;TrainingTraining 3) an compute instance-level an instance-level partial classifierfor classifier co-for instance usuallyfor instancefor instanceusually discrimination. instance requires discrimination.spective: discrimination. requires discrimination. to learncomponentscomponentsto wecomponents learncomponentscomponents proposecomponents incomponents our incomponents inand our method. our in in method. our ourmethod.DHP in inand ourmethod. method. our into method. our method.DHP training speed method. trainingsine frameworks upsine logitssine theframeworkslogitssine using logitsconvergence using logits localwhen usingpeatedpeated local usingpeated classification tolocalpeatedwhenpeated increasing loopto classificationpeated tolocal looppeated classificationloop to over topeatedincreasing loop over loopto classification over tothe loop overby loopto the weights;over complete the loopthe over complete the weights;over completethe intro- over completethe(pseudo) weights; complete the datasetthe 4)complete the weights; datasetcomplete datasetcompute 4)complete(pseudo) dataset for dataset compute 4) for datasetseveral for datasetcompute 4)several for datasetseveral for the compute several forepochs several for the epochs several forepochs several the epochs several epochs the epochs epochs epochs to optimizetoto optimize optimizetoto optimize optimizeto forto optimize for optimizebetterto for optimizebetter for better for representation. better for better representation. for representation. better for better representation. representation. better representation. representation. representation. a large-scalea large-scalefc layer.fc layer. For example, For example, for ImageNet-1K for ImageNet-1K with with classclass numberDDP numberDDPDDP fromDDPDDPexponentialDDPDDP fromexponentialDDP 10 DHPexponential DHP DHPthousandexponential DHP10 DHP values DHP DHPthousand values DHP values of to logits values of 30 logits of to andmillion. logits of 30 and logits sum andmillion. sum them and sumThe them sum over them exper- The over them all over nodes allexper- over allnodes tonodes all tonodes to to • Smoothing• Smoothing• Smoothing• Smoothing labels labels labelsof hardest labelsof hardest of hardest of classes. hardestDataset classes.DatasetDataset classes.TheDatasetDataset classes.DatasetTheDataset massiveTheDataset #Instances massive #InstancesThe #Instances massive #Instancesamount #Instances massive #Instances amount #Instances #Instancesamount amount FigureFigureFigureFigure 3Figure compares 3Figure 3 comparesFigure compares 3Figure 3 compares compares 3 the3 compares compares the3 GPU the compares GPU the theGPU memory theGPU GPU memorythe memory GPU the GPU memory memory overhead GPU memory overhead memory overhead memory overhead overhead of overhead overheadofDDP of overheadDDP ofDDP of DDP ofDDP of DDP ofDDP DDP sine logits using local classification weights; 4) compute the ducing a contrastive priormemorymemorymemorymemorymemory speed tomemory speedmemory speedobtain classificationmemory speed memory speed memoryobtain speed memory speedobtain thememory speed memoryspeed memoryobtainspeed thesoftmax memoryspeed thememoryspeed softmax speedspeedthe softmax speed denominators;speed softmax weights. denominators; denominators; denominators; 5) computeSpecifi- 5) compute 5) compute 5) softmax compute softmax softmax prob- softmax prob- prob- prob- approximatelyapproximately 1.28 million 1.28 million images,of images, negativeof one negativeof negativeof needs classes one negative classes needs classes to raises classes optimize raises tothe raises optimize theraisesrisk the risk aof therisk optimizing of risk optimizinga of optimizingiment of optimizing overiment overis very overconducted very overis very conducted very on 64 on V100 64and V100and GPUs DHPand DHPandand DHP trainingand DHP GPUs DHPtrainingand training with DHPand DHPtraining frameworkstraining DHP frameworkstraining frameworkstraining with32G frameworkstraining frameworks frameworks when frameworks memory.32G when frameworks when increasing when when increasing memory. increasing when when increasing increasing when the increasing increasing the (pseudo) the increasing (pseudo) the (pseudo) the (pseudo) the (pseudo) the (pseudo) the (pseudo) (pseudo) ImageNet-1KImageNet-1KImageNet-1KImageNet-1KImageNet-1KImageNet-1KImageNet-1KImageNet-1K 1.28M 1.28M 1.28M 1.28M 1.28M 15.6G 1.28M 15.6G1.28M 15.6G 1.28M 15.6G 15.6G 428s 15.6G 428s 15.6G 428sabilities 15.6G 428s 9.18G 428s 9.18Gabilities 428s 9.18G 428sabilities 9.18G 428s 9.18G 302s and 9.18Gabilities 302s 9.18G 302s and the 9.18G 302s 302s and thecross 302sclass 302s and theclass cross 302sclass numberentropy thecrossclassclass number numberentropyclass crossclass number numberentropyfromclass loss numberfrom numberentropyfrom 10 loss numberfrom on from 10 thousand 10loss from the thousandon from 10 thousand 10 loss from onthe subsetthousand 10 thousand 10 to the thousandon subset 10 to30thousand to thesubset 30thousanddata; million. to30 to million. subset30data; million. to30 to6) million. 30data; million. The to30 de- 6)million. The 30data; million. The exper- de- 6) million. The exper- The exper- de- 6) The exper- The exper- de- The exper- exper- exper- exponential values of logits and sum them over all nodesC 1280000 toC 1280000 cally, beforeImageNet-21KImageNet-21KImageNet-21KImageNet-21KImageNet-21K trainingImageNet-21KImageNet-21KImageNet-21K 14.2M 14.2M 14.2M 14.2M 14.2M started,OOM 14.2M 14.2MOOMOOM 14.2MOOMOOM-OOMOOM-- weOOM 21.51G-- 21.51G 21.51G- run- 21.51G 21.51G 2940s- 21.51G 2940s 21.51G 2940s an 21.51G 2940s 2940s 2940sinference 2940s 2940s epoch us- weightweight matrix matrix of size ofW size WR ⇥similarRsimilar⇥similar pairs.;similar forpairs. pairs.We ImageNet-21K,; forpairs.We apply We ImageNet-21K, apply We applylabel applylabel smoothing label smoothing label smoothing smoothing onDDP onthe onDDP thereports top- onthe top-K thereports top-K out-of-memory top-K duceK out-of-memoryduce gradientsduce gradientsduce gradients ofgradients (OOM)iment the ofimentiment the oflocal is(OOM)imentimentthe conducted is oflocal isimenterror conducted lossimentthe conductedlocal is isiment conducted loss conducted is localwith iserroron conducted losswhen conducted iswith on 64 onconductedrespect loss 64with V100 on 64 on whenrespect V100 64thewith onV100 64 onrespect GPUs V100 64to V100 on GPUs 64class respectGPUs V100features 64tothe V100 with GPUs GPUs V100features withto GPUswith 32Gclass GPUs features withto 32G with GPUsand 32G memory. features with 32G with memory.and 32G memory. with 32G and memory. 32G memory. 32G memory.and memory. memory. 2 2 C 14200000C 14200000 obtain the softmax denominators;the 5) sizethe compute further size further enlarged softmax enlarged to W prob- tohardestWhardesthardest instance⇥hardest instance instance⇥ classesing instance classes. This classesthe to classes.alleviate to isThisfixed alleviate to of- alleviate to is this alleviaterandom of- this issue. thisnumber issue. this issue.number issue. initial reaches reaches network 4.7weights;weights; million, 4.7weights; 7)weights; million,with gather 7) while 7)gatherDDPrunning gather 7)gradientsDDPDDP while reportsgatherthe gradientsDDPDDP reports reports gradientsDDP DHPDDP reports out-of-memory reportsthe gradientsDDPfrom out-of-memory reports out-of-memory reportsBNsfrom DHP out-of-memorytraining reports out-of-memory all from out-of-memory out-of-memory GPUall fromto out-of-memorytraining (OOM) all GPU (OOM) ex- (OOM)frame- node GPUall (OOM) (OOM)error node GPU (OOM)error and(OOM)errorframe- node when (OOM)error errorwhenand node sumwhen error theand errorwhen whensum the error class the whenand sumwhen class the theclass whensum theclass class the class the class class R R numbernumbernumbernumber reachesnumber reachesnumber reachesnumber reachesnumber 4.7reaches 4.7reaches million, 4.7reaches million, 4.7reaches million,4.7 million,4.7 million,while 4.7 million,while 4.7 million,while the million,while while the DHP the while whileDHP the DHPthe trainingwhile DHPthe DHPtraining the training DHP the DHPtraining trainingframe- DHP trainingframe- trainingframe- trainingframe- frame- frame- frame- frame- Note2 Note thatNote2 thatNote the that theabove that the above theabove improvements aboveimprovements improvements improvements only only bring only bring only little bring little bring or little no or little noor no orthem; nothem; 8)them; run 8)them; run 8) a step run 8) a step run aof step optimization aof step optimization of optimization of optimization to update toD update toN update toparameters update parameters parameters parameters of of of of abilities and the cross-entropyten loss infeasibleten on infeasible the when subset when using using data;a regular a regular 6) distributed distributedtract data alltraining data• instanceRaw-feature training• •Raw-featureRaw-feature••Raw-featureRaw-feature• •Raw-featureRaw-feature initialization•workRaw-feature initializationfeatures initialization initialization initializationworkcan initialization initialization with initialization with support with running canX with running with running with supportrunning with= runningBNs. with runningBNs.up runningBNs.To runningxBNs. to BNs.To improveTo1 BNs.up improve30, BNs.To improveTox BNs. improvetoTomillion improve2To improve30,To improvework improve millionworkwork numbercan,workwork canx support canworkN supportwork can numbercan supportwork cansupport upsupportof can uptosupport can upsupportclasses, 30to uptosupport up30 millionof 30to upto million up30classes, million 30×to upto million number30 million which 30to number million number;30 million number millionof number which classes,ofnumber ofnumber classes, classes,ofnumber of classes, whichclasses, of of whichclasses, whichclasses, of whichclasses, which which which which computationalcomputationalcomputationalcomputational overhead overhead overhead overheadfor for the for the trainingthe for thethe training convergence,the theconvergence, trainingthe convergence,the process. convergence,the training convergence,the process. convergence,the convergence, we process. convergence, we propose Nextwe process. propose we propose we Next propose we proposetowe Nextwe initialize proposeto we proposetowe initializeNext initialize proposeto towe initialize initialize to theencoding, towe initialize the classificationinitialize to theencoding, classificationinitialize the classification theencoding, classification the classification theencoding, MLP, classification the classification MLP, classification and MLP,is 6is and MLP, classification.is46 and.6is4is classification.of466is and.of4 classificationthe.is4of6 the.6is4 DDP’sofclassification the.of46 DDP’s the.of4 theDDP’s layers.of theDDP’s limit.R DDP’sof the layers. limit. DDP’s the limit. DDP’s layers. TheWe limit. DDP’s limit. We layers. alsoTheWe limit. pipeline limit. alsoWe TheWe also limit.note pipeline We also note alsoTheWe notepipeline that alsoWe isnote also thatnote pipeline that there- also note is thatnote the that DHP there- isnote that DHP the that there-DHP is that theDHP DHP there- DHP the DHP DHP (DDP)(DDP) pipeline. pipeline. In this In work, this work, we introduce we introduce a distributed a distributed hy- hy- is 6.4is 6of.4 theof DDP’s the{ DDP’s limit. limit.··· We⇥ also We⇥⇥ ⇥ note⇥also}⇥⇥ ∈⇥ notethat the that DHP the DHP deduce gradients of the local loss with respect to featureswillwill introducewill introducewill introduce introducethese thesethen techniques these techniques these techniques we techniques respectively directlyweights respectivelyweightsweights respectivelyweightsweights as respectivelyweights asrawinweights as rawweights assignasdetails.rawinstance asin rawinstance asraw details.instancein as raw instance asdetails.rawinstance⇥ featuresin rawinstance features details.instance features them instance features⇥ features extracted features extracted features extracted featurespeatedextracted to extracted by extractedpeated extractedbyclassification a by randompeatedextracted ato by a randomby randomlooppeated ato by a randomby random loopa to bya overrandom randomloop atocan overrandomcan loopthe benefitcan over benefitcan the completecan benefit overcan frombenefitweights the benefitcompletecan from benefitcan from the completemore benefit from morefrombenefit datasetcomplete more GPUsfrom morefrom moreGPUsdataset GPUsfrom moreW todataset moreGPUs forGPUs support to more to GPUsdataset support forGPUsseveral support to= to GPUs supportfor larger-scale supportseveral to to larger-scale support forseveral larger-scale support to epochs larger-scale support larger-scale several epochs larger-scale instance larger-scale epochs instance larger-scale instance epochs instance instance instance instance instance brid parallelbrid parallel (DHP) (DHP) training training framework framework (Song (Song et al. et2020) al. 2020)initial toinitialinitial network,initialinitial network,to network,initialinitial network,can network, whereinitial network, where network, where benefit network, all wherecan where all but allwhere but where BNbenefit all but all where BN frombutD layers BNall but all layers BN but BNlayers all butN of BNlayers frombutmoreto layers BN it of ofare layersoptimize BN itto layers it ofare fixed. of are layers optimize moreitto GPUs itfixed. of are fixed. ofare optimize it Weto itfixed. ofare fixed. forWe are optimize it WeGPUs fixed. areto fixed. forbetterWe We fixed.classification, support forWe better Weclassification,classification, to representation. forbetterWeclassification,classification, support representation. betterclassification,classification, larger-scale representation. butclassification, but DDP representation. but DDP butlarger-scale DDP but does DDP but DDPdoes but does not DDP but DDPdoes not does bearinstance not DDP does bear not does bearnot this does bearnot this bearinstance scalability.not this bear scalability.not this bear this scalability. bear this scalability. scalability. this Tablescalability. this scalability. Table Tablescalability. 1 Table Table 1 1 Table Table 1 1 Table 1 1 1 and weights; 7) gather feature gradientsmakemake the large-scale thefrom large-scale all instance GPU instance classification node classification feasible.w feasible.1, w2,assumeassumeassumeassume thatassume, thatassumew thatassumesuchclassification, thatassumesuchN that such initialization that such thatinitializationsuch initializationclassification, thatsuch initializationsuch initialization suchR initialization initialization offers initialization offers× offers but a offers offers contrastive a DDPaoffers contrastive offersas contrastive but a a offers contrastive contrastive a an DDPa doescontrastive prior contrastive a prior contrastiveinitialization. prior priornot does prior priorcompares prior bearcompares priornotcomparescomparescompares thisbear thecomparescompares the training thecompares scalability. training thisthe thetrainingThe thetraining training efficiencythe scalability. training efficiencythe training efficiency training efficiencyintu- efficiency Tableof efficiency efficiency ofDDP of efficiency DDP ofDDP Tableof and 1 DDP ofDDP and of andDHP DDP ofDDP andDHP and DHP1 DDP frame- and DHP andDHP frame- frame- andDHP DHP frame- frame- DHP frame- frame- frame- HybridHybridHybrid ParallelismHybrid Parallelism Parallelism Parallelism{ for···for instancefor instancefor instancefor instancefor instance discrimination.for instance discrimination.for instance discrimination. instance discrimination.} discrimination. discrimination.∈ discrimination. discrimination. FigureFigureFigure 3 comparesFigure 3 compares 3works comparesworks 3works compares theonworksworks onImageNet-1K theworkson GPUworks ImageNet-1K onImageNet-1K theonworks GPU ImageNet-1K onImageNet-1K memorythe onGPU ImageNet-1K onImageNet-1K memory GPUand ImageNet-1K andmemory ImageNet-21Kand overhead ImageNet-21Kand memory and ImageNet-21K overheadand ImageNet-21K ImageNet-21Kand overhead ImageNet-21Kand ImageNet-21K ofoverhead underImageNet-21K under DDP of under the under DDPof under the same the under DDP of undersame the samethe under DDP samethe same the same the same same FigureFigure 2 summarizes 2 summarizes the outline the outline of the of distributed the distributed hybrid• Smoothing• hybrid•SmoothingSmoothing••SmoothingSmoothing• •Smoothing labelsSmoothing• labelscomparesSmoothing labels of labels labels hardest of oflabelscompares hardest labels hardest of of labels hardest classes. hardest of of classes.the hardest classes. hardest of classes. hardestThe classes. trainingThe classes.theThe massiveclasses.The massiveclasses.The massivetrainingThe massiveThe massiveamount efficiencyThe massiveamount massiveamount massiveamount amount efficiency amount amountbatch amountbatch ofbatch sizebatchbatch DDPsize sizesettings.batch ofbatch sizesettings. size settings.batch DDPsize andsettings. sizesettings. We sizesettings. We settings. show We DHP andsettings. show We Weshow that Weshow show Wethat DHP thatframe-the show We show thatthe thatDHP the show thatDHP the thatframe-DHPthe training thatDHPthe DHP training the training DHP the DHP training trainingframe- DHP trainingframe- trainingframe- trainingframe- frame- frame- frame- frame- and sum them; 8) run a step of optimization to updateTrainingTraining pa-Training anTraining instance-level an instance-levelan instance-level anition instance-level classifier behind classifier classifier usually classifier usually this usually requires usually initializationrequires requires to requires learn to learn to learn to learnand mechanismand DHPand DHP trainingand DHP training DHP trainingwork frameworks trainingworkwork is frameworks notworkworknot frameworks two-fold. only notwork onlyworknot frameworks notonly consumeswork when notonly consumesonly not consumes onlywhen not consumesonly consumes increasing when onlyless consumes consumes lessincreasing when First,less GPU consumes increasinglessGPU less GPU memory, lessincreasing GPU thelessGPU memory, memory, lessGPU the(pseudo)GPU memory, memory, but GPU the memory,(pseudo) but memory, also but the(pseudo) memory, also but butalsotrains (pseudo) butalsotrains also buttrains also buttrains alsotrains alsotrains trains trains parallelparallel training training process process on T a onGPU large-scaleaT large-scaleaGPU nodes. large-scalea large-scalefc nodes. Forlayer.fc layer.fc encoding Forlayer.fc Forlayer. example, encoding For example, Forand example,of negativeof forexample,of negativeand negativeof for ImageNet-1Kof negative negativeof classesfor ImageNet-1Kof negative classesworks negativeof classesfor ImageNet-1K negative classesraises classes ImageNet-1K raises classesworks raises classes onthe raises with classesraises the risk the raisesImageNet-1K with raisesrisk the ofriskon the raises with optimizingofrisk the risk of theImageNet-1K optimizing with risk optimizingof theclass ofrisk optimizing optimizingofriskclass of optimizingover number optimizingofclass over optimizingover very numberandclass over very over numbervery over from veryImageNet-21K over very numberand over veryfrom very 10 from veryImageNet-21K thousand10 from 10 thousand thousand10 tothousand under 30 to million.30to under the 30 million.to million.30samethe The million. The same exper- The exper- The exper- exper- rameters of encoding, MLP, and classification layers. The running BNssimilarsimilarsimilarsimilar pairs.similar can pairs.similar pairs.similar We pairs.similar pairs. We offer applyWe pairs. pairs. applyWe We apply pairs. label We apply applyWe label a labelapply We smoothing apply labelcontrastive label smoothing apply smoothing label label smoothing smoothing label smoothing on smoothing on the smoothing on the on top- the on top- the onK the top- on priorK the top- on top-K the top-K theK top-muchK top-muchKmuch in fasterKmuchmuch faster the fastermuch thanmuch faster faster thanmuch than thefaster output faster than the thanDDP thefaster thanDDP the thanDDP the counterpart. thanDDP the DDPcounterpart. the counterpart.DDP fea- the DDPcounterpart. counterpart. DDP counterpart. counterpart. counterpart. MLPMLP layers, layers, we follow we follow the data theapproximatelyapproximately paralleldataapproximatelyapproximately parallel 1.28pipeline 1.28 million 1.28pipeline million 1.28 and million images, million images, andhardest images, onehardesthardest images, onehardest instanceneedshardest instanceonehardest instanceneedshardestbatch instanceonehardest toinstance needs classes instanceoptimize classes toinstance needs classesbatch instanceoptimize classesto size classes alleviate to optimizeclasses to classesalleviate alleviatea tosettings.optimize classesto size alleviate alleviate to thisa to alleviate this alleviatea toissue. thissettings.iment alleviate issue.this thisa issue.iment We this issue. issue.thisiment is issue.this issue.show conductediment is We issue. conducted isshow conductedthat is conducted on the that 64on onDHP V10064 the 64on V100DHP GPUsV100 64 training GPUsV100 GPUs withtraining GPUs with frame- 32G with 32G with memory.frame- 32G memory. 32G memory. memory. C copy1280000C C1280000copy1280000C 1280000 Raw-FeatureRaw-FeatureRaw-FeatureRaw-FeatureRaw-FeatureRaw-FeatureRaw-FeatureRaw-Feature Initialization Initialization Initialization Initialization Initialization Initialization Initialization withInitialization with with Running with withRunning Running with withRunning Running with BNs Running Running BNs BNs Running BNs BNs BNs BNs BNs pipeline is repeated to loop through the complete datasetweightweight forweight matrixweight matrix matrix of sizematrix oftures, size ofW size ofW sizeRW since⇥RWR⇥ ⇥R in; for⇥ ; each ImageNet-21K,for; for ImageNet-21K,; ImageNet-21K,for inference ImageNet-21K,DDP phase,DDP reportsDDP reportsDDP reports the out-of-memory reports out-of-memory features out-of-memory out-of-memory (OOM) computed(OOM) (OOM) error (OOM) error when error when error thewhen thewhen class the class the class class themthem to different to different GPUs, GPUs, each eachprocessing processing a subset a subset2 of2 mini-2 ofNote2NoteC mini-Note that14200000NoteCNote that thattheNoteC14200000Note thatthe abovethat thework14200000NoteC abovethat the abovethatthe14200000 improvements abovethatthe above thework improvements not improvementsabove the above improvements improvements above only improvements not improvements only improvements onlyconsumes only bring only bringonly bring littleonlyconsumes bringonly bring little littleonly bringor bring little noorlittle less bringor nolittle noorlittle or nolittleGPU noor lessInstance or noInstance noorInstance noGPUInstance memory,Instance classificationInstance classificationInstance classificationInstancememory, classification classification classification classificationbut faces classification faces faces thealso facesbut faces the slow the faces slow faces thetrains slowthealso convergence faces slowthe convergenceslow the convergence slow thetrains convergenceslow convergence slow convergence convergence problem convergence problem problem problem problem problem problem problem thethe sizethe size furtherthe size further size further enlarged further enlarged enlarged to enlargedW to W tocomputationalcomputationalWR tocomputationalcomputational⇥WRcomputationalcomputationalR⇥computational overheadcomputational⇥R overhead overhead.⇥ This overhead overhead. for This overhead overheadfor. is the for This overhead of- the for.is training thefor This of-trainingthefor is thetraining for of- thetraining foris training process.thenumber of-training process.the training process.number training process. process. Nextnumber process. Next reachesprocess. Nextnumber we process. Next reaches Next we we Next reaches Next we 4.7we Next reaches we 4.7we million, we 4.7 million, 4.7 million, while million, while the while the whileDHP the DHP training the DHP training DHP training frame- training frame- frame- frame- batchbatch data; data; while while for the for large-scale the large-scalefc layer,fc layer, we follow we follow2 the2 2 the2 muchmuch faster faster than the than DDP the DDPcounterpart.in counterpart.in earlyin earlyin earlyin epochs earlyin early epochsin epochs earlyin early epochs due epochs early due epochs due to epochs due tothe epochs due to the due infrequent tothe due to infrequent the due tothe infrequent to the infrequent infrequent tothe infrequentvisiting the infrequent visiting infrequentvisiting visiting visitingof visitingof instance visitingof instance visitingof instance of instance of instance of instance of instance instance several epochs to optimize for better representation. tenten infeasibleten infeasibleten infeasible infeasible when whenafter using when using when a using everyregular a using regularwill awill regularwill distributedintroduce a BNwill introduceregularwill distributedintroducewill introducewill distributedintroduce these layerwill introduce distributedintroducethese datathese introducetechniques these thesetechniques data techniques training these will datathesetechniques techniques training thesetechniquesrespectively data techniques training respectively subtracttechniquesrespectively training respectively respectively respectivelywork inrespectively details.in respectivelywork in details. details.caninwork in a details. details.in can supportworkrunning in details. details.canin support details. can supportsamples up supportsamples tosamples upsamples average30samples up(i.e.,tosamples million(i.e., 30 tosamples up(i.e., oncesamples 30million(i.e., once(i.e.,to once accessmillion (i.e.,30 once(i.e.,number onceaccess accessofmillion (i.e., once number peronceaccess access perotheroncenumber epoch)access perof access epoch) pernumber perepoch)accessclasses, of perepoch) and epoch) perclasses, of and epoch) per theand epoch)classes, of theandwhich epoch) lackand the classes, lackand thewhich lacktheand of lacktheandwhich ofprior lack the of prior lack thewhich ofprior lack of prior lack ofprior of prior ofprior prior (DDP)(DDP)(DDP) pipeline.(DDP) pipeline. pipeline. In pipeline. this In this In work, this In work, this we work, weintroduce work, we introduce weintroduce introducea distributed a distributed a distributed a distributed hy- hy- hy- hy-is 6is.46is.46ofis.4 the6of.4of the DDP’sknowledge theofknowledge DDP’sknowledge the DDP’sknowledgeknowledge limit.knowledge inDDP’sknowledge limit. the inknowledge in theWelimit. classifier. the in in classifier. theWelimit. classifier.the alsoin in classifier.theWe classifier. thealso in A classifier.note theWe recent classifier.A also A recent classifier.note recent A also thatA work noterecent recent A work thatA workrecent notethe (Cao recent A workthat work (Cao recent the (Cao DHP worket that work (Cao theal. (Caoet DHP worket al.2020) (Cao theal. (Caoet DHP et2020) al.2020) (Cao al. et DHP et2020) al.2020) al. et 2020) al.2020) 2020) Figure 3 compares the GPU memory overhead of DDP instanceHybrid featuresHybridHybridHybridHybrid ParallelismHybrid ParallelismHybrid ParallelismHybrid Parallelism Parallelism Parallelism extracted Parallelism Parallelism in previous iterations. Second, bridbrid parallelbrid parallelbrid parallel (DHP) parallel (DHP) (DHP) training (DHP) training training framework training framework framework framework (Song (Song et(Song al. et(Song 2020) al. et 2020)al. et to2020) al. to2020) tocan tocan benefit⇥can benefit⇥can benefit from⇥ benefit from⇥ morehandles fromhandles morehandles from GPUs morehandles thehandles GPUs thehandles more infrequent thehandles toGPUs infrequent thehandles infrequentthe support toGPUs infrequentthe infrequent the support toinstance infrequent the infrequent instancesupport toinstance infrequentlarger-scale instancesupport instance visiting larger-scale instance visiting instance visitinglarger-scale instance visiting problemvisiting larger-scale problemvisiting instance problemvisiting problemvisiting instance problem by problem instanceby using problem by using problem instanceby using by a using by using a by a using by using a a using a a a TrainingTrainingTrainingTrainingTraining anTraining aninstance-levelTraining an instance-levelTraining aninstance-level an instance-level aninstance-level an instance-level aninstance-level classifier instance-level classifier classifier classifier classifier usually classifier usually classifier usually classifier usually requires usually requires usually requires usually requires usually requiresto learnrequires to requiresto learn learnrequires to to learn learn to to learnsliding-window learn tosliding-window learnsliding-windowsliding-windowsliding-windowsliding-windowsliding-windowsliding-window data data data scheduler, data scheduler, data scheduler, data scheduler, data scheduler, data which scheduler, scheduler, which which scheduler, samples which which samples which samples which samples samples which overlapped samples overlapped samples overlapped samples overlapped overlapped overlapped overlapped overlapped and DHP training frameworks when increasing the (pseudo)makemake themake thelarge-scalemake the large-scale thelarge-scale large-scaleassigning instance instance instance classification instancea classification large-scaleaa large-scaleclassification features large-scaleaa large-scaleclassification large-scaleaa large-scalefcfeasible. large-scalealayer.fc large-scalefc feasible.layer.layer.fcfc Forfeasible.layer. tolayer.fc Forfc example, Forfeasible.layer.layer.fc example, Forweights For example,layer. For example, example, For for example, For forexample, ImageNet-1K for example, ImageNet-1Kclassification, for ImageNet-1Kfor ImageNet-1Kforclassification, ImageNet-1Kapproximately for ImageNet-1Kclassification, for ImageNet-1K with ImageNet-1Kclassification, with with but with withbatches withbut DDP withbatches butbatches with DDPbatches does betweenbatches butDDP betweenbatches does betweenbatches DDP not converts betweenbatches does between adjacent not bear between adjacent does between adjacent not bear between adjacent this adjacent iterations. not bear adjacent iterations. this adjacentscalability. iterations. bear adjacent thisiterations.the iterations. scalability. This iterations. thisscalability. iterations. This This increasesiterations. scalability. This increasesTable This increases This increasesTable This increases the This1 increasesTable theincreases pos- the 1 increasesTable pos- the pos-the 1 pos-the pos- the 1 pos- the pos- pos- FigureFigureFigure 2 summarizesFigure 2 summarizes 2 summarizes 2 summarizes the theoutline the outlineapproximately theoutlineapproximately ofapproximately outlineapproximately theapproximately ofapproximately thedistributed ofapproximately 1.28approximately the distributedof 1.28 1.28 million thedistributed 1.28 million 1.28 million distributed 1.28 hybrid million images,1.28 million images,1.28 millionhybrid images, million images,hybrid million oneimages, oneimages, hybrid needs oneimages, needs oneimages,compares one needs to one needscompares needs one optimize to to needs one optimizecompares needs optimize to to needs optimizecompares the optimize to a to optimize a the optimize totraining a optimizethe aitive traininga itive a the trainingitive ainstance efficiencyitive ainstanceitive training instanceitive efficiency instanceitive instance visiting efficiencyitive instance visiting instance visiting efficiency ofinstance visiting but visiting DDP but visitingof it but visiting also itDDPof but visiting itbut also and also significantlyDDP itbutof it but also significantly alsoand itsignificantlyDDP butDHP it also significantly and also significantly it DHP also significantly andframe- significantly multiplies DHP significantly multiplies frame- multiplies DHP multiplies frame- multiplies the multiplies frame- multiplies the the multiplies the the the the the C C1280000C1280000C1280000C 1280000C1280000C1280000C12800001280000 class number from 10K to 30M. The experiment is con- classificationweightweightweight matrixweightweight matrixweight matrixtaskweight ofmatrix matrixweight sizeof matrixof sizematrix toWsize of ofmatrixW size sizeofW a ofR sizeWW pair-wisesizeofR⇥RW size⇥WR⇥RW⇥R⇥;R for⇥;R for⇥; ImageNet-21K, for⇥; ImageNet-21K,; for ImageNet-21K, for metric; ImageNet-21K, for; ImageNet-21K, for; ImageNet-21K, for ImageNet-21K, ImageNet-21K, learningtimetimetime oftime looping oftime of loopingtime looping oftime of loopingtasktime overlooping of of overlooping overlooping of the overlooping the over completein the over complete the over complete the early over complete the complete the dataset. complete the dataset. complete dataset. complete dataset. dataset. dataset. dataset. dataset. parallelparallelparallel trainingparallel training training process training process process on processT onGPU onT GPUT on nodes.GPUT nodes.GPU nodes. For nodes. For encoding For encoding2 For encoding22 andencoding22 and2C2 andC142000002C14200000 andworksC14200000C 14200000C14200000worksC14200000worksC14200000 on14200000 ImageNet-1Kworkson onImageNet-1K ImageNet-1Kon ImageNet-1K and and ImageNet-21K and ImageNet-21K and ImageNet-21K ImageNet-21K under under theunder theundersame the same thesame same thethe sizethe sizethe sizethe further sizethe further sizethe further sizethe further enlarged size further enlarged size further enlarged further enlarged further enlarged to enlargedW to enlarged toW enlargedW to toRWW toR⇥ toRWW⇥ toR⇥RW⇥R⇥R. This⇥R.⇥ This. This⇥ is.. This ofis This is. of- This. ofis This is. of- This ofis is of-In of- isIn this of-In thisIn this work,In this work,In this work,In this we work,In this work, we present this work,we work, present we wepresent work, wepresent apresent we simple apresent we apresent simple simple apresent a simple weight simple a weighta simple weight simple a weight initializationsimple weight initialization weight initialization weight initialization weight initialization initialization initialization initialization ducted on 64 V100 GPUs with 32GB memory and aMLP totalMLP layers,MLP layers,MLP layers, we layers, wefollow weepochs, follow wefollow the follow thedata thedataten parallelwhich thedataten infeasibleten parallel infeasibledataten infeasibleten parallel infeasibletenpipeline infeasibleten parallel when infeasible istenpipeline infeasible when when infeasible usingpipeline relatively when when usingand usingpipeline when a when usingand regular usingcopy a when a regular usingand2 regular usingcopy a a2 regular distributedusingand regular2copy a distributed a regular2 distributed2 regularcopy easiera distributed2 regularbatch distributed2 distributeddata2batch distributed data distributeddata sizetrainingbatch data training datato sizetrainingbatch settings. data training data sizetraining converge settings. data training sizetraining settings.mechanism trainingmechanism We settings.mechanismmechanism Wemechanism showmechanism We tomechanism show toandhandlemechanism Wethat showto handle tohandle thatto show the handlethe tohandle offersthat to the thehandle DHPconvergence the tohandle that convergencethehandle DHPtheconvergence thetrainingconvergence the DHPconvergence the aconvergence thetraining DHPconvergence issue. trainingconvergence issue.issue. frame- trainingSpecifically, issue. issue. frame-Specifically, Specifically,issue.issue. frame- Specifically, Specifically, issue. frame-Specifically, Specifically, Specifically, themthem tothem different tothem different to different to GPUs, different GPUs, GPUs, each GPUs, each processing each(DDP) processing(DDP) each(DDP) processing pipeline.(DDP)(DDP) pipeline. processing(DDP) pipeline.(DDP) a pipeline. subset pipeline.(DDP) In a pipeline. this In subset pipeline. In a this pipeline. work, thissubset In of In a work, this this Inwork, subsetmini- of In we this work, work, this In wemini-of introduce we work, this introducework, mini- weof introduce we work, introduce wemini- introduce we awork introduce distributed we introducea adistributedwork introduce distributed anot awork distributed distributed anot ahy-distributedonlywork distributed ahy- not hy-distributedonly consumes hy- not hy-onlybefore hy-consumes hy-beforeonlybefore consumes hy- trainingbeforebeforeconsumes traininglessbefore trainingbefore trainingless trainingstarted,before GPU trainingstarted,less trainingstarted, GPU trainingstarted,less wememory, started, GPU we started, run wememory, started, GPU run wean started, run wememory, aninference run we runanbut wememory, inference runaninference wean run but also inference aninference run anepochbut inferencealso anepochinferencetrains epochbut also inference using epochtrains epoch usingalso usingepochtrains theepoch using using theepochtrains the using using the the using the the the batch size of 4096. DDP reports out-of-memory (OOM)batchbatch er- data;batch data;batch while data; while data; forwhilewarm forthewhile for the large-scale forthe large-scale start the large-scalebridbrid large-scalebrid parallelfcforbrid parallelbrid parallellayer,fcbrid parallel (DHP)brid parallellayer, instancefc (DHP)brid parallel (DHP) weparallellayer,fc training(DHP) parallel (DHP) we followlayer, training (DHP) training (DHP) we follow training frameworktraining(DHP) we follow frameworktraining theclassification.framework training follow frameworktraining the framework framework(Song the framework (Song frameworkmuch(Song the et (Song (Songmuch al. et et (Song al.2020) fastermuch (Song al. et 2020)et (Song al.2020) fastermuch al. et to 2020)et al.than2020)faster to al. et to 2020) al.than2020)faster to thetofixed 2020) than tofixed thetoDDPfixedrandom than tofixed thefixedrandom DDPrandomfixed counterpart. theDDPfixedrandomrandom initialfixed counterpart.random initialDDPrandom initial counterpart. networkrandom initial initial network counterpart. networkinitial initial network network withinitial network with network withtrainable network withtrainable withtrainable withtrainable withtrainable BNs withtrainable BNstrainable BNstotrainable BNs extractto BNsto extract BNs extractto BNsto extract BNs extracttoto extract extractto extract C CNCNCNC NCNCNCN N makemakemake themakemake the large-scale themake large-scalemake the large-scalethemake large-scalethe large-scale the large-scaleinstance the large-scale instance large-scaleinstance instance instance classification instance classification instance classification instance classification classification classification classification feasible. classification feasible. feasible. feasible. feasible. feasible. feasible. feasible.allall instanceall instanceall instanceall instanceall instanceall features instanceall features instance features instance features featuresX featuresX= featuresX= featuresX=Xx1=,Xx=xX1x2,1,=x,Xx=2x1,21,,=,x,x21x2,N,,1,x,,x2xN1,2N,,,x, x2NR,N, x,RN⇥xRN, ⇥x;RN⇥R; ⇥R;⇥R;⇥;R⇥ ; ⇥; ; ror when the class number reaches 4.7 million, while the FigureFigure 4FigureFigure compares 2FigureFigure summarizes 2Figure 2 summarizesFigure summarizes 2 2Figure summarizes summarizes 2 2 summarizes thesummarizes 2 thesummarizes outline the the outline the outlinethe outlinetheof outline thediscriminative theof outline theof outline the distributed theof outlineof distributed the distributedthe of of distributedthe distributed theof distributedhybrid the distributed hybrid distributedhybrid hybrid hybrid hybrid hybridthenability hybridthenthen wethen wethen directly wethen directly wethen directly wethen directly we assigndirectlyof we assigndirectly we assigndirectly different them assigndirectly assign them assignthem{ to assign them classification{ themto assign{ to classification them classification{ themto{ to··· classification them classification{ to···{ to··· classification classification{ to··· weights}··· classification weights2}··· weights}···2 weights2}··· weightsW} 2 weightsW2}= weightsW}2= weightsW2}=W2=W=W=W= = C CNCNCNC NCNCNCN N parallelparallelparallelparallel trainingparallel trainingparallel trainingparallel trainingparallel processtraining processtraining processtraining processtraining on process onT process on processGPUT onT processGPU onGPUTnodes. onTGPU onGPU nodes.Tnodes. onTGPU ForGPUnodes.T nodes. ForGPU encodingnodes. For nodes. encoding For Forencodingnodes. Forencoding encoding For and encoding For andencoding and encoding and and andw and1w, andw1,21w,,w21,1,2w,,,w1w2,2,1,Nw,,w,w21wN,,2,Nw,,wwR2N,,NwR,⇥wRN,⇥NwRas⇥RN anas⇥R⇥asR aninitialization.⇥ anasRas initialization.⇥ aninitialization. anas⇥as initialization. aninitialization. anas initialization. aninitialization. The initialization. The Theintu- Theintu- The intu- The intu- Theintu- Theintu- intu- intu- DHP training framework can support up to 30 million num- classifierMLPMLP initializationMLP layers,MLPMLP layers, layers,MLPMLP welayers, layers,MLP we followlayers, we layers, follow we followwelayers, thefollowwe follow we thedata followthe we followdata schemes,thedata thefollow paralleldata the paralleldata the paralleldata the paralleldata parallelpipelinedata parallelpipeline parallelpipeline parallelpipelinepipeline and i.e.,pipeline andpipelinecopy andpipelinecopy and andcopy random andcopycopy andcopy andition{copyition{copyition{ behindition{ behindition{ behind···ition{ weight behindthis···ition{ behind··· thisition{ behind initialization this··· behind···} initialization this2 behindthis··· initialization}···}2 this initialization2 initialization this···} initial-}2 initialization this2 initialization} mechanism}2 initialization mechanism2} mechanism2 mechanism mechanism mechanism is mechanism two-fold. is mechanism is two-fold. two-fold. is is two-fold. two-fold. is First, is two-fold. First,two-fold. is First, two-fold. First, First, First, First, First, themthemthem tothemthem differentto tothem differentthem differentto tothem different differentto GPUs, to different GPUs, differentto GPUs, different GPUs,each GPUs, each GPUs, each processingGPUs, each processingGPUs,each processing each processingeach processing each processing a processing subset a processing a subset subset a a of subset subset a ofmini- a subset of subsetmini- a ofmini- of subset mini- ofmini- of mini-a ofmini- weighta mini-a weight weightaa weightvector weighta vectora weight vector weightaw vector weightvectoriwrepresents vectorwi vectorrepresentsiwrepresentsw vectori irepresentswrepresentswi therepresentsiwrepresents thei feature therepresents feature the feature the feature thecenter feature the center feature thecenter feature of center feature center differentof ofcenter different center differentof of center different differentof of different differentof different ber of classes, which is 6.4 of the DDP’s limit. We also izationbatch (randombatchbatch data;batchbatch data; data;batch whilebatch data; data;whilebatch while data; for data;while whileinit. for thedata; for while thewhile forlarge-scale thefor while large-scale thefor large-scalethefor large-scalethe forlarge-scale the large-scalefc the large-scale short),layer,fc large-scalefclayer,layer,fcfc welayer,layer,fc we followfc welayer, followlayer,fc we follow we raw-featurelayer, the follow we follow we the follow the we follow the the followtransformed thetransformed thetransformed thetransformedtransformedtransformed viewstransformed initialization viewstransformed views of views views instanceof of views instance views instanceof of views instance instanceofi, of thus instancei, instanceofi thus, thus instanceinitializingi,i, thus initializing thusi initializing,i thus, initializing thus initializingi, thus initializing weights initializing weights initializing weights weights weightsas weightsas weightsas weightsas as as as as × Table 2: Ablation study on the effectiveness of different Table 4: Ablation study that compares Gaussian random and components in our method. a contrastive prior for classifier initialization.

MLP A Label ImageNet Gaussian A contrastive #Epochs head contrastive smoothing Top-1 Top-5 random prior prior XX 58.6 83.1 10 12.4 27.4 (+15.0) X 67.3 87.7 25 40.8 46.3 (+5.5) XX 67.6 88.0 50 56.1 58.0 (+1.9) XXX 68.2 88.5 100 62.9 64.1 (+1.2) 200 67.3 67.6 (+0.3) 400 69.3 69.7 (+0.4) Table 3: Ablation study that compares full instance classification and sampled instance classification. Table 5: Ablation study of label smoothing with different #Sampled instances Top-1 Top-5 number of hard classes K and smoothing factor α. 210 64.8 86.3 Hard class Smoothing 212 65.3 86.7 Top-1 Top-5 number K factor α 214 65.4 86.7 no smoothing 67.6 88.0 216 65.5 86.8 100 0.1 67.7 88.2 Full 67.3 87.7 100 0.2 68.2 88.5 100 0.3 67.5 88.2 50 0.2 68.0 88.5 with fixed BNs (raw with fixed BNs for short), and raw- 200 0.2 67.9 88.4 feature initialization with running BNs (raw with running BNs for short). We use ImageNet linear evaluation accu- i racy (evaluated after 10 epochs training) as well as aver- top-K hardest negative classes H = c1, c2, , cK . The − { ··· } age intra- and inter-instance-class similarities as the indica- label of class j 1, 2, ,N is then defined as ∈ { ··· } tors. As shown in Figure 4, raw with running BNs achieves 1 α, j = i,  − the best linear evaluation accuracy, and clearly outperforms i  i other initialization methods. In addition, raw with running yj = α/K, j H , (2)  ∈ − BNs obtains a much larger similarity gap ( 9.3%) between  0, otherwise. ∼ positive and negative instance pairs than raw with fixed BNs The loss function in Eq. (1) is redefined as ( 1.5%), validating the assumption that running BNs may ∼ PN i provide a contrastive prior for instance discrimination. 1 X j=1 yj exp(cos(wj, xi)/τ) J = log . (3) We also note that the instance features extracted by a ran- − I PN i I j=1 exp(cos(wj, xi)/τ) dom network with running BNs are also a robust start for | | ∈ semantic classification. We run an instance retrieval experi- The top-K similarities between instance weights are com- ment on the train set of ImageNet-1K with a randomly ini- puted once per epoch, which only amounts for a small frac- tialized ResNet-50 network to extract all image features, and tion of training time. The smoothed softmax cross-entropy determine whether the searched instance and the query in- loss reduces the impact of noisy or very similar negative stance are of a same semantic category. We find that a 3% pairs on the learned representation. This is also verified later top-1 accuracy can be achieved, which far exceeds the top-1 in our ablation study, where smoothing labels of several accuracy of 0.1% of a random guess. hardest classes improves the transfer performance.

Smoothing Labels of Hardest Classes Experiments A challenge of instance-level classification is that it intro- Experiment Settings duces a very large number of negative classes, significantly Training datasets. Unless specified, we use ImageNet- raising the risk of optimizing over very similar pairs that can 1K to train our unsupervised model for most experiments. be noisy and make the training hard to converge. ImageNet-1K consists of around 1.28 million images be- In this work, we handle this problem by applying label longing to 1000 classes. We treat every image instance smoothing on a few hardest instance classes. Although other (along with its various transformed views) in the dataset as a techniques (e.g., clustering) are also applicable, we choose unique class, and train a 1.28 million-way instance classifier label smoothing for its simplicity and efficiency. We no- as a pretext task to learn visual representation. tice that the semantically similar instance pairs are relatively Evaluation datasets. The learned visual representations are stable across the training process. Therefore, we represent evaluated in three ways: First, under the linear evaluation each instance i as its corresponding classification weights protocol of ImageNet-1K, we fix the representation model wi (instead of its unstable features xi), and compute the co- and learn a linear classifier upon it. The top-1/top-5 classi- sine similarities between wi and all other weights W¯i = fication accuracies are employed to compare different unsu- D (N 1) w1, , wi 1, wi+1, , wN R × − to find the pervised methods. Second, we evaluate the semi-supervised { ··· − ··· } ∈ Table 6: State-of-the-art comparison of linear classification Table 7: Comparison of semi-supervised learning accuracy accuracy of unsupervised methods on ImageNet-1K. of both label propagation and representation learning based methods on ImageNet-1K. Method Top-1 Top-5 Supervised 76.3 93.1 Label Fraction Method Exemplar (Dosovitskiy et al. 2014) 48.6 - 1% 10% CPC (Oord, Li, and Vinyals 2018) 48.7 73.6 Supervised 48.4 80.4 InstDisc. (Wu et al. 2018a) 54.0 - Label propagation: BigBiGAN (Donahue and Simonyan 2019) 56.0 77.4 PseudoLabels (Zhai et al. 2019) 51.6 82.4 NPID++ (Wu et al. 2018b) 59.0 - VAT+Entropy Min. (Miyato et al. 2018) 47.0 83.4 LocalAgg. (Zhuang, Zhai, and Yamins 2019b) 60.2 - UDA (Xie et al. 2019) - 88.5 MoCo (He et al. 2020) 60.6 - FixMatch (Sohn et al. 2020) - 89.1 SeLa (YM., C., and A. 2020) 61.5 84.0 PIRL (Misra and Maaten 2020) 63.6 - Representation Learning: CPCv2 (Henaff´ et al. 2019) 63.8 85.3 InstDisc. (Dosovitskiy et al. 2014) 39.2 77.4 CMC (Tian, Krishnan, and Isola 2019) 64.1 85.4 PIRL (Misra and Maaten 2020) 57.2 83.8 SimCLR (Chen et al. 2020a) 69.3 89.0 PCL (Li et al. 2020) 75.6 86.2 PIC (Cao et al. 2020) 70.8 90.0 SimCLR (Chen et al. 2020a) 75.5 87.8 MoCoV2 (Chen et al. 2020c) 71.1 - PIC (Cao et al. 2020) 77.1 88.7 Ours 71.4 90.3 Ours 81.8 89.2 learning performance on ImageNet-1K, where methods are required to classify images in the val set when only a small sification model can already achieve a top-1 accuracy of fraction (i.e., 1% or 10%) of manual labels are provided 67.3%, suggesting that full instance classification is a very in the train set. Third, we evaluate the transferring perfor- strong baseline for unsupervised representation learning. mance by finetuning the representations on several down- The contrastive prior and label smoothing on the top-K stream tasks and compute performance gains. In our experi- hardest classes further boost the linear evaluation accuracy ments, downstream tasks include Pascal-VOC object detec- by around 1%. tion (Everingham et al. 2010), iNaturalist18 fine-grained im- Ablation: full instance classification v.s. sampled instance age classification (Van Horn et al. 2018), and many others. classification. Table 3 compares linear classification accura- Implementation details. We use ResNet-50 (He et al. 2016) cies using representations learned by full instance classifica- as the backbone in all our experiments. We train our model tion and by sampled instance classification, with sampling using the SGD optimizer, where the weight decay and mo- sizes ranging from 210 to 216. Note we remove a contrastive mentum are set to 0.0001 and 0.9, respectively. The initial prior and label smoothing in the experiments and only an- learning rate (lr) is set to 0.48 and decays using the cosine alyze the impact of class sampling. We observe that full in- annealing scheduler. In addition, we use 10 epochs of linear stance classification clearly outperforms sampled instance lr warmup to stabilize training. The minibatch size is 4096 classification by a margin of 1.8%, verifying the benefits of and the feature dimension D = 128. We set the temperature exploring the complete set of negative instances. in Eq. (1) as τ = 0.15, and the smoothing factor in Eq. (3) as α = 0.2. For fair comparison, following practices in re- Ablation: a contrastive prior v.s. random initialization. cent works (Chen et al. 2020a; Cao et al. 2020), we feed two Table 4 compares the linear evaluation performance of our augmented views per instance for training. All experiments method using Gaussian random and a contrastive prior for are conducted on 64 V100 GPUs with 32GB memory. classifier initialization, with training length increased from 10 to 400 epochs. We observe that a contrastive prior sig- Ablation Study nificantly speeds up convergence compared to random initialization, especially in early epochs (i.e., epoch 10, 25, This section validates several modeling and configuration and 50). Besides, the accuracy of a contrastive prior version options for our method. We compare the quality of repre- consistently outperforms random initialization counterpart, sentations using ImageNet linear protocol evaluated on the showing the robustness of our initialization mechanism. val set. In each experiment, the linear classifier is trained with a batch size of 2048 and a lr of 40 that decays during Ablation: label smoothing on hardest classes. Table 5 training under the cosine annealing rule. shows the impact of label smoothing on representation learn- Ablation: effectiveness of components. Table 2 shows the ing. We vary the considered number K of hardest negative linear evaluation performance using different combinations classes, and the smoothing factor α.A no smoothing base- of components in our method, including a two-layer MLP line where K = 0 and α = 0 is also included for com- head, a contrastive prior, and label smoothing. Accuracies parison. We observe that smoothing labels of a few hardest are measured after 200-epochs training. We find that all the classes improves linear evaluation performance over non- three components bring performance gains, improving the smoothing baseline in most hyper-parameter settings. The top-1 accuracy of our method from 58.6% to a competi- best accuracy can be obtained with K = 100 and α = 0.2, tive 68.2%. We also observe that a vanilla instance clas- where a 0.6% gain of top-1 accuracy can be achieved. Table 8: Comparison of transferring performance on PAS- Table 10: Transferring performance of different pretrained CAL VOC object detection. models on more downstream visual tasks.

Method AP AP50 AP75 Method CIFAR10 CIFAR100 SUN397 DTD Supervised 53.5 81.3 58.8 Scratch 95.9 80.2 53.6 64.8 MoCo (He et al. 2020) 55.9 81.5 62.6 Supervised 97.5 86.4 64.3 74.6 MoCoV2 (Chen et al. 2020c) 57.4 82.5 64.0 SimCLR 97.7 85.9 63.5 73.2 PIC (Cao et al. 2020) 57.1 82.4 63.4 Ours 97.8 86.2 64.2 77.6 Ours 57.2 82.2 64.0

Table 9: Comparison of transferring performance on iNatu- end-to-end on the trainval07+12 set of the PASCAL VOC ralist fine-grained classification. dataset (Everingham et al. 2010). We adopt the same experiment settings as MoCoV2 (Chen et al. 2020c). The AP (Av- Method Top-1 Top-5 erage Precision), AP50, and AP75 scores on the test2007 set Scratch 65.4 85.5 are used as indicators. Table 8 shows the results. Our trans- Supervised 66.0 85.6 ferring performance is significantly better than supervised MoCo (He et al. 2020) 65.7 85.7 pretraining counterpart (+3.7% in AP), and is competitive PIC (Cao et al. 2020) 66.2 85.7 Ours 66.2 86.2 with state-of-the-art unsupervised learning methods. iNaturalist fine-grained classification: We finetune the pretrained model end-to-end on the train set of iNaturalist 2018 dataset (Van Horn et al. 2018) and evaluate the top-1 Comparison with Previous Results and top-5 classification accuracies on the val set. Results are ImageNet linear evaluation. Table 6 compares our work shown in Table 9. Our method is closely competitive with with previous unsupervised visual representation learning the ImageNet supervised pretraining counterpart as well as methods under the ImageNet linear evaluation protocol. previous state-of-the-art unsupervised methods. The results We follow recent practices (Chen et al. 2020a,c) to train a indicate the discriminative ability of our pretrained represen- longer length, i.e., 1000 epochs. The proposed unsupervised tation in fine-grained classification. learning framework achieves a top-1 accuracy of 71.4% on More downstream tasks: Table 10 shows transferring re- ImageNet-1K, outperforming SimCLR (+2.1%), PIC (0.6%) sults on more downstream tasks, including image classifi- and MoCoV2 (+0.3%). The results verify that a simple full cation on CIFAR10, CIFAR100 (Krizhevsky, Hinton et al. instance classification framework can learn very competitive 2009), SUN397 (Xiao et al. 2010), and DTD (Cimpoi et al. visual representations. The performance gains can partly be 2014). To summarize, our method performs competitively attributed to the ability of large-scale full negative instance with ImageNet supervised pretraining as well as state-of- exploration, which is not supported by previous unsuper- the-art unsupervised pretraining. vised frameworks (Chen et al. 2020a,c; Cao et al. 2020). Semi-supervised learning. Following (Kolesnikov, Zhai, and Beyer 2019; Chen et al. 2020a), we sample a 1% or 10% Conclusion fraction of labeled data from ImageNet, and train a classifier In this work, we present an unsupervised visual represen- starting from our unsupervised pretrained model to evaluate tation learning framework where the pretext task is to dis- the semi-supervised learning performance. For 1% labels, tinguish all instances in a dataset with a parametric classi- we train the backbone with a lr of 0.001 and the classifier fier. The task is similar to supervised semantic classifica- with a lr of 15. For 10% labels, the lrs for the backbone tion, but with a much larger number of classes (equal to and the classifier are set to 0.001 and 10, respectively (Li the dataset size) and finer granularity. We first introduce a et al. 2020). Table 7 compares our work with both represen- hybrid parallel training framework to make large-scale in- tation learning based and label propagation based methods. stance classification feasible, which significantly reduces the We obtain a top-5 accuracy of 81.8% when only 1% of la- GPU memory overhead and speeds up training in our ex- bels are used, outperforming all previous methods by a non- periments. Second, we propose to improve the convergence negligible margin (+4.7%). We also achieve the best result by introducing a contrastive prior to the instance classifier. when 10% of labels are provided, surpassing SimCLR and This is achieved by initializing the classification weights as PIC by 1.4% and 0.5%, respectively. The results suggest the raw instance features extracted by a fixed random network strong discriminative ability of our learned representation. with running BNs. We show in our experiments that this Transfer learning. To further evaluate the learned represen- simple strategy clearly speeds up convergence and improves tation, we apply the pretrained model to several downstream the transferring performance. Finally, to reduce the impact visual tasks (including detection, fine-grained classification, of noisy negative instance pairs, we propose to smooth the and many others) to evaluate the transferring performance. labels of a few hardest classes. Extensive experiments on PASCAL VOC Object Detection: Following (He et al. ImageNet classification, semi-supervised classification, and 2020), we use Faster-RCNN (Ren et al. 2015) with ResNet- many downstream tasks show that our simple unsupervised 50 backbone as the object detector. We initialize ResNet- representation learning method performs comparable or su- 50 with our pretrained weights, and finetune all layers perior than state-of-the-art unsupervised methods. References Feng, Z.; Xu, C.; and Tao, D. 2019. Self-supervised repre- Asano, Y. M.; Rupprecht, C.; and Vedaldi, A. 2020. Self- sentation learning by rotation feature decoupling. In Pro- labelling via simultaneous clustering and representation ceedings of the IEEE Conference on Computer Vision and learning . Pattern Recognition, 10364–10374. Cao, Y.; Xie, Z.; Liu, B.; Lin, Y.; Zhang, Z.; and Hu, H. Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsuper- 2020. Parametric Instance Classification for Unsupervised vised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 Visual Feature Learning. arXiv preprint arXiv:2006.14618 . . He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018. Momentum contrast for unsupervised visual representation Deep clustering for unsupervised learning of visual features. learning. In Proceedings of the IEEE/CVF Conference on In Proceedings of the European Conference on Computer Computer Vision and Pattern Recognition, 9729–9738. Vision (ECCV), 132–149. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020a. ual learning for image recognition. In Proceedings of the A Simple Framework for Contrastive Learning of Visual IEEE conference on computer vision and pattern recogni- Representations. arXiv preprint arXiv:2002.05709 . tion, 770–778. Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; and Hin- Henaff,´ O. J.; Srinivas, A.; De Fauw, J.; Razavi, A.; Doer- ton, G. 2020b. Big Self-Supervised Models are Strong Semi- sch, C.; Eslami, S.; and Oord, A. v. d. 2019. Data-efficient Supervised Learners. arXiv preprint arXiv:2006.10029 . image recognition with contrastive predictive coding. arXiv Chen, X.; Fan, H.; Girshick, R.; and He, K. 2020c. Improved preprint arXiv:1905.09272 . baselines with momentum contrastive learning. arXiv Jenni, S.; and Favaro, P. 2018. Self-supervised feature learn- preprint arXiv:2003.04297 . ing by learning to spot artifacts. In Proceedings of the IEEE Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; ; and Conference on Computer Vision and Pattern Recognition, Vedaldi, A. 2014. Describing Textures in the Wild. In Pro- 2733–2742. ceedings of the IEEE Conf. on Computer Vision and Pattern Kolesnikov, A.; Zhai, X.; and Beyer, L. 2019. Revisiting Recognition (CVPR). self-supervised visual representation learning. In Proceed- ings of the IEEE conference on Computer Vision and Pattern Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- Recognition, 1920–1929. Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple pattern recognition, 248–255. Ieee. layers of features from tiny images . Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsuper- Li, J.; Zhou, P.; Xiong, C.; Socher, R.; and Hoi, S. C. 2020. vised visual representation learning by context prediction. In Prototypical Contrastive Learning of Unsupervised Repre- Proceedings of the IEEE international conference on com- sentations. arXiv preprint arXiv:2005.04966 . puter vision, 1422–1430. Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learn- Doersch, C.; and Zisserman, A. 2017. Multi-task self- ing of pretext-invariant representations. In Proceedings of supervised visual learning. In Proceedings of the IEEE In- the IEEE/CVF Conference on Computer Vision and Pattern ternational Conference on Computer Vision, 2051–2060. Recognition, 6707–6717. Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018. Donahue, J.; Krahenb¨ uhl,¨ P.; and Darrell, T. 2016. Adver- Virtual adversarial training: a regularization method for su- sarial feature learning. arXiv preprint arXiv:1605.09782 . pervised and semi-supervised learning. IEEE transactions Donahue, J.; and Simonyan, K. 2019. Large scale adversar- on pattern analysis and machine intelligence 41(8): 1979– ial representation learning. In Advances in Neural Informa- 1993. tion Processing Systems, 10542–10552. Noroozi, M.; and Favaro, P. 2016. Unsupervised learning of Dosovitskiy, A.; Fischer, P.; Springenberg, J. T.; Riedmiller, visual representations by solving jigsaw puzzles. In Euro- M.; and Brox, T. 2015. Discriminative unsupervised feature pean Conference on Computer Vision, 69–84. Springer. learning with exemplar convolutional neural networks. IEEE Noroozi, M.; Vinjimoor, A.; Favaro, P.; and Pirsiavash, H. transactions on pattern analysis and machine intelligence 2018. Boosting self-supervised learning via knowledge 38(9): 1734–1747. transfer. In Proceedings of the IEEE Conference on Com- Dosovitskiy, A.; Springenberg, J. T.; Riedmiller, M.; and puter Vision and Pattern Recognition, 9359–9367. Brox, T. 2014. Discriminative unsupervised feature learning Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation with convolutional neural networks. In Advances in neural learning with contrastive predictive coding. arXiv preprint information processing systems, 766–774. arXiv:1807.03748 . Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and and Zisserman, A. 2010. The pascal visual object classes Efros, A. A. 2016. Context encoders: Feature learning by (voc) challenge. International journal of computer vision inpainting. In Proceedings of the IEEE conference on com- 88(2): 303–338. puter vision and pattern recognition, 2536–2544. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster Proceedings of the IEEE International Conference on Com- R-CNN: Towards Real-Time Object Detection with Region puter Vision, 6002–6012. Proposal Networks. In Cortes, C.; Lawrence, N. D.; Lee, Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019b. Local aggre- D. D.; Sugiyama, M.; and Garnett, R., eds., Advances in gation for unsupervised learning of visual embeddings. In Neural Information Processing Systems 28, 91–99. Proceedings of the IEEE International Conference on Com- Sohn, K.; Berthelot, D.; Li, C.-L.; Zhang, Z.; Carlini, N.; puter Vision, 6002–6012. Cubuk, E. D.; Kurakin, A.; Zhang, H.; and Raffel, C. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685 . Song, L.; Pan, P.; Zhao, K.; Yang, H.; Chen, Y.; Zhang, Y.; Xu, Y.; and Jin, R. 2020. Large-Scale Training System for 100-Million Classification at Alibaba. In Proceedings of the 26th ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining, 2909–2930. Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive Mul- tiview Coding. arXiv preprint arXiv:1906.05849 . Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; and Belongie, S. 2018. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8769–8778. Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018a. Unsuper- vised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3733–3742. Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018b. Unsuper- vised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3733–3742. Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba, A. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 3485–3492. IEEE. Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.-T.; and Le, Q. V. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 . YM., A.; C., R.; and A., V. 2020. Self-labelling via simultaneous clustering and representation learning. In Interna- tional Conference on Learning Representations (ICLR). Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019. S4l: Self-supervised semi-supervised learning. In Proceed- ings of the IEEE international conference on computer vision, 1476–1485. Zhang, L.; Qi, G.-J.; Wang, L.; and Luo, J. 2019. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2547–2555. Zhang, R.; Isola, P.; and Efros, A. A. 2016. Colorful image colorization. In European conference on computer vision, 649–666. Springer. Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019a. Local aggre- gation for unsupervised learning of visual embeddings. In