DAP: Detection-Aware Pre-Training with Weak Supervision
Total Page:16
File Type:pdf, Size:1020Kb
DAP: Detection-Aware Pre-training with Weak Supervision Yuanyi Zhong1, Jianfeng Wang2, Lijuan Wang2, Jian Peng1, Yu-Xiong Wang1, Lei Zhang2 1 University of Illinois at Urbana-Champaign {yuanyiz2, jianpeng, yxw}@illinois.edu 2 Microsoft {jianfw, lijuanw, leizhang}@microsoft.com Abstract Step 2: Pseudo Box Generation Step 3: Detector Pre-training Backbone Cat This paper presents a detection-aware pre-training (DAP) approach, which leverages only weakly-labeled Piano Classification classification-style datasets (e.g., ImageNet) for pre- FPN Box regression training, but is specifically tailored to benefit object de- Clock tection tasks. In contrast to the widely used image RPN losses classification-based pre-training (e.g., on ImageNet), which Weight init. does not include any location-related training tasks, we Step 1: Classifier Pre-training Step 4: Downstream Detection Tasks transform a classification dataset into a detection dataset Backbone Backbone through a weakly supervised object localization method Cat based on Class Activation Maps to directly pre-train a de- Classification tector, making the pre-trained model location-aware and Piano FPN capable of predicting bounding boxes. We show that DAP Classification Box regression Clock can outperform the traditional classification pre-training in RPN losses terms of both sample efficiency and convergence speed in Figure 1. The DAP workflow. It consists of 4 steps: (1) Classifier downstream detection tasks including VOC and COCO. In pre-training on a weak supervision dataset, (2) Pseudo box gener- particular, DAP boosts the detection accuracy by a large ation by WSOL (e.g., through CAM as illustrated), (3) Detector margin when the number of examples in the downstream pre-training with the generated pseudo boxes, (4) Downstream de- task is small. tection tasks. The traditional classification pre-training and fine- tuning directly go from Step (1) to (4) at the bottom, while DAP inserts the additional Steps (2) and (3) at the top. In both cases, the 1. Introduction pre-trained weights are used to initialize the downstream models. DAP gives the model a chance to learn how to perform explicit Pre-training and fine-tuning have been a dominant localization, and is able to pre-train detection-related components paradigm for deep learning-based object recognition in while classification pre-training cannot, such as the FPN, RPN, computer vision [14, 10, 29, 17]. In such a paradigm, neural and box regressor in a Faster RCNN detector. network weights are typically pre-trained on a large dataset (e.g., through ImageNet [8] classification training), and then ization point than a completely random initialization [12]. transferred to initialize models in downstream tasks. Pre- Therefore, fine-tuning would only require a relatively small training can presumably help improve downstream tasks in number of gradient steps to achieve competitive accuracy. multiple ways. The low-level convolutional filters, such as However, the empirical gain for object detection brought edge, shape, and texture filters, are already well-learned in by classification pre-training is diminishing with succes- pre-training [42]. The pre-trained network is also capable sively larger pre-training datasets, ranging from ImageNet- of providing meaningful semantic representations. For ex- 1M, ImageNet-5k [17], to ImageNet-21k (14M), JFT-300M ample, in the case of ImageNet classification pre-training, [36], and billion-scale Instagram images [25]. Meanwhile, since the number of categories is large (1000 classes), the [16] shows that training from random initialization (i.e., downstream object categories might be related to a subset from scratch) can work equally well with sufficiently large of the pre-training categories and can reuse the pre-trained data (COCO [24]) and a sufficiently long training time, feature representations. Pre-training may also help the opti- making the effect of classification pre-training questionable. mizer avoid bad local minima by providing a better initial- We conjecture that the diminishing gain of classifica- 4537 tion pre-training for object detection is due to several mis- classification pre-training and fine-tuning stages yields con- matches between the pre-training and the fine-tuning tasks. sistent gains across different downstream detection tasks. Firstly, the task objectives of classification and detection The improvement is especially significant in the low-data are different. Existing classification pre-training is typically regime. This is particularly useful in practice to save the an- unaware of downstream detection tasks. The pre-training notation effort. In the full-data setting, DAP leads to faster adopts a single whole-image classification loss which en- convergence than classification pre-training and also im- courages translation and scale-invariant features, while the proves the final detection accuracy by a decent margin. Our detection fine-tuning involves several different classifica- work suggests that a carefully designed detection-specific tion and regression losses which are sensitive to object lo- pre-training strategy with classification-style data can still cations and scales. Secondly, the data distributions are mis- benefit object detection. We believe that this work makes aligned. The localization information required by detec- the first attempt towards detection-aware pre-training with tion is not explicitly made available in classification pre- weak supervision. training. Thirdly, the architectures are misaligned. The net- work used in pre-training is a bare backbone network such 2. Related Work as a ResNet model [18] followed by an average pooling and a linear classification layer. In contrast, the network in Pre-training and fine-tuning paradigm. Pre-training an object detector contains various additional architectural contributed to many breakthroughs in applying CNN for components such as the Region Proposal Network (RPN) object recognition [14, 10, 29, 17]. A common strategy, [29], the Feature Pyramid Network (FPN) [22], the ROI for example, is to pre-train the networks through supervised classification heads and the bounding box regression heads learning on the ImageNet classification dataset [8, 30] and [29], etc. These unique architectural components in detec- then fine-tune the weights in downstream tasks. Zeiler et al. tors are not pre-trained and are instead randomly initialized visualize the convolutional filters in a pre-trained network, in detection fine-tuning, which could be sub-optimal. and find that intermediate layers can capture universal local Aiming at bridging the gap between pre-training with patterns, such as edges and corners that can be generaliz- classification data and detection fine-tuning, we introduce able to other vision tasks [42]. Pre-training may ease up the a Detection-Aware Pre-training (DAP) procedure as shown difficult optimization problem of fitting deep neural nets via in Figure 1. There are two desired properties that are nec- first-order methods [12]. Recently, the limit of supervised essary to pre-train a detector: (1) Classification should be pre-training has been pushed by scaling up the datasets. done locally rather than globally; (2) Features should be In Big Transfer (BiT), the authors show that surprisingly capable of predicting bounding boxes and can be easily high transfer performance can be achieved across 20 down- adapted to any desired object categories after fine-tuning. stream tasks by classification pre-training on a dataset of With the desired properties in mind, DAP starts from pre- 300M noisy-labeled images (JFT-300M) [5]. Notably, pre- training a classifier on the classification data, and extracts training on JFT-300M drastically improves the performance the localization information with existing tools developed with small data. Similarly, Mahajan et al. explore the limits in Weakly Supervised Object Localization (WSOL) based of (weakly) supervised pre-training with noisy hashtags on on Class Activation Maps (CAM) [47]. The next step is to billions of social media (Instagram) images [25]. The tra- treat the localized instances as pseudo bounding boxes to ditional ImageNet-1M becomes a small dataset compared pre-train a detection model. Finally, the pre-trained weights to the Instagram data. A gain of 5.6% can be achieved on are used for model initialization in downstream detection ImageNet-1M classification accuracy by pre-training on the tasks such as VOC [13] and COCO [24]. DAP enables billion-scale data. As for related work in other deep learn- the pre-training of (almost) the entire detector architecture ing fields, pre-training is also a dominant strategy in natural and offers the model the opportunity to adapt its representa- language processing (NLP) and speech processing [31, 41]. tion to perform localization explicitly. Our problem setting For example, BERT [9] and GPT-3 [3] show that language focuses on leveraging the weak image-level supervision in models pre-trained on massive corpora can generalize well classification-style data for pre-training (ImageNet-1M and to various NLP tasks. ImageNet-14M) [8], therefore makes a head-to-head com- Pre-training and object detection. However, the story of parison to the traditional classification pre-training. Note how and to what extent classification pre-training is helping that our setting is different from unsupervised pre-training object detection is up for debate. On one hand, it is observed [15, 4, 5] which is only based on unlabeled images, and is that pre-training is important when downstream data is lim- different from fully-supervised detection pre-training [32] ited [1, 16]. On the other hand, there is a line of work re- which is hard to scale. porting competitive accuracy when training modern object Comprehensive experiments demonstrate that adding the detectors from scratch [37, 33, 49, 16]. The gain brought by simple lightweight DAP steps in-between the traditional classification pre-training on larger datasets seems dimin- 4538 ishing [20, 25, 16]. Classification pre-training may some- Components Label: Normalized Pseudo times even harm localization when the downstream data Kite (bird) CAM for Kite Threshold: 0.3 Boxes is abundant while benefit classification [25].