Factors of Influence for Transfer Learning Across Diverse Appearance Domains and Task Types
Total Page:16
File Type:pdf, Size:1020Kb
FACTORS OF POSITIVE TRANSFER 1 Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types Thomas Mensink*, Jasper Uijlings*, Alina Kuznetsova, Michael Gygli, and Vittorio Ferrari Abstract—Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 1200 transfer experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations for practitioners. Index Terms—Transfer Learning, Computer Vision. F 1 INTRODUCTION ranfer learning is omnipresent in computer vision. ing a large-scale exploration of transfer learning across a T The common practice is transfer learning through wide variety of image domains and task types. In particular, ILSVRC’12 pre-training: train on the ILSVRC’12 image clas- we perform over 1200 transfer learning experiments across sification task [4], copy the resulting weights to a target 20 datasets spanning seven diverse image domains (con- model, then fine-tune for the target task at hand. This sumer, driving, aerial, underwater, indoor, synthetic, close- strategy was shown to be effective on a wide variety of ups) and four task types (semantic segmentation, object datasets and task types, including image classification [7], detection, depth estimation, keypoint detection). In many of [17], [37], [42], [55], object detection [35], semantic segmen- our experiments the source and target come from different tation [90], human pose estimation [38], [123], [126], and image domains, task types, or both. For example, we find depth estimation [16], [29]. Intuitively, the reason for this that semantic segmentation on COCO [65] is a good source success is that the network learns a strong generic visual for depth estimation on SUN RGB-D [94](Tab. 3e); and even representation, providing a better starting point for learning that keypoint detection on Stanford Dogs [8], [48] helps ob- a new task than training from scratch. ject detection on the Underwater Trash [30] dataset (Tab. 4b). But can we do better than ILSVRC’12 pre-training? And We then do a systematic meta-analysis of these experiments, what factors make a source task good for transferring to a relating transfer learning performance to three underlying given target task? Some previous works aim to demonstrate factors of variation: the difference in image domain between that a generic representation trained on a single large source source and target tasks, their difference in task type, and the dataset works well for a variety of classification tasks [54], size of their training sets. This yields new insights into when [71], [122]. Others instead try to automatically find a subset transfer learning brings benefits and which source works arXiv:2103.13318v1 [cs.CV] 24 Mar 2021 of the source dataset that transfers best to a given target [33], best for a given target. [74], [77], [117]. By virtue of dataset collection suites such as At a high level, our main conclusions are: (1) for most VTAB [122] and VisDA [76], recently several works in trans- target tasks we are able to find sources that significantly fer learning and domain adaptation experiment with target outperform ILSVRC’12 pre-training; (2) the image domain datasets spanning a variety of image domains [23], [54], [57], is the most important factor for achieving positive transfer; [71], [76], [122], [124]. However, most previous work focuses (3) the source dataset should include the image domain of solely on image classification [23], [33], [57], [66], [77], [87], the target dataset to achieve best results; (4) conversely, we [122] or a single structured prediction task [40], [62], [101], observe almost no negative effects when the image domain [103], [117]. of the source task is much broader than that of the target; (5) In this paper, we go beyond previous works by provid- transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types. The rest of our paper is organized as follows: Sect. 2 • Google Research discusses related work. Sect. 3 discusses the three factors • Primary contacts: [email protected] and [email protected]. of variation. Sect. 4 details and validates the network archi- * Equal contribution. tectures that we use. Sect. 5 presents our transfer learning Manuscript submitted in March, 2021 to TPAMI. This work has been submitted to the IEEE for possible publication. Copyright may be transferred experiments. Sect. 6 presents a detailed analysis of our without notice, after which this version may no longer be accessible. results. Sect. 7 concludes our paper. FACTORS OF POSITIVE TRANSFER 2 Fig. 1: We explore transfer learning across a wide variety of image domain and task types. We show here example images for the 20 datasets we consider, highlighting their visual diversity. We grouped them into manually defined image domains: consumer in orange, driving in green, synthetic in red, aerial in purple, underwater in blue, close-ups in yellow, and indoor in magenta. 2 RELATED WORK ferent cameras [87]. Recent works explored larger domain We review here related work on transfer learning in com- shifts, e.g. across clipart, web-shopping products, consumer puter vision. We take a rather broad definition of this photos and artist paintings [105], or across simple synthetic term, and include several families of works that transfer renderings and real consumer photos [57], [76], and even knowledge from a task to another, even though they are not fusing multiple datasets spanning several modalities such as explicitly positioned as ‘transfer learning’. We pay particular hand-drawn sketeches, synthetic renderings and consumer attention to their experimental setting and analyses, as this photos [124]. Despite this substantial variation in domain, is the aspect most related to our work. most works consider only object classification tasks, with Domain adaptation. This family of works adapts image relatively few papers tackling semantic segmentation [40], classifiers from a source domain to a target domain, typi- [62], [76], [98], [101], [119]. cally containing the same classes but appearing in different Few-shot learning. Works in this area aim to learn to kind of images. Some of the most common techniques are classify new target classes from very few examples, typically minimizing the discrepancy in feature distributions [32], 1-10, by transferring knowledge from source classes with [66], [87], [96], [102], [105], [124], embedding domain align- many training samples. Typically the source and target ment layers into a network [13], [68], [84], directly transform classes come from the same dataset, hence there is no images from the target to the source domain [9], [86], [88]. domain shift. There are two main approaches: metric-based Earlier works consider relatively small domain shifts, e.g. and optimization-based. Optimization-based methods em- where the source and target datasets are captured by dif- ploy meta-learning [28], [75], [79], [80], [112]. These methods FACTORS OF POSITIVE TRANSFER 3 train a model such that it can be later adapted to new classes large source dataset works well for all target tasks [54], with few update steps. MAML [28] is the most prominent [122]. Others instead try to automatically find a subset of the example of this line of work. Despite its success, [79] later source dataset that transfers best to a given target [33], [74], showed that its performance largely stems from learning [77], [117]. The subsets correspond either to subtrees of the a generic feature embedding, rather than a model that source class hierarchy [74], [77], or are derived by clustering can adapt to new data faster. Metric-based methods [52], source images by appearance [117]. After selecting a subset, [69], [93], [99], [106] aim at learning an embedding space they retrieve its corresponding pre-trained source model which allows to classify examples of any class using a and fine-tune only that one on the target dataset. distance-based classifier, e.g. nearest neighbor [93]. Overall, Taskonomy [121] explores a converse scenario. They the community seems to be reaching a consensus [15], [34], perform transfer learning across many task types, including [41], [78], [95], [100]: the key ingredient to high-performing several dense prediction ones, e.g. surface normal estima- few-shot classification is learning a general representation, tion and semantic segmentation. However, there is only one rather than sophisticated algorithms for adapting to the new image domain (indoor scenes) as all the tasks are defined classes. In line with these works, we study what repre- on the same images (one dataset). This convenient setting sentation is suitable for solving a target task. In contrast enables them to study which source task type helps the most with these works, our focus is on more complex structured which target task type, by exhaustively trying out all pairs. prediction. As an powerful outcome, they derive a taxonomy of task While most few-shot learning works focus on classifica- types, with directed edges weighted by how much a task tion, there are a few exceptions, e.g.