FACTORS OF POSITIVE TRANSFER 1 Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types

Thomas Mensink*, Jasper Uijlings*, Alina Kuznetsova, Michael Gygli, and Vittorio Ferrari

Abstract—Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 1200 transfer experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations for practitioners.

Index Terms—Transfer Learning, Computer Vision. !

1 INTRODUCTION

ranfer learning is omnipresent in computer vision. ing a large-scale exploration of transfer learning across a T The common practice is transfer learning through wide variety of image domains and task types. In particular, ILSVRC’12 pre-training: train on the ILSVRC’12 image clas- we perform over 1200 transfer learning experiments across sification task [4], copy the resulting weights to a target 20 datasets spanning seven diverse image domains (con- model, then fine-tune for the target task at hand. This sumer, driving, aerial, underwater, indoor, synthetic, close- strategy was shown to be effective on a wide variety of ups) and four task types (semantic segmentation, object datasets and task types, including image classification [7], detection, depth estimation, keypoint detection). In many of [17], [37], [42], [55], object detection [35], semantic segmen- our experiments the source and target come from different tation [90], human pose estimation [38], [123], [126], and image domains, task types, or both. For example, we find depth estimation [16], [29]. Intuitively, the reason for this that semantic segmentation on COCO [65] is a good source success is that the network learns a strong generic visual for depth estimation on SUN RGB-D [94](Tab. 3e); and even representation, providing a better starting point for learning that keypoint detection on Stanford Dogs [8], [48] helps ob- a new task than training from scratch. ject detection on the Underwater Trash [30] dataset (Tab. 4b). But can we do better than ILSVRC’12 pre-training? And We then do a systematic meta-analysis of these experiments, what factors make a source task good for transferring to a relating transfer learning performance to three underlying given target task? Some previous works aim to demonstrate factors of variation: the difference in image domain between that a generic representation trained on a single large source source and target tasks, their difference in task type, and the dataset works well for a variety of classification tasks [54], size of their training sets. This yields new insights into when [71], [122]. Others instead try to automatically find a subset transfer learning brings benefits and which source works arXiv:2103.13318v1 [cs.CV] 24 Mar 2021 of the source dataset that transfers best to a given target [33], best for a given target. [74], [77], [117]. By virtue of dataset collection suites such as At a high level, our main conclusions are: (1) for most VTAB [122] and VisDA [76], recently several works in trans- target tasks we are able to find sources that significantly fer learning and domain adaptation experiment with target outperform ILSVRC’12 pre-training; (2) the image domain datasets spanning a variety of image domains [23], [54], [57], is the most important factor for achieving positive transfer; [71], [76], [122], [124]. However, most previous work focuses (3) the source dataset should include the image domain of solely on image classification [23], [33], [57], [66], [77], [87], the target dataset to achieve best results; (4) conversely, we [122] or a single structured prediction task [40], [62], [101], observe almost no negative effects when the image domain [103], [117]. of the source task is much broader than that of the target; (5) In this paper, we go beyond previous works by provid- transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types. The rest of our paper is organized as follows: Sect. 2 • Research discusses related work. Sect. 3 discusses the three factors • Primary contacts: [email protected] and [email protected]. of variation. Sect. 4 details and validates the network archi- * Equal contribution. tectures that we use. Sect. 5 presents our transfer learning Manuscript submitted in March, 2021 to TPAMI. This work has been submitted to the IEEE for possible publication. Copyright may be transferred experiments. Sect. 6 presents a detailed analysis of our without notice, after which this version may no longer be accessible. results. Sect. 7 concludes our paper. FACTORS OF POSITIVE TRANSFER 2

Fig. 1: We explore transfer learning across a wide variety of image domain and task types. We show here example images for the 20 datasets we consider, highlighting their visual diversity. We grouped them into manually defined image domains: consumer in orange, driving in green, synthetic in red, aerial in purple, underwater in blue, close-ups in yellow, and indoor in magenta.

2 RELATED WORK ferent cameras [87]. Recent works explored larger domain We review here related work on transfer learning in com- shifts, e.g. across clipart, web-shopping products, consumer puter vision. We take a rather broad definition of this photos and artist paintings [105], or across simple synthetic term, and include several families of works that transfer renderings and real consumer photos [57], [76], and even knowledge from a task to another, even though they are not fusing multiple datasets spanning several modalities such as explicitly positioned as ‘transfer learning’. We pay particular hand-drawn sketeches, synthetic renderings and consumer attention to their experimental setting and analyses, as this photos [124]. Despite this substantial variation in domain, is the aspect most related to our work. most works consider only object classification tasks, with Domain adaptation. This family of works adapts image relatively few papers tackling semantic segmentation [40], classifiers from a source domain to a target domain, typi- [62], [76], [98], [101], [119]. cally containing the same classes but appearing in different Few-shot learning. Works in this area aim to learn to kind of images. Some of the most common techniques are classify new target classes from very few examples, typically minimizing the discrepancy in feature distributions [32], 1-10, by transferring knowledge from source classes with [66], [87], [96], [102], [105], [124], embedding domain align- many training samples. Typically the source and target ment layers into a network [13], [68], [84], directly transform classes come from the same dataset, hence there is no images from the target to the source domain [9], [86], [88]. domain shift. There are two main approaches: metric-based Earlier works consider relatively small domain shifts, e.g. and optimization-based. Optimization-based methods em- where the source and target datasets are captured by dif- ploy meta-learning [28], [75], [79], [80], [112]. These methods FACTORS OF POSITIVE TRANSFER 3 train a model such that it can be later adapted to new classes large source dataset works well for all target tasks [54], with few update steps. MAML [28] is the most prominent [122]. Others instead try to automatically find a subset of the example of this line of work. Despite its success, [79] later source dataset that transfers best to a given target [33], [74], showed that its performance largely stems from learning [77], [117]. The subsets correspond either to subtrees of the a generic feature embedding, rather than a model that source class hierarchy [74], [77], or are derived by clustering can adapt to new data faster. Metric-based methods [52], source images by appearance [117]. After selecting a subset, [69], [93], [99], [106] aim at learning an embedding space they retrieve its corresponding pre-trained source model which allows to classify examples of any class using a and fine-tune only that one on the target dataset. distance-based classifier, e.g. nearest neighbor [93]. Overall, Taskonomy [121] explores a converse scenario. They the community seems to be reaching a consensus [15], [34], perform transfer learning across many task types, including [41], [78], [95], [100]: the key ingredient to high-performing several dense prediction ones, e.g. surface normal estima- few-shot classification is learning a general representation, tion and semantic segmentation. However, there is only one rather than sophisticated algorithms for adapting to the new image domain (indoor scenes) as all the tasks are defined classes. In line with these works, we study what repre- on the same images (one dataset). This convenient setting sentation is suitable for solving a target task. In contrast enables them to study which source task type helps the most with these works, our focus is on more complex structured which target task type, by exhaustively trying out all pairs. prediction. As an powerful outcome, they derive a taxonomy of task While most few-shot learning works focus on classifica- types, with directed edges weighted by how much a task tion, there are a few exceptions, e.g. for object detection [108], helps another. [109], [113] and semantic segmentation [89]. These works Finally, several other papers perform transfer learning follow a similar setup with source and targets coming from experiments on more complex tasks than image classifi- the same dataset. We also focus on structured prediction cation as a way to validate their proposed method, but tasks, however, we tackle more realistic scenarios, where a typically only for a few source-target dataset pairs, and only model is adapted to new datasets, appearance domains, and within the same task type (e.g. from Open Images [58] to task types. CityScapes [18] for instance segmentation [117]; and across Transfer learning. A common practice in computer vision pairs of ILSVRC’12 [4], COCO [65], Pascal VOC [2] for object is to start from a neural network trained for image classifica- detection [103]). tion on ILSVRC’12 [4] as a generic source (Sect. 1). Because Other related work. Finally we discuss two other directions of its success, this strategy is the starting point in all our of related work. Experimenting over a collection of datasets experiments, and should be seen as the baseline to beat for is getting more common, e.g. for robust vision approaches [3], any transfer learning method. [60], [81]. However, the general aim for robust methods is While many works simply apply this strategy as a to learn a single model which performs well across several practical ‘trick of the trade’, several papers explore transfer datasets for one task type. learning in more depth, attempting to understand when it A difficulty for sequential learning of neural networks, works, to find even better source datasets than ILSVRC’12, is the tendency of catastrophic forgetting [21], [51], [91]. In or to propose more sophisticated transfer techniques. The our transfer learning setup, the new target model might typical setting considers a very large source dataset: besides have forgotten the old source tasks. However, forgetting ILSVRC’12 [23], [33], [54], [71], [74], also ImageNet21k with is not relevant in our study, since we are interested in 9M images [23], [54], [71], [77], [122], Open Images with performance on the target task only. Moreover, we analyse 1.7M images [117], Places-205 with 2.5M images [33], and in which conditions the target task can successfully re-use even JFT-300M with 300M images [23], [54], [71], [74], [77]. the knowledge from the source task. Several works [23], [54], [71], [122] report experiments on target datasets spanning different image domains, especially 3 FACTORSOF INFLUENCE since the advent of the VTAB suite [122] which assembles We study transfer learning at scale across three factors of datasets captured using a standard camera, as well as re- influence: the difference in image domain between source mote sensing, medical, and synthetic ones. Yet, each paper and target tasks (Sect. 3.2), their difference in task type considers little variation in the source dataset, experiment- (Sect. 3.3), and the size of the source and target training sets ing with 1-3 sources overall, and sometimes picking just (Sect. 3.4). To study these factors of influence, we introduce one for each target dataset [33]. More importantly, the vast a collection of 20 datasets in Sect. 3.5, which we have chosen majority of reported experiments are on image classification, to cover these factors well. for both source and target datasets. Note how VTAB down- grades some structured tasks to simpler versions that can be expressed as classification, e.g. predict the depth of the 3.1 Transfer Learning through pre-training closest object to the camera, as opposed to a depth value for This paper explores transfer learning from a source task each object or . to a target task. We define a task as the combination of a In terms of method, most works follow the classical task type (e.g. object detection, semantic segmentation) and protocol of learning a feature representation from the source a dataset (e.g. COCO). We follow the widespread practice dataset, then fine-tune with a new task head on each target of initializing the backbones of our networks with the dataset in turn. However, they differ in how they select weights obtained from image classification pre-training on source images for a given target. Some aim to demonstrate ILSVRC’12 [7], [17], [35], [38], [42], [55], [90], [126]. This leads that a single generic representation trained on the whole to the following process: FACTORS OF POSITIVE TRANSFER 4

1) we train a model on ILSVRC’12 classification. on these tasks yields spatially sensitive features, we hypoth- 2) we copy the weights of the backbone of the ILSVRC’12 esise that these tasks could benefit each other. We consider classification model to the source model. We randomly four task types: initialize the head of the source model, which is specific 1) semantic segmentation: predict the class label for each to the task type. pixel in the image. 3) we train on the source task. 2) object detection: predict tight bounding boxes around 4) we copy the weights of the backbone of the source objects and predict their class labels. model to the target model. Again, we randomly ini- 3) keypoint detection: detect the image location of body tialize its head. joints or parts, for the human and dog classes in the 5) we train on the target training set. datasets we consider. 6) we evaluate on the target validation set. 4) depth estimation: predict for each pixel the distance from This protocol essentially defines a transfer chain: the camera to the physical surface (from a single image). ILSVRC’12 → source task → target task. We compare These four task types significantly vary in nature: semantic these transfer chains to the default practice: ILSVRC’12 → segmentation and depth estimation are pixel-wise predic- target task. To ensure fair comparisons, we have one set of tion tasks, whereas keypoint detection requires identifying ILSVRC’12 pre-trained weights which are use throughout all sparse (but related) points in an image, and object detection experiments in this paper. Analogously, for each source task requires predicting bounding boxes. Hence, while these we create one set of pre-trained weights used throughout all tasks are all about spatial localisation, their different nature our experiments. makes it interesting to study the influence of transferring from one task type to another. 3.2 Image Domain 3.4 Dataset Size We want to study transfer learning across a wide range of different image domains. For this we considered many The size of the target training set is important (e.g. [54]). publicly available datasets and manually selected the fol- When a target dataset is very large, the effect of transfer lowing domain types: consumer photos, driving, indoor, aerial, learning is likely to be minimal: all the required visual underwater, close-ups, and synthetic We deliberately only use knowledge can be gathered directly from this target dataset. domains from RGB image sensors, and exclude imagery Therefore we consider two transfer learning settings, one from other sensors, such as CT scans, multi-spectral im- using only a small number of target images and one using agery, or lidar, because it is unclear how knowledge could the full target dataset. We believe that using a small target be re-used when transferring across datasets acquired by training set is most relevant in practice, given that we often different sensors. want to train strong models from a small set of annotated To measure the difference between source and target images. domains, one way is to use these manually defined domain We also perform some experiments varying the size types. This results in a simple binary same or different mea- of the source training set. A source model trained on a sure. We also use a continuous similarity metric obtained larger dataset is likely to be more beneficial for transfer directly from the visual appearance of the source and target learning [42], [67], [97]. Hence, in some of our experiments datasets, to offer a more fine grained metric. In particular, we limit all source sets to have a maximum number of we extract image features from a backbone based on our images, sampled uniformly from the dataset. This allows multi-source semantic segmentation model (Sect. 5.1), where to study the influence of source domain versus source size. we attach a spatial average pooling layer to the backbone, resulting in an 720 dimensional image vector. The domain 3.5 Dataset collection difference is then the average distance of a target image to its closest source image: To study these factors of influence on transfer learning we selected 20 public available datasets annotated for one (or 1 X   more) of our task types, ensuring to span various very D(T |S) = min(d(ft, fs) (1) |T | s different image domains and with a large variety in dataset t size. An overview of the datasets and their task types which where, | · | denotes the cardinality, f denotes an image we use is given in Tab. 1. The variety in visual appearance feature vector, and for d(·, ·) we use the Euclidean dis- is illustrated in Fig. 1. tance. From each dataset we sample 1000 images to com- While we consider the annotations listed in Tab. 1, pute D(T |S). The same images are used irrespective of some datasets have additional annotation types available. whether the dataset is used as target or source. Due to the For example, CityScapes [18] also provides disparity and min operation in Eq. (1) this is an asymmetric measure, Berkeley Deep Drive [118] also includes road segmentation i.e. D(A|B) 6= D(B|A). This measure enables more fine- annotations. grained analysis than using our manually defined domains. In our experiments, each dataset plays both roles of source and target, in different experiments in turn. This allows us to extensively study the aforementioned factors of 3.3 Task type influence for transfer learning over a wide range of image In this study we focus on structured prediction tasks, which domains, both within task types and across task types, and all involve some form of spatial localization. Since training for training sets of various size. FACTORS OF POSITIVE TRANSFER 5

Nr Name Reference Domain Semantic Detection Keypoints Depth # Train # Val 1 Pascal Context [70] Consumer 60 - - - 5K 5K 2 Pascal VOC [27] Consumer 22 20 - - 10K 1449 a 3 MPII Human Pose [6] Consumer - - 16 - 17K 7K 4 ADE20K [125] Consumer 150 - - - 20K 2K 5 COCO Panoptic [12], [50], [65] Consumer 134 80 17 - 118K 5K 6 KITTI [5] Driving 30 - - - 150 50 7 CamVid [10] Driving 23 - - - 367 101 8 CityScapes [18] Driving 33 - - - 3K 500 9 India Driving Dataset (IDD) [104] Driving 35 - - - 7K 981 10 Berkeley Deep Drive (BDD) [118] Driving 20 10 - - 7K 3K b 11 Mapillary Vista Dataset [72] Driving 66 - - - 18K 2K 12 ISPRS [83] Aerial 6 - - - 4K 437 c 13 iSAID [110], [114] Aerial 16 - - - 27K 9K c 14 SUN RGB-D [94] Indoor 37 - - X 5K 425 d 15 ScanNet [20] Indoor 41 - - - 19K 5K 16 SUIM [44] Underwater 8 - - - 1525 110 17 Underwater Trash (UWT) [30] Underwater - 3 - - 6K 1144 18 Stanford Dogs [8], [48] Close-ups - - 20 - 7K 5K 19 vKITTI2 [11], [31] Synthetic (driving) 9 - - X 43K 7K 20 vGallery [111] Synthetic (indoor) 8 - - X 44K 12K ILSVRC’12/ImageNet [22], [85] Websearch 1,000 classes for classification 1.2M 50K a From the Pascal VOC datasets we use VOC2012 for semantic segmentation and VOC2007 for object detection. b For object detection BDD100K is used. c iSAID and ISPRS consist of few very high resolution images. Using the procedure of the iSAID devkit[110], we split each high-res image into multiple partially overlapping low-res images (800 × 800), which we use for training and evaluation. Train and validation images originate from different high-res images and hence are not overlapping in content. d The SUN RBG-D dataset contains imagery from NYU depth v2 [92], Berkeley B3DO [45], and SUN3D [115].

TABLE 1: Overview of the 20 datasets used in this paper. They vary in size, task types (we consider semantic segmentation, objct detection, keypoint detection, depth estimation), number of classes, and image domain. We included datasets from consumer imagery, driving, aerial, indoor, underwater, close-ups, and synthetic domains. Some datasets (such as KTTI [5]) have more annotation types available, but are not used in this study.

4 SETTINGUP NETWORKSFOR TRANSFER Illumination normalization. For each dataset we normalize LEARNING the images so that each color channel has zero mean and standard deviation 1. This form of whitening minimizes In this section we describe in detail the network architecture illumination biases from the different datasets as much as we use for our transfer learning experiments. Three things possible. are important for our study: a common framework with shared data augmentation, a common backbone, and high Data augmentation. To unify the data augmentation across quality models specific to each task type. all four task types considered, we ran many experiments to estimate the importance of common data augmentation techniques. We found the following augmentations to al- 4.1 Data normalization and augmentation ways have a positive (or neutral) effect and thus use them During training various data normalization and augmen- in all experiments: (1) random horizontal flipping of the tation techniques are typically used [19], [56] which have image; (2) random rescaling of the image; and (3) taking a significant impact on model performance. We unify the data random crop. For object detection and keypoint detection, normalization and augmentation step, because it changes we consider only random crops fully containing at least the input of the network, i.e. it influences the low-level one object. As in [60], we found that the image scales that statistics of the input imagery. This allows to prevent varia- lead to the best performance are intrinsic to the dataset, tions in data normalization and augmentation from affecting and hence dataset-dependent. Since we see this (partly) as transfer learning performance. a property of the image domain, we optimized for each Data augmentation also influences what the model task the input resolution of the network. Overall, the input 420 × 420 learns. For example, applying large rotation transforma- resolutions during training range from for Pascal 713 × 713 512 × 1024 tions or anisotropic scaling makes the network invariant to Context to for SUN RGB-D to for those aspects, which may be beneficial for some tasks but CityScapes, where the latter is a landscape format common detrimental for others, e.g. full rotation invariant networks to most driving datasets. We always evaluate at a single are not able to distinguish 6 from 9. In this paper we scale and resize each image such that one side matches the do not want data normalization and augmentation to be input resolution used for that task. confounding factors in transfer learning experiments. Hence After careful consideration, we decided not to use image we aim to keep data normalization and augmentation as rotation, varying image aspect ratio, or any form of color simple as possible without compromising on accuracy, and augmentation. Image rotation is incompatible with object de- apply the same protocol to all experiments. tection which (generally) assumes axis-aligned ground-truth FACTORS OF POSITIVE TRANSFER 6

HR-Net Classification 128 2-strided 3x3 C

H/4 x W/4 x 15C x W/4 x H/4 256 256 Add & ⨁ 2-strided 3x3 2C 512 512 Add & 1 x NrClass

H x W x 3 ⨁ 4C 2-strided 3x3 Avg Linear Soft 1024 1024 ⨁ 2048 8C Upsample Add & 1x1 Pool Classifier Max & Concat (a) Fig. 2: Schematic illustration of the HRNet backbone. Each Illustration of HR-Net backbone: Every row is half the resolution of the row below, purple blocks indicate a series of convolutionalhorizontal layers, connection layer between depicts blocks a include resolution strided convolutions and eachand upsampling block to the rep- respective Description Top-1 Acc dimensions.resents Each row a doubles series the number of residual of channels. We units. use C=48 The channels, blocks resulting in are an output fully of H/4 x W/4 x 720. ResNet-50 [107] 76.9 connected using all resolutions with either downscaling or HRNetV2-W44 [107] 77.0 upsampling. The final representation is obtained by concate- nating all upsampled layers to H/4 × W/4. ours HRNetV2-W48 79.5 (b) Backbone Depth Semseg Detection Keypoints Fig. 3: Illustration of the classification head architecture (3a) 69M 721 14K 2.4M 10M and table with top-1 accuracies for different networks and TABLE 2: Number of parameters in the backbone and the backbones on the ILSVRC’12 validation set (3b). respective task heads. These values refer to semantic seg- mentation and object detection with 20 classes, and 17 for Architecture. The HRNetV2 backbone extends the ResNet keypoint detection. The backbone forms the majority of the architecture to preserve high-resolution spatial features. A neural network and contains visual knowledge which could regular ResNet consists of blocks organised in four ‘stages’, be shared to new target tasks. each reducing the image resolution by factor 2. HRNetV2 follows this design, but after each stage it also keeps a par- boxes. Moreover, we were able to reproduce the semantic allel high-resolution branch (Fig. 2). All branches from one segmentation performance of [60] even without rotation stage are fed into the next, so this next stage incorporates in- augmentation (see Sect. 4.3). Varying image aspect ratio did formation from the representations at different resolutions. not yield positive effects on semantic segmentation, and The backbone produces 4 feature maps, each with different also such augmentation is uncommon for object detection resolution and number of channels. These are combined into and keypoint detection. Color augmentation random changes a single feature map by upscaling and concatenating them in hue, contrast, saturation, and brightness did not yield along the channel dimension. This results in a final output positive effects for us on semantic segmentation and object feature map of shape W/4 × H/4 × 15C, where W and H detection, while substantially slowing down training. are the original image width and height, and C = 48 is the Our Setup. To conclude, we consistently apply a limited number of channels in the highest resolution output layer. number of data augmentation techniques across all experi- Hence the final feature map has 15 × 48 = 720 channels. ments. The only difference across experiments is the input Training. In order to validate our backbone architecture resolution, which is fixed per-dataset and is thus consistent we use image classification on ILSVRC’12. We extend the with the image domain factor of variation we want to backbone with the classification head proposed in [107] study (Sect. 3.2). This uniform protocol effectively cancels (Fig. 3a). This head increases the number of channels, while out varying data augmentation as a potential causal factor reducing the resolution to H/32 × W/32, and then uses an influencing the performance of transfer learning. average pooling layer with a linear classifier. The network is trained for 200 epochs, with SGD using momentum and 4.2 Backbone Architecture a large batch-size of 4096, learning-rate decay after 60, 120, and 180 epochs. Transfer learning through pre-training is only possible when Evaluation. Performance is evaluated using Top-1 accuracy the models for all task types share the same backbone ar- on the ILSVRC’12 validation set. chitecture. To have meaningful results, we need to choose a Results. In Fig. 3b we show the results of our network com- backbone which works well across all task types we explore. pared to the ResNet-50 and HRNetV2-W44 architectures Therefore we choose the recent high-resolution backbone from [107]. Our model performs competitively, reaching HRNetV2 [107]. It has two main advantages. First, HRNetV2 79.5% Top-1 accuracy, validating our choice of architecture. was shown to outperform ResNet [39], ResNeXt [116], Wide ResNet [120], and stacked hourglass networks [73] on three of the tasks we explore: semantic segmentation, keypoint 4.3 Semantic Segmentation detection, and object detection. Second, its design allows to Architecture. For semantic segmentation, we adopt the use relative shallow task type specific heads compared to network head proposed in [107](Fig. 4a). It consists of three the number of parameters in the backbone. In Tab. 2 we stages: (1) a 1x1 convolution changing the dimensionality provide an overview of the number of trainable parameters. of the backbone output to the number of classes K. This The backbone consists of 69M parameters, making up 87%- implements a linear classifier at each pixel; (2) a softmax 99.99% of all parameters depending on the task type and the non-linearity; (3) a bi-linear up sampling layer to produce number of classes. the final predicted segmentation map H × W × K. Depth Per Pixel Soft Up H x W x 1 Prediction Plus Sample

FACTORS OF POSITIVE TRANSFER 7

Semantic Segmentation Object Detection

Per Pixel Soft Up H x W x NrC H/4 x W/4 x NrC Heatmap 3x3 Max Classifier Max Sample Prediction Pool

(a) Feature Offset H/4 x W/4 x 2 Adapter Prediction NMS

Dataset MSEG [60] Ours Box Predictions Width/Height H/4 x W/4 x 2 BDD 63.2 65.0 Prediction CityScapes 77.6 75.7 COCO Panoptic 52.6 54.0 (a) Pascal Context 46.0 47.1 NMS is really helpful for Pascal, but not for COCO and BDD. ScanNet 62.2 61.9 Description→ Maybe because in Pascal larger objects are detected MAP (wrt COCO and BDD). → Or because the number of boxes per image which is lower by Pascal ResNet-101 - CenterNet [126] 34.6 (b) HourglassFeature adapter: ResNet-104 1x1 conv - CenterNet to align the [126 feature] pyramid 40.3from the Fig. 4: Semantic segmentation model and performance. 4a Hourglassbackbone. ResNet-104 Followed by MS+flip regular 3x3 - CenterNet conv + non-linearity [126] 45.1 + BN. shows an illustration of the head architecture, where pur- HRNetV2-W48-pyramid - CenterNet [107] 43.4 ple blocks indicate trainable parameters (in total (720 + Box Predictions: 100 boxes per image. Maxima of 1) ∗ NrC) and blue blocks have no trainable parameters. oursclass-specificHRNetV2-W48 heatmaps - CenterNet are used as centerpoints 39.0 with their respective offsets and width/height 4b compares our performance against that of the MSEG (b) setup [60] on several datasets of the MSEG collection. Our setup is validated by the fact we perform on par with [60]. Fig. 5: Object detection model and performance. 5a illus- trates our CenterNet-based [126] head. 5b compares the performance to recent works. Our model perfors close to the Training. The network is trained using the cross-entropy CenterNet Hourglass ResNet-14 [126] under the same con- loss at each pixel. Starting from ILSVRC’12 pre-trained ditions: no data augmentation during evaluation (neither weights, we train on the source dataset using SGD with Multi-Scale (MS) nor horizontal flipping (flip)). This demon- momentum. We use multiple Google Cloud TPU-v3-8 ac- strates that we base our transfer learning experiments on a celerators using synchronized batch norm and a batch size strong modern object detection framework. of 32. For almost all sources datasets we use a stepwise learning rate decay, and optimize per source the starting learning rate, the number of steps after which the learning each class, a binary classifier evaluates whether this pixel is rate is lowered (by a fixed factor of 10), and the total number the center point of an object bounding box. Additionally, at of training steps. However, we found performance to be these points we predict a translation correcting offset (in x- unstable for the small SUIM and KITTI datasets. For these and y- coordinates) and the size of the bounding box (width we found that switching to a “” learning rate policy and height). Using center points to predict bounding boxes curr step p results in a simpler model than a two-stage architecture like (base lr × (1 − max steps ) , p = 0.9) stabilizes training, as also done in [14], [60]. Faster-RCNN [82], while being about as accurate [24], [126]. Evaluation. Performance is evaluated by the Intersection- Following [107], [126] we add a feature adapter (neck) over-Union (IoU), averaged over classes [26]. During evalu- between the backbone and the head. This neck consists ation we process complete images at a single resolution only. of 1x1 convolution followed by a 3x3 convolution, batch- We scale each image to match the resolution used during normalization [43] and ReLu. The detection CenterNet head training. outputs three prediction maps, each is the result of a 3x3 Results. To validate our training setup, we compare the convolution, ReLu, and another 3x3 convolution. The three performance of our networks on a subset of the MSEG pixel-wise prediction maps are: dataset collection [60]. While these datasets are also part of M1a K-channel heatmap, one per class. Each pixel is our collection, for this experiment we use the annotation assigned the likelihood to be the center on an object provided in MSEG, which has fewer classes than in our of a certain class. transfer setting setup (c.f. Sect. 5). As Fig. 4b shows, our M2 a two-channel offset map representing x- and y-offsets models perform on par with those of [60], which also use a of the point to the real center of the bounding box; and HRNetV2 backbone and linear semantic segmentation layer, M3 a two-channel width/height map, which predicts the but have different normalisation and data augmentation width and height of an object centered at that pixel. steps and a different implementation. Following [126] the final box predictions are obtained by combining the top T = 100 highest scoring pixels from the 4.4 Object Detection class-specific heatmaps, with their respective x/y offsets to Architecture. We follow the CenterNet1 approach of Zhou define the box-centers. The box dimensions are extracted et al. [126](Fig. 5a). For each pixel in the feature map and for from the width/height prediction map. Then, optionally, non-maximum suppression (NMS) is applied. 1. With CenterNet we denote a family of object detection models Training. For training we use a different loss for each which predicts boxes via their center points with fully convolutational head [126]: architectures [24], [126]. The paper of [24] is called CenterNet while the GitHub of [126] is https://github.com/xingyizhou/CenterNet. L1 a logistic regression focal loss [61], [64] for the class FACTORS OF POSITIVE TRANSFER 8

Keypoint Detection We do not use a neck, but directly feed the feature maps Box: centers produced by the backbone into these heads. The maps Box: offsets H/4 x W/4 x 2 output by the first three heads are analogous to the object Box: Width/Height H/4 x W/4 x 2 detection ones, but for a single class only, to detect the KP: Heatmap H/4 x W/4 x 2 box around the person (or dog for another dataset). The KP: Offsets H/4 x W/4 x K+1 remaining three maps are: KP: Initial location H/4 x W/4 x 2 M4a K-channel keypoint heatmap, where each channel is

H/4 x W/4 x 2K KeyPoint Prediction the output of a binary classifier predicting one of the K keypoints; M5 a keypoint offset map: a two-channel x/y offset map for (a) a translation correction of the keypoint; M6 a keypoint allocation map: a 2 × K-channel map mark- Description AP50 ing the displacement from the center of an object at Hourglass-104 - CenterNet - single stage [126] 84.2 this pixel to each of the K keypoints belonging to that HRNetV1-W44 - two stage cascade [107] 90.8 object. ours HRNetV2-W48 - CenterNet - single stage 81.1 We can obtain keypoints for a image by processing these maps: First, a bounding box around the person/dog (b) is obtained by combining the box centers (M1), with the Fig. 6: Keypoint Detection model and performance. 6a il- offsets (M2) and the box dimensions (M3). Then, M4 and lustrates our model head architecture. 6b provides a per- M5 are combined to create a set of candidate keypoint formance comparison. The two stage cascade model first locations with accompanying confidence scores, within the detects persons, crops out their image pixels, and uses person/dog bounding box. Finally, M6 is used to associate a separate network to predict keypoints. We follow the each candidate keypoint to the person/dog object it belongs simpler single stage approach used in e.g. [126]. to. Training. For M1-M3 we use the same losses as for object detection. For M4-M6 we add the following [126]: heatmap, acting on all pixels; L4 a focal loss [61], [64] is used for multi-class joint classi- L2 a L1-loss for offset prediction, acting only on pixels that fication, similar to M1 in the object detection training; arecenter points of objects in the ground-truth; L5 a L1 loss for offset prediction, similar to M2 in the object L3 a L1-loss for width and height prediction, again acting detection training; only on ground-truth centers. L6 a L1 loss is used to directly regress to the initial key- We train using the same procedure as in Sect. 4.3, but using point locations. a batch size of 128 and the Adam optimizer [49] until We largely follow the training setup for object detection, convergence. using a batch size of 128 and the Adam optimizer [49]. Evaluation. We use the COCO definition of mean Average We set the starting learning rate and number of steps per Precision (mAP) [65]. They evaluate mAP at various IoU dataset. thresholds and average them. For COCO we do not use Non Evaluation. We report the mean Average Precision at 0.5 Maximum Suppression, since this did not improve results Object Keypoint Similarity (OKS), as defined by the COCO on this dataset as also observed by [126] (on other datasets challenge [1]. NMS was useful, see details in Appendix A). Results. To validate our setup we train for keypoint Results. We validate our setup by training on the COCO prediction on the COCO dataset. In Fig. 6b we report training set and evaluating performance on its validation the performance of our network architecture and compare set. We compare to results reported in [126] and for a to [126]. They use as backbone a stacked hourglass [73] HRNetV2-W48 backbone with a feature pyramid using the consisting of two consecutive hourglass models, which they CenterNet variant [107]. call Hourglass-104. Their model yields 84.2 AP50.2 For As can be seen in Fig. 5b, the best results are obtained completeness, we include the results of a more complex by using use data augmentation during evaluation (multi- two stage cascade keypoint detection approach [107], which scale and horizontal flip) or by using a feature pyramid. obtains very good performance. In this paper we use the Without these enhancements, our implementation performs HRNetV2 backbone with the simpler single-stage CenterNet close to the Hourglass ResNet-104 results of [126] (39.0 mAP heads. Our model yields 81.1 mAP, which is near [126] and and 40.3 mAP respectively). Hence we conclude that we can thus strong enough for our transfer learning exploration. base our transfer learning experiments on a strong, modern object detection framework. 4.6 Depth Estimation Architecture. For monocular depth estimation, we modify 4.5 Keypoint Detection the architecture for semantic segmentation. We propose to Architecture. We again follow the CenterNet approach add a single regression layer on top of the output of the of [126](Fig. 6a). The keypoint head architecture resembles the object detection architecture, yet with some differences. 2. Result obtained via correspondence with the authors: The public available model in the zoo obtaining 64.0 AP50 on coco-test, obtains We create 6 pixelwise prediction heads, each composed 84.2 mAP on coco-val, when evaluated without flipping, see: of a 3x3 convolution, ReLu, and another 3x3 convolution. https://github.com/xingyizhou/CenterNet FACTORS OF POSITIVE TRANSFER 9

Depth Estimation 5.1 Setup Per Pixel Soft Up H x W x 1 Training target models. A target model T for task results Prediction Plus Sample from training the network architecture for a specific task type on the training data of a specific target dataset. We (a) train a target model via transfer learning by fine-tuning from a given source model S. More specifically, we copy the Description RMSE (↓) δ < 1.25(↑) weights of S into the backbone component of T , randomly ResNet DORN [29] .509 82.8 initialise the head components, and fine-tune on the target ResNet D-DFN [16] .528 86.6 training set. We always use a chain of transfer, ILSVRC’12 → source → target, since all our source models are obtained ours HRNetV2-W48 .493 82.5 after pre-training on ILSVRC’12. We re-use most of the training setup from training the (b) source model (Sect. 4) for the target model. For the small Fig. 7: Monocular Depth Estimation. 7a illustrates the task target training set experiments, we lowered both the number head architecture. 7b shows a comparison of performance of total training steps and steps after which we lower the on NYUDepth- V2 [92] dataset using a RMSE and δ < 1.25 learning rate. Furthermore, since the source task may be accuracy as performance measures. While our architecture similar to the target task, the starting weights may already features a light-weighted head, it still outperforms depth- be close to what is desirable. Therefore we also train using specific architectures in terms of RMSE and is on par with two additional lower initial learning rates. For the full δ < 1.25. target training setting the learning rates and learning rate schedules are the same as for the source models, which we backbone, followed by a soft-plus layer and a bi-linear found to bring optimal accuracy. For both the small target upsampling layer to match the original resolution (Fig. 7a). set experiments and full target set experiments we report The softplus activation converts the logit value to depth: the best result over the different learning rates. log(exp(x) + 1) and is also used in [36], [127]. This is a Evaluation. We are interested in the gain obtained from differentiable clipping function which ensures that depth fine-tuning from a specific source model compared to fine- predictions are positive. tuning from ILSVRC’12 image classification (the default in Training. To train the network, we combine two losses the community). The core question is: are additional gains following [16], [63]: (1) a L1-loss between the predicted possible, by picking a good source? We therefore look at the depth map and the ground truth map at each pixel; (2) a relative transfer gain: smoothness loss which encourages differences in neighbour-  m(T |S)  ing depth pixels to be the same in the predicted depth map r(T |S) = − 1 ∗ 100 (2) m(T |ILSVRC) and the ground-truth. More precisely, we take the derivative in x-direction for both the predicted and ground-truth maps, where m denotes a metric specific for the task type (e.g. and compare them with an L1-loss. We repeat this for the y- mean-IoU for semantic segmentation), which is evaluated direction. for all tasks on complete images at a single resolution only. For training we use a batch size of 64 and the Adam Since for depth estimation lower values of m mean better optimizer [49]. performance, we multiply r by −1 in that case. Evaluation. We validate on the NYUDepthV2 [92] dataset, This notion of gain is similar in spirit to the one defined using the root mean squared error (RMSE) metric and the in Taskonomy [121]. However their gain is the percentage of zˆ z δ < 1.25 accuracy, where δ = max( z , zˆ) is a measure of evaluation images for which the transfer model outperforms relative accuracy defined in [59]. a model trained from scratch. Instead, our relative transfer Results. We compare our proposed network to two re- gain metric refers to a much stronger baseline: a model fine- cent ResNet based models: [29] which uses depth-specific tuned from ILSVRC’12. Moreover, we evaluate how much losses based on ordinal regression; and [16] which uses the better the target model becomes in terms of a standard metric same losses as we do, but with a depth-specific network specific to each task type, averaged over all evaluation architecture on top of the ResNet backbone. The results images. are presented in Fig. 7b. We observe that our model is To facilitate the discussion in Sect. 6, we distinguish four slightly better than both others, when measured using the levels of transfer gains: root mean squared error (RMSE), while slightly behind on VP Very positive transfer effect, when r > 10; the δ < 1.25 measure. Hence, we conclude that our light- P Positive transfer effect, when r > 2; weight depth prediction head is well suited for monocular I Insignificant transfer effect, when 2 ≤ r > −2; depth estimation. N Negative transfer effect when r < −2. Multi-Source Training. Inspired by the success of using 5 TRANSFERLEARNINGEXPERIMENTS generic visual representations for various computer vision In this section we describe our transfer learning experi- tasks [53], [54], [60], we include a multi-source model in our ments. We mainly conduct experiments in two settings: experiments. transfer learning with a small target training set and with We train a multi-source model for a specific task type the full target set. The analysis of the results across all based on several datasets. For each dataset a separate head experiments will be discussed in Sect. 6. is attached to a single, common backbone: FACTORS OF POSITIVE TRANSFER 10

Underwater Mapillary Aerial PContext Consumer Driving Indoor Synthetic ScanNet vGallery ADE20K CamVid Scratch vKITTI2 18K

• 19K 44K 20K COCO 5K 367 ISPRS 42K 120K semantic segmentation is trained across iSAID, COCO, PVOC iSAID 4.0K SUIM KITTI Multi 1.5K 10K 28K 3.0K BDD 150 SUN City IDD 7K 5K Mapillary, ScanNet, SUIM, vGallery, and vKITTI2; 7K • depth estimation is trained across all three depth ISPRS - 19 9 16 16 22 15 -5 11 17 14 19 -4 -1 7 2 5 32 -72 datasets we consider: SUN RGB-D, vGallery, and iSAID -16 - -7 -12 -12 -8 -13 -20 -16 -9 -12 -9 -14 -12 -8 -20 -18 - -52 vKITTI2; PContext -15 -7 - 27 0 40 -7 -11 -11 -9 -12 -6 -19 -29 -12 -11 -8 32 -80 • object detection is trained across all four object detec- PVOC -23 -8 29 - -13 31 -9 -13 -12 -12 -12 -12 -26 -31 -12 -12 -11 26 -84

tion datasets: COCO, BDD, Pascal VOC, and Underwa- ADE20K -4 1 -3 -8 - 13 -2 -3 -2 -3 -6 -1 -5 -11 -6 -1 -4 14 -67

ter Trash. COCO -7 -3 -4 -5 -7 - -3 -4 -6 -5 -7 -7 -10 -15 -6 -4 -3 - -77

The multi-source models are trained using dataset inter- KITTI 1 2 3 2 6 6 - 2 11 8 6 10 -6 -3 -1 1 2 9 -21 leaving at the batch level, i.e. each batch is sampled from CamVid 3 1 4 2 11 2 9 - 13 9 11 17 4 3 -1 -1 3 12 -18 another dataset, and each head classifies only the images City -5 -2 0 -2 2 4 -1 -1 - 9 6 17 -6 -9 -2 -2 -3 15 -41 from that particular dataset. When doing transfer learning, IDD -5 -1 1 0 1 6 -1 -4 4 - 3 12 -5 -9 -2 0 -4 10 -44 To: Segmentation only the weights of the common backbone are used as a BDD -8 -2 1 1 7 12 1 2 10 13 - 29 -8 -12 -3 -2 -5 21 -41 source model. Mapillary -7 -1 1 -1 4 11 -3 -4 9 10 7 - -10 -12 -3 -3 -4 - -60

SUN -16 -6 10 -1 55 60 -1 -11 -11 -12 -10 -11 - 61 0 -10 -7 73 -66

5.2 Transfer learning with a small target training set ScanNet -21 -6 3 -2 31 56 -4 -5 -7 -6 -7 -5 71 - -6 -6 -6 - -62

SUIM -8 1 -1 1 0 4 -1 -5 -4 -1 -2 -6 -8 -10 - 0 -3 - -46 The first setting we study is transfer learning with a small vKITTI2 -1 0 -1 -1 1 0 0 1 2 2 1 2 -1 -2 0 - 1 - -16 target training set, in which we limit the number of an- notated examples are available for training. We deem this vGallery -1 0 0 -1 1 1 0 1 0 1 0 1 0 0 -1 0 - - -10 setting as the most challenging and practically relevant: (a)

Depth Detection St. Dogs Keypoints vGallery Scratch obtain good performance for a structured prediction task vKITTI2 12K 44K COCO COCO 42K 120K 120K Multi UWT BDD SUN 70K MPII 25K 6K by annotating only a few images. Transfer learning is par- 5K ticularly relevant in such a low-data regime. iSAID -13 -35 -21 -29 -40 -28 -21 -17 -13 -36 -52 Concretely, we limit the number of target training images to 150 per dataset (this is less than 3% of the available PContext -16 -40 -15 -4 -47 -39 -7 -14 -6 -37 -80 train data for 14 out of the 20 datasets). For COCO and COCO -8 -14 -6 -2 -44 -24 -5 -8 -5 -36 -77 ADE20k we make an exception and use 1000 images, which is still less than 1% of the available train data for COCO KITTI -3 -6 -4 -4 3 -6 -1 -3 -3 -8 -21 and 5% for ADE20K. The reason is that these datasets have CamVid -4 0 4 -1 10 -1 4 -13 -4 -11 -18 a large number of classes following a long-tail distribution (COCO has 134 classes for segmentation, and ADE20K has IDD -4 -9 -5 -6 -3 -13 -2 -6 -3 -11 -44 To: Segmentation 150). Just 150 images did not properly cover all classes: In SUN 8 -39 -12 -9 -34 -31 -3 -10 -4 -28 -66 all experiments we use a seeded selection, such that for each dataset all models are fine-tuned using the exact same ScanNet 13 -38 -13 -18 -36 -27 -11 -9 -3 -27 -62 training samples. SUIM -8 -16 -4 -6 -19 -9 -3 -5 -1 -14 -46 Experiment. Our results are shown in Tab. 3 (all relative transfer gains). Positive transfer gains are in green, nega- (b) Mapillary PContext

Detection Depth St. Dogs Keypoints Segmentation vKITTI2 vKITTI2 VOC07 18K 12K COCO COCO COCO 5K 42K 42K 120K 120K 120K SUIM Multi Multi 1.5K 5K UWT BDD BDD SUN SUN 70K MPII 25K 6K 7K tive transfer gains are in purple, no gains are represented 5K 5K by white. For ease of presentation, we group results by VOC07 - -14 -8 207 -49 -74 -45 -15 -11 0 -21 4 -38 -28 -22 -26 25 8 the target task type, which leads to multiple . Then UWT 6 - -12 52 -18 -29 -17 -5 -8 9 -9 -11 -6 -12 -14 -16 -8 -9 we group by the source task type (two separate tables for semantic segmentation, blue vertical line in other ex- BDD 15 -1 - 94 -16 -23 -2 -4 -3 19 -3 5 -12 14 23 -4 -1 15 To: Detection periments). Then we group by image domain (marked in COCO 30 5 -5 - -26 -72 -14 -6 1 8 -6 -1 -11 -6 -4 -1 16 12 blue above in Tab. 3a). Finally, within each image domain (c) Keypoints Detection St. Dogs Depth Segmentation 12K we order sources by their size. For completeness, absolute COCO COCO COCO 120K 120K 120K Multi Multi Multi SUN MPII 25K performance table are available in Appendix A. We tried 5K to keep all experiments as homogenous as possible, while St. Dogs - 110 42 -25 -90 141 128 -33 -20 maintaining a wide scope of transfer across many diverse MPII -8 - 166 -30 -81 36 37 -31 -46

COCO -29 90 - -54 -98 39 39 -51 -63 domains and task types. This leads to two peculiarities, To: Keypoints which we clarify below. (d) Depth Detection Keypoints Segmentation St. Dogs vGallery vGallery Scratch vKITTI2 vKITTI2

First, in most experiments, the source training set and the 12K 44K 44K COCO COCO COCO 42K 42K 120K 120K 120K Multi Multi UWT BDD BDD SUN SUN 70K MPII 25K 6K 7K target training set are disjoint (i.e. always when transferring 5K 5K within a task type, and most of the time across task types). SUN - -9 2 1 -7 -6 0 1 0 -3 9 -1 4 -7 -3 7 -33

However, when transferring across task types, a few of the vKITTI2 0 - -1 -1 0 0 0 -2 -1 0 0 -1 0 0 -1 1 -16 To: Depth source-target pairs use (partly) the same training set, e.g. vGallery 0 -6 - -5 -3 -3 -5 -1 -3 -6 3 -1 -5 -2 1 3 -46 COCO Object Detection as source for COCO Keypoint De- tection in Tab. 3d or BDD Semantic Segmentation as source (e) for BDD Object Detection in Tab. 3c. In these experiments TABLE 3: Results for transfer learning in the small target the source models have been trained on all images of the training setting. Performance measured in relative improve- training dataset for the source task type, while the target ment gain over a model fine-tuned from ILSVRC’12. In Table (a) both the source and target task types are semantic segmentation. See main text for details. FACTORS OF POSITIVE TRANSFER 11

Underwater Underwater Mapillary Aerial PContext Consumer Driving Indoor Synthetic

Aerial Consumer Driving Mapillary Indoor Synthetic PContext ScanNet vGallery ADE20K ScanNet CamVid vGallery Scratch ADE20K vKITTI2 CamVid Scratch vKITTI2 18K 19K 44K 20K COCO 5K 367 ISPRS 42K 120K PVOC iSAID COCO ISPRS 4.0K SUIM PVOC KITTI Multi 1.5K iSAID 10K 28K 3.0K BDD SUIM 150 KITTI SUN City BDD IDD SUN 7K 5K City 7K IDD

ISPRS - 30 29 24 19 31 30 20 34 24 32 29 10 8 21 38 20 44 -68 ISPRS - 19 10 12 0 8 15 -5 -12 12 25 0 -20 -3 4 3 17 -72

iSAID 0 - -1 -1 -3 -3 0 -1 0 0 -1 0 -2 -5 -1 -1 0 1 -44 iSAID -18 - -4 -2 -10 -12 -13 -20 -14 -11 -14 -12 -9 -13 -7 -19 -18 -52

PContext -1 0 - 5 0 8 0 -1 0 -1 -1 -1 -2 -4 -1 0 0 5 -46 PContext -14 -4 - 12 -4 12 -7 -11 -8 -8 -11 -5 -13 -18 -10 -8 -5 -80

PVOC -1 -1 -1 - -1 1 0 0 -2 -1 -1 -1 -2 -2 -1 0 -1 0 -38 PVOC -21 -7 18 - -11 11 -9 -13 -12 -8 -9 -7 -17 -20 -11 -11 -9 -84

ADE20K -2 -2 -4 -5 - -1 -2 -2 -2 -2 -4 -2 -4 -5 -4 -3 -3 1 -39 ADE20K -5 0 -3 -3 - 1 -2 -3 -2 -2 -4 -1 -3 -9 -6 0 -3 -67

COCO -6 -5 -6 -6 -6 - -6 -5 -6 -6 -6 -6 -6 -7 -6 -6 -6 -2 -37 COCO -7 -1 -5 -3 -6 - -3 -4 -4 -5 -4 -4 -7 -9 -6 -3 -1 -77 KITTI 1 3 5 3 6 7 - 2 11 6 6 11 -4 -2 3 -1 1 11 -24 KITTI -2 -3 1 -2 -1 2 - 2 8 2 3 1 -4 -11 0 -1 1 -21 CamVid -9 -8 -1 -9 2 -5 -5 - 4 3 2 12 -8 -4 -2 -1 1 6 -24 CamVid 5 3 3 6 4 3 9 - 11 9 11 7 -2 4 1 1 1 -18 City -6 -5 -2 -6 -6 -6 -5 -4 - -4 -5 5 -8 -10 -2 -4 -5 1 -32 City -8 -3 -2 -4 -1 0 -1 -1 - 3 3 6 -9 -12 -5 -4 -5 -41 IDD -1 -1 -1 -1 -1 -1 1 0 0 - -1 2 -1 -3 -1 0 -1 1 -22

To: Segmentation IDD -5 -2 0 -1 -3 -2 -1 -4 3 - 1 6 -8 -8 -3 0 -4 -44 BDD -2 -2 -3 -2 -1 -1 0 1 2 2 - 0 -3 -6 0 0 -1 3 -33 To: Segmentation BDD -10 -3 0 2 1 -1 1 1 8 7 - 14 -11 -10 -4 -2 -3 -41 Mapillary -3 -3 -4 -5 -6 -3 -2 -3 -2 -2 -4 - -7 -8 -4 -3 -4 -1 -30

Mapillary -7 -2 0 -1 -2 -1 -3 -4 7 5 5 - -9 -12 -4 -3 -3 -60 SUN -4 -3 -4 -5 3 5 -3 -2 -1 -1 -1 -3 - 1 -3 -3 -2 4 -33

SUN -16 -1 3 -1 20 15 -1 -11 -7 -6 -10 -7 - 40 -4 -10 -5 -66 ScanNet 0 0 0 -1 2 5 1 2 1 1 0 1 1 - -1 1 0 5 -33

SUIM -5 -3 -7 -7 -3 -4 -2 -3 -4 -5 -2 -5 -3 -4 - -4 -3 -3 -36 ScanNet -17 -3 1 -4 14 10 -4 -5 -8 -2 -9 -6 49 - -4 -5 -5 -62

vKITTI2 -1 -1 -1 -1 -1 -1 0 0 0 4 7 7 7 -1 7 - 0 5 -5 SUIM -8 1 -5 2 -3 2 -1 -5 -6 -4 -3 -2 -2 -9 - -4 -2 -46

vGallery -1 0 0 -1 0 0 0 0 0 0 0 0 0 0 -1 0 - 1 -9 vKITTI2 -3 -2 -2 -4 -2 -2 0 1 0 -3 -1 -2 -2 -4 -3 - -2 -16

vGallery -2 -1 -1 -1 -1 -1 0 1 -1 0 -1 0 -1 -1 -2 -1 - -10 (a) Detection Depth Keypoints Segmentation St. Dogs vKITTI2 vKITTI2 VOC07 12K COCO COCO COCO 42K 42K 120K 120K 120K SUIM Multi Multi Multi 1.5K 5K UWT BDD BDD SUN SUN 70K 6K 7K 5K 5K TABLE 5: Results for transfer learning in the small source and small target setting. All results transferring within VOC07 - -10 -16 18 11 -11 -41 -8 -3 -7 -4 -7 -3 -2 -17 -1 semantic segmentation. Performance measured in relative UWT 15 - -1 16 -9 -12 -24 -10 7 10 11 -3 -5 0 -9 6 improvement gain over a model fine-tuned from ILSVRC’12. BDD 1 0 - 2 2 -1 -8 0 0 0 0 -1 0 0 -25 1 See main text for details. To: Detection COCO 1 0 1 - -2 -3 -9 -3 -2 0 -2 -2 -2 -1 -5 -3

(b) Keypoints Detection Depth We perform a study similar to Sect. 5.2, but focus more St. Dogs

Depth Segmentation vGallery vKITTI2 12K COCO COCO COCO 120K 120K 120K 44K 42K Multi Multi Multi on transfer within a task type. The results are shown Multi SUN MPII 25K SUN 5K 5K in Tab. 4, again grouped by respectively target task type, St. Dogs - 0 -1 -2 -53 -2 -3 -8 -1 SUN - -6 -3 -3 source task type, image domain, and source size. Absolute

MPII 7 - 26 6 -66 7 9 -2 -1 vKITTI2 3 - 1 3 numbers can be found in Appendix A. To: Depth COCO 0 0 - 0 -34 -2 -2 -1 0 vGallery 1 0 - -3 To: Keypoints 5.4 Small source and small target training set (c) (d) In this last setting we limit the size of the target training set TABLE 4: Results for transfer learning in the full target as in Sect. 5.2, and also limit the size of each source training training setting. Performance measured in relative improve- set to 1500 samples. This enables to study transfer learning ment gain over a model fine-tuned from ILSVRC’12. In effects where sources differ in appearance domain, but not Table (a) both the source and target task types are semantic in the amount of labeled images available. For all except two segmentation. See main text for details. datasets, this implies a reduction to the number of images available (KITTI and CamVid have only respectively 150 and 367 training images overall). model is fine-tuned on only a small part of that training set, We study transfer learning in this setting only for the se- and for a different task type. This setting is relevant if one is mantic segmentation task type and only for transfer within- interested in extending the annotation of a dataset from one task. The results are shown in Tab. 5. task type to another (e.g. one has a lot of object bounding boxes, but very few depth masks for a dataset). 6 ANALYSIS Second, when transferring within a task type, the multi- In this section we analyse our results across different set- source models are only applied for target datasets which tings, task types and domains. To facilitate the discussion, were not part of their training (hence the blank entries for we report the percentage of experiments for which a positive this column in Tab. 3a). This is necessary for meaningful (P), very positive (VP) or negative (N) transfer effect is experiments, as otherwise some target training sets would observed. We do not report insignificant transfer (I). Tab. 6 simply be a subset of the images and annotations present in shows the main results, split into the small target training set the multi-source. and full target training set settings, and filtered by whether we transfer within/across image domain and within/across 5.3 Transfer learning with full target training set task type. In this setting we use the full available target training set. A1: Classic ILSVRC’12 transfer learning always outper- We expect transfer learning to have a lesser effect, given that forms training a model from scratch. For all our experi- the target training set contains more information. ments starting from ILSVRC’12 outperforms training from FACTORS OF POSITIVE TRANSFER 12

Underwater Mapillary PContext ScanNet vGallery ADE20K Aerial Consumer CamVid Driving Indoor Synthetic scratch. And by a large margin: even when using the full vKITTI2 COCO ISPRS PVOC iSAID SUIM KITTI BDD SUN City target training set, transferring from ILSVRC’12 improves IDD performance by 5% − 71% (see for example Tab. 4a, right- ISPRS - 13 17 17 17 16 20 21 19 18 16 19 17 17 17 20 20 most column). This confirms that ILSVRC’12 pre-training is iSAID 16 - 19 19 18 18 21 22 20 20 18 20 19 19 18 21 20 a solid way of (starting) transfer learning, which explains PContext 27 25 - 12 17 15 23 23 23 20 20 20 24 22 23 21 25 why this practice is widespread. A2: For most target tasks there exists a source task which PVOC 27 26 12 - 18 15 23 23 23 20 20 20 24 22 24 22 26 brings further benefits on top of ILSVCR’12 pre-training. ADE20K 23 22 15 16 - 15 20 20 20 18 17 17 20 19 21 19 22

This can be seen in Tab. 3 and Tab. 4, by noticing that for COCO 27 25 16 16 17 - 22 22 22 19 19 19 24 21 23 21 25 almost any row there are green entries, indicating positive KITTI 26 24 14 15 13 13 - 12 11 10 11 10 24 22 21 12 23 relative transfer gain over ILSVRC’12 for that source model. CamVid 27 25 14 15 12 13 11 - 11 10 10 10 24 21 22 12 23 To quantify this observation, we have computed the number of target tasks for which there is a source leading City 20 20 13 14 12 14 10 11 - 10 10 10 20 20 21 12 19 to positive (P) or very positive (VP) transfer effect on top of IDD 29 26 14 15 13 13 13 12 14 - 11 10 26 23 22 13 26

ILSVRC’12 pre-training. To do this, for each target task we BDD 29 26 14 15 13 14 14 14 16 11 - 11 26 23 23 14 27 take the transfer gain brought by the best source (Tab. 7). In Mapillary 45 39 15 16 14 14 21 18 26 12 12 - 40 32 34 18 41 the small target training set regime, there is a positive effect SUN 18 18 15 15 13 14 21 22 20 19 19 20 - 12 19 20 18 for 85% of the target tasks, and very positive for 70%. Even in the full target training set regime, for 56% of the target ScanNet 20 20 16 16 15 15 22 23 21 21 20 21 13 - 20 21 18 tasks there is a positive effect, and very positive for 22%. SUIM 19 17 19 19 19 18 23 23 23 22 20 21 21 20 - 21 19 So although ILSVRC’12 pre-training is the de-facto stan- vKITTI2 26 23 14 14 14 14 12 13 13 12 12 12 24 21 21 - 22 dard way to do transfer learning, there is surprisingly much vGallery 20 19 16 15 15 15 20 20 19 19 19 18 15 15 20 19 - to gain from an additional transfer step, and this for all task types. Next, we analyze what are the factors that influence Fig. 8: Domain distance between datasets (Eq. (1), asymmet- these benefits. ric). Rows are targets, columns are sources. Lowest distance A3: The image domain strongly affects transfer gains. is within a manually defined image domain, as expected. From Tab. 3a we see that most positive gains occur when Consumer to driving and indoor domains yield close dis- the source and target tasks are in the same image domain tance values, but not vice versa. This shows that consumer (within-domain transfer). Conversely, often transfer across images include both the driving and indoor domains. image domains yields negative gains. For example, for the consumer datasets as target, all other domains are bad sources. This makes the image domain an important factor. can compare its influence on transfer gains to the influence As can be seen, the size of the source dataset also plays of source size. We first measure the rank correlation between a role: for example, for driving as a target domain, the domain distance and transfer gains using Kendall τ [47]. larger driving and consumer datasets are generally better For the small target training set experiments this yields a sources than the smaller ones. However, this effect is far less correlation of 0.38, while the correlation between source size important than the image domain. Finally, Tab. 3b shows and transfer gains is 0.12. Therefore the image domain is a transfer from other task types to segmentation (cross-task- more important factor of influence than source size. type transfer). Here, most effects are negative. A4: For positive transfer, the source image domain should To quantify, we first compare the individual effects of include the target domain. In most of our experiments a image domain and task type in Tab. 6. Out of all experi- source from a broader domain helps a target from a more ments with small target training set, within-domain, cross- specific domain contained within it. This can be qualita- task-type setting, 43% yields positive transfer gains and tively observed from Tab. 3a. For example, sources from 37% negative ones. Conversely, out of all experiments in the consumer domain achieve positive transfer not only on the cross-domain, within-task-type setting, only 19% yields consumer targets, but also on driving or indoor. Indeed, positive transfer gains while 51% negative ones. This pattern the consumer datasets also contain images of street views is repeated when using the full target training set. Hence we and indoor scenes (Fig. 1). Conversely, if we look at the conclude that, given a target task, the image domain of the consumer domain as a target, none of the driving or indoor source is more important to achieve good transfer gains than sources yield any positive transfer. The same pattern can its task type. also explain the success of the multi-source model: this Above we considered image domains as manually de- model always yields positive transfer, and by design it fined, intuitive types. As presented in Sect. 3.2, we also contains images from each of the manually defined target consider a continuous measure of domain distance based domains. on image appearance. In Fig. 8, we visualize the domain To visualize domain inclusion, we created a t-SNE plot distance for semantic segmentation datasets, with lighter of the activations of 150 images from each semantic seg- colors indicating a smaller distance. This appearance dis- mentation dataset (Fig. 9a; same features as Sect. 3.2). The tance correlates with the manually defined domains. One visualization show that the consumer domain points are exception is the vKITTI2 dataset, which seems to be closer indeed scattered around the plot, covering almost all other to other driving datasets than to the other synthetic dataset domains. Other domains form more compact clusters, e.g. vGallery. aerial in green, and driving in yellow/orange. With such a continuous measure of domain distance we Finally, we also quantitatively demonstrate that domain FACTORS OF POSITIVE TRANSFER 13

Small target training set Full target training set image domain task type P VP N # P VP N # transfer transfer r > 2% r > 10% r < −2% r > 2% r > 10% r < −2% all all 23% 13% 54% 509 18% 7% 38% 382 within within 69% 47% 17% 64 35% 12% 22% 78 within cross 43% 22% 37% 49 22% 6% 39% 18 cross within 19% 8% 51% 242 14% 6% 38% 242 cross cross 5% 2% 79% 154 7% 2% 64% 44

TABLE 6: Percentage of experiments (i.e. source-target pairs, table entries in Tab. 3, Tab. 4, Tab. 5) for which we measured positive/negative transfer effects of a certain magnitude. We aggregate these under different image domain and task type transfer constraints. # denotes the number of experiments considered according to the constraint.

Small target training set Full target training set image domain task type P VP N # P VP N # transfer transfer r > 2% r > 10% r < −2% r > 2% r > 10% r < −2% all all 85% 70% 4% 27 56% 22% 7% 27 within within 73% 68% 14% 22 50% 19% 12% 26 within cross 65% 35% 18% 17 33% 17% 0% 6 cross within 59% 33% 26% 27 33% 7% 19% 27 cross cross 32% 11% 53% 19 29% 14% 14% 7

TABLE 7: Transfer effects induced by the best available source for each target task (i.e. max over entries in a row in Tab. 3, Tab. 4, and Tab. 5). We aggregated these under different image domain and task type transfer constraints. # denotes the number of target tasks considered according to the constraints.

τ Assignment method this in the small source and small target set setting (Sect. 5.4). (1) Earth Mover’s Distance 0.17 The results are shown in Tab. 8. The assignment measure (2) Target image → closest source image 0.39 capturing inclusion (2) has the highest correlation with → (3) Source image closest target image 0.02 transfer gains. Interestingly, measure (3) is uncorrelated. (4) Source → Target & Target → Source 0.23 This suggests that even if the source domain has images far from the target domain, this does not influence transfer TABLE 8: Kendall-τ correlation between transfer gains and gains. This confirms that the most important aspect for a dataset distance for semantic segmentation in the small source to yield positive transfer is that is should include the source small target setting (Sect. 5.4). As distance we take target image domain. the average distance between images from the source and A5: Multi-source models yield good transfer, but are target datasets, while varying the assignment method (i.e. outperformed by the largest within-domain source. min function in Eq. (1)). The assignment with the highest Tab. 3a shows that our multi-source semantic segmenta- correlation measures inclusion of the target by the source. tion model yields positive transfer gains for all targets. This is in line with our previous observations that the source do- main should include the target, while it is less important that inclusion matters. To do so, we change the function used it spans a much broader range. Indeed, by construction the in Eq. (1) to match target images to source images. We have source data of this multi-source model spans all domains. four strategies: (1) Assign one-to-one each target image to a However, we also observe that the largest within-domain source image such that the overall Earth Mover’s Distance source almost always yields better transfer gains: In 6 out is minimized (i.e. matching is done through the Hungarian of these 10 experiments the largest within-domain source algorithm). This is a measure of overlap of the distributions yields significantly better transfer (i.e. > 2%). Only for the of the two image domains. (2) Assign each target image to its targets ISPRS and SUN RGB-D the multi-source model is closest source. This measures inclusion of the target dataset better, which can be explained by many sources being good in the source dataset and is identical to Eq. (1). (3) Assign for both datasets: almost all datasets are good for ISPRS, each source image to its closest target. A large distance here while both the consumer and indoor domains are good for means that many source images are far from the target. SUN RGB-D. Hence the multi-source models arguable cover (4) Take the average of (2) and (3), yielding a symmetric the domain for these targets better than the largest single measure. source. We correlate each of the four measures to transfer gains A6: Transfer across task types can bring positive transfer using Kendall-τ rank correlation. To remove any influence gains. of task type and dataset size (both source and target), we do Depending on the choice of task types for the source FACTORS OF POSITIVE TRANSFER 14

ILSVRC'12 Segmentation Detection Keypoints

(b) Different networks on COCO.

ILSVRC'12 Segmentation Depth

ISPRS ADE20K CityScapes Mapillary SUIM iSAID COCO BDD SUN vKITTI2 PContext KITTI IDD ScanNet vGallery PVOC CamVid

(a) Multi source network features for all semantic segmentation datasets. (c) Different networks on SUN RGB-D.

Fig. 9: Illustration of the apperance distribution of the datasets, according to different feature encoding networks, using t- SNE. Fig. 9a visualizes the semantic segmentation datasets using features extracted from the multi-source network. Fig. 9b and Fig. 9c visualizes the COCO and SUN RGB-D datasets using networks trained on different task types (ILSVRC’12 indicates classification).

and target, cross-task-type transfer can be beneficial: for same dataset same domain 65% of the targets within the same image domain as the source, cross-task-type transfer results in positive transfer PN# PN# gains (Tab. 7). To study this effect more precisely, we split the seg → obj det 100% 0% 2 63% 25% 8 results over different cross-task-type pairs in Tab. 9. The left seg → kp det 0% 100% 1 0% 100% 2 column looks at cross-task-type transfer on the same dataset → seg depth 33% 0% 3 33% 33% 3 (the specific setting studied in detail in Taskonomy [121]). obj det → seg 0% 0% 1 30% 50% 10 obj det → kp det 100% 0% 1 100% 0% 2 Here we can see that object detection and keypoint detection kp det → seg 0% 100% 1 0% 100% 3 help each other. Segmentation helps object detection but not kp det → obj det 100% 0% 1 0% 33% 3 vice versa. Semantic segmentation and keypoint detection depth → seg 100% 0% 1 100% 0% 1 hurt each other. When transfering to a different dataset, but still saying in the same image domain (right column), results TABLE 9: Percentage of positive and negative transfer exper- show similar effects (except that keypoint detection stops iments in the cross-task-type, small target setting. Semantic helping object detection). segmentation is denoted as ‘seg’, object detection as ‘obj Our results confirm some of the observations in [121] det’, keypoint detection as ‘kp det’, depth estimation as as we also observe that some task types are more easily to ‘depth’. We consider the setting where the images of the transfer from / to than others, and that transfer across task source are the same (dataset) as the target, and when the types is asymmetric, i.e. if task type A is beneficial for task source and target share the same domain but are different type B, the reverse might not hold. datasets. Note that the two experiments where segmentation to object detection fails for the same domain, are for Under- However, even when we observe gains through cross- water Trash as a target. Arguably, this is a rather different task-type transfer within the same dataset, if there is another underwater domain then that of SUIM which acts as the within-domain source with the same task type, it yields even source (once alone, once as a multi-source model). better gains. Two examples are the object detection results for BDD and COCO as targets in Tab. 3c. This suggests that having a source with the same image domain and task type FACTORS OF POSITIVE TRANSFER 15 as the target is better than having annotations for another versa, out of all experiments with positive transfer gains task type on the target dataset. in the small target setting, only 25% also have significant Tab. 6 shows that for cross-task-type transfer to work, gains in the full target setting (for most of these experiments the image domain of source and target should be the same: the gains become insignificant). When looking at negative cross-domain, cross-task transfer yields negative transfer for transfer effects: if transfer effects are negative on a small 79% of the experiments, and positive for 5%. Moreover, target training set, these effects remain negative or become out of those 5%, all except one experiment have consumer neutral in the full target set 98% of the time. images as a source which includes the target domain of Practically, this suggests a quick test for whether a source driving, indoor, and (arguably) underwater. Hence cross- dataset may be beneficial for a target. First train on a small task-type transfer can only be expected to work if the image subset of the target training set. Then, if transfer effects domain from the source includes that of the target. are negative, discard this source for this target. If transfer To understand how the task type influences the features effects are positive, explore this source further, since there learned by the source models, we visualize the features is a good chance its benefits are kept when using the full learned on the same dataset but for different task types target training set. using t-SNE (cosine distance, which removes arbitrary scal- A10: The source domain including the target is more ing effects). We do this for the exact same set of images important than the number of source samples. Here we for COCO (Fig. 9b) and SUN RGB-D (Fig. 9c). We observe aim to disentangle the size of the source training set from the that each model creates compact cluster representations, breadth of the image domain it spans (e.g. COCO covers a showing there is a larger distance between an identical broad image domain as visualized in Fig. 9b). The influence image in the feature spaces of two different models than of source size is eliminated in our experiment in Tab. 5, between two different images in the feature space of one where we use a small training set for all sources (and a model. small target training set too). It is instructive to compare A7: Transfer within-task-type and within-domain yields this Tab. 3a, where full source sets are used. very positive effects. Combining the observations A3 and Even after fixing source size, most patterns stay roughly A6, our recommendation to obtain positive transfer is to the same. For example, COCO remains a good source for use a source model trained on the same domain and for other consumer and indoor datasets. Indeed, COCO is not the same task type as the target task. In this setting, a total only the largest but also the most diverse consumer dataset. of 69% of all source-target pairs exhibits positive transfer Similarly, Mapillary remains a good source for most other (Tab. 6). Moreover, the best available source in this setting driving datasets, while other driving sources are generally leads to positive transfer for 73% of the target tasks (and worse. Mapillary was designed to capture driving con- 68% very positive). ditions around all continents, whereas the India Driving A8: Transfer naturally flows from larger to smaller Dataset covers only Indian roads, Berkeley Deep Drive is datasets. To study the effect of dataset size, we look at US only, while CityScapes is (mostly) Germany. Camvid transfer learning within-domain and within-task-type. In consists of frames of four videos, and therefore has a narrow this setting, the transfer effect is best visibile in Tab. 3a and image domain. This suggests that much of the effects in Tab. 4a, when looking at the blocks around the diagonal. Tab. 3a which we attributed to dataset size in A8, can in fact Within each block of datasets of the same domain, they be attributed to the property of the source domain including are ordered by training set size. We observe a clear upper- the target domain. Of course, source size is generally corre- triangle pattern, where the larger datasets positively transfer lated to the breadth of its domain (e.g. COCO and Mapillary towards the smaller datasets. The reverse is much less were both designed to span broader domains), which is why present. care needs to be taken to disentangle their effects. Quantitatively, we measure a positive transfer in 60% of within-domain and within-task-type experiments where the source training set is larger than the target dataset. 7 CONCLUSION Conversely, when the source training set is smaller than In this paper we performed over 1200 transfer learning the target training set, only 5% of our experiments result experiments across a wide variety of image domains and in positive transfer. task types. Our systematic analysis of these experiments A9: Transfer learning effects are larger for small target lead to the following conclusions: (1) for most tasks there training sets. We expect this to hold because when the tar- exists a source which significantly outperforms ILSVRC’12 get task already offers a large training set, the target model pre-training; (2) the image domain is the most important can learn the required visual knowledge from the target factor for achieving positive transfer; (3) the source task task directly. We can clearly see such effects by comparing should include the image domain of the target task to achieve the within-domain segmentation results of the small and best result; (4) conversely, we observe little negative effect full target training set setting, i.e. Tab. 3a vs Tab. 4a. The when the image domain of the source task is much broader absolute transfer gains are higher in the small target training than that of the target; (5) transfer across task types can be set setting than in the full one, while overall patterns remain beneficial, but its success is heavily dependent on both the roughly the same. source and target task types. We now compare Tab. 3a and Tab. 4a quantitatively. We Our findings provide support for the success of large- first examine positive transfer gains: out of all experiments scale pre-training for transfer learning, with a single very with positive transfer gains in the full target setting, 78% large source spanning a mixture of domains [46], [54], [67], also have a positive transfer in the small target setting. Vice [97], [122]. If a good source set should include the target FACTORS OF POSITIVE TRANSFER 16 but can be arbitrarily broad, training on a massive dataset [25] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from which covers most of the visual domains should yield a a single image using a multi-scale deep network. In NeurIPS, 2014. good source model for most target tasks. However, this [26] M. Everingham, S. Eslami, L. van Gool, C. Williams, J. Winn, and form of pre-training inevitably requires transfer across task A. Zisserman. The PASCAL visual object classes challenge: A types, whose success depends on both the source and target retrospective. IJCV, 2015. task types. This suggests that future works exploring pre- [27] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, training should focus also on structured prediction tasks. and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html, 2012. REFERENCES [28] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017. [1] Common objects in context (COCO) keypoint challenge - evalu- [29] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ation. https://cocodataset.org/#keypoints-eval. ordinal regression network for monocular depth estimation. In [2] The PASCAL visual object classes challenge. http://host.robots. CVPR, 2018. ox.ac.uk/pascal/VOC/. [30] M. Fulton, J. Hong, and J. Sattar. Trash-icra19: A bounding box [3] Robust vision challenge. http://www.robustvision.net/. labeled dataset of underwater trash. In Proc. Intl. Conf. on Robotics [4] Imagenet large scale visual recognition challenge (ILSVRC). http: and Automation, 2020. //www.image-net.org/challenges/LSVRC/2012/index, 2012. [31] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as [5] H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and proxy for multi-object tracking analysis. In Proceedings of the IEEE C. Rother. Augmented reality meets computer vision: Efficient conference on Computer Vision and Pattern Recognition, pages 4340– data generation for urban driving scenes. IJCV, 2018. 4349, 2016. [6] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human [32] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by pose estimation: New benchmark and state of the art analysis. In backpropagation. In ICML, 2014. CVPR, 2014. [33] W. Ge and Y. Yu. Borrowing treasures from the wealthy: Deep [7] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. transfer learning through selective joint fine-tuning. In CVPR, Factors of transferability for a generic convnet representation. 2017. IEEE Trans. on PAMI, 2015. [34] S. Gidaris and N. Komodakis. Dynamic few-shot visual learning [8] B. Biggs, O. Boyne, J. Charles, A. Fitzgibbon, and R. Cipolla. Who without forgetting. In CVPR, 2018. left the dogs out?: 3D animal reconstruction with expectation maximization in the loop. In ECCV, 2020. [35] R. Girshick. Fast R-CNN. In ICCV, 2015. [9] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krish- [36] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova. Depth from nan. Unsupervised pixel-level domain adaptation with genera- videos in the wild: Unsupervised monocular depth learning from tive adversarial networks. In CVPR, 2017. unknown cameras. In ICCV, 2019. [10] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes [37] M. Gygli, J. Uijlings, and V. Ferrari. Training neural networks to in video: A high-definition ground truth database. Patt. Rec. produce compatible features. In AAAI, 2020. Letters, 30(2):88–97, 2009. [38] K. He, G. Gkioxari, P. Dollar,´ and R. Girshick. Mask R-CNN. In [11] Y. Cabon, N. Murray, and M. Humenberger. Virtual kitti 2, 2020. ICCV, 2017. [12] H. Caesar, J. Uijlings, and V. Ferrari. COCO-Stuff: Thing and stuff [39] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for classes in context. In CVPR, 2018. image recognition. In CVPR, 2016. [13] W.-G. Chang, T. You, S. Seo, S. Kwak, and B. Han. Domain- [40] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel- specific batch normalization for unsupervised domain adapta- level adversarial and constraint-based adaptation. arXiv preprint tion. In CVPR, 2019. arXiv:1612.02649, 2016. [14] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [41] S. Huang and D. Tao. All you need is a good representation: A. Yuille. DeepLab: Semantic Image Segmentation with Deep A multi-level and classifier-centric representation for few-shot Convolutional Nets, Atrous Convolution, and Fully Connected learning. arXiv preprint arXiv:1911.12476, 2019. CRFs. IEEE Trans. on PAMI, 2018. [42] M. Huh, P. Agrawal, and A. Efros. What makes imagenet good [15] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang. A for transfer learning? In NIPS LSCVS workshop, 2016. closer look at few-shot classification. In ICLR, 2019. [43] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep [16] Y. Chen, T. Mensink, and E. Gavves. 3d neighborhood convolu- network training by reducing internal covariate shift. In ICML, tion: Learning depth-aware features for rgb-d and rgb semantic 2015. segmentation. In 3DV, 2019. [44] M. J. Islam, C. Edge, Y. Xiao, P. Luo, M. Mehtaz, C. Morse, [17] B. Chu, V. Madhavan, O. Beijbom, J. Hoffman, and T. Darrell. S. S. Enan, and J. Sattar. Semantic Segmentation of Underwater Best practices for fine-tuning visual classifiers to new domains. Imagery: Dataset and Benchmark. In IEEE/RSJ International In ECCV, 2016. Conference on Intelligent Robots and Systems (IROS), 2020. [18] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, [45] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes T. Darrell. A category-level 3-d object dataset: Putting the kinect dataset for semantic urban scene understanding. In CVPR, 2016. to work. In ICCV Workshop, 2011. [19] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. [46] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning Autoaugment: Learning augmentation strategies from data. In visual features from large weakly supervised data. In ECCV, CVPR, 2019. 2016. [20] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and Biometrika M. Nießner. ScanNet: Richly-annotated 3d reconstructions of [47] M. Kendall. A new measure of rank correlation. , 1938. indoor scenes. In CVPR, 2017. [48] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel [21] G. Davidson and M. C. Mozer. Sequential mastery of multiple dataset for fine-grained image categorization. In Workshop on visual tasks: Networks naturally learn to learn and forget to Fine-Grained Visual Categorization, CVPR Workshop, 2011. forget. In CVPR, 2020. [49] D. P. Kingma and J. L. Ba. Adam: A method for stochastic [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-fei. optimization. In ICLR, 2015. ImageNet: A large-scale hierarchical image database. In CVPR, [50] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar.´ Panoptic 2009. segmentation. In CVPR, 2019. [23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [51] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des- T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska- et al. An image is worth 16x16 words: Transformers for image Barwinska, et al. Overcoming catastrophic forgetting in neural recognition at scale. arXiv preprint arXiv:2010.11929, 2020. networks. Proc. Nat. Acad. Sci. USA, 2017. [24] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian. Centernet: [52] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural net- Keypoint triplets for object detection. In ICCV, 2019. works for one-shot image recognition. In ICML Workshop, 2015. FACTORS OF POSITIVE TRANSFER 17

[53] I. Kokkinos. Ubernet: Training a ‘universal’ cnn for low-, mid-, [82] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards and high- level vision using diverse datasets and limited mem- real-time object detection with region proposal networks. In ory. In CVPR, 2017. NeurIPS, 2015. [54] A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, [83] F. Rottensteiner, G. Sohn, M. Gerke, and J. D. Wegner. Theme and N. Houlsby. Big Transfer (BiT): general visual representation section — urban object detection and 3d building reconstruction. learning. In ECCV, 2020. ISPRS Journal of Photogrammetry and Remote Sensing, 93:143 – 144, [55] S. Kornblith, J. Shlens, and Q. V. Le. Do better imagenet models 2014. transfer better? In CVPR, 2019. [84] S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe, and E. Ricci. [56] A. Krizhevsky. Learning multiple layers of features from tiny Unsupervised domain adaptation using feature-whitening and images. Technical report, University of Toronto, 2009. consensus loss. In CVPR, 2019. [57] J. N. Kundu, N. Venkat, A. Revanur, R. V. Babu, et al. Towards [85] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, inheritable models for open-set domain adaptation. In CVPR, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and 2020. L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. [58] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- International Journal of Computer Vision (IJCV), 115(3):211–252, Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. The 2015. open images dataset v4. IJCV, 2020. [86] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source [59] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of to target and back: symmetric bi-directional adaptive gan. In perspective. In CVPR, 2014. CVPR, 2018. [60] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun. MSeg: A [87] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual composite dataset for multi-domain semantic segmentation. In category models to new domains. In ECCV, 2010. CVPR, 2020. [88] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. [61] H. Law and J. Deng. Cornernet: Detecting objects as paired Generate to adapt: Aligning domains using generative adversar- keypoints. In ECCV, 2018. ial networks. In CVPR, 2018. [62] S. Lee, J. Hyun, H. Seong, and E. Kim. Unsupervised domain [89] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots. One-shot adaptation for semantic segmentation by content transfer. In learning for semantic segmentation. In arXiv, 2017. AAAI, 2020. [90] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional net- [63] J. Li, R. Klein, and A. Yao. A two-streamed network for estimat- works for semantic segmentation. IEEE Trans. on PAMI, 2016. ing fine-scaled depth maps from single RGB images. In ICCV, [91] K. Shmelkov, C. Schmid, and K. Alahari. Incremental learning of 2017. object detectors without catastrophic forgetting. In ICCV, 2017. [64] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar.´ Focal loss [92] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor seg- for dense object detection. In ICCV, 2017. mentation and support inference from rgbd images. In ECCV, 2012. [65] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [93] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for P. Dollar,´ and C. Zitnick. Microsoft COCO: Common objects in few-shot learning. In NeurIPS, 2017. context. In ECCV, 2014. [94] S. Song, S. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene [66] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable understanding benchmark suite. In CVPR, 2015. features with deep adaptation networks. In ICML, 2015. [95] J.-C. Su, S. Maji, and B. Hariharan. When does self-supervision [67] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, improve few-shot learning? In ECCV, 2020. A. Bharambe, and L. van der Maaten. Exploring the limits of [96] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep weakly supervised pretraining. In ECCV, 2018. domain adaptation. In ECCV Workshop, 2016. [68] F. Maria Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. Rota Bulo. [97] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting AutoDIAL: Automatic DomaIn Alignment Layers. In ICCV, 2017. unreasonable effectiveness of data in deep learning era. In ICCV, [69] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance- 2017. based image classification: Generalizing to new classes at near- [98] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros. Unsupervised zero cost. IEEE Trans. on PAMI, 2013. domain adaptation through self-supervision. arXiv preprint [70] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, arXiv:1909.11825, 2019. R. Urtasun, and A. Yuille. The role of context for object detection [99] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. and semantic segmentation in the wild. In CVPR, 2014. Hospedales. Learning to compare: Relation network for few-shot [71] B. Mustafa, A. Loh, J. Freyberg, P. MacWilliams, A. Karthike- learning. In CVPR, 2018. salingam, N. Houlsby, and V. Natarajan. Supervised trans- [100] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola. fer learning at scale for medical imaging. arXiv preprint Rethinking few-shot image classification: a good embedding is arXiv:2101.05913, 2021. all you need? In ECCV, 2020. [72] G. Neuhold, T. Ollmann, S. Rota Bulo,` and P. Kontschieder. The [101] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and mapillary vistas dataset for semantic understanding of street M. Chandraker. Learning to adapt structured output space for scenes. In ICCV, 2017. semantic segmentation. In CVPR, 2018. [73] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for [102] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial human pose estimation. In ECCV. Springer, 2016. discriminative domain adaptation. In CVPR, 2017. [74] J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and [103] J. Uijlings, S. Popov, and V. Ferrari. Revisiting knowledge transfer R. Pang. Domain adaptive transfer learning with specialist for training object class detectors. In CVPR, 2018. models. arXiv preprint arXiv:1811.07056, 2018. [104] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, and [75] A. Nichol, J. Achiam, and J. Schulman. On first-order meta- C. Jawahar. Idd: A dataset for exploring problems of autonomous learning algorithms. arXiv preprint arXiv:1803.02999, 2018. navigation in unconstrained environments. In Proc. WACV, 2019. [76] X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, and [105] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Pan- K. Saenko. Visda: A synthetic-to-real benchmark for visual chanathan. Deep hashing network for unsupervised domain domain adaptation. In CVPR Workshop, 2018. adaptation. In CVPR, 2017. [77] J. Puigcerver, C. Riquelme, B. Mustafa, C. Renggli, A. S. Pinto, [106] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and S. Gelly, D. Keysers, and N. Houlsby. Scalable transfer learning D. Wierstra. Matching networks for one shot learning. NeurIPS, with expert models. In ICLR, 2020. 2016. [78] H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with [107] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, imprinted weights. In CVPR, 2018. Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao. Deep high- [79] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals. Rapid learning resolution representation learning for visual recognition. IEEE or feature reuse? towards understanding the effectiveness of Trans. on PAMI, 2020. MAML. ICLR, 2020. [108] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng. Panet: Few- [80] S. Ravi and H. Larochelle. Optimization as a model for few-shot shot image semantic segmentation with prototype alignment. In learning. In ICLR, 2016. ICCV, October 2019. [81] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual [109] X. Wang, T. Huang, J. Gonzalez, T. Darrell, and F. Yu. Frustrat- domains with residual adapters. In NeurIPS, 2017. ingly simple few-shot object detection. In Proceedings of the 37th FACTORS OF POSITIVE TRANSFER 18

International Conference on Machine Learning, volume 119, pages APPENDIX A 9919–9928. PMLR, 2020. ADDITIONAL RESULTS TABLES [110] S. Waqas Zamir, A. Arora, A. Gupta, S. Khan, G. Sun, F. Shah- baz Khan, F. Zhu, L. Shao, G.-S. Xia, and X. Bai. isaid: A large- his appendix reports more extensive results for all of scale dataset for instance segmentation in aerial images. In CVPR our experiments, i.e. including both absolute perfor- Workshops, 2019. T [111] P. Weinzaepfel, G. Csurka, Y. Cabon, and M. Humenberger. mance as well as relative transfer gain. We use the following Visual localization by learning objects-of-interest dense match evaluation metrics: regression. In CVPR, 2019. • For semantic segmentation we use Intersection-over- [112] L. Weng. Meta-learning: Learning to learn fast. lilianweng.github.io/lil-log, 2018. Union (IoU), averaged over classes [26]. See Fig. 10 and [113] J. Wu, S. Liu, D. Huang, and Y. Wang. Multi-scale positive Fig. 11. sample refinement for few-shot object detection. In A. Vedaldi, • For object detection, we use mean Average Precision H. Bischof, T. Brox, and J.-M. Frahm, editors, ECCV. Springer International Publishing, 2020. (mAP) at a threshold of 0.5 (box-based) IoU [26] for Pas- [114] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, cal VOC and Underwater Trash. For COCO and BDD M. Pelillo, and L. Zhang. Dota: A large-scale dataset for object we use the COCO mAP variant [65], which evaluates detection in aerial images. In The IEEE Conference on Computer mAP at multiple IoU thresholds and averages them. As Vision and Pattern Recognition (CVPR), June 2018. [115] J. Xiao, A. Owens, and A. Torralba. SUN3D: A database of big earlier observed in [126], the common post-processing spaces reconstructed using SfM and object labels. In ICCV, 2013. stage of Non-Maximum Suppression (NMS) did not [116] J. Xie, M. Kiefel, M.-T. Sun, and A. Geiger. Semantic instance improve results on COCO or BDD, but we found it annotation of urban scenes by 3d to 2d label transfer. In CVPR, 2016. beneficial on Pascal VOC and SUIM. For the latter two [117] X. Yan, D. Acuna, and S. Fidler. Neural data server: A large-scale dataset we use NMS at an IoU of 0.3. Results are in search engine for transfer learning data. CVPR, 2020. Fig. 12. [118] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, • For keypoint detection we report the mean Average and T. Darrell. Bdd100k: A diverse driving dataset for heteroge- neous multitask learning. In CVPR, June 2020. Precision at 0.5 Object Keypoint Similarity (OKS), as [119] F. Yu, M. Zhang, H. Dong, S. Hu, B. Dong, and L. Zhang. defined by the COCO challenge [1]. Results are in Dast: Unsupervised domain adaptation in semantic segmentation Fig. 13. based on discriminator attention and self-training. In AAAI, 2021. • [120] S. Zagoruyko and N. Komodakis. Wide residual networks. In For depth estimation we use the Root Mean Square BMVC, 2016. Error (RMSE), linear version, as defined in [25], and also [121] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and zˆ z the standard metric δ < 1.25, where δ = max( z , zˆ) is a S. Savarese. Taskonomy: Disentangling task transfer learning. measure of relative accuracy as defined in [59]. Results In CVPR, 2018. [122] X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, are in Fig. 14. M. Lucic, J. Djolonga, A. S. Pinto, M. Neumann, A. Dosovitskiy, Relative transfer gains are calculated w.r.t. the L. Beyer, O. Bachem, M. Tschannen, M. Michalski, O. Bousquet, ILSVRC’12 column, cf Eq. 2 in the main paper. S. Gelly, and N. Houlsby. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019. [123] H. Zhang, H. Ouyang, S. Liu, X. Qi, X. Shen, R. Yang, and J. Jia. Human pose estimation with spatial contextual information. In arXiv, 2019. [124] Y. Zhao, H. Ali, and R. Vidal. Stretching domain adaptation: How far is too far? arXiv preprint arXiv:1712.02286, 2017. [125] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ADE20K dataset. In CVPR, 2017. [126] X. Zhou, D. Wang, and P. Krahenb¨ uhl.¨ Objects as points. In arXiv, 2019. [127] J. Zhu and Y. Fang. Learning object-specific distance from a monocular image. In ICCV, 2019. FACTORS OF POSITIVE TRANSFER 19

Underwater Underwater ILSVRC'12 Mapillary Mapillary Aerial PContext Consumer Driving Indoor Synthetic Aerial PContext Consumer Driving Indoor Synthetic ScanNet ScanNet vGallery vGallery ADE20K ADE20K CamVid CamVid Scratch Scratch 1.2M vKITTI2 vKITTI2 18K 18K 19K 19K 44K 44K 20K 20K COCO COCO 5K 5K 367 367 ISPRS ISPRS 42K 42K 120K 120K PVOC PVOC iSAID iSAID 4.0K 4.0K SUIM SUIM KITTI KITTI Multi Multi 1.5K 1.5K 10K 10K 28K 28K 3.0K 3.0K BDD BDD 150 150 SUN SUN City City IDD IDD 7K 7K 5K 5K 7K 7K

ISPRS - 0.38 0.35 0.37 0.37 0.39 0.37 0.31 0.36 0.38 0.36 0.38 0.31 0.32 0.34 0.33 0.34 0.42 0.09 0.32 ISPRS - 19 9 16 16 22 15 -5 11 17 14 19 -4 -1 7 2 5 32 -72

iSAID 0.36 - 0.39 0.37 0.37 0.39 0.37 0.34 0.36 0.39 0.37 0.39 0.36 0.38 0.39 0.34 0.35 - 0.20 0.42 iSAID -16 - -7 -12 -12 -8 -13 -20 -16 -9 -12 -9 -14 -12 -8 -20 -18 - -52

PContext 0.24 0.26 - 0.36 0.28 0.39 0.26 0.25 0.25 0.26 0.25 0.26 0.23 0.20 0.25 0.25 0.26 0.37 0.06 0.28 PContext -15 -7 - 27 0 40 -7 -11 -11 -9 -12 -6 -19 -29 -12 -11 -8 32 -80

PVOC 0.41 0.50 0.69 - 0.47 0.71 0.49 0.47 0.47 0.47 0.47 0.47 0.40 0.37 0.47 0.47 0.48 0.68 0.08 0.54 PVOC -23 -8 29 - -13 31 -9 -13 -12 -12 -12 -12 -26 -31 -12 -12 -11 26 -84

ADE20K 0.22 0.23 0.22 0.21 - 0.26 0.23 0.22 0.23 0.23 0.22 0.23 0.22 0.20 0.22 0.23 0.22 0.26 0.08 0.23 ADE20K -4 1 -3 -8 - 13 -2 -3 -2 -3 -6 -1 -5 -11 -6 -1 -4 14 -67

COCO 0.29 0.30 0.30 0.29 0.29 - 0.30 0.30 0.29 0.29 0.29 0.29 0.28 0.26 0.29 0.30 0.30 - 0.07 0.31 COCO -7 -3 -4 -5 -7 - -3 -4 -6 -5 -7 -7 -10 -15 -6 -4 -3 - -77

KITTI 0.46 0.47 0.47 0.47 0.49 0.49 - 0.47 0.51 0.50 0.49 0.51 0.43 0.45 0.46 0.46 0.47 0.50 0.36 0.46 KITTI 1 2 3 2 6 6 - 2 11 8 6 10 -6 -3 -1 1 2 9 -21

CamVid 0.51 0.50 0.51 0.50 0.54 0.50 0.54 - 0.56 0.54 0.55 0.58 0.51 0.51 0.49 0.49 0.51 0.55 0.40 0.49 CamVid 3 1 4 2 11 2 9 - 13 9 11 17 4 3 -1 -1 3 12 -18

City 0.41 0.42 0.43 0.42 0.44 0.45 0.43 0.43 - 0.47 0.46 0.51 0.41 0.39 0.42 0.42 0.42 0.49 0.26 0.43 City -5 -2 0 -2 2 4 -1 -1 - 9 6 17 -6 -9 -2 -2 -3 15 -41

IDD 0.37 0.38 0.39 0.39 0.39 0.41 0.38 0.37 0.40 - 0.40 0.43 0.37 0.35 0.38 0.38 0.37 0.42 0.22 0.38 IDD -5 -1 1 0 1 6 -1 -4 4 - 3 12 -5 -9 -2 0 -4 10 -44 To: Segmentation To: Segmentation BDD 0.42 0.45 0.46 0.46 0.49 0.51 0.46 0.46 0.50 0.51 - 0.59 0.42 0.40 0.44 0.45 0.43 0.55 0.27 0.45 BDD -8 -2 1 1 7 12 1 2 10 13 - 29 -8 -12 -3 -2 -5 21 -41

Mapillary 0.25 0.26 0.27 0.26 0.27 0.29 0.26 0.25 0.29 0.29 0.28 - 0.24 0.23 0.26 0.26 0.25 - 0.11 0.26 Mapillary -7 -1 1 -1 4 11 -3 -4 9 10 7 - -10 -12 -3 -3 -4 - -60

SUN 0.17 0.19 0.22 0.20 0.31 0.32 0.20 0.18 0.18 0.18 0.18 0.18 - 0.33 0.20 0.18 0.19 0.35 0.07 0.20 SUN -16 -6 10 -1 55 60 -1 -11 -11 -12 -10 -11 - 61 0 -10 -7 73 -66

ScanNet 0.11 0.13 0.15 0.14 0.19 0.22 0.14 0.14 0.13 0.13 0.13 0.14 0.24 - 0.14 0.13 0.13 - 0.05 0.14 ScanNet -21 -6 3 -2 31 56 -4 -5 -7 -6 -7 -5 71 - -6 -6 -6 - -62

SUIM 0.49 0.53 0.53 0.54 0.53 0.55 0.52 0.50 0.51 0.53 0.52 0.50 0.49 0.48 - 0.53 0.51 - 0.29 0.53 SUIM -8 1 -1 1 0 4 -1 -5 -4 -1 -2 -6 -8 -10 - 0 -3 - -46

vKITTI2 0.55 0.56 0.55 0.55 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.57 0.55 0.55 0.55 - 0.56 - 0.47 0.56 vKITTI2 -1 0 -1 -1 1 0 0 1 2 2 1 2 -1 -2 0 - 1 - -16

vGallery 0.77 0.78 0.78 0.77 0.78 0.78 0.78 0.78 0.78 0.78 0.77 0.78 0.78 0.77 0.77 0.78 - - 0.70 0.78 vGallery -1 0 0 -1 1 1 0 1 0 1 0 1 0 0 -1 0 - - -10

(a) Small target training set. Underwater Underwater ILSVRC'12 Mapillary Mapillary Aerial PContext Consumer Driving Indoor Synthetic Aerial PContext Consumer Driving Indoor Synthetic ScanNet ScanNet vGallery vGallery ADE20K ADE20K CamVid CamVid Scratch Scratch 1.2M vKITTI2 vKITTI2 18K 18K 19K 19K 44K 44K 20K 20K COCO COCO 5K 5K 367 367 ISPRS ISPRS 42K 42K 120K 120K PVOC PVOC iSAID iSAID 4.0K 4.0K SUIM SUIM KITTI KITTI Multi Multi 1.5K 1.5K 10K 10K 28K 28K 3.0K 3.0K BDD BDD 150 150 SUN SUN City City IDD IDD 7K 7K 5K 5K 7K 7K

ISPRS - 0.41 0.41 0.39 0.38 0.41 0.41 0.38 0.42 0.39 0.42 0.41 0.35 0.34 0.38 0.44 0.38 0.45 0.10 0.32 ISPRS - 30 29 24 19 31 30 20 34 24 32 29 10 8 21 38 20 44 -68

iSAID 0.82 - 0.82 0.81 0.80 0.80 0.82 0.81 0.82 0.82 0.81 0.82 0.80 0.78 0.81 0.81 0.82 0.83 0.46 0.82 iSAID 0 - -1 -1 -3 -3 0 -1 0 0 -1 0 -2 -5 -1 -1 0 1 -44

PContext 0.46 0.47 - 0.49 0.47 0.51 0.47 0.47 0.47 0.46 0.46 0.47 0.46 0.45 0.46 0.47 0.47 0.49 0.25 0.47 PContext -1 0 - 5 0 8 0 -1 0 -1 -1 -1 -2 -4 -1 0 0 5 -46

PVOC 0.77 0.77 0.77 - 0.77 0.79 0.78 0.78 0.77 0.77 0.77 0.78 0.77 0.76 0.78 0.78 0.78 0.78 0.49 0.78 PVOC -1 -1 -1 - -1 1 0 0 -2 -1 -1 -1 -2 -2 -1 0 -1 0 -38

ADE20K 0.41 0.40 0.40 0.39 - 0.41 0.41 0.40 0.41 0.41 0.40 0.41 0.40 0.39 0.40 0.40 0.40 0.42 0.25 0.41 ADE20K -2 -2 -4 -5 - -1 -2 -2 -2 -2 -4 -2 -4 -5 -4 -3 -3 1 -39

COCO 0.51 0.51 0.51 0.51 0.51 - 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.50 0.51 0.51 0.51 0.53 0.34 0.54 COCO -6 -5 -6 -6 -6 - -6 -5 -6 -6 -6 -6 -6 -7 -6 -6 -6 -2 -37

KITTI 0.46 0.47 0.48 0.47 0.48 0.48 - 0.46 0.50 0.48 0.48 0.50 0.43 0.44 0.47 0.45 0.46 0.50 0.34 0.45 KITTI 1 3 5 3 6 7 - 2 11 6 6 11 -4 -2 3 -1 1 11 -24

CamVid 0.50 0.51 0.54 0.50 0.56 0.53 0.53 - 0.57 0.57 0.56 0.62 0.51 0.53 0.54 0.55 0.56 0.59 0.42 0.55 CamVid -9 -8 -1 -9 2 -5 -5 - 4 3 2 12 -8 -4 -2 -1 1 6 -24

City 0.54 0.54 0.56 0.53 0.53 0.53 0.54 0.54 - 0.55 0.54 0.60 0.53 0.51 0.56 0.55 0.54 0.57 0.39 0.57 City -6 -5 -2 -6 -6 -6 -5 -4 - -4 -5 5 -8 -10 -2 -4 -5 1 -32

IDD 0.52 0.52 0.51 0.51 0.52 0.52 0.52 0.52 0.52 - 0.52 0.53 0.52 0.50 0.52 0.52 0.52 0.53 0.41 0.52 IDD -1 -1 -1 -1 -1 -1 1 0 0 - -1 2 -1 -3 -1 0 -1 1 -22 To: Segmentation To: Segmentation BDD 0.63 0.63 0.62 0.63 0.64 0.64 0.64 0.65 0.65 0.65 - 0.64 0.62 0.60 0.64 0.64 0.64 0.66 0.43 0.64 BDD -2 -2 -3 -2 -1 -1 0 1 2 2 - 0 -3 -6 0 0 -1 3 -33

Mapillary 0.48 0.48 0.47 0.47 0.46 0.48 0.48 0.48 0.49 0.49 0.47 - 0.46 0.46 0.47 0.48 0.47 0.49 0.35 0.49 Mapillary -3 -3 -4 -5 -6 -3 -2 -3 -2 -2 -4 - -7 -8 -4 -3 -4 -1 -30

SUN 0.41 0.42 0.41 0.41 0.44 0.45 0.42 0.42 0.43 0.43 0.42 0.42 - 0.43 0.42 0.42 0.42 0.45 0.29 0.43 SUN -4 -3 -4 -5 3 5 -3 -2 -1 -1 -1 -3 - 1 -3 -3 -2 4 -33

ScanNet 0.42 0.42 0.42 0.42 0.43 0.44 0.43 0.43 0.43 0.43 0.42 0.43 0.43 - 0.42 0.43 0.42 0.44 0.28 0.42 ScanNet 0 0 0 -1 2 5 1 2 1 1 0 1 1 - -1 1 0 5 -33

SUIM 0.72 0.73 0.71 0.71 0.74 0.73 0.74 0.73 0.73 0.73 0.74 0.72 0.74 0.73 - 0.73 0.73 0.74 0.48 0.76 SUIM -5 -3 -7 -7 -3 -4 -2 -3 -4 -5 -2 -5 -3 -4 - -4 -3 -3 -36

vKITTI2 0.58 0.58 0.58 0.58 0.58 0.58 0.58 0.58 0.58 0.61 0.63 0.63 0.62 0.58 0.63 - 0.58 0.61 0.56 0.58 vKITTI2 -1 -1 -1 -1 -1 -1 0 0 0 4 7 7 7 -1 7 - 0 5 -5

vGallery 0.78 0.79 0.79 0.78 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.78 0.79 - 0.80 0.72 0.79 vGallery -1 0 0 -1 0 0 0 0 0 0 0 0 0 0 -1 0 - 1 -9

(b) Full target training set. Underwater Underwater ILSVRC'12 Mapillary Aerial Consumer Driving Indoor Synthetic Aerial Consumer Driving Mapillary Indoor Synthetic PContext PContext ScanNet ScanNet vGallery vGallery ADE20K ADE20K CamVid CamVid Scratch Scratch vKITTI2 vKITTI2 COCO COCO ISPRS ISPRS PVOC PVOC iSAID iSAID SUIM SUIM KITTI KITTI BDD BDD SUN SUN City City IDD IDD

ISPRS - 0.38 0.35 0.36 0.32 0.35 0.37 0.31 0.28 0.36 0.40 0.32 0.26 0.31 0.33 0.33 0.38 0.09 0.32 ISPRS - 19 10 12 0 8 15 -5 -12 12 25 0 -20 -3 4 3 17 -72

iSAID 0.35 - 0.41 0.41 0.38 0.38 0.37 0.34 0.36 0.38 0.37 0.38 0.39 0.37 0.39 0.34 0.35 0.20 0.42 iSAID -18 - -4 -2 -10 -12 -13 -20 -14 -11 -14 -12 -9 -13 -7 -19 -18 -52

PContext 0.24 0.27 - 0.31 0.27 0.32 0.26 0.25 0.26 0.26 0.25 0.27 0.24 0.23 0.25 0.26 0.27 0.06 0.28 PContext -14 -4 - 12 -4 12 -7 -11 -8 -8 -11 -5 -13 -18 -10 -8 -5 -80

PVOC 0.42 0.50 0.64 - 0.48 0.59 0.49 0.47 0.47 0.50 0.49 0.50 0.45 0.43 0.48 0.48 0.49 0.08 0.54 PVOC -21 -7 18 - -11 11 -9 -13 -12 -8 -9 -7 -17 -20 -11 -11 -9 -84

ADE20K 0.22 0.23 0.22 0.22 - 0.23 0.23 0.22 0.23 0.23 0.22 0.23 0.22 0.21 0.22 0.23 0.22 0.08 0.23 ADE20K -5 0 -3 -3 - 1 -2 -3 -2 -2 -4 -1 -3 -9 -6 0 -3 -67

COCO 0.29 0.31 0.30 0.30 0.29 - 0.30 0.30 0.30 0.30 0.30 0.30 0.29 0.28 0.29 0.30 0.31 0.07 0.31 COCO -7 -1 -5 -3 -6 - -3 -4 -4 -5 -4 -4 -7 -9 -6 -3 -1 -77

KITTI 0.45 0.45 0.46 0.45 0.45 0.47 - 0.47 0.50 0.47 0.47 0.47 0.44 0.41 0.46 0.46 0.46 0.36 0.46 KITTI -2 -3 1 -2 -1 2 - 2 8 2 3 1 -4 -11 0 -1 1 -21

CamVid 0.52 0.51 0.51 0.52 0.51 0.51 0.54 - 0.55 0.54 0.55 0.53 0.49 0.51 0.50 0.50 0.50 0.40 0.49 CamVid 5 3 3 6 4 3 9 - 11 9 11 7 -2 4 1 1 1 -18

City 0.40 0.42 0.42 0.41 0.43 0.43 0.43 0.43 - 0.44 0.44 0.46 0.39 0.38 0.41 0.41 0.41 0.26 0.43 City -8 -3 -2 -4 -1 0 -1 -1 - 3 3 6 -9 -12 -5 -4 -5 -41

IDD 0.36 0.38 0.39 0.38 0.37 0.38 0.38 0.37 0.40 - 0.39 0.41 0.35 0.35 0.37 0.39 0.37 0.22 0.38 IDD -5 -2 0 -1 -3 -2 -1 -4 3 - 1 6 -8 -8 -3 0 -4 -44 To: Segmentation To: Segmentation BDD 0.41 0.44 0.46 0.47 0.46 0.45 0.46 0.46 0.49 0.49 - 0.52 0.41 0.41 0.44 0.44 0.44 0.27 0.46 BDD -10 -3 0 2 1 -1 1 1 8 7 - 14 -11 -10 -4 -2 -3 -41

Mapillary 0.25 0.26 0.27 0.26 0.26 0.26 0.26 0.25 0.28 0.28 0.28 - 0.24 0.23 0.26 0.26 0.26 0.11 0.26 Mapillary -7 -2 0 -1 -2 -1 -3 -4 7 5 5 - -9 -12 -4 -3 -3 -60

SUN 0.17 0.20 0.21 0.20 0.24 0.23 0.20 0.18 0.19 0.19 0.18 0.19 - 0.28 0.19 0.18 0.19 0.07 0.20 SUN -16 -1 3 -1 20 15 -1 -11 -7 -6 -10 -7 - 40 -4 -10 -5 -66

ScanNet 0.12 0.14 0.15 0.14 0.16 0.16 0.14 0.14 0.13 0.14 0.13 0.14 0.21 - 0.14 0.14 0.14 0.05 0.14 ScanNet -17 -3 1 -4 14 10 -4 -5 -8 -2 -9 -6 49 - -4 -5 -5 -62

SUIM 0.49 0.53 0.50 0.54 0.51 0.54 0.52 0.50 0.50 0.51 0.51 0.52 0.52 0.48 - 0.51 0.52 0.29 0.53 SUIM -8 1 -5 2 -3 2 -1 -5 -6 -4 -3 -2 -2 -9 - -4 -2 -46

vKITTI2 0.54 0.55 0.54 0.53 0.54 0.54 0.56 0.56 0.56 0.54 0.55 0.55 0.54 0.53 0.54 - 0.55 0.47 0.56 vKITTI2 -3 -2 -2 -4 -2 -2 0 1 0 -3 -1 -2 -2 -4 -3 - -2 -16

vGallery 0.76 0.77 0.77 0.77 0.77 0.77 0.78 0.78 0.77 0.77 0.77 0.77 0.77 0.76 0.76 0.77 - 0.70 0.78 vGallery -2 -1 -1 -1 -1 -1 0 1 -1 0 -1 0 -1 -1 -2 -1 - -10

(c) Small source and small target training set.

Fig. 10: Absolute performance and the corresponding relative transfer gains for semantic segmentation as both a source and target task type. Left tables: mean Intersection-over-Union. Right tables: relative transfer gains. FACTORS OF POSITIVE TRANSFER 20 ILSVRC'12 St. Dogs

Depth Detection Keypoints Depth Detection St. Dogs Keypoints vGallery vGallery Scratch Scratch 1.2M vKITTI2 vKITTI2 12K 12K 44K 44K COCO COCO COCO COCO 42K 42K 120K 120K 120K 120K Multi Multi UWT UWT BDD BDD SUN SUN 70K 70K MPII MPII 25K 25K 6K 6K 5K 5K

iSAID 0.37 0.28 0.34 0.30 0.25 0.31 0.34 0.35 0.37 0.27 0.20 0.42 iSAID -13 -35 -21 -29 -40 -28 -21 -17 -13 -36 -52

PContext 0.24 0.17 0.24 0.27 0.15 0.17 0.26 0.24 0.26 0.18 0.06 0.28 PContext -16 -40 -15 -4 -47 -39 -7 -14 -6 -37 -80

COCO 0.28 0.27 0.29 0.30 0.17 0.23 0.29 0.28 0.29 0.20 0.07 0.31 COCO -8 -14 -6 -2 -44 -24 -5 -8 -5 -36 -77

KITTI 0.45 0.43 0.44 0.44 0.47 0.43 0.45 0.44 0.44 0.42 0.36 0.46 KITTI -3 -6 -4 -4 3 -6 -1 -3 -3 -8 -21

CamVid 0.47 0.49 0.51 0.49 0.54 0.49 0.51 0.43 0.47 0.44 0.40 0.49 CamVid -4 0 4 -1 10 -1 4 -13 -4 -11 -18

IDD 0.37 0.35 0.37 0.36 0.37 0.33 0.38 0.36 0.37 0.34 0.22 0.38 IDD -4 -9 -5 -6 -3 -13 -2 -6 -3 -11 -44 To: Segmentation To: Segmentation SUN 0.22 0.12 0.18 0.18 0.13 0.14 0.20 0.18 0.19 0.15 0.07 0.20 SUN 8 -39 -12 -9 -34 -31 -3 -10 -4 -28 -66

ScanNet 0.16 0.09 0.12 0.12 0.09 0.10 0.13 0.13 0.14 0.10 0.05 0.14 ScanNet 13 -38 -13 -18 -36 -27 -11 -9 -3 -27 -62

SUIM 0.49 0.44 0.51 0.50 0.43 0.48 0.51 0.50 0.52 0.45 0.29 0.53 SUIM -8 -16 -4 -6 -19 -9 -3 -5 -1 -14 -46

Fig. 11: Transfer from other task types to semantic segmentation, in the small target training setting. Left: absolute metric (Intersection-over-Union) and the corresponding relative transfer gains (right). ILSVRC'12 Mapillary PContext

Detection Depth St. Dogs Keypoints Segmentation 1.2M vKITTI2 vKITTI2 VOC07 18K 12K COCO COCO COCO 5K 42K 42K 120K 120K 120K SUIM Multi Multi 1.5K 5K UWT BDD BDD SUN SUN 70K MPII 25K 6K 7K 5K 5K

VOC07 - 0.20 0.21 0.71 0.12 0.06 0.13 0.20 0.20 0.23 0.18 0.24 0.14 0.17 0.18 0.17 0.29 0.25 0.23

UWT 0.35 - 0.30 0.51 0.28 0.24 0.28 0.32 0.31 0.37 0.31 0.30 0.32 0.30 0.29 0.28 0.31 0.31 0.34

BDD 0.07 0.06 - 0.12 0.05 0.05 0.06 0.06 0.06 0.07 0.06 0.06 0.05 0.07 0.07 0.06 0.06 0.07 0.06 To: Detection COCO 0.10 0.08 0.08 - 0.06 0.02 0.07 0.08 0.08 0.09 0.08 0.08 0.07 0.08 0.08 0.08 0.09 0.09 0.08 Mapillary PContext

Detection Depth St. Dogs Keypoints Segmentation vKITTI2 vKITTI2 VOC07 18K 12K COCO COCO COCO 5K 42K 42K 120K 120K 120K SUIM Multi Multi 1.5K 5K UWT BDD BDD SUN SUN 70K MPII 25K 6K 7K 5K 5K

VOC07 - -14 -8 207 -49 -74 -45 -15 -11 0 -21 4 -38 -28 -22 -26 25 8

UWT 6 - -12 52 -18 -29 -17 -5 -8 9 -9 -11 -6 -12 -14 -16 -8 -9

BDD 15 -1 - 94 -16 -23 -2 -4 -3 19 -3 5 -12 14 23 -4 -1 15 To: Detection COCO 30 5 -5 - -26 -72 -14 -6 1 8 -6 -1 -11 -6 -4 -1 16 12

(a) Small target training set. Top: mean Average Precision. Bottom: relative transfer gain.

Detection Depth Keypoints Segmentation ILSVRC'12 St. Dogs 1.2M vKITTI2 vKITTI2 VOC07 12K COCO COCO COCO 42K 42K 120K 120K 120K SUIM Multi Multi Multi 1.5K 5K UWT BDD BDD SUN SUN 70K 6K 7K 5K 5K

VOC07 - 0.65 0.60 0.85 0.80 0.64 0.43 0.66 0.70 0.67 0.69 0.67 0.70 0.70 0.60 0.71 0.72

UWT 0.51 - 0.44 0.52 0.40 0.39 0.34 0.40 0.48 0.49 0.49 0.43 0.42 0.45 0.41 0.47 0.45

BDD 0.26 0.26 - 0.27 0.27 0.26 0.24 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.19 0.26 0.26 To: Detection COCO 0.40 0.39 0.39 - 0.38 0.38 0.36 0.38 0.38 0.39 0.38 0.38 0.38 0.39 0.37 0.38 0.39

Detection Depth Keypoints Segmentation St. Dogs vKITTI2 vKITTI2 VOC07 12K COCO COCO COCO 42K 42K 120K 120K 120K SUIM Multi Multi Multi 1.5K 5K UWT BDD BDD SUN SUN 70K 6K 7K 5K 5K

VOC07 - -10 -16 18 11 -11 -41 -8 -3 -7 -4 -7 -3 -2 -17 -1

UWT 15 - -1 16 -9 -12 -24 -10 7 10 11 -3 -5 0 -9 6

BDD 1 0 - 2 2 -1 -8 0 0 0 0 -1 0 0 -25 1 To: Detection COCO 1 0 1 - -2 -3 -9 -3 -2 0 -2 -2 -2 -1 -5 -3

(b) Full target training set. Top: mean Average Precision. Bottom: relative transfer gain.

Fig. 12: Absolute metrics and the corresponding relative transfer gains for object detection as a target task type. FACTORS OF POSITIVE TRANSFER 21

Keypoints Detection ILSVRC'12 Keypoints Detection ILSVRC'12 St. Dogs St. Dogs

Depth Segmentation 1.2M Depth Segmentation 1.2M 12K 12K COCO COCO COCO COCO COCO COCO 120K 120K 120K 120K 120K 120K Multi Multi Multi Multi Multi Multi SUN SUN MPII MPII 25K 25K 5K 5K

St. Dogs - 0.48 0.32 0.17 0.02 0.55 0.52 0.15 0.18 0.23 St. Dogs - 0.92 0.92 0.90 0.43 0.91 0.90 0.85 0.91 0.92

MPII 0.26 - 0.75 0.20 0.05 0.38 0.39 0.20 0.15 0.28 MPII 0.70 - 0.82 0.69 0.22 0.70 0.71 0.64 0.65 0.66

COCO 0.16 0.43 - 0.10 0.00 0.32 0.31 0.11 0.08 0.23 COCO 0.81 0.81 - 0.81 0.53 0.79 0.80 0.80 0.81 0.81 To: Keypoints To: Keypoints

Keypoints Detection Keypoints Detection St. Dogs Depth Segmentation St. Dogs Depth Segmentation 12K 12K COCO COCO COCO COCO COCO COCO 120K 120K 120K 120K 120K 120K Multi Multi Multi Multi Multi Multi SUN SUN MPII MPII 25K 25K 5K 5K

St. Dogs - 110 42 -25 -90 141 128 -33 -20 St. Dogs - 0 -1 -2 -53 -2 -3 -8 -1

MPII -8 - 166 -30 -81 36 37 -31 -46 MPII 7 - 26 6 -66 7 9 -2 -1

COCO -29 90 - -54 -98 39 39 -51 -63 COCO 0 0 - 0 -34 -2 -2 -1 0 To: Keypoints To: Keypoints

(a) Small target training set. Top: mean Average Precision at (b) Full target training set. Top: mean Average Precision at 0.5 0.5 Object Keypoint Similarity. Bottom: relative transfer gain. Object Keypoint Similarity. Bottom: relative transfer gain.

Fig. 13: Absolute metrics and the corresponding relative transfer gains for keypoint detection as target task type.

Depth Detection Keypoints Segmentation ILSVRC'12 Depth St. Dogs vGallery vGallery ILSVRC'12 Scratch 1.2M vKITTI2 vKITTI2 vGallery 12K 44K 44K COCO COCO COCO 1.2M 42K 42K 120K 120K 120K vKITTI2 Multi Multi UWT BDD BDD 44K SUN SUN 70K MPII 25K 42K 6K 7K 5K 5K Multi SUN 5K

SUN - 0.90 0.81 0.82 0.89 0.88 0.83 0.82 0.83 0.86 0.76 0.84 0.80 0.89 0.86 0.77 1.11 0.83 SUN - 0.629 0.610 0.613 0.592

vKITTI2 4.66 - 4.72 4.69 4.67 4.66 4.67 4.75 4.73 4.65 4.68 4.72 4.65 4.67 4.72 4.60 5.40 4.66 vKITTI2 4.051 - 4.132 4.068 4.182 To: Depth To: Depth vGallery 0.87 0.93 - 0.91 0.90 0.90 0.92 0.88 0.90 0.93 0.84 0.88 0.91 0.89 0.87 0.84 1.27 0.87 vGallery 0.850 0.859 - 0.883 0.861

Depth Detection Keypoints Segmentation Depth St. Dogs vGallery vGallery Scratch vKITTI2 vKITTI2 vGallery 12K 44K 44K COCO COCO COCO 42K 42K 120K 120K 120K vKITTI2 Multi Multi UWT BDD BDD 44K SUN SUN 70K MPII 25K 42K 6K 7K 5K 5K Multi SUN 5K

SUN - -9 2 1 -7 -6 0 1 0 -3 9 -1 4 -7 -3 7 -33 SUN - -6 -3 -3

vKITTI2 0 - -1 -1 0 0 0 -2 -1 0 0 -1 0 0 -1 1 -16 vKITTI2 3 - 1 3 To: Depth To: Depth vGallery 0 -6 - -5 -3 -3 -5 -1 -3 -6 3 -1 -5 -2 1 3 -46 vGallery 1 0 - -3

(a) Small target training set. Top: RMSE. Bottom: relative transfer gain. (b) Full target training set. Top: RMSE. Bottom: relative transfer gain. Depth Detection Keypoints Segmentation ILSVRC'12 Depth St. Dogs vGallery vGallery ILSVRC'12 Scratch 1.2M vKITTI2 vKITTI2 vGallery 12K 44K 44K COCO COCO COCO 1.2M 42K 42K 120K 120K 120K vKITTI2 Multi Multi UWT BDD BDD 44K SUN SUN 70K MPII 25K 42K 6K 7K 5K 5K Multi SUN 5K

SUN - 0.50 0.57 0.56 0.52 0.53 0.55 0.56 0.56 0.55 0.61 0.55 0.60 0.52 0.54 0.61 0.41 0.56 SUN - 0.74 0.76 0.76 0.77

vKITTI2 0.88 - 0.89 0.89 0.89 0.89 0.90 0.90 0.88 0.89 0.89 0.89 0.89 0.89 0.89 0.90 0.87 0.90 vKITTI2 0.93 - 0.93 0.93 0.93 To: Depth To: Depth vGallery 0.48 0.46 - 0.48 0.49 0.49 0.48 0.49 0.50 0.48 0.50 0.49 0.47 0.48 0.49 0.50 0.36 0.49 vGallery 0.50 0.49 - 0.47 0.49

Depth Detection Keypoints Segmentation Depth St. Dogs vGallery vGallery Scratch vKITTI2 vKITTI2 vGallery 12K 44K 44K COCO COCO COCO 42K 42K 120K 120K 120K vKITTI2 Multi Multi UWT BDD BDD 44K SUN SUN 70K MPII 25K 42K 6K 7K 5K 5K Multi SUN 5K

SUN - -9 2 2 -6 -4 -1 1 1 -2 10 -1 7 -6 -3 9 -26 SUN - -4 -1 -1

vKITTI2 -2 - 0 -1 -1 0 0 0 -1 -1 -1 -1 -1 -1 -1 0 -3 vKITTI2 0 - 0 0 To: Depth To: Depth vGallery -3 -7 - -3 -1 -1 -3 0 1 -3 1 0 -4 -2 0 1 -27 vGallery 1 0 - -4

(c) Small target training set. Top: δ < 1.25. Bottom: relative transfer gain. (d) Full target training set. Top: δ < 1.25. Bottom: relative transfer gain.

Fig. 14: Absolute metrics and the corresponding relative transfer gains for depth estimation as a target task type.