Pretraining boosts out-of-domain robustness for pose estimation

Alexander Mathis1,2∗† Thomas Biasi2∗ Steffen Schneider1,3 Mert Yuksekg¨ on¨ ul¨ 2,3 Byron Rogers4 Matthias Bethge3 Mackenzie W. Mathis1,2‡ 1EPFL 2Harvard University 3University of Tubingen¨ 4Performance Genetics

Abstract Transfer learning (using pretrained ImageNet models), gives a 2X boost on out-of-domain data vs. from scratch training Neural networks are highly effective tools for pose esti- Test (out of domain), mation. However, as in other computer vision tasks, robust- trained from scratch ness to out-of-domain data remains a challenge, especially for small training sets that are common for real-world ap- plications. Here, we probe the generalization ability with three architecture classes (MobileNetV2s, ResNets, and Ef- ficientNets) for pose estimation. We developed a dataset of 30 horses that allowed for both “within-domain” and “out- of-domain” (unseen horse) benchmarking—this is a cru- Test (out of domain), cial test for robustness that current human pose estimation w/ transfer learning benchmarks do not directly address. We show that better MobileNetV2-0.35 EfficientNet-B0 MobileNetV2-1 EfficientNet-B3 ImageNet-performing architectures perform better on both ResNet-50 Train within- and out-of-domain data if they are first pretrained Test (within domain) w/transfer learning on ImageNet. We additionally show that better ImageNet trained from scratch models generalize better across animal species. Further- more, we introduce Horse-C, a new benchmark for common Figure 1. Transfer Learning boosts performance, especially on corruptions for pose estimation, and confirm that pretrain- out-of-domain data. Normalized pose estimation error vs. Ima- ing increases performance in this domain shift context as geNet Top 1% accuracy with different backbones. While training well. Overall, our results demonstrate that transfer learn- from scratch reaches the same task performance as fine-tuning, the ing is beneficial for out-of-domain robustness. networks remain less robust as demonstrated by poor accuracy on out-of-domain horses. Mean ± SEM, 3 shuffles.

1. Introduction eralize to new individuals [39, 30, 34, 42]. Thus, one natu- Pose estimation is an important tool for measuring be- rally asks the following question: Assume you have trained havior, and thus widely used in technology, medicine and an algorithm that performs with high accuracy on a given biology [5, 39, 30, 34]. Due to innovations in both deep (individual) animal for the whole repertoire of movement— learning algorithms [17, 7, 11, 24, 38, 8] and large-scale how well will it generalize to different individuals that have datasets [29, 4, 3] pose estimation on humans has become slightly or dramatically different appearances? Unlike in very powerful. However, typical human pose estimation common human pose estimation benchmarks, here the set- benchmarks, such as MPII pose and COCO [29, 4, 3], con- ting is that datasets have many (annotated) poses per indi- tain many different individuals (> 10k) in different con- vidual (>200) but only a few individuals (≈ 10). texts, but only very few example postures per individual. In To allow the field to tackle this challenge, we developed real world applications of pose estimation, users often want a novel benchmark comprising 30 diverse Thoroughbred to create customized networks that estimate the location horses, for which 22 body parts were labeled by an expert of user-defined bodyparts by only labeling a few hundred in 8, 114 frames (Dataset available at http://horse10. frames on a small subset of individuals, yet want this to gen- deeplabcut.org). Horses have various coat colors and ∗Equal contribution. the “in-the-wild” aspect of the collected data at various †Correspondence: [email protected] Thoroughbred farms added additional complexity. With ‡Correspondence: [email protected] this dataset we could directly test the effect of pretraining

1859 Figure 2. Horse Dataset: Example frames for each Thoroughbred horse in the dataset. The videos vary in horse color, background, lighting conditions, and relative horse size. The sunlight variation between each video added to the complexity of the learning challenge, as well as the handlers often wearing horse-leg-colored clothing. Some horses were in direct sunlight while others had the light behind them, and others were walking into and out of shadows, which was particularly problematic with a dataset dominated by dark colored coats. To illustrate the Horse-10 task we arranged the horses according to one split: the ten leftmost horses were used for train/test within-domain, and the rest are the out-of-domain held out horses. on out-of-domain data. Here we report two key insights: ple different postures per individual (>200) and comprising (1) ImageNet performance predicts generalization for both 30 diverse race horses, for which 22 body parts were labeled within domain and on out-of-domain data for pose estima- by an expert in 8, 114 frames. This pose estimation dataset tion; (2) While we confirm that task-training can catch up allowed us to address within and out of domain general- with fine-tuning pretrained models given sufficiently large ization. Our dataset could be important for further testing training sets [10], we show this is not the case for out-of- and developing recent work for domain adapation in animal domain data (Figure 1). Thus, transfer learning improves pose estimation on a real-world dataset [28, 42, 36]. robustness and generalization. Furthermore, we contrast the domain shift inherent in this dataset with domain shift in- 2.2. Transfer learning duced by common image corruptions [14, 35], and we find pretraining on ImageNet also improves robustness against Transfer learning has become accepted wisdom: fine- those corruptions. tuning pretrained weights of large scale models yields best results [9, 48, 25, 33, 27, 49]. He et al. nudged the field 2. Related Work to rethink this accepted wisdom by demonstrating that for various tasks, directly training on the task-data can match 2.1. Pose and keypoint estimation performance [10]. We confirm this result, but show that on Typical human pose estimation benchmarks, such as held-out individuals (“out-of-domain”) this is not the case. MPII pose and COCO [29, 4, 3] contain many different in- Raghu et al. showed that for target medical tasks (with dividuals (> 10k) in different contexts, but only very few little similarity to ImageNet) transfer learning offers little example postures per individual. Along similar lines, but benefit over lightweight architectures [40]. Kornblith et al. for animals, Cao et al. created a dataset comprising a few showed for many object recognition tasks, that better Im- thousand images for five domestic animals with one pose ageNet performance leads to better performance on these per individual [6]. There are also papers for facial keypoints other benchmarks [23]. We show that this is also true for in horses [41] and sheep [47, 16] and recently a large scale pose-estimation both for within-domain and out-of-domain dataset featuring 21.9K faces from 334 diverse species was data (on different horses, and for different species) as well introduced [20]. Our work adds a dataset comprising multi- as for corruption resilience.

1860 A B 5% training data C 50% training data D Train Normalized distance (nose to eye) Test (within domain) Test (out of domain)

X = MobileNetV2 0.35 + = EfficientNet B6

0.3 normalized error 0.5 normalized error

Figure 3. Transfer Learning boosts performance, especially on out-of-domain data. A: Illustration of normalized error metric, i.e. measured as a fraction of the distance from nose to eye (which is 30 cm on a horse). B: Normalized Error vs. Network performance as ranked by the Top 1% accuracy on ImageNet (order by increasing ImageNet performance: MobileNetV2-0.35, MobileNetV2-0.5, MobileNetV2-0.75, MobileNetV2-1.0, ResNet-50, ResNet-101, EfficientNets B0 to B6). The faint lines indicate data for the three splits. Test data is in red, train is blue, grey is out-of-domain data C: Same as B but with 50% training fraction. D: Example frames with human annotated body parts vs. predicted body parts for MobileNetV2-0.35 and EfficientNet-B6 architectures with ImageNet pretraining on out-of-domain horses.

What is the limit of transfer learning? Would ever larger 3. Data and Methods data sets give better generalization? Interestingly, it appears to strongly depend on what task the network was pretrained 3.1. Datasets and evaluation metrics on. Recent work by Mahajan et al. showed that pretraining We developed a novel horse data set comprising 8, 114 for large-scale hashtag predictions on Instagram data (3.5 frames across 30 different horses captured for 4 − 10 sec- billion images) improves classification, while at the same onds with a GoPro camera (Resolution: 1920×1080, Frame time possibly harming localization performance for tasks Rate: 60 FPS), which we call Horse-30 (Figure 2). We like object detection, instance segmentation, and keypoint downsampled the frames by a factor of 15% to speed-up detection [32]. This highlights the importance of the task, the benchmarking process (288 × 162 pixels; one video rather than the sheer size as a crucial factor. Further cor- was downsampled to 30%). We annotated 22 previously roborating this insight, Li et al. showed that pretraining on established anatomical landmarks for equines [31, 2]. The large-scale object detection task can improve performance following 22 body parts were labeled in 8, 114 frames: for tasks that require fine, spatial information like segmen- Nose, Eye, Nearknee, Nearfrontfetlock, Nearfrontfoot, Of- tation [27]. Thus, one interesting future direction to boost fknee, Offfrontfetlock, Offfrontfoot, Shoulder, Midshoul- robustness could be to utilize networks pretrained on Open- der, Elbow, Girth, Wither, Nearhindhock, Nearhindfet- Images, which contains bounding boxes for 15 million in- lock, Nearhindfoot, Hip, Stifle, Offhindhock, Offhindfet- stances and close to 2 million images [26]. lock, Offhindfoot, Ischium. We used the DeepLabCut 2.0 toolbox [37] for labeling. We created 3 splits that con- 2.3. Robustness tain 10 randomly selected training horses each (referred to Studying robustness to common image corruptions based as Horse-10). For each training set we took a subset of on benchmarks such as ImageNet-C [14, 35, 44] is a fruitful 5% (≈ 160 frames), and 50% (≈ 1, 470 frames) of the avenue for making more robust. Apart from frames for training, and then evaluated the performance on evaluating our pose estimation algorithms on novel horses the training, test, and unseen (defined as “out-of-domain”) (domain-shift), we also investigate the robustness with re- horses (i.e. the other horses that were not in the given split spect to image corruptions. Hendrycks et al. study robust- of Horse-10). As the horses could vary dramatically in size ness to out-of distribution data on CIFAR 10, CIFAR 100 across frames, due to the “in-the-wild” variation in distance and TinyImageNet (but not pose estimation). The authors from the camera, we normalized the raw pixel errors by the report that pretraining is important for adversarial robust- eye-to-nose distance and report the fraction of this distance ness [15]. Shah et al. found that pose estimation algorithms (normalized error) as well as percent correct keypoint met- are highly robust against adversarial attacks [45], but nei- ric [4]; we used a matching threshold of 30% of the head ther directly test out-of-domain robustness on different in- segment length (nose to eye per horse; see Figure 3A). dividuals, nor robustness to common image corruptions as For the generalization experiments, we also tested on the we do in this study. Animal Pose dataset [6] to test the generality of our find-

1861 ings (Figure 4). We extracted all single animal images from erwise) last stride 2 convolution to stride 1). this dataset, giving us 1091 cat, 1176 dog, 486 horse, 237 The training loss is defined as the cross entropy loss cow, and 214 sheep images. To note, we corrected errors in for the scoremaps and the location refinement error via a ground truth labels for the dog’s (in about 10% of frames). Huber loss with weight 0.05 [33]. The loss is minimized Because nearly all images in this dataset are twice the size via ADAM with batch size 8 [21]. For training, a cosine of the Horse-10 data, we also downsampled the images by learning rate schedule, as in [23] with ADAM optimizer a factor of 2 before training and testing. Given the lack and batchsize 8 was used; we also performed augmenta- of a consistent eye-to-nose distance across the dataset due tion, using imgaug [19], with random cropping and rota- to the varying orientations, we normalized as follows: the tions. Initial learning rates and decay target points were raw pixel errors were normalized by the square root of the cross-validated for MobileNetV2 0.35 and 1.0, ResNet- bounding box area for each individual. For training the var- 50, EfficientNet B0, B3, and B5 for the pretrained and ious architectures, the best schedules from cross validation from scratch models (see Supplementary Material). For on Horse-10 were used (see Section 3.2). each model that was not cross validated (MobileNetV2 0.5 We also applied common image corruptions [14] to the and 0.75, ResNet-101, EfficientNet B1, B2, B4, B6), the Horse-10 dataset, yielding a variant of the benchmark which best performing training parameters from the most simi- we refer to as Horse-C. Horse-C images are corrupted with lar cross validated model was used (i.e. the cross vali- 15 forms of digital transforms, blurring filters, point-wise dated EfficientNet-B0 schedule was used for EfficientNet- noise or simulated weather conditions. All conditions are B1; see Supplementary Material). For MobileNetV2s, we applied following the evaluation protocol and implementa- trained the batch normalization layers too (this had little tion by Michaelis et al. [35]. In total, we arrived at 75 vari- effect on task performance for MobileNetV2-0.35). Pre- ants of the dataset (15 different corruptions at 5 different trained models were trained for 30k iterations (as they con- severities), yielding over 600K images. verged), while models from scratch were trained for 180k iterations. From scratch variants of the architectures used 3.2. Architectures and Training Parameters He-initialization [12], while all pretrained networks were initialized from their ImageNet trained weights. We utilized the pose estimation toolbox called DeepLab- Cut [33, 37, 18], and added MobileNetV2 [43] and Ef- 3.3. Cross Validation of Learning Schedules ficientNet backbones [46] to the ResNets [13] that were present, as well as adding imgaug for data augmenta- To fairly compare the pose estimation networks with dif- tion [19]. The TensorFlow [1]-based network architec- ferent backbones, we cross-validated the learning sched- tures could be easily exchanged while keeping data loading, ules. For models with pretraining and from scratch, we training, and evaluation consistent. The feature detectors in cross validated the cosine learning rate schedules by per- DeepLabCut consist of a backbone followed by deconvolu- forming a grid search of potential initial learning rates and tional layers to predict pose scoremaps and location refine- decay targets to optimize their performance on out of do- ment maps (offsets), which can then be used for predicting main data. Given that our main result is that while task- the pose while also providing a confidence score. As previ- training can catch up with fine-tuning pretrained models ously, for the ResNet backbones we utilize an output stride given sufficiently large training sets on within-domain-data of 16 and then upsample the filter banks with deconvolu- (consistent with [10]), we will show that this is not the tions by a factor of two to predict the heatmaps and location- case for out-of-domain data. Thus, in order to give models refinement at 1/8th of the original image size scale [18, 33]. trained from scratch the best shot, we optimized the perfor- For MobileNetV2 [43], we configured the output-stride as mance on out of domain data. Tables in the Supplementary 16 (by changing the last stride 2 convolution to stride 1). Material describe the various initial learning rates explored We utilized four variants of MobileNetV2 with different during cross validation as well as the best learning sched- expansion ratios (0.35, 0.5, 0.75 and 1) as this ratio mod- ules for each model. ulates the ImageNet accuracy from 60.3% to 71.8%, and 3.4. Similarity Analysis pretrained models on ImageNet from TensorFlow [1, 43]. The base EfficientNet model was designed by Tan et To elucidate the differences between pretrained models al. [46] through a neural architecture search to optimize for and models trained from scratch, we analyze the represen- accuracy and inverse flops. From B0 to B6, compound scal- tational similarity between the variants. We use linear cen- ing is used to increase the width, depth, and resolution of tered kernel alignment (CKA) [22] to compare the image the network, which directly corresponds to an increase in representations at various depths in the backbone networks. ImageNet performance [46]. We used the AutoAugment The results were aggregated with respect to the similarity pretrained checkpoints from TensorFlow as well as adapted of representations of within domain images versus out of the EfficientNet’s output-stride to 16 (by changing the (oth- domain images, and averaged over the three shuffles.

1862 4. Results

To test within and out-of-domain performance we cre- ated a new dataset of 30 different Thoroughbreds that are led by different humans, resulting in a dataset of 8, 114 im- ages with 22 labeled body parts each. These videos differ strongly in horse appearance, context, and background (Fig- ure 2). Thus, this dataset is ideal for testing robustness and out-of-sample generalization. We created 3 splits contain- ing 10 random horses each, and then varied the amount of training data from these 10 horses (referred to as Horse-10, see Methods). As the horses could vary dramatically in size across frames, due to the “in-the-wild” variation in distance from the camera, we used a normalized pixel error; i.e. we normalized the raw pixel errors by the eye-to-nose distance and report the fraction within this distance (see Methods).

Table 1. average [email protected] (%) MODELS WITHIN DOMAIN OUT-OF-D.

MOBILENETV2-0.35 95.2 63.5 MOBILENETV2-0.5 97.1 70.4 MOBILENETV2-0.75 97.8 73.0 MOBILENETV2-1 98.8 77.6 RESNET-50 99.8 81.3 RESNET-101 99.9 84.3 EFFICIENTNET-B0 99.9 81.6 EFFICIENTNET-B1 99.9 84.5 Figure 4. Generalization Across Species. Normalized pose esti- EFFICIENTNET-B2 99.9 84.3 mation error vs. ImageNet Top 1% accuracy with different back- EFFICIENTNET-B3 99.9 86.6 bones (as in Figure 1), but for 10 additional out-of-domain tests. EFFICIENTNET-B4 99.9 86.9 EFFICIENTNET-B5 99.9 87.7 Training on either a single species or four species while holding EFFICIENTNET-B6 99.9 88.4 one species (either cow or sheep) out.

4.2. Generalization to novel horses Next, we evaluated the performance of the networks on 4.1. ImageNet performance vs task performance different horses in different contexts, i.e. out-of-domain To probe the impact of ImageNet performance on pose horses (Figures 3A-C). Most strikingly, on out-of-domain estimation robustness, we selected modern convolutional horses, the relationship between ImageNet performance and architectures as backbones with a wide range of ImageNet performance on Horse-10 was even stronger. This can be performance (see Methods; 13 models spanning from 60% quantified by comparing the linear regression slope for out- −0 93% to 84% ImageNet performance). To fairly compare the Mo- of-domain test data: . pose-estimation improvement 2 = 0 93 bileNetV2, ResNet and EfficientNet backbones, we cross per percentage point of ImageNet performance, R . , = 9 × 10−8 −0 33% 2 = validated the learning schedules for each model (see Meth- p vs. within-domain test data . , R 0 93 = 1 4 × 10−7 50% ods). In total, we found that all ImageNet-pretrained archi- . , p . (for training data). Results for tectures exhibited strong performance on Horse-10 within several different bodyparts of the full 22 are displayed in Ta- domain, i.e. low average errors, and high average percent ble 3, highlighting that better models also generalized better correct key points (aPCK; Figure 3, Table 1). Performance in a bodypart specific way. In other words, less powerful on Horse-10 within domain also closely matched perfor- models overfit more on the training data. mance on Horse-30 (see Supplementary Material). Next, 4.3. Generalization across species we directly compared the ImageNet performance to their respective performance on this pose estimation task. We Does the improved generalization to novel individuals found Top-1% ImageNet accuracy correlates with pose es- also hold for a more difficult out-of-domain generaliza- timation test error (linear fit for test: slope −0.33%, R2 = tion, namely, across species? Thus, we turned to a pose- 0.93, p=1.4 × 10−7; Figure 3). Results for different body- estimation dataset comprising multiple species. We eval- parts are displayed in Table 2. uated the performance of the various architectures on the

1863 Table 2. [email protected] (%) for several bodyparts and architectures on within domain horses (FF=front foot; HF = Hind foot; HH = Hind Hock). Nose Eye Shoulder Wither Elbow NearFF OffFF Hip NearHH NearHF OffHF MobileNetV2 0.35 90.7 94.1 97.6 96.9 96.7 92.3 93.7 96.4 94.1 94.2 92.5 MobileNetV2 1.0 97.7 98.8 99.7 99.1 99.0 97.6 97.3 99.4 98.4 98.5 98.9 ResNet 50 99.9 100.0 99.8 99.9 99.8 99.8 99.6 99.9 99.9 99.6 99.8 ResNet 101 99.9 100.0 99.9 99.8 99.9 99.8 99.7 99.8 99.9 99.7 99.9 EfficientNet-B0 99.7 99.9 100.0 99.9 100.0 99.6 99.5 100.0 99.9 99.7 99.7 EfficientNet-B3 99.9 99.9 99.9 99.9 99.9 99.7 99.6 99.7 99.8 99.6 99.9 EfficientNet-B6 99.9 99.9 99.9 99.8 100.0 99.8 99.9 99.8 99.8 99.7 99.8

Table 3. [email protected] (%) for several bodyparts and architectures on out-of-domain horses (FF=front foot; HF = Hind foot; HH = Hind Hock). Nose Eye Shoulder Wither Elbow NearFF OffFF Hip NearHH NearHF OffHF MobileNetV2 0.35 45.6 53.1 65.5 68.0 69.1 56.4 57.6 65.9 65.9 60.5 62.5 MobileNetV2 1.0 59.0 67.2 83.8 79.7 84.0 70.1 72.1 82.0 79.9 76.0 76.7 ResNet 50 68.2 73.6 85.4 85.8 88.1 72.6 70.2 89.2 85.7 77.0 74.1 ResNet 101 67.7 72.4 87.6 86.0 89.0 79.9 78.0 92.6 87.2 83.4 80.0 EfficientNet-B0 60.3 62.5 84.9 84.6 87.2 77.0 75.4 86.7 86.7 79.6 79.4 EfficientNet-B3 71.7 76.6 88.6 88.7 92.0 80.4 81.8 90.6 90.8 85.0 83.6 EfficientNet-B6 74.7 79.7 90.3 89.8 92.8 83.6 84.4 92.1 92.1 87.8 85.3

Animal Pose dataset from Cao et. al [6]. Here, images and Using this schedule, and consistent with He et al. [10], poses of horses, dogs, sheep, cats, and cows allow us to test we found that randomly initialized networks could closely performance across animal classes. Using ImageNet pre- match the performance of pretrained networks, given training and the cross validated schedules from our Horse- enough data and time (Figure 5). As expected, for smaller 10 experiments, we trained on individual animal classes or training sets (5% training data; 160 images), this was not the multiple animal classes (holding out sheep/cows) and ex- case (Figure 5). While task-training could therefore match amined how the architectures generalized to sheep/cows, the performance of pretrained networks given enough train- respectively (Figure 4). For both cows and sheep, better ing data, this was not the case for novel horses (out-of- ImageNet architectures, trained on the pose data of other domain data). The trained from-scratch networks never animal classes performed better, in most cases. We mused caught up and indeed plateaued early (Figure 5; Figure 1). that this improved generalization could be a consequence Quantitatively, we also found that for stronger networks of the ImageNet pretraining or the architectures themselves. (ResNets and EfficientNets) generalization was worse if Therefore, we turned to Horse-10 and trained the different trained from scratch (Figure 1). Interestingly that was not architectures directly on horse pose estimation from scratch. the case for the lightweight models, i.e. MobileNetV2s (cf. [40]). 4.4. Task•based training from scratch 4.5. Network similarity analysis To assess the impact of ImageNet pretraining we also We hypothesized that the differences in generaliza- trained several architectures from scratch. Thereby we tion are due to more invariant representations in networks could directly test if the increased slope for out-of-domain with higher ImageNet-performance using Centered Kernel performance across networks was merely a result of more Alignment (CKA) [22]. We first verified that the repre- powerful network architectures. He et al. demonstrated sentations change with task training (Supplementary Mate- that training Mask R-CNN with ResNet backbones directly rial Figure 1). We compared the representations of within- on the COCO object detection, instance segmentation and domain and out-of-domain images across networks trained key point detection tasks, catches-up with the performance from ImageNet vs. from scratch. We found that early blocks of ImageNet-pretrained variants if training for substantially are similar for from scratch vs transfer learning for both more iterations than typical training schedules [10]. How- sets of horses. In later layers, the representations diverge, ever, due to the of these tasks, they could not test this but comparisons between within-domain and out of domain relationship on out-of-domain data. For fine-tuning from trends were inconclusive as to why e.g., EfficientNets gen- ImageNet pretrained models, we trained for 30k iterations eralize better (Supplementary Material Figure 2). (as the loss had flattened; see Figure 5). First, we searched for best performing schedules for training from scratch 4.6. Horse•C: Robustness to image corruptions while substantially increasing the training time (6X longer). We found that cosine decay with restart was the best for out- To elucidate the difficulty of the Horse-10 benchmark, of-domain performance (see Methods; Figure 5). we more broadly evaluate pose estimation performance un-

1864 Figure 5. Training randomly initialized networks longer cannot rescue out-of-domain performance. Top Row: Best performing (cross-validated) learning schedules used for training. Middle: Normalized error vs. training iterations for MobileNetV2-0.35, ResNet-50 and EfficientNet-B0 using 50% of the training data. Test errors when training from scratch (solid lines) closely match the transfer learning (dashed lines) performance after many iterations. Crucially, out-of-domain testing does not approach performance for pretrained network (stars). Bottom Row: Same as Middle but using 5% of the training data; note, however, for just 5% training data, the test errors do not approach the test error of pretrained models for larger models. der different forms of domain shift (Figure 6). Recently, plus “out-of-domain” identity, is even harder—the perfor- Schneider, Rusak et al. demonstrated that simple unsu- mance degradation induced by different horse identities is pervised domain adaptation methods can greatly enhance on the same order of magnitude as the mean error induced the performance on corruption robustness benchmarks [44]. on the corrupted dataset. Finally, and consistent with our We therefore settled on a full adaptation evaluation proto- other results, we found a performance gain by using pre- col: We re-trained MobileNetV2 0.35 and 1.0, ResNet50, trained networks (Figure 8). as well as EfficientNet B0 and B3 with batch normalization enabled. During evaluation we then re-computed separate 5. Discussion and conclusions batch norm statistics for each horse and corruption type. We use batch norm adaptation [44] during our evalu- We developed a novel pose estimation benchmark for ation on Horse-C. On clean out-of-domain data, we see out-of-domain robustness (Horse-10), and for testing com- improvements for MobileNetV2s and ResNets when using mon image corruptions on pose estimation (Horse-C). The pre-trained networks, and for all models when training mod- data and benchmarks are available at http://horse10. els from scratch (Figure 7). On common corruptions, uti- deeplabcut.org. Furthermore, we report two key find- lizing adaptation is crucial to final performance (see full re- ings: (1) pretrained-ImageNet networks offer known advan- sults in Supplementary Material). In the batch norm adapted tages: shorter training times, and less data requirements, models, we compared four test conditions comprised of as well as a novel advantage: robustness on out-of-domain within-domain and out-of domain for both original (clean) data, & (2) pretained networks that have higher ImageNet and corrupted images (Figure 6). First, we find that even performance lead to better generalization. Collectively, this with batch norm adapted models, Horse-C is as hard as sheds a new light on the inductive biases of “better Ima- Horse-10; namely performance is significantly affected on geNet architectures” for visual tasks to be particularly ben- corrupted data (Figure 8). Secondly, we find the corruption eficial for robustness.

1865 Figure 6. Measuring the impact of common image corruptions on pose estimation (Horse-C): We adapt the image corruptions consid- ered by Hendrycks et al. and contrast the impact of common image corruptions with that of out of domain evaluation.

Same identity Different identity 1.2 We introduced novel DeepLabCut model variants, 0.14 https://github.com/DeepLabCut/ Batchnorm Model part of 1.0 fixed transfer DeepLabCut, that can achieve high accuracy, but with 0.12 adaptive scratch 0.8 higher inference speed (up to double) than the original

0.10 ResNet backbone (see Supplementary Material for an 0.6 inference speed benchmark).

0.08 0.4 In summary, transfer learning offers multiple advan-

Normalized Error (Test) tages. Not only does pretraining networks on ImageNet 0.06 0.2 allow for using smaller datasets and shorter training time 60 65 70 75 80 60 65 70 75 80 (Figure 5), it also strongly improves robustness and gener- ImageNet Top-1 accuracy ImageNet Top-1 accuracy alization, especially for more powerful, over-parameterized Figure 7. Impact of test time normalization. Models trained with models. In fact, we found a strong correlation between adaptive BN layers slightly outperform our baseline models for generalization and ImageNet accuracy (Figure 3). While the MobileNetV2 and ResNet architecture out-of-domain evalua- we found a significant advantage (>2X boost) of using tion. Lines with alpha transparency represent fixed models (vs. pretrained networks vs. from scratch for out-of-domain adapted). Mean ± SEM is shown. robustness, there is still a 3-fold difference in performance between within domain and out of domain (Figure 1). We Same identity Different identity believe that this work demonstrates that transfer learning 1.4 1.4 Corruption Model approaches are powerful to build robust architectures, 1.2 corrupted transfer 1.2 clean scratch and are particularly important for further developing 1.0 1.0 performance improvements on real-world datasets, such 0.8 0.8 as Horse-10 and derived benchmarks such as Horse-C.

0.6 0.6 Furthermore, by sharing our animal pose robustness

0.4 0.4 benchmark dataset, we also believe that the community can

Normalized Error (Test) collectively work towards closing the gap. 0.2 0.2

60 65 70 75 80 60 65 70 75 80 ImageNet Top-1 accuracy ImageNet Top-1 accuracy Acknowledgements: Funding was provided by a Row- land Fellowship (MWM), CZI EOSS Grant (MWM & AM), Figure 8. Impact of distribution shift introduced by horse iden- the Bertarelli Foundation (MWM) and the German Federal tities and common corruptions. We tested within identity (i.e., Ministry of Education and Research (BMBF) through the equivalent to within-domain in Horse-10 (left), or out-of-domain Tubingen¨ AI Center (FKZ: 01IS18039A; StS & MB). St.S. identity (right). Lines with alpha transparency represent corrupted images, whereas solid is the original (clean) image. Mean ± SEM thanks the International Max Planck Research School for is shown. Intelligent Systems and acknowledges his membership in the European Laboratory for Learning and Intelligent Sys- tems (ELLIS) PhD program. We thank Maxime Vidal for ground truth corrections to the Animal Pose dataset.

1866 References [14] Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, tions. arXiv preprint arXiv:1903.12261, 2019. Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- [15] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: pre-training can improve model robustness and uncertainty. a system for large-scale . In OSDI, vol- In ICML, 2019. ume 16, pages 265–283, 2016. [16] Charlie Hewitt and Marwa Mahmoud. Pose-informed face [2] TM Anderson and CW McIlwraith. Longitudinal develop- alignment for extreme head pose variations in animals. In ment of equine conformation from weanling to age 3 years 2019 8th International Conference on Affective Computing Equine veterinary journal in the thoroughbred. , 36(7):563– and Intelligent Interaction (ACII), pages 1–6. IEEE, 2019. 570, 2004. [17] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, [3] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. Arttrack: Articulated multi-person tracking in the Schiele. Posetrack: A benchmark for human pose estima- wild. In CVPR’17, 2017. tion and tracking. In Proceedings of the IEEE Conference [18] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, on Computer Vision and Pattern Recognition, pages 5167– Mykhaylo Andriluka, and Bernt Schiele. DeeperCut: A 5176, 2018. deeper, stronger, and faster multi-person pose estimation [4] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and model. In European Conference on Computer Vision, pages Bernt Schiele. 2d human pose estimation: New benchmark 34–50. Springer, 2016. and state of the art analysis. In Proceedings of the IEEE Con- [19] Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi ference on computer Vision and Pattern Recognition, pages Tanaka, Jake Graving, Christoph Reinders, Sarthak Ya- 3686–3693, 2014. dav, Joy Banerjee, Gabor´ Vecsei, Adam Kraft, Zheng Rui, [5] Daniel Bachmann, Frank Weichert, and Gerhard Rinkenauer. Jirka Borovec, Christian Vallentin, Semen Zhydenko, Kil- Evaluation of the leap motion controller as a new contact- ian Pfeiffer, Ben Cook, Ismael Fernandez,´ Franc¸ois-Michel free pointing device. Sensors, 15(1):214–233, 2015. De Rainville, Chi-Hung Weng, Abner Ayala-Acevedo, [6] Jinkun Cao, Hongyang Tang, Hao-Shu Fang, Xiaoyong Raphael Meudec, Matias Laporte, et al. imgaug. https:// Shen, Cewu Lu, and Yu-Wing Tai. Cross-domain adapta- github.com/aleju/imgaug, 2020. Online; accessed tion for animal pose estimation. In Proceedings of the IEEE 01-Feb-2020. International Conference on Computer Vision, pages 9498– [20] Muhammad Haris Khan, John McDonagh, Salman Khan, 9507, 2019. Muhammad Shahabuddin, Aditya Arora, Fahad Shahbaz [7] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Khan, Ling Shao, and Georgios Tzimiropoulos. Animalweb: Realtime multi-person 2d pose estimation using part affinity A large-scale hierarchical dataset of annotated animal faces. fields. In CVPR, 2017. In Proceedings of the IEEE/CVF Conference on Computer [8] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Vision and Pattern Recognition, pages 6939–6948, 2020. Thomas S Huang, and Lei Zhang. Higherhrnet: Scale-aware [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for representation learning for bottom-up human pose estima- stochastic optimization, 2014. tion. In Proceedings of the IEEE/CVF Conference on Com- [22] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and puter Vision and Pattern Recognition, pages 5386–5395, Geoffrey Hinton. Similarity of neural network representa- 2020. tions revisited. arXiv preprint arXiv:1905.00414, 2019. [9] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, [23] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do bet- Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCaf: A deep ter imagenet models transfer better? In Proceedings of the convolutional activation feature for generic visual recogni- IEEE Conference on Computer Vision and Pattern Recogni- tion. In International conference on machine learning, pages tion, pages 2661–2671, 2019. 647–655, 2014. [24] Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. Pifpaf: [10] Kaiming He, Ross Girshick, and Piotr Dollar.´ Rethinking im- Composite fields for human pose estimation. In Proceed- agenet pre-training. arXiv preprint arXiv:1811.08883, 2018. ings of the IEEE Conference on Computer Vision and Pattern [11] Kaiming He, Georgia Gkioxari, Piotr Dollar,´ and Ross Gir- Recognition, pages 11977–11986, 2019. shick. Mask r-cnn. In Proceedings of the IEEE international [25] Matthias Kummerer,¨ Thomas SA Wallis, and Matthias conference on computer vision, pages 2961–2969, 2017. Bethge. Deepgaze ii: Reading fixations from deep [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. features trained on object recognition. arXiv preprint Delving deep into rectifiers: Surpassing human-level perfor- arXiv:1610.01563, 2016. mance on imagenet classification. CoRR, abs/1502.01852, [26] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- 2015. jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Popov, Matteo Malloci, Tom Duerig, et al. The open im- Deep residual learning for image recognition. In Proceed- ages dataset v4: Unified image classification, object detec- ings of the IEEE conference on computer vision and pattern tion, and visual relationship detection at scale. arXiv preprint recognition, pages 770–778, 2016. arXiv:1811.00982, 2018.

1867 [27] Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, and methodological study on the example of alpine skiing. Sen- Larry S Davis. An analysis of pre-training on object detec- sors, 19(19):4323, 2019. tion. arXiv preprint arXiv:1904.05871, 2019. [40] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy [28] Siyuan Li, Semih Gunel, Mirela Ostrek, Pavan Ramdya, Pas- Bengio. Transfusion: Understanding transfer learning for cal Fua, and Helge Rhodin. Deformation-aware unpaired im- medical imaging. In Advances in Neural Information Pro- age translation for pose estimation on laboratory animals. In cessing Systems, pages 3342–3352, 2019. Proceedings of the IEEE/CVF Conference on Computer Vi- [41] Maheen Rashid, Xiuye Gu, and Yong Jae Lee. Interspecies sion and Pattern Recognition, pages 13158–13168, 2020. knowledge transfer for facial keypoint detection. In Proceed- [29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, ings of the IEEE Conference on Computer Vision and Pattern Pietro Perona, Deva Ramanan, Piotr Dollar,´ and C Lawrence Recognition, pages 6894–6903, 2017. Zitnick. Microsoft coco: Common objects in context. In [42] Artsiom Sanakoyeu, Vasil Khalidov, Maureen S McCarthy, European conference on computer vision, pages 740–755. Andrea Vedaldi, and Natalia Neverova. Transferring Springer, 2014. dense pose to proximal animal classes. arXiv preprint [30] Pablo Maceira-Elvira, Traian Popa, Anne-Christine Schmid, arXiv:2003.00080, 2020. and Friedhelm C Hummel. Wearable technology in stroke [43] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- rehabilitation: towards improved diagnosis and treatment of moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted upper-limb motor impairment. Journal of neuroengineering residuals and linear bottlenecks. In Proceedings of the IEEE and rehabilitation, 16(1):142, 2019. Conference on Computer Vision and Pattern Recognition, [31] Lars-Erik Magnusson and B Thafvellin. Studies on the con- pages 4510–4520, 2018. formation and related traits of standardbred trotters in swe- [44] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- den. Journal of Animal Physiology and Animal Nutrition mann, Wieland Brendel, and Matthias Bethge. Removing (Germany, FR), 1990. covariate shift improves robustness against common corrup- [32] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, tions. Thirty-fourth Conference on Neural Information Pro- Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, cessing Systems (NeurIPS), 2020. and Laurens van der Maaten. Exploring the limits of weakly [45] Sahil Shah, Abhishek Sharma, Arjun Jain, et al. On supervised pretraining. In Proceedings of the European Con- the robustness of human pose estimation. arXiv preprint ference on Computer Vision (ECCV), pages 181–196, 2018. arXiv:1908.06401, 2019. [33] Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga [46] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking Abe, Venkatesh N Murthy, Mackenzie Weygandt Mathis, model scaling for convolutional neural networks. arXiv and Matthias Bethge. Deeplabcut: markerless pose estima- preprint arXiv:1905.11946, 2019. tion of user-defined body parts with deep learning. Nature [47] Heng Yang, Renqiao Zhang, and Peter Robinson. Human , 21(9):1281–1289, 2018. and sheep facial landmarks localisation by triplet interpo- [34] Mackenzie Weygandt Mathis and Alexander Mathis. Deep lated features. In 2016 IEEE Winter Conference on Applica- learning tools for the measurement of animal behavior in tions of Computer Vision (WACV), pages 1–8. IEEE, 2016. neuroscience. Current Opinion in Neurobiology, 60:1–11, [48] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2020. How transferable are features in deep neural networks? In [35] Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Advances in neural information processing systems, pages Evgenia Rusak, Oliver Bringmann, Alexander S. Ecker, 3320–3328, 2014. Matthias Bethge, and Wieland Brendel. Benchmarking ro- [49] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, bustness in object detection: Autonomous driving when win- Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A ter is coming. arXiv preprint arXiv:1907.07484, 2019. comprehensive survey on transfer learning. arXiv preprint [36] Jiteng Mu, Weichao Qiu, Gregory D Hager, and Alan L arXiv:1911.02685, 2019. Yuille. Learning from synthetic animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12386–12395, 2020. [37] Tanmay Nath, Alexander Mathis, An Chi Chen, Amir Patel, Matthias Bethge, and Mackenzie Weygandt Mathis. Using deeplabcut for 3d markerless pose estimation across species and behaviors. Nature Protocols, 14:2152–2176, 2019. [38] Guanghan Ning, Jian Pei, and Heng Huang. Lighttrack: A generic framework for online top-down human pose track- ing. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition Workshops, pages 1034–1035, 2020. [39] Mirela Ostrek, Helge Rhodin, Pascal Fua, Erich Muller,¨ and Jorg¨ Sporri.¨ Are existing monocular computer vision-based 3d motion capture approaches ready for deployment? a

1868