Pretraining Boosts Out-Of-Domain Robustness for Pose Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Pretraining boosts out-of-domain robustness for pose estimation Alexander Mathis1;2*† Thomas Biasi2* Steffen Schneider1;3 Mert Yüksekgönül2;3 Byron Rogers4 Matthias Bethge3 Mackenzie W. Mathis1;2‡ 1EPFL 2Harvard University 3University of Tübingen 4Performance Genetics Abstract Transfer learning (using pretrained ImageNet models), gives a 2X boost on out-of-domain data vs. from scratch training Neural networks are highly effective tools for pose esti- Test (out of domain), mation. However, as in other computer vision tasks, robust- trained from scratch ness to out-of-domain data remains a challenge, especially for small training sets that are common for real-world ap- plications. Here, we probe the generalization ability with three architecture classes (MobileNetV2s, ResNets, and Ef- ficientNets) for pose estimation. We developed a dataset of 30 horses that allowed for both “within-domain” and “out- of-domain” (unseen horse) benchmarking—this is a cru- Test (out of domain), cial test for robustness that current human pose estimation w/ transfer learning benchmarks do not directly address. We show that better MobileNetV2-0.35 EfficientNet-B0 MobileNetV2-1 EfficientNet-B3 ImageNet-performing architectures perform better on both ResNet-50 Train within- and out-of-domain data if they are first pretrained Test (within domain) w/transfer learning on ImageNet. We additionally show that better ImageNet trained from scratch models generalize better across animal species. Further- more, we introduce Horse-C, a new benchmark for common Figure 1. Transfer Learning boosts performance, especially on corruptions for pose estimation, and confirm that pretrain- out-of-domain data. Normalized pose estimation error vs. Ima- ing increases performance in this domain shift context as geNet Top 1% accuracy with different backbones. While training well. Overall, our results demonstrate that transfer learn- from scratch reaches the same task performance as fine-tuning, the ing is beneficial for out-of-domain robustness. networks remain less robust as demonstrated by poor accuracy on out-of-domain horses. Mean ± SEM, 3 shuffles. 1. Introduction eralize to new individuals [40, 30, 35, 43]. Thus, one natu- rally asks the following question: Assume you have trained Pose estimation is an important tool for measuring be- an algorithm that performs with high accuracy on a given havior, and thus widely used in technology, medicine and (individual) animal for the whole repertoire of movement— biology [5, 40, 30, 35]. Due to innovations in both deep how well will it generalize to different individuals that have arXiv:1909.11229v2 [cs.CV] 12 Nov 2020 learning algorithms [17,7, 11, 24, 39,8] and large-scale slightly or dramatically different appearances? Unlike in datasets [29,4,3] pose estimation on humans has become common human pose estimation benchmarks, here the set- very powerful. However, typical human pose estimation ting is that datasets have many (annotated) poses per indi- benchmarks, such as MPII pose and COCO [29,4,3], vidual (>200) but only a few individuals (≈10). contain many different individuals (>10k) in different con- To allow the field to tackle this challenge, we developed texts, but only very few example postures per individual. In a novel benchmark comprising 30 diverse Thoroughbred real world applications of pose estimation, users often want horses, for which 22 body parts were labeled by an expert to create customized networks that estimate the location in 8114 frames (Dataset available at http://horse10. of user-defined bodyparts by only labeling a few hundred deeplabcut.org). Horses have various coat colors and frames on a small subset of individuals, yet want this to gen- the “in-the-wild” aspect of the collected data at various Thoroughbred farms added additional complexity. With *Equal contribution. †Correspondence: [email protected] this dataset we could directly test the effect of pretraining ‡Correspondence: [email protected] on out-of-domain data. Here we report two key insights: Figure 2. Horse Dataset: Example frames for each Thoroughbred horse in the dataset. The videos vary in horse color, background, lighting conditions, and relative horse size. The sunlight variation between each video added to the complexity of the learning challenge, as well as the handlers often wearing horse-leg-colored clothing. Some horses were in direct sunlight while others had the light behind them, and others were walking into and out of shadows, which was particularly problematic with a dataset dominated by dark colored coats. To illustrate the Horse-10 task we arranged the horses according to one split: the ten leftmost horses were used for train/test within-domain, and the rest are the out-of-domain held out horses. (1) ImageNet performance predicts generalization for both 30 diverse race horses, for which 22 body parts were labeled within domain and on out-of-domain data for pose estima- by an expert in 8114 frames. This pose estimation dataset tion; (2) While we confirm that task-training can catch up allowed us to address within and out of domain general- with fine-tuning pretrained models given sufficiently large ization. Our dataset could be important for further testing training sets [10], we show this is not the case for out-of- and developing recent work for domain adapation in animal domain data (Figure1). Thus, transfer learning improves pose estimation on a real-world dataset [28, 43, 37]. robustness and generalization. Furthermore, we contrast the domain shift inherent in this dataset with domain shift in- 2.2. Transfer learning duced by common image corruptions [14, 36], and we find pretraining on ImageNet also improves robustness against Transfer learning has become accepted wisdom: fine- those corruptions. tuning pretrained weights of large scale models yields best results [9, 49, 25, 33, 27, 50]. He et al. nudged the field 2. Related Work to rethink this accepted wisdom by demonstrating that for various tasks, directly training on the task-data can match 2.1. Pose and keypoint estimation performance [10]. We confirm this result, but show that on Typical human pose estimation benchmarks, such as held-out individuals (“out-of-domain”) this is not the case. MPII pose and COCO [29,4,3] contain many different in- Raghu et al. showed that for target medical tasks (with dividuals (> 10k) in different contexts, but only very few little similarity to ImageNet) transfer learning offers little example postures per individual. Along similar lines, but benefit over lightweight architectures [41]. Kornblith et al. for animals, Cao et al. created a dataset comprising a few showed for many object recognition tasks, that better Im- thousand images for five domestic animals with one pose ageNet performance leads to better performance on these per individual [6]. There are also papers for facial keypoints other benchmarks [23]. We show that this is also true for in horses [42] and sheep [48, 16] and recently a large scale pose-estimation both for within-domain and out-of-domain dataset featuring 21.9k faces from 334 diverse species was data (on different horses, and for different species) as well introduced [20]. Our work adds a dataset comprising multi- as for corruption resilience. ple different postures per individual (>200) and comprising What is the limit of transfer learning? Would ever larger A B 5% training data C 50% training data D Normalized distance Train (nose to eye) Test (within domain) Test (out of domain) X = MobileNetV2 0.35 + = EfficientNet B6 0.3 normalized error 0.5 normalized error Figure 3. Transfer Learning boosts performance, especially on out-of-domain data. A: Illustration of normalized error metric, i.e. measured as a fraction of the distance from nose to eye (which is approximately 30 cm on a horse). B: Normalized Error vs. Net- work performance as ranked by the Top 1 % accuracy on ImageNet (order by increasing ImageNet performance: MobileNetV2-0.35, MobileNetV2-0.5, MobileNetV2-0.75, MobileNetV2-1.0, ResNet-50, ResNet-101, EfficientNets B0 to B6). The faint lines indicate data for the three splits. Test data is in red, train is blue, grey is out-of-domain data C: Same as B but with 50 % training fraction. D: Example frames with human annotated body parts vs. predicted body parts for MobileNetV2-0.35 and EfficientNet-B6 architectures with ImageNet pretraining on out-of-domain horses. data sets give better generalization? Interestingly, it appears 3. Data and Methods to strongly depend on what task the network was pretrained on. Recent work by Mahajan et al. showed that pretraining 3.1. Datasets and evaluation metrics for large-scale hashtag predictions on Instagram data (3:5 We developed a novel horse data set comprising 8114 billion images) improves classification, while at the same frames across 30 different horses captured for 4-10 seconds time possibly harming localization performance for tasks with a GoPro camera (Resolution: 1920 × 1080, Frame like object detection, instance segmentation, and keypoint Rate: 60 FPS), which we call Horse-30 (Figure2). We detection [32]. This highlights the importance of the task, downsampled the frames by a factor of 15% to speed-up rather than the sheer size as a crucial factor. Further cor- the benchmarking process (288 × 162 pixels; one video roborating this insight, Li et al. showed that pretraining on was downsampled to 30%). We annotated 22 previously large-scale object detection task can improve performance established anatomical landmarks for equines [31,2]. The for tasks that require fine, spatial information like segmen- following 22 body parts were labeled in 8114 frames: tation [27]. Thus, one interesting future direction to boost Nose, Eye, Nearknee, Nearfrontfetlock, Nearfrontfoot, Of- robustness could be to utilize networks pretrained on Open- fknee, Offfrontfetlock, Offfrontfoot, Shoulder, Midshoul- Images, which contains bounding boxes for 15 million in- der, Elbow, Girth, Wither, Nearhindhock, Nearhindfet- stances and close to 2 million images [26].