A Shape Transformation-Based Dataset Augmentation Framework for Pedestrian Detection
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Computer Vision manuscript No. (will be inserted by the editor) A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection Zhe Chen 1 · Wanli Ouyang 2 · Tongliang Liu 1 · Dacheng Tao 1 the date of receipt and acceptance should be inserted later Abstract Deep learning-based computer vision is usu- much better augmentation results than other pedes- ally data hungry. Many researchers attempt to augment trian synthesis approaches using low-quality pedestri- datasets with synthesized data to improve model ro- ans. By augmenting the original datasets, our proposed bustness. However, the augmentation of popular pedes- framework also improves the baseline pedestrian detec- trian datasets, such as Caltech and Citypersons, can tor by up to 38% on the evaluated benchmarks, achiev- be extremely challenging because real pedestrians are ing state-of-the-art performance. commonly in low quality. Due to the factors like oc- Keywords Pedestrian Detection · Dataset Augmen- clusions, blurs, and low-resolution, it is significantly tation · Pedestrian Rendering difficult for existing augmentation approaches, which generally synthesize data using 3D engines or genera- tive adversarial networks (GANs), to generate realistic- 1 Introduction looking pedestrians. Alternatively, to access much more natural-looking pedestrians, we propose to augment pedes- With the introduction of large-scale pedestrian datasets trian detection datasets by transforming real pedestri- (Doll´aret al., 2009; Dollar et al., 2012; Zhang et al., ans from the same dataset into different shapes. Ac- 2017; Geiger et al., 2013), deep convolutional neural cordingly, we propose the Shape Transformation-based networks (DCNNs) have achieved promising detection Dataset Augmentation (STDA) framework. The pro- accuracy. However, the trained DCNN detectors may posed framework is composed of two subsequent mod- not be robust enough due to the issue that negative ules, i.e. the shape-guided deformation and the envi- background examples greatly exceed positive foreground ronment adaptation. In the first module, we introduce a examples during training. Recent studies have confirmed shape-guided warping field to help deform the shape of that DCNN detectors trained with the limited fore- a real pedestrian into a different shape. Then, in the sec- ground examples can be vulnerable to difficult objects ond stage, we propose an environment-aware blending which have unexpected states (Huang and Ramanan, map to better adapt the deformed pedestrians into sur- 2017) and diversified poses (Alcorn et al., 2018). arXiv:1912.07010v1 [cs.CV] 15 Dec 2019 rounding environments, obtaining more realistic-looking To improve detector robustness, besides designing pedestrians and more beneficial augmentation results new machine learning algorithms, many researchers at- for pedestrian detection. Extensive empirical studies on tempted to augment training datasets by generating different pedestrian detection benchmarks show that new foreground examples. For instance, Huang et al. the proposed STDA framework consistently produces (Huang and Ramanan, 2017) used a 3D game engine to simulate pedestrians and adapt them into the pedes- B Zhe Chen ([email protected]) trian datasets. Other studies (Ma et al., 2018; Siarohin Wanli Ouyang ([email protected]) et al., 2018; Zanfir et al., 2018; Ge et al., 2018) at- Tongliang Liu ([email protected]) Dacheng Tao ([email protected]) tempted to augment the person re-identification datasets 1 UBTECH Sydney Artificial Intelligence Centre, The Uni- by transferring the poses of pedestrians using genera- versity of Sydney, Sydney, Australia tive adversarial networks (GANs). Despite progress, it 2 The University of Sydney, Sydney, Australia is still very challenging to adequately apply existing 2 Chen et al. Background Shape-guided Environment Pedestrian 푖 Shape 푗 Patch Deformation Adaptation Shape-guided Environment-aware Warping Field Blending Map Rich Diversified Low-quality Pedestrians Generated for Pedestrian Data Dataset Augmentation Robust Pedestrian Pedestrian Dataset Detection Fig. 1 We propose the shape transformation-based dataset augmentation framework for pedestrian detection. In the frame- work, we subsequently introduce the shape-guided warping field to deform pedestrians and the environment-aware blending map to adapt the deformed pedestrians into background environments. Our proposed framework can effectively generate more realistic-looking pedestrians for augmenting pedestrian datasets in which real pedestrians are usually in low-quality. Best view in color. augmentation approaches on the common pedestrian ods that require sufficient appearance details to define detection datasets. First, synthesizing pedestrians us- the desired output, it is much easier to access rich pixel- ing external platforms like the 3D game engines may level shape deformation supervision which defines the introduce a significant domain gap between synthesized deformation from a shape to another shape, if only pedestrians and real pedestrians, limiting the overall low-quality pedestrian examples are available in the benefits for the generated pedestrians to improve the datasets. The learned deformation between different shapes model robustness for detecting real pedestrians. More- can guide the deformation of appearances of the real over, regarding the methods that utilize GANs to ren- pedestrians, avoiding the requirement of detailed su- der pedestrians, they generally require rich appearance pervision information to directly define the transformed details from paired training images to help define the appearances. In addition, since the shape information desired output of generative networks during training can naturally distinguish foreground areas from back- procedures. However, in common pedestrian detection ground areas, we can simply focus on adapting synthe- datasets like Caltech (Dollar et al., 2012) and CityPer- sized foreground appearances into background environ- sons (Zhang et al., 2017), pedestrians are usually in ments, avoiding the rick of further generating unnat- low quality due to the factors like heavy occlusions, ural background environments together with the syn- blurry appearance, and low-resolution caused by small thesized pedestrians as required in current GAN-based sizes. As a result, these available real pedestrians only approaches. Last but not the least, we find that trans- provide extremely limited amount of appearance de- forming real pedestrians based on different shapes can tails that can be used for training generative networks. effectively increase foreground sample diversity while Without sufficient description of the desired appearance still maintaining the appearance characteristics of real of synthesized pedestrians, we can show in our experi- pedestrians adequately. ments that current GAN-based methods only generate less realistic or even corrupted pedestrians using very Based on these observations, we devise a Shape Trans- low-quality pedestrians from common pedestrian detec- formation based Dataset Augmentation (STDA) frame- tion datasets. work to fulfill the pedestrian dataset augmentation task more effectively. In particular, the framework first de- By addressing above issues, we propose to augment forms a real pedestrian into a similar pedestrian but pedestrian datasets by transforming real pedestrians with a different shape and then adapts the shape-deformed from the same dataset according to different shapes pedestrians into surrounding environments on the im- (i.e. segmentation masks in this study) rather than age to be augmented. In the STDA framework, we intro- rendering new pedestrians. Our motivation comes from duce a shape-guided warping field, which is a set of vec- the following observations. First, unlike existing meth- tors that define the warping operation between shapes, A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 3 to further define an appropriate deformation between et al., 2012; Zhang et al., 2017; Geiger et al., 2013), re- the shapes and the appearances of the real pedestrians. searchers have greatly improved the pedestrian detec- Moreover, we introduce an environment-aware blend- tion performance with DCNNs (Simonyan and Zisser- ing map to help the shape-deformed pedestrians better man, 2014; He et al., 2016). Among the DCNN detec- blend into various background environments, delivering tors, two-stage detection pipelines (Ouyang and Wang, more realistic-looking pedestrians on the image. 2013; Ren et al., 2015; Li et al., 2018; Cai et al., 2016; In this study, our key contributions are listed as Zhang et al., 2016b; Du et al., 2017) usually perform follows: better than single-stage detection pipelines (Liu et al., 2016; Redmon et al., 2016; Lin et al., 2018). Despite { We propose a shape transformation-based dataset progress, the issue that foreground and background ex- augmentation framework to augment the pedestrian amples are extremely unbalanced in pedestrian datasets detection datasets and improve pedestrian detec- still affects the robustness of the DCNN detectors ad- tion accuracy. To the best of our knowledge, we are versely. Current pedestrian detectors could still be frag- the first that apply the shape-transformation-based ile to even small transformation of pedestrians. To tackle data synthesis methodology for pedestrian detec- this problem, many researchers tend to augment the tion. datasets by synthesizing new foreground data. {