<<

International Journal of Computer Vision manuscript No. (will be inserted by the editor)

A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection

Zhe Chen 1 · Wanli 2 · Tongliang 1 · Dacheng Tao 1

the date of receipt and acceptance should be inserted later

Abstract Deep learning-based computer vision is usu- much better augmentation results than other pedes- ally data hungry. Many researchers attempt to augment trian synthesis approaches using low-quality pedestri- datasets with synthesized data to improve model ro- ans. By augmenting the original datasets, our proposed bustness. However, the augmentation of popular pedes- framework also improves the baseline pedestrian detec- trian datasets, such as Caltech and Citypersons, can tor by up to 38% on the evaluated benchmarks, achiev- be extremely challenging because real pedestrians are ing state-of-the-art performance. commonly in low quality. Due to the factors like oc- Keywords Pedestrian Detection · Dataset Augmen- clusions, blurs, and low-resolution, it is significantly tation · Pedestrian Rendering difficult for existing augmentation approaches, which generally synthesize data using 3D engines or genera- tive adversarial networks (GANs), to generate realistic- 1 Introduction looking pedestrians. Alternatively, to access much more natural-looking pedestrians, we propose to augment pedes- With the introduction of large-scale pedestrian datasets trian detection datasets by transforming real pedestri- (Doll´aret al., 2009; Dollar et al., 2012; et al., ans from the same dataset into different shapes. Ac- 2017; Geiger et al., 2013), deep convolutional neural cordingly, we propose the Shape Transformation-based networks (DCNNs) have achieved promising detection Dataset Augmentation (STDA) framework. The pro- accuracy. However, the trained DCNN detectors may posed framework is composed of two subsequent mod- not be robust enough due to the issue that negative ules, i.e. the shape-guided deformation and the envi- background examples greatly exceed positive foreground ronment adaptation. In the first module, we introduce a examples during training. Recent studies have confirmed shape-guided warping field to help deform the shape of that DCNN detectors trained with the limited fore- a real pedestrian into a different shape. Then, in the sec- ground examples can be vulnerable to difficult objects ond stage, we propose environment-aware blending which have unexpected states ( and Ramanan, map to better adapt the deformed pedestrians into sur- 2017) and diversified poses (Alcorn et al., 2018). arXiv:1912.07010v1 [cs.CV] 15 Dec 2019 rounding environments, obtaining more realistic-looking To improve detector robustness, besides designing pedestrians and more beneficial augmentation results new machine learning algorithms, many researchers at- for pedestrian detection. Extensive empirical studies on tempted to augment training datasets by generating different pedestrian detection benchmarks show that new foreground examples. For instance, Huang et al. the proposed STDA framework consistently produces (Huang and Ramanan, 2017) used a 3D game engine to simulate pedestrians and adapt them into the pedes- B Zhe Chen ([email protected]) trian datasets. Other studies (Ma et al., 2018; Siarohin Wanli Ouyang ([email protected]) et al., 2018; Zanfir et al., 2018; et al., 2018) at- Tongliang Liu ([email protected]) Dacheng Tao ([email protected]) tempted to augment the person re-identification datasets 1 UBTECH Sydney Artificial Intelligence Centre, The Uni- by transferring the poses of pedestrians using genera- versity of Sydney, Sydney, Australia tive adversarial networks (GANs). Despite progress, it 2 The University of Sydney, Sydney, Australia is still very challenging to adequately apply existing 2 Chen et al.

Background Shape-guided Environment Pedestrian 푖 Shape 푗 Patch Deformation Adaptation

Shape-guided Environment-aware Warping Field Blending Map

Rich Diversified Low-quality Pedestrians Generated for Pedestrian Data Dataset Augmentation Robust Pedestrian Pedestrian Dataset Detection

Fig. 1 We propose the shape transformation-based dataset augmentation framework for pedestrian detection. In the frame- work, we subsequently introduce the shape-guided warping field to deform pedestrians and the environment-aware blending map to adapt the deformed pedestrians into background environments. Our proposed framework can effectively generate more realistic-looking pedestrians for augmenting pedestrian datasets in which real pedestrians are usually in low-quality. Best view in color. augmentation approaches on the common pedestrian ods that require sufficient appearance details to define detection datasets. First, synthesizing pedestrians us- the desired output, it is much easier to access rich pixel- ing external platforms like the 3D game engines may level shape deformation supervision which defines the introduce a significant domain gap between synthesized deformation from a shape to another shape, if only pedestrians and real pedestrians, limiting the overall low-quality pedestrian examples are available in the benefits for the generated pedestrians to improve the datasets. The learned deformation between different shapes model robustness for detecting real pedestrians. More- can guide the deformation of appearances of the real over, regarding the methods that utilize GANs to ren- pedestrians, avoiding the requirement of detailed su- der pedestrians, they generally require rich appearance pervision information to directly define the transformed details from paired training images to help define the appearances. In addition, since the shape information desired output of generative networks during training can naturally distinguish foreground areas from back- procedures. However, in common pedestrian detection ground areas, we can simply focus on adapting synthe- datasets like Caltech (Dollar et al., 2012) and CityPer- sized foreground appearances into background environ- sons (Zhang et al., 2017), pedestrians are usually in ments, avoiding the rick of further generating unnat- low quality due to the factors like heavy occlusions, ural background environments together with the syn- blurry appearance, and low-resolution caused by small thesized pedestrians as required in current GAN-based sizes. As a result, these available real pedestrians only approaches. Last but not the least, we find that trans- provide extremely limited amount of appearance de- forming real pedestrians based on different shapes can tails that can be used for training generative networks. effectively increase foreground sample diversity while Without sufficient description of the desired appearance still maintaining the appearance characteristics of real of synthesized pedestrians, we can show in our experi- pedestrians adequately. ments that current GAN-based methods only generate less realistic or even corrupted pedestrians using very Based on these observations, we devise a Shape Trans- low-quality pedestrians from common pedestrian detec- formation based Dataset Augmentation (STDA) frame- tion datasets. work to fulfill the pedestrian dataset augmentation task more effectively. In particular, the framework first de- By addressing above issues, we propose to augment forms a real pedestrian into a similar pedestrian but pedestrian datasets by transforming real pedestrians with a different shape and then adapts the shape-deformed from the same dataset according to different shapes pedestrians into surrounding environments on the im- (i.e. segmentation masks in this study) rather than age to be augmented. In the STDA framework, we intro- rendering new pedestrians. Our motivation comes from duce a shape-guided warping field, which is a set of vec- the following observations. First, unlike existing meth- tors that define the warping operation between shapes, A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 3 to further define an appropriate deformation between et al., 2012; Zhang et al., 2017; Geiger et al., 2013), re- the shapes and the appearances of the real pedestrians. searchers have greatly improved the pedestrian detec- Moreover, we introduce an environment-aware blend- tion performance with DCNNs (Simonyan and Zisser- ing map to help the shape-deformed pedestrians better man, 2014; He et al., 2016). Among the DCNN detec- blend into various background environments, delivering tors, two-stage detection pipelines (Ouyang and Wang, more realistic-looking pedestrians on the image. 2013; Ren et al., 2015; et al., 2018; Cai et al., 2016; In this study, our key contributions are listed as Zhang et al., 2016b; Du et al., 2017) usually perform follows: better than single-stage detection pipelines (Liu et al., 2016; Redmon et al., 2016; et al., 2018). Despite – We propose a shape transformation-based dataset progress, the issue that foreground and background ex- augmentation framework to augment the pedestrian amples are extremely unbalanced in pedestrian datasets detection datasets and improve pedestrian detec- still affects the robustness of the DCNN detectors ad- tion accuracy. To the best of our knowledge, we are versely. Current pedestrian detectors could still be frag- the first that apply the shape-transformation-based ile to even small transformation of pedestrians. To tackle data synthesis methodology for pedestrian detec- this problem, many researchers tend to augment the tion. datasets by synthesizing new foreground data. – We propose the shape-guided warping field to help define a proper shape deformation procedure. We also introduce an environment-aware blending map 2.2 Simulation-based Dataset Augmentation to better adapt the shape-transformed pedestrians into different backgrounds, achieving better augmen- To achieve dataset augmentation, researchers have used tation results on the image. 3D simulation platforms to synthesize new examples for – We introduce a shape constraining operation to im- the datasets. For example, (Lerer et al., 2016; Ros et al., prove shape deformation quality and a novel hard 2016) used a 3D game engine to help build new datasets. positive mining loss to further magnify the benefits More related studies used the 3D simulation platforms of the synthesized pedestrians for improving detec- to augment pedestrian-related datasets. In particular, tion robustness. (Pishchulin et al., 2011; Hattori et al., 2015) employed a – Our proposed framework is promising for generat- game engine to synthesize training data for pedestrian ing pedestrians using low-quality examples. Com- detection. In addition, (Huang and Ramanan, 2017) ap- prehensive evaluations on the famous Caltech (Dol- plied a GAN to narrow the domain gap between the lar et al., 2012) and CityPersons (Zhang et al., 2017) 3D simulated pedestrians and the natural pedestrians benchmarks validate that our proposed framework to augment pedestrian datasets, but this method brings can generate more realistic-looking pedestrians than limited improvement on common pedestrian detection, existing methods using low-quality data. With pedes- suggesting that the domain gap is still large. However, trian datasets augmented by our framework, we promis- there is still a significant domain gap between simulated ingly boost the performance of the baseline pedes- pedestrians and real pedestrians. Such gap could fur- trian detector, accessing superior performance to ther pose negative effects on DCNN detectors, making other cutting-edge pedestrian detectors. the augmented datasets deliver incremental improve- ments on pedestrian detection.

2 Related Work 2.3 GAN-based Dataset Augmentation 2.1 Pedestrian Detection Recently, with several improvements (Radford et al., Pedestrian is critical in many applications such as robotics 2015; Arjovsky et al., 2017; Gulrajani et al., 2017), and autonomous driving (Enzweiler and Gavrila, 2008; GANs(Goodfellow et al., 2014) have shown great ben- Doll´aret al., 2009; Dollar et al., 2012; Zhang et al., efits on synthesis-based applications such as image-to- 2016c). Traditional pedestrian detectors generally use image translation (Isola et al., 2017; Liu et al., 2017a; hand-crafted features (Viola et al., 2005; et al., Isola et al., 2017; Zhu et al., 2017) and skeleton-to- 2007) and adopt human part-based detection strategy image generation (Villegas et al., 2017; et al., 2017). (Felzenszwalb et al., 2010b) or cascaded structures (Felzen- In the literature of person re-identification task, many szwalb et al., 2010a; Bar-Hillel et al., 2010; Felzenszwalb works attempted to transfer the poses of real pedestri- et al., 2008). Recently, by taking advantages of large- ans to deliver diversified pedestrians for the augmen- scale pedestrian datasets (Doll´aret al., 2009; Dollar tation. For instance, (Liu et al., 2018; Ma et al., 2018; 4 Chen et al.

Siarohin et al., 2018; Zanfir et al., 2018; Ge et al., 2018; 3 Shape Transformation-based Dataset Zheng et al., 2017; Ma et al., 2017) introduced vari- Augmentation Framework ous techniques to transform the human appearance ac- cording to 2D or 3D poses and improve the person re- 3.1 Problem Definition identification performance. In practice, these methods require accurate pose information or paired training im- Data augmentation technique, commonly formulated as ages that contain rich appearance details to achieve suc- transformations of raw data, has been used to access the cessful transformation. However, existing widely used vast majority of the state-of-the-art results in image pedestrian datasets like Caltech provide neither pose recognition. The data augmentation is intuitively ex- annotations nor paired appearance information for train- plained as to increase the training data size and as a reg- ing GANs. Furthermore, in current pedestrian datasets, ularizer that can model hypothesis complexity (Good- a large number of small pedestrians whose appearances fellow et al., 2016; Zhang et al., 2016a; Dao et al., 2019). are usually in low quality can make existing pose esti- In particular, the hypothesis complexity can be used to mators difficult to deliver reasonable predictions. Fig. measure the generalization error, which is the differ- 2 shows some examples describing that the poses of ence between the training and test errors, of learning low-quality pedestrians are much more unstable than algorithms (Vapnik, 2013; Liu et al., 2017b). Larger hy- the masks estimated using the same Mask RCNN (He pothesis complexity usually implies a larger generaliza- et al., 2017) detector. As a result, it is quite infeasi- tion error and vice versa. In practice, a small training ble to seamlessly apply these pose transfer models for error and a small generalization error is favoured to augmenting current pedestrian datasets. guarantee a small test error. As a result, the data aug- mentation is especially useful for deep learning models In pedestrian detection, some studies have intro- which are powerful in maintaining a small training er- duced specifically designed GANs for the augmenta- ror but has a large hypothesis complexity. It has been tion. As an example, (Ouyang et al., 2018b) modified empirically demonstrated that data augmentation op- the pix2pixGAN (Isola et al., 2017) to make it more erations can greatly improve the generalization ability suitable for the pedestrian generation, but this method of deep models (Cire¸sanet al., 2010; Dosovitskiy et al., lacks a particular mechanism that helps produce di- 2015; Sajjadi et al., 2016). versified pedestrians and the method still delivers poor In this study, the overall goal is to devise a more generation results based on low-quality data. effective dataset augmentation framework to improve pedestrian detection models. The framework is sup- posed to generate diversified and more realistic-looking pedestrian examples to enrich the corresponding datasets Pedestrian Shape/Mask Pose Pedestrian Shape/Mask Pose in which real pedestrians are usually in very low-quality. We achieve this goal by transforming real pedestrians into different shapes rather than rendering new pedes- trians. In fact, the devised data augmentation method focusing on only transforming pedestrians can be effi- cacious in regularizing the hypothesis complexity. This can be empirically justified by our experiments which show that pedestrian detection performance of the base- line model can be significantly improved without intro- ducing extra model complexity. In general, the top line Fig. 2 Some examples showing that the shape estimation of Fig.3 briefly describes our augmentation methodol- results are more accurate than pose estimation results on low- quality images using the same Mask RCNN model. Best view ogy. in color. Formally, suppose zi is an image patch containing a real pedestrian in the dataset and is its extracted shape or segmentation mask. Denote sj as a different shape which can be obtained based on another real In this study, we propose that transforming pedes- pedestrian’s shape. Here, we implement a shape transformation- trians from the original dataset by altering their shapes based dataset augmentation function, denoted as fSTDA, can produce diversified and much more lifelike pedes- to generate a new pedestrian by transforming a real trians without requiring rich appearance details for su- pedestrian into a new pedestrian with a more realistic- pervision. looking appearance but with another shape sj for the A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 5

Generated Pedestrians Training Pedestrian Shape Transformation-based Augmented Robust Pedestrian Dataset Dataset Augmentation Framework Pedestrian Dataset Detection

Shape-guided Deformation Network Input Network Output 퐬 퐕→ 퐬→ 퐕 퐳 퐬 → 푓 ℒ , ℒ

Augmentation Result Guide 퐬 퐳 퐕→ 퐳→

푓 U-Net 훼 푥, 푦

Background Environment Adaptation Patch 푏 from 퐼 퐳→

ℒ , ℒ Background Patch ℒ

Fig. 3 Overview of the proposed shape transformation-based dataset augmentation framework for pedestrian datasets with low-quality pedestrian data. In particular, we introduce the shape-guided warping field, Vi→j , and the environment-aware blending map, α(x, y), to respectively help implement the shape-guided deformation and the environment adaptation, obtaining w w gen gen the deformed shape si→j , the deformed pedestrian zi→j , and the transformation result zi→j . By placing the zi→j into the I, we can effectively augment the original image. In practice, we employ a U-Net to predict both of the Vi→j and α(x, y). Training losses for the U-Net include Lshape, Ladv, Lcyc, and Lhpm. Best view in color.

augmentation: specifically, given a pedestrian image patch zi and a different shape sj, we first deform the image zi into zgen = f (z , s ,I), (1) i→j STDA i j a pedestrian with similar appearance but a shape sj that is different from the original shape s . The defor- where zgen is a patch containing the newly generated i i→j mation can be defined according to the transformation pedestrian z by transforming its shape into s , and I i j from s into s . Then, we adapt the deformed pedes- is the image to be augmented. i j trian image into some background environments on the image I. Denote by fSD the function that implements the shape-guided deformation, and denote by f the 3.2 Framework Overview EA function that implements the environment adaptation. In pedestrian detection datasets, it is difficult to access The proposed framework implements fSTDA as follows: gen sufficient appearance details to define the desired zi→j, making it extremely challenging to generate realistic- fSTDA(zi, sj,I) = fEA(fSD(zi, sj),I). (2) looking pedestrians using low-quality appearance. To properly implement the fSTDA using low-quality pedes- Fig.3 shows a detailed architecture of the proposed trians, we decompose the pedestrian generation task framework. As illustrated in the figure, we introduce a into two sub-tasks, which are shape-guided deformation shape-guided warping field, denoted as Vi→j, to help and environment adaptation. The first task focuses on implement the shape-guided deformation function. We varying the appearances to enrich data diversity, and also propose to apply the environment-aware blending the second task mainly adapts the deformed pedestri- map, denoted as α(x, y), to achieve environment adap- ans different environments to blend the transformed tation. In particular, with the help of Vi→j, the defor- pedestrians within each image to be augmented. More mation between different shapes can guide the defor- 4320, 331, 188, 300, 175, 352, 301 6 Chen et al.

mation of appearances of real pedestrians. Then, after 훾 푦 =0 퐬 퐬 (original) 퐬 better adapting the shape-deformed pedestrian into the weighted combine background environments using α(x, y), we obtain di- based on versified and more realistic-looking pedestrians to aug- 훾(푦) ment pedestrian detection datasets. In practice, we can 푦 employ a single end-to-end U-Net (Ronneberger et al., 2015) to help fulfill the both sub-tasks in a single pass. According to Eq.2, the employed network takes as in- put the pedestrian patch zi, its shape si, the target 훾 푦 =1 shape sj, and a background patch from I, and then predicts both of the V and the α(x, y). i→j Fig. 4 An example of the shape constraining operation. In In the following sections, we will subsequently de- particular, we combine the shape sj with si based on a j j scribe in details the implementation of shape-guided weighting function γ(y). cy and ly respectively denote the deformation in Sec. 3.2.1 and the implementation of middle point and the width of the foreground areas on the i environment adaptation in Sec. 3.2.2. line whose vertical offset is y and on the shape sj . cy and i ly are the corresponding middle point and the width of fore- ground areas on the shape si. The γ(y) is a linear function 3.2.1 Shape-guided Deformation of y, controlling the combination of si and sj . Best view in color. In this study, we implement deformation according to warping operations. In order to obtain a detailed de- into the shape sj. Accordingly, the desired warping field scription about warping operations, we introduce the Vi→j should make the following equation hold: shape-guided warping field, which is the assignment of w the vectors for warping between shapes, to further help sj = si→j = fwarp(si; Vi→j), (4) deform pedestrians. Denote by vi→j(x, y) the warping w vector located at the (x, y) that helps warp the shape where si→j is the warped shape si according to Vi→j. si into the shape sj. The set of these warping vectors, Since si and sj can be easily obtained from the pedes- i.e.Vi→j = {vi→j(x, y)}, then forms a shape-guided trian datasets, we are able to access sufficient pixel-level warping field. An example of this warping field can be supervision to train the employed network. We mainly w found in Fig.3 under the symbol Vi→j, where the warp- apply the L1 loss, ||sj − si→j||1, to make the network ing field helps deform the si (colored in blue) into the satisfy the above definition during the training. There- sj (colored in purple). Then, suppose fwarp is the func- fore, by satisfying the above definition, we can obtain tion that warps the input image patch according to the the desired warping field that can help generate shape- predicted warping field, we then implement the fSD by: transformed natural pedestrians based on Eq.3. Shape Constraining Operation: In practice, we ob- serve that if the target shape sj varies too much w.r.t.si, w zi→j = fSD(zi, sj) = fwarp(zi; Vi→j), (3) the obtained warping field may distort the input pedes- trians after warping, resulting in unnatural results that w where zi→j is the warped pedestrian zi according to the could degrade the augmentation performance. To avoid shape sj. In practice, we define that each warping vector this, we apply a shape constraining operation on the vi→j(x, y) is a 2D vector which contains the horizontal target shape. and vertical displacements between the mapped warp- To avoid inadequate shape deformation, we hypoth- ing point and the original point located at (x, y). Thus, esize that varying more for the lower body of a natural we can make the employed network directly predict the pedestrian is more acceptable than varying more for the Vi→j. In addition, we implement the fwarp with the upper body. In particular, we find that changing too help of bilinear interpolation, since the bilinear inter- much the upper body of a pedestrian would generally polation can properly back-propagate the gradients ob- require the change of the viewpoint for that pedestrian w tained from zi→j to the Vi→j, aiding the training of (e.g.from the side view to the front view) to obtain a the employed network. For more details about using - natural appearance, but the warping operations do not linear interpolation for warping and training, we refer generate new image contents to generate the pedestrian readers to (Jaderberg et al., 2015; Dai et al., 2017). in a different viewpoint. As a result, we apply a con- To make the shape-guided warping field adequately straint for the target shape sj to avoid varying too much describe the deformation between shapes, we define that for the upper body. More specifically, we constrain the the estimated warping field should warp the shape si target shape sj by combining it with the input shape si A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 7 according to a weighted function. The combination is where α(x, y) is an entry value of the environment- defined on the middle point and the width of the fore- aware blending map located at (x, y). An example of ground areas in each horizontal line on the sj. Suppose the estimated α(x, y) can be found in Fig.3. y is the vertical offset for a horizontal line on sj. We re- In practice, it is difficult to define the desired refine- j j spectively denote cy and ly as the middle point and the ment result and the desired environment-aware blend- width of the foreground areas on that line. We define ing map. Therefore, we can not access appropriate su- the shape constraining operation as: pervision information to train the employed network for environment adaptation. Without supervision, we  j j i cy = γ(y) cy + (1 − γ(y)) cy, apply an adversarial loss to facilitate the employed net- j j i (5) ly = γ(y) ly + (1 − γ(y)) ly, work to learn and blend the deformed pedestrians into the environments effectively. Similar to the shape-guided i i where cy and ly refer to the middle point and width warping field, we make the employed network directly of foreground areas on the si at line y, and γ(y) is the predict environment-aware blending map. Note that we weight function w.r.t.y controlling the strictness of the constrain the environment-aware blending map to pre- constraint. According to Eq.5, the smaller weight value vent changing the appearance of the deformed pedes- γ(y) can make the target shape contribute less to the trians too much. In particular, we adopt a shifted and combination result and vice versa. Accordingly, to allow rescaled tanh squashing function to make the values of more variations only for the lower body of a pedestrian, α(x, y) lie in a range of 0.8 and 1.2. we define γ(y) as a linear function which increases from 0 to 1 with y varying from the top to the bottom. Fig.4 shows a visual example of the introduced 3.3 Objectives shape constraining operation when constraining sj ac- cording to Eq.5. We can observe that the proposed Since we employ a single network to predict both the shape constraining operation adequately constrains the shape-guided warping field and the environment blend- sj by making the output shape closer to si for the up- ing map, we can unify the objectives for training. per body of the pedestrian and closer to sj for the lower First, to obtain a proper shape-guided warping field, body. we introduce a shape deformation loss and a cyclic re- construction loss. The shape deformation loss ensures 3.2.2 Environment Adaptation that the predicted warping field satisfies the constrain as described in Eq.4. The cyclic reconstruction loss After the shape-guided deformation, we place the de- then ensures that the deformed shape and pedestrian formed pedestrians into the image I to fulfill augmen- can be deformed back to the input shape and pedes- tation. However, directly pasting deformed pedestrians trian. Therefore, we define that the shape deformation on the image would sometimes produce significant ap- loss function Lshape for a pair of samples i → j is as pearance mismatch due to the issues like the disconti- follows: nuities in illumination conditions and imperfect shapes w predicted by the mask extractor. To refine the gener- Lshape = E[||sj − si→j||1], (8) ated pedestrians according to the environments, we fur- ther perform environment adaptation. and the cyclic loss is defined as follows: To properly blend a shape-deformed pedestrian into L = [||s − sw || + ||z − zw || ], (9) the image I by considering surrounding environments, cyc E i j→i 1 i j→i 1 we introduce an environment-aware blending map to w w where sj→i is the deformation result of si→j accord- help refine the deformed pedestrians. We formulate this w w ing to si and zj→i is the deformation result of zi→j refinement procedure as follows: w using the same warping field for computing sj→i. As a result, the Eq.8 describes the L -based shape deforma- zgen = f (zw ,I) = {za (x, y)}, (6) 1 i→j EA i→j i→j tion loss, and the Eq.9 form the cyclic reconstruction loss. where za (x, y) is the environment adaptation result i→j In addition, an adversarial loss, denoted as L , located at (x, y): adv is included to make sure that the shape-guided defor- mation and environment adaptation can help produce a   w zi→j(x, y) = sj(x, y) · α(x, y) · zi→j(x, y) more realistic-looking pedestrian patches. Similar with   typical GANs, the adversarial loss is computed by in- + 1 − s (x, y) · α(x, y)) · I(x, y), (7) j troducing a discriminator D for the employed network: 8 Chen et al.

3.4 Dataset Augmentation

gen  Ladv = E[log D(z)] + E[log 1 − D(zi→j) ], (10) When augmenting the pedestrian datasets with the pro- posed framework, we attempted to sample more natural where z refers to any real pedestrian in the dataset. locations and sizes to place the transformed pedestrians Hard Positive Mining Loss: Since our final goal is in the image. Fortunately, pedestrian datasets deliver to improve the detection performance, we further ap- sufficient knowledge encoded within bounding box an- ply a hard positive mining loss to magnify the benefits notations to define these geometric statistics of a nat- of the transformed pedestrians on improving detection ural pedestrian. For example, in the Caltech dataset robustness. Inspired by the study of hard positive gen- (Doll´aret al., 2009; Dollar et al., 2012), the aspect ra- eration (Wang et al., 2017), we attempt to generate tio of a pedestrian is usually around 0.41. In addition, pedestrians that are not very easy to be recognized by it is also possible to describe the bottom edge ybox and a RCNN detector (Girshick et al., 2014). Different from the height hbox of an annotated bounding box for a the study (Wang et al., 2017) that additionally intro- pedestrian using a linear model (Park et al., 2010): duced an occlusion mask and the spatial transforma- hbox = kybox + b, where k and b are the coefficients. tion operations to generate hard positives, we only in- In the Caltech dataset whose images are 480 by 640, troduce a loss function on the transformed pedestrians the k and b are found to be around 1.15 and -194.24. to make the employed network learn to produce harder For each image to be augmented, we sample several lo- positives for the RCNN detector. To compute this loss, cations and sizes according to this linear model. Then, we additionally train a RCNN, denoted as R, to dis- for each sampled location and size, we run the proposed tinguish pedestrian patches from background patches framework and put the transformation result into the which do not contain pedestrians inside. Suppose L hpm image. Algorithm1 describes the detailed pipeline of is the hard positive mining loss, then we have: applying the proposed framework to augment pedestri- gen ans dataset. Lhpm = E[log(1−R(zi→j))]+E[log R(z)]+E[log(1−R(b))], (11) Algorithm 1 Pedestrian Dataset Augmentation where b refers to background image patches in the Pipeline dataset. Input: Natural pedestrians {zi}, the corresponding shapes The major difference between the Lhpm and the {si}, and images from original dataset I = {I}. L is that the R distinguishes between pedestrians Output: Augmented dataset images I; adv 1: for each image I ∈ I do patches and background patches, while D in Ladv dis- 2: Uniformly sample a number n from the set tinguishes between true pedestrian patches and the shape- {1, 2, 3, 4, 5}; transformed pedestrian patches. 3: for m = 1 : n do box Overall Loss. To sum up, the overall training ob- 4: Sample a location and a size according to h = kybox + b; jective L of the network employed to help implement 5: Sample a pedestrian patch zi, a shape sj from a dif- the proposed framework can be written as follows: ferent pedestrian, and a background patch cropped from the sampled location and size on I ; L = ω1Lshape + ω2Lcyc + ω3Ladv + ω4Lhpm, (12) 6: Perform shape-guided deformation according to Eq. 3 and then perform environment adaptation accord- ing to Eq.6, obtaining zgen ; where ω1, ω2, ω3, ω4 are the corresponding loss weights. i→j 7: Place the zgen into the image I according to the In general, we borrow the setting from the implementa- i→j 1 sampled location and size; tion of pix2pixGAN and set the ω1 and ω3 to 100 and 8: end for 1, respectively. Since we find in the experiment that the 9: end for network can hardly learn a proper shape-guided warp- 10: return I ing field if ω2 is too large, we empirically set the ω2 to a small value, i.e. 0.5, in this study. Similarly, we also set ω4 to 0.5 to make the hard positive mining loss contribute less to the overall objective. In practice, the network is obtained by minimizing the overall loss 4 Experiments L, the discriminator D is obtained by maximizing the L , and the R is obtained by maximizing the L . adv hpm We perform comprehensive evaluation for the proposed 1 https://github.com/junyanz/pytorch-CycleGAN-and- STDA framework to augment pedestrian datasets. We pix2pix use the popular Caltech (Doll´aret al., 2009; Dollar A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 9 et al., 2012) and CityPersons (Zhang et al., 2017) bench- we make the length of the shorter size of input images as marks for the evaluation. 720 for Caltech and as 1024 for CityPersons. We train In this section, we will first present the overall dataset the FPN detector on the Caltech with 3 epochs and augmentation results on evaluated datasets. Then, we on the CityPersons with 6 epochs. In general, the fi- will validate the improvements in improving detection nal performance of the baseline detector is 10.4% mean accuracy of applying our proposed STDA framework to miss rate on Caltech test set and 13.9% mean miss rate augment different datasets, comparing to other cutting- on CityerPersons validation set. Note that we weight edge pedestrian detectors. Subsequently, we perform the loss values for synthesized pedestrians by a factor detailed ablation studies on the STDA framework to of 0.1, reducing the potential biases towards generated analyze the effects of different components in STDA pedestrians rather than real pedestrians. on generating more realistic-looking pedestrians and on improving detection accuracy. 4.2 Dataset Augmentation Results

4.1 Settings and Implementation Details We first present the pedestrian synthesis results of ap- plying our STDA framework to augment pedestrian For evaluation, we consider the log-average miss rates datasets. (MR) against different false positive rates as the major metric to represent pedestrian detection performance. 4.2.1 Pedestrian Synthesis Results In the Caltech, we follow the protocol of (Zhang et al., 2016b) and use around 42k images for training and 4024 In Figure5, we illustrate the dataset augmentation re- images for testing. In the Citypersons, as suggested in sults on both the evaluated Caltech dataset and cityper- the original study, we use 2975 images for training and sons dataset. Even if some of the pedestrians are blurry perform the evaluation on the 500 images from the val- and lack rich appearance details, we can observe that idation set. We apply a Mask RCNN to extract shapes the shape transformed pedestrians can still be naturally on Caltech and use the annotated pedestrian masks on blended into the environments of the image, obtaining Citypersons. To augment the datasets, for each frame, very realistic-looking pedestrians for dataset augmenta- we transform n pedestrians using our framework and tion. Furthermore, the STDA can also produce pedes- n is uniformly sampled from {1, 2, 3, 4, 5}. Thus, each trians in uncommon walking areas, such as in the mid- image has the number of positive pedestrians increased dle areas of the street. This can increase the irregular by 1 ∼ 5. foreground examples for pedestrian detection, and the For the network employed to implement the frame- model can be more robust in detecting pedestrians af- work, we use the U-net architecture with 8 blocks. All ter augmentation. Moreover, with a similar geometry the input and output patches have a size of 256 × 256. arrangement with real pedestrians, the illustrated re- Both the D as introduced in Eq. 10 and the R as intro- sults can demonstrate that the proposed STDA frame- duced in Eq. 11 are CNNs with 3 convolutional blocks. work is effective in generating pedestrians in a simi- During optimization, we reduce the updating frequency lar domain with real pedestrians. Besides, our method of D and R to stabilize the training, i.e. we update D can produce occlusion cases, e.g. by overlapping the and R once at every 40-th update of the U-net. Learn- generated pedestrians over real pedestrians, which can ing rate is set to 1e − 5 and we perform training with promisingly increase the amount of occlusion cases for 80 epochs for a dataset. training and thus improve the detection robustness for We adopt a ResNet50-based FPN detector (Lin et al., occlusions. 2017) as our baseline detector. When training this de- In addition, we also compare our method with an- tector, we modified some default parameters according other recently published powerful GAN-based data ren- to the pedestrian detection task. First, for the region dering technique, i.e. PS-GAN (Ouyang et al., 2018b), proposal network in FPN, we follow the (Zhang et al., using the same background patches. We train the com- 2016b) and only use the anchor with the aspect ratio of pared PS-GANs using similar protocols with the origi- 2.44. We discard the 512x512 anchors in FPN because nal study, while the major difference is that we use very they do not contribute much to the performance. In low-quality pedestrian data provided in the pedestrian addition, we set the batch size as 512 for both the Re- dataset. Training schemes for both our study and the gion Proposal Network (RPN) and the Regional-CNN PS-GAN are kept the same. Figure 6(a) shows some (RCNN) in FPN. To reduce of false positive rates of pedestrian synthesis results using existing GANs. We FPN, we further set the foreground thresholds of RPN can find that the compared GAN-based method pro- and RCNN to 0.5 and 0.7, respectively. During training, duces very blurry pedestrians due to low-quality train- 10 Chen et al.

Fig. 5 Dataset augmentation results of the proposed STDA on images from Caltech (top 2 rows) and CityPersons (bottom 3 rows), respectively. Light green bounding boxes indicate the synthesized pedestrians. The presented image patches are cropped and zoomed to better illustrate the details. Best view in color. ing data. Besides, the generated backgrounds can be ground patches. our method achieves significantly lower also unnatural and distorted due to the lack of suffi- score than PS-GAN, meaning that the STDA-generated cient appearance details during training. On the con- pedestrians are much more similar to the true data. trary, as shown in Figure 6(b), our proposed STDA This illustrates its superiority over the GAN-based data framework can effectively generate much more realis- rendering methods. tic and natural-looking pedestrians in different back- A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 11

1 .80 .64 .50

.40 11.5% UDN+ .30 10.4% FPN (baseline) 10.3% FasterRCNN+ATT .20 9.9% MS-CNN 9.7% SA-FastRCNN

9.6% RPN+BF miss rate miss 9.2% AdaptFasterRCNN .10 8.2% F-DNN+SS 7.8% GDFL 7.4% SDS-RCNN .05 7.2% FPN w STDA (ours) 6.4% FPN w STDA+ (ours)

10-3 10-2 10-1 100 false positives per image

(a) Synthesis results of PS-GAN (Ouyang et al., 2018b). Fig. 7 Effects of the proposed STDA framework for aug- menting the Caltech pedestrian dataset, comparing to other cutting-edge pedestrian detectors. “+” means multi-scale testing.

1 .80 .64 .50 .40

.30

.20 miss rate miss

.10 10.4% FPN (baseline) 9.7% FPN w PS-GAN .05 9.5% FPN w Pasting 7.2% FPN w STDA (ours) (b) Synthesis results of STDA (ours) 10-3 10-2 10-1 100 Fig. 6 Pedestrian synthesis results of STDA, comparing to false positives per image another cutting-edge GAN-based data generation method. Synthesized pedestrians are in the middle of each image patch Fig. 8 Effects of the proposed STDA framework for aug- and background patches are kept the same. Best view in color. menting the Caltech pedestrian dataset, comparing to other pedestrian synthesis methods such as random pasting, PoseGen (Ma et al., 2017), and PS-GAN (Ouyang et al., 4.2.2 Improvements for Pedestrian Detection 2018b).

Caltech: To evaluate the augmentation results of our proposed STDA framework, we first perform the evalu- line detector. By further applying the multi-scale test- ation on the test set of the Caltech benchmark. We eval- ing, we can achieve 38% improvement, significantly out- uate the performance gains with respect to the base- performing other cutting-edge pedestrian detectors such line detector to demonstrate the effectiveness. Fig.7 as the MS-CNN (Cai et al., 2016) that introduced more shows the detailed performance of our method com- complicated architecture. paring to other cutting-edge methods (Ouyang et al., In Figure8, we also compare our framework with 2018a; Zhang et al., 2018b; Cai et al., 2016; Li et al., some other augmentation methods, including Pasting 2018; Zhang et al., 2016b, 2017; Du et al., 2017; Brazil that directly pastes real pedestrians randomly and PS- et al., 2017; Lin et al., 2018). In particular, our frame- GAN (Ouyang et al., 2018b) that generates pedestrian work improves around 30% miss rate over the base- patches based on a pix2pixGAN (Isola et al., 2017) 12 Chen et al.

Methods/MR(%) OCC (heavy) OCC (partial) UDN+ (Ouyang et al., 2018a) 70.3 28.2 FasterRCNN+ATT (Zhang et al., 2018b) 45.2 22.3 MS-CNN (Cai et al., 2016) 59.9 19.2 SA-FastRCNN (Li et al., 2018) 64.4 24.8 RPN+BF (Zhang et al., 2016b) 74.4 24.2 AdaptFasterRCNN (Zhang et al., 2017) 57.6 26.5 F-DNN+SS (Du et al., 2017) 53.8 15.1 GDFL (Lin et al., 2018) 43.2 16.7 SDS-RCNN (Brazil et al., 2017) 58.5 14.9 FPN (baseline) (Lin et al., 2017) 59.3 22.9 FPN w STDA (ours) 41.9 17.2 FPN w STDA+ (ours) 43.5 12.4

Table 1 Performance on occluded (OCC) pedestrians on Caltech test set. Best results are highlighted in bold. “+” means multi-scale testing.

Methods/MR(%) AR(typical) AR(a-typical) UDN+ (Ouyang et al., 2018a) 7.8 14.9 FasterRCNN+ATT (Zhang et al., 2018b) 6.0 19.4 MS-CNN (Cai et al., 2016) 6.3 15.7 SA-FastRCNN (Li et al., 2018) 5.7 15.8 RPN+BF (Zhang et al., 2016b) 6.0 14.5 AdaptFasterRCNN (Zhang et al., 2017) 5.0 16.2 F-DNN+SS (Du et al., 2017) 5.1 13.3 GDFL (Lin et al., 2018) 4.5 14.7 SDS-RCNN (Brazil et al., 2017) 4.6 11.7 FPN (baseline) (Lin et al., 2017) 6.6 15.7 FPN w STDA (ours) 4.5 11.3 FPN w STDA+ (ours) 3.9 9.9

Table 2 Performance on pedestrians with diversified aspect ratios (AR) on Caltech test set. Best results are highlighted in bold. “typical” means the pedestrians with normal aspect ratios; “a-typical” means the pedestrians with unusual aspect ratios; “+” means multi-scale testing. pipeline. We can find that the three other compared the more realistic-looking pedestrians synthesized by methods can also slightly improve the baseline detector, STDA, the augmented dataset can consistently improve suggesting that augmenting pedestrian datasets with the baseline detector at all presented false positives per synthesized pedestrians is useful for improving detec- image. tion accuracy. However, due to the unnatural pedes- trians synthesized based on low-quality data as pre- Besides overall performance, we also present perfor- sented in Figure 6(a), improvements brought by PS- mance on specific detection attributes. For example, Ta- GAN is very limited. Even random pasting real pedes- ble1 shows the detection accuracy on pedestrians with trians can deliver a slightly better improvements using partial or heavy occlusions. According to the statistics, low-quality data. Moreover, we can also observe that we can find that the proposed STDA can effectively re- the compared methods have higher false positives per duce the average miss rate of the baseline detector for image than the baseline detector at low miss rates, sug- both partial and heavy occluded pedestrians, achieving gesting that the baseline detector may be distracted favorable performance comparing to other cutting-edge by unnatural pedestrians to some extents. Comparing pedestrian detectors. This confirms that synthesizing to the other compared pedestrian synthesis methods, pedestrians with occlusions using our proposed STDA the performance gain brought by our proposed STDA framework can promisingly help improve the detection is much more significant with respect to the baseline robustness and accuracy of occluded pedestrians in test detector, confirming that our proposed framework is set. In addition, we also evaluate the performance of much more effective in augmenting pedestrian datasets applying STDA to augment the Caltech on pedestrians using low-quality pedestrian data. Furthermore, with with different aspect ratios in the Table2. In particu- lar, for the detection on the pedestrians with “typical” A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 13

Method/MR% Reasonable Heavy Partial Bare Citypersons (Zhang et al., 2017) 15.4 - - - TLL (Song et al., 2018) 14.4 52.0 15.9 9.2 RepultionLoss (Wang et al., 2018) 13.2 56.9 16.8 7.6 OR-CNN (Zhang et al., 2018a) 12.8 55.7 15.3 6.7 FPN (baseline) (Lin et al., 2017) 13.9 52.9 15.4 8.5 FPN w STDA (ours) 11.0 44.1 11.3 6.4 FPN w STDA+ (ours) 10.2 41.9 10.5 5.8

Table 3 Performance on the validation set of CityPersons. Best results are highlighted in bold. “+’ means multi-scale testing.

aspect ratios, our proposed framework is able to boost loss Lhpm to help generate pedestrians. We present the the performance of the baseline detector by up to 41%. effects of different components by generating pedestri- When detecting the pedestrians with “a-typical” aspect ans based on low-quality real pedestrian data in the Fig. ratios, our method also promisingly improves the base- 9. According to the presented results, we can observe line performance, obtaining the highest average miss that the quality of the generated pedestrians is pro- rate among compared pedestrian detectors. These re- gressively improved by introducing more components, sults demonstrate that our framework can produce rich demonstrating the effectiveness of the different compo- diversified and beneficial pedestrians for the augmenta- nents in STDA framework. More specifically, the shape tion. constraining operation can first help the deformation CityPersons: In this section, we also report the operation produce less distorted pedestrians. Then, by performance on the validation set of CityPersons. The adding the cyclic loss Lcyc and adversarial loss Ladv, experiment settings are similar to the evaluation for the the obtained pedestrians become more realistic-looking Caltech dataset except that image sizes are 1024 × 2048 in details. Subsequently, the introduced environment- for training and testing. aware blending map trained by Lebm helps the trans- Table3 presents the detailed statics of the evaluated formed pedestrians better adapt into the background methods. We can find that our framework effectively image patch. Lastly, the Lhpm can slightly change some augments the original dataset and improves the perfor- appearance characteristics, such as illumination or color, mance of the baseline FPN detector. State-of-the-art to make the pedestrians less distinguishable from the performance can be accessed by applying our proposed environments, which actually further improve the pedes- framework, further demonstrating that our proposed trian generation results. framework can consistently augment different pedes- trian datasets with low-quality pedestrian data. 4.3.2 Quantitative Study

To perform ablation studies, we split the training set of 4.3 Ablation Studies Caltech into one smaller training set and one validation set. More specifically, we collect the frames from the In this section, we perform comprehensive component first four sets in the training as training images, while analysis of the proposed STDA framework for both the the frames from the last set are considered as valida- pedestrian generation and the pedestrian detection aug- tion images. We sample every 30-th frame in the overall mentation, using the low-quality pedestrians in Caltech dataset to set up the training/validation set. Note that dataset and the Caltech benchmark for training. this setting of training/validation set is ONLY used for ablation study. 4.3.1 Qualitative Study Table4 presents the detailed results. We can find that each of the introduced component, including shape We first evaluate the qualitative effects of different com- constraining operation (SC), cyclic loss (Lcyc), adver- ponents in the STDA framework for the pedestrian gen- sarial loss (Ladv), the environment-aware blending map eration task. In particular, we start the experiments (EBM), and the hard positive mining (HPM), can all from only using the shape-guided deformation super- contribute a promising average miss rate reduction. In vised by Lshape for pedestrian generation. Then, we particular, the cyclic and adversarial loss that helps gradually add the shape-constraining operation, cyclic better deform pedestrians and the environment-aware reconstruction loss Lcyc, adversarial loss Ladv, environment-blending map that helps better adapt deformed pedes- aware blending map e(x, y), and hard positive mining trians can both greatly boost the benefits of synthe- 14 Chen et al.

ℒ ℒ ℒ + HPM (with SC) + SC + + + EBM

Fig. 9 Effects of different components in the proposed STDA framework. “SC” means shape constraining operation; “EBM” means environment-aware blending map; “HPM” means hard positive mining. Best view in color.

baseline Lshape SC Lcyc Ladv EBM HPM MR%  10.77   10.58    10.31     9.03      8.56       7.73        7.49

Table 4 Effects of different components in the proposed STDA framework on the selected validation set on Caltech dataset. “SC” means shape constraining operation; “EBM” means environment-aware blending map; “HPM” means hard positive mining. sized pedestrians on improving detection accuracy. The 5 Conclusions proposed hard positive mining scheme can further im- prove the detection accuracy, demonstrating its effec- In this study, we present a novel shape transformation- tiveness in dataset augmentation. Based on the quali- based dataset augmentation framework to improve pedes- tative analysis as shown in Figure9, we can further con- trian detection. The proposed framework can effectively clude that augmenting pedestrian datasets with more deform natural pedestrians into a different shape and realistic-looking pedestrians can deliver better improve- can adequately adapt the deformed pedestrians into ments on detection accuracy. various background environments. Using low-quality pedes- trian data available in the datasets, our proposed frame- work produces much more lifelike pedestrians than other A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 15 cutting-edge data synthesis techniques. By applying the Enzweiler M, Gavrila DM (2008) Monocular pedes- proposed framework on the two different well-known trian detection: Survey and experiments. T-PAMI pedestrian benchmarks, i.e. Caltech and CityPersons, (12):2179–21953 we improve the baseline pedestrian detector with a great Felzenszwalb P, McAllester D, Ramanan D (2008) A margin, achieving state-of-the-art performance on both discriminatively trained, multiscale, deformable part of the evaluated benchmarks. model. In: CVPR, IEEE, pp 1–83 Felzenszwalb PF, Girshick RB, McAllester D (2010a) Cascade object detection with deformable part mod- References els. In: CVPR, IEEE, pp 2241–22483 Felzenszwalb PF, Girshick RB, McAllester D, Ramanan Alcorn MA, Li Q, Gong Z, Wang C, L, Ku WS, D (2010b) Object detection with discriminatively Nguyen A (2018) Strike (with) a pose: Neural net- trained part-based models. T-PAMI 32(9):1627–1645 works are easily fooled by strange poses of familiar 3 objects. arXiv preprint arXiv:1811115531 Ge Y, Li Z, Zhao H, Yin G, S, Wang X, et al. (2018) Arjovsky M, Chintala S, Bottou L (2017) Wasserstein Fd-gan: Pose-guided feature distilling gan for robust gan. arXiv preprint arXiv:1701078753 person re-identification. In: Advances in Neural In- Bar-Hillel A, Levi D, Krupka E, Goldberg C (2010) formation Processing Systems, pp 1230–12411,4 Part-based feature synthesis for human detection. In: Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision ECCV, Springer, pp 127–1423 meets robotics: The kitti dataset. The International Brazil G, Yin X, Liu X (2017) Illuminating pedestri- Journal of Robotics Research 32(11):1231–12371,3 ans via simultaneous detection & segmentation. In: Girshick R, Donahue J, Darrell T, Malik J (2014) Rich Proceedings of the IEEE International Conference on feature hierarchies for accurate object detection and Computer Vision, Venice, Italy 11, 12 semantic segmentation. In: Proceedings of the IEEE Cai Z, Fan Q, Feris RS, Vasconcelos N (2016) A unified conference on computer vision and pattern recogni- multi-scale deep convolutional neural network for fast tion, pp 580–5878 object detection. In: ECCV, Springer, pp 354–3703, Goodfellow I, Pouget-Abadie J, Mirza M, B, Warde- 11, 12 Farley D, Ozair S, Courville A, Bengio Y (2014) Gen- Cire¸sanDC, Meier U, Gambardella LM, Schmidhuber J erative adversarial nets. In: NIPS, pp 2672–26803 (2010) Deep, big, simple neural nets for handwritten Goodfellow I, Bengio Y, Courville A (2016) Deep learn- digit recognition. Neural computation 22(12):3207– ing. MIT press4 32204 Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Dai J, H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y Courville AC (2017) Improved training of wasserstein (2017) Deformable convolutional networks. In: Pro- gans. In: NIPS, pp 5767–57773 ceedings of the IEEE international conference on Hattori H, Naresh Boddeti V, Kitani KM, Kanade T computer vision, pp 764–7736 (2015) Learning scene-specific pedestrian detectors Dao T, Gu A, Ratner A, Smith V, De Sa C, Re C (2019) without real data. In: CVPR, pp 3819–38273 A kernel theory of modern data augmentation. In: He K, Zhang X, Ren S, Sun J (2016) Deep residual International Conference on Machine Learning, pp learning for image recognition. In: CVPR, pp 770– 1528–15374 7783 Doll´arP, Wojek C, Schiele B, Perona P (2009) Pedes- He K, Gkioxari G, Doll´arP, Girshick R (2017) Mask trian detection: A benchmark. In: CVPR, IEEE, pp r-cnn. In: ICCV, IEEE, pp 2980–29884 304–3111,3,8 Huang S, Ramanan D (2017) Expecting the unex- Dollar P, Wojek C, Schiele B, Perona P (2012) Pedes- pected: Training detectors for unusual pedestrians trian detection: An evaluation of the state of the art. with adversarial imposters. In: CVPR, IEEE, vol 1 T-PAMI 34(4):743–7611,2,3,8 1,3 Dosovitskiy A, Fischer P, Springenberg JT, Riedmiller Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to- M, Brox T (2015) Discriminative unsupervised fea- image translation with conditional adversarial net- ture learning with exemplar convolutional neural net- works. arXiv preprint3,4, 11 works. IEEE transactions on pattern analysis and Jaderberg M, Simonyan K, Zisserman A, et al. (2015) machine intelligence 38(9):1734–17474 Spatial transformer networks. In: Advances in neural Du X, El-Khamy M, Lee J, Davis L (2017) Fused dnn: information processing systems, pp 2017–20256 A deep neural network fusion approach to fast and Lerer A, Gross S, Fergus R (2016) Learning physical robust pedestrian detection. In: WACV, IEEE, pp intuition of block towers by example. arXiv preprint 953–9613, 11, 12 16 Chen et al.

arXiv:1603013123 Ran Y, Weiss I, Zheng Q, Davis LS (2007) Pedestrian Li J, Liang X, Shen S, Xu T, Feng J, Yan S (2018) detection via periodic motion analysis. International Scale-aware fast r-cnn for pedestrian detection. TMM Journal of Computer Vision 71(2):143–1603 20(4):985–9963, 11, 12 Redmon J, Divvala S, Girshick R, Farhadi A (2016) Lin C, J, Wang G, Zhou J (2018) Graininess-aware only look once: Unified, real-time object detection. deep feature learning for pedestrian detection. In: In: CVPR, IEEE, pp 779–7883 ECCV, Springer, pp 732–7473, 11, 12 Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Lin TY, Doll´arP, Girshick RB, He K, Hariharan B, Be- Towards real-time object detection with region pro- longie SJ (2017) Feature pyramid networks for object posal networks. In: NIPS, pp 91–993 detection. In: CVPR, vol 1, p 49, 12, 13 Ronneberger O, Fischer P, Brox T (2015) U-net: Convo- Liu J, B, Yan Y, Zhou P, S, Hu J (2018) Pose lutional networks for biomedical image segmentation. transferrable person re-identification. In: CVPR, In: International Conference on Medical image com- IEEE, pp 4099–41083 puting and computer-assisted intervention, Springer, Liu MY, Breuel T, Kautz J (2017a) Unsupervised pp 234–2416 image-to-image translation networks. In: NIPS, pp Ros G, Sellart L, Materzynska J, Vazquez D, Lopez 700–7083 AM (2016) The synthia dataset: A large collection of Liu T, Lugosi G, Neu G, Tao D (2017b) Algorith- synthetic images for semantic segmentation of urban mic stability and hypothesis complexity. In: Proceed- scenes. In: CVPR, IEEE, pp 3234–32433 ings of the 34th International Conference on Machine Sajjadi M, Javanmardi M, Tasdizen T (2016) Regu- Learning-Volume 70, JMLR. org, pp 2159–21674 larization with stochastic transformations and per- Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, turbations for deep semi-supervised learning. In: Ad- CY, Berg AC (2016) Ssd: Single shot multibox de- vances in Neural Information Processing Systems, pp tector. In: ECCV, Springer, pp 21–373 1163–11714 Ma L, Jia X, Sun Q, Schiele B, Tuytelaars T, Van Gool Siarohin A, Sangineto E, Lathuili`ereS, Sebe N (2018) L (2017) Pose guided person image generation. In: Deformable gans for pose-based human image gen- Advances in Neural Information Processing Systems, eration. In: Proceedings of the IEEE Conference on pp 406–4164, 11 Computer Vision and Pattern Recognition, pp 3408– Ma L, Sun Q, Georgoulis S, Van Gool L, Schiele B, Fritz 34161,4 M (2018) Disentangled person image generation. In: Simonyan K, Zisserman A (2014) Very deep convo- Proceedings of the IEEE Conference on Computer lutional networks for large-scale image recognition. Vision and Pattern Recognition, pp 99–1081,3 arXiv preprint arXiv:140915563 Ouyang W, Wang X (2013) Joint deep learning for Song T, Sun L, D, Sun H, Pu S (2018) Small-scale pedestrian detection. In: CVPR, pp 2056–20633 pedestrian detection based on topological line local- Ouyang W, Zhou H, Li H, Li Q, Yan J, Wang X (2018a) ization and temporal feature aggregation. In: The Eu- Jointly learning deep features, deformable parts, oc- ropean Conference on Computer Vision (ECCV) 13 clusion and classification for pedestrian detection. T- Vapnik V (2013) The nature of statistical learning the- PAMI 40(8):1874–1887 11, 12 ory. Springer science & business media4 Ouyang X, Cheng Y, Jiang Y, Li CL, Zhou P Villegas R, J, Zou Y, Sohn S, Lin X, Lee H (2017) (2018b) Pedestrian-synthesis-gan: Generating pedes- Learning to generate long-term future via hierarchi- trian data in real scene and beyond. arXiv preprint cal prediction. arXiv preprint arXiv:1704058313 arXiv:1804020474,9, 11 Viola P, Jones MJ, Snow D (2005) Detecting pedestri- Park D, Ramanan D, Fowlkes C (2010) Multiresolution ans using patterns of motion and appearance. Inter- models for object detection. In: ECCV, Springer, pp national Journal of Computer Vision 63(2):153–161 241–2548 3 Pishchulin L, Jain A, Wojek C, Andriluka M, Wang X, Shrivastava A, Gupta A (2017) A-fast-rcnn: Thorm¨ahlenT, Schiele B (2011) Learning people de- Hard positive generation via adversary for object de- tection models from few training samples. In: CVPR, tection. In: Proceedings of the IEEE Conference on IEEE, pp 1473–14803 Computer Vision and Pattern Recognition, pp 2606– Radford A, Metz L, Chintala S (2015) Unsuper- 26158 vised representation learning with deep convolutional Wang X, Xiao T, Jiang Y, Shao S, Sun J, Shen C (2018) generative adversarial networks. arXiv preprint Repulsion loss: detecting pedestrians in a crowd. In: arXiv:1511064343 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7774–7783 13 A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 17

Yan Y, Xu J, Ni B, Zhang W, Yang X (2017) Skeleton- aided articulated motion generation. In: Proceedings of the 2017 ACM on Multimedia Conference, ACM, pp 199–2073 Zanfir M, Popa , Zanfir A, Sminchisescu C (2018) Human appearance transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5391–53991,4 Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016a) Understanding deep learning re- quires rethinking generalization. arXiv preprint arXiv:1611035304 Zhang L, Lin L, Liang X, He K (2016b) Is faster r- cnn doing well for pedestrian detection? In: ECCV, Springer, pp 443–4573,9, 11, 12 Zhang S, Benenson R, Omran M, Hosang J, Schiele B (2016c) How far are we from solving pedestrian detection? In: CVPR, IEEE, pp 1259–12673 Zhang S, Benenson R, Schiele B (2017) Citypersons: A diverse dataset for pedestrian detection. In: CVPR, IEEE, vol 1, p 31,2,3,9, 11, 12, 13 Zhang S, L, Bian X, Z, Li SZ (2018a) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 637–653 13 Zhang S, Yang J, Schiele B (2018b) Occluded pedes- trian detection through guided attention in cnns. In: CVPR, IEEE, pp 6995–7003 11, 12 Zheng Z, Zheng L, Yang Y (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE Inter- national Conference on Computer Vision, pp 3754– 37624 Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent ad- versarial networks. In: ICCV, IEEE3