International Journal of Computer Vision manuscript No. (will be inserted by the editor)
A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection
Zhe Chen 1 · Wanli Ouyang 2 · Tongliang Liu 1 · Dacheng Tao 1
the date of receipt and acceptance should be inserted later
Abstract Deep learning-based computer vision is usu- much better augmentation results than other pedes- ally data hungry. Many researchers attempt to augment trian synthesis approaches using low-quality pedestri- datasets with synthesized data to improve model ro- ans. By augmenting the original datasets, our proposed bustness. However, the augmentation of popular pedes- framework also improves the baseline pedestrian detec- trian datasets, such as Caltech and Citypersons, can tor by up to 38% on the evaluated benchmarks, achiev- be extremely challenging because real pedestrians are ing state-of-the-art performance. commonly in low quality. Due to the factors like oc- Keywords Pedestrian Detection · Dataset Augmen- clusions, blurs, and low-resolution, it is significantly tation · Pedestrian Rendering difficult for existing augmentation approaches, which generally synthesize data using 3D engines or genera- tive adversarial networks (GANs), to generate realistic- 1 Introduction looking pedestrians. Alternatively, to access much more natural-looking pedestrians, we propose to augment pedes- With the introduction of large-scale pedestrian datasets trian detection datasets by transforming real pedestri- (Doll´aret al., 2009; Dollar et al., 2012; Zhang et al., ans from the same dataset into different shapes. Ac- 2017; Geiger et al., 2013), deep convolutional neural cordingly, we propose the Shape Transformation-based networks (DCNNs) have achieved promising detection Dataset Augmentation (STDA) framework. The pro- accuracy. However, the trained DCNN detectors may posed framework is composed of two subsequent mod- not be robust enough due to the issue that negative ules, i.e. the shape-guided deformation and the envi- background examples greatly exceed positive foreground ronment adaptation. In the first module, we introduce a examples during training. Recent studies have confirmed shape-guided warping field to help deform the shape of that DCNN detectors trained with the limited fore- a real pedestrian into a different shape. Then, in the sec- ground examples can be vulnerable to difficult objects ond stage, we propose an environment-aware blending which have unexpected states (Huang and Ramanan, map to better adapt the deformed pedestrians into sur- 2017) and diversified poses (Alcorn et al., 2018). arXiv:1912.07010v1 [cs.CV] 15 Dec 2019 rounding environments, obtaining more realistic-looking To improve detector robustness, besides designing pedestrians and more beneficial augmentation results new machine learning algorithms, many researchers at- for pedestrian detection. Extensive empirical studies on tempted to augment training datasets by generating different pedestrian detection benchmarks show that new foreground examples. For instance, Huang et al. the proposed STDA framework consistently produces (Huang and Ramanan, 2017) used a 3D game engine to simulate pedestrians and adapt them into the pedes- B Zhe Chen ([email protected]) trian datasets. Other studies (Ma et al., 2018; Siarohin Wanli Ouyang ([email protected]) et al., 2018; Zanfir et al., 2018; Ge et al., 2018) at- Tongliang Liu ([email protected]) Dacheng Tao ([email protected]) tempted to augment the person re-identification datasets 1 UBTECH Sydney Artificial Intelligence Centre, The Uni- by transferring the poses of pedestrians using genera- versity of Sydney, Sydney, Australia tive adversarial networks (GANs). Despite progress, it 2 The University of Sydney, Sydney, Australia is still very challenging to adequately apply existing 2 Chen et al.
Background Shape-guided Environment Pedestrian 푖 Shape 푗 Patch Deformation Adaptation
Shape-guided Environment-aware Warping Field Blending Map
Rich Diversified Low-quality Pedestrians Generated for Pedestrian Data Dataset Augmentation Robust Pedestrian Pedestrian Dataset Detection
Fig. 1 We propose the shape transformation-based dataset augmentation framework for pedestrian detection. In the frame- work, we subsequently introduce the shape-guided warping field to deform pedestrians and the environment-aware blending map to adapt the deformed pedestrians into background environments. Our proposed framework can effectively generate more realistic-looking pedestrians for augmenting pedestrian datasets in which real pedestrians are usually in low-quality. Best view in color. augmentation approaches on the common pedestrian ods that require sufficient appearance details to define detection datasets. First, synthesizing pedestrians us- the desired output, it is much easier to access rich pixel- ing external platforms like the 3D game engines may level shape deformation supervision which defines the introduce a significant domain gap between synthesized deformation from a shape to another shape, if only pedestrians and real pedestrians, limiting the overall low-quality pedestrian examples are available in the benefits for the generated pedestrians to improve the datasets. The learned deformation between different shapes model robustness for detecting real pedestrians. More- can guide the deformation of appearances of the real over, regarding the methods that utilize GANs to ren- pedestrians, avoiding the requirement of detailed su- der pedestrians, they generally require rich appearance pervision information to directly define the transformed details from paired training images to help define the appearances. In addition, since the shape information desired output of generative networks during training can naturally distinguish foreground areas from back- procedures. However, in common pedestrian detection ground areas, we can simply focus on adapting synthe- datasets like Caltech (Dollar et al., 2012) and CityPer- sized foreground appearances into background environ- sons (Zhang et al., 2017), pedestrians are usually in ments, avoiding the rick of further generating unnat- low quality due to the factors like heavy occlusions, ural background environments together with the syn- blurry appearance, and low-resolution caused by small thesized pedestrians as required in current GAN-based sizes. As a result, these available real pedestrians only approaches. Last but not the least, we find that trans- provide extremely limited amount of appearance de- forming real pedestrians based on different shapes can tails that can be used for training generative networks. effectively increase foreground sample diversity while Without sufficient description of the desired appearance still maintaining the appearance characteristics of real of synthesized pedestrians, we can show in our experi- pedestrians adequately. ments that current GAN-based methods only generate less realistic or even corrupted pedestrians using very Based on these observations, we devise a Shape Trans- low-quality pedestrians from common pedestrian detec- formation based Dataset Augmentation (STDA) frame- tion datasets. work to fulfill the pedestrian dataset augmentation task more effectively. In particular, the framework first de- By addressing above issues, we propose to augment forms a real pedestrian into a similar pedestrian but pedestrian datasets by transforming real pedestrians with a different shape and then adapts the shape-deformed from the same dataset according to different shapes pedestrians into surrounding environments on the im- (i.e. segmentation masks in this study) rather than age to be augmented. In the STDA framework, we intro- rendering new pedestrians. Our motivation comes from duce a shape-guided warping field, which is a set of vec- the following observations. First, unlike existing meth- tors that define the warping operation between shapes, A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 3 to further define an appropriate deformation between et al., 2012; Zhang et al., 2017; Geiger et al., 2013), re- the shapes and the appearances of the real pedestrians. searchers have greatly improved the pedestrian detec- Moreover, we introduce an environment-aware blend- tion performance with DCNNs (Simonyan and Zisser- ing map to help the shape-deformed pedestrians better man, 2014; He et al., 2016). Among the DCNN detec- blend into various background environments, delivering tors, two-stage detection pipelines (Ouyang and Wang, more realistic-looking pedestrians on the image. 2013; Ren et al., 2015; Li et al., 2018; Cai et al., 2016; In this study, our key contributions are listed as Zhang et al., 2016b; Du et al., 2017) usually perform follows: better than single-stage detection pipelines (Liu et al., 2016; Redmon et al., 2016; Lin et al., 2018). Despite – We propose a shape transformation-based dataset progress, the issue that foreground and background ex- augmentation framework to augment the pedestrian amples are extremely unbalanced in pedestrian datasets detection datasets and improve pedestrian detec- still affects the robustness of the DCNN detectors ad- tion accuracy. To the best of our knowledge, we are versely. Current pedestrian detectors could still be frag- the first that apply the shape-transformation-based ile to even small transformation of pedestrians. To tackle data synthesis methodology for pedestrian detec- this problem, many researchers tend to augment the tion. datasets by synthesizing new foreground data. – We propose the shape-guided warping field to help define a proper shape deformation procedure. We also introduce an environment-aware blending map 2.2 Simulation-based Dataset Augmentation to better adapt the shape-transformed pedestrians into different backgrounds, achieving better augmen- To achieve dataset augmentation, researchers have used tation results on the image. 3D simulation platforms to synthesize new examples for – We introduce a shape constraining operation to im- the datasets. For example, (Lerer et al., 2016; Ros et al., prove shape deformation quality and a novel hard 2016) used a 3D game engine to help build new datasets. positive mining loss to further magnify the benefits More related studies used the 3D simulation platforms of the synthesized pedestrians for improving detec- to augment pedestrian-related datasets. In particular, tion robustness. (Pishchulin et al., 2011; Hattori et al., 2015) employed a – Our proposed framework is promising for generat- game engine to synthesize training data for pedestrian ing pedestrians using low-quality examples. Com- detection. In addition, (Huang and Ramanan, 2017) ap- prehensive evaluations on the famous Caltech (Dol- plied a GAN to narrow the domain gap between the lar et al., 2012) and CityPersons (Zhang et al., 2017) 3D simulated pedestrians and the natural pedestrians benchmarks validate that our proposed framework to augment pedestrian datasets, but this method brings can generate more realistic-looking pedestrians than limited improvement on common pedestrian detection, existing methods using low-quality data. With pedes- suggesting that the domain gap is still large. However, trian datasets augmented by our framework, we promis- there is still a significant domain gap between simulated ingly boost the performance of the baseline pedes- pedestrians and real pedestrians. Such gap could fur- trian detector, accessing superior performance to ther pose negative effects on DCNN detectors, making other cutting-edge pedestrian detectors. the augmented datasets deliver incremental improve- ments on pedestrian detection.
2 Related Work 2.3 GAN-based Dataset Augmentation 2.1 Pedestrian Detection Recently, with several improvements (Radford et al., Pedestrian is critical in many applications such as robotics 2015; Arjovsky et al., 2017; Gulrajani et al., 2017), and autonomous driving (Enzweiler and Gavrila, 2008; GANs(Goodfellow et al., 2014) have shown great ben- Doll´aret al., 2009; Dollar et al., 2012; Zhang et al., efits on synthesis-based applications such as image-to- 2016c). Traditional pedestrian detectors generally use image translation (Isola et al., 2017; Liu et al., 2017a; hand-crafted features (Viola et al., 2005; Ran et al., Isola et al., 2017; Zhu et al., 2017) and skeleton-to- 2007) and adopt human part-based detection strategy image generation (Villegas et al., 2017; Yan et al., 2017). (Felzenszwalb et al., 2010b) or cascaded structures (Felzen- In the literature of person re-identification task, many szwalb et al., 2010a; Bar-Hillel et al., 2010; Felzenszwalb works attempted to transfer the poses of real pedestri- et al., 2008). Recently, by taking advantages of large- ans to deliver diversified pedestrians for the augmen- scale pedestrian datasets (Doll´aret al., 2009; Dollar tation. For instance, (Liu et al., 2018; Ma et al., 2018; 4 Chen et al.
Siarohin et al., 2018; Zanfir et al., 2018; Ge et al., 2018; 3 Shape Transformation-based Dataset Zheng et al., 2017; Ma et al., 2017) introduced vari- Augmentation Framework ous techniques to transform the human appearance ac- cording to 2D or 3D poses and improve the person re- 3.1 Problem Definition identification performance. In practice, these methods require accurate pose information or paired training im- Data augmentation technique, commonly formulated as ages that contain rich appearance details to achieve suc- transformations of raw data, has been used to access the cessful transformation. However, existing widely used vast majority of the state-of-the-art results in image pedestrian datasets like Caltech provide neither pose recognition. The data augmentation is intuitively ex- annotations nor paired appearance information for train- plained as to increase the training data size and as a reg- ing GANs. Furthermore, in current pedestrian datasets, ularizer that can model hypothesis complexity (Good- a large number of small pedestrians whose appearances fellow et al., 2016; Zhang et al., 2016a; Dao et al., 2019). are usually in low quality can make existing pose esti- In particular, the hypothesis complexity can be used to mators difficult to deliver reasonable predictions. Fig. measure the generalization error, which is the differ- 2 shows some examples describing that the poses of ence between the training and test errors, of learning low-quality pedestrians are much more unstable than algorithms (Vapnik, 2013; Liu et al., 2017b). Larger hy- the masks estimated using the same Mask RCNN (He pothesis complexity usually implies a larger generaliza- et al., 2017) detector. As a result, it is quite infeasi- tion error and vice versa. In practice, a small training ble to seamlessly apply these pose transfer models for error and a small generalization error is favoured to augmenting current pedestrian datasets. guarantee a small test error. As a result, the data aug- mentation is especially useful for deep learning models In pedestrian detection, some studies have intro- which are powerful in maintaining a small training er- duced specifically designed GANs for the augmenta- ror but has a large hypothesis complexity. It has been tion. As an example, (Ouyang et al., 2018b) modified empirically demonstrated that data augmentation op- the pix2pixGAN (Isola et al., 2017) to make it more erations can greatly improve the generalization ability suitable for the pedestrian generation, but this method of deep models (Cire¸sanet al., 2010; Dosovitskiy et al., lacks a particular mechanism that helps produce di- 2015; Sajjadi et al., 2016). versified pedestrians and the method still delivers poor In this study, the overall goal is to devise a more generation results based on low-quality data. effective dataset augmentation framework to improve pedestrian detection models. The framework is sup- posed to generate diversified and more realistic-looking pedestrian examples to enrich the corresponding datasets Pedestrian Shape/Mask Pose Pedestrian Shape/Mask Pose in which real pedestrians are usually in very low-quality. We achieve this goal by transforming real pedestrians into different shapes rather than rendering new pedes- trians. In fact, the devised data augmentation method focusing on only transforming pedestrians can be effi- cacious in regularizing the hypothesis complexity. This can be empirically justified by our experiments which show that pedestrian detection performance of the base- line model can be significantly improved without intro- ducing extra model complexity. In general, the top line Fig. 2 Some examples showing that the shape estimation of Fig.3 briefly describes our augmentation methodol- results are more accurate than pose estimation results on low- quality images using the same Mask RCNN model. Best view ogy. in color. Formally, suppose zi is an image patch containing a real pedestrian in the dataset and si is its extracted shape or segmentation mask. Denote sj as a different shape which can be obtained based on another real In this study, we propose that transforming pedes- pedestrian’s shape. Here, we implement a shape transformation- trians from the original dataset by altering their shapes based dataset augmentation function, denoted as fSTDA, can produce diversified and much more lifelike pedes- to generate a new pedestrian by transforming a real trians without requiring rich appearance details for su- pedestrian into a new pedestrian with a more realistic- pervision. looking appearance but with another shape sj for the A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 5
Generated Pedestrians Training Pedestrian Shape Transformation-based Augmented Robust Pedestrian Dataset Dataset Augmentation Framework Pedestrian Dataset Detection
Shape-guided Deformation Network Input Network Output 퐬 퐕 → 퐬 → 퐕 퐳 퐬 → 푓 ℒ , ℒ
Augmentation Result Guide 퐬 퐳 퐕 → 퐳 →
푓 U-Net 훼 푥, 푦
Background Environment Adaptation Patch 푏 from 퐼 퐳 →
ℒ , ℒ Background Patch ℒ
Fig. 3 Overview of the proposed shape transformation-based dataset augmentation framework for pedestrian datasets with low-quality pedestrian data. In particular, we introduce the shape-guided warping field, Vi→j , and the environment-aware blending map, α(x, y), to respectively help implement the shape-guided deformation and the environment adaptation, obtaining w w gen gen the deformed shape si→j , the deformed pedestrian zi→j , and the transformation result zi→j . By placing the zi→j into the I, we can effectively augment the original image. In practice, we employ a U-Net to predict both of the Vi→j and α(x, y). Training losses for the U-Net include Lshape, Ladv, Lcyc, and Lhpm. Best view in color.
augmentation: specifically, given a pedestrian image patch zi and a different shape sj, we first deform the image zi into zgen = f (z , s ,I), (1) i→j STDA i j a pedestrian with similar appearance but a shape sj that is different from the original shape s . The defor- where zgen is a patch containing the newly generated i i→j mation can be defined according to the transformation pedestrian z by transforming its shape into s , and I i j from s into s . Then, we adapt the deformed pedes- is the image to be augmented. i j trian image into some background environments on the image I. Denote by fSD the function that implements the shape-guided deformation, and denote by f the 3.2 Framework Overview EA function that implements the environment adaptation. In pedestrian detection datasets, it is difficult to access The proposed framework implements fSTDA as follows: gen sufficient appearance details to define the desired zi→j, making it extremely challenging to generate realistic- fSTDA(zi, sj,I) = fEA(fSD(zi, sj),I). (2) looking pedestrians using low-quality appearance. To properly implement the fSTDA using low-quality pedes- Fig.3 shows a detailed architecture of the proposed trians, we decompose the pedestrian generation task framework. As illustrated in the figure, we introduce a into two sub-tasks, which are shape-guided deformation shape-guided warping field, denoted as Vi→j, to help and environment adaptation. The first task focuses on implement the shape-guided deformation function. We varying the appearances to enrich data diversity, and also propose to apply the environment-aware blending the second task mainly adapts the deformed pedestri- map, denoted as α(x, y), to achieve environment adap- ans different environments to blend the transformed tation. In particular, with the help of Vi→j, the defor- pedestrians within each image to be augmented. More mation between different shapes can guide the defor- 4320, 331, 188, 300, 175, 352, 301 6 Chen et al.
mation of appearances of real pedestrians. Then, after 훾 푦 =0 퐬 퐬 (original) 퐬 better adapting the shape-deformed pedestrian into the weighted combine background environments using α(x, y), we obtain di- based on versified and more realistic-looking pedestrians to aug- 훾(푦) ment pedestrian detection datasets. In practice, we can 푦 employ a single end-to-end U-Net (Ronneberger et al., 2015) to help fulfill the both sub-tasks in a single pass. According to Eq.2, the employed network takes as in- put the pedestrian patch zi, its shape si, the target 훾 푦 =1 shape sj, and a background patch from I, and then predicts both of the V and the α(x, y). i→j Fig. 4 An example of the shape constraining operation. In In the following sections, we will subsequently de- particular, we combine the shape sj with si based on a j j scribe in details the implementation of shape-guided weighting function γ(y). cy and ly respectively denote the deformation in Sec. 3.2.1 and the implementation of middle point and the width of the foreground areas on the i environment adaptation in Sec. 3.2.2. line whose vertical offset is y and on the shape sj . cy and i ly are the corresponding middle point and the width of fore- ground areas on the shape si. The γ(y) is a linear function 3.2.1 Shape-guided Deformation of y, controlling the combination of si and sj . Best view in color. In this study, we implement deformation according to warping operations. In order to obtain a detailed de- into the shape sj. Accordingly, the desired warping field scription about warping operations, we introduce the Vi→j should make the following equation hold: shape-guided warping field, which is the assignment of w the vectors for warping between shapes, to further help sj = si→j = fwarp(si; Vi→j), (4) deform pedestrians. Denote by vi→j(x, y) the warping w vector located at the (x, y) that helps warp the shape where si→j is the warped shape si according to Vi→j. si into the shape sj. The set of these warping vectors, Since si and sj can be easily obtained from the pedes- i.e.Vi→j = {vi→j(x, y)}, then forms a shape-guided trian datasets, we are able to access sufficient pixel-level warping field. An example of this warping field can be supervision to train the employed network. We mainly w found in Fig.3 under the symbol Vi→j, where the warp- apply the L1 loss, ||sj − si→j||1, to make the network ing field helps deform the si (colored in blue) into the satisfy the above definition during the training. There- sj (colored in purple). Then, suppose fwarp is the func- fore, by satisfying the above definition, we can obtain tion that warps the input image patch according to the the desired warping field that can help generate shape- predicted warping field, we then implement the fSD by: transformed natural pedestrians based on Eq.3. Shape Constraining Operation: In practice, we ob- serve that if the target shape sj varies too much w.r.t.si, w zi→j = fSD(zi, sj) = fwarp(zi; Vi→j), (3) the obtained warping field may distort the input pedes- trians after warping, resulting in unnatural results that w where zi→j is the warped pedestrian zi according to the could degrade the augmentation performance. To avoid shape sj. In practice, we define that each warping vector this, we apply a shape constraining operation on the vi→j(x, y) is a 2D vector which contains the horizontal target shape. and vertical displacements between the mapped warp- To avoid inadequate shape deformation, we hypoth- ing point and the original point located at (x, y). Thus, esize that varying more for the lower body of a natural we can make the employed network directly predict the pedestrian is more acceptable than varying more for the Vi→j. In addition, we implement the fwarp with the upper body. In particular, we find that changing too help of bilinear interpolation, since the bilinear inter- much the upper body of a pedestrian would generally polation can properly back-propagate the gradients ob- require the change of the viewpoint for that pedestrian w tained from zi→j to the Vi→j, aiding the training of (e.g.from the side view to the front view) to obtain a the employed network. For more details about using bi- natural appearance, but the warping operations do not linear interpolation for warping and training, we refer generate new image contents to generate the pedestrian readers to (Jaderberg et al., 2015; Dai et al., 2017). in a different viewpoint. As a result, we apply a con- To make the shape-guided warping field adequately straint for the target shape sj to avoid varying too much describe the deformation between shapes, we define that for the upper body. More specifically, we constrain the the estimated warping field should warp the shape si target shape sj by combining it with the input shape si A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection 7 according to a weighted function. The combination is where α(x, y) is an entry value of the environment- defined on the middle point and the width of the fore- aware blending map located at (x, y). An example of ground areas in each horizontal line on the sj. Suppose the estimated α(x, y) can be found in Fig.3. y is the vertical offset for a horizontal line on sj. We re- In practice, it is difficult to define the desired refine- j j spectively denote cy and ly as the middle point and the ment result and the desired environment-aware blend- width of the foreground areas on that line. We define ing map. Therefore, we can not access appropriate su- the shape constraining operation as: pervision information to train the employed network for environment adaptation. Without supervision, we j j i cy = γ(y) cy + (1 − γ(y)) cy, apply an adversarial loss to facilitate the employed net- j j i (5) ly = γ(y) ly + (1 − γ(y)) ly, work to learn and blend the deformed pedestrians into the environments effectively. Similar to the shape-guided i i where cy and ly refer to the middle point and width warping field, we make the employed network directly of foreground areas on the si at line y, and γ(y) is the predict environment-aware blending map. Note that we weight function w.r.t.y controlling the strictness of the constrain the environment-aware blending map to pre- constraint. According to Eq.5, the smaller weight value vent changing the appearance of the deformed pedes- γ(y) can make the target shape contribute less to the trians too much. In particular, we adopt a shifted and combination result and vice versa. Accordingly, to allow rescaled tanh squashing function to make the values of more variations only for the lower body of a pedestrian, α(x, y) lie in a range of 0.8 and 1.2. we define γ(y) as a linear function which increases from 0 to 1 with y varying from the top to the bottom. Fig.4 shows a visual example of the introduced 3.3 Objectives shape constraining operation when constraining sj ac- cording to Eq.5. We can observe that the proposed Since we employ a single network to predict both the shape constraining operation adequately constrains the shape-guided warping field and the environment blend- sj by making the output shape closer to si for the up- ing map, we can unify the objectives for training. per body of the pedestrian and closer to sj for the lower First, to obtain a proper shape-guided warping field, body. we introduce a shape deformation loss and a cyclic re- construction loss. The shape deformation loss ensures 3.2.2 Environment Adaptation that the predicted warping field satisfies the constrain as described in Eq.4. The cyclic reconstruction loss After the shape-guided deformation, we place the de- then ensures that the deformed shape and pedestrian formed pedestrians into the image I to fulfill augmen- can be deformed back to the input shape and pedes- tation. However, directly pasting deformed pedestrians trian. Therefore, we define that the shape deformation on the image would sometimes produce significant ap- loss function Lshape for a pair of samples i → j is as pearance mismatch due to the issues like the disconti- follows: nuities in illumination conditions and imperfect shapes w predicted by the mask extractor. To refine the gener- Lshape = E[||sj − si→j||1], (8) ated pedestrians according to the environments, we fur- ther perform environment adaptation. and the cyclic loss is defined as follows: To properly blend a shape-deformed pedestrian into L = [||s − sw || + ||z − zw || ], (9) the image I by considering surrounding environments, cyc E i j→i 1 i j→i 1 we introduce an environment-aware blending map to w w where sj→i is the deformation result of si→j accord- help refine the deformed pedestrians. We formulate this w w ing to si and zj→i is the deformation result of zi→j refinement procedure as follows: w using the same warping field for computing sj→i. As a result, the Eq.8 describes the L -based shape deforma- zgen = f (zw ,I) = {za (x, y)}, (6) 1 i→j EA i→j i→j tion loss, and the Eq.9 form the cyclic reconstruction loss. where za (x, y) is the environment adaptation result i→j In addition, an adversarial loss, denoted as L , located at (x, y): adv is included to make sure that the shape-guided defor- mation and environment adaptation can help produce a w zi→j(x, y) = sj(x, y) · α(x, y) · zi→j(x, y) more realistic-looking pedestrian patches. Similar with typical GANs, the adversarial loss is computed by in- + 1 − s (x, y) · α(x, y)) · I(x, y), (7) j troducing a discriminator D for the employed network: 8 Chen et al.
3.4 Dataset Augmentation