Faster Autoaugment: Learning Augmentation Strategies Using Backpropagation

Faster AutoAugment: Learning Augmentation Strategies using Backpropagation Ryuichiro Hataya1;2 Jan Zdenek1 Kazuki Yoshizoe2 Hideki Nakayama1 1 Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan 2 RIKEN Center for Advanced Intelligence Project, Tokyo, Japan Abstract Original or Classified Augmented? + correctly? Data augmentation methods are indispensable heuristics to boost the performance of deep neural networks, especially in image recognition tasks. Recently, several studies Critic have shown that augmentation strategies found by search algorithms outperform hand-made strategies. Such methods employ black-box search algorithms over image transformations with continuous or discrete parameters and require a long time to obtain better strategies. In this paper, we propose a differentiable policy search pipeline for data augmentation, which is much faster than previous methods. We introduce approximate gradients for several transformation operations with discrete parameters as well as Policy the differentiable mechanism for selecting operations. As the objective of training, we minimize the distance between Update policy by the distributions of augmented data and the original data, backpropagation which can be differentiated. We show that our method, Faster AutoAugment, achieves significantly faster searching Figure 1. Overview of our proposed model. We propose to use a than prior work without a performance drop. differentiable data augmentation pipeline to achieve faster policy search by using adversarial learning. 1. Introduction Dataset AA PBA Fast AA Faster AA (ours) CIFAR-10 5,000 5.0 3.5 0.23 Data augmentation is a powerful technique for machine SVHN 1,000 1.0 1.5 0.061 learning to virtually increase the amount and diversity of ImageNet 15,000 - 450 2.3 data, which improves the performance especially in image recognition tasks. Conventional data augmentation Table 1. Faster AutoAugment is faster than others, without a arXiv:1911.06987v1 [cs.CV] 16 Nov 2019 methods include geometric transformations such as rota- significant performance drop (see section 5). GPU hours com- tion and color enhancing such as auto-contrast. Similarly to parison of Faster AutoAugment (Faster AA), AutoAugment (AA) other hyper-parameters, the designers of data augmentation [5], PBA [12] and Fast AutoAugment (Fast AA) [18]. strategies usually select transformation operations based on their prior knowledge (e.g., required invariance). For ex- optimal hyper-parameters in a search space, which becomes ample, horizontal flipping is expected to be effective for prohibitively large as the combinations get complex. There- general object recognition but probably not for digit recog- fore, efficient methods are required to find optimal strate- nition. In addition to the selection, the designers need to gies. If gradient information of these hyper-parameters is combine several operations and set their magnitudes (e.g., available, they can be efficiently optimized by gradient de- degree of rotation). Therefore, designing of data augmenta- scent [20]. However, the gradient information is usually tion strategies is a complex combinatorial problem. difficult to obtain because some magnitude parameters are When designing data augmentation strategies in a data- discrete, and the selection process of operations is non- driven manner, one can regard the problem as searching for differentiable. Therefore, previous research to automati- 1 differentiable selection of operations and a differen- Original images tiable objective that measures the distance between the original and augmented image distributions. 3. We show that our proposed method, Faster AutoAug- ment, significantly reduces the searching time com- pared to prior methods without a performance drop. 2. Related Work Neural Architecture Search Augmented images Neural Architecture Search (NAS) aims to automatically design architectures of neural networks to achieve higher Figure 2. We regard data augmentation as a process that fills miss- performance than manually designed ones. To this end, ing data points of the original training data; therefore, our ob- NAS algorithms are required to select better combinations jective is to minimize the distance between the distributions of of components (e.g., convolution with a 3x3 kernel) from augmented data and the original data using adversarial learning. discrete search spaces using searching algorithms such as reinforcement learning [38] and evolution strategy [24]. Re- cally design data augmentation policies has used black-box cently, DARTS [19] achieved faster search by relaxing the optimization methods that require no gradient information. discrete search space to a continuous one which allowed For example, AutoAugment [5] used reinforcement learn- them to use gradient-based optimization. While AutoAug- ing. ment [5] was inspired by [38], our method is influenced by In this paper, we propose to solve the problem by approx- DARTS [19]. imating gradient information and thus enabling gradient- Data Augmentation based optimization for data augmentation policies. To this end, we approximate the gradients of discrete image op- Data augmentation methods improve the performance of erations using straight-through estimator [3] and make the learnable models by increasing the virtual size and diversity selection process of operations differentiable by incorporat- of training data without collecting additional data samples. ing a recent differentiable neural architecture search method Traditionally, geometric transformations and color enhanc- [19]. As the objective, we minimize the distance between ing transformations have been used in image recognition the distributions of the original images and augmented im- tasks. For example, [17, 11] randomly apply horizontal flip- ages, because we want the data augmentation pipeline to ping and cropping as well as alternation of image hues. In transform images so that it fills missing points in the train- recent years, other image manipulation methods have been ing data [18] (see Figure 2). To make the transformed im- shown to be effective. [37, 6] cut out a random patch from ages match the distribution of original images, we use ad- the image and replace it with random noise or a constant versarial learning (see Figure 1). As a result, the searching value. Another strategy is to mix multiple images of dif- process becomes end-to-end differentiable and significantly ferent classes either by convex combinations [36, 29] or by faster than prior work such as AutoAugment, PBA and Fast creating a patchwork from them [34]. In these studies, the AutoAugment (see Table 1 1). selection of operations, their magnitudes and the probabili- We empirically show that our method, which we call ties to be applied are carefully hand-designed. Faster AutoAugment, enables much faster policy search while achieving comparable performance with that of prior Automating Data Augmentation work on standard benchmarks: CIFAR-10, CIFAR-100 Similar to NAS, it is a natural direction to aim to au- [16], SVHN [21] and ImageNet [26]. tomate data augmentation. One direction is to search for In summary, our contributions are following three points: better combinations of symbolic operations using black-box optimization techniques: reinforcement learning [5, 23], 1. We introduce gradient approximations for several non- evolution strategy [32], Bayesian optimization [18] and differentiable data augmentation operations. Population Based Training [12]. As the objective, [5, 32, 12] directly aim to minimize error rate, or equivalently to 2. We make the searching of data augmentation policies maximize accuracy, while [23, 18] try to match the densi- end-to-end differentiable by gradient approximations, ties of augmented and original images. 1Note that [18] and we estimate the GPU hours with an NVIDIA V100 Another direction is to use generative adversarial net- GPU while [5] did with an NVIDIA P100 GPU. works (GANs) [9]. [30, 1] use conditional GANs to gen- Operation Magnitude µ Policy shear x continuous shear y continuous Affine transformation translate x continuous translate y continuous rotate continuous flip none solarize discrete posterize Sub-policy discrete invert none Color contrast continuous enhancing color continuous operations brightness Operation continuous sharpness none augmented with auto contrast none magnitude equalize none cutout discrete Other operations sample pairing continuous original Table 2. Operations used in AutoAugment, PBA, Fast AutoAug- ment and Faster AutoAugment. Some operations have discrete Figure 3. Schematic view of the problem setting. Each image is magnitude parameters µ, while others have no or continuous mag- augmented by a sub-policy randomly selected from the policy. nitude parameters. Different from previous works, we approxi- A single sub-policy is composed of K consecutive operations mate gradients of operations w.r.t. discrete magnitude µ, which we describe in section 4.1. (O1;:::;OK ), such as shear x and solarize. An operation Ok operates a given image with probability pk and magnitude µk. 3.1. Operations Operations used in each sub-policy include affine trans- erate images that promote the performance of image classi- formations such as shear x and color enhancing opera- fiers. [27, 28] use GANs to modify the outputs of simulators tions such as solarize. In addition, we use cutout to look like real objects. [6] and sample pairing [13] following [5, 12, 18]. Automating data augmentation

Load more