A Kernel Theory of Modern Data Augmentation

A Kernel Theory of Modern Data Augmentation Tri Dao 1 Albert Gu 1 Alexander J. Ratner 1 Virginia Smith 2 Christopher De Sa 3 Christopher Re´ 1 Abstract as regularizer to make the resulting model more robust, and Data augmentation, a technique in which a train- provide resources to data-hungry deep learning models. As ing set is expanded with class-preserving transfor- a testament to its growing importance, the technique has mations, is ubiquitous in modern machine learn- been used to achieve nearly all state-of-the-art results in ing pipelines. In this paper, we seek to establish a image recognition (Cires¸an et al., 2010; Dosovitskiy et al., theoretical framework for understanding data aug- 2016; Graham, 2014; Sajjadi et al., 2016), and is becoming mentation. We approach this from two directions: a staple in many other areas as well (Uhlich et al., 2017; Lu First, we provide a general model of augmentation et al., 2006). Learning augmentation policies alone can also as a Markov process, and show that kernels appear boost the state-of-the-art performance in image classifica- naturally with respect to this model, even when tion tasks (Ratner et al., 2017; Cubuk et al., 2018). we do not employ kernel classification. Next, we analyze more directly the effect of augmentation Despite its ubiquity and importance to the learning process, on kernel classifiers, showing that data augmen- data augmentation is typically performed in an ad-hoc man- tation can be approximated by first-order feature ner with little understanding of the underlying theoretical averaging and second-order variance regulariza- principles. In the field of deep learning, for example, data tion components. These frameworks both serve augmentation is commonly understood to act as a regularizer to illustrate the ways in which data augmenta- by increasing the number of data points and constraining the tion affects the downstream learning model, and model (Goodfellow et al., 2016; Zhang et al., 2017). How- the resulting analyses provide novel connections ever, even for simpler models, it is not well-understood how between prior work in invariant kernels, tangent training on augmented data affects the learning process, the propagation, and robust optimization. Finally, parameters, and the decision surface of the resulting model. we provide several proof-of-concept applications This is exacerbated by the fact that data augmentation is showing that our theory can be useful for accel- performed in diverse ways in modern machine learning erating machine learning workflows, such as re- pipelines, for different tasks and domains, thus precluding ducing the amount of computation needed to train a general model of transformation. Our results show that using augmented data, and predicting the utility regularization is only part of the story. of a transformation prior to training. In this paper, we aim to develop a theoretical understanding of data augmentation. First, in Section3, we analyze 1. Introduction data augmentation as a Markov process, in which augmenta- The process of augmenting a training dataset with synthetic tion is performed via a random sequence of transformations. examples has become a critical step in modern machine This formulation closely matches how augmentation is often learning pipelines. The aim of data augmentation is to applied in practice. Surprisingly, we show that performing k- artificially create new training data by applying transfor- nearest neighbors with this model asymptotically results in mations, such as rotations or crops for images, to input a kernel classifier, where the kernel is a function of the base data while preserving the class labels. This practice has augmentations. These results demonstrate that kernels ap- many potential benefits: Data augmentation can encode pear naturally with respect to data augmentation, regardless prior knowledge about data or task-specific invariances, act of the base model, and illustrate the effect of augmentation on the learned representation of the original data. 1Department of Computer Science, Stanford University, Cali- fornia, USA 2Department of Electrical and Computer Engineering, Motivated by the connection between data augmentation Carnegie Mellon University, Pennsylvania, USA 3Department of and kernels, in Section4 we show that a kernel classifier on Computer Science, Cornell University, New York, USA. Corre- augmented data approximately decomposes into two com- spondence to: Tri Dao <[email protected]>. ponents: (i) an averaged version of the transformed features, Proceedings of the 36 th International Conference on Machine and (ii) a data-dependent variance regularization term. This Learning, Long Beach, California, PMLR 97, 2019. Copyright suggests a more nuanced explanation of data augmentation— 2019 by the author(s). namely, that it improves generalization both by inducing A Kernel Theory of Modern Data Augmentation invariance and by reducing model complexity. We vali- effectively applied for kernel methods and deep learning date the quality of our approximation empirically, and draw architectures (Section5), which we show can be used to connections to other generalization-improving techniques, reduce training computation and diagnose the effectiveness including recent work in invariant learning (Zhao et al., of various transformations. 2017; Mroueh et al., 2015; Raj et al., 2017) and robust optimization (Namkoong & Duchi, 2017). Prior theory also does not capture the complex process by which data augmentation is often applied. For example, Finally, in Section5, to illustrate the utility of our theoreti- previous work (Bishop, 1995; Chapelle et al., 2001) shows cal understanding of augmentation, we explore promising that adding noise to input data has the effect of regularizing practical applications, including: (i) developing a diagnos- the model, but these effects have yet to be explored for more tic to determine, prior to training, the importance of an commonly applied complex transformations, and it is not augmentation; (ii) reducing training costs for kernel meth- well-understood how the inductive bias embedded in com- ods by allowing for augmentations to be applied directly to plex transformations manifest themselves in the invariance features—rather than the raw data—via a random Fourier of the model (addressed here in Section4). A common features approach; and (iii) suggesting a heuristic for train- recipe in achieving state-of-the-art accuracy in image clas- ing neural networks to reduce computation while realizing sification is to apply a sequence of more complex transfor- most of the accuracy gain from augmentation. mations such as crops, flips, or local affine transformations to the training data, with parameters drawn randomly from hand-tuned ranges (Cires¸an et al., 2010; Dosovitskiy et al., 2. Related Work 2014). Similar strategies have also been employed in appli- Data augmentation has long played an important role in ma- cations of classification for audio (Uhlich et al., 2017) and chine learning. For many years it has been used, for example, text (Lu et al., 2006). In Section3, we analyze a motivating in the form of jittering and virtual examples in the neural model reaffirming the connection between augmentation network and kernel methods literatures (Sietsma & Dow, and kernel methods, even in the setting of complex and 1991; Scholkopf¨ et al., 1996; Decoste & Scholkopf¨ , 2002). composed transformations. These methods aim to augment or modify the raw training data so that the learned model will be invariant to known Finally, while data augmentation has been well-studied in transformations or perturbations. Given its importance, re- the kernels literature (Burges, 1999; Scholkopf¨ et al., 1996; cent efforts have been made to apply data augmentation Muandet et al., 2012), it is typically explored in the con- more efficiently, for example with subsampling (Kuchnik text of simple geometrical invariances with closed forms. & Smith, 2019). There has also been significant work in For example, van der Wilk et al.(2018) use Gaussian pro- incorporating invariance directly into the model or training cesses to learn these invariances from data by maximizing procedure, rather than by expanding the training set (van der the marginal likelihood. Further, the connection is often Wilk et al., 2018; Tai et al., 2019). One illustrative example approached in the opposite direction—by looking for ker- is that of tangent propagation for neural networks (Simard nels that satisfy certain invariance properties (Haasdonk & et al., 1992; 1998), which proposes a regularization penalty Burkhardt, 2007; Teo et al., 2008). We instead approach the to enforce local invariance, and has been extended in several connection directly via data augmentation, and show that recent works (Rifai et al., 2011; Demyanov et al., 2015; even complicated augmentation procedures akin to those Zhao et al., 2017). However, while efforts have been made used in practice can be represented as a kernel method. that loosely connect traditional data augmentation with these methods (Leen, 1995; Zhao et al., 2017), there has not been a rigorous study on how these sets of procedures relate in 3. Data Augmentation as a Kernel the context of modern models and transformations. To begin our study of data augmentation, we propose and in- vestigate a model of augmentation as a Markov process, In this work, we make explicit the connection between aug- inspired by the general manner in which the process is mentation and modifications to the model,

A Kernel Theory of Modern Data Augmentation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support