Generalization Error of Invariant Classifiers
Total Page:16
File Type:pdf, Size:1020Kb
Generalization Error of Invariant Classifiers Jure Sokoli´c1 Raja Giryes2 Guillermo Sapiro3 Miguel R. D. Rodrigues1 1University College London 2Tel-Aviv University 3Duke University Abstract There are various approaches in the literature that aim at bounding the GE via the complexity measures of the hypothesis class, such as the VC-dimenension This paper studies the generalization error of (Vapnik, 1999; Vapnik and Chervonenkis, 1991), the invariant classifiers. In particular, we con- fat-shattering dimension (Alon et al., 1997), and the sider the common scenario where the classi- Rademacher and the Gaussian complexities (Bartlett fication task is invariant to certain transfor- and Mendelson, 2002). Another line of work provides mations of the input, and that the classifier the GE bounds based on the stability of the algo- is constructed (or learned) to be invariant to rithms, by measuring how sensitive is the output to the these transformations. Our approach relies removal or change of a single training sample (Bous- on factoring the input space into a product of quet and Elisseeff, 2002). Finally, there is a recent a base space and a set of transformations. We work by Xu and Mannor (2012) that bounds the GE show that whereas the generalization error in terms of the notion of algorithmic robustness. of a non-invariant classifier is proportional to the complexity of the input space, the gener- An important property of the (traditional) GE bounds alization error of an invariant classifier is pro- is that they are distribution agnostic, i.e., they hold for portional to the complexity of the base space. any distribution on the sample space. Moreover, GE We also derive a set of sufficient conditions bounds can lead to a principled derivation of learn- on the geometry of the base space and the ing algorithms with GE guarantees, e.g., Support Vec- set of transformations that ensure that the tor Machine (SVM) (Cortes and Vapnik, 1995) and its complexity of the base space is much smaller extension to non-linear classification with kernel ma- than the complexity of the input space. Our chines (Hofmann et al., 2008). analysis applies to general classifiers such as However, the design of learning algorithms in practice convolutional neural networks. We demon- does not rely only on the complexity measures of the strate the implications of the developed the- hypothesis class, but it also relies on exploiting the ory for such classifiers with experiments on underlying structure present in the data. A promi- the MNIST and CIFAR-10 datasets. nent example is associated with the field of computer vision where the features and learning algorithms are designed to be invariant to the intrinsic variability in 1 Introduction the data (Soatto and Chiuso, 2016). Image classifi- cation is a particular computer vision task that re- One of the fundamental topics in statistical learning quires representations that are invariant to various nui- theory is the one of the generalization error (GE). sances/transformations such as viewpoint and illumi- Given a training set and a hypothesis class, a learning nation variations commonly present in the set of nat- algorithm chooses a hypothesis based on the training ural images, but do not contain \helpful information" set in such a way that it minimizes an empirical loss. as to the identity of the classified object. This moti- This loss, which is calculated on the training set, is also vates us to develop a theory for learning algorithms called the training loss and it often underestimates the that are invariant to certain sets of transformations. expected loss. The GE is the difference between the The GE of invariant methods has been studied via empirical loss and the expected loss. the VC-dimension by Abu-Mostafa (1993), where it is shown that the subset of an hypothesis class that th Proceedings of the 20 International Conference on Artifi- is invariant to certain transformations is smaller than cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- erdale, Florida, USA. JMLR: W&CP volume 54. Copy- the general hypothesis class. Therefore, it has a right 2017 by the author(s). smaller VC-dimension. Yet, the authors do not pro- Generalization Error of Invariant Classifiers vide any characterization of how much smaller the VC- 1.1 Contributions dimension of an invariant method might be. Similarly, group symmetry in data distribution was also explored The main contribution of this paper can be summa- in the problem of covariance estimation, where it is rized as follows: shown that leveraging group symmetry leads to gains We prove that given a learning method invariant to a in sample complexity of the covariance matrix esti- set of transformations of size T , the GE of this method mation (Shah and Chandrasekaran, 2012; Soloveychik may be up to a factor pT smaller than the GE of a et al., 2016). non-invariant learning method. There are various other examples in the literature that Additionally, our other contributions include: aims to understand/leverage the role of invariance in data processing. For example, Convolutional Neural We define notions of stable invariant classifiers Networks (CNNs) { which are known to achieve state • and provide GE bounds for such classifiers; of the art results in image recognition, speech recog- nition, and many other tasks (LeCun et al., 2015) { We establish a set of sufficient conditions that en- are known to possess certain invariances. The invari- • sure that the bound of the GE of a stable invariant ance in CNNs is achieved by careful design of the ar- classifier is much smaller than the GE of a robust chitecture so that it is (approximately) invariant to non-invariant classifier. We are not aware of any various transformations such as rotation, scale and other works in the literature that achieve this; affine deformations (Cohen and Welling, 2016; Diele- man et al., 2016; Gens and Domingos, 2014); or by Our theory also suggests that explicitly enforcing • training with augmented training set, meaning the invariance when training the networks should im- training set is augmented with some transformed ver- prove the generalization of the learning algorithm. sions of the training samples, so that the learned net- The theoretical results are supported by experi- work is approximately invariant (Simard et al., 2003). ments on the MNIST and CIFAR-10 datasets. Another example of a translation invariant method is the scattering transform, which is a CNN-like trans- 2 Problem Statement form based on wavelets and point-wise non-linearities (Bruna and Mallat, 2012). See also (Sifre and Mal- We start by describing the problem of supervised lat, 2013; Wiatowski and B¨olcskei, 2015). In practice, learning and its associated GE. Then we define the such learning techniques achieve a lower GE than their notions of invariance in the classification task and the \non-invariant" counterparts. notion of an invariant algorithm. Poggio et al. (2012) and Anselmi et al. (2014, 2016) study biologically plausible learning of invariant repre- 2.1 Generalization Error sentations and connect their results to CNNs. The role of convolutions and pooling in the context of natural We consider learning a classifier from training samples. images is also studied by Cohen and Shashua (2016). In particular, we assume that there is a probability distribution P defined on the sample space and that There are various works that that study the GE of we have a training set drawn i.i.d. from ZP denoted CNNs (Huang et al., 2015; Neyshabur et al., 2015; m by Sm = si i , si , i = 1; : : : ; m. A learning Shalev-Shwartz and Ben-David, 2014; Sokoli´cet al., f g =1 2 Z algorithm takes the training set Sm and maps it 2016), however, they do not establish any connection A to a learned hypothesis S . The loss function of an between the network's invariance and its GE. A m hypothesis S on the sample z is denoted by A m 2 Z Motivated by the above examples, this work proposes l( Sm ; z). The empirical loss and the expected loss of A a theoretical framework to study the GE of invariant the learned hypothesis Sm are defined as A learning algorithms and shows that an invariant learn- X l ( S ) = 1=m l ( S ; si) and (1) ing technique may have a much smaller GE than a emp A m A m non-invariant learning technique. Moreover, our work si2Sm lexp( Sm ) = Es∼P [l ( Sm ; s)] ; (2) directly relates the difference in GE bounds to the size A A of the set of transformations that a learning algorithm is invariant to. Our approach is significantly different respectively; and the GE is defined as from (Abu-Mostafa, 1993) because it focuses on the GE( S ) = l ( S ) l ( S ) : (3) complexity of the data, rather than on the complexity A m j emp A m − exp A m j of the hypothesis class. We consider a classification problem, where the sample space = is a product of the input space Z X × Y Jure Sokoli´c,Raja Giryes, Guillermo Sapiro, Miguel R. D. Rodrigues and the label space , where a vector x Stability of a learning algorithm defined in this way X Y 2 X ⊆ RN represents an observation that has a corresponding ensures that a learned classifier has a small GE as we class label y = 1; 2;:::;NY . We will write shall see in Theorem 1. z = (x; y) and2s Y= (x ;f y ). g i i i We also need a measure of complexity/size of the input space , which is given by the covering number. 2.2 Stable Classifier and its Generalization X Definition 3. Consider a space and a metric d.