<<

arXiv:1602.07576v3 [cs.LG] 3 Jun 2016 rceig fthe of Proceedings Cohen S. Taco Learning 8 oyih 06b h author(s). the by 2016 Copyright 48. a Welling Max Amsterdam of University muto miia vdnespot h ointa both large sharing that a weight notion convolutional the lacking, supports currently evidence the- is empirical strong of design a amount network Although data neural sensory audio. of of and ory video, models images, powerful convnets) as very such be (CNNs, to networks proven have neural convolutional Deep Introduction 1. Research Advanced for Institute Canadian Irvine California of University Amsterdam of University evn h aaiyt er ayueu transformations. useful many learn pre- to uses while capacity one, layer the connected convolution serving fully a a analyze image, than to parameters the fewer weights of far same part the each model using approximately or By both shifts. are to distribution data and bel ovltoa egtsaigi fetv eas hr is there because effective a is sharing weight Convolutional performance. predictive good for important are tors) rnlto translation -Nsaheesaeo h r eut nCI- on results MNIST. art rotated the and FAR10 of state rotations. achieve and G-CNNs reflections translations, by negligible gen- erated groups with discrete implemented for be overhead computational can easy and are use layers to convolution of number the parameters. increasing without network the capacity of expressive layers. the convolution increase G-convolutions regular than of sharing degree higher weight substantially a enjoys of that type layer new a G-convolutions, use symme- G-CNNs exploiting tries. by complexity re- sample that duces networks neural general- convolutional natural of ization a (G-CNNs), Convolutional Networks equivariant Neural Group introduce We e ok Y S,21.JL:WC volume W&CP JMLR: 2016. USA, NY, York, New , 33 rd Abstract nentoa ofrneo Machine on Conference International nms ecpintss h la- the tasks: perception most in ru qiain ovltoa Networks Convolutional Equivariant Group and depth aogohrfac- other (among mrvsrslswtotadtoa uig nsection In tuning. additional consistently without G-convolutions results improves replac- with that convolutions show planar We ing CIFAR10). plain and re- augmented art on the of ( state sults achieve G-CNNs where CIFAR10, and e oti eeaiain oi section in so generalization, is equivariance this of to notion key The reflections. includ and , rotations be of ing can groups larger networks exploit convolutional to how generalized show we just network. paper the not this of In it layers higher exploit in to also but by possible first, it preserved the in makes is which (translation) In layer, symmetry edge-effects). each to the up shifting the words, least then feeding (at other and as maps layers feature same resulting same the the the is through layers image of original number are a network through a it such in equivariant layers lation the all because work ain u a alt qiaywt oegnrltrans- sec- from general framework more tion mathematical trans- with the to equivary Using equivariant to formations. are fail they may but that lations show and CNNs, dard ovlto aescnb sdefcieyi a in effectively used be can layers Convolution xesoso h ehd eoecnldn nsection several in concluding consider before and method, results the of these extensions of discussion a provide section compatibl In thus G-CNNs. and with equivariant, all pool- are as blocks residual such normaliza and CNNs, batch modern nonlinearities, pointwise in arbitrary var- used ing, as layers well of as kinds G-convolutions, ious that show transla- We the for group). G-CNN tion the being latter (the CNNs standard e fmteaia ocpsi section in concepts mathematical of ber nsection In convolutions. group for details tation section manner. In generic a in G-convolution the analyze and define learning. representation deep section in in work related role discussing After its and concept this 4 ecndfieGCN (section G-CNNs define can we , 2 . 28% 5 8 eaayeteeuvrac rpriso stan- of properties equivariance the analyze we , erpr xeietlrslso MNIST-rot on results experimental report we ro nMITrt and MNIST-rot, on error hfigteiaeadte feeding then and image the shifting : 7 epoiecnrt implemen- concrete provide we M T . . WELLING S . COHEN 4 3 . 2 19% 4 ercl num- a recall we , 6 yaaoyto analogy by ) ewl discuss will we htalwu to us allow that resp. @ @ UVA UVA deep 6 trans- . . . 9 NL NL 46% net- tion 10 we e . - Group Equivariant Convolutional Networks 2. Structured & Equivariant Representations 3. Related Work Deep neural networks produce a sequence of progressively There is a large body of literature on invariant represen- more abstract representations by mapping the input through tations. Invariance can be achieved by pose normalization a series of parameterized functions (LeCun et al., 2015). In using an equivariant detector (Lowe, 2004; Jaderberg et al., the current generation of neural networks, the representa- 2015) or by averaging a possibly nonlinear function over tion spaces are usually endowed with very minimal internal a group (Reisert, 2008; Skibbe, 2013; Manay et al., 2006; structure, such as that of a linear space Rn. Kondor, 2007). In this paper we construct representations that have the Scattering convolution networks use wavelet convolutions, structure of a linear G-space, for some chosen group G. nonlinearities and group averaging to produce stable in- This that each vector in the representation space has variants (Bruna & Mallat, 2013). Scattering networks have a pose associated with it, which can be transformed by the been extended to use convolutions on the group of trans- elements of some group of transformations G. This addi- lations, rotations and scalings, and have been applied tional structure allows us to model data more efficiently: A to object and texture recognition (Sifre & Mallat, 2013; filter in a G-CNN detects co-occurrences of features that Oyallon & Mallat, 2015). have the preferred relative pose, and can match such a fea- A number of recent works have addressed the prob- ture constellation in every global pose through an operation lem of learning or constructing equivariant representa- called the G-convolution. tions. This includes work on transforming autoencoders A representation space can obtain its structure from other (Hinton et al., 2011), equivariant Boltzmann machines representation spaces to which it is connected. For this to (Kivinen & Williams, 2011; Sohn & Lee, 2012), equivari- work, the network or layer Φ that maps one representation ant descriptors (Schmidt & Roth, 2012), and equivariant to another should be structure preserving. For G-spaces filtering (Skibbe, 2013). this means that Φ has to be equivariant: Lenc & Vedaldi (2015) show that the AlexNet CNN ′ (Krizhevsky et al., 2012) trained on imagenet sponta- Φ(Tg x)= Tg Φ(x), (1) neously learns representations that are equivariant to flips, scaling and rotation. This supports the idea that equivari- That is, transforming an input x by a transformation g ance is a good inductive bias for deep convolutional net- (forming Tg x) and then passing it through the learned map works. Agrawal et al. (2015) show that useful representa- Φ should give the same result as first mapping x through Φ tions can be learned in an unsupervised manner by training and then transforming the representation. a convolutional network to be equivariant to ego-motion. Equivariance can be realized in many ways, and in particu- Anselmi et al. (2014; 2015) use the theory of locally com- ′ lar the operators T and T need not be the same. The only pact topological groups to develop a theory of statistically ′ requirement for T and T is that for any two transforma- efficient learning in sensory cortex. This theory was imple- tions g and h, we have T (gh) = T (g)T (h) (i.e. T is a mented for the commutative group consisting of time- and linear representation of G). vocal tract length shifts for an application to speech recog- From equation 1 we see that the familiar concept of in- nition by Zhang et al. (2015). ′ variance is a special kind of equivariance where Tg is the Gens & Domingos (2014) proposed an approximately identity transformation for all g. In deep learning, general equivariant convolutional architecture that uses sparse, equivariance is more useful than invariance because it is high-dimensional feature maps to deal with high- impossible to determine if features are in the right spatial dimensional groups of transformations. Dieleman et al. configuration if they are invariant. (2015) showed that rotation symmetry can be exploited in Besides improving statistical efficiency and facilitating ge- convolutional networks for the problem of galaxy morphol- ometrical reasoning, equivariance to symmetry transforma- ogy prediction by rotating feature maps, effectively learn- tions constrains the network in a way that can aid general- ing an equivariant representation. This work was later ex- ization. A network Φ can be non-injective, meaning that tended (Dieleman et al., 2016) and evaluated on various non-identical vectors x and y in the input space become computer vision problems that have cyclic symmetry. identical in the output space (for example, two instances Cohen & Welling (2014) showed that the concept of disen- of a face may be mapped onto a single vector indicating tangling can be understood as a reduction of the operators the presence of any face). If is equivariant, then the - Φ G Tg in an equivariant representation, and later related this transformed inputs Tg x and Tg y must also be mapped to notion of disentangling to the more familiar statistical no- the same output. Their “sameness” (as judged by the net- tion of decorrelation (Cohen & Welling, 2015). work) is preserved under symmetry transformations. Group Equivariant Convolutional Networks

4. Mathematical Framework equations are cumbersome. Hence, our preferred method of composing two group elements represented by integer In this section we present a mathematical framework that tuples is to convert them to matrices, multiply these matri- enables a simple and generic definition and analysis of G- ces, and then convert the resulting matrix back to a tuple of CNNs for various groups G. We begin by defining sym- integers (using the atan2 function to obtain r). metry groups, and study in particular two groups that are used in the G-CNNs we have built so far. Then we take a The group p4 acts on points in Z2 (pixel coordinates) by look at functions on groups (used to model feature maps in multiplying the matrix g(r, u, v) by the homogeneous co- ′ ′ ′ ′ G-CNNs) and their transformation properties. ordinate vector x(u , v ) of a point (u , v ):

′ 4.1. Symmetry Groups cos(rπ/2) sin(rπ/2) u u − ′ gx sin(rπ/2) cos(rπ/2) v v  (3) A symmetry of an object is a transformation that leaves ≃ 0 01 1 the object invariant. For example, if we take the sampling     grid of our image, Z2, and flip it over we get Z2 = 4.3. The group p4m ( n, m) (n,m) Z2 = Z2. So the flipping− oper- { − − | ∈ } ation is a symmetry of the sampling grid. The group p4m consists of all compositions of translations, If we have two symmetry transformations g and h and we mirror reflections, and rotations by 90 degrees about any center of rotation in the grid. Like , we can parameterize compose them, the result gh is another symmetry transfor- p4 mation (i.e. it leaves the object invariant as well). Further- this group by integers: −1 more, the inverse g of any symmetry is also a symmetry, m rπ m rπ ( 1) cos( 2 ) ( 1) sin( 2 ) u and composing it with g gives the identity transformation − rπ − − rπ g(m,r,u,v)= sin( ) cos( ) v , e. A of transformations with these properties is called a  2 2  0 01 .   One simple example of a group is the set of 2D integer where m 0, 1 , 0 r< 4 and (u, v) Z2. The reader translations, Z2. Here the group operation (“composition may verify∈{ that this} is≤ indeed a group. ∈ of transformations”) is addition: (n,m) + (p, q) = (n + Again, composition is most easily performed using the ma- p,m + q). One can verify that the sum of two translations trix representation. Computing r, u, v from a given matrix is again a translation, and that the inverse (negative) of a g can be done using the same method we use for p4, and translation is a translation, so this is indeed a group. 1 for m we have m = 2 (1 det(g)). Although it may seem fancy to call 2-tuples of integers a − group, this is helpful in our case because as we will see in 4.4. Functions on groups section 6, a useful notion of convolution can be defined for functions on any group1, of which Z2 is only one exam- We model images and stacks of feature maps in a conven- tional CNN as functions f : Z2 RK supported on a ple. The important properties of the convolution, such as → equivariance, arise primarily from the group structure. bounded (typically rectangular) domain. At each pixel co- ordinate (p, q) Z2, the stack of feature maps returns a -dimensional∈ vector , where denotes the num- 4.2. The group p4 K f(p, q) K ber of channels. The group p4 consists of all compositions of translations Although the feature maps must always be stored in finite and rotations by 90 degrees about any center of rotation in arrays, modeling them as functions that extend to infinity a square grid. A convenient parameterization of this group (while being non-zero on a finite region only) simplifies in terms of three integers r, u, v is the mathematical analysis of CNNs. cos(rπ/2) sin(rπ/2) u We will be concerned with transformations of the feature − g(r, u, v)=  sin(rπ/2) cos(rπ/2) v , (2) maps, so we introduce the following notation for a trans- 0 01 formation acting on a set of feature maps:   g Z2 −1 −1 where 0 r < 4 and (u, v) . The group operation is [Lgf](x) = [f g ](x)= f(g x) (4) given by≤ matrix multiplication.∈ ◦ The composition and inversion operations could also be Computationally, this says that to get the value of the g- represented directly in terms of integers (r, u, v), but the transformed feature map Lgf at the point x, we need to do a lookup in the original feature map f at the point g−1x, 1At least, on any locally compact group. which is the unique point that gets mapped to x by g. This Group Equivariant Convolutional Networks

m m operator Lg is a concrete instantiation of the transformation operator Tg referenced in section 2, and one may verifythat

LgLh = Lgh. (5) mr3 e mr mr3 e mr = r r3 = = r r3 = rm r2 r3m rm r2 r3m If g represents a pure translation t = (u, v) Z2 then g−1x simply means x t. The inverse on g in∈ equation 4 ensures that the function− is shifted in the positive direction r2m = mr2 r2m = mr2 when using a positive translation, and that Lg satisfies the criterion for being a (eq. 5) even for trans- Figure 2. A p4m feature map and its rotation by r. formations g and h that do not commute (i.e. gh = hg). 6 As will be explained in section 6.1, feature maps in a G- This rich transformation structure arises from the group op- CNN are functions on the group G, instead of functions on eration of p4 or p4m, combined with equation 4 which de- 2 the group Z . For functions on G, the definition of Lg is scribes the transformation of a function on a group. still valid if we simply replace x (an element of Z2) by h −1 Finally, we define the involution of a feature map, which (an element of G), and interpret g h as composition. will appear in section 6.1 when we study the behavior of It is easy to mentally visualize a planar feature map f : the G-convolution, and which also appears in the gradient Z2 R undergoing a transformation, but we are not used of the G-convolution. We have: → to visualizing functions on groups. To visualize a feature ∗ −1 f (g)= f(g ) (6) map or filter on p4, we plot the four patches associated with the four pure rotations on a circle, as shown in figure 1 For Z2 feature maps the involution is just a point reflec- (left). Each pixel in this figure has a rotation coordinate tion, but for G-feature maps the meaning depends on the (the patch in which the pixel appears), and two translation structure of G. In all cases, f ∗∗ = f. coordinates (the pixel position within the patch). 5. Equivariance properties of CNNs In this section we recall the definitions of the convolution and correlation operations used in conventional CNNs, and e e r r3 r r3 show that these operations are equivariant to translations r2 r2 but not to other transformations such as rotation. This is certainly well known and easy to see by mental visualiza- tion, but deriving it explicitly will make it easier to follow the derivation of group equivariance of the group convolu- tion defined in the next section. Figure 1. A p4 feature map and its rotation by r. At each layer l, a regular convnet takes as input a stack of l feature maps f : Z2 RK and convolves or correlates it When we apply the 90 degree rotation r to a function on → l with a set of Kl+1 filters ψi : Z2 RK : p4, each planar patch follows its red r-arrow (thus incre- → menting the rotation coordinate by 1 (mod 4)), and simul- Kl taneously undergoes a -degreerotation. The result of this i i 90 [f ψ ](x)= fk(y)ψ (x y) ∗ k − operation is shown on the right of figure 1. As we will see yX∈Z2 kX=1 in section 6,a p4 feature map in a p4-CNN undergoes ex- l (7) K actly this motion under rotation of the input image. i i [f ⋆ ψ ](x)= fk(y)ψ (y x) k − For p4m, we can make a similar plot, shown in figure 2. yX∈Z2 kX=1 A p4m function has 8 planar patches, each one associated with a mirroring m and rotation r. Besides red rotation If one employs convolution ( ) in the forward pass, the cor- arrows, the figure now includes small blue reflection lines relation (⋆) will appear in the∗ backward pass when comput- (which are undirected, since reflections are self-inverse). ing gradients, and vice versa. We will use the correlation in the forward pass, and refer generically to both operations Upon rotation of a p4m function, each patch again follows as “convolution”. its red r-arrows and undergoes a 90 degree rotation. Un- der a mirroring, the patches connected by a blue line will Using the substitution y y + t, and leaving out the sum- change places and undergo the mirroring transformation. mation over feature maps→ for clarity, we see that a transla- Group Equivariant Convolutional Networks tion followed by a correlation is the same as a correlation The equivariance of this operation is derived in complete followed by a translation: analogy to eq. 8, now using the substitution h uh: → [[Ltf] ⋆ ψ](x)= f(y t)ψ(y x) −1 −1 − − [[Luf] ⋆ ψ](g)= fk(u h)ψ(g h) Xy hX∈G Xk = f(y)ψ(y + t x) = f(h)ψ(g−1uh) y − X (8) hX∈G Xk (12) = f(y)ψ(y (x t)) = f(h)ψ((u−1g)−1h) y − − X hX∈G Xk

= [Lt[f ⋆ ψ]](x). = [Lu[f ⋆ ψ]](g) And so we say that “correlation is an equivariant map for The equivariance of eq. 10 is derived similarly. Note that the translation group”, or that “correlation and translation although equivariance is expressed by the same formula commute”. Using an analogous computation one can show [Luf] ⋆ ψ = Lu[f ⋆ ψ] for both first-layer G-correlation that also for the convolution, [Ltf] ψ = Lt[f ψ]. ∗ ∗ (eq. 10) and full G-correlation (11), the meaning of the Although convolutions are equivariant to translation, they operator Lu is different: for the first layer correlation, the 2 are not equivariant to other isometries of the sampling lat- inputs f and ψ are functions on Z , so Luf denotes the tice. For instance, as shown in the supplementary material, transformation of such a function, while Lu[f ⋆ ψ] denotes rotating the image and then convolving with a fixed filter is the transformation of the feature map, which is a function not the same as first convolvingand then rotating the result: on G. For the full G-correlation, both the inputs f and ψ and the output f ⋆ ψ are functions on G. [[Lrf] ⋆ ψ](x)= Lr[f ⋆ [Lr−1 ψ]](x) (9) In words, this says that the correlation of a rotated image Note that if G is not commutative, neither the G- convolution nor the -correlation is commutative. How- L f with a filter ψ is the same as the rotation by r of the G r ever, the feature maps and are related by the original image f convolved with the inverse-rotated filter ψ⋆f f ⋆ ψ involution (eq. 6): Lr−1 ψ. Hence, if an ordinary CNN learns rotated copies of the same filter, the stack of feature maps is equivariant, ∗ f ⋆ ψ = (ψ⋆f) . (13) although individual feature maps are not. Since the involution is invertible (it is its own inverse), the 6. Group Equivariant Networks information content of f ⋆ψ and ψ⋆f is the same. However, f ⋆ ψ is more efficient to compute when using the method In this section we will define the three layers used in a G- described in section 7, because transforming a small filter CNN (G-convolution, G-pooling, nonlinearity) and show is faster than transforming a large feature map. that each one commutes with G-transformations of the do- main of the image. It is customary to add a bias term to each feature map in a convolution layer. This can be done for G-conv 6.1. G-Equivariant correlation layers as well, as long as there is only one bias per G- feature map (instead of one bias per spatial feature plane The correlation (eq. 7) is computed by shifting a filter and within a G-feature map). Similarly, batch normalization then computing a dot product with the feature maps. By (Ioffe & Szegedy, 2015) should be implemented with a sin- replacing the shift by a more general transformation from gle scale and bias parameter per G-feature map in order some group G, we get the G-correlation used in the first to preserve equivariance. The sum of two G-equivariant layer of a G-CNN: feature maps is also G-equivariant, thus G-conv layers can be used in highway networks and residual networks −1 (10) [f ⋆ ψ](g)= fk(y)ψk(g y). (Srivastava et al., 2015; He et al., 2015). yX∈Z2 Xk Notice that both the input image f and the filter ψ are func- 6.2. Pointwise nonlinearities tions of the plane Z2, but the feature map f ⋆ψ is a function on the discrete group G (which may contain translations as Equation 12 shows that G-correlation preserves the trans- a subgroup). Hence, for all layers after the first, the filters ψ formation propertiesof the previouslayer. What about non- must also be functions on G, and the correlation operation linearities and pooling? becomes Recall that we think of feature maps as functions on G. In −1 this view, applying a nonlinearity ν : R R to a feature [f ⋆ ψ](g)= fk(h)ψk(g h). (11) → hX∈G Xk map amounts to function composition. We introduce the Group Equivariant Convolutional Networks composition operator a function on the quotient space G/H, in which two trans- formations are considered equivalent if they are related by Cν f(g) = [ν f](g)= ν(f(g)). (14) a transformation in H. ◦ which acts on functions by post-composing them with ν. As an example, in a p4 feature map, we can pool over all four rotations at each spatial position (the cosets of the sub- Since the left transformation operator L acts by pre- group R of rotations around the origin). The resulting fea- composition, C and L commute: Z2 ture map is a function on ∼= p4/R, i.e. it will transform −1 −1 in the same way as the input image. Another example is Cν Lhf = ν [f h ] = [ν f] h = LhCν f, (15) ◦ ◦ ◦ ◦ given by a feature map on Z, where we could pool over the cosets of the subgroup nZ of shifts by multiples of n. This so the rectified feature map inherits the transformation gives a feature map on Z/nZ, which has a cyclic transfor- properties of the previous layer. mation law under translations. 6.3. Subgroup pooling and coset pooling This concludes our analysis of G-CNNs. Since all layer types are equivariant, we can freely stack them into deep In order to simplify the analysis, we split the pooling op- networks and expect G-conv parameter sharing to be effec- eration into two steps: the pooling itself (performed with- tive at arbitrary depth. out stride), and a subsampling step. The non-strided max- pooling operation applied to a feature map f : G R can be modeled as an operator P that acts on f as → 7. Efficient Implementation Computing the G-convolution for involves nothing more Pf(g)= max f(k), (16) k∈gU than indexing arithmetic and inner products, so it can be implemented straightforwardly. Here we present where gU = gu u U is the g-transformation of some the details for a G-convolution implementation that can pooling domain{ U| ∈G }(typically a neighborhood of the ⊂ leverage recent advances in fast computation of planar identity transformation). In a regular convnet, U is usually convolutions (Mathieu et al., 2014; Vasilache et al., 2015; a 2 2 or 3 3 square including the origin (0, 0), and g is Lavin & Gray, 2015). a translation.× × A plane symmetry group G is called split if any transfor- As shown in the supplementary material, pooling com- mation g G can be decomposed into a translation t Z2 ∈ ∈ mutes with Lh: and a transformation s in the stabilizer of the origin (i.e. s PLh = LhP (17) leaves the origin invariant). For the group p4, we can write g = ts for t a translation and s a rotation about the origin, Since poolingtends to reduce the variation in a featuremap, while p4m splits into translations and rotation-flips. Us- it makes sense to sub-sample the pooled feature map, or ing this split of G and the fact that LgLh = Lgh, we can equivalently, to do a “pooling with stride”. In a G-CNN, rewrite the G-correlation (eq. 10 and 11) as follows: the notion of “stride” is generalized by subsampling on a subgroup H G. That is, H is a subset of G that is itself a f ⋆ ψ(ts)= fk(h)Lt [Lsψk(h)] (18) group (i.e. closed⊂ under multiplication and inverses). The hX∈X Xk subsampled feature map is then equivariant to H but not G. where X = Z2 in layer one and X = G in further layers. In a standard convnet, pooling with stride 2 is the same as Thus, to compute the p4 (or p4m) correlation f ⋆ ψ we can pooling and then subsampling on H = (2i, 2j) (i, j) first compute Lsψ (“filter transformation”) for all four ro- Z2 which is a subgroup of G = Z2. For{ the p4-CNN,| we∈ } tations (or all eight rotation-flips) and then call a fast planar may subsample on the subgroup H containing all 4 rota- correlation routine on f and the augmented filter bank. tions, as well as shifts by multiples of 2 pixels. The computational cost of the algorithm presented here is We can obtain full G-equivariance by choosing our pooling roughly equal to that of a planar convolution with a filter region U to be a subgroup H G. The pooling domains ⊂ bank that is the same size as the augmented filter bank used gH that result are called cosets in group theory. The cosets in the G-convolution, because the cost of the filter transfor- partition the group into non-overlapping regions. The fea- mation is negligible. ture map that results from pooling over cosets is invariant to the right-action of , because the cosets are similarly in- H 7.1. Filter transformation variant (ghH = gH). Hence, we can arbitrarilychoose one coset representative per coset to subsample on. The feature The set of filters at layer l is stored in an array F [ ] of shape map that results from coset pooling may be thought of as Kl Kl−1 Sl−1 n n, where Kl is the number· of × × × × Group Equivariant Convolutional Networks channels at layer l, Sl−1 denotes the number of transfor- state of the art, which uses prior knowledge about rotations mations in G that leave the origin invariant (e.g. 1, 4 or 8 (Schmidt & Roth, 2012) (see table 1). for Z2, p4 or p4m filters, respectively), and n is the spa- Next, we replaced each convolution by a p4-convolution tial (or translational) extent of the filter. Note that typically, (eq. 10 and 11), divided the number of filters by √4 = S1 =1 for 2D images, while Sl =4 or Sl =8 for l> 1. 2 (so as to keep the number of parameters approximately The filter transformation Ls amounts to a permutation of fixed), and added max-pooling over rotations after the last the entries of each of the Kl Kl−1 scalar-valued filter convolution layer. This architecture (P4CNN) was found channels in F . Since we are applying× Sl transformations to to perform better without dropout, so we removed it. The each filter, the output of this operation is an array of shape P4CNN almost halves the error rate of the previous state of Kl Sl Kl−1 Sl−1 n n, which we call F +. the art (2.28% vs 3.98% error). × × × × × The permutation can be implemented efficiently by a GPU We then tested the hypothesis that premature invariance is kernel that does a lookup into F for each output cell of undesirable in a deep architecture (section 2). We took F +, using a precomputed index associated with the output the Z2CNN, replaced each convolution layer by a p4- cell. To precompute the indices, we define an invertible convolution (eq. 10) followed by a coset max-pooling over map g(s,u,v) that takes an input index (valid for an array rotations. The resulting feature maps consist of rotation- of shape Sl−1 n n) and produces the associated group invariant features, and have the same transformation law as element g as a× matrix× (section 4.2 and 4.3). For each in- the input image. This network (P4CNNRotationPooling) put index (s,u,v) and each transformation s′, we compute outperforms the baseline and the previous state of the art, s,¯ u,¯ v¯ = g−1(g(s′, 0, 0)−1g(s,u,v)). This index is used but performs significantly worse than the P4CNN which to set F +[i,s′,j,s,u,v]= F [i, j, s,¯ u,¯ v¯] for all i, j. does not pool over rotations in intermediate layers. The G-convolutionfor a new group can be added by simply implementing a map g( ) from indices to matrices. Network TestError(%) · Larochelle et al. (2007) 10.38 ± 0.27 7.2. Planar convolution Sohn & Lee (2012) 4.2 Schmidt & Roth (2012) 3.98 The second part of the G-convolution algorithm is a pla- Z2CNN 5.03 ± 0.0020 + nar convolution using the expanded filter bank F . If P4CNNRotationPooling 3.21 ± 0.0012 −1 Sl > 1, the sum over X in eq. 18 involves a sum over P4CNN 2.28 ± 0.0004 the stabilizer. This sum can be folded into the sum over fea- ture channels performed by the planar convolution routine Table 1. Error rates on rotated MNIST (with standard deviation by reshaping F + from Kl Sl Kl−1 Sl−1 n n to under variation of the random seed). SlKl Sl−1Kl−1 n n.× The resulting× × array can× be× inter- preted× as a conventional× × filter bank with Sl−1Kl−1 planar input channels and SlKl planar output channels, which can 8.2. CIFAR-10 be correlated with the feature maps f (similarly reshaped). The CIFAR-10 dataset consists of 60k images of size 32 32, divided into 10 classes. The dataset is split into 40×k 8. Experiments training, 10k validation and 10k testing splits. Z2 8.1. Rotated MNIST We compared the p4-, p4m- and standard planar convolutions on two kinds of baseline architectures. The rotated MNIST dataset (Larochelle et al., 2007) con- Our first baseline is the All-CNN-C architecture by tains 62000 randomly rotated handwritten digits. The Springenberg et al. (2015), which consists of a sequence of dataset is split into a training, validation and test sets of 9 strided and non-strided convolution layers, interspersed size 10000, 2000 and 50000, respectively. with rectified linear activation units, and nothing else. Our We performed model selection using the validation set, second baseline is a residual network (He et al., 2016), yielding a CNN architecture (Z2CNN) with 7 layers of which consists of an initial convolution layer, followed by 3 3 convolutions (4 4 in the final layer), 20 chan- three stages of 2n convolution layers using ki filters at stage nels× in each layer, relu activation× functions, batch normal- i, followed by a final classification layer (6n +2 layers in ization, dropout, and max-pooling after layer 2. For op- total). The first convolution in each stage i> 1 uses a stride timization, we used the Adam algorithm (Kingma & Ba, of 2, so the feature map sizes are 32, 16, and 8 for the three 2015). This baseline architecture outperforms the mod- stages. We use n = 7, ki = 32, 64, 128 yielding a wide els tested by Larochelle et al. (2007) (when trained on 12k 44-layer network called ResNet44. and evaluated on 50k), but does not match the previous To evaluate G-CNNs, we replaced all convolution layers of Group Equivariant Convolutional Networks the baseline architectures by p4 or p4m convolutions. For a 9. Discussion & Future work constant number of filters, this increases the size of the fea- ture maps 4 or 8-fold, which in turn increases the numberof Our results show that p4 and p4m convolution layers can parameters required per filter in the next layer. Hence, we be used as a drop-in replacement of standard convolutions halve the numberof filters in each p4-conv layer, and divide that consistently improves the results. it by roughly √8 3 in each p4m-conv layer. This way, G-CNNs benefit from data augmentation in the same way ≈ the number of parameters is left approximately invariant, as convolutional networks, as long as the augmentation while the size of the internal representation is increased. comes from a group larger than G. Augmenting with flips Specifically, we used ki = 11, 23, 45 for p4m-ResNet44. and small translations consistently improves the results for To evaluate the impact of data augmentation, we compare the p4 and p4m-CNN. the networks on CIFAR10 and augmented CIFAR10+. The The CIFAR dataset is not actually symmetric, since objects latter denotes moderate data augmentation with horizon- typically appear upright. Nevertheless, we see substantial tal flips and small translations, following Goodfellow et al. increases in accuracy on this dataset, indicating that there (2013) and many others. need not be a full symmetry for G-convolutions to be ben- The training procedure for training the All-CNN was re- eficial. produced as closely as possible from Springenberg et al. In future work, we want to implement G-CNNs that work (2015). For the ResNets, we used stochastic gradient de- on hexagonal lattices which have an increased number of scent with initial learning rate of 0.05 and momentum 0.9. symmetries relative to square grids, as well as G-CNNs for The learning rate was divided by 10 at epoch 50, 100 and 3D space groups. All of the theory presentedin this paper is 150, and training was continued for 300 epochs. directly applicable to these groups, and the G-convolution can be implemented in such a way that new groups can be added by simply specifying the group operation and a Network G CIFAR10 CIFAR10+ Param. bijective map between the group and the set of indices. All-CNN Z2 9.44 8.86 1.37M p4 8.84 7.67 1.37M One limitation of the method as presented here is that p4m 7.59 7.04 1.22M it only works for discrete groups. Convolution on con- ResNet44 Z2 9.45 5.61 2.64M tinuous (locally compact) groups is mathematically well- p4m 6.46 4.94 2.62M defined, but may be hard to approximate in an equivari- ant manner. A further challenge, already identified by 2 Table 2. Comparison of conventional (i.e. Z ), p4 and p4m CNNs Gens & Domingos (2014), is that a full enumeration of on CIFAR10 and augmented CIFAR10+. Test set error rates and transformations in a group may not be feasible if the group number of parameters are reported. is large. Finally, we hope that the current work can serve as a con- To the best of our knowledge, the p4m-CNN outperforms crete example of the general philosophy of “structured rep- all published results on plain CIFAR10 (Wan et al., 2013; resentations”, outlined in section 2. We believe that adding Goodfellow et al., 2013; Lin et al., 2014; Lee et al., 2015b; mathematical structure to a representation (making sure Srivastava et al., 2015; Clevert et al., 2015; Lee et al., that maps between representations preserve this structure), 2015a). However, due to radical differences in model sizes could enhance the ability of neural nets to see abstract sim- and architectures, it is difficult to infer much about the in- ilarities between superficially different concepts. trinsic merit of the various techniques. It is quite possi- ble that the cited methods would yield better results when deployed in larger networks or in combination with other 10. Conclusion techniques. Extreme data augmentation and model ensem- We have introduced G-CNNs, a generalization of convolu- bles can also further improve the numbers (Graham, 2014). tional networks that substantially increases the expressive Inspired by the wide ResNets of Zagoruyko & Komodakis capacity of a network without increasing the number of (2016), we trained another ResNet with 26 layers and parameters. By exploiting symmetries, G-CNNs achieve state of the art results on rotated MNIST and CIFAR10. ki = (71, 142, 248) (for planar convolutions) or ki = (50, 100, 150) (for p4m convolutions). When trained with We have developed the general theory of G-CNNs for dis- moderate data augmentation, this network achieves an er- crete groups, showing that all layer types are equivariant to ror rate of 5.27% using planar convolutions, and 4.19% the action of the chosen group G. Our experimental results with p4m convolutions. This result is comparable to the show that G-convolutions can be used as a drop-in replace- 4.17% error reported by Zagoruyko & Komodakis (2016), ment for spatial convolutions in modern network architec- but using fewer parameters (7.2M vs 36.5M). tures, improving their performance without further tuning. Group Equivariant Convolutional Networks

Acknowledgements the 30th International Conference on Machine Learning (ICML), pp. 1319–1327, 2013. We would like to thank Joan Bruna, Sander Dieleman, Robert Gens, Chris Olah, and Stefano Soatto for helpful Graham, B. Fractional Max-Pooling. arXiv:1412.6071, discussions. This research was supported by NWO (grant 2014. number NAI.14.108), Google and Facebook. He, K., Zhang, X., Ren, S., and Sun, J. Deep Resid- ual Learning for Image Recognition. arXiv:1512.03385, References 2015. Agrawal, P., Carreira, J., and Malik, J. Learning to See He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, by Moving. In International Conference on Computer Jian. Identity Mappings in Deep Residual Networks. Vision (ICCV), 2015. arXiv:1603.05027, 2016. Anselmi, F., Leibo, J. Z., Rosasco, L., Mutch, J., Tacchetti, Hinton, G. E., Krizhevsky, A., and Wang, S. D. Trans- A., and Poggio, T. Unsupervised learning of invariant forming auto-encoders. ICANN-11: International Con- representations with low sample complexity: the magic ference on Artificial Neural Networks, Helsinki, 2011. of sensory cortex or a new framework for machine learn- ing? Technical Report 001, MIT Center for Brains, Ioffe, S. and Szegedy, C. Batch Normalization : Acceler- Minds and Machines, 2014. ating Deep Network Training by Reducing Internal Co- variate Shift. arXiv:1502.03167v3, 2015. Anselmi, F., Rosasco, L., and Poggio, T. On Invariance and Selectivity in Representation Learning. Technical report, Jaderberg, M., Simonyan, K., Zisserman, A., and MIT Center for Brains, Minds and Machines, 2015. Kavukcuoglu, K. Spatial Transformer Networks. In Advances in Neural Information Processing Systems 28 Bruna, J. and Mallat, S. Invariant scattering convolu- (NIPS 2015), 2015. tion networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(8):1872–86, aug Kingma, D. and Ba, J. Adam: A Method for Stochastic 2013. Optimization. In Proceedings of the International Con- ference on Learning Representations (ICLR), 2015. Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and Kivinen, Jyri J. and Williams, Christopher K I. Transfor- Accurate Deep Network Learning by Exponential Linear mation equivariant Boltzmann machines. In 21st Inter- Units (ELUs). arXiv:1511.07289v3, 2015. national Conference on Artificial Neural Networks, jun Cohen, T. and Welling, M. Learning the Irreducible Repre- 2011. sentations of Commutative Lie Groups. In Proceedings Kondor, R. A novel set of rotationally and translation- of the 31st International Conference on Machine Learn- ally invariant features for images based on the non- ing (ICML), volume 31, pp. 1755–1763, 2014. commutative bispectrum. arXiv:0701127, 2007. Cohen, T. S. and Welling, M. Transformation Properties of Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNet Learned Visual Representations. International Confer- classification with deep convolutional neural networks. ence on Learning Representations (ICLR), 2015. Advances in Neural Information Processing Systems, 25, Dieleman, S., Willett, K. W., and Dambre, J. Rotation- 2012. invariant convolutional neural networks for galaxy mor- Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and phology prediction. Monthly Notices of the Royal Astro- Bengio, Y. An empirical evaluation of deep architectures nomical Society, 450(2), 2015. on problems with many factors of variation. Proceedings of the 24th International Conference on Machine Learn- Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Ex- ing (ICML), 2007. ploiting Cyclic Symmetry in Convolutional Neural Net- works. In International Conference on Machine Learn- Lavin, A. and Gray, S. Fast Algorithms for Convolutional ing (ICML), 2016. Neural Networks. arXiv:1509.09308, 2015. Gens, R. and Domingos, P. Deep Symmetry Networks. LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na- In Advances in Neural Information Processing Systems ture, 521(7553):436–444, 2015. (NIPS), 2014. Lee, C., Gallagher, P. W., and Tu, Z. Generalizing Pooling Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, Functions in Convolutional Neural Networks: Mixed, A., and Bengio, Y. Maxout Networks. In Proceedings of Gated, and Tree. ArXiv:1509.08985, 2015a. Group Equivariant Convolutional Networks

Lee, C., Xie, S., Gallagher, P.W., Zhang, Z., and Tu, Springenberg, J.T., Dosovitskiy, A., Brox, T., and Ried- Z. Deeply-Supervised Nets. In Proceedings of the miller, M. Striving for Simplicity: The All Convolu- Eighteenth International Conference on Artificial Intel- tional Net. Proceedings of the International Conference ligence and Statistics (AISTATS), volume 38, pp. 562– on Learning Representations (ICLR), 2015. 570, 2015b. Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhu- Lenc, K. and Vedaldi, A. Understanding image represen- ber, J¨urgen. Training Very Deep Networks. Advances in tations by measuring their equivariance and equivalence. Neural Information Processing Systems (NIPS), 2015. In Proceedings of the IEEE Conf. on Computer Vision Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Pi- and Pattern Recognition (CVPR), 2015. antino, S., and LeCun, Y. Fast convolutional nets with Lin, M., Chen, Q., and Yan, S. Network In Network. fbfft: A GPU performance evaluation. In International International Conference on Learning Representations Conference on Learning Representations (ICLR), 2015. (ICLR), 2014. Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. Regularization of neural networks using dropconnect. Lowe, D.G. Distinctive Image Features from Scale- International Conference on Machine Learning (ICML), Invariant Keypoints. International Journal of Computer pp. 109–111, 2013. Vision, 60(2):91–110, nov 2004. Zagoruyko, S. and Komodakis, N. Wide Residual Net- Manay, Siddharth, Cremers, Daniel, Hong, Byung Woo, works. arXiv:1605.07146, 2016. Yezzi, Anthony J., and Soatto, Stefano. Integral invari- ants for shape matching. IEEE Transactions on Pattern Zhang, C., Voinea, S., Evangelopoulos, G., Rosasco, L., Analysis and Machine Intelligence, 28(10):1602–1617, and Poggio, T. Discriminative template learning in 2006. ISSN 01628828. doi: 10.1109/TPAMI.2006.208. group-convolutional networks for invariant speech rep- resentations. InterSpeech, pp. 3229–3233, 2015. Mathieu, M., Henaff, M., and LeCun, Y. Fast Training of Convolutional Networks through FFTs. In International Conference on Learning Representations (ICLR), 2014.

Oyallon, E. and Mallat, S. Deep Roto-Translation Scat- tering for Object Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2865—-2873, 2015.

Reisert, Marco. Group Integration Techniques in Pattern Analysis. PhD thesis, Albert-Ludwigs-University, 2008.

Schmidt, U. and Roth, S. Learning rotation-aware fea- tures: From invariant priors to equivariant descriptors. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

Sifre, Laurent and Mallat, Stephane. Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimi- nation. IEEE conference on Computer Vision and Pat- tern Recognition (CVPR), 2013.

Skibbe, H. Spherical Algebra for Biomedical Im- age Analysis. PhD thesis, Albert-Ludwigs-Universitat Freiburg im Breisgau, 2013.

Sohn, K. and Lee, H. Learning Invariant Representations with Local Transformations. Proceedings of the 29th International Conference on Machine Learning (ICML- 12), 2012. Group Equivariant Convolutional Networks Appendix A: Equivariance Derivations Appendix B: Gradients We claim in the paper that planar correlation is not equiv- To train a G-CNN, we need to compute gradients of a loss ariant to rotations. Let f : R2 RK be an image with K function with respect to the parameters of the filters. If we channels, and let ψ : R2 RK→be a filter. Take a rotation use the fast algorithm explained in section 7 of the main pa- r about the origin. The ordinary→ planar correlation ⋆ is not per, we only have to implement the gradient of the indexing equivariant to rotations, i.e., [Lrf]⋆ψ = Lr[f ⋆ψ]. Instead operation (section 7.1, “filter transformation”), because the we have: 6 2D convolution routine and its gradient are given.

[[Lrf] ⋆ ψ](x)= Lrfk(y)ψk(y x) This gradient is computed as follows. The gradient of the − yX∈Z2 Xk loss with respect to cell i in the input of the indexing oper- −1 ation is the sum of the gradients of the output cells j that = fk(r y)ψk(y x) − index cell i. On current GPU hardware, this can be im- yX∈Z2 Xk plemented efficiently using a kernel that is instantiated for = fk(y)ψk(ry x) each cell j in the output array. The kernel adds the value − yX∈Z2 Xk of the gradient of the loss with respect to cell j to cell i of −1 the array that holds the gradient of the loss with respect to = fk(y)ψk(r(y r x)) − the input of the indexing operation (this array is to be ini- yX∈Z2 Xk tialized at zero). Since multiple kernels write to the same −1 = fk(y)Lr−1 ψ(y r x)) cell i, the additions must be done using atomic operations − yX∈Z2 Xk to avoid concurrency problems. −1 = f ⋆ [Lr−1 ψ](r x) Alternatively, one could implement the filter transforma- = Lr[f ⋆ [Lr−1 ψ]](x) tion using a precomputed permutation matrix. This is not (19) as efficient, but the gradient is trivial, and most computa- Line by line, we used the following definitions, facts and tion graph / deep learning packages will have implemented manipulations: the matrix multiplication and its gradient.

1. The definition of the correlation ⋆. Appendix C: G-conv calculus −1 2. The definition of Lr, i.e. Lrf(x)= f(r x). Although the gradient of the filter transformation operation is all that is needed to do backpropagation in a G-CNN 3. The substitution y ry, which does not change the for a split group G, it is instructive to derive the analytical summation bounds→ since rotation is a symmetry of the gradients of the G-correlation operation. This leads to an sampling grid Z2. elegant “G-conv calculus”, included here for the interested 4. Distributivity. reader. l l−1 lk 5. The definition of Lr. Let feature map k at layer l be denoted fk = f ⋆ ψ , where f l−1 is the stack of feature maps in the previous 6. The definition of the correlation ⋆. layer. At some point in the backprop algorithm, we will l have computed the derivative ∂L/∂fk for all k, and we 7. The definition of Lr. l−1 need to compute ∂L/∂fj (to backpropagateto lower lay- lk A visual proof can be found in (Dieleman et al., 2016). ers) as well as ∂L/∂ψj (to update the parameters). We find that, Using a similar line of reasoning, we can show that pooling ∂L ∂L ∂f l (h) commutes with the : = k l−1 ∂f l (h) l−1 ∂fj (g) Xh,k k ∂fj (g) PLhf(g)= max Lhf(k) ∈ k gU l−1 ′ −1 ∂L ∂fk′ (h ) lk −1 ′ = max f(h k) =  − ψk′ (h h ) k∈gU ∂f l (h) l 1 Xh,k k hX′,k′ ∂fj (g) = max f(k)   ∈ ∂L hk gU (20) = ψlk(h−1g) ∂f l (g) j = max f(k) Xh,k k k∈h−1gU −1 ∂L = Pf(h g) = ⋆ ψl∗ (g) ∂f l j  = LhPf(g) (21) Group Equivariant Convolutional Networks where the superscript ∗ denotes the involution

ψ∗(g)= ψ(g−1), (22)

l and ψj is the set of filter components applied to input fea- ture map j at layer l:

l l1 lKl ψj (g) = (ψj (g),...,ψj (g)) (23)

To compute the gradient with respect to component j of filter k, we have to G-convolve the j-th input feature map with the k-th output feature map:

∂L ∂L ∂f l (h) = k ∂ψlk(g) ∂f l (h) ∂ψlk(g) j Xh k j

lk −1 ′ ∂L l−1 ′ ∂ψk′ (h h ) =  f ′ (h )  ∂f l (h) k ∂ψlk(g) Xh k hX′,k′ j   ∂L = f l−1(hg) ∂f l (h) j Xh k

∂L l−1 = l fj (g) ∂fk ∗  (24) So we see that both the forward and backward passes in- volve convolution or correlation operations, as is the case in standard convnets.