Structured Receptive Fields in Cnns

Structured Receptive Fields in CNNs Jorn-Henrik¨ Jacobsen1, Jan van Gemert1,2, Zhongyou Lou1, Arnold W. M. Smeulders1 1University of Amsterdam, The Netherlands 2TU Delft, The Netherlands {j.jacobsen,z.lou,a.w.m.smeulders}@uva.nl, [email protected] Abstract Learning powerful feature representations with CNNs is hard when training data are limited. Pre-training is one way to overcome this, but it requires large datasets suffi- ciently similar to the target domain. Another option is to design priors into the model, which can range from tuned hy- Figure 1: A subset of filters of the first structured recep- perparameters to fully engineered representations like Scat- tive field CNN layer as trained on 100-class ILSVRC2012 tering Networks. We combine these ideas into structured and the Gaussian derivative basis they are learned from. The receptive field networks, a model which has a fixed filter network learns scaled and rotated versions of zero, first, sec- basis and yet retains the flexibility of CNNs. This flexibil- ond and third order filters. Furthermore, the filters learn to ity is achieved by expressing receptive fields in CNNs as a recombine the different input color channels which is a cru- weighted sum over a fixed basis which is similar in spirit cial property of CNNs. to Scattering Networks. The key difference is that we learn arbitrary effective filter sets from the basis rather than mod- eling the filters. This approach explicitly connects clas- eter spaces. Since all images are spatially coherent and hu- sical multiscale image analysis with general CNNs. With man observers are considered to only cast local variations structured receptive field networks, we improve consider- up to a certain order as meaningful [11, 18] our key as- ably over unstructured CNNs for small and medium dataset sumption is that it is unnecessary to learn these properties scenarios as well as over Scattering for large datasets. We in the network. When visualizing the intermediate layers validate our findings on ILSVRC2012, Cifar-10, Cifar-100 of a trained network, see e.g. [39] and Figure 2, it becomes and MNIST. As a realistic small dataset example, we show evident that the filters as learned in a CNN are locally coher- state-of-the-art classification results on popular 3D MRI ent and as a consequence can be decomposed into a smooth brain-disease datasets where pre-training is difficult due to compact filter basis [12]. a lack of large public datasets in a similar domain. We aim to maintain the CNN’s capacity to learn general 1. Introduction variances and invariances in arbitrary images. Following from our assumptions, the demand is posed on the filter set Where convolutional networks have appeared enor- that i) a linear combination of a finite basis set is capable of mously powerful in the classification of images when ample forming any arbitrary filter necessary for the task at hand, data are available [14], we focus on smaller image datasets. as illustrated in Figure 1 and ii) that we preserve the full We propose structuring receptive fields in CNNs as linear learning capacity of the network. For i) we choose the fam- combinations of basis functions to train them with fewer ily of Gaussian filters and its smooth derivatives for which image data. it has been proven [12] that 3-rd or 4-th order is sufficient The common approach to smaller datasets is to per- to capture all local image variation perceivable by humans. form pre-training on a large dataset, usually ImageNet [29]. According to scale-space theory [11, 35], the Gaussian fam- Where CNNs generalize well to domains similar to the do- ily constitutes the Taylor expansion of the image function main where the pre-training came from [27, 40], the per- which guarantees completeness. For ii) we maintain back- formance decreases significantly when moving away from propagation parameter optimization in the network, now the pre-training domain [40, 37]. We aim to make learning applied to learning the weights by which the filters are more effective for smaller sets by restricting CNNs param- summed into the effective filter set. 2610 2. Related Work 2.1. Scale•space: the deep structure of images Scale-space theory [35] provides a model for the structure of images by steadily convolving the image with filters Figure 2: Filters randomly sampled from all layers of the of increasing scale, effectively reducing the resolution in GoogLenet model [33], from left to right layer number in- each scale step. While details of the image will slowly dis- creases. Without being forced to do so, the model exhibits appear, the order by which they do so will uniquely encode spatial coherence (seen as smooth functions almost every- the deep structure of the image [11]. Gaussian filters have where) after being trained on ILSVRC2012. This behaviour the advantage in that they do not introduce any artifacts [18] reflects the spatial coherence of the input feature maps even in the image while Gaussian derivative filters form a com- in the highest layers. plete and stable basis to decompose locally any realistic image. The set of responses to the derivative filters describing one patch is called the N-jet [5]. Similarly motivated, the Scattering Transform [2, 20, In the same vein, CNNs can be perceived to also model 30], a special type of CNN, uses a complete set of wavelet the deep structure of images, this time in a non-linear fash- filters ordered in a cascade. However, different from a clas- ion. The pooling layers in a CNN effectively reduce reso- sical CNN, the filters parameters are not learned by back- lution of input feature maps. Viewed from the top of the propagation but rather they are fixed from the start and the network down, the spatial extent of a convolution kernel is whole network structure is motivated by signal processing increased in each layer by a factor 2, where a 5x5 kernel at principles. In the Scattering Network the choice of local the higher layer measures 10x10 pixels on the layer below. and global invariances are tailored to the type of images The deep structure in a CNN models the image on several specifically. In the Scattering Transform invariance to group discrete levels of resolution simultaneously, precisely in line actions beyond local translation and deformation requires with Scale-space theory. explicit design [20] with the regards to the variability en- Where CNNs typically reduce resolution by max pooling countered in the target domain such as translation [2], rota- in a non-linear fashion, Scale-space offers a linear theory tion [30] or scale. As a consequence, when the desired in- for continuous reduction of resolution. Scale-space theory variance groups are known a priori, Scattering delivers very treats an image as a function of the mathematical apparatus effective networks. to reveal the local image structure. In this paper, we exploit Our paper takes the best of two worlds. On the one hand, the descriptive power of Scale-space theory to decompose we adopt the Scattering principle of using fixed filter bases the image locally on a fixed filter basis of multiple scales. as a function prior in the network. But on the other hand, we maintain from plain CNNs the capacity to learn arbitrary ef- 2.2. CNNs and their parameters fective filter combinations to form complex invariances and equivariances. CNNs [15] have large numbers of parameters to Our main contributions are: learn [13]. This is their strength as they can solve ex- tremely complicated problems [13, 34]. At the same time, their number of unrestricted parameters is a limiting factor Deriving the structured receptive field network in terms of the large amounts of data needed to train. To • (RFNN) from first principles by formulating filter prevent overfitting, which is an issue even when training on learning as a linear decomposition onto a filter ba- large datasets like the million images of the ILSVRC2012 sis, unifying CNNs and multiscale image analysis in challenge [29], usually regularization is imposed with meth- a learnable model. ods like dropout [32] and weight decay [22]. Regulariza- tion is essential to achieving good performance. In cases Combining the strengths of Scattering and CNNs. We • where limited training data are available, CNN training do well on both domains: i) small datasets where Scat- quickly overfits regardless and the learned representations tering is best but CNNs are weak; ii) complex datasets do not generalize well. Transfer learning from models pre- where CNNs excel but Scattering is weak. trained in similar domains to the new domain is necessary to achieve competitive results [23]. One thing pre-training State-of-the-art classification results on a small dataset on large datasets provides is knowledge about properties in- • where pre-training is infeasible. The task is herent to all natural images, such as spatial coherence and Alzheimer’s disease classification on two widely used robustness to uninformative variability. In this paper, we brain MRI datasets. We outperform all published re- aim to design these properties into CNNs to improve gener- sults on the ADNI dataset. alization when limited training data are available. 2611 2.3. The Scattering representation 3x3 convolution layers. 5x5 convolutions and 2 subsequent 3x3 convolutions have the same effective receptive field size To reduce model complexity we draw inspiration from while each receptive field has 18 instead of 25 trainable pa- the elegant convolutional Scattering Network [2, 20, 30]. rameters. This regularization enables learning larger mod- Scattering uses a multi-layer cascade of a pre-defined els that are less prone to overfitting. In this paper, we follow wavelet filter bank with nonlinearity and pooling operators.

Load more