Adaptive Wavelet Pooling for Convolutional Neural Networks

Adaptive wavelet pooling for convolutional neural networks Moritz Wolter Jochen Garcke Fraunhofer Center for Machine Learning Institute for Numerical Simulation and Fraunhofer SCAI University of Bonn Institute for Computer Science Fraunhofer Center for Machine Learning University of Bonn and Fraunhofer SCAI [email protected] [email protected] Abstract Pooling operations replace the internal network representation with a summary statistic of the features at that point [Goodfellow et al., 2016]. Pooling Convolutional neural networks (CNN)s have operations are important yet imperfect because they become the go-to choice for most image and often rely only on simple max or mean operations, video processing tasks. Most CNN architec- or their mixture, in a relatively small neighborhood. tures rely on pooling layers to reduce the Adaptive pooling attempts to improve classic pooling resolution along spatial dimensions. The approaches by introducing learned parameters within reduction allows subsequent deep convolu- the pooling layer. [Tsai et al., 2015] found adaptive tion layers to operate with greater efficiency. pooling to be beneficial on image segmentation tasks, This paper introduces adaptive wavelet pool- while [McFee et al., 2018] made similar observations ing layers, which employ fast wavelet trans- for audio processing. forms (FWT) to reduce the feature resolution. The FWT decomposes the input fea- Recently more sophisticated pooling strategies tures into multiple scales reducing the fea- have been introduced. These approaches uti- ture dimensions by removing the fine-scale lize basis representations in the frequency do- subbands. Our approach adds extra flex- main [Rippel et al., 2015], or in time (spatial) and ibility through wavelet-basis function opti- frequency [Williams and Li, 2018], a forward trans- mization and coefficient weighting at differ- form handles the conversion. Pooling is implemented ent scales. The adaptive wavelet layers inte- by truncating those components a-priori deemed least grate directly into well-known CNNs like the important in that basis representation and then trans- LeNet, Alexnet, or Densenet architectures. ferring the features back into the original space. Note Using these networks, we validate our ap- that pooling by truncation in Fourier- or wavelet-basis proach and find competitive performance on representations can be considered a form of regulariza- the MNIST, CIFAR-10, and SVHN (street tion by projection [Natterer, 1977, Engl et al., 1996]. view house numbers) data-sets. [Zeiler and Fergus, 2013] presented stochastic pooling as an efficient regularizer. Previous wavelet-based [Bruna and Mallat, 2013] 1 Introduction layer and pooling architectures mostly utilized static hand-crafted wavelets. Optimizable wavelet basis The machine learning community has largely representations have been designed previously for turned to convolutional networks (CNNs) for image network compression [Wolter et al., 2020] and graph [He et al., 2016], audio [Nagrani et al., 2019] and processing networks [Rustamov and Guibas, 2013]. video processing [Carreira and Zisserman, 2017] From a pooling point of view, wavelets are an tasks. Within CNNs, pooling layers boost computa- approach that can more accurately represent the tional efficiency and introduce translation invariance. feature contents with fewer artifacts than nearest- neighbor interpolation methods such as max- or Proceedings of the 24th International Conference on Artifi- mean- pooling [Williams and Li, 2018]. cial Intelligence and Statistics (AISTATS) 2021, San Diego, California, USA. PMLR: Volume 130. Copyright 2021 by To the best of our knowledge, we are the first to pro- the author(s). pose wavelet (time and frequency domain) based adap- Adaptive wavelet pooling for convolutional neural networks tive pooling. In this paper, we make the following con- [McFee et al., 2018] propose an adaptive pooling oper- tributions: ator. Their approach interpolates between min-, max-, and average-pooling features. [Gulcehre et al., 2014] • We introduce adaptive- and scaled-wavelet finds that learned norm-pooling can be seen as a gen- pooling as an alternative to spectral- eralization of average, root mean square, and max- [Rippel et al., 2015] and static-wavelet pool- pooling, [Liu et al., 2017] found this approach helpful ing [Williams and Li, 2018]. for video processing tasks. [Gopinath et al., 2019] de- vised an adaptive pooling approach for graph convo- • We propose an improved cost function for wavelet lutional neural networks and found improved perfor- optimization based on the alias cancellation and mance on brain surface analysis tasks. perfect reconstruction conditions. • We show that adaptive and scaled wavelet pooling 2.3 Fourier and wavelet domain pooling performs competitively in convolutional machine learning architectures on the MNIST, CIFAR-10, [Rippel et al., 2015] proposes to learn convolution fil- and SVHN data sets. ter weights in the frequency domain and uses the Fast Fourier Transform for dimensionality reduction by low- pass filtering the frequency domain coefficients. Alter- To aid with reproducing this work, source code for natively [Williams and Li, 2018] found the separable all models and our fast wavelet transformation im- fast wavelet transform (FWT) useful for feature com- plementation is available at https://github.com/ pression. The FWT obtains a multiscale analysis re- Fraunhofer-SCAI/wavelet_pooling . cursively. The approach computes a two-scale analysis wavelet decomposition and discards the first, fine-scale 2 Related work resolution level. The synthesis transform only uses the second, coarse-scale level coefficients to construct the Alternating convolutional and pooling layers followed reduced representation. [Williams and Li, 2018] pro- by one or multiple fully connected layers are the build- poses to use a fixed Haar wavelet basis; we conse- ing blocks of most modern neural networks used for quently refer to this approach as wavelet pooling. object recognition. It is therefore not surprising that the machine learning literature has long been studying The papers closest to ours are [Williams and Li, 2018] pooling operations. and [Wolter et al., 2020]. [Wolter et al., 2020] pro- Early investigations explored max-pooling and non- poses to use flexible wavelets for network compres- linear subsampling [Scherer et al., 2010]. More re- sion and formulates a cost function to optimize the cent work proposed subsampling by exclusively us- wavelets. We build on both approaches for feature ing strides in convolutions [Springenberg et al., 2015] pooling, add coefficient weights, and devise an im- and pooling for graph convolutional neural networks proved broader loss function. [Porrello et al., 2019]. Next, we highlight the regular- izing, adaptive, and Fourier or Wavelet-based pooling 3 Methods approaches, which are of particular relevance for this paper. Our pooling approach relies on the multiscale representation we obtain from the fast wavelet transform 2.1 Pooling and regularization (FWT). The recurrent evaluation of the FWT pro- duces new scale coefficients each time it runs. We refer [Zeiler and Fergus, 2013] introduced stochastic pool- to the number of runs as levels. ing. The stochastic approach randomly picks the ac- This section discusses the FWT and the properties our tivation in each pooling neighborhood. Like dropout, wavelet filters must have, how to turn these into a cost- these layers act as a regularizer. A similar idea has ap- function, and the rescaling of wavelet coefficients. peared in [Malinowski and Fritz, 2013] which explored random and learned choices of the pooling region. 3.1 The fast wavelet transform 2.2 Adaptive pooling The fast wavelet transform constitutes a change of rep- Learned or adaptive pooling layers seek to im- resentation and a form of multiscale analysis. It ex- prove performance by adding extra flexibility. presses the input data points in terms of a wavelet For semantic-segmentation adaptive region pooling filter pair by computing [Strang and Nguyen, 1996]: [Tsai et al., 2015] has been proposed. Working on a weakly labeled sound event detection task b = Ax; (1) Moritz Wolter, Jochen Garcke Analysis Matrix Scale 3 Scale 2 Scale 1 0 0 0 0 5 5 5 5 10 10 10 10 15 15 15 15 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Figure 1: Structure of the Haar analysis fast wavelet transformation matrix on the left followed by the structures of the individual scale processing matrices. The full analysis matrix (left) is the product of multiple (here three) matrices. Each describes the FWT-operations at the current level. Two convolution matrices H0 and H1 are clearly visible in the three matrices on the left [Wolter, 2021]. the analysis matrix A is the product of multiple ma- [Strang and Nguyen, 1996] trices, each decomposing the input signal x at an indi- F F vidual scale. Finally, multiplication of the total matrix S = F F 0 1 :::: (4) 0 1 I A with the input-data yields the wavelet coefficients b. In order to guarantee invertibility we must have SA = I. To enforce it, conditions on the filter pairs h ; h We show the structure of the individual matrices in 0 1 as well as f ; f , which make up the convolution and figure 1. We observe a growing identity block ma- 0 1 transposed convolution matrices, are required. In sum- trix I, as the FWT moves through the different scales. mary, computation of the fast wavelet transform relies The filter matrix blocks move in steps of two. Each on convolution pairs, which recursively build on each FWT step cuts the input length in half. The identity other. submatrices appear where the results from the previous steps have been stored. The reoccurring diagonals denote convolution operations with the analysis filter 3.2 The two-dimensional wavelet-transform pair h and h . Given the wavelet filter degree d, each 0 1 The two-dimensional wavelet transform is based on the filter has N = 2d coefficients. The filters are arranged same principles as the one-dimensional case.

Adaptive Wavelet Pooling for Convolutional Neural Networks

Logistic Regression Trained with Different Loss Functions Discussion

Regularized Regression Under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss

The Central Limit Theorem in Differential Privacy

Bayesian Classifiers Under a Mixture Loss Function

A Unified Bias-Variance Decomposition for Zero-One And

Are Loss Functions All the Same?

Bayes Estimator Recap - Example

1 Basic Concepts 2 Loss Function and Risk

Maximum Likelihood Linear Regression

Lectures on Statistics

Lecture 5: Logistic Regression 1 MLE Derivation

Lecture 2: Estimating the Empirical Risk with Samples CS4787 — Principles of Large-Scale Machine Learning Systems