Adaptive pooling for convolutional neural networks

Moritz Wolter Jochen Garcke Fraunhofer Center for Institute for Numerical Simulation and Fraunhofer SCAI University of Bonn Institute for Computer Science Fraunhofer Center for Machine Learning University of Bonn and Fraunhofer SCAI [email protected] [email protected]

Abstract Pooling operations replace the internal network representation with a summary of the fea- tures at that point [Goodfellow et al., 2016]. Pooling Convolutional neural networks (CNN)s have operations are important yet imperfect because they become the go-to choice for most image and often rely only on simple max or operations, video processing tasks. Most CNN architec- or their mixture, in a relatively small neighborhood. tures rely on pooling layers to reduce the Adaptive pooling attempts to improve classic pooling resolution along spatial dimensions. The approaches by introducing learned parameters within reduction allows subsequent deep convolu- the pooling layer. [Tsai et al., 2015] found adaptive tion layers to operate with greater efficiency. pooling to be beneficial on image segmentation tasks, This paper introduces adaptive wavelet pool- while [McFee et al., 2018] made similar observations ing layers, which employ fast wavelet trans- for audio processing. forms (FWT) to reduce the feature resolu- tion. The FWT decomposes the input fea- Recently more sophisticated pooling strategies tures into multiple scales reducing the fea- have been introduced. These approaches uti- ture dimensions by removing the fine-scale lize basis representations in the do- subbands. Our approach adds extra flex- main [Rippel et al., 2015], or in time (spatial) and ibility through wavelet-basis function opti- frequency [Williams and Li, 2018], a forward trans- mization and coefficient weighting at differ- form handles the conversion. Pooling is implemented ent scales. The adaptive wavelet layers inte- by truncating those components a-priori deemed least grate directly into well-known CNNs like the important in that basis representation and then trans- LeNet, Alexnet, or Densenet architectures. ferring the features back into the original space. Note Using these networks, we validate our ap- that pooling by truncation in Fourier- or wavelet-basis proach and find competitive performance on representations can be considered a form of regulariza- the MNIST, CIFAR-10, and SVHN (street tion by projection [Natterer, 1977, Engl et al., 1996]. view house numbers) -sets. [Zeiler and Fergus, 2013] presented stochastic pooling as an efficient regularizer. Previous wavelet-based [Bruna and Mallat, 2013] 1 Introduction layer and pooling architectures mostly utilized static hand-crafted . Optimizable wavelet basis The machine learning community has largely representations have been designed previously for turned to convolutional networks (CNNs) for image network compression [Wolter et al., 2020] and graph [He et al., 2016], audio [Nagrani et al., 2019] and processing networks [Rustamov and Guibas, 2013]. video processing [Carreira and Zisserman, 2017] From a pooling point of view, wavelets are an tasks. Within CNNs, pooling layers boost computa- approach that can more accurately represent the tional efficiency and introduce translation invariance. feature contents with fewer artifacts than nearest- neighbor interpolation methods such as max- or Proceedings of the 24th International Conference on Artifi- mean- pooling [Williams and Li, 2018]. cial Intelligence and (AISTATS) 2021, San Diego, California, USA. PMLR: Volume 130. Copyright 2021 by To the best of our knowledge, we are the first to pro- the author(s). pose wavelet (time and ) based adap- Adaptive wavelet pooling for convolutional neural networks tive pooling. In this paper, we make the following con- [McFee et al., 2018] propose an adaptive pooling oper- tributions: ator. Their approach interpolates between min-, max-, and average-pooling features. [Gulcehre et al., 2014] • We introduce adaptive- and scaled-wavelet finds that learned -pooling can be seen as a gen- pooling as an alternative to spectral- eralization of average, root mean square, and max- [Rippel et al., 2015] and static-wavelet pool- pooling, [Liu et al., 2017] found this approach helpful ing [Williams and Li, 2018]. for video processing tasks. [Gopinath et al., 2019] de- vised an adaptive pooling approach for graph convo- • We propose an improved cost function for wavelet lutional neural networks and found improved perfor- optimization based on the alias cancellation and mance on brain surface analysis tasks. perfect reconstruction conditions.

• We show that adaptive and scaled wavelet pooling 2.3 Fourier and wavelet domain pooling performs competitively in convolutional machine learning architectures on the MNIST, CIFAR-10, [Rippel et al., 2015] proposes to learn fil- and SVHN data sets. ter weights in the frequency domain and uses the Fast Fourier Transform for dimensionality reduction by low- pass filtering the frequency domain coefficients. Alter- To aid with reproducing this work, source code for natively [Williams and Li, 2018] found the separable all models and our fast wavelet transformation im- fast wavelet transform (FWT) useful for feature com- plementation is available at https://github.com/ pression. The FWT obtains a multiscale analysis re- Fraunhofer-SCAI/wavelet_pooling . cursively. The approach computes a two-scale analysis wavelet decomposition and discards the first, fine-scale 2 Related work resolution level. The synthesis transform only uses the second, coarse-scale level coefficients to construct the Alternating convolutional and pooling layers followed reduced representation. [Williams and Li, 2018] pro- by one or multiple fully connected layers are the build- poses to use a fixed Haar wavelet basis; we conse- ing blocks of most modern neural networks used for quently refer to this approach as wavelet pooling. object recognition. It is therefore not surprising that the machine learning literature has long been studying The papers closest to ours are [Williams and Li, 2018] pooling operations. and [Wolter et al., 2020]. [Wolter et al., 2020] pro- Early investigations explored max-pooling and non- poses to use flexible wavelets for network compres- linear subsampling [Scherer et al., 2010]. More re- sion and formulates a cost function to optimize the cent work proposed subsampling by exclusively us- wavelets. We build on both approaches for feature ing strides in [Springenberg et al., 2015] pooling, add coefficient weights, and devise an im- and pooling for graph convolutional neural networks proved broader loss function. [Porrello et al., 2019]. Next, we highlight the regular- izing, adaptive, and Fourier or Wavelet-based pooling 3 Methods approaches, which are of particular relevance for this paper. Our pooling approach relies on the multiscale repre- sentation we obtain from the fast wavelet transform 2.1 Pooling and regularization (FWT). The recurrent evaluation of the FWT pro- duces new scale coefficients each time it runs. We refer [Zeiler and Fergus, 2013] introduced stochastic pool- to the number of runs as levels. ing. The stochastic approach randomly picks the ac- This section discusses the FWT and the properties our tivation in each pooling neighborhood. Like dropout, wavelet filters must have, how to turn these into a cost- these layers act as a regularizer. A similar idea has ap- function, and the rescaling of wavelet coefficients. peared in [Malinowski and Fritz, 2013] which explored random and learned choices of the pooling region. 3.1 The fast wavelet transform 2.2 Adaptive pooling The fast wavelet transform constitutes a change of rep- Learned or adaptive pooling layers seek to im- resentation and a form of multiscale analysis. It ex- prove performance by adding extra flexibility. presses the input data points in terms of a wavelet For semantic-segmentation adaptive region pooling filter pair by computing [Strang and Nguyen, 1996]: [Tsai et al., 2015] has been proposed. Working on a weakly labeled sound event detection task b = Ax, (1) Moritz Wolter, Jochen Garcke

Analysis Matrix Scale 3 Scale 2 Scale 1 0 0 0 0

5 5 5 5

10 10 10 10

15 15 15 15 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Figure 1: Structure of the Haar analysis fast wavelet transformation matrix on the left followed by the structures of the individual scale processing matrices. The full analysis matrix (left) is the product of multiple (here three) matrices. Each describes the FWT-operations at the current level. Two convolution matrices H0 and H1 are clearly visible in the three matrices on the left [Wolter, 2021]. the analysis matrix A is the product of multiple ma- [Strang and Nguyen, 1996] trices, each decomposing the input signal x at an indi-  F F  vidual scale. Finally, multiplication of the total matrix S = F F  0 1 .... (4) 0 1 I A with the input-data yields the wavelet coefficients b. In order to guarantee invertibility we must have SA = I. To enforce it, conditions on the filter pairs h , h We show the structure of the individual matrices in 0 1 as well as f , f , which make up the convolution and figure 1. We observe a growing identity block ma- 0 1 transposed convolution matrices, are required. In sum- trix I, as the FWT moves through the different scales. mary, computation of the fast wavelet transform relies The filter matrix blocks move in steps of two. Each on convolution pairs, which recursively build on each FWT step cuts the input length in half. The identity other. submatrices appear where the results from the previ- ous steps have been stored. The reoccurring diagonals denote convolution operations with the analysis filter 3.2 The two-dimensional wavelet-transform pair h and h . Given the wavelet filter degree d, each 0 1 The two-dimensional wavelet transform is based on the filter has N = 2d coefficients. The filters are arranged same principles as the one-dimensional case. Separa- in vectors N . The filter pair appears in the con- R ble and non-separable two-dimensional wavelet trans- volution matrices∈ H and H . Overall we observe the 0 1 forms exist [Jensen and la Cour-Harbo, 2001]. Sep- pattern [Strang and Nguyen, 1996], arable two-dimensional transforms work with two one dimensional transforms. Non-separable trans-  H  0 H  forms use a single two-dimensional FWT. Separable A = ... H 0 . (2)  1  H transforms have previously been found to problem- I 1 atically single out axis-aligned and diagonal struc- tures [Jensen and la Cour-Harbo, 2001]. We, there- The first two FWT-matrices are shown. The dots ... fore, choose to work with two-dimensional wavelet act as a placeholder for additional analysis matrices. transforms. We now have to handle the extra dimen- All of which factor into the final analysis matrix. sion and require two-dimensional filter quadruplets in- We have seen how to construct the analysis matrix A stead of pairs. and will now invert this process. Overall the inverse Given the one-dimensional filter pairs f , f and h , h , FWT (IFWT) can again be thought of as a linear op- 0 1 0 1 their two-dimensional counterparts are computed us- eration, ing outer products [Vyas et al., 2018]. An outer prod- uct of two column vectors a and b is defined as abT . Sb = x. (3) The required four filters are computed using the outer products of all four combinations, The synthesis matrix S is constructed using the syn- T T thesis filter pair f0, f1, where structurally the synthesis fll = f0f0 , flh = f0f1 , (5) matrices are transposed in comparison to their anal- f = f f T , f = f f T . (6) ysis counterparts. We show the multi-scale recon- hl 1 0 hh 1 1 struction matrices in figure 2. Given the transposed This filter set fll, flh, fhl, fhh fk is used in all for- { } ∈ convolution matrices F0 and F1, S is constructed by ward convolutions at every level. Adaptive wavelet pooling for convolutional neural networks

Synthesis Matrix Scale 1 Scale 2 Scale 3 0 0 0 0

5 5 5 5

10 10 10 10

15 15 15 15 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Figure 2: Structures of the Haar synthesis fast wavelet transformation matrices for three scales (right) as well as the complete inverse matrix (left). We observe that the synthesis matrix is the product of stacked transposed convolution matrices F0 and F1. In comparison to the forward transform matrices are reversed [Wolter, 2021].

In the non-seperated case, two-dimensional convolu- In Section 3.4 we introduce sub-tensor weights, which tion is used to compute the wavelet coefficients. De- make multiscale wavelet pooling meaningful. noting∗ the filter with k and using i for the scale, con- volution with each of the four filters 3.4 Scaled wavelet pooling

xi fk = ki (7) To further benefit from multiscale analysis transforms, ∗ we introduce scaling weights for the pooling operation. yields the four subbands with k [ll, lh, hl, hh]. The Our weights rescale the wavelet coefficient subbands at ∈ last three coefficients are stored, and lli is used to com- higher levels. The idea is to allow our scaled wavelet pute the next scale. In other words, the input signal pooling layers to in- or decrease the contribution of a xi must be lli 1 after the first level where i > 1. particular scale to the pooled result. − In the inverse or synthesis case, the two-dimensional The forward two-dimensional wavelet transform pro- filters are once more constructed using outer prod- duces the coefficient quadruples lli, lhi, hli, hhi at ucts. Given the four subbands at the higher level, lli 1 each scale i. It suffices to store the lh, hl, hh subma- − is reconstructed using a transposed two-dimensional trices at each level to fully reconstruct the signal, since convolution. The reconstruction completes the three the low-low lli coefficients serve as input to compute stored subband tensors on the level below, which we the next quadruple at i + 1. keep during the forward pass. While inversely travers- We introduce three scale coefficient parameters ing the list of sub-tensors stored during the analysis w , w , w 1 > 0, at all but the last scale, transform, transposed convolutions successively recon- lh,i hl,i hh,i R where we additionally∈ use a single w > 0. The learned struct the original input, combining the newly com- ll parameters are multiplied with the submatrices. For puted lli 1 with the stored submatrices at every step. − example, in the final low-low case, the inverse trans- form would work with wll ll. With I indicating the 3.3 Wavelet pooling total number of scales and· dots denoting intermediate subbands, overall the weighted list, The fast wavelet transform turns into a pooling method, when the first level, fine-scale coefficients [[wlh,0 lh0, whl,0hl0, whh,0 hh0],... (8) · · · are discarded [Williams and Li, 2018]. The full anal- [wll,I llI , wlh,I lhI , whl,I hlI , whh,I hhI ]] (9) ysis two-dimensional transform is computed for two · · · · or more scales. The synthesis transform, how- is fed into the inverse FWT to compute the pooled ever, is computed without using the subbands at features. Our approach gives the pooling operation the first scale, which cuts the resolution in half the opportunity to rescale features at various scales [Williams and Li, 2018]. We are comparing pool- according to their importance. ing based on non-separable and separable transforms [Williams and Li, 2018]. Separable transforms treat 3.5 Wavelet properties the rows first and afterwards, independently, the The chosen wavelet basis representation impacts the columns. Two-dimensional wavelet transforms us a pooling. We now aim to optimize the wavelet ba- single 2D-convolution, see equations (5)-(7). sis. We observe that for invertibility, the synthesis- Without re-weighting, the inverse transform recom- transform must undo the analysis-transform. This re- bines higher level sub-tensors into the original input. quirement does not allow us to work with any filter. Moritz Wolter, Jochen Garcke

The so-called alias cancellation and perfect reconstruc- To allow joint optimization of the network and wavelet tion conditions must hold [Strang and Nguyen, 1996]. weights, this pooling loss function must be added to Filters that satisfy these conditions turn into wavelets. the task-dependent loss function that we want the op- The analysis filter pair is called h0, h1, while one refers timizer to improve. to the synthesis pair as f0, f1. Both conditions work on their z-transformed counterparts. For a complex num- 4 ber z C, the transformed filters Hp(z) and Fp(z) are defined∈ as, p 0, 1 ∈ { } All experiments use Pytorch [Paszke et al., 2017] and X n X n Hp(z) = hp(n)z− ,Fp(z) = fp(n)z− , (10) run on Nvidia Titan Xp cards, with 12GB memory. n n ∈Z ∈Z Starting with a proof of concept , we com- for both pairs. In the z-domain the alias cancellation pute a forward 2d-FWT, set the first subband coeffi- condition is given by [Strang and Nguyen, 1996] cients to zero, and compute the synthesis FWT. The resulting image appears in figure 3 on the left. Using H0( z)F0(z) + H1( z)F1(z) = 0. (11) − − the cost function defined in equation 15, we optimize In filter design the alias cancellation condition zero-padded Haar wavelet filters to minimize the in- is often solved by the alternating signs solu- formation loss resulting from not having the first level tion F (z) = H ( z) and F (z) = H (z) coefficients. The second image from the left in figure 3 0 1 − 1 − 0 [Strang and Nguyen, 1996]. For the perfect re- shows the resulting reconstruction. Right next to it, construction condition to hold we must have figure 3 depicts the mean channel difference of both [Strang and Nguyen, 1996], reconstructions. We observe that the new reconstruc- tion is improved, especially at the edges, and shows H (z)F (z) + H (z)F (z) = 2zc. (12) 0 0 1 1 fewer artifacts. Finally, the ver- The equation requires the filter product-sum at the sus the optimization steps appears on the very right center zc to be equal to two, and zero everywhere else. of figure 3. We conclude that the filter optimization The power at the center is denoted as c. approach could help to improve wavelet pooling layers in principle. 3.6 The adaptive wavelet loss Haar-wavelet initialization tends to get stuck at the Haar-Wavelet. In subsequent experiments, we choose It turns out that we do not have to compute the z- to work with randomly initialized filter coefficients. transform in order to evaluate both conditions. The The randomly initialized wavelet weights are pre- representations can be used, by exploit- trained to obtain sufficiently wavelet-like coefficients. ing the equivalence of polynomial multiplication and These, in turn, ensure stable FWT and iFWT compu- convolution [Wolter et al., 2020]. Instead of measur- tations. ing the from the alternating signs solution, we choose to work with convolutions for both our alias We repeat the proof of concept experiment with ran- cancellation and perfect reconstruction losses. Based domly initialized filters and show the converged filter on (11) we now define for alias cancellation in figure 4. The solution displays a pattern that solves the anti-aliasing condition. Some possible solutions N 1 X− h k require F0(z) = H1( z) and F1(z) = H0( z). Sub- ac(θw) = ([h0 ( 1) ] f0)k − − − L · − ∗ stituting ( z) produces a minus sign at odd powers in k=0 (13) − 2 the coefficient polynomial. Multiplication with ( 1) k i − + ([h ( 1) ] f )k . shifts the pattern to even powers. Whenever F0 and 1 · − ∗ 1 H1 share the same sign F1 and H0 do not and the other For the perfect reconstruction loss we follow way around. Other patterns appear too. We picked [Wolter et al., 2020] and use, based on (12) an alternating sign example because it is well known, N 1 and other solutions to the convolutions in equations X− h i2 (13) and (14) are harder to recognize. pr(θw) = (h0 f0)k + (h1 f1)k 2 e N/2 , L ∗ ∗ − · b c k=0 In the next sections, we will evaluate the different (14) wavelet pooling approaches compared to max- and where e N/2 is the N/2 -th unit vector. And k the mean-pooling on several data sets. Here, working with filter positionb c as countedb c from left to right. Summing standard CNN architectures, we compare the methods up both expression leads us to the overall wavelet loss by replacing the local pooling layers. We work with first-degree wavelets. The wavelets in our CNN exper- (θw) = ac(θw) + pr(θw). (15) L L L iments, therefore, use two coefficients in all four filter Adaptive wavelet pooling for convolutional neural networks

mean squared error 104 haar optimized-haar normalized-difference · 0 0 0 2.4

200 200 200

2.2 400 400 400

600 600 600 2

0 200 400 600 800 1,000 0 200 400 600 800 1,000 0 200 400 600 800 1,000 0 200 400 600 800 1,0001,2001,400

Figure 3: Wavelet optimization proof of concept. The first level coefficients are set to zero. Using the remaining coefficients the input image is reconstructed. We show the original Haar-wavelet and optimized Haar-wavelet reconstruction. To the right, the difference between the two reconstructions divided by the maximum of its largest absolute value for better visibility. On the very right, the mean squared error is plotted during optimization. We observe that our approach reduces the reconstruction error.

learned filters Table 1: Pooling method comparison on the MNIST data set. Our evaluation uses a LeNet5 like network. [Williams and Li, 2018] used a network based on Zeil- 0.5 ers network and what we refer to as wavelet pooling. H0 On MNIST adaptive wavelet pooling compares favor- H1 ably to classic pooling methods. While scaled wavelet 0 F0 F1 pooling performs competitively. For our experiments magnitude mean, and over ten runs are shown. 0.5 − Experiment Accuracy [%] adaptive wavelet 99.173 0.037 1 2 3 4 5 6 ± coefficient no scaled wavelet 99.075 0.112 wavelet 99.073 ± 0.103 ± Figure 4: Converged wavelet using the sum of our loss max 99.055 0.140 ± formulation and mean squared error on the proof of adaptive max 98.809 0.065 ± concept experiment shown in figure 3 using a random mean 99.059 0.122 ± initialization. H ,H denote the analysis, F ,F de- adaptive mean 99.064 0.038 0 1 0 1 ± note the synthesis filter pairs. separable wavelet 99.071 0.0967 ± wavelet [Williams and Li, 2018] 99.01 banks. Each wavelet-pooling layer has two scales and its own weights.

Wavelet and scaled wavelet layers use Haar wavelets. one for each digit from zero to nine. The scales are initially set to one. Adaptive wavelet layers work with random filter initializations. Initial We solve this classification problem using a LeNet-like values for the adaptive filters are drawn from the uni- network with two convolution, two pooling, and three form distribution [ 1, 1]. To ensure working FWT fully connected layers. We choose to optimize using and IFWT transforms,U − we pre-train the adaptive fil- stochastic with a of 0.12, ters for 250 steps to ensure that our forward- and a momentum value of 0.6, and decay the learning rate backward-fast wavelet transform produce useful values by multiplication with 0.95 after every weight update. from the start. Training stops after 25 epochs. Table 1 shows experimental results on MNIST. We ob- 4.1 MNIST serve the highest mean accuracy with a small standard deviation, for the adaptive wavelet approach. Followed The MNIST data set [LeCun et al., 1998] consists of by scaled wavelet, mean, and max pooling. In these sixty thousand train- and ten thousand-test images. experiments using adaptive mean oder max pooling All images have 28x28 pixels and belong to ten classes, did not help. To evaluate the impact of choosing a Moritz Wolter, Jochen Garcke

LeNet second pooling layer cost LeNet second pooling layer scales

1 10 3

1 10− 2.5

3 10− 2 loss

5 weight 10− 1.5 lh 7 conv alias cancellation loss 1 10− hl alt. signs alias cancellation loss hh perfect reconstruction loss 9 0.5 ll 10− 0 0.5 1 1.5 2 0 0.5 1 1.5 2 iteration 104 · iteration 104 · Figure 5: Logarithmically scaled wavelet loss during Figure 6: Wavelet scales weight during training in the training in the second MNIST pooling layer. The green second MNIST pooling layer. Originally initialized to line shows the perfect reconstruction loss as defined have a weight of one, over time the optimizer adjusts in equation 14. The blue line the alias cancellation the relative weight of the subbands with respect to loss from equation 13. Both plots overlap most of the each other. time. The yellow plot depicts the alias cancellation loss function proposed in [Wolter et al., 2020]. We observe that our wavelet has converged to a working solution Figure 6 depicts the evolution of the wavelet coefficient that the previous formulation does not allow. submatrix weights during training. Since all weights were initialized to one we see that the optimizer moves these factors significantly as it adjusts the contribution seperable or non-separable wavelet transform, we im- of the individual subbands. Table 1 showed us that plemented and explored the effect of a separable trans- the scaling weights can lead to performance gains over form in the last row of Table 1, using a LeNet5 like ar- fixed wavelet pooling, while basis optimization appears chitecture as we do in all other MNIST experiments. to be more efficient in this case. [Jensen and la Cour-Harbo, 2001] discourage this ap- proach, but it does not reduce accuracy on MNIST. 4.2 CIFAR-10 Measurements of three wavelet loss formulations dur- ing training are shown in figure 5. The perfect re- The CIFAR-10 data set [Krizhevsky et al., 2009] con- construction loss formulation defined in equation 14 is sists of 60k 32 by 32 color images of these 50k are shown in green. Two different alias cancellation loss used for training and 10k for testing. The images show formulations are compared. The blue line shows the airplanes, automobiles, birds, cats, deers, dogs, frogs, alias cancellation loss from equation 13. The yellow horses, ships, and trucks. Networks trained on this plot depicts a measure of the mean squared deviation data set have to recognize these correctly. from F0(z) = H1( z) and F1(z) = H0( z), − − − We start our comparison of pooling methods using N an Alexnet [Krizhevsky, 2014] like network, which we X k 2 Lac(θw) = f0,k ( 1) h1,k modify to work on CIFAR-10. While moving from ex- − − (16) k=0 periment to experiment, we always swap all pooling k 2 + f ,k + ( 1) h ,k layers. 1 − 0 We train using stochastic gradient descent with a as proposed by [Wolter et al., 2020]. While this loss learning rate 0.06 and 0.6 momentum. The learning formulation leads to alias cancelling wavelet filters if rate was divided by 10 after 150 and again after 250 optimized, we find it to be unnecessarily restrictive. epochs. The training process was stopped after 300 Figure 5 illustrates this point. The loss shown by the epochs. blue line converges to a minimum which the formula- tion from equation 16 would not have allowed. Results are tabulated in Table 2, again our adaptive Adaptive wavelet pooling for convolutional neural networks

Table 2: Pooling method comparison on the CIFAR- Table 4: Pooling method comparison on the SVHN 10 data set using an Alexnet backbone. The last row data set. Experiments again are based on a modified again shows the wavelet-pooling result from a Zeiler- Alexnet architecture. We find our adaptive pooling like network. We find our adaptive pooling approach method to perform on par with max pooling in this to perform well, while the scaled approach does slighly case. worse yets remains competitive in this case. Experiment Accuracy Experiment Accuracy adaptive wavelet 94.87% adaptive wavelet 89.79% scaled wavelet 94.66% scaled wavelet 88.94% wavelet 94.29% wavelet 89.01% max 94.87% max 89.64% mean 94.42% mean 89.24% wavelet [Williams and Li, 2018] 91.10% separable wavelet 88.72% wavelet [Williams and Li, 2018] 80.28% 4.3 SVHN

The Street View House Numbers (SVHN) Dataset [Netzer et al., 2011] has 73k color training 26k testing images. We choose to use the CIFAR-10-like setting Table 3: Pooling method comparison on the CIFAR- with 32 by 32 color images. 10 data set using a densenet backbone. We find that adaptive wavelet and max pooling work best, while We again use an Alexnet [Krizhevsky, 2014] structure. scaled wavelet and fixed wavelet pooling outperform Wavelet initialization and pretraining were left un- mean pooling. changed. For optimization, we choose stochastic gra- dient descent with a learning rate of 0.08 and no mo- Experiment Accuracy mentum. adaptive wavelet 94.41% The results shown in Table 4 confirm our previous ob- scaled wavelet 94.30% servation that adaptive-wavelet and max-pooling per- wavelet 94.22% form best. The wavelet approach did not outper- max 94.37% form mean pooling. This observation is in line with mean 94.17% [Williams and Li, 2018], for their experiments with and without dropout. We note that scaled wavelet pooling outperformed mean pooling in this case and conclude that some form of learned flexibility is re- quired within wavelet pooling layers. wavelet pooling performed competitively. Overall we find that the Alexnet backbone does much better than 5 Conclusion the Zeiler-like approach used jointly with dropout in [Williams and Li, 2018]. We have introduced the usage of non-separable two- We repeat our experiments replacing the Alexnet with dimensional fast wavelet transforms for pooling layers a densenet [Huang et al., 2017] architecture. For all in CNNs. Additionally, we proposed scaled and adap- experiments, we replace the pooling layers within all tive extensions. Here, an improved convolutional alias transition blocks and leave the final global pooling cancellation wavelet loss formulation permits wavelets, layer untouched. Optimization again used stochas- which an earlier employed restrictive alternating sign tic gradient descent with momentum. The learning formulation did not allow. In our experiments, we have rate was set to 0.05, again dividing it by ten after 150 found that adaptive and scaled wavelet pooling deliv- and 250 epochs. Training stopped after 300 epochs. ers competitive results and adds extra flexibility to 4 Weight decay helped in this case and was set to 10− . wavelet pooling. Table 3 shows the best test set performance observed Experimentally we found that the extra flexibility, during training. We find that adaptive and max- added trough scaling or wavelet optimization leads to pooling work best, while scaled and wavelet pooling improved results in comparison to static wavelet pool- do better than mean pooling. ing. In particular, we saw that fixed wavelet pool- Moritz Wolter, Jochen Garcke ing often failed to outperform mean pooling. With new model and the kinetics dataset. In proceedings the exception of our CIFAR-10 experiments using the of the IEEE Conference on Computer Vision and Alexnet backbone, we saw that adding the scaling of , pages 6299–6308. the coefficients generally leads to better performance in comparison to mean pooling. [Engl et al., 1996] Engl, H., Hanke, M., and Neubauer, A. (1996). Regularization of Inverse Wavelet optimization had a bigger impact than rescal- Problems. Kluwer. ing the subbands. We speculate that this might be a result of the small input images we chose to use for [Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., computational reasons. These small images limit the and Courville, A. (2016). . MIT number of meaningful scales the FWT can produce. Press. Using more scales in future work might give the scaled wavelet pooling approach an opportunity to have more [Gopinath et al., 2019] Gopinath, K., Desrosiers, C., of an impact. and Lombaert, H. (2019). Adaptive graph convolu- tion pooling for brain surface analysis. In IPMI. Our framework supports high-degree wavelets in principle, but these require additional padding [Gulcehre et al., 2014] Gulcehre, C., Cho, K., Pas- to be invertible. Eventually, the extra padding canu, R., and Bengio, Y. (2014). Learned-norm impacts results. Adopting boundary filters pooling for deep feedforward and recurrent neu- [Strang and Nguyen, 1996] and wavelets instead ral networks. In Joint European Conference on of a padding-based approach may help here. We hope Machine Learning and Knowledge Discovery in to investigate this in future work. Databases, pages 530–546. Springer. Future work could also train orthogonal wavelets. The [He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, orthogonal FWT matrices do not change the norm J. (2016). Deep residual learning for image recog- or the length of feature input vectors. Such or- nition. In Proceedings of the IEEE conference on thogonality is of particular importance for the sta- computer vision and pattern recognition, pages 770– bility of recurrent neural networks, and could also 778. help stabilize CNNs. Having orthogonal wavelet fil- ters should make the FWT analysis matrix orthogonal T T [Huang et al., 2017] Huang, G., Liu, Z., Van AA = I and A A = I [Strang and Nguyen, 1996]. Der Maaten, L., and Weinberger, K. Q. (2017). Translating these conditions back into the filters Densely connected convolutional networks. In leads to f [k] = h [ k] and f [k] = h [ k] 0 0 − 1 1 − Proceedings of the IEEE conference on computer [Jensen and la Cour-Harbo, 2001]. vision and pattern recognition, pages 4700–4708. This condition suggests an orthogonal loss function. Such a loss could, for example, measure the mean [Jensen and la Cour-Harbo, 2001] Jensen, A. and squared deviation from orthogonality, which might be la Cour-Harbo, A. (2001). Ripples in : interesting in future investigations. the discrete wavelet transform. Springer Science & Business Media.

6 Acknowledgments [Krizhevsky, 2014] Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. We thank Helmut Harbrecht for insightful discus- arXiv preprint arXiv:1404.5997. sions and our anonymous reviewers for their feedback. Moritz would like to thank the University of Bonn for [Krizhevsky et al., 2009] Krizhevsky, A., Hinton, G., access to the Auersberg and Teufelskapelle clusters. et al. (2009). Learning multiple layers of features This work was supported by the Fraunhofer Cluster of from tiny images. Excellence Cognitive Internet Technologies (CCIT). [LeCun et al., 1998] LeCun, Y., Bottou, L., Bengio, References Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the [Bruna and Mallat, 2013] Bruna, J. and Mallat, S. IEEE, 86(11):2278–2324. (2013). Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine [Liu et al., 2017] Liu, D., Zhou, Y., Sun, X., Zha, Z., intelligence, 35(8):1872–1886. and Zeng, W. (2017). Adaptive pooling in multi- instance learning for web video annotation. 2017 [Carreira and Zisserman, 2017] Carreira, J. and Zis- IEEE International Conference on Computer Vi- serman, A. (2017). Quo vadis, action recognition? a sion Workshops (ICCVW), pages 318–327. Adaptive wavelet pooling for convolutional neural networks

[Malinowski and Fritz, 2013] Malinowski, M. and [Strang and Nguyen, 1996] Strang, G. and Nguyen, Fritz, M. (2013). Learnable pooling regions for T. (1996). Wavelets and filter banks. SIAM. image classification. International Conference on Learning Representations (workshop track). [Tsai et al., 2015] Tsai, Y.-H., Hamsici, O. C., and Yang, M.-H. (2015). Adaptive region pooling for [McFee et al., 2018] McFee, B., Salamon, J., and object detection. In Proceedings of the IEEE Con- Bello, J. (2018). Adaptive pooling operators for ference on Computer Vision and Pattern Recogni- weakly labeled sound event detection. IEEE/ACM tion, pages 731–739. Transactions on Audio, Speech, and Language Pro- cessing, 26:2180–2193. [Vyas et al., 2018] Vyas, A., Yu, S., and Paik, J. (2018). Multiscale transforms with application to [Nagrani et al., 2019] Nagrani, A., Chung, J. S., Xie, image processing. Springer. W., and Zisserman, A. (2019). Voxceleb: Large- scale speaker verification in the wild. Computer Sci- [Williams and Li, 2018] Williams, T. and Li, R. ence and Language. (2018). Wavelet pooling for convolutional neural networks. In International Conference on Learning [Natterer, 1977] Natterer, F. (1977). Regularisierung Representations. schlecht gestellter Probleme durch Projektionsver- fahren. Numer. Math., 28:329–341. [Wolter, 2021] Wolter, M. (2021). Frequency Domain Methods in Recurrent Neural Networks for Sequen- [Netzer et al., 2011] Netzer, Y., Wang, T., Coates, A., tial Data Processing. PhD thesis, Institut f¨urInfor- Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading matik, Universit¨atBonn, Universit¨atsund Landes- digits in natural images with unsupervised feature bibliothek Bonn. submitted. learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning. [Wolter et al., 2020] Wolter, M., Lin, S., and Yao, A. (2020). Neural network compression via learnable [Paszke et al., 2017] Paszke, A., Gross, S., Chintala, wavelet transforms. In 29th International Confer- S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Des- ence on Artificial Neural Networks. maison, A., Antiga, L., and Lerer, A. (2017). Auto- matic differentiation in . In Conference on [Zeiler and Fergus, 2013] Zeiler, M. D. and Fergus, Neural Information Processing Systems Workshop R. (2013). Stochastic pooling for regularization Autodiff. of deep convolutional neural networks. In Inter- national Conference on Learning Representations [Porrello et al., 2019] Porrello, A., Abati, D., Calder- 2013. arXiv preprint arXiv:1301.3557. ara, S., and Cucchiara, R. (2019). Classifying sig- nals on irregular domains via convolutional cluster pooling. pages 1388–1397. Proceedings of Machine Learning Research. [Rippel et al., 2015] Rippel, O., Snoek, J., and Adams, R. P. (2015). Spectral representations for convolutional neural networks. In Advances in neu- ral information processing systems. [Rustamov and Guibas, 2013] Rustamov, R. and Guibas, L. J. (2013). Wavelets on graphs via deep learning. In Advances in neural information processing systems, pages 998–1006. [Scherer et al., 2010] Scherer, D., M¨uller, A., and Behnke, S. (2010). Evaluation of pooling operations in convolutional architectures for object recognition. In Artificial Neural Networks – ICANN 2010, pages 92–101. Springer Berlin Heidelberg. [Springenberg et al., 2015] Springenberg, J., Dosovit- skiy, A., Brox, T., and Riedmiller, M. (2015). Striv- ing for simplicity: The all convolutional net. In In- ternational Conference on Learning Representations (workshop track).