<<

Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance

Jyoti Nigam, Srishti Barahpuriya and Renu M. Rameshan Indian Institute of Technology, Mandi, Himachal Pradesh, India

Keywords: Convolution, Correlation, Linear Transformation, Nonlinear Transformation.

Abstract: AlexNet, one of the earliest and successful deep learning networks, has given great performance in image clas- sification task. There are some fundamental properties for good classification such as: the network preserves the important information of the input data; the network is able to see differently, points from different classes. In this work we experimentally verify that these core properties are followed by the AlexNet architecture. We analyze the effect of linear and nonlinear transformations on input data across the layers. The convolution fil- ters are modeled as linear transformations. The verified results motivate to draw conclusions on the desirable properties of transformation that aid in better classification.

1 INTRODUCTION AND • We derive the structure of the linear transforma- RELATED WORK tion corresponding to the convolution filters and analyze its effect on the data using the bounds on the of the linear transformation. Convolutional neural networks (CNNs) have led to considerable improvements in performance for many • Using a specific data selection plan we show em- computer vision (LeCun et al., 1989; Krizhevsky pirically that the data from the same class shrinks et al., 2012) and natural language processing tasks and separation increases between two different (Young et al., 2018). In recent literature there are classes. many papers (Giryes et al., 2016; Sokolic´ et al., 2017; Sokolic´ et al., 2017; Oyallon, 2017) which provide an analysis on why deep neural networks (DNNs) are ef- ficient classifiers. (Kobayashi, 2018; Dosovitskiy and 2 ANALYSIS Brox, 2016) provide an analysis of CNNs by looking at the visualization of the neuron activations. Statis- AlexNet employs five convolution layers and three tical models (Xie et al., 2016) have also been used to max pooling layers for extracting features. Further- derive feature representation based on a simple statis- more, the three fully connected layers for classifying tical model. images as shown in Fig.1. Each layer makes use of We choose to analyze the network in a method dif- the rectified linear unit (ReLU) for nonlinear neuron ferent from all the above by modeling the filters as a activation. linear transformation. The effect of nonlinear opera- In CNNs, feature extraction from the input data tions is analyzed by using measures like Mahalanobis is done by convolution layers while fully connected distance and angular as well as Euclidean separation layers perform as classifiers. Each convolution layer between points of different classes. generates a feature map that is in 3D format and AlexNet (Krizhevsky et al., 2012) is one of the fed into the subsequent layer. The feature map from oldest successful CNNs that recognizes images of the the last convolution layer is given to fully connected ImageNet dataset (Deng et al., 2009). We analyze ex- layers in the form of a flattened vector and a 1000 di- perimentally this network with an aim of understand- mensional vector is generated as output of fully con- ing the mathematical reasons for its success. We use nected layer. This is followed by normalization and data from two classes of ImageNet to study the per- then a softmax layer is applied. In the normalized out- formance of AlexNet. put vector, each dimension refers to the probability of The contributions of this analysis are as follows: the image being the element of each image class.

860

Nigam, J., Barahpuriya, S. and Rameshan, R. Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance. DOI: 10.5220/0007582408600865 In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 860-865 ISBN: 978-989-758-351-3 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance

Figure 1: AlexNet Architecture (Kim et al., 2017).

CNN has more than one linear transformations and two types of nonlinear transformations (ReLU and pooling) which are used in repetition. The the nonlinear transformation confines the data to the pos- itive orthant of higher dimension.

2.1 Analysis of Linear Transformation

The main operation in CNNs is convolution. The fil- Figure 2: Convolution operation. The input is of size ters are correlated with the input to get feature maps N1N2N3 ×1 and there are M3 filters ( f1,..., fM3 ) which gen- and this operation is termed as convolution in the liter- erate M3 feature maps (m1,...,mM3 ). ature. Since correlation is nothing other than convolu- tion without flipping the , correlation operation has full . The input x is convolved with each filter can also be represented as a matrix vector product. generating a feature map mk. Eq.(3) gives the descrip- We refer to this matrix as the linear transformation. tion of convolution operation. 1D and 2D correlation can be represented as shown in Eq. (1) and Eq. (2), respectively. y = Tx. (3) y(n) = ∑h(k)x(n + k), (1) 2.1.1 Analysis based on Nature of k Transformation Matrix

y(m,n) = ∑∑h(l,k)x(m + l,n + k), (2) The desirable properties for transformation matrix (T) l k to aid classification are: where x is the input and y is the output and h is the 1. The null space of T should be such that the differ- kernel. Eq.1 leads to a and Eq.2 leads ence of vectors from two different classes should to a block Toeplitz matrix. In CNNs notice that the or- not be in the null space of T. This in turn demands der of convolution is higher and correspondingly one that the difference should not lie in null space of gets a matrix which is Toeplitz with Toeplitz blocks. fk, 1 ≤ k ≤ M3, i.e. if xi ∈ Ci and x j ∈ Cj, where Typically each layer has multiple filters leading to Ci and Cj are different classes, multiple maps and the Toeplitz matrix corresponding to each filter is stacked vertically to get the overall x j − xi ∈/ N ( fk), 1 ≤ k ≤ N. (4) transformation from the input space to output space. N ×N ×N Proof. Let x1,x2 ∈ R 1 2 3 be two points and As an example let x ∈ R N1×N2×N3 be an input vec- their difference x = x1 − x2, the norm of x is, tor and y ∈ R M1×M2×M3 be the output vector, where N and M are number of channels in input and num- 3 3 kxk= kx − x k. (5) ber of filters, respectively. Then the transformation 2 1 matrix T is such that T ∈ R M1M2M3×N1N2N3 . T is ob- x1,x2 are transformed to tained by stacking fk, where 1 ≤ k ≤ M3, as shown in Fig.2. Each fk is Toeplitz with Toeplitz blocks and y1 = Tx1, y2 = Tx2. (6)

861 ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

Table 1: Analysis of norm values at each layer.

Layers Total filters Filters: k T k2 < 1 Min Max 1 96 13 0.28 4.16 2 256 2 0.03 4.26 3 384 2 0.96 2.22 4 384 6 0.96 2.12 5 256 0 1.16 2.19

Norm of the difference of y1 and y2 is Angles between representative points of both classes 100 Near_Min Medium_Min 90 Far_Min ky2 − y1k= kT(x2 − x1)k, (7) Near_Max 80 Medium_Max That can be written as: Far_Max 70

M 60 3 1 2 2 50 kT(x2 − x1)k= (∑k fk(x2 − x1)k ) , Angles k=1 40

30

20 x −x ∈/ (T) x −x ∈/ ( f ) ∀k. 2 1 N only if 2 1 N k This is 10

important to maintain separation between classes 0 1 2 3 4 5 after the transformation. Layers 2. λ (TT T ) > 1, λ (TT T ) > 1, where λ and Figure 3: Angles between representative points of both the min max min classes. The first entry in x−axis shows the input value fol- λmax are the minimum and maximum eigenvalues, T lowed by five layers of AlexNet. Near min and Near max respectively of TT . are minimum and maximum angle values, respectively from the near region. Similarly for medium and far regions the Proof. For proper classification, two vectors from minimum and maximum values are shown. different classes should be separated at least by kx2 − x1k. Since transformations on the input data. In this direction there is a work (Giryes et al., 2016), which provides y2 − y1 = T(x2 − x1), a study about how the distance and the angle changes ky − y k = kT(x − x )k, 2 1 2 1 after the transformation within the layers. The anal- = k(x2 − x1)kkTzk, wherekzk= 1. ysis in (Giryes et al., 2016) is based on the networks with random Gaussian weights and it exploits tools Let λ and λ are the minimum and max- min max used in the compressed sensing and dictionary learn- imum eigenvalues, respectively of TT T , then ing literature. They showed that if the angle between λ ≤ kTzk≤ λ , and hence min max the inputs is large then the Euclidean distance be- λmink(x2 − x1)k≤ k(y2 − y1)k≤ λmaxk(x2 − x1)k. tween them at the output layer will be large and vice- (8) versa. We analyze the effect of nonlinear transformations on the input data by measuring the following key- From Eq.(8) it is evident that λmin,λmax > 1 is points. The details are given in the experiments sec- ideal. We observe that for all the layers λmin > 0 tion. as shown in Tab. 1 but λmin > 1 only for the last layer . Note that it is not necessary that when all • Effect of transformation on angles between repre- sentative points of two classes full rank fks are vertically stacked the T is full rank, but we observe from the experiments that it • Transformation in Euclidean distance of points is full rank. from mean within each class. • Mahalanobis distance between the mean points of 2.2 Analysis of Nonlinear two classes. Transformation • Minimum and maximum Euclidean distance among points within class and between class. In this section, we analyze and verify (using AlexNet and its pre-trained weights) the effect of nonlinear

862 Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance

3 EXPERIMENTS Algorithm 1: Selection of representative points. Require: In this section we analyze how the angles and Eu- Let X and Y be the two classes, X = {x1,...,xm} clidean distances change within the layers. We focus and Y = {y1,...,yn}. on the case of ReLU as the activation function and Ensure: max pool as the pooling strategy. In order to verify Six representative points (x1,··· ,x6) and the results/observations which conclude that the net- (y1,··· ,y6), respectively from both classes. work is providing a good classification, we consider 1 m 1 n 1: µ1 = m ∑i=1 xi, µ2 = n ∑i=1 yi. the data from two classes namely cat and dog (each 1 m 2 1 n 2 2: σ1 = m−1 ∑i=1(xi −µ1) , σ2 = n−1 ∑i=1(yi −µ2) having 1000 images). We use the Mahalanobis dis- 3: To find near mean (N) representative points, tance to find the distance between classes. N1 = {xi ∈ X|kxi − µ1k≤ σ1}. In our experiments we use the AlexNet with its N2 is defined similarly. pre-trained weights and all images from both the 4: Select x1,x2 the points with minimum and maxi- classes are passed through the network. To measure mum distance, respectively from set N1 and sim- the distance and angle among each class and between ilarly for N2 as well. class data, the best six representative points are being 5: To find medium distance from mean (M), selected from both the classes. dmax+dmin 2 M1 = {xi ∈ X| 2 − σ1 ≤ kxi − µ1k ≤ dmax+dmin 2 2 +σ1}, where dmax = maxikxi −µ1k and 3.1 Method for Selecting the 2 dmin = minikxi − µ1k . Representative Points M2 is defined similarly. 6: Select x3,x4 the points with minimum and maxi- The dataset is divided into three different regions mum distance, respectively from set M1 and sim- which are named as near mean N, at medium distance ilarly for M2 as well. from mean M, far away points from mean F. In order 7: To find far away region F, 2 to divide the regions and to find representative points, F1 = {xi ∈ X|kxi − µ1k > dmax − σ1}. we follow the steps provided in Algo 1. F2 is defined similarly. 8: Select x5,x6 the points with minimum and maxi- mum distance, respectively from set F1 and simi- 4 RESULTS AND DISCUSSION larly for F2 as well. 9: Representative points (y1,··· ,y6) of other class are also obtained in the similar manner. Due to normalization the input data (X ∈ R N1×N2×N3 ) belongs to a manifold/sphere and we apply the ReLU (ρ) as the activation function. The nonlinear trans- the Mahalanobis distance between means of the two formation followed by normalization sends the out- classes at each layer. To calculate Mahalanobis dis- put data (Y ∈ M1×M2×M3 ) to a sphere/manifold (with R tance we need but due to high di- unit radius). mension of data, in practice it is hard to compute it. Hence, we use principal component analysis to reduce 4.1 Effect of Transformation on Angles the dimension of the data and take only one direction between Representative Points of with the most significant variation. It is clear from Two Classes the Fig.4 that the Mahalanobis distance is increasing, pointing to the fact that separation between classes is To show the layer-wise influence on the angles be- increasing. tween the data of two classes we plot the angle values for all representative points. It can be seen from Fig. 3 4.3 Transformation of Euclidean that the minimum as well as the maximum angles are Distance of Representative Points much higher than that of the respective input angles. from Mean

4.2 Mahalanobis Distance between the In this section, we analyze the influence of network Mean Points of Two Classes in the terms of change in the distance of represen- tative points with their means for both the classes. In order to analyze the separation between the two We observe from Fig.5 and Fig.6 that the Euclidean classes as data passes through the layers, we analyze distance between representative points and their re-

863 ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

Euclidean distance from mean 80000 Near_Min Medium_Min 70000 Far_Min Near_Max Medium_Max 60000 Far_Max

50000

40000 Distance 30000

20000

10000

0 0 1 2 3 4 5 Layers Figure 6: Euclidean distance of representative points from mean (class 2). The first entry in x−axis shows the input Figure 4: Mahalanobis distance between mean of classes. value followed by five layers of AlexNet. Near min and Near max are minimum and maximum distance values, re- Euclidean distance from mean 80000 spectively from the near region. Similarly for medium and Near_Min Medium_Min far regions the minimum and maximum values are shown. 70000 Far_Min Near_Max Medium_Max 60000 Far_Max Euclidean Distance 80000 Near_Min 50000 Medium_Min 70000 Far_Min Near_Max 40000 Medium_Max 60000 Far_Max Distance 30000 50000

20000 40000

10000 Distance 30000

0 0 1 2 3 4 5 20000 Layers Figure 5: Euclidean distance of representative points from 10000 mean (class 1). The first entry in x−axis shows the input 0 0 1 2 3 4 5 value followed by five layers of AlexNet. Near min and Layers Near max are minimum and maximum distance values, re- Figure 7: Euclidean distance among representative points spectively from the near region. Similarly for medium and (class 1). The first entry in x−axis shows the input value far regions the minimum and maximum values are shown. followed by five layers of AlexNet. Near min and Near max are minimum and maximum distance values, respectively spective means are getting reduced as points passes from the near region. Similarly for medium and far regions through the subsequent layers. This reflects that the the minimum and maximum values are shown. points from same class are getting clustered together. between the means, singular values of the transfor- 4.4 Euclidean Distance between mation matrix for the filters, increase of the angles Representative Points within and between points between two classes indicate that the two classes are getting apart as they pass through the Across Classes layer of the network.

We also analyze how the distance between repre- sentative points are changing after passing through each subsequent layer. It is seen from Fig.7, Fig.8 5 CONCLUSION and Fig.9 the distances among points within a class decrease but distances among points between two In this study, we consider the architecture of AlexNet classes also reduce. and its linear and nonlinear transformation operations. Even though distances between points from two We analyze the required criterion of transformation classes is also decreasing we can say that the network matrix for appropriate classification and see that the is doing good classification as the other parameters conditions are met fully for the last layer and par- which we have seen such as the Mahalanobis distance tially for the intermediate layers. We select six rep-

864 Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance

Euclidean Distance ings of the IEEE Conference on Computer Vision and 80000 Near_Min Pattern Recognition, pages 4829–4837. Medium_Min 70000 Far_Min Near_Max Giryes, R., Sapiro, G., and Bronstein, A. M. (2016). Deep Medium_Max neural networks with random gaussian weights: a uni- 60000 Far_Max versal classification strategy? IEEE Trans. Signal 50000 Processing, 64(13):3444–3457.

40000 Kim, H., Nam, H., Jung, W., and Lee, J. (2017). Perfor-

Distance mance analysis of cnn frameworks for gpus. Perfor- 30000 mance Analysis of Systems and Software (ISPASS).

20000 Kobayashi, T. (2018). Analyzing filters toward efficient convnet. In Proceedings of the IEEE Conference 10000 on Computer Vision and Pattern Recognition, pages 0 5619–5628. 0 1 2 3 4 5 Layers Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im- Figure 8: Euclidean distance among representative points agenet classification with deep convolutional neural (class 2). The first entry in x−axis shows the input value networks. In Advances in neural information process- followed by five layers of AlexNet. Near min and Near max ing systems, pages 1097–1105. are minimum and maximum distance values, respectively LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, from the near region. Similarly for medium and far regions R. E., Hubbard, W., and Jackel, L. D. (1989). Back- the minimum and maximum values are shown. propagation applied to handwritten zip code recogni- tion. Neural computation, 1(4):541–551. Euclidean Distance Oyallon, E. (2017). Building a regular decision bound- 80000 Near_Min ary with deep networks. In 2017 IEEE Conference Medium_Min 70000 Far_Min on Computer Vision and Pattern Recognition (CVPR), Near_Max Medium_Max volume 00, pages 1886–1894. 60000 Far_Max Sokolic,´ J., Giryes, R., Sapiro, G., and Rodrigues, M. R. 50000 (2017). Robust large margin deep neural net- works. IEEE Transactions on Signal Processing, 40000

Distance 65(16):4265–4280. 30000 Sokolic,´ J., Giryes, R., Sapiro, G., and Rodrigues, M. R. D.

20000 (2017). Generalization error of deep neural networks: Role of classification margin and data structure. In 10000 2017 International Conference on Sampling Theory

0 and Applications (SampTA), pages 147–151. 0 1 2 3 4 5 Layers Xie, L., Zheng, L., Wang, J., Yuille, A. L., and Tian, Q. Figure 9: Euclidean distance among representative points (2016). Interactive: Inter-layer activeness propaga- (between classes). The first entry in x−axis shows the in- tion. In Proceedings of the IEEE Conference on Com- put value followed by five layers of AlexNet. Near min and puter Vision and Pattern Recognition, pages 270–279. Near max are minimum and maximum distance values, re- Young, T., Hazarika, D., Poria, S., and Cambria, E. (2018). spectively from the near region. Similarly for medium and Recent trends in deep learning based natural language far regions the minimum and maximum values are shown. processing. ieee Computational intelligenCe maga- zine, 13(3):55–75. resentative points from each class and observe the ef- fect of nonlinear transformations on the input data by measuring the change in angle and distance between these points and we observed that same class data is bunched together and different class data are well sep- arated in-spite of the fact that all data points come closer irrespective of class.

REFERENCES

Deng, J., Dong, W., Socher, R., jia Li, L., Li, K., and Fei- fei, L. (2009). Imagenet: A large-scale hierarchical image database. In In CVPR. Dosovitskiy, A. and Brox, T. (2016). Inverting visual repre- sentations with convolutional networks. In Proceed-

865