Analyzing the Linear and Nonlinear Transformations of Alexnet to Gain Insight Into Its Performance

Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance Jyoti Nigam, Srishti Barahpuriya and Renu M. Rameshan Indian Institute of Technology, Mandi, Himachal Pradesh, India Keywords: Convolution, Correlation, Linear Transformation, Nonlinear Transformation. Abstract: AlexNet, one of the earliest and successful deep learning networks, has given great performance in image classification task. There are some fundamental properties for good classification such as: the network preserves the important information of the input data; the network is able to see differently, points from different classes. In this work we experimentally verify that these core properties are followed by the AlexNet architecture. We analyze the effect of linear and nonlinear transformations on input data across the layers. The convolution filters are modeled as linear transformations. The verified results motivate to draw conclusions on the desirable properties of transformation matrix that aid in better classification. 1 INTRODUCTION AND • We derive the structure of the linear transforma- RELATED WORK tion corresponding to the convolution filters and analyze its effect on the data using the bounds on the norm of the linear transformation. Convolutional neural networks (CNNs) have led to considerable improvements in performance for many • Using a specific data selection plan we show em- computer vision (LeCun et al., 1989; Krizhevsky pirically that the data from the same class shrinks et al., 2012) and natural language processing tasks and separation increases between two different (Young et al., 2018). In recent literature there are classes. many papers (Giryes et al., 2016; Sokolic´ et al., 2017; Sokolic´ et al., 2017; Oyallon, 2017) which provide an analysis on why deep neural networks (DNNs) are ef- ficient classifiers. (Kobayashi, 2018; Dosovitskiy and 2 ANALYSIS Brox, 2016) provide an analysis of CNNs by looking at the visualization of the neuron activations. Statis- AlexNet employs five convolution layers and three tical models (Xie et al., 2016) have also been used to max pooling layers for extracting features. Further- derive feature representation based on a simple statis- more, the three fully connected layers for classifying tical model. images as shown in Fig.1. Each layer makes use of We choose to analyze the network in a method dif- the rectified linear unit (ReLU) for nonlinear neuron ferent from all the above by modeling the filters as a activation. linear transformation. The effect of nonlinear opera- In CNNs, feature extraction from the input data tions is analyzed by using measures like Mahalanobis is done by convolution layers while fully connected distance and angular as well as Euclidean separation layers perform as classifiers. Each convolution layer between points of different classes. generates a feature map that is in 3D tensor format and AlexNet (Krizhevsky et al., 2012) is one of the fed into the subsequent layer. The feature map from oldest successful CNNs that recognizes images of the the last convolution layer is given to fully connected ImageNet dataset (Deng et al., 2009). We analyze ex- layers in the form of a flattened vector and a 1000 di- perimentally this network with an aim of understand- mensional vector is generated as output of fully con- ing the mathematical reasons for its success. We use nected layer. This is followed by normalization and data from two classes of ImageNet to study the per- then a softmax layer is applied. In the normalized out- formance of AlexNet. put vector, each dimension refers to the probability of The contributions of this analysis are as follows: the image being the element of each image class. 860 Nigam, J., Barahpuriya, S. and Rameshan, R. Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance. DOI: 10.5220/0007582408600865 In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 860-865 ISBN: 978-989-758-351-3 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Analyzing the Linear and Nonlinear Transformations of AlexNet to Gain Insight into Its Performance Figure 1: AlexNet Architecture (Kim et al., 2017). CNN has more than one linear transformations and two types of nonlinear transformations (ReLU and pooling) which are used in repetition. The the nonlinear transformation confines the data to the pos- itive orthant of higher dimension. 2.1 Analysis of Linear Transformation The main operation in CNNs is convolution. The fil- Figure 2: Convolution operation. The input is of size ters are correlated with the input to get feature maps N1N2N3 ×1 and there are M3 filters ( f1;:::; fM3 ) which gen- and this operation is termed as convolution in the liter- erate M3 feature maps (m1;:::;mM3 ). ature. Since correlation is nothing other than convolution without flipping the kernel, correlation operation has full rank. The input x is convolved with each filter can also be represented as a matrix vector product. generating a feature map mk. Eq.(3) gives the descrip- We refer to this matrix as the linear transformation. tion of convolution operation. 1D and 2D correlation can be represented as shown in Eq. (1) and Eq. (2), respectively. y = Tx: (3) y(n) = ∑h(k)x(n + k); (1) 2.1.1 Analysis based on Nature of k Transformation Matrix y(m;n) = ∑∑h(l;k)x(m + l;n + k); (2) The desirable properties for transformation matrix (T) l k to aid classification are: where x is the input and y is the output and h is the 1. The null space of T should be such that the differ- kernel. Eq.1 leads to a Toeplitz matrix and Eq.2 leads ence of vectors from two different classes should to a block Toeplitz matrix. In CNNs notice that the or- not be in the null space of T. This in turn demands der of convolution is higher and correspondingly one that the difference should not lie in null space of gets a matrix which is Toeplitz with Toeplitz blocks. fk, 1 ≤ k ≤ M3, i.e. if xi 2 Ci and x j 2 Cj, where Typically each layer has multiple filters leading to Ci and Cj are different classes, multiple maps and the Toeplitz matrix corresponding to each filter is stacked vertically to get the overall x j − xi 2= N ( fk); 1 ≤ k ≤ N: (4) transformation from the input space to output space. N ×N ×N Proof. Let x1;x2 2 R 1 2 3 be two points and As an example let x 2 R N1×N2×N3 be an input vec- their difference x = x1 − x2, the norm of x is, tor and y 2 R M1×M2×M3 be the output vector, where N and M are number of channels in input and num- 3 3 kxk= kx − x k: (5) ber of filters, respectively. Then the transformation 2 1 matrix T is such that T 2 R M1M2M3×N1N2N3 . T is ob- x1;x2 are transformed to tained by stacking fk, where 1 ≤ k ≤ M3, as shown in Fig.2. Each fk is Toeplitz with Toeplitz blocks and y1 = Tx1; y2 = Tx2: (6) 861 ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods Table 1: Analysis of norm values at each layer. Layers Total filters Filters: k T k2 < 1 Min Max 1 96 13 0.28 4.16 2 256 2 0.03 4.26 3 384 2 0.96 2.22 4 384 6 0.96 2.12 5 256 0 1.16 2.19 Norm of the difference of y1 and y2 is Angles between representative points of both classes 100 Near_Min Medium_Min 90 Far_Min ky2 − y1k= kT(x2 − x1)k; (7) Near_Max 80 Medium_Max That can be written as: Far_Max 70 M 60 3 1 2 2 50 kT(x2 − x1)k= (∑k fk(x2 − x1)k ) ; Angles k=1 40 30 20 x −x 2= (T) x −x 2= ( f ) 8k: 2 1 N only if 2 1 N k This is 10 important to maintain separation between classes 0 1 2 3 4 5 after the transformation. Layers 2. l (TT T ) > 1, l (TT T ) > 1, where l and Figure 3: Angles between representative points of both the min max min classes. The first entry in x−axis shows the input value fol- lmax are the minimum and maximum eigenvalues, T lowed by five layers of AlexNet. Near min and Near max respectively of TT . are minimum and maximum angle values, respectively from the near region. Similarly for medium and far regions the Proof. For proper classification, two vectors from minimum and maximum values are shown. different classes should be separated at least by kx2 − x1k. Since transformations on the input data. In this direction there is a work (Giryes et al., 2016), which provides y2 − y1 = T(x2 − x1); a study about how the distance and the angle changes ky − y k = kT(x − x )k; 2 1 2 1 after the transformation within the layers. The anal- = k(x2 − x1)kkTzk; wherekzk= 1: ysis in (Giryes et al., 2016) is based on the networks with random Gaussian weights and it exploits tools Let l and l are the minimum and max- min max used in the compressed sensing and dictionary learn- imum eigenvalues, respectively of TT T , then ing literature. They showed that if the angle between l ≤ kTzk≤ l , and hence min max the inputs is large then the Euclidean distance be- lmink(x2 − x1)k≤ k(y2 − y1)k≤ lmaxk(x2 − x1)k: tween them at the output layer will be large and vice- (8) versa. We analyze the effect of nonlinear transformations on the input data by measuring the following key- From Eq.(8) it is evident that lmin;lmax > 1 is points.

Analyzing the Linear and Nonlinear Transformations of Alexnet to Gain Insight Into Its Performance

MATH 2030: MATRICES Introduction to Linear Transformations We Have

Vectors, Matrices and Coordinate Transformations

Support Graph Preconditioners for Sparse Linear Systems

Wronskian Solutions to the Kdv Equation Via B\" Acklund

Rotation Matrix - Wikipedia, the Free Encyclopedia Page 1 of 22

Linear Algebra with Exercises B

DRS: Diagonal Dominant Reduction for Lattice-Based Signature Version 2

Chapter 6 Direct Methods for Solving Linear Systems

Computing Rational Forms of Integer Matrices

2 Rotation Kinematics

Matrix Notation for Graphical Transformations

Linear Algebra and Geometric Transformations in 2D