<<

Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of : Deep CNN [email protected]

1 Convolution Neural Network(CNN) 1.1 Introduction Convolutional Neural Networks(CNN) are bio-inspired artificial Neural Networks. Unlike traditional tasks, CNN can be fed with raw image pixel value rather than feature vectors.

1.2 Architecture of CNN

Figure 1: Architecture of CNN ([GBC, YB]

Basic Design Principle for CNN is to develop an architecture and learning in such a way that it reduces the number of parameters without compromising the compression power of learning . In general, linear math operation of convolution followed by nonlinear activators, pooling layers, and deep neural net classification is one way to interpret the architecture. The convolution processes act as appropriate feature detectors that deal with a large amount of information (low level). Complete convolution has different feature detectors so multiple features can be extracted from same image. A single feature detector is smaller in size compared to input image and is slid over the image. Hence all units in feature detectors share same weight and bias, thereby helps in detecting the same feature in all points of image, and ,in , gives properties of invariance to transformation and shift of the images. Local connection between pixels are used many times in architecture. The idea of introducing local recep- tive field allows extraction of element of any features (architecture of edges, corners and end points)(See [Yann2] for more details). Higher degree of complex features are detected in hidden layers (when com- bined with hidden layer). Functions of sparse connectivity between subsequent layers, parameter sharing of weights between adjacent pixels and equivalent representation enable CNN to be more effective then by image recognition and image classification. Now we will discuss the component in a CNN part by part:

1.2.1 Convolution Layers Suppose the input of the convolution layer has the dimension H ×W ×C (stacked together), then Convolution layers are set of parallel feature maps, formed by sliding different kernels (feature detector) over an input and projecting element wise dot(or pixel) as the feature maps. Suppose the set of kernel has k1 × k2 × C dimension in each kernel, then KD := size of the set measures how many output layers the convolution layer would be. Further, assume there is a stride Zs, which representing the sliding portion and there is also a zero-padding parameter Zp, which controls size of the feature maps and kernels. Then dimension of output of such a convolution layer will be H1 × W1 × D1, where

H + 2Zp − k1 W + 2Zp − k2 (H1,W1,D1) = ( , ,KD) Zs + 1 Zs + 1

1 & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected]

1.2.2 Activation functions Activation functions define output of neuron based on a given set of then inputs. Weighted sum of linear net input value is passed through an activation for non-linear transformation. A typical is based on conditional which will return the value one or zero as a output

op {P (op = 1|ip) or P (op = 0|ip)} When network input info ip cross the threshold value, the activation functions returns value 1 and passes info to the next layers, when network input ip less than the threshold value, then the activation functions returns value 0 and the info is not passed on. Here are some common activation functions: 1 0 • Sigmoid: f(x) = 1+e−x , f (x) = f(x)(1 − f(x)).

ex−e−x 0 2 • Tanh: f(x) = ex+e−x , f (x) = 1 − f(x) . ( ( 0 x < 0 0 x < 0 • ReLU: f(x) = , f 0(x) = . x x ≥ 0 1 x ≥ 0 ( ( 0.01x x < 0 0.01 x < 0 • Lenky ReLU: f(x) = , f 0(x) = x x ≥ 0 1 x ≥ 0

x • Softmax: f(x ) = e j , ∂f(xj ) = f(x )(δ − f(x )). j Pd xk ∂x j ij i k=1 e i

1.2.3 Pooling Layers Pooling layers are down-sampling layer combined output of layer to a single neuron. If we denote k as the kernel size(now assume kernel is squared), Dn as number of kernel windows, and Zs as stride to develop pooling layers, then the output dimension of the pooling layer will be (suppose we have H1 ×W1 ×D1 input):

H1 − k W1 − k H2 × W2 × D2, where (H2,W2,D2) = ( + 1, + 1,Dn) Zs Zs • Max-pooling • Average Pooling

• L2 norm Pooling

1.2.4 Fully Connected Dense Layers After the pooling layers, pixels of pooling layers is stretched to single column vector. These vectorized and concatenated data points are fed into dense layers ,known as fully connected layers for the classification.

1.2.5 Loss or Cost Function maps an event of one or more variable onto a real number associated with some cost. Loss function is used to the performance of the model and inconsistency between actual yi and predicted L+1 value yˆi . Performance of model increases with the decrease value of loss function. If output vector of all possible output is yi = {0, 1} and an event with set of input vector variable x = {x1, x2, . . . , xt} then the mapping of xi to yi is given by L+1 yˆi = some mapping f(σ(x), w, b) and we can define loss function t 1 X L(yˆL+1, y) = hy , f(σ(x), w, b)i t i i=1

2 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected] where L is the loss function. σ is the activation function, w is weight parameters and b is the bias term. h·i measures the "difference" between true yi and the result given by the network based on σ(x),w and b. Some common examples for loss functions are:

• Mean Squared Error(MSE): t 1 X L(yˆL+1, y) = (y − yˆL+1)2 t i i i=1

• Mean Squared Logarithm Error(MSLE):

t 1 X L(yˆL+1, y) = (ln(y + 1) − ln(ˆyL+1 + 1))2 t i i i=1

• L2 loss: t L+1 X L+1 2 L(yˆ , y) = (yi − yˆi ) i=1

• L1 loss: t L+1 X L+1 L(yˆ , y) = |yi − yˆi | i=1

L+1 l−1 • Cross Entropy: If the probability that the output yi is in the training set label yˆi is , P (yi|zi ) = L+1 L+1 l−1 L+1 yˆi = 1 and the probability that output yi is not in the training set label yˆi is P (yi|zi ) =y ˆi = 0. The expected label is y, then

l−1 L+1 yi L+1 (1−yi) P (yi|zi ) = (ˆyi ) (1 − yˆi )

We wish to maximize the likelihood, which is equivalent to minimize:

l−1 L+1 yi L+1 (1−yi) − ln P (yi|zi ) = − ln[(ˆyi ) (1 − yˆi ) ] In case of t training samples, the cost function of Cross Entropy is:

t 1 X L(yˆL+1, y) = − [y ln(ˆyL+1) + (1 − y ) ln(1 − yˆL+1)] t i i i i i=1 2 Learning of CNNs 2.1 Feed-Forward Run Feed-Forward Run (or Propagation) can be explained as multiplying the input value by randomly initiated bias values of each connection of every neurons followed by summation of all the products of all the neurons. Then passing the net input value through non-linear activation functions. In a discrete , image and kernel can be represented as a 3D tensor with the dimension of (H, W, C) th th th and (k1, k2,C) where H, W , and C represent the H , and W pixel in C channel. First two indices indicate the spatial coordinates and last index is indicating the color channel. Mathematical form of convolution operation over multi dimensional tensor can be written as:

k k C X1 X2 X (I ⊗ K)ij = Km,n,cIi+m,j+n,c m=1 n=1 c=1 For grey scale image, convolution process can be expressed as,

3 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected]

k k X1 X2 (I ⊗ K)ij = Km,nIi+m,j+n m=1 n=1 p,q If a kernel bank Ku,v convolved with image Im,n with stride value of 1 and zero padding value of 0, then p,q feature maps of the convolution layer Cm,n can be computed by,

k1 k2 p,q X X p,q p,q Cm,n = Im−u,n−v · Ku,v + b m=1 n=1 where bp,q are bias term. These feature maps are passed through a non-linear activation function σ

k1 k2 ! p,q X X p,q p,q Cm,n = σ Im−u,n−v · Ku,v + b m=1 n=1 where σ is a ReLU activation function. Max Pooling layer can be calculated as p,q p,q Pm,n = max(Cm,n) This Pooling layer P p,q is concatenated to form a long vector with the length p x q and is fed into fully l−1 connected dense layers for the classification, then the vectorized data points ai in l − 1 layer is given by

l−1 p,q ai = f(P ) This long vector is fed to a fully connected dense layers from lth layer to (L + 1)th layer. For example, if the fully connected dense layers is developed with L number of layers and n number of neurons, then l is the first layer. Forward run between the layers are given by:

 l   l l l l   l−1  l  z1 w11 w12 w13 ··· w1n a1 b1  .   . . . . .   .   .   .   . . . . .   .   .   l  =  l l l l   l−1 +  l  (1) zi  wi1 wi2 wi3 ··· win  ai  bi   .   . . . . .   .   .  ...... or

zl = (W l)T al−1 + bl (2)

4 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected]

Figure 2: Forward run in fully connected layer

Figure 3: Forward run in a single neuron in one layer

Consider a single neuron (with subscript j) in a fully connected layer l, then

n l X l l−1 l aj = σ( wijaj + bi) i=1 In vector form, the output al is:

al = σ((W l)T al−1 + bl) = σ(zl) l T l T  l l l  where (a ) = σ(z ) := σ(z1), σ(z2), ··· σ(zi), ··· .

5 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected]

In the same manner, the output value of last layer L is

aL = σ((W L)T aL−1 + bL) = σ(zL) L+1 Expanding this to classification layers, final output predicted value yˆi of a neuron unit (subscript in i) at (L + 1)th layer will be:

L+1 L T 2 T 1 T 1 1 2 L yˆi = σ((W ) ...... σ((W ) (σ((W ) a + b ) + b ) ...... + b ) L+1 If the predicted value is yˆi and actual labeled value is yi, then the performance of the model can be measured by cross entropy loss function.

3 Back Propagation (Backward Run)

Backward run, also known as backward propagation, is referred to backward propagation of errors which use to compute the gradient of the loss function with respect to the parameters such as weight and bias.

Figure 4: Back propagation in fully connected layer

In the back propagation, the parameters such as W L+1, bL+1,W l, bl,Kp,q, bp,q are needed to up data in order to minimize the cost function. During backward propagation, gradient of loss function of final layers with respect to the parameters is computed first where the gradient of first layer is computed last. Also partial of one layer is reused by chain rule: (cross entropy loss)

6 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected]

L+1 t L+1 L+1 ∂L(yˆ , y) 1 X −∂[yj ln(ˆyj ) + (1 − yj) ln(1 − yˆj )] = ∂yˆL+1 t ∂yˆL+1 i j=1 i (3)   1 −yi 1 − yi = L+1 + L+1 t yˆi 1 − yˆi In the case of multiclass categorical classification, the loss function of classification layer L + 1 is:

 ∂L(yˆL+1,y)   −y1 1−y1  L+1 L+1 + L+1 ∂yˆ1 yˆ 1−yˆ L+1 1 1  ∂L(yˆ ,y)   −y2 1−y2    L+1 + L+1 ∂yˆL+1  yˆ 1−yˆ   2   2 2   .  1  .   .  =  .  (4)   t    ∂L(yˆL+1,y)  −yi 1−yi  L+1 + L+1   L+1  ∂yˆ  yˆi 1−yˆi   j     .  . . .

L Now we need to calculate w.r.t. weight wuv in final layer, which is the Lth layer. We apply chain rule here:

L+1 t L+1 L+1 ∂L(yˆ , y) X ∂L(yˆ , y) ∂yˆj = ∂wL L+1 ∂wL uv j=1 ∂yˆj uv t ! L 1 X −yj 1 − yj ∂σ(zj ) = + (5) t L+1 L+1 ∂wL j=1 yˆj 1 − yˆj uv t ! L L 1 X −yj 1 − yj ∂σ(zj ) ∂zj = + t L+1 L+1 ∂zL1 ∂wL j=1 yˆj 1 − yˆj j uv

In this final layer (Lth layer). the sigmoid activation functions is utilized for non-linear transformation.

L 1 σ(zi ) = L 1 + exp(zi ) Also, we have

n L X L L−1 L zj = wijai + bi i=1 Applying the formula, hence

L+1 t ! ∂L(yˆ , y) 1 X −yj 1 − yj = + σ(zL+1)(1 − σ(zL))δ aL−1 ∂wL t σ(zL) 1 − σ(zL) j j jv u uv j=1 j j   1 −yv 1 − yv L L L−1 (6) = L + L σ(zv )(1 − σ(zv ))au t σ(zv ) 1 − σ(zv ) 1 = (σ(zL) − y ))aL−1 t v v u

L Similarly, partial derivative of cost function w.r.t. bias bi in uth neuron Lth layer is

7 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected]

L+1 t L+1 L L ∂L(yˆ , y) X ∂L(yˆ , y) ∂σ(zj ) ∂zj = ∂bL ∂σ(zL) ∂zL ∂bL i j=1 j j i (7) 1 = [σ(zL) − y ] t i i In order to perform the learning of Convolution Nets, it is also necessary to update the kernel bank weights and bias value in convolution layers as well as in pooling layers.

3.1 Parameter updates To minimize loss function, we need to update the learning parameters via gradient descent scheme. With the idea of back propagation, we can estimate the partial derivative so the following update scheme is computable:

∂L(yˆL+1, y) W l = W l − α ∂W l

∂L(yˆL+1, y) W l = W l − α ∂W l

L+1 p,q p,q ∂L(yˆ , y) K = K − α p,q ∂Ku,v ∂L(yˆL+1, y) bp,q = bp,q − α ∂bp,q where α is the learning rate.

References

[BHK] Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. [GBC] , , Aaron Courville Deep Learning http://www.deeplearningbook.org [MP] Murugan, Pushparaja. Feed forward and backward run in deep convolution neural network. arXiv preprint arXiv:1711.03278 (2017). [YB] LeCun, Yann, and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361.10 (1995): 1995. [Yann] LeCun, Yann. Generalization and network design strategies. Connectionism in perspective (1989): 143-155.

[Yann2] LeCun, Yann, et al. Object recognition with gradient-based learning. Shape, contour and grouping in . Springer, Berlin, Heidelberg, 1999. 319-345. [YBH] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature 521.7553 (2015): 436. [Bishop] Bishop, Christopher M. Neural networks for . Oxford university press, 1995.

[ASH] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo- lutional neural networks. Advances in neural information processing systems. 2012.

8