1 Convolution Neural Network(CNN) 1.1 Introduction Convolutional Neural Networks(CNN) Are Bio-Inspired Artificial Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected] 1 Convolution Neural Network(CNN) 1.1 Introduction Convolutional Neural Networks(CNN) are bio-inspired artificial Neural Networks. Unlike traditional machine learning tasks, CNN can be fed with raw image pixel value rather than feature vectors. 1.2 Architecture of CNN Figure 1: Architecture of CNN ([GBC, YB] Basic Design Principle for CNN is to develop an architecture and learning algorithm in such a way that it reduces the number of parameters without compromising the compression power of learning algorithms. In general, linear math operation of convolution followed by nonlinear activators, pooling layers, and deep neural net classification is one way to interpret the architecture. The convolution processes act as appropriate feature detectors that deal with a large amount of information (low level). Complete convolution layer has different feature detectors so multiple features can be extracted from same image. A single feature detector is smaller in size compared to input image and is slid over the image. Hence all units in feature detectors share same weight and bias, thereby helps in detecting the same feature in all points of image, and ,in addition, gives properties of invariance to transformation and shift of the images. Local connection between pixels are used many times in architecture. The idea of introducing local recep- tive field allows extraction of element of any features (architecture of edges, corners and end points)(See [Yann2] for more details). Higher degree of complex features are detected in hidden layers (when com- bined with hidden layer). Functions of sparse connectivity between subsequent layers, parameter sharing of weights between adjacent pixels and equivalent representation enable CNN to be more effective then by image recognition and image classification. Now we will discuss the component in a CNN part by part: 1.2.1 Convolution Layers Suppose the input of the convolution layer has the dimension H ×W ×C (stacked together), then Convolution layers are set of parallel feature maps, formed by sliding different kernels (feature detector) over an input and projecting element wise dot(or pixel) as the feature maps. Suppose the set of kernel has k1 × k2 × C dimension in each kernel, then KD := size of the set measures how many output layers the convolution layer would be. Further, assume there is a stride Zs, which representing the sliding portion and there is also a zero-padding parameter Zp, which controls size of the feature maps and kernels. Then dimension of output of such a convolution layer will be H1 × W1 × D1, where H + 2Zp − k1 W + 2Zp − k2 (H1;W1;D1) = ( ; ;KD) Zs + 1 Zs + 1 1 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected] 1.2.2 Activation functions Activation functions define output of neuron based on a given set of then inputs. Weighted sum of linear net input value is passed through an activation function for non-linear transformation. A typical activation function is based on conditional probability which will return the value one or zero as a output op fP (op = 1jip) or P (op = 0jip)g When network input info ip cross the threshold value, the activation functions returns value 1 and passes info to the next layers, when network input ip less than the threshold value, then the activation functions returns value 0 and the info is not passed on. Here are some common activation functions: 1 0 • Sigmoid: f(x) = 1+e−x , f (x) = f(x)(1 − f(x)). ex−e−x 0 2 • Tanh: f(x) = ex+e−x , f (x) = 1 − f(x) . ( ( 0 x < 0 0 x < 0 • ReLU: f(x) = , f 0(x) = . x x ≥ 0 1 x ≥ 0 ( ( 0:01x x < 0 0:01 x < 0 • Lenky ReLU: f(x) = , f 0(x) = x x ≥ 0 1 x ≥ 0 x • Softmax: f(x ) = e j , @f(xj ) = f(x )(δ − f(x )). j Pd xk @x j ij i k=1 e i 1.2.3 Pooling Layers Pooling layers are down-sampling layer combined output of layer to a single neuron. If we denote k as the kernel size(now assume kernel is squared), Dn as number of kernel windows, and Zs as stride to develop pooling layers, then the output dimension of the pooling layer will be (suppose we have H1 ×W1 ×D1 input): H1 − k W1 − k H2 × W2 × D2; where (H2;W2;D2) = ( + 1; + 1;Dn) Zs Zs • Max-pooling • Average Pooling • L2 norm Pooling 1.2.4 Fully Connected Dense Layers After the pooling layers, pixels of pooling layers is stretched to single column vector. These vectorized and concatenated data points are fed into dense layers ,known as fully connected layers for the classification. 1.2.5 Loss or Cost Function Loss function maps an event of one or more variable onto a real number associated with some cost. Loss function is used to measure the performance of the model and inconsistency between actual yi and predicted L+1 value y^i . Performance of model increases with the decrease value of loss function. If output vector of all possible output is yi = f0; 1g and an event with set of input vector variable x = fx1; x2; : : : ; xtg then the mapping of xi to yi is given by L+1 y^i = some mapping f(σ(x); w; b) and we can define loss function t 1 X L(y^L+1; y) = hy ; f(σ(x); w; b)i t i i=1 2 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected] where L is the loss function. σ is the activation function, w is weight parameters and b is the bias term. h·i measures the "difference" between true yi and the result given by the network based on σ(x),w and b. Some common examples for loss functions are: • Mean Squared Error(MSE): t 1 X L(y^L+1; y) = (y − y^L+1)2 t i i i=1 • Mean Squared Logarithm Error(MSLE): t 1 X L(y^L+1; y) = (ln(y + 1) − ln(^yL+1 + 1))2 t i i i=1 • L2 loss: t L+1 X L+1 2 L(y^ ; y) = (yi − y^i ) i=1 • L1 loss: t L+1 X L+1 L(y^ ; y) = jyi − y^i j i=1 L+1 l−1 • Cross Entropy: If the probability that the output yi is in the training set label y^i is , P (yijzi ) = L+1 L+1 l−1 L+1 y^i = 1 and the probability that output yi is not in the training set label y^i is P (yijzi ) =y ^i = 0. The expected label is y, then l−1 L+1 yi L+1 (1−yi) P (yijzi ) = (^yi ) (1 − y^i ) We wish to maximize the likelihood, which is equivalent to minimize: l−1 L+1 yi L+1 (1−yi) − ln P (yijzi ) = − ln[(^yi ) (1 − y^i ) ] In case of t training samples, the cost function of Cross Entropy is: t 1 X L(y^L+1; y) = − [y ln(^yL+1) + (1 − y ) ln(1 − y^L+1)] t i i i i i=1 2 Learning of CNNs 2.1 Feed-Forward Run Feed-Forward Run (or Propagation) can be explained as multiplying the input value by randomly initiated bias values of each connection of every neurons followed by summation of all the products of all the neurons. Then passing the net input value through non-linear activation functions. In a discrete color space, image and kernel can be represented as a 3D tensor with the dimension of (H; W; C) th th th and (k1; k2;C) where H, W , and C represent the H , and W pixel in C channel. First two indices indicate the spatial coordinates and last index is indicating the color channel. Mathematical form of convolution operation over multi dimensional tensor can be written as: k k C X1 X2 X (I ⊗ K)ij = Km;n;cIi+m;j+n;c m=1 n=1 c=1 For grey scale image, convolution process can be expressed as, 3 Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377) MW 11:00a.m.-12:30p.m. GDC 5.304 Lecture Notes: Geometry of Deep Learning: Deep CNN [email protected] k k X1 X2 (I ⊗ K)ij = Km;nIi+m;j+n m=1 n=1 p;q If a kernel bank Ku;v convolved with image Im;n with stride value of 1 and zero padding value of 0, then p;q feature maps of the convolution layer Cm;n can be computed by, k1 k2 p;q X X p;q p;q Cm;n = Im−u;n−v · Ku;v + b m=1 n=1 where bp;q are bias term. These feature maps are passed through a non-linear activation function σ k1 k2 ! p;q X X p;q p;q Cm;n = σ Im−u;n−v · Ku;v + b m=1 n=1 where σ is a ReLU activation function. Max Pooling layer can be calculated as p;q p;q Pm;n = max(Cm;n) This Pooling layer P p;q is concatenated to form a long vector with the length p x q and is fed into fully l−1 connected dense layers for the classification, then the vectorized data points ai in l − 1 layer is given by l−1 p;q ai = f(P ) This long vector is fed to a fully connected dense layers from lth layer to (L + 1)th layer.