<<

In The Name of Allah ImageNet Classification with Deep Convolutional – AlexNet - - Geoffrey E. Hinton

Iman Jami Moghaddam Digital Image processing Dey 96 3 Outline

 Problem Definition: Image classification  Artificial Neural Networks  Convolutional Neural Networks  Alex net Problem Definition: Image

4 classification Detailed explanation 5 Classification Problem

 Given: Training set  labeled set of 푁 Input-output pairs 퐷 = 푥푖, 푦푖 , 푦푖 ∈ {1, … , 퐾} ,  Goal: Given an input 풙, assign it to one of 퐾 classes 6 ImageNet LSVRC-2010

 ImageNet is a dataset of over 15 million labeled high-resolution(variable resolutions) images belonging to roughly 22,000 categories.  LSVRC uses a subset of ImageNet. The dataset has 1.2 million high- resolution training images and150,000 testing images .  The classification task:  Get the “correct” class in your top 5 bets. There are 1000 classes. 7 Artificial Neural Networks Detailed explanation 8 A typical neuron

 There is one axon that branches  There is a dendritic tree that collects input from other neurons.  A spike of activity in the axon causes charge to be injected into the post-synaptic neuron  There is an axon hillock that generates outgoing spikes whenever enough charge has flowed in at synapses to depolarize the cell membrane 9 Synapse

 When a spike of activity travels along an axon and arrives at a synapse it causes vesicles of transmitter chemical to be released.  There are several kinds of transmitter.  The transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the membrane of the post-synaptic neuron thus changing their shape. 10 How the brain works

 Each neuron receives inputs from other neurons  The effect of each input line on the neuron is controlled by a synaptic weight  The synaptic weights adapt so that the whole network learns to perform useful computations  You have about 1011neurons each with about 104 weights. 11 Artificial neuron

 A set of weight W

 Y = 푗 푤푗푥푗  Output is a function of Y 12

 They were popularised by in the early 1960’s  Binary threshold neurons (McCulloch-Pitts) 13 Perceptrons Learning

 Pick training cases using any policy that ensures that every training case will keep getting picked.  If the output unit is correct, leave its weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector. 14 Limitation of

 So “multi-” neural networks do not use the perceptron learning procedure.  The perceptron convergence procedure works by ensuring that every time the weights change, they get closer to every “generously feasible” set of weights 15 A different look at learning procedure

 instead of showing the weights get closer to a good set of weights, show that the actual output values get closer the target values.  The simplest example is a linear neuron with a squared error measure. 16 Linear neurons

 y = b + 푖 푥푖푤푖

1 푛 푛  퐸 = (푡 − 푦 )2 2 푛 ∈푡푟표푖푛푖푛 17 Changing weights 18 Non-Linear neurons

 Sigmoid

 Tanh

 ReLU 19 Learning with hidden units

 Networks without hidden units are very limited in the input-output mappings they can model.  Randomly perturb one weight and see if it improves performance. If so, save the change.  Very inefficient 20 algorithm

 We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.  Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities.  We can compute error derivatives for all the hidden units efficiently at the same time. 21 Changing weights 22 Multi class classification - SVM 23 Multi class classification - Softmax 24 Convolutional Neural Networks Introduction 25 Viewpoint invariance methods

 Use redundant invariant features: Extract a large, redundant set of features that are invariant under transformations  Put a box around the object and use normalized pixels:  But choosing the box is difficult because of Segmentation errors, occlusion, unusual orientations. We need to recognize the shape to get the box right!  The replicated feature approach (CNN) 26 CNN

 Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way  In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth 27 Example Architecture

 INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.  CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.  RELU layer will apply an elementwise , such as the max(0,x)max(0,x) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]).  POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].  FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume. 28 Architecture Overview 29 Convolutional Layer

 The CONV layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume  As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position  For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter) 30 Layer 31 Convolution Layer 32 Convolutional layers 33 Convolutional Layer

 Three hyperparameters control the size of the output volume:  Depth  Stride  Zero-padding

 Size of output of each filter: (W−F+2P)/S+1  Backpropagation learning 34 Pulling Layer

 Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.  The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation  Recall from the backpropagation chapter that the backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass 35 Pulling Layer 36 Compare with fully connected network

 compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train.  their theoretically-best performance is likely to be only slightly worse. 37 Alex Net 38 Architecture of AlexNet 39 Architecture of AlexNet

 The first convolutional layer filters the 224×224×3 input image with 96 kernels of size 11×11×3 with a stride of 4 pixels  The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48.  The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers  The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer.  The fourth convolutional layer has 384 kernels of size 3 × 3 × 192.  the fifth convolutional layer has 256 kernels of size 3 × 3 × 192.  The fully-connected layers have 4096 neurons each. 40 ReLU Nonlinearity

 The standard way to model a neuron’s output f as a function of its input x is with f(x) = tanh(x)  these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x). 41 Training on multiple GPU

 A single GTX 580 GPU has only 3GB of , which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU.  Therefore we spread the net across two GPUs. 42 Overlapping pooling

 If we set s = z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z, we obtain overlapping pooling. This is what we use throughout our  network, with s = 2 and z = 3. This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non- overlapping scheme s = 2, z = 2, which produces output of equivalent dimensions. 43 Intuition of first layer features 44 Results 45 Qualitative Evaluations 46 After AlexNet … 47 Now - Dynamic Routing Between Capsules

Geoffrey Hinton and Sara Sabour, holding a two-piece pyramid puzzle, are researching a system that could let computers see more like humans at a laboratory in Toronto. 48 Any Question?! 49 Thanks! 