<<

CS7015 () : Lecture 11 Convolutional Neural Networks, LeNet, AlexNet, ZF-Net, VGGNet, GoogLeNet and ResNet

Mitesh M. Khapra

Department of Computer Science and Engineering Indian Institute of Technology Madras

1/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Module 11.1 : The operation

2/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ∞ X xt−aw−a (x∗w)t a=0

input filter convolution

Now suppose our sensor is noisy x1 x2 To obtain a less noisy estimate we would like to average several measure- ments st = = More recent measurements are more important so we would like to take a weighted average

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals x0

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ∞ X xt−aw−a (x∗w)t a=0

input filter convolution

Now suppose our sensor is noisy x2 To obtain a less noisy estimate we would like to average several measure- ments st = = More recent measurements are more important so we would like to take a weighted average

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals x0 x1

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ∞ X xt−aw−a (x∗w)t a=0

input filter convolution

Now suppose our sensor is noisy To obtain a less noisy estimate we would like to average several measure- ments st = = More recent measurements are more important so we would like to take a weighted average

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals x0 x1 x2

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ∞ X xt−aw−a (x∗w)t a=0

input filter convolution

To obtain a less noisy estimate we would like to average several measure- ments st = = More recent measurements are more important so we would like to take a weighted average

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals Now suppose our sensor is noisy x0 x1 x2

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ∞ X xt−aw−a (x∗w)t a=0

input filter convolution

st = = More recent measurements are more important so we would like to take a weighted average

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals Now suppose our sensor is noisy x0 x1 x2 To obtain a less noisy estimate we would like to average several measure- ments

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ∞ X xt−aw−a (x∗w)t a=0

input filter convolution

st = =

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals Now suppose our sensor is noisy x0 x1 x2 To obtain a less noisy estimate we would like to average several measure- ments More recent measurements are more important so we would like to take a weighted average

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 (x∗w)t

input filter convolution

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals Now suppose our sensor is noisy x0 x1 x2 To obtain a less noisy estimate we would like to average several measure- ∞ X ments st = xt−aw−a = a=0 More recent measurements are more important so we would like to take a weighted average

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 input filter convolution

Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals Now suppose our sensor is noisy x0 x1 x2 To obtain a less noisy estimate we would like to average several measure- ∞ X ments st = xt−aw−a = (x∗w)t a=0 More recent measurements are more important so we would like to take a weighted average

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Suppose we are tracking the position of an aeroplane using a laser sensor at discrete time intervals Now suppose our sensor is noisy x0 x1 x2 To obtain a less noisy estimate we would like to average several measure- ∞ X ments st = xt−aw−a = (x∗w)t a=0 More recent measurements are more important so we would like to take a input filter weighted average convolution

3/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The weight array (w) is known as the filter We just slide the filter over the input and compute the value of st based on a win- dow around xt s = x w + x w + x w + x w + x w + x w + x w 6 6 0 5 −1 4 −2 3 −3 2 −4 1 −5 Here0 −6 the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a a=0

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We just slide the filter over the input and compute the value of st based on a win- dow around xt s = x w + x w + x w + x w + x w + x w + x w 6 6 0 5 −1 4 −2 3 −3 2 −4 1 −5 Here0 −6 the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Here the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 0.00 0.00 0.00 0.00 0.00

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Here the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 0.00 0.00 0.00 0.00

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Here the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 0.00 0.00 0.00

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Here the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 0.00 0.00

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Here the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 2.28 0.00

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Here the input (and the kernel) is one dimensional Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Can we use a convolutional operation on a 2D input also?

In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one dimensional

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70

S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In practice, we would only sum over a 6 X small window st = xt−aw−a The weight array (w) is known as the a=0 filter We just slide the filter over the input and compute the value of st based on a win- dow around xt w−6 w−5 w−4 w−3 w−2 w−1 w0 W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one dimensional

X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on a 2D input also?

S 0.00 1.80 1.96 2.11 2.16 2.28 2.42

s6 = x6w0 + x5w−1 + x4w−2 + x3w−3 + x2w−4 + x1w−5 + x0w−6

4/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We would now like to use a 2D filter (m × n) First let us see what the 2D formula looks like This formula looks at all the preced- ing neighbours (i − a, j − b) In practice, we use the following for- mula which looks at the succeeding m−1 n−1 X X neighbours Sij = (I ∗ K)ij = a=0 b=0

We can think of images as 2D inputs

5/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 First let us see what the 2D formula looks like This formula looks at all the preced- ing neighbours (i − a, j − b) In practice, we use the following for- mula which looks at the succeeding m−1 n−1 X X neighbours Sij = (I ∗ K)ij = a=0 b=0

We can think of images as 2D inputs We would now like to use a 2D filter (m × n)

5/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This formula looks at all the preced- ing neighbours (i − a, j − b) In practice, we use the following for- mula which looks at the succeeding neighbours

We can think of images as 2D inputs We would now like to use a 2D filter (m × n) First let us see what the 2D formula looks like

m−1 n−1 X X Sij = (I ∗ K)ij = Ii−a,j−bKa,b a=0 b=0

5/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In practice, we use the following for- mula which looks at the succeeding neighbours

We can think of images as 2D inputs We would now like to use a 2D filter (m × n) First let us see what the 2D formula looks like This formula looks at all the preced- ing neighbours (i − a, j − b)

m−1 n−1 X X Sij = (I ∗ K)ij = Ii−a,j−bKa,b a=0 b=0

5/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We can think of images as 2D inputs We would now like to use a 2D filter (m × n) First let us see what the 2D formula looks like This formula looks at all the preced- ing neighbours (i − a, j − b) In practice, we use the following for- mula which looks at the succeeding m−1 n−1 X X neighbours Sij = (I ∗ K)ij = Ii+a,j+bKa,b a=0 b=0

5/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 OutputOutput

aw+bx+ey+fzaw+bx+ey+fz bw+cx+fy+gzbw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

Input Kernel a b c d w x e f g h y z i j k `

Let us apply this idea to a toy ex- ample and see the results

6/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 OutputOutput

aw+bx+ey+fzaw+bx+ey+fz bw+cx+fy+gzbw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

Input Let us apply this idea to a toy ex- Kernel ample and see the results a b c d w x e f g h y z i j k `

Output

aw+bx+ey+fz

6/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 OutputOutput

aw+bx+ey+fzaw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

Input Let us apply this idea to a toy ex- Kernel ample and see the results a b c d w x e f g h y z i j k `

Output

aw+bx+ey+fz bw+cx+fy+gz

6/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 OutputOutput

aw+bx+ey+fzaw+bx+ey+fz bw+cx+fy+gzbw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

Input Let us apply this idea to a toy ex- Kernel ample and see the results a b c d w x e f g h y z i j k `

Output aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

6/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 OutputOutput

aw+bx+ey+fzaw+bx+ey+fz bw+cx+fy+gzbw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

Input Let us apply this idea to a toy ex- Kernel ample and see the results a b c d w x e f g h y z i j k `

Output aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz

6/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 OutputOutput

aw+bx+ey+fzaw+bx+ey+fz bw+cx+fy+gzbw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

Input Let us apply this idea to a toy ex- Kernel ample and see the results a b c d w x e f g h y z i j k `

Output aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz

6/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 OutputOutput

aw+bx+ey+fzaw+bx+ey+fz bw+cx+fy+gzbw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz

Input Let us apply this idea to a toy ex- Kernel ample and see the results a b c d w x e f g h y z i j k `

Output aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+`z

6/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 m n b 2 c b 2 c X X Sij = (I ∗ K) = Ii−a,j−bK m n ij 2 +a, 2 +b a=b− m c b=b− n c 2 2 In other words we will assume that the kernel is centered on the pixel of pixel of interest interest So we will be looking at both preceed- ing and succeeding neighbors

For the rest of the discussion we will use the following formula for convolu- tion

7/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In other words we will assume that the kernel is centered on the pixel of pixel of interest interest So we will be looking at both preceed- ing and succeeding neighbors

For the rest of the discussion we will m n b 2 c b 2 c X X use the following formula for convolu- Sij = (I ∗ K) = Ii−a,j−bK m +a, n +b ij 2 2 tion m n a=b− 2 c b=b− 2 c

7/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So we will be looking at both preceed- ing and succeeding neighbors

For the rest of the discussion we will m n b 2 c b 2 c X X use the following formula for convolu- Sij = (I ∗ K) = Ii−a,j−bK m +a, n +b ij 2 2 tion a=b− m c b=b− n c 2 2 In other words we will assume that the kernel is centered on the pixel of pixel of interest interest

7/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 For the rest of the discussion we will m n b 2 c b 2 c X X use the following formula for convolu- Sij = (I ∗ K) = Ii−a,j−bK m +a, n +b ij 2 2 tion a=b− m c b=b− n c 2 2 In other words we will assume that the kernel is centered on the pixel of pixel of interest interest So we will be looking at both preceed- ing and succeeding neighbors

7/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Let us see some examples of 2D applied to images

8/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 1 1 ∗ 1 1 1 = 1 1 1

9/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 1 1 ∗ 1 1 1 = 1 1 1

blurs the image

9/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 0 -1 0 ∗ -1 5 -1 = 0 -1 0

10/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 0 -1 0 ∗ -1 5 -1 = 0 -1 0

sharpens the image

10/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 1 1 ∗ 1 -8 1 = 1 1 1

11/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 1 1 ∗ 1 -8 1 = 1 1 1

detects the edges

11/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We will now see a working example of 2D convolution.

12/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Each time we slide the kernel we get one value in the output The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We can use multiple filters to get mul- tiple feature maps.

We just slide the kernel over the input image Each time we slide the kernel we get one value in the output The resulting output is called a fea- ture map.

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We just slide the kernel over the input image Each time we slide the kernel we get one value in the output The resulting output is called a fea- ture map. We can use multiple filters to get mul- tiple feature maps.

13/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put What would happen in the 3D case?

Question In the 1D case, we slide a one dimensional filter over a one dimensional input

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put What would happen in the 3D case?

Question A B C B A B C In the 1D case, we slide a one dimensional filter over a one dimensional input

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put What would happen in the 3D case?

Question A B C B A B C In the 1D case, we slide a one dimensional filter over a one dimensional input

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put What would happen in the 3D case?

Question A B C B A B C In the 1D case, we slide a one dimensional filter over a one dimensional input

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put What would happen in the 3D case?

Question A B C B A B C In the 1D case, we slide a one dimensional filter over a one dimensional input

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put What would happen in the 3D case?

Question A B C B A B C In the 1D case, we slide a one dimensional filter over a one dimensional input

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What would happen in the 3D case?

Question In the 1D case, we slide a one dimensional filter over a one dimensional input In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What would happen in the 3D case?

a b c d Question In the 1D case, we slide a one dimensional e f g h filter over a one dimensional input i j k l In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What would happen in the 3D case?

a b c d Question In the 1D case, we slide a one dimensional e f g h filter over a one dimensional input i j k l In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What would happen in the 3D case?

a b c d Question In the 1D case, we slide a one dimensional e f g h filter over a one dimensional input i j k l In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What would happen in the 3D case?

a b c d Question In the 1D case, we slide a one dimensional e f g h filter over a one dimensional input i j k l In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What would happen in the 3D case?

a b c d Question In the 1D case, we slide a one dimensional e f g h filter over a one dimensional input i j k l In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What would happen in the 3D case?

a b c d Question In the 1D case, we slide a one dimensional e f g h filter over a one dimensional input i j k l In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Question In the 1D case, we slide a one dimensional filter over a one dimensional input In the 2D case, we slide a two dimen- stional filter over a two dimensional out- put What would happen in the 3D case?

14/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width OUTPUT and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like?

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Once again we will slide the volume over the 3D input and compute the convolution oper- ation Note that in this lecture we will assume that the filter always extends to the depth of the image In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width OUTPUT and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume

filter

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Note that in this lecture we will assume that the filter always extends to the depth of the image filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 filter In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation Note that in this lecture we will assume that the filter always extends to the depth of the image

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 filter

As a result the output will be 2D (only width and height, no depth) Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation Note that in this lecture we will assume that the filter always extends to the depth of the image In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth)

OUTPUT

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 filter

Once again we can apply multiple filters to get multiple feature maps

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation Note that in this lecture we will assume that the filter always extends to the depth of the image In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width OUTPUT and height, no depth)

INPUT

15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 filter

R G B What would a 3D filter look like? It will be 3D and we will refer to it as a volume Once again we will slide the volume over the 3D input and compute the convolution oper- ation Note that in this lecture we will assume that the filter always extends to the depth of the image In effect, we are doing a 2D convolution oper- ation on a 3D input (because the filter moves along the height and the width but not along the depth) As a result the output will be 2D (only width OUTPUT and height, no depth) INPUT Once again we can apply multiple filters to get

multiple feature maps 15/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Module 11.2 : Relation between input size, output size and filter size

16/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 inputs 2 filters 3 outputs and the relations between them We will see how they are related but before that we will define a few quantities

So far we have not said anything explicit about the dimensions of the

17/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 2 filters 3 outputs and the relations between them We will see how they are related but before that we will define a few quantities

So far we have not said anything explicit about the dimensions of the 1 inputs

17/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 3 outputs and the relations between them We will see how they are related but before that we will define a few quantities

So far we have not said anything explicit about the dimensions of the 1 inputs 2 filters

17/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 and the relations between them We will see how they are related but before that we will define a few quantities

So far we have not said anything explicit about the dimensions of the 1 inputs 2 filters 3 outputs

17/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We will see how they are related but before that we will define a few quantities

So far we have not said anything explicit about the dimensions of the 1 inputs 2 filters 3 outputs and the relations between them

17/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So far we have not said anything explicit about the dimensions of the 1 inputs 2 filters 3 outputs and the relations between them We will see how they are related but before that we will define a few quantities

17/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 and Depth (D1) of the original input

F Height (H1)

F

D1

Width (W1),

The Stride S (We will come back to H2 this later) The number of filters K The spatial extent (F ) of each filter H1 (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2, W1

D2 H2 and D2)

D1

We first define the following quantit- ies

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 F and Depth (D1) of the original input

F

D1

Height (H1)

The Stride S (We will come back to H2 this later) The number of filters K The spatial extent (F ) of each filter H1 (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2,

D2 H2 and D2)

D1

We first define the following quantit- ies

Width (W1),

W1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 F

F

D1

and Depth (D1) of the original input The Stride S (We will come back to H2 this later) The number of filters K The spatial extent (F ) of each filter (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2,

D2 H2 and D2)

D1

We first define the following quantit- ies

Width (W1), Height (H1)

H1

W1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 F

F

D1

The Stride S (We will come back to H2 this later) The number of filters K The spatial extent (F ) of each filter (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2,

D2 H2 and D2)

We first define the following quantit- ies

Width (W1), Height (H1) and Depth (D1) of the original input

H1

W1

D1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 F

F

D1

H2

The number of filters K The spatial extent (F ) of each filter (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2,

D2 H2 and D2)

We first define the following quantit- ies

Width (W1), Height (H1) and Depth (D1) of the original input The Stride S (We will come back to this later)

H1

W1

D1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 F

F

H2 D1 The number of filters K The spatial extent (F ) of each filter (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2,

D2 H2 and D2)

We first define the following quantit- ies

Width (W1), Height (H1) and Depth (D1) of the original input The Stride S (We will come back to this later)

H1

W1

D1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 F

F

H2 D1

The spatial extent (F ) of each filter (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2,

D2 H2 and D2)

We first define the following quantit- ies

Width (W1), Height (H1) and Depth (D1) of the original input The Stride S (We will come back to this later) The number of filters K

H1

W1

D1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2,

D2 H2 and D2)

We first define the following quantit- ies

F Width (W1), Height (H1) and Depth (D1) of the original input

F The Stride S (We will come back to D1 this later) The number of filters K The spatial extent (F ) of each filter H1 (the depth of each filter is same as the depth of each input)

W1

D1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We first define the following quantit- ies

F Width (W1), Height (H1) and Depth (D1) of the original input

F The Stride S (We will come back to

H2 D1 this later) The number of filters K The spatial extent (F ) of each filter H1 (the depth of each filter is same as the depth of each input)

The output is W2 × H2 × D2 (we will W2 soon see a formula for computing W2, W1

D2 H2 and D2)

D1

18/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output

=

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output

=

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output

=

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output

=

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output

=

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output

=

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 = This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary

pixel of interest

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 = This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

This results in an output which is of smaller dimensions than the input

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary)

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input

19/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

In general, W2 = W1 − F + 1

H2 = H1 − F + 1 For example, let’s consider a 5 × 5 kernel We will refine this formula further We have an even smaller output now

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input As the size of the kernel increases, this be- comes true for even more pixels

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

In general, W2 = W1 − F + 1

H2 = H1 − F + 1 We will refine this formula further We have an even smaller output now

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input As the size of the kernel increases, this be- comes true for even more pixels For example, let’s consider a 5 × 5 kernel

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

In general, W2 = W1 − F + 1

H2 = H1 − F + 1 We will refine this formula further

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input As the size of the kernel increases, this be- comes true for even more pixels For example, let’s consider a 5 × 5 kernel We have an even smaller output now

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

In general, W2 = W1 − F + 1

H2 = H1 − F + 1 We will refine this formula further

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input As the size of the kernel increases, this be- comes true for even more pixels For example, let’s consider a 5 × 5 kernel We have an even smaller output now

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

In general, W2 = W1 − F + 1

H2 = H1 − F + 1 We will refine this formula further

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input As the size of the kernel increases, this be- comes true for even more pixels For example, let’s consider a 5 × 5 kernel We have an even smaller output now

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

In general, W2 = W1 − F + 1

H2 = H1 − F + 1 We will refine this formula further

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller pixel of interest dimensions than the input As the size of the kernel increases, this be- comes true for even more pixels For example, let’s consider a 5 × 5 kernel We have an even smaller output now

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

In general, W2 = W1 − F + 1

H2 = H1 − F + 1 We will refine this formula further

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller pixel of interest dimensions than the input As the size of the kernel increases, this be- comes true for even more pixels For example, let’s consider a 5 × 5 kernel We have an even smaller output now

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 =

Let us compute the dimension (W2,H2) of the output Notice that we can’t place the kernel at the = corners as it will cross the input boundary This is true for all the shaded points (the kernel crosses the input boundary) This results in an output which is of smaller dimensions than the input As the size of the kernel increases, this be- In general, W2 = W1 − F + 1 comes true for even more pixels

H2 = H1 − F + 1 For example, let’s consider a 5 × 5 kernel We will refine this formula further We have an even smaller output now

20/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 0 0 0 0 0 0 0 0 0 0 0 We can use something known as padding 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same size as the input?

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same size as the input? We can use something known as padding

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0 0 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same size as the input? We can use something known as padding Pad the inputs with appropriate number of 0 inputs so that you can now apply the kernel at the corners

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This means we will add one row and one column of 0 inputs at the top, bottom, left and right

We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same 0 0 0 0 0 0 0 0 0 size as the input? 0 0 We can use something known as padding 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 0 0 0 0 0 0 0 0 0

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same 0 0 0 0 0 0 0 0 0 size as the input? 0 0 We can use something known as padding 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same 0 0 0 0 0 0 0 0 0 size as the input? 0 0 We can use something known as padding 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same 0 0 0 0 0 0 0 0 0 size as the input? 0 0 We can use something known as padding 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

What if we want the output to be of same 0 0 0 0 0 0 0 0 0 size as the input? 0 0 We can use something known as padding 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What if we want the output to be of same 0 0 0 0 0 0 0 0 0 size as the input? 0 0 We can use something known as padding 0 0 0 0 Pad the inputs with appropriate number of 0 0 0 = inputs so that you can now apply the kernel 0 0 at the corners 0 0 Let us use pad P = 1 with a 3 × 3 kernel 0 0 This means we will add one row and one 0 0 0 0 0 0 0 0 0 column of 0 inputs at the top, bottom, left and right

We now have,

W2 = W1 − F + 2P + 1

H2 = H1 − F + 2P + 1 We will refine this formula further

21/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 It defines the intervals at which the So what should our final formula look like, filter is applied (here S = 2) W1 − F + 2P W2 = + 1 Here, we are essentially skipping S H − F + 2P every 2nd pixel which will again H = 1 + 1 2 S result in an output which is of smaller dimensions

What does the stride S do?

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like,

W1 − F + 2P W2 = + 1 Here, we are essentially skipping S H − F + 2P every 2nd pixel which will again H = 1 + 1 2 S result in an output which is of smaller dimensions

What does the stride S do? It defines the intervals at which the filter is applied (here S = 2)

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

So what should our final formula look like,

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 0 0 0 0 0 0 0 0 0 What does the stride S do? 0 0 0 0 It defines the intervals at which the 0 0 filter is applied (here S = 2) 0 0 = Here, we are essentially skipping 0 0 0 0 every 2nd pixel which will again 0 0 result in an output which is of 0 0 0 0 0 0 0 0 0 smaller dimensions

So what should our final formula look like, W − F + 2P W = 1 + 1 2 S H − F + 2P H = 1 + 1 2 S

22/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Each filter gives us one 2D output. K filters will give us K such 2D out-

H2 puts We can think of the resulting output as K × W2 × H2 volume

Thus D2 = K

W2

D2 = K

W1−F +2P W2 = S + 1

H1−F +2P H2 = S + 1

D2 = K

Finally, coming to the depth of the output.

filter

H1

W1

D1

23/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 K filters will give us K such 2D out-

H2 puts We can think of the resulting output

filter as K × W2 × H2 volume

Thus D2 = K

W2

D2 = K

W1−F +2P W2 = S + 1

H1−F +2P H2 = S + 1

D2 = K

Finally, coming to the depth of the output. Each filter gives us one 2D output.

H1

W1

D1

23/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We can think of the resulting output

filter as K × W2 × H2 volume

Thus D2 = K

Finally, coming to the depth of the output. Each filter gives us one 2D output. K filters will give us K such 2D out-

H2 puts

H1

W2

W1

D2 = K

D1 W1−F +2P W2 = S + 1

H1−F +2P H2 = S + 1

D2 = K

23/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 filter

Thus D2 = K

Finally, coming to the depth of the output. Each filter gives us one 2D output. K filters will give us K such 2D out-

H2 puts We can think of the resulting output as K × W2 × H2 volume H1

W2

W1

D2 = K

D1 W1−F +2P W2 = S + 1

H1−F +2P H2 = S + 1

D2 = K

23/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 filter

Finally, coming to the depth of the output. Each filter gives us one 2D output. K filters will give us K such 2D out-

H2 puts We can think of the resulting output as K × W2 × H2 volume H1 Thus D2 = K

W2

W1

D2 = K

D1 W1−F +2P W2 = S + 1

H1−F +2P H2 = S + 1

D2 = K

23/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2 =?

W2 =?

227−11 55 = 4 + 1

227−11 55 = 4 + 1

96

Let us do a few exercises

H2 =?

11 ∗ =

11 227 3 W =? 96 filters 2 227 Stride = 4 P adding = 0

W1−F +2P D2 =? W2 = S + 1

3 H1−F +2P H2 = S + 1

24/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 W2 =?

227−11 55 =H2 =?4 + 1

227−11 55 =W2 4=? + 1

D2 =?

Let us do a few exercises

H2 =?

11 ∗ =

11 227 3 W =? 96 filters 2 227 Stride = 4 P adding = 0

W1−F +2P 96 W2 = S + 1

3 H1−F +2P H2 = S + 1

24/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2 =?

227−11 55 =W2 4=? + 1

D2 =?

Let us do a few exercises

227−11 55 = 4 + 1 11 ∗ =

11 227 3 W =? 96 filters 2 227 Stride = 4 P adding = 0

W1−F +2P 96 W2 = S + 1

3 H1−F +2P H2 = S + 1

24/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2 =?

W2 =?

D2 =?

Let us do a few exercises

227−11 55 = 4 + 1 11 ∗ =

11 227 3 55 = 227−11 + 1 96 filters 4 227 Stride = 4 P adding = 0

W1−F +2P 96 W2 = S + 1

3 H1−F +2P H2 = S + 1

24/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2 =?

W2 =?

32−5 28 = 1 + 1

32−5 28 = 1 + 1

6

Let us do a few exercises

H2 =?

5 ∗ =

5 32 1 W =? 6 filters 2 32 Stride = 1 P adding = 0

W1−F +2P D2 =? W2 = S + 1

1 H1−F +2P H2 = S + 1

25/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 W2 =?

32−5 28 =H2 =?1 + 1

32−5 28 =W2 1=?+ 1

D2 =?

Let us do a few exercises

H2 =?

5 ∗ =

5 32 1 W =? 6 filters 2 32 Stride = 1 P adding = 0

W1−F +2P 6 W2 = S + 1

1 H1−F +2P H2 = S + 1

25/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2 =?

32−5 28 =W2 1=?+ 1

D2 =?

Let us do a few exercises

32−5 28 = 1 + 1 5 ∗ =

5 32 1 W =? 6 filters 2 32 Stride = 1 P adding = 0

W1−F +2P 6 W2 = S + 1

1 H1−F +2P H2 = S + 1

25/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2 =?

W2 =?

D2 =?

Let us do a few exercises

32−5 28 = 1 + 1 5 ∗ =

5 32 1 28 = 32−5 + 1 6 filters 1 32 Stride = 1 P adding = 0

W1−F +2P 6 W2 = S + 1

1 H1−F +2P H2 = S + 1

25/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Module 11.3 : Convolutional Neural Networks

26/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We will try to understand this by considering the task of “image classification”

Putting things into perspective What is the connection between this operation (convolution) and neural net- works?

27/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Putting things into perspective What is the connection between this operation (convolution) and neural net- works? We will try to understand this by considering the task of “image classification”

27/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Features

Raw pixels car, bus, monument, flower

Edge Detector car, bus, monument, flower

SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 car, bus, monument, flower

Edge Detector car, bus, monument, flower

SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

Features

Raw pixels

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Edge Detector car, bus, monument, flower

SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

Features

Raw pixels car, bus, monument, flower

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Edge Detector car, bus, monument, flower

SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

Features

Raw pixels car, bus, monument, flower

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 car, bus, monument, flower

SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

Features

Raw pixels car, bus, monument, flower

Edge Detector

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

Features

Raw pixels car, bus, monument, flower

Edge Detector car, bus, monument, flower

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

Features

Raw pixels car, bus, monument, flower

Edge Detector car, bus, monument, flower

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

Features

Raw pixels car, bus, monument, flower

Edge Detector car, bus, monument, flower

SIF T/HOG

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 static feature extraction (no learning) learning weights of classifier

Features

Raw pixels car, bus, monument, flower

Edge Detector car, bus, monument, flower

SIF T/HOG car, bus, monument, flower

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Features

Raw pixels car, bus, monument, flower

Edge Detector car, bus, monument, flower

SIF T/HOG car, bus, monument, flower

static feature extraction (no learning) learning weights of classifier

28/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 ...... Learn these weights ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03

Input Features Classifier

car, bus, monument, flower

0 0 0 0 0 0 1 1 1 0 0 1 -8 1 0 0 1 1 1 0 0 0 0 0 0

Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker- nels/filters in addition to learning the weights of the classifier? 29/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Learn these weights

Input Features Classifier

car, bus, monument, flower

0 0 0 0 0 0 1 1 1 0 0 1 -8 1 0 0 1 1 1 0 0 0 0 0 0

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03

Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker- nels/filters in addition to learning the weights of the classifier? 29/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Input Features Classifier

car, bus, monument, flower

0 0 0 0 0 0 1 1 1 0 0 1 -8 1 0 0 1 1 1 0 0 0 0 0 0

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 ...... Learn these weights ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03

Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker- nels/filters in addition to learning the weights of the classifier? 29/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 -0.02337041-0.01871333 -0.03243878 -0.01075948············ 0.04684572-0.04728875 -0.053751580.00104325 -0.05350766 0.01935937······ ······-0.043236740.01016542 ...... -0.007925010.03008777 -0.00503319 0.00335217 ············ -0.027911280.00174674

Input Features Classifier

car, bus, monument, flower

0 0 0 0 0 0 1 1 1 0 0 1 -8 1 0 0 1 1 1 0 0 0 0 0 0

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03 Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn multiple meaningful kernels/filters in addition to learning the weights of the clas- sifier? 30/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 -1.21358689e-03-0.01871333 -0.01075948 3.23652686e-03 ············ -2.06615720e-020.04684572 -1.52757822e-030.00104325 0.01935937 2.36130832e-03 ············ -1.19824838e-020.01016542 ...... -8.25322699e-040.03008777 0.00335217-5.14897937e-03 ············ -9.90395527e-03-0.02791128

Input Features Classifier

car, bus, monument, flower

0 0 0 0 0 0 1 1 1 0 0 1 -8 1 0 0 1 1 1 0 0 0 0 0 0

car, bus, monument, flower

-0.02337041 -0.03243878 ······ -0.04728875 -0.05375158 -0.05350766 ······ -0.04323674 ...... -0.00792501 -0.00503319 ······ 0.00174674 Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn multiple meaningful kernels/filters in addition to learning the weights of the clas- sifier? 30/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 -0.02337041-1.21358689e-03 -0.03243878 3.23652686e-03············ -2.06615720e-02-0.04728875 -0.05375158-1.52757822e-03 -0.05350766 2.36130832e-03······ ······-0.04323674-1.19824838e-02 ...... -0.00792501-8.25322699e-04 -0.00503319 -5.14897937e-03············ -9.90395527e-030.00174674

Input Features Classifier

car, bus, monument, flower

0 0 0 0 0 0 1 1 1 0 0 1 -8 1 0 0 1 1 1 0 0 0 0 0 0

car, bus, monument, flower

-0.01871333 -0.01075948 ······ 0.04684572 0.00104325 0.01935937 ······ 0.01016542 ...... 0.03008777 0.00335217 ······ -0.02791128 Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn multiple meaningful kernels/filters in addition to learning the weights of the clas- sifier? 30/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Input Classifier

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -0.01112582 0.02185669 ······ 0.00015161 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 -0.00687587 0.01229961 ······ 0.00214013 ...... ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03 -0.00372989 -0.00886137 ······ -0.01974954

Yes, we can ! Simply by treating these kernels as parameters and learning them in addition to the weights of the classifier (using back propagation) Such a network is called a Convolutional Neural Network.

Can we learn multiple layers of meaningful kernels/filters in addition to learning the weights of the classifier?

31/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 -1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -0.01112582 0.02185669 ······ 0.00015161 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 -0.00687587 0.01229961 ······ 0.00214013 ...... backpropagation ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03 -0.00372989 -0.00886137 ······ -0.01974954

Simply by treating these kernels as parameters and learning them in addition to the weights of the classifier (using back propagation) Such a network is called a Convolutional Neural Network.

Input Classifier

car, bus, monument, flower

Can we learn multiple layers of meaningful kernels/filters in addition to learning the weights of the classifier? Yes, we can !

31/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Such a network is called a Convolutional Neural Network.

Input Classifier

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -0.01112582 0.02185669 ······ 0.00015161 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 -0.00687587 0.01229961 ······ 0.00214013 ...... backpropagation ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03 -0.00372989 -0.00886137 ······ -0.01974954

Can we learn multiple layers of meaningful kernels/filters in addition to learning the weights of the classifier? Yes, we can ! Simply by treating these kernels as parameters and learning them in addition to the weights of the classifier (using back propagation)

31/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Input Classifier

car, bus, monument, flower

-1.21358689e-03 3.23652686e-03 ······ -2.06615720e-02 -0.01112582 0.02185669 ······ 0.00015161 -1.52757822e-03 2.36130832e-03 ······ -1.19824838e-02 -0.00687587 0.01229961 ······ 0.00214013 ...... backpropagation ...... -8.25322699e-04 -5.14897937e-03 ······ -9.90395527e-03 -0.00372989 -0.00886137 ······ -0.01974954

Can we learn multiple layers of meaningful kernels/filters in addition to learning the weights of the classifier? Yes, we can ! Simply by treating these kernels as parameters and learning them in addition to the weights of the classifier (using back propagation)

Such a network is called a Convolutional Neural Network. 31/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 But how is this different from a regular feedforward neural network Let us see

Okay, I get it that the idea is to learn the kernel/filters by just treating them as parameters of the classification model

32/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Let us see

Okay, I get it that the idea is to learn the kernel/filters by just treating them as parameters of the classification model But how is this different from a regular feedforward neural network

32/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Okay, I get it that the idea is to learn the kernel/filters by just treating them as parameters of the classification model But how is this different from a regular feedforward neural network Let us see

32/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 10 classes(digits)

This is what a regular feed-forward neural network will look like ... There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11 16 Contrast this to what happens in the case of convolution

2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 10 classes(digits)

This is what a regular feed-forward neural network will look like ... There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is what a regular feed-forward neural network will look like ... There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits)

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is what a regular feed-forward neural network will look like ... There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits)

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is what a regular feed-forward neural network will look like ... There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits)

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is what a regular feed-forward neural network will look like

There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits) ...

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is what a regular feed-forward neural network will look like

There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits) ...

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is what a regular feed-forward neural network will look like

There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits) ...

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is what a regular feed-forward neural network will look like

There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits) ...

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits)

This is what a regular feed-forward neural network will look like ...

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 For example all the 16 input neurons are contributing to the computation of h11

Contrast this to what happens in the case of convolution

10 classes(digits)

This is what a regular feed-forward neural network will look like ... There are many dense connections here

16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Contrast this to what happens in the case of convolution

10 classes(digits)

This is what a regular feed-forward neural network will look like ... There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11 16 2

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 10 classes(digits)

This is what a regular feed-forward neural network will look like ... There are many dense connections here

For example all the 16 input neurons are contributing to the computation of h11 16 Contrast this to what happens in the 2 case of convolution

33/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 h11 h12 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 contribute to h11

The connections are much sparser

We are taking advantage of the structure of the image(interactions h11 between neighboring pixels are more interesting)

This sparse connectivity reduces the number of parameters in the model

...

16

2 * =

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 h11 h12

For example, only pixels 1, 2, 5, 6 contribute to h11

The connections are much sparser

We are taking advantage of the structure of the image(interactions h11 between neighboring pixels are more interesting)

This sparse connectivity reduces the number of parameters in the model

Only a few local neurons participate in the computation of h11 ...

16

2 * =

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 h12

The connections are much sparser

We are taking advantage of the structure of the image(interactions between neighboring pixels are more interesting)

This sparse connectivity reduces the number of parameters in the model

h11 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 ... contribute to h11

16

h11 2 * =

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The connections are much sparser

We are taking advantage of the structure of the image(interactions h11 between neighboring pixels are more interesting)

This sparse connectivity reduces the number of parameters in the model

h11 h12 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 ... contribute to h11

16

h12 2 * =

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The connections are much sparser

We are taking advantage of the structure of the image(interactions h11 between neighboring pixels are more interesting)

This sparse connectivity reduces the number of parameters in the model

h11 h12 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 ... contribute to h11

16

h13 2 * =

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The connections are much sparser

We are taking advantage of the structure of the image(interactions h11 between neighboring pixels are more interesting)

This sparse connectivity reduces the number of parameters in the model

h11 h12 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 ... contribute to h11

16

h14 2 * =

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We are taking advantage of the structure of the image(interactions h11 between neighboring pixels are more interesting)

This sparse connectivity reduces the number of parameters in the model

h11 h12 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 ... contribute to h11

16 The connections are much sparser

h14 2 * =

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 h11

This sparse connectivity reduces the number of parameters in the model

h11 h12 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 ... contribute to h11

16 The connections are much sparser We are taking advantage of the structure of the image(interactions h14 between neighboring pixels are more 2 * = interesting)

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 h11

h11 h12 Only a few local neurons participate in the computation of h11

For example, only pixels 1, 2, 5, 6 ... contribute to h11

16 The connections are much sparser We are taking advantage of the structure of the image(interactions h14 between neighboring pixels are more * = interesting) 2 This sparse connectivity reduces the number of parameters in the model

34/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Aren’t we losing information (by los- ing interactions between some input pixels)

Well, not really

The two highlighted neurons (x1 & ∗ x5) do not interact in layer 1

But they indirectly contribute to the computation of g3 and hence interact indirectly

But is sparse connectivity really good thing ?

∗ Goodfellow-et-al-2016

35/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Well, not really

The two highlighted neurons (x1 & ∗ x5) do not interact in layer 1

But they indirectly contribute to the computation of g3 and hence interact indirectly

But is sparse connectivity really good thing ?

Aren’t we losing information (by los- ing interactions between some input pixels)

∗ Goodfellow-et-al-2016

35/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The two highlighted neurons (x1 & ∗ x5) do not interact in layer 1

But they indirectly contribute to the computation of g3 and hence interact indirectly

But is sparse connectivity really good thing ?

Aren’t we losing information (by los- ing interactions between some input pixels)

Well, not really

∗ Goodfellow-et-al-2016

35/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 But they indirectly contribute to the computation of g3 and hence interact indirectly

But is sparse connectivity really good thing ?

Aren’t we losing information (by los- ing interactions between some input pixels)

Well, not really

The two highlighted neurons (x1 & ∗ x5) do not interact in layer 1

∗ Goodfellow-et-al-2016

35/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 But is sparse connectivity really good thing ?

Aren’t we losing information (by los- ing interactions between some input pixels)

Well, not really

The two highlighted neurons (x1 & ∗ x5) do not interact in layer 1

But they indirectly contribute to the computation of g3 and hence interact indirectly

∗ Goodfellow-et-al-2016

35/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 16

Kernel 1

Kernel 2 4x4 Image

Consider the following net- work

Do we want the kernel weights to be different for dif- ferent portions of the image?

Imagine that we are trying to learn a kernel that detects edges

Shouldn’t we be applying the same kernel at all the por- tions of the image?

Another characteristic of CNNs is weight sharing

36/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 16

Kernel 1

Kernel 2 4x4 Image

Do we want the kernel weights to be different for dif- ferent portions of the image?

Imagine that we are trying to learn a kernel that detects edges

Shouldn’t we be applying the same kernel at all the por- tions of the image?

Another characteristic of CNNs is weight sharing

Consider the following net- work

36/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Do we want the kernel weights to be different for dif- ferent portions of the image?

Imagine that we are trying to learn a kernel that detects edges

Shouldn’t we be applying the same kernel at all the por- tions of the image?

Another characteristic of CNNs is weight sharing

Consider the following net- work

16

Kernel 1

Kernel 2 4x4 Image

36/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Imagine that we are trying to learn a kernel that detects edges

Shouldn’t we be applying the same kernel at all the por- tions of the image?

Another characteristic of CNNs is weight sharing

Consider the following net- work

16 Do we want the kernel weights to be different for dif- ferent portions of the image?

Kernel 1

Kernel 2 4x4 Image

36/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Shouldn’t we be applying the same kernel at all the por- tions of the image?

Another characteristic of CNNs is weight sharing

Consider the following net- work

16 Do we want the kernel weights to be different for dif- ferent portions of the image?

Kernel 1 Imagine that we are trying to learn a kernel that detects edges Kernel 2 4x4 Image

36/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Another characteristic of CNNs is weight sharing

Consider the following net- work

16 Do we want the kernel weights to be different for dif- ferent portions of the image?

Kernel 1 Imagine that we are trying to learn a kernel that detects edges Kernel 2 4x4 Image Shouldn’t we be applying the same kernel at all the por- tions of the image? 36/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Yes, indeed

This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

16

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

16

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

16

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 But does that mean we can have only one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

16

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 No, we can have many such kernels but the kernels will be shared by all locations in the image

This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only 16 one kernel?

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only 16 one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only 16 one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This is called “weight sharing”

In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only 16 one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 In other words shouldn’t the orange and pink kernels be the same

Yes, indeed

This would make the job of learning easier(instead of trying to learn the same weights/kernels at different loc- ations again and again)

But does that mean we can have only 16 one kernel?

No, we can have many such kernels but the kernels will be shared by all locations in the image

This is called “weight sharing” 37/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Let us see what a full convolutional neural network looks like

So far, we have focused only on the convolution operation

38/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 So far, we have focused only on the convolution operation Let us see what a full convolutional neural network looks like

38/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 It has alternate convolution and pooling layers What does a pooling layer do? Let us see

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2 FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

A 32 28 14 32 P aram P aram 10 P aram = 850 28 14 5 = 48120= 10164 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

39/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What does a pooling layer do? Let us see

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2 FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

A 32 28 14 32 P aram P aram 10 P aram = 850 28 14 5 = 48120= 10164 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

It has alternate convolution and pooling layers

39/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Let us see

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2 FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

A 32 28 14 32 P aram P aram 10 P aram = 850 28 14 5 = 48120= 10164 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

It has alternate convolution and pooling layers What does a pooling layer do?

39/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2 FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

A 32 28 14 32 P aram P aram 10 P aram = 850 28 14 5 = 48120= 10164 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

It has alternate convolution and pooling layers What does a pooling layer do? Let us see

39/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

Input

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 4 2 1

5 8 3 4 maxpool 8 4 = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

*

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 4 2 1

5 8 3 4 maxpool 8 4

7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

* =

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 maxpool 8 4

2x2 filters (stride 2) 7 5

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 * = 7 6 4 5

1 3 1 2

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 8 4

7 5

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool * = 7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 8 4

7 5

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool * = 7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 4

7 5

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 * = 7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 7 5

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2)

1 3 1 2

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 5

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7

1 3 1 2

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 maxpool 8 8 4

2x2 filters (stride 1) 8 8 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4

7 6 4 5

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 8 8 4

8 8 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 8 8 4

8 8 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 8 4

8 8 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 4

8 8 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 8 8 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 8 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 5

7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 7 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 6 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 5

Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Instead of max pooling we can also do average pooling

1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 4 2 1

5 8 3 4 maxpool 8 4 * = 7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1 filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average pooling

40/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 We will now see some case studies where convolution neural networks have been successful

41/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Convolution Layer 2 Convolution Layer 1 Pooling Layer 2 FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

28 14

10 28 14 5 10 5

LeNet-5 for handwritten character recognition

Input

A 32

32

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Convolution Layer 2 Pooling Layer 2

FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

14

10 14 5 10 5

LeNet-5 for handwritten character recognition

Input Convolution Layer 1

A 32 28

32 28

S = 1,F = 5, K = 6,P = 0, P aram =?

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Convolution Layer 2 Pooling Layer 2

FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

14

10 14 5 10 5

LeNet-5 for handwritten character recognition

Input Convolution Layer 1

A 32 28

32 28

S = 1,F = 5, K = 6,P = 0, P aram = 150

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Convolution Layer 2 Pooling Layer 2

FC 1(120) FC 2(84)Output(10)

10 5

10 5

LeNet-5 for handwritten character recognition

Input Convolution Layer 1

Pooling Layer 1

A 32 28 14 32 28 14

S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, P aram = 150 P aram =?

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Convolution Layer 2 Pooling Layer 2

FC 1(120) FC 2(84)Output(10)

10 5

10 5

LeNet-5 for handwritten character recognition

Input Convolution Layer 1

Pooling Layer 1

A 32 28 14 32 28 14

S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, P aram = 150 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Pooling Layer 2

FC 1(120) FC 2(84)Output(10)

5 5

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1

Pooling Layer 1

A 32 28 14 32 10 28 14 10 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, P aram = 150 P aram = 0 K = 16,P = 0, P aram =?

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Pooling Layer 2

FC 1(120) FC 2(84)Output(10)

5 5

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1

Pooling Layer 1

A 32 28 14 32 10 28 14 10 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, P aram = 150 P aram = 0 K = 16,P = 0, P aram = 2400

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 FC 1(120) FC 2(84)Output(10)

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2

Pooling Layer 1

A 32 28 14 32 10 28 14 5 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram =?

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 FC 1(120) FC 2(84)Output(10)

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2

Pooling Layer 1

A 32 28 14 32 10 28 14 5 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 FC 2(84)Output(10)

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2

Pooling Layer 1 FC 1(120)

A 32 28 14 32 10 P aram 28 14 5 =? 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 FC 2(84)Output(10)

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2

Pooling Layer 1 FC 1(120)

A 32 28 14 32 10 P aram 28 14 5 = 48120 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output(10)

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2

Pooling Layer 1 FC 1(120)FC 2(84)

A 32 28 14 32 P aram 10 P aram 28 14 5 = 48120 =? 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output(10)

LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2

Pooling Layer 1 FC 1(120)FC 2(84)

A 32 28 14 32 P aram 10 P aram 28 14 5 = 48120= 10164 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2 FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

A 32 28 14 32 P aram P aram 10 P aram =? 28 14 5 = 48120= 10164 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 LeNet-5 for handwritten character recognition

Convolution Layer 2 Input Convolution Layer 1 Pooling Layer 2 FC 1(120) Pooling Layer 1 FC 2(84)Output(10)

A 32 28 14 32 P aram P aram 10 P aram = 850 28 14 5 = 48120= 10164 10 5 S = 1,F = 5, S = 1,F = 2, K = 6,P = 0, K = 6,P = 0, S = 1,F = 5, S = 1,F = 2, P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0, P aram = 2400 P aram = 0

42/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 How do we train a convolutional neural network ?

43/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output

m ` wherein only a few weights(in color) are active n o the rest of the weights (in gray) are We can thus train a convolution zero neural network using backpropagation by thinking of it as a feedforward neural network with sparse connections

Input Kernel b c d l m n o w x e f g y z h i j b c d e f g h i j

A CNN can be implemented as a feedforward neural network

44/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output

` m

n o the rest of the weights (in gray) are We can thus train a convolution zero neural network using backpropagation by thinking of it as a feedforward neural network with sparse connections

Input Kernel b c d l m n o w x e f g y z h i j b c d e f g h i j

Output A CNN can be implemented as a feedforward neural network m ` wherein only a few weights(in color) are active n o

44/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output

` m

n o We can thus train a convolution neural network using backpropagation by thinking of it as a feedforward neural network with sparse connections

Input Kernel b c d l m n o w x e f g y z h i j b c d e f g h i j

Output A CNN can be implemented as a feedforward neural network m ` wherein only a few weights(in color) are active n o the rest of the weights (in gray) are zero

44/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output

` m

n o We can thus train a convolution neural network using backpropagation by thinking of it as a feedforward neural network with sparse connections

Input Kernel b c d l m n o w x e f g y z h i j b c d e f g h i j

Output A CNN can be implemented as a feedforward neural network m ` wherein only a few weights(in color) are active n o the rest of the weights (in gray) are zero

44/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output

` m

n o We can thus train a convolution neural network using backpropagation by thinking of it as a feedforward neural network with sparse connections

Input Kernel b c d l m n o w x e f g y z h i j b c d e f g h i j

Output A CNN can be implemented as a feedforward neural network m ` wherein only a few weights(in color) are active n o the rest of the weights (in gray) are zero

44/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output

` m

n o We can thus train a convolution neural network using backpropagation by thinking of it as a feedforward neural network with sparse connections

Input Kernel b c d l m n o w x e f g y z h i j b c d e f g h i j

A CNN can be implemented as a feedforward neural network wherein only a few weights(in color) are active the rest of the weights (in gray) are zero

44/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output

` m

n o

Input Kernel b c d l m n o w x e f g y z h i j b c d e f g h i j

A CNN can be implemented as a feedforward neural network wherein only a few weights(in color) are active the rest of the weights (in gray) are We can thus train a convolution zero neural network using backpropagation by thinking of it as a feedforward neural network with

sparse connections 44/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Module 11.4 : CNNs (success stories on ImageNet)

45/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ZFNet VGGNet

ImageNet Success Stories(roadmap for rest of the talk) AlexNet

46/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 VGGNet

ImageNet Success Stories(roadmap for rest of the talk) AlexNet ZFNet

46/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ImageNet Success Stories(roadmap for rest of the talk) AlexNet ZFNet VGGNet

46/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 25.8 152 layers

16.4

11.7 19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15 AlexNet ZFNet VGG GoogleNet ResNet

28.2

ILSVRC’10

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 152 layers

16.4

11.7 19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15 AlexNet ZFNet VGG GoogleNet ResNet

28.2 25.8

ILSVRC’10 ILSVRC’11

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 152 layers

11.7 19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15 ZFNet VGG GoogleNet ResNet

28.2 25.8

16.4

ILSVRC’10 ILSVRC’11 ILSVRC’12 AlexNet

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 152 layers

19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’14 ILSVRC’14 ILSVRC’15 VGG GoogleNet ResNet

28.2 25.8

16.4

11.7

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 AlexNet ZFNet

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 152 layers

19 layers 22 layers

6.7

shallow 8 layers 8 layers 3.57

ILSVRC’14 ILSVRC’15 GoogleNet ResNet

28.2 25.8

16.4

11.7

7.3

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 AlexNet ZFNet VGG

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 152 layers

19 layers 22 layers

shallow 8 layers 8 layers 3.57

ILSVRC’15 ResNet

28.2 25.8

16.4

11.7

7.3 6.7

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 AlexNet ZFNet VGG GoogleNet

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

28.2 25.8

16.4

11.7

7.3 6.7

3.57

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15 AlexNet ZFNet VGG GoogleNet ResNet

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 28.2 25.8 152 layers

16.4

11.7 19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15 AlexNet ZFNet VGG GoogleNet ResNet

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 28.2 25.8 152 layers

16.4

11.7 19 layers 22 layers

7.3 6.7

shallow 8 layers 8 layers 3.57

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15 AlexNet ZFNet VGG GoogleNet ResNet

47/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ImageNet Success Stories(roadmap for rest of the talk) AlexNet ZFNet VGGNet

48/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 9752,H2 ==? 552723119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

55 27 23 11 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 11 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 Convolution MaxPooling 1000 96 256 MaxPooling Convolution 4096 4096 96 Convolution

227

227

3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 272311 9752,H2 ==? 2723119752 Parameters:Parameters:Parameters: (3 (3Parameters: (5×××33×5××384)256)384)96)×× 0?×384256384256 = = = 1 0. 0.327.68MMM

TotalInput:Input:Input: Parameters: 11 27FC1 97 ××119727××× 2738425696.55M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 256,384,×256) 23× 55 540961000××F×52355=×4096 = =× 53×256 16 425696MM = 4M FS = 1,3,PS = 02

Output:W2 = 55,H2 = 55 Parameters: (11 × 11 × 3) × 96 = 34K

55 27 23 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 Convolution MaxPooling 1000 96 256 MaxPooling Convolution 4096 4096 96 Convolution

Input: 227 × 227 × 3 Conv1: K = 96,F = 11 S = 4,P = 0

Output:W2 =?,H2 =? Parameters: ?

227

11

11

227

3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 272311 9752,H2 ==? 2723119752 Parameters:Parameters:Parameters: (3 (3Parameters: (5×××33×5××384)256)384)96)×× 0?×384256384256 = = = 1 0. 0.327.68MMM

TotalInput:Input:Input: Parameters: 11 27FC1 97 ××119727××× 2738425696.55M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 256,384,×256) 23× 55 540961000××F×52355=×4096 = =× 53×256 16 425696MM = 4M FS = 1,3,PS = 02

Output:W2 =?,H2 =? Parameters: (11 × 11 × 3) × 96 = 34K

27 23 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000 96 256 MaxPooling Convolution 4096 4096

Input: 227 × 227 × 3 Conv1: K = 96,F = 11 S = 4,P = 0

Output:W2 = 55,H2 = 55 Parameters: ?

227 55

11

11

55 227

96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 272311 9752,H2 ==? 2723119752 Parameters:Parameters:Parameters: (3 (3Parameters: (5×××33×5××384)256)384)96)×× 0?×384256384256 = = = 1 0. 0.327.68MMM

TotalInput:Input:Input: Parameters: 11 27FC1 97 ××119727××× 2738425696.55M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 256,384,×256) 23× 55 540961000××F×52355=×4096 = =× 53×256 16 425696MM = 4M FS = 1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

27 23 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000 96 256 MaxPooling Convolution 4096 4096

Input: 227 × 227 × 3 Conv1: K = 96,F = 11 S = 4,P = 0

Output:W2 = 55,H2 = 55 Parameters: (11 × 11 × 3) × 96 = 34K

227 55

11

11

55 227

96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552311 9752,H2 ==? 5523119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 54096F1000×F×=523=×4096 =11 =× 53256 16 4256MM = 4M FS = 4,1,3,PS = 02

Output:W2 = 27,H2 = 27 Parameters: 0

27 23 11 9 5 7 5 dense 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000 96 256 MaxPooling Convolution 4096 4096

Max Pool Input: 55 × 55 × 96 F = 3,S = 2

Output:W2 =?,H2 =? Parameters: ?

227 55

11 3 3 11

55 227

96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552311 9752,H2 ==? 5523119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 54096F1000×F×=523=×4096 =11 =× 53256 16 4256MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: 0

23 11 9 5 7 5 dense 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 23 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000 256 Convolution 4096 4096

Max Pool Input: 55 × 55 × 96 F = 3,S = 2

Output:W2 = 27,H2 = 27 Parameters: ?

227 55 27 11 3 3 11

27 55 227

96 MaxPooling 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552311 9752,H2 ==? 5523119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 54096F1000×F×=523=×4096 =11 =× 53256 16 4256MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

23 11 9 5 7 5 dense 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 23 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000 256 Convolution 4096 4096

Max Pool Input: 55 × 55 × 96 F = 3,S = 2

Output:W2 = 27,H2 = 27 Parameters: 0

227 55 27 11 3 3 11

27 55 227

96 MaxPooling 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552711 9752,H2 ==? 5527119752 Parameters:Parameters:Parameters: (3 (3Parameters: (11××3×3××11384)256)384)× 3)× 0?××38438425696 = = 1 34.0327.8KMM

TotalInput:Input: Parameters: 11227FC1 97 ××1197227×× 27384256×.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,384,256,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 3×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 = 23,H2 = 23 Parameters: (5 × 5 × 96) × 256 = 0.6M

23 11 9 7 dense 3 3 3 3 5 3 dense dense 3 3 3 3 3 2 5 2 7 9 256 11 256 MaxPooling 23 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000 256 Convolution 4096 4096

Input: 27 × 27 × 96 Conv1: K = 256,F = 5 S = 1,P = 0

Output:W2 =?,H2 =? Parameters: ?

227 55 27 11 3 5 3 5 11

27 55 227

96 MaxPooling 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552711 9752,H2 ==? 5527119752 Parameters:Parameters:Parameters: (3 (3Parameters: (11××3×3××11384)256)384)× 3)× 0?××38438425696 = = 1 34.0327.8KMM

TotalInput:Input: Parameters: 11227FC1 97 ××1197227×× 27384256×.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,384,256,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 3×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: (5 × 5 × 96) × 256 = 0.6M

11 9 7 dense 3 3 3 3 5 3 dense dense 3 3 3 3 3 2 5 2 7 9 256 11 256 MaxPooling 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000

4096 4096

Input: 27 × 27 × 96 Conv1: K = 256,F = 5 S = 1,P = 0

Output:W2 = 23,H2 = 23 Parameters: ?

227 55 27 23 11 3 5 3 5 11

27 23 55 227

96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552711 9752,H2 ==? 5527119752 Parameters:Parameters:Parameters: (3 (3Parameters: (11××3×3××11384)256)384)× 3)× 0?××38438425696 = = 1 34.0327.8KMM

TotalInput:Input: Parameters: 11227FC1 97 ××1197227×× 27384256×.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,384,256,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 3×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

11 9 7 dense 3 3 3 3 5 3 dense dense 3 3 3 3 3 2 5 2 7 9 256 11 256 MaxPooling 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000

4096 4096

Input: 27 × 27 × 96 Conv1: K = 256,F = 5 S = 1,P = 0

Output:W2 = 23,H2 = 23 Parameters: (5 × 5 × 96) × 256 = 0.6M

227 55 27 23 11 3 5 3 5 11

27 23 55 227

96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552723 9752,H2 ==? 5527239752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input:× Input:K 4096 40962=× 96,256,384,×256)× 55 54096F1000××F×=555=×4096 =11 = 53×256 16 496MM = 4M FS = 4,1,3,PS = 02

Output:W2 = 11,H2 = 11 Parameters: 0

11 9 7 dense 3 3 3 5 3 dense dense 3 3 3 3 2 5 2 7 9 256 11 256 MaxPooling 384 Convolution 384 Convolution 256 Convolution MaxPooling 1000

4096 4096

Max Pool Input: 23 × 23 × 256 F = 3,S = 2

Output:W2 =?,H2 =? Parameters: ?

227 55 27 23 11 5 3 3 3 3 5 11

27 23 55 227

96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552723 9752,H2 ==? 5527239752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input:× Input:K 4096 40962=× 96,256,384,×256)× 55 54096F1000××F×=555=×4096 =11 = 53×256 16 496MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: 0

9 7 dense 3 3 3 5 3 dense dense 3 3 3 3 2 5 2 7 9 256 256 MaxPooling 384 Convolution 384 Convolution Convolution 1000

4096 4096

Max Pool Input: 23 × 23 × 256 F = 3,S = 2

Output:W2 = 11,H2 = 11 Parameters: ?

227 55 27 23 11 11 5 3 3 3 3 5 11

11 27 23 55 227 256 MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 552723 9752,H2 ==? 5527239752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input:× Input:K 4096 40962=× 96,256,384,×256)× 55 54096F1000××F×=555=×4096 =11 = 53×256 16 496MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

9 7 dense 3 3 3 5 3 dense dense 3 3 3 3 2 5 2 7 9 256 256 MaxPooling 384 Convolution 384 Convolution Convolution 1000

4096 4096

Max Pool Input: 23 × 23 × 256 F = 3,S = 2

Output:W2 = 11,H2 = 11 Parameters: 0

227 55 27 23 11 11 5 3 3 3 3 5 11

11 27 23 55 227 256 MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 752,H2 ==? 55272311752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)384)96)× 3)×× 0?××38425625696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 227 27FC1 97 ×××9727227×× 27384×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 = 9,H2 = 9 Parameters: (3 × 3 × 256) × 384 = 0.8M

9 7 dense 3 3 5 3 dense dense 3 3 3 2 5 2 7 9 256 256 MaxPooling 384 Convolution 384 Convolution Convolution 1000

4096 4096

Input: 11 × 11 × 256 Conv1: K = 384,F = 3 S = 1,P = 0

Output:W2 =?,H2 =? Parameters: ?

227 55 27 23 11 11 3 5 3 3 3 3 5 3 11

11 27 23 55 227 256 MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 752,H2 ==? 55272311752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)384)96)× 3)×× 0?××38425625696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 227 27FC1 97 ×××9727227×× 27384×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: (3 × 3 × 256) × 384 = 0.8M

7 dense 3 3 5 3 dense dense 3 3 3 2 5 2 7 256 256 MaxPooling 384 Convolution Convolution

1000

4096 4096

Input: 11 × 11 × 256 Conv1: K = 384,F = 3 S = 1,P = 0

Output:W2 = 9,H2 = 9 Parameters: ?

227 55 27 23 11 11 9 3 5 3 3 3 3 5 3 11

11 9 27 23 55 384 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 752,H2 ==? 55272311752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)384)96)× 3)×× 0?××38425625696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 227 27FC1 97 ×××9727227×× 27384×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

7 dense 3 3 5 3 dense dense 3 3 3 2 5 2 7 256 256 MaxPooling 384 Convolution Convolution

1000

4096 4096

Input: 11 × 11 × 256 Conv1: K = 384,F = 3 S = 1,P = 0

Output:W2 = 9,H2 = 9 Parameters: (3 × 3 × 256) × 384 = 0.8M

227 55 27 23 11 11 9 3 5 3 3 3 3 5 3 11

11 9 27 23 55 384 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 952,H2 ==? 55272311952 Parameters:Parameters:Parameters: (3Parameters: (5 (11×××35×11×256)384)96)× 3)× 0?××25638425696 = = 0 340..68KMM

TotalInput:Input:Input: Parameters: 11227 27FC1 7 ×××11727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 = 7,H2 = 7 Parameters: (3 × 3 × 384) × 384 = 1.327M

7 dense 3 5 3 dense dense 3 3 2 5 2 7 256 256 MaxPooling 384 Convolution Convolution

1000

4096 4096

Input: 9 × 9 × 384 Conv1: K = 384,F = 3 S = 1,P = 0

Output:W2 =?,H2 =? Parameters: ?

227 55 27 23 11 11 9 3 5 3 3 3 3 3 5 3 3 11

11 9 27 23 55 384 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 952,H2 ==? 55272311952 Parameters:Parameters:Parameters: (3Parameters: (5 (11×××35×11×256)384)96)× 3)× 0?××25638425696 = = 0 340..68KMM

TotalInput:Input:Input: Parameters: 11227 27FC1 7 ×××11727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: (3 × 3 × 384) × 384 = 1.327M

dense 3 5 3 dense dense 3 3 2 5 2 256 256 MaxPooling Convolution

1000

4096 4096

Input: 9 × 9 × 384 Conv1: K = 384,F = 3 S = 1,P = 0

Output:W2 = 7,H2 = 7 Parameters: ?

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 3 3 11 7 11 9 27 23 384 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 952,H2 ==? 55272311952 Parameters:Parameters:Parameters: (3Parameters: (5 (11×××35×11×256)384)96)× 3)× 0?××25638425696 = = 0 340..68KMM

TotalInput:Input:Input: Parameters: 11227 27FC1 7 ×××11727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

dense 3 5 3 dense dense 3 3 2 5 2 256 256 MaxPooling Convolution

1000

4096 4096

Input: 9 × 9 × 384 Conv1: K = 384,F = 3 S = 1,P = 0

Output:W2 = 7,H2 = 7 Parameters: (3 × 3 × 384) × 384 = 1.327M

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 3 3 11 7 11 9 27 23 384 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 972,H2 ==? 55272311972 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)96)× 3)×× 0?××38425638496 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 9 ×××11927227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 = 5,H2 = 5 Parameters: (3 × 3 × 384) × 256 = 0.8M

5 dense 3 dense dense 3 2 5 2 256 256 MaxPooling Convolution

1000

4096 4096

Input: 7 × 7 × 384 Conv1: K = 256,F = 3 S = 1,P = 0

Output:W2 =?,H2 =? Parameters: ?

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 3 5 3 3 3 11 7 11 9 27 23 384 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 972,H2 ==? 55272311972 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)96)× 3)×× 0?××38425638496 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 9 ×××11927227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: (3 × 3 × 384) × 256 = 0.8M

dense 3 dense dense 3 2 2 256 MaxPooling

1000

4096 4096

Input: 7 × 7 × 384 Conv1: K = 256,F = 3 S = 1,P = 0

Output:W2 = 5,H2 = 5 Parameters: ?

227 55 27 23 11 11 9 3 5 7 3 3 3 3 5 3 3 5 3 3 3 11 5 7 11 9 256 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 972,H2 ==? 55272311972 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)96)× 3)×× 0?××38425638496 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 9 ×××11927227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

dense 3 dense dense 3 2 2 256 MaxPooling

1000

4096 4096

Input: 7 × 7 × 384 Conv1: K = 256,F = 3 S = 1,P = 0

Output:W2 = 5,H2 = 5 Parameters: (3 × 3 × 384) × 256 = 0.8M

227 55 27 23 11 11 9 3 5 7 3 3 3 3 5 3 3 5 3 3 3 11 5 7 11 9 256 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 975,H2 ==? 55272311975 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input: Input:×K 4096 40962=× 96,256,384,×256) 23× 554096F1000××F×=2355=4096 =11 =× 53× 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 = 2,H2 = 2 Parameters: 0

dense 2 dense dense 2 256 MaxPooling

1000

4096 4096

Max Pool Input: 5 × 5 × 256 F = 3,S = 2

Output:W2 =?,H2 =? Parameters: ?

227 55 27 23 11 11 9 5 7 3 5 3 3 3 3 3 3 3 5 3 3 3 3 11 5 7 11 9 256 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 975,H2 ==? 55272311975 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input: Input:×K 4096 40962=× 96,256,384,×256) 23× 554096F1000××F×=2355=4096 =11 =× 53× 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: 0

dense dense dense

1000

4096 4096

Max Pool Input: 5 × 5 × 256 F = 3,S = 2

Output:W2 = 2,H2 = 2 Parameters: ?

227 55 27 23 11 11 9 5 7 3 5 3 3 3 3 3 3 3 2 5 3 3 3 3 11 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 975,H2 ==? 55272311975 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:Parameters:MaxMaxConv1: Pool Pool (2 Input: Input:×K 4096 40962=× 96,256,384,×256) 23× 554096F1000××F×=2355=4096 =11 =× 53× 16 425696MM = 4M FS = 4,1,3,PS = 02

Output:W2 =?,H2 =? Parameters: ?

dense dense dense

1000

4096 4096

Max Pool Input: 5 × 5 × 256 F = 3,S = 2

Output:W2 = 2,H2 = 2 Parameters: 0

227 55 27 23 11 11 9 5 7 3 5 3 3 3 3 3 3 3 2 5 3 3 3 3 11 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 9752,H2 ==? 552723119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool Input: Input: Input:K 4096 4096= 96,256,384,× 23× 55 54096F1000××F=52355=× =11 =× 53×256 16 425696MM FS = 4,1,3,PS = 02

dense dense

1000

4096

FC1 Parameters: (2 × 2 × 256) × 4096 = 4M

227 55 27 23 11 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 5 3 3 3 3 11 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 4096 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 9752,H2 ==? 552723119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 40962=× 96,256,384,256) 23× 55 5F1000××F×=52355=×4096 11 =× 53×256 425696M = 4M FS = 4,1,3,PS = 02

dense

1000

FC1 Parameters: 4096 × 4096 = 16M

227 55 27 23 11 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 dense 5 3 3 3 3 11 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 9752,H2 ==? 552723119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

TotalInput:Input:Input: Parameters: 11227 27FC1 97 ×××119727227××× 27384256×96.553 M Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 40962=× 96,256,384,×256) 23 55 54096F××F×=52355=×4096 =11× 53×256 1625696M = 4M FS = 4,1,3,PS = 02

FC1 Parameters: 4096 × 1000 = 4M

227 55 27 23 11 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 11 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 1000 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Output:Output:Output:WWW2 2=2==? 55272311 9752,H2 ==? 552723119752 Parameters:Parameters:Parameters:Parameters: (3 (3Parameters: (5 (11×××3×3×5×11×384)256)384)96)× 3)×× 0?××38425638425696 = = = 1 0 34.0.327.68KMMM

Input:Input:Input: 11227 27FC1 97 ×××119727227×××384256×963 Parameters:Parameters:Parameters:MaxMaxMaxConv1: Pool Pool Pool (2 Input: Input:× Input:K 4096 40962=× 96,256,384,×256) 23× 55 54096F1000××F×=52355=×4096 =11 =× 53×256 16 425696MM = 4M FS = 4,1,3,PS = 02

Total Parameters: 27.55M

227 55 27 23 11 11 9 5 7 3 5 dense 3 3 3 3 3 3 3 2 dense dense 5 3 3 3 3 11 5 2 7 9 256 11 256 MaxPooling 27 23 384 Convolution 55 384 Convolution 256 227 Convolution MaxPooling 1000 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input

49/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 make linear dense dense dense We will first stretch out the last conv or maxpool layer to make it a 1d vector This 1d vector is 2 × 2 × 256 = 1024 1000 then densely connec-

ted to other lay- 4096 4096 ers just as in a regular feedforward neural network

Let us look at the connections in the fully connected lay- ers in more detail 2

2

256

MaxPooling

50/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 dense dense dense

This 1d vector is 1000 then densely connec-

ted to other lay- 4096 4096 ers just as in a regular feedforward neural network

Let us look at the connections in the fully connected lay- ers in more detail 2 make linear

We will first stretch 2 out the last conv 256 or maxpool layer to MaxPooling make it a 1d vector

2 × 2 × 256 = 1024

50/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 dense dense

1000

4096

Let us look at the connections in the fully connected lay- ers in more detail make linear 2 dense

We will first stretch 2 out the last conv 256 or maxpool layer to MaxPooling make it a 1d vector This 1d vector is 2 × 2 × 256 = 1024 then densely connec- ted to other lay- 4096 ers just as in a regular feedforward neural network

50/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 dense

1000

Let us look at the connections in the fully connected lay- ers in more detail make linear 2 dense dense

We will first stretch 2 out the last conv 256 or maxpool layer to MaxPooling make it a 1d vector This 1d vector is 2 × 2 × 256 = 1024 then densely connec- ted to other lay- 4096 4096 ers just as in a regular feedforward neural network

50/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Let us look at the connections in the fully connected lay- ers in more detail make linear 2 dense dense dense

We will first stretch 2 out the last conv 256 or maxpool layer to MaxPooling make it a 1d vector This 1d vector is 2 × 2 × 256 = 1024 1000 then densely connec- ted to other lay- 4096 4096 ers just as in a regular feedforward neural network

50/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ImageNet Success Stories(roadmap for rest of the talk) AlexNet ZFNet VGGNet

51/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 5 3 3 3 3 3 11 5 2 7 11 9 256 23 256 27 384 MaxPooling 55 Convolution 384 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

55 27 23 11 7 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 7 5 3 3 3 3 3 5 2 7 11 9 256 23 512 27 1024 MaxPooling 55 Convolution 512 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution

227

227

3 Input

227

227

3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 55 27 23 11 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 5 3 3 3 3 3 5 2 7 11 9 256 23 256 27 384 MaxPooling 55 Convolution 384 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution DifferenceLayer6:Layer5:Layer7: in TotalKK== No. 384 384256 of→→ Parameters1024512 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((384((3843 × 256)××256)384)× (512−−(1024(512− 384)××1024))512)) = 0.29 = =M 0 0..368MM

55 27 23 11 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 5 3 3 3 3 3 5 2 7 11 9 256 23 512 27 1024 MaxPooling 55 Convolution 512 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution

227

11

11

227

Layer1: F = 11 → 7 3 Input Difference in Parameters ((112 − 72) × 3) × 96 = 20.7K

227

7

7

227

3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 27 23 11 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 5 3 3 3 3 3 5 2 7 9 256 11 256 27 23 MaxPooling 384 Convolution 384 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 DifferenceLayer6:Layer5:Layer7: in TotalKK== No. 384 384256 of→→ Parameters1024512 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((384((3843 × 256)××256)384)× (512−−(1024(512− 384)××1024))512)) = 0.29 = =M 0 0..368MM

27 23 11 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 5 3 3 3 3 3 5 2 7 9 256 11 512 27 23 MaxPooling 1024 Convolution 512 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096

227 55

11

11

55 227

96 Convolution Layer1: F = 11 → 7 3 Input Difference in Parameters ((112 − 72) × 3) × 96 = 20.7K

227 55

7

7

55 227

96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 27 23 11 9 5 7 3 3 3 3 5 3 2dense dense dense 5 3 3 3 3 3 5 2 7 9 256 11 256 27 23 MaxPooling 384 Convolution 384 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

27 23 11 9 5 7 3 3 3 3 5 3 2dense dense dense 5 3 3 3 3 3 5 2 7 9 256 11 512 27 23 MaxPooling 1024 Convolution 512 Convolution 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096

227 55

11 3 3 11

55 227

96 Convolution 3 Input Layer2: No difference

227 55

7 3 3 7

55 227

96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 23 11 9 5 7 3 3 3 3 5 3 2dense dense dense 5 3 3 3 3 3 5 2 7 9 256 11 256 23 MaxPooling 384 Convolution 384 Convolution 256 Convolution 1000 256 MaxPooling Convolution 4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

23 11 9 5 7 3 3 3 3 5 3 2dense dense dense 5 3 3 3 3 3 5 2 7 9 256 11 512 23 MaxPooling 1024 Convolution 512 Convolution 256 Convolution 1000 256 MaxPooling Convolution 4096 4096

227 55 27 11 3 3 11

27 55 227

96 MaxPooling 96 Convolution 3 Input Layer2: No difference

227 55 27 7 3 3 7

27 55 227

96 MaxPooling 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 23 11 9 7 3 3 3 3 5 3 2dense dense dense 3 3 3 3 3 5 2 7 9 256 11 256 23 MaxPooling 384 Convolution 384 Convolution 256 Convolution 1000 256 MaxPooling Convolution 4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

23 11 9 7 3 3 3 3 5 3 2dense dense dense 3 3 3 3 3 5 2 7 9 256 11 512 23 MaxPooling 1024 Convolution 512 Convolution 256 Convolution 1000 256 MaxPooling Convolution 4096 4096

227 55 27 11 3 5 3 11 5

27 55 227

96 MaxPooling 96 Convolution 3 Input Layer3: No difference

227 55 27 7 3 5 3 7 5

27 55 227

96 MaxPooling 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 11 9 7 3 3 3 3 5 3 2dense dense dense 3 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 384 Convolution 384 Convolution 256 1000 MaxPooling Convolution 4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

11 9 7 3 3 3 3 5 3 2dense dense dense 3 3 3 3 3 5 2 7 9 256 11 512 MaxPooling 1024 Convolution 512 Convolution 256 1000 MaxPooling Convolution 4096 4096

227 55 27 23 11 3 5 3 11 5

27 23 55 227

96 256 MaxPooling Convolution 96 Convolution 3 Input Layer3: No difference

227 55 27 23 7 3 5 3 7 5

27 23 55 227

96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 11 9 7 3 3 3 5 3 2dense dense dense 3 3 3 3 5 2 7 9 256 11 256 MaxPooling 384 Convolution 384 Convolution 256 1000 MaxPooling Convolution 4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

11 9 7 3 3 3 5 3 2dense dense dense 3 3 3 3 5 2 7 9 256 11 512 MaxPooling 1024 Convolution 512 Convolution 256 1000 MaxPooling Convolution 4096 4096

227 55 27 23 11 5 3 3 3 3 11 5

27 23 55 227

96 256 MaxPooling Convolution 96 Convolution 3 Input Layer4: No difference

227 55 27 23 7 5 3 3 3 7 5 3

27 23 55 227

96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 9 7 3 3 3 5 3 2dense dense dense 3 3 3 3 5 2 7 9 256 256 MaxPooling 384 Convolution 384 Convolution Convolution 1000

4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

9 7 3 3 3 5 3 2dense dense dense 3 3 3 3 5 2 7 9 256 512 MaxPooling 1024 Convolution 512 Convolution Convolution 1000

4096 4096

227 55 27 23 11 11 5 3 3 3 3 11 5 11 27 23 55 227 256 MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input Layer4: No difference

227 55 27 23 11 7 5 3 3 3 7 5 3

11 27 23 55 227 256 MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 9 7 3 3 5 3 2dense dense dense 3 3 3 5 2 7 9 256 256 MaxPooling 384 Convolution 384 Convolution Convolution 1000

4096 4096 DifferenceLayer6:Layer7:Layer1: in TotalKKF== No.= 384 256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××33××((11((384((3842 −××7256)384)2) ×−−3)(1024(512× 96×× =1024))512)) 20.7K = = 0 0..368MM

9 7 3 3 5 3 2dense dense dense 3 3 3 5 2 7 9 256 512 MaxPooling 1024 Convolution 512 Convolution Convolution 1000

4096 4096

227 55 27 23 11 11 3 5 3 3 3 3 3 11 5 11 27 23 55 227 256 MaxPooling 96 256 MaxPooling Convolution 96 Convolution Layer5: K = 384 → 512 3 Input Difference in Parameters (3 × 3 × 256) × (512 − 384) = 0.29M

227 55 27 23 11 7 3 5 3 3 3 7 5 3 3

11 27 23 55 227 256 MaxPooling 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 7 3 3 5 3 2dense dense dense 3 3 3 5 2 7 256 256 MaxPooling 384 Convolution Convolution 1000

4096 4096 DifferenceLayer6:Layer7:Layer1: in TotalKKF== No.= 384 256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××33××((11((384((3842 −××7256)384)2) ×−−3)(1024(512× 96×× =1024))512)) 20.7K = = 0 0..368MM

7 3 3 5 3 2dense dense dense 3 3 3 5 2 7 256 512 MaxPooling 1024 Convolution Convolution 1000

4096 4096

227 55 27 23 11 11 9 3 5 3 3 3 3 3 11 5 11 9 27 23 55 384 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution Layer5: K = 384 → 512 3 Input Difference in Parameters (3 × 3 × 256) × (512 − 384) = 0.29M

227 55 27 23 11 7 9 3 5 3 3 3 7 5 3 3

11 9 27 23 55 512 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 7 3 5 3 2dense dense dense 3 3 5 2 7 256 256 MaxPooling 384 Convolution Convolution 1000

4096 4096 DifferenceLayer5:Layer7:Layer1: in TotalKF= No.= 384256 11 of→→ Parameters5127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3 × 3(3××((11((3843 ×2 256)−×7256)2)××(512−3)(1024×−96384)× =512)) 20 =.7 0K.29 =M 0.36M

7 3 5 3 2dense dense dense 3 3 5 2 7 256 512 MaxPooling 1024 Convolution Convolution 1000

4096 4096

227 55 27 23 11 11 9 3 5 3 3 3 3 3 3 11 5 3 11 9 27 23 55 384 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution Layer6: K = 384 → 1024 3 Input Difference in Parameters (3 × 3 × ((384 × 384) − (512 × 1024)) = 0.8M

227 55 27 23 11 7 9 3 5 3 3 3 3 7 5 3 3 3

11 9 27 23 55 512 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 3 5 3 2dense dense dense 3 3 5 2 256 256 MaxPooling Convolution

1000

4096 4096 DifferenceLayer5:Layer7:Layer1: in TotalKF= No.= 384256 11 of→→ Parameters5127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3 × 3(3××((11((3843 ×2 256)−×7256)2)××(512−3)(1024×−96384)× =512)) 20 =.7 0K.29 =M 0.36M

3 5 3 2dense dense dense 3 3 5 2 256 512 MaxPooling Convolution

1000

4096 4096

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 3 11 5 3 7 11 9 23 27 384 55 384 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution Layer6: K = 384 → 1024 3 Input Difference in Parameters (3 × 3 × ((384 × 384) − (512 × 1024)) = 0.8M

227 55 27 23 11 7 9 3 5 7 3 3 3 3 7 5 3 3 3 7 11 9 23 27 1024 55 512 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 5 3 2dense dense dense 3 5 2 256 256 MaxPooling Convolution

1000

4096 4096 DifferenceLayer6:Layer5:Layer1: in TotalKKF== No.= 384 384 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3 ×(33 ××((11((3843 ×2 256)−×7384)2)××(512−3)(512×−96384)× =1024)) 20 =.7 0K.29 =M 0.8M

5 3 2dense dense dense 3 5 2 256 512 MaxPooling Convolution

1000

4096 4096

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 3 3 11 5 3 3 7 11 9 23 27 384 55 384 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution Layer7: K = 256 → 512 3 Input Difference in Parameters (3 × 3 × ((384 × 256) − (1024 × 512)) = 0.36M

227 55 27 23 11 7 9 3 5 7 3 3 3 3 3 7 5 3 3 3 3 7 11 9 23 27 1024 55 512 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 3 2dense dense dense 3 2 256 MaxPooling

1000

4096 4096 DifferenceLayer6:Layer5:Layer1: in TotalKKF== No.= 384 384 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3 ×(33 ××((11((3843 ×2 256)−×7384)2)××(512−3)(512×−96384)× =1024)) 20 =.7 0K.29 =M 0.8M

3 2dense dense dense 3 2 256 MaxPooling

1000

4096 4096

227 55 27 23 11 11 9 3 5 7 3 3 3 3 5 3 5 3 3 3 3 11 5 7 11 9 23 256 27 384 55 Convolution 384 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution Layer7: K = 256 → 512 3 Input Difference in Parameters (3 × 3 × ((384 × 256) − (1024 × 512)) = 0.36M

227 55 27 23 11 7 9 3 5 7 3 3 3 3 5 3 7 5 3 3 3 3 5 7 11 9 23 512 27 1024 55 Convolution 512 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 2dense dense dense 2 256 MaxPooling

1000

4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

2dense dense dense 2 256 MaxPooling

1000

4096 4096

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 3 5 3 3 3 3 3 11 5 7 11 9 23 256 27 384 55 Convolution 384 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input Layer8: No difference

227 55 27 23 11 7 9 3 5 7 3 3 3 3 3 5 3 7 5 3 3 3 3 3 5 7 11 9 23 512 27 1024 55 Convolution 512 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 dense dense dense

1000

4096 4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

dense dense dense

1000

4096 4096

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 3 2 5 3 3 3 3 3 11 5 2 7 11 9 256 23 256 27 384 MaxPooling 55 Convolution 384 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input Layer8: No difference

227 55 27 23 11 7 9 3 5 7 3 3 3 3 3 5 3 2 7 5 3 3 3 3 3 5 2 7 11 9 256 23 512 27 1024 MaxPooling 55 Convolution 512 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 dense dense

1000

4096 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

dense dense

1000

4096

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 dense 3 2 5 3 3 3 3 3 11 5 2 7 11 9 256 23 256 27 384 MaxPooling 55 Convolution 384 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 96 Convolution 3 Input Layer9: No difference

227 55 27 23 11 7 9 3 5 7 3 3 3 3 3 5 dense 3 2 7 5 3 3 3 3 3 5 2 7 11 9 256 23 512 27 1024 MaxPooling 55 Convolution 512 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 dense

1000

DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

dense

1000

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 dense dense 3 2 5 3 3 3 3 3 11 5 2 7 11 9 256 23 256 27 384 MaxPooling 55 Convolution 384 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input Layer10: No difference

227 55 27 23 11 7 9 3 5 7 3 3 3 3 3 5 dense dense 3 2 7 5 3 3 3 3 3 5 2 7 11 9 256 23 512 27 1024 MaxPooling 55 Convolution 512 Convolution 227 256 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 DifferenceLayer6:Layer5:Layer7:Layer1: in TotalKKF== No.= 384 384256 11 of→→→ Parameters10245127 Layer10:Layer2:Layer3:Layer4:Layer8:Layer9: No No difference difference Difference1. in45M Parameters (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 5 3 3 3 3 3 11 5 2 7 11 9 256 23 256 27 384 MaxPooling 55 Convolution 384 Convolution 227 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input Layer10: No difference

227 55 27 23 11 7 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 7 5 3 3 3 3 3 5 2 7 11 9 256 23 512 27 1024 MaxPooling 55 Convolution 512 Convolution 227 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Layer6:Layer5:Layer7:Layer1:KKF=== 384 384256 11→→→10245127 DifferenceLayer10:Layer2:Layer3:Layer4:Layer8:Layer9: No Noin Parametersdifference difference (3(3××3(33×××((11((384((3843 ×2 256)−××7256)384)2)××(512−−3)(1024(512×−96384)×× =1024))512)) 20 =.7 0K.29 = =M 0 0..368MM

227 55 27 23 11 11 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 5 3 3 3 3 3 11 5 2 7 11 9 256 23 256 27 384 MaxPooling 55 Convolution 384 Convolution 227 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Difference in Total No. of Parameters Input 1.45M

227 55 27 23 11 7 9 3 5 7 3 3 3 3 3 5 dense dense dense 3 2 7 5 3 3 3 3 3 5 2 7 11 9 256 23 512 27 1024 MaxPooling 55 Convolution 512 Convolution 227 256 1000 MaxPooling Convolution 96 256 MaxPooling Convolution 4096 4096 96 Convolution 3 Input 52/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ImageNet Success Stories(roadmap for rest of the talk) AlexNet ZFNet VGGNet

53/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224

Input

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224

Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv maxpool Conv 1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 224 224

64 Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 128 maxpool Conv Conv 1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 224 224

64 64 maxpool Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 128 maxpool Conv

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 224 224

64 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool maxpool Conv

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 112 112 224 224

64 128 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

56 28 28 14 14 7 7 28 14 14 28 56 512 512 512 256 512 256 maxpool Conv maxpool Conv maxpool Conv

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 112 112 224 224

128 64 128 maxpool 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

56 28 28 14 14 7 7 28 14 14 28 56 512 512 512 256 512 256 maxpool Conv maxpool Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 112 112 224 224

128 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

56 28 28 14 14 7 7 28 14 14 28 56 512 512 512 256 512 256 maxpool Conv maxpool Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 112 112 224 224

128 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

28 28 14 14 7 7 28 14 14 28 512 512 512 256 512 maxpool Conv maxpool Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 56 56 112 112 224 224

128 256 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

28 14 14 7 7 28 14 14 512 512 512 512 Conv maxpool Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 56 56 112 112

224 224 256 128 256 maxpool 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

28 14 14 7 7 28 14 14 512 512 512 512 maxpool Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 56 56 112 112

224 224 256 128 256 maxpool Conv 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

28 14 14 7 7 28 14 14 512 512 512 512 maxpool Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 56 56 112 112

224 224 256 128 256 maxpool Conv 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

14 14 7 7 14 14 512 512 512

maxpool Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 28 28 56 56 112 112

224 224 256 512 128 256 maxpool Conv 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

14 7 7 14 512 512

Conv maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 28 14 28 56 56 112 112

224 224 512 256 512 128 256 maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

14 7 7 14 512 512 maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 28 14 28 56 56 112 112

224 224 512 256 512 128 256 maxpool Conv maxpool Conv 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

14 7 7 14 512 512 maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 28 14 28 56 56 112 112

224 224 512 256 512 128 256 maxpool Conv maxpool Conv 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

7 7 512

maxpool

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 14 28 14 14 28 56 56 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

1000 fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv Input Conv

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

1000 fc

4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv Input Conv fc

4096

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

1000

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv Input Conv fc fc

4096 4096

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

softmax

1000

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv Input Conv fc fc

4096 4096

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

Total parameters in non FC layers = ∼ 16M

Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M Total Parameters in FC layers =

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 + (4096 × 4096) + (4096 × 1024) = ∼ 122M Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M Total Parameters in FC layers = (512 × 7 × 7 × 4096)

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 + (4096 × 1024) = ∼ 122M Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096)

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 = ∼ 122M Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024)

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Most parameters are in the first FC layer (∼ 102M)

softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 softmax

224 224 112 112 56 56 28 28 14 14 7 7 28 14 14 28 56 56 512 112 112

224 224 512 512 256 512 128 256 maxpool Conv maxpool Conv maxpool 64 128 maxpool Conv 64 maxpool Conv 1000 Input Conv fc fc

4096 4096

Kernel size is 3 × 3 throughout Total parameters in non FC layers = ∼ 16M Total Parameters in FC layers = (512 × 7 × 7 × 4096) + (4096 × 4096) + (4096 × 1024) = ∼ 122M Most parameters are in the first FC layer (∼ 102M)

54/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Module 11.5 : Image Classification continued (GoogLeNet and ResNet)

55/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H1

f H Max Pooling f After this layer we could apply a max- D

1 W1 convolution H pooling layer 2 1

D

1 Or a 1 × 1 convolution 3 convolution 3 W Or a 3 × 3 convolution D

H3 W2 Or a 5 × 5 convolution

5 convolution 1 Question: Why choose between 5

1 D these options (convolution, maxpool- W3 ing, filter sizes)?

1 Idea: Why not apply all of them at the same time and then concatenate the feature maps?

Consider the output at a certain layer of a convolutional neural network

H

W

D

56/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H

1 convolution H2 1 D Or a 1 × 1 convolution 3 convolution 3 W Or a 3 × 3 convolution D

H3 W2 Or a 5 × 5 convolution

5 convolution 1 Question: Why choose between 5

1 D these options (convolution, maxpool- W3 ing, filter sizes)?

1 Idea: Why not apply all of them at the same time and then concatenate the feature maps?

Consider the output at a certain layer H1 of a convolutional neural network f Max Pooling f After this layer we could apply a max- D H W1 pooling layer

1

W

D

56/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H2

3 convolution 3 Or a 3 × 3 convolution D

H3 W2 Or a 5 × 5 convolution

5 convolution Question: Why choose between 5

1 D these options (convolution, maxpool- W3 ing, filter sizes)?

1 Idea: Why not apply all of them at the same time and then concatenate the feature maps?

Consider the output at a certain layer H1 of a convolutional neural network f H Max Pooling f After this layer we could apply a max- D H 1 W1 convolution pooling layer 1

D

1 Or a 1 × 1 convolution

W

W

1

D

56/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H3 Or a 5 × 5 convolution

5 convolution Question: Why choose between 5

D these options (convolution, maxpool- W3 ing, filter sizes)?

1 Idea: Why not apply all of them at the same time and then concatenate the feature maps?

Consider the output at a certain layer H1 of a convolutional neural network f H Max Pooling f After this layer we could apply a max- D H 1 W1 convolution H pooling layer 2 1

D

1 Or a 1 × 1 convolution 3 convolution 3 W Or a 3 × 3 convolution D

W2 W

1

1

D

56/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Question: Why choose between these options (convolution, maxpool- ing, filter sizes)? Idea: Why not apply all of them at the same time and then concatenate the feature maps?

Consider the output at a certain layer H1 of a convolutional neural network f H Max Pooling f After this layer we could apply a max- D H 1 W1 convolution H pooling layer 2 1

D

1 Or a 1 × 1 convolution 3 convolution 3 W Or a 3 × 3 convolution D

H3 W2 Or a 5 × 5 convolution W 5 convolution 1 5

1 D

W D 3

1

56/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Idea: Why not apply all of them at the same time and then concatenate the feature maps?

Consider the output at a certain layer H1 of a convolutional neural network f H Max Pooling f After this layer we could apply a max- D H 1 W1 convolution H pooling layer 2 1

D

1 Or a 1 × 1 convolution 3 convolution 3 W Or a 3 × 3 convolution D

H3 W2 Or a 5 × 5 convolution W 5 convolution 1 Question: Why choose between 5

1 D these options (convolution, maxpool- W D 3 ing, filter sizes)?

1

56/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Consider the output at a certain layer H1 of a convolutional neural network f H Max Pooling f After this layer we could apply a max- D H 1 W1 convolution H pooling layer 2 1

D

1 Or a 1 × 1 convolution 3 convolution 3 W Or a 3 × 3 convolution D

H3 W2 Or a 5 × 5 convolution W 5 convolution 1 Question: Why choose between 5

1 D these options (convolution, maxpool- W D 3 ing, filter sizes)?

1 Idea: Why not apply all of them at the same time and then concatenate the feature maps?

56/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 If P = 0 & S = 1 then convolving a W × H × D input with a F × F × D filter results in a (W − F + 1)(H − F + 1) sized output Each element of the output requires O(F × F × D) computations Can we reduce the number of compu- tations?

Well this naive idea could result in a H1 large number of computations f H Max Pooling f

D H 1 W1 convolution H2 1

D

1 3 convolution 3

D W

H3 W2 W 5 convolution 1 5

1 D

W D 3

1

57/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Each element of the output requires O(F × F × D) computations Can we reduce the number of compu- tations?

Well this naive idea could result in a H1 large number of computations f H Max Pooling f If P = 0 & S = 1 then convolving a D H 1 W1 convolution H W × H × D input with a F × F × D 2 1 D filter results in a (W − F + 1)(H − 1 3 convolution 3 F + 1) sized output

D W

H3 W2 W 5 convolution 1 5

1 D

W D 3

1

57/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Can we reduce the number of compu- tations?

Well this naive idea could result in a H1 large number of computations f H Max Pooling f If P = 0 & S = 1 then convolving a D H 1 W1 convolution H W × H × D input with a F × F × D 2 1 D filter results in a (W − F + 1)(H − 1 3 convolution 3 F + 1) sized output

D W Each element of the output requires H3 W2 W 5 O(F × F × D) computations convolution 1 5

1 D

W D 3

1

57/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Well this naive idea could result in a H1 large number of computations f H Max Pooling f If P = 0 & S = 1 then convolving a D H 1 W1 convolution H W × H × D input with a F × F × D 2 1 D filter results in a (W − F + 1)(H − 1 3 convolution 3 F + 1) sized output

D W Each element of the output requires H3 W2 W 5 O(F × F × D) computations convolution 1 5 Can we reduce the number of compu- 1 D

W D 3 tations?

1

57/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Huh?? What does a 1×1 convolution

H H do ? It aggregates along the depth

1 So convolving a D×W ×H input with 1 D1 1×1 (D1 < D) filters will result in D a D1 × W × H output (S = 1,P = 0)

If D1 < D then this effectively re- W W duces the dimension of the input and hence the computations Specifically instead of O(F × F × D)

D D11 we will need O(F × F × D1) compu- tations We could then apply subsequent 3×3, 5 × 5 filter on this reduced output

Yes, by using 1 × 1 convolutions

58/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H It aggregates along the depth

1 So convolving a D×W ×H input with 1 D1 1×1 (D1 < D) filters will result in D a D1 × W × H output (S = 1,P = 0)

If D1 < D then this effectively re- W duces the dimension of the input and hence the computations Specifically instead of O(F × F × D)

D11 we will need O(F × F × D1) compu- tations We could then apply subsequent 3×3, 5 × 5 filter on this reduced output

Yes, by using 1 × 1 convolutions Huh?? What does a 1×1 convolution

H do ?

W

D

58/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H

So convolving a D×W ×H input with D1 1×1 (D1 < D) filters will result in a D1 × W × H output (S = 1,P = 0)

If D1 < D then this effectively re- W duces the dimension of the input and hence the computations Specifically instead of O(F × F × D)

D1 we will need O(F × F × D1) compu- tations We could then apply subsequent 3×3, 5 × 5 filter on this reduced output

Yes, by using 1 × 1 convolutions Huh?? What does a 1×1 convolution

H H do ? It aggregates along the depth

1

1

D

W W

D 1

58/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H

If D1 < D then this effectively re- W duces the dimension of the input and hence the computations Specifically instead of O(F × F × D) 1 we will need O(F × F × D1) compu- tations We could then apply subsequent 3×3, 5 × 5 filter on this reduced output

Yes, by using 1 × 1 convolutions Huh?? What does a 1×1 convolution

H H do ? It aggregates along the depth

1 So convolving a D×W ×H input with 1 D1 1×1 (D1 < D) filters will result in D a D1 × W × H output (S = 1,P = 0)

W W

D D1

58/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H

W

Specifically instead of O(F × F × D) 1 we will need O(F × F × D1) compu- tations We could then apply subsequent 3×3, 5 × 5 filter on this reduced output

Yes, by using 1 × 1 convolutions Huh?? What does a 1×1 convolution

H H do ? It aggregates along the depth

1 So convolving a D×W ×H input with 1 D1 1×1 (D1 < D) filters will result in D a D1 × W × H output (S = 1,P = 0)

If D1 < D then this effectively re- W W duces the dimension of the input and hence the computations

D D1

58/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H

W

1

We could then apply subsequent 3×3, 5 × 5 filter on this reduced output

Yes, by using 1 × 1 convolutions Huh?? What does a 1×1 convolution

H H do ? It aggregates along the depth

1 So convolving a D×W ×H input with 1 D1 1×1 (D1 < D) filters will result in D a D1 × W × H output (S = 1,P = 0)

If D1 < D then this effectively re- W W duces the dimension of the input and hence the computations Specifically instead of O(F × F × D)

D D1 we will need O(F × F × D1) compu- tations

58/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 H

W

1

Yes, by using 1 × 1 convolutions Huh?? What does a 1×1 convolution

H H do ? It aggregates along the depth

1 So convolving a D×W ×H input with 1 D1 1×1 (D1 < D) filters will result in D a D1 × W × H output (S = 1,P = 0)

If D1 < D then this effectively re- W W duces the dimension of the input and hence the computations Specifically instead of O(F × F × D)

D D1 we will need O(F × F × D1) compu- tations We could then apply subsequent 3×3, 5 × 5 filter on this reduced output

58/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 × 1 convolutions

3 × 3 convolutions (on reduced input) Filter concatenation 1 × 1 convolutions 5 × 5 convolutions (dimensionality re- So we can use D and D 1 × 1 fil- (on reduced input) 1 2 duction) ters before the 3 × 3 and 5 × 5 filters 3 × 3 Maxpooling (dimensionality re- 1 × 1 convolutions duction) respectively We can then add the maxpooling layer followed by dimensionality re- duction And a new set of 1 × 1 convolutions And finally we concatenate all these layers This is called the Inception module We will now see GoogLeNet which contains many such inception mod- ules

But we might want to use different 28 dimensionality reductions before the 1 × 1 convolutions 3 × 3 convolutions (dimensionality re- (on reduced input) duction) 3 × 3 and 5 × 5 filters

5 × 5 convolutions (on reduced input)

28

256

59/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 × 1 convolutions

3 × 3 convolutions (on reduced input) Filter concatenation 1 × 1 convolutions 5 × 5 convolutions (dimensionality re- (on reduced input) duction)

3 × 3 Maxpooling (dimensionality re- 1 × 1 convolutions duction) We can then add the maxpooling layer followed by dimensionality re- duction And a new set of 1 × 1 convolutions And finally we concatenate all these layers This is called the Inception module We will now see GoogLeNet which contains many such inception mod- ules

But we might want to use different 28 dimensionality reductions before the 1 × 1 convolutions 3 × 3 convolutions (dimensionality re- (on reduced input) duction) 3 × 3 and 5 × 5 filters

1 × 1 convolutions 5 × 5 convolutions (dimensionality re- So we can use D and D 1 × 1 fil- (on reduced input) 1 2 duction) 28 ters before the 3 × 3 and 5 × 5 filters respectively

256

59/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 × 1 convolutions

3 × 3 convolutions (on reduced input) Filter concatenation 1 × 1 convolutions 5 × 5 convolutions (dimensionality re- (on reduced input) duction)

1 × 1 convolutions

And a new set of 1 × 1 convolutions And finally we concatenate all these layers This is called the Inception module We will now see GoogLeNet which contains many such inception mod- ules

But we might want to use different 28 dimensionality reductions before the 1 × 1 convolutions 3 × 3 convolutions (dimensionality re- (on reduced input) duction) 3 × 3 and 5 × 5 filters

1 × 1 convolutions 5 × 5 convolutions (dimensionality re- So we can use D and D 1 × 1 fil- (on reduced input) 1 2 duction) 28 ters before the 3 × 3 and 5 × 5 filters 3 × 3 Maxpooling (dimensionality re- 1 × 1 convolutions duction) respectively 256 We can then add the maxpooling layer followed by dimensionality re- duction

59/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 3 × 3 convolutions (on reduced input) Filter concatenation 5 × 5 convolutions (on reduced input)

And finally we concatenate all these layers This is called the Inception module We will now see GoogLeNet which contains many such inception mod- ules

1 × 1 convolutions But we might want to use different 28 dimensionality reductions before the 1 × 1 convolutions 3 × 3 convolutions (dimensionality re- (on reduced input) duction) 3 × 3 and 5 × 5 filters

1 × 1 convolutions 5 × 5 convolutions (dimensionality re- So we can use D and D 1 × 1 fil- (on reduced input) 1 2 duction) 28 ters before the 3 × 3 and 5 × 5 filters 3 × 3 Maxpooling (dimensionality re- 1 × 1 convolutions duction) respectively 256 We can then add the maxpooling layer followed by dimensionality re- duction And a new set of 1 × 1 convolutions

59/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 3 × 3 convolutions (on reduced input)

5 × 5 convolutions (on reduced input)

This is called the Inception module We will now see GoogLeNet which contains many such inception mod- ules

1 × 1 convolutions But we might want to use different 28 dimensionality reductions before the 1 × 1 convolutions 3 × 3 convolutions (dimensionality re- (on reduced input) duction) 3 × 3 and 5 × 5 filters Filter concatenation 1 × 1 convolutions 5 × 5 convolutions (dimensionality re- So we can use D and D 1 × 1 fil- (on reduced input) 1 2 duction) 28 ters before the 3 × 3 and 5 × 5 filters 3 × 3 Maxpooling (dimensionality re- 1 × 1 convolutions duction) respectively 256 We can then add the maxpooling layer followed by dimensionality re- duction And a new set of 1 × 1 convolutions And finally we concatenate all these layers

59/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 3 × 3 convolutions (on reduced input)

5 × 5 convolutions (on reduced input)

We will now see GoogLeNet which contains many such inception mod- ules

1 × 1 convolutions But we might want to use different 28 dimensionality reductions before the 1 × 1 convolutions 3 × 3 convolutions (dimensionality re- (on reduced input) duction) 3 × 3 and 5 × 5 filters Filter concatenation 1 × 1 convolutions 5 × 5 convolutions (dimensionality re- So we can use D and D 1 × 1 fil- (on reduced input) 1 2 duction) 28 ters before the 3 × 3 and 5 × 5 filters 3 × 3 Maxpooling (dimensionality re- 1 × 1 convolutions duction) respectively 256 We can then add the maxpooling layer followed by dimensionality re- duction And a new set of 1 × 1 convolutions And finally we concatenate all these layers This is called the Inception module

59/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 3 × 3 convolutions (on reduced input)

5 × 5 convolutions (on reduced input)

1 × 1 convolutions But we might want to use different 28 dimensionality reductions before the 1 × 1 convolutions 3 × 3 convolutions (dimensionality re- (on reduced input) duction) 3 × 3 and 5 × 5 filters Filter concatenation 1 × 1 convolutions 5 × 5 convolutions (dimensionality re- So we can use D and D 1 × 1 fil- (on reduced input) 1 2 duction) 28 ters before the 3 × 3 and 5 × 5 filters 3 × 3 Maxpooling (dimensionality re- 1 × 1 convolutions duction) respectively 256 We can then add the maxpooling layer followed by dimensionality re- duction And a new set of 1 × 1 convolutions And finally we concatenate all these layers This is called the Inception module We will now see GoogLeNet which contains many such inception mod-

ules 59/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 112 112

64 Conv

56 56 28 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 28 1024 1024 7 7 7

56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 5a 5b 832 832 1024 avgpool dropout(40%) 480 512 528 832 192 256 480 maxpool Inception Inception 1000 1000 64 192 maxpoolInception Inception maxpool Inception Inception Inception fc softmax maxpool Conv

229 229

Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 56 56 28 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 28 1024 1024 7 7 7 14 14 56 56 3a 3b 14 4a 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 480 512 528 832 192 256 480 maxpool Inception Inception 1000 1000 64 192 maxpoolInception Inception maxpool Inception Inception Inception fc softmax maxpool Conv

229 112 229 112

64 Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 56 28 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 28 1024 1024 7 7 7 14 14 56 3a 3b 14 4a 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 480 512 528 832 192 256 480 maxpool Inception Inception 1000 1000 192 maxpoolInception Inception maxpool Inception Inception Inception fc softmax Conv

229 112 56 56 229 112

64 64 maxpool Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 28 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 28 1024 1024 7 7 7 14 14 3a 3b 14 4a 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 480 512 528 832 192 256 480 maxpool Inception Inception 1000 1000 maxpoolInception Inception maxpool Inception Inception Inception fc softmax

229 112 56 56 56 56 229 112

64 192 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 1024 1024 7 7 7 14 14 3a 3b 14 4a 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 480 512 528 832 256 480 maxpool Inception Inception 1000 1000 Inception Inception maxpool Inception Inception Inception fc softmax

229 112 56 56 28 28 56 56 229 112 192 64 192 maxpool 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 28 14 14 14 14 7 7 7 1 1 1 1

28 1024 1024 7 7 7 14 14 3b 14 4a 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 480 512 528 832 480 maxpool Inception Inception 1000 1000 Inception maxpool Inception Inception Inception fc softmax

229 112 56 56 28 28 28 28 56 56 3a 229 112 192 256 64 192 maxpoolInception 64 maxpool Conv Conv Input

64 1 × 1 convolutions

28 96 1 × 1 convolu- 128 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 16 1 × 1 convolu- 32 5 × 5 convolutions tions (dimensionality (on reduced input) reduction) 28 3 × 3 Maxpooling (dimensionality re- 32 1 × 1 convolutions duction)

192

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 14 14 14 14 7 7 7 1 1 1 1 1024 1024 7 7 7 14 14 14 4a 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 480 512 528 832 maxpool Inception Inception 1000 1000 maxpool Inception Inception Inception fc softmax

229 112 56 56 28 28 28 28 28 28 56 56 3a 3b 229 112 192 256 480 64 192 maxpoolInception Inception 64 maxpool Conv Conv Input 128 1 × 1 convolutions 28 128 1 × 1 convolu- 192 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 32 1 × 1 convolu- 96 5 × 5 convolutions tions (dimensionality (on reduced input) reduction) 28 3 × 3 Maxpooling (dimensionality re- 64 1 × 1 convolutions duction)

256

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 14 14 14 7 7 7 1 1 1 1 1024 1024 7 7 7 14 14 4a 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 512 528 832 maxpool Inception Inception 1000 1000 Inception Inception Inception fc softmax

229 112 56 56 28 28 28 14 28 28 28 56 56 3a 3b 14 229 112 480 192 256 480 64 192 maxpoolInception Inception maxpool 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 14 14 14 7 7 7 1 1 1 1 1024 1024 7 7 7 14 14 4b 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 512 528 832 maxpool Inception Inception 1000 1000 Inception Inception fc softmax

229 112 56 56 28 28 28 14 28 28 28 56 56 3a 3b 14 4a 229 112 480 192 256 480 64 192 maxpoolInception Inception maxpool Inception 64 maxpool Conv Conv Input 192 1 × 1 convolutions 14 96 1 × 1 convolu- 208 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 16 1 × 1 convolu- 48 5 × 5 convolutions tions (dimensionality (on reduced input) reduction) 14 3 × 3 Maxpooling (dimensionality re- 64 1 × 1 convolutions duction)

480

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 14 14 14 7 7 7 1 1 1 1 1024 1024 7 7 7 14 14 4c 14 4d 4e 5a 5b 832 832 1024 avgpool dropout(40%) 512 528 832 maxpool Inception Inception 1000 1000 Inception Inception fc softmax

229 112 56 56 28 28 28 14 28 28 28 56 56 3a 3b 14 4a 4b 229 112 480 192 256 480 64 192 maxpoolInception Inception maxpool Inception 64 maxpool Conv Conv Input 160 1 × 1 convolutions 14 112 1 × 1 convolu- 224 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 24 1 × 1 convolu- 64 5 × 5 convolutions tions (dimensionality (on reduced input) reduction) 14 3 × 3 Maxpooling (dimensionality re- 64 1 × 1 convolutions duction)

512

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 14 14 7 7 7 1 1 1 1 1024 1024 7 7 7

4d 14 4e 14 5a 5b 832 832 1024 avgpool dropout(40%) 528 832 maxpool Inception Inception 1000 1000 Inception Inception fc softmax

56 56 229 112 28 28 28 14 14 28 28 28 56 56 3a 3b 14 4a 4b 4c 14 229 112 480 512 192 256 480 64 192 maxpoolInception Inception maxpool Inception 64 maxpool Conv Conv Input 128 1 × 1 convolutions 14 128 1 × 1 convolu- 256 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 24 1 × 1 convolu- 64 5 × 5 convolutions tions (dimensionality (on reduced input) reduction) 14 3 × 3 Maxpooling (dimensionality re- 64 1 × 1 convolutions duction)

512

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 14 7 7 7 1 1 1 1 1024 1024 7 7 7

4e 14 5a 5b 832 832 1024 avgpool dropout(40%) 832 maxpool Inception Inception 1000 1000 Inception fc softmax

56 56 229 112 28 28 28 14 14 14 28 28 28 56 56 3a 3b 14 4a 4b 4c 14 4d 14 229 112 480 512 528 192 256 480 64 192 maxpoolInception Inception maxpool Inception Inception 64 maxpool Conv Conv Input 112 1 × 1 convolutions 14 144 1 × 1 convolu- 288 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 32 1 × 1 convolu- 64 5 × 5 convolutions tions (dimensionality (on reduced input) reduction) 14 3 × 3 Maxpooling (dimensionality re- 64 1 × 1 convolutions duction)

512

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 7 7 7 1 1 1 1 1024 1024 7 5a 7 5b 7 832 832 1024 avgpool dropout(40%) maxpool Inception Inception 1000 1000 fc softmax

56 56 229 112 28 28 28 14 14 14 14 28 28 28 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 229 112 480 512 528 832 192 256 480 64 192 maxpoolInception Inception maxpool Inception Inception Inception 64 maxpool Conv Conv Input 256 1 × 1 convolutions 14 160 1 × 1 convolu- 320 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 32 1 × 1 convolu- 128 5 × 5 convolu- tions (dimensionality tions (on reduced reduction) input) 14 3 × 3 Maxpooling 128 1 × 1 (dimensionality re- convolutions duction)

528

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 7 7 1 1 1 1 1024 1024 5a 7 5b 7 832 1024 avgpool dropout(40%)

Inception Inception 1000 1000 fc softmax

56 56 229 112 28 28 28 14 14 14 14 7 28 28 28 7 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 832 229 112 480 512 528 832 192 256 480 maxpool 64 192 maxpoolInception Inception maxpool Inception Inception Inception 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 7 1 1 1 1 1024 1024 5b 7 1024 avgpool dropout(40%)

Inception 1000 1000 fc softmax

56 56 229 112 28 28 28 14 14 14 14 7 7 28 28 28 7 7 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 5a 832 832 229 112 480 512 528 832 192 256 480 maxpool Inception 64 192 maxpoolInception Inception maxpool Inception Inception Inception 64 maxpool Conv Conv Input 256 1 × 1 convolutions 7 160 1 × 1 convolu- 320 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 32 1 × 1 convolu- 128 5 × 5 convolu- tions (dimensionality tions (on reduced reduction) input) 7 3 × 3 Maxpooling 128 1 × 1 (dimensionality re- convolutions duction)

832

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 1 1 1 1024 1024 avgpool dropout(40%)

1000 1000 fc softmax

56 56 229 112 28 28 28 14 14 14 14 7 7 7 28 28 28 7 7 7 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 5a 5b 832 832 1024 229 112 480 512 528 832 192 256 480 maxpool Inception Inception 64 192 maxpoolInception Inception maxpool Inception Inception Inception 64 maxpool Conv Conv Input 384 1 × 1 convolutions 7 192 1 × 1 convolu- 384 3 × 3 convolu- tions (dimensionality tions (on reduced reduction) input) Filter concatenation 48 1 × 1 convolu- 128 5 × 5 convolu- tions (dimensionality tions (on reduced reduction) input) 7 3 × 3 Maxpooling 128 1 × 1 (dimensionality re- convolutions duction)

832

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1 1 1024 dropout(40%)

1000 1000 fc softmax

56 56 229 112 28 28 28 14 14 14 14 7 7 7 1 1

28 28 28 1024 7 7 7 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 5a 5b 832 832 1024 avgpool 229 112 480 512 528 832 192 256 480 maxpool Inception Inception 64 192 maxpoolInception Inception maxpool Inception Inception Inception 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1000 1000 fc softmax

56 56 229 112 28 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 28 1024 1024 7 7 7 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 5a 5b 832 832 1024 avgpool dropout(40%) 229 112 480 512 528 832 192 256 480 maxpool Inception Inception 64 192 maxpoolInception Inception maxpool Inception Inception Inception 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1000 softmax

56 56 229 112 28 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 28 1024 1024 7 7 7 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 5a 5b 832 832 1024 avgpool dropout(40%) 229 112 480 512 528 832 192 256 480 maxpool Inception Inception 1000 64 192 maxpoolInception Inception maxpool Inception Inception Inception fc 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 56 56 229 112 28 28 28 14 14 14 14 7 7 7 1 1 1 1

28 28 28 1024 1024 7 7 7 56 56 3a 3b 14 4a 4b 4c 14 4d 14 4e 14 5a 5b 832 832 1024 avgpool dropout(40%) 229 112 480 512 528 832 192 256 480 maxpool Inception Inception 1000 1000 64 192 maxpoolInception Inception maxpool Inception Inception Inception fc softmax 64 maxpool Conv Conv Input

60/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1000

501761024××10001000 WW ∈∈RR Notice that output of the last layer is 1024 7 × 7 × 1024 7 × 7 × 1024 dimensional flatten What if we were to add a fully connec- ted layer with 1000 nodes (for 1000 7 classes) on top of this pick average We would have 7×7×1024×1000 = 1024 49M parameters 1024 Instead they use an average pooling of 7 size 7 × 7 on each of the 1024 feature maps This results in a 1024 dimensional output Significantly reduces the number of parameters

Important Trick: Got rid of the fully connected layer

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1000

501761024××10001000 WW ∈∈RR 1024

flatten What if we were to add a fully connec- ted layer with 1000 nodes (for 1000 classes) on top of this pick average We would have 7×7×1024×1000 = 1024 49M parameters Instead they use an average pooling of size 7 × 7 on each of the 1024 feature maps This results in a 1024 dimensional output Significantly reduces the number of parameters

Important Trick: Got rid of the fully connected layer Notice that output of the last layer is 7 × 7 × 1024 7 × 7 × 1024 dimensional flatten

7

1024

7

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 501761024××10001000 WW ∈∈RR 1024

flatten

pick average We would have 7×7×1024×1000 = 1024 49M parameters Instead they use an average pooling of size 7 × 7 on each of the 1024 feature maps This results in a 1024 dimensional output Significantly reduces the number of parameters

1000 Important Trick: Got rid of the fully connected layer Notice that output of the last layer is 7 × 7 × 1024 7 × 7 × 1024 dimensional flatten What if we were to add a fully connec- ted layer with 1000 nodes (for 1000 7 classes) on top of this

1024

7

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1024×1000 W ∈ R 1024

flatten

pick average 1024 Instead they use an average pooling of size 7 × 7 on each of the 1024 feature maps This results in a 1024 dimensional output Significantly reduces the number of parameters

1000 Important Trick: Got rid of the

50176×1000 fully connected layer W ∈ R Notice that output of the last layer is 7 × 7 × 1024 7 × 7 × 1024 dimensional flatten What if we were to add a fully connec- ted layer with 1000 nodes (for 1000 7 classes) on top of this We would have 7×7×1024×1000 = 49M parameters 1024

7

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1024×1000 W ∈ R 1024

flatten

1024

This results in a 1024 dimensional output Significantly reduces the number of parameters

1000 Important Trick: Got rid of the

50176×1000 fully connected layer W ∈ R Notice that output of the last layer is 7 × 7 × 1024 7 × 7 × 1024 dimensional flatten What if we were to add a fully connec- ted layer with 1000 nodes (for 1000 7 classes) on top of this pick average We would have 7×7×1024×1000 = 49M parameters 1024 Instead they use an average pooling of 7 size 7 × 7 on each of the 1024 feature maps

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1024×1000 W ∈ R 1024

flatten

Significantly reduces the number of parameters

1000 Important Trick: Got rid of the

50176×1000 fully connected layer W ∈ R Notice that output of the last layer is 7 × 7 × 1024 7 × 7 × 1024 dimensional flatten What if we were to add a fully connec- ted layer with 1000 nodes (for 1000 7 classes) on top of this pick average We would have 7×7×1024×1000 = 1024 49M parameters 1024 Instead they use an average pooling of 7 size 7 × 7 on each of the 1024 feature maps This results in a 1024 dimensional output

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 50176×1000 W ∈ R 7 × 7 × 1024

flatten

1000 Important Trick: Got rid of the

1024×1000 fully connected layer W ∈ R Notice that output of the last layer is 1024 7 × 7 × 1024 dimensional flatten What if we were to add a fully connec- ted layer with 1000 nodes (for 1000 7 classes) on top of this pick average We would have 7×7×1024×1000 = 1024 49M parameters 1024 Instead they use an average pooling of 7 size 7 × 7 on each of the 1024 feature maps This results in a 1024 dimensional output Significantly reduces the number of

parameters 61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 50176×1000 W ∈ R 2× more computations 7 × 7 × 1024

flatten

1000 12× less parameters than AlexNet

1024×1000 W ∈ R 1024

flatten

7 pick average 1024 1024

7

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 50176×1000 W ∈ R 7 × 7 × 1024

flatten

1000 12× less parameters than AlexNet

1024×1000 W ∈ R 2× more computations 1024

flatten

7 pick average 1024 1024

7

61/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 GoogLeNet ResNet

62/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Now suppose we construct a deeper network which has few more layers (in orange) Intuitively, if the shallow network works well then the deep network should also work well by simply learn- ing to compute identity functions in the new layers Essentially, the solution space of a shallow neural network is a subset of the solution space of a deep neural network

Suppose we have been able to train a shallow neural network well

63/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Intuitively, if the shallow network works well then the deep network should also work well by simply learn- ing to compute identity functions in the new layers Essentially, the solution space of a shallow neural network is a subset of the solution space of a deep neural network

Suppose we have been able to train a shallow neural network well Now suppose we construct a deeper network which has few more layers (in orange)

63/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Essentially, the solution space of a shallow neural network is a subset of the solution space of a deep neural network

Suppose we have been able to train a shallow neural network well Now suppose we construct a deeper network which has few more layers (in orange) Intuitively, if the shallow network works well then the deep network should also work well by simply learn- ing to compute identity functions in the new layers

63/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Suppose we have been able to train a shallow neural network well Now suppose we construct a deeper network which has few more layers (in orange) Intuitively, if the shallow network works well then the deep network should also work well by simply learn- ing to compute identity functions in the new layers Essentially, the solution space of a shallow neural network is a subset of the solution space of a deep neural network

63/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Notice that the deep layers have a higher error rate on the test set

But in practice it is observed that this doesn’t happen

64/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 But in practice it is observed that this doesn’t happen Notice that the deep layers have a higher error rate on the test set

64/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 The two layers are essentially learning some function of the input What if we enable it to learn only a residual function of the input?

x

relu Identity

F (x) relu

H(x) = F (x) + x

x Consider any two stacked layers in a CNN

relu

relu H(x)

65/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 What if we enable it to learn only a residual function of the input?

x

relu Identity

F (x) relu

H(x) = F (x) + x

x Consider any two stacked layers in a CNN

relu The two layers are essentially learning some function of the input

relu H(x)

65/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 x Consider any two stacked layers in a CNN

relu The two layers are essentially learning some function of the input

relu What if we enable it to learn only a H(x) residual function of the input?

x

relu Identity

F (x) relu

H(x) = F (x) + x

65/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Remember our argument that a deeper version of a shallow network would do just fine by learning identity transformations in the new layers This identity connection from the in- put allows a ResNet to retain a copy of the input Using this idea they were able to train really deep networks

x Why would this help?

relu

relu H(x)

x

relu Identity

F (x) relu

H(x) = F (x) + x

66/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 This identity connection from the in- put allows a ResNet to retain a copy of the input Using this idea they were able to train really deep networks

x Why would this help? Remember our argument that a

relu deeper version of a shallow network would do just fine by learning identity transformations in the new layers relu H(x)

x

relu Identity

F (x) relu

H(x) = F (x) + x

66/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Using this idea they were able to train really deep networks

x Why would this help? Remember our argument that a

relu deeper version of a shallow network would do just fine by learning identity transformations in the new layers relu H(x) This identity connection from the in- put allows a ResNet to retain a copy x of the input

relu Identity

F (x) relu

H(x) = F (x) + x

66/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 x Why would this help? Remember our argument that a

relu deeper version of a shallow network would do just fine by learning identity transformations in the new layers relu H(x) This identity connection from the in- put allows a ResNet to retain a copy x of the input Using this idea they were able to train really deep networks relu Identity

F (x) relu

H(x) = F (x) + x

66/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ImageNet Detection: 16% better than the 2nd best system ImageNet Localization: 27% bet- ter than the 2nd best system COCO Detection: 11% better than the 2nd best system COCO Segmentation: 12% better than the 2nd best system

1st place in all five main tracks ImageNet Classification: “Ultra- deep” 152-layer nets

ResNet, 152 layers

67/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 ImageNet Localization: 27% bet- ter than the 2nd best system COCO Detection: 11% better than the 2nd best system COCO Segmentation: 12% better than the 2nd best system

1st place in all five main tracks ImageNet Classification: “Ultra- deep” 152-layer nets ImageNet Detection: 16% better than the 2nd best system

ResNet, 152 layers

67/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 COCO Detection: 11% better than the 2nd best system COCO Segmentation: 12% better than the 2nd best system

1st place in all five main tracks ImageNet Classification: “Ultra- deep” 152-layer nets ImageNet Detection: 16% better than the 2nd best system ImageNet Localization: 27% bet- ter than the 2nd best system

ResNet, 152 layers

67/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 COCO Segmentation: 12% better than the 2nd best system

1st place in all five main tracks ImageNet Classification: “Ultra- deep” 152-layer nets ImageNet Detection: 16% better than the 2nd best system ImageNet Localization: 27% bet- ter than the 2nd best system COCO Detection: 11% better than the 2nd best system

ResNet, 152 layers

67/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 1st place in all five main tracks ImageNet Classification: “Ultra- deep” 152-layer nets ImageNet Detection: 16% better than the 2nd best system ImageNet Localization: 27% bet- ter than the 2nd best system COCO Detection: 11% better than the 2nd best system COCO Segmentation: 12% better than the 2nd best system

ResNet, 152 layers

67/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Xavier/2 initialization from [He et al] SGD + Momentum(0.9) Learning rate:0.1, divided by 10 when validation error plateaus Mini-batch size 256 Weight decay of 1e-5 No dropout used

Bag of tricks Batch Normalizaton after every CONV layer

ResNet, 152 layers

68/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 SGD + Momentum(0.9) Learning rate:0.1, divided by 10 when validation error plateaus Mini-batch size 256 Weight decay of 1e-5 No dropout used

Bag of tricks Batch Normalizaton after every CONV layer Xavier/2 initialization from [He et al]

ResNet, 152 layers

68/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Learning rate:0.1, divided by 10 when validation error plateaus Mini-batch size 256 Weight decay of 1e-5 No dropout used

Bag of tricks Batch Normalizaton after every CONV layer Xavier/2 initialization from [He et al] SGD + Momentum(0.9)

ResNet, 152 layers

68/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Mini-batch size 256 Weight decay of 1e-5 No dropout used

Bag of tricks Batch Normalizaton after every CONV layer Xavier/2 initialization from [He et al] SGD + Momentum(0.9) Learning rate:0.1, divided by 10 when validation error plateaus

ResNet, 152 layers

68/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Weight decay of 1e-5 No dropout used

Bag of tricks Batch Normalizaton after every CONV layer Xavier/2 initialization from [He et al] SGD + Momentum(0.9) Learning rate:0.1, divided by 10 when validation error plateaus Mini-batch size 256

ResNet, 152 layers

68/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 No dropout used

Bag of tricks Batch Normalizaton after every CONV layer Xavier/2 initialization from [He et al] SGD + Momentum(0.9) Learning rate:0.1, divided by 10 when validation error plateaus Mini-batch size 256 Weight decay of 1e-5

ResNet, 152 layers

68/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11 Bag of tricks Batch Normalizaton after every CONV layer Xavier/2 initialization from [He et al] SGD + Momentum(0.9) Learning rate:0.1, divided by 10 when validation error plateaus Mini-batch size 256 Weight decay of 1e-5 No dropout used

ResNet, 152 layers

68/68 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11