CNN Architectures

Home , Geoffrey Hinton

Lecture 8: CNN Architectures

Justin Johnson Lecture 8 - 1 September 30, 2019 Reminder: A2 due today!

Due at 11:59pm

Remember to run the validation script!

Justin Johnson Lecture 8 - 2 September 30, 2019 Soon: Assignment 3!

Modular API for backpropagation

Fully-connected networks Dropout Update rules: SGD+Momentum, RMSprop, Adam Convolutional networks Batch normalization

Will be released today or tomorrow Will be due two weeks from the day it is released

Justin Johnson Lecture 8 - 3 September 30, 2019 Last Time: Components of Convolutional Networks Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization

Justin Johnson Lecture 8 - 4 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

227 x 227 inputs 5 Convolutional layers Max pooling 3 fully-connected layers ReLU nonlinearities

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 7 September 30, 2019 AlexNet

Used “Local response normalization”; 227 x 227 inputs Not used anymore 5 Convolutional layers Max pooling Trained on two GTX 580 GPUs – only 3 fully-connected layers 3GB of memory each! Model split ReLU nonlinearities over two GPUs

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 8 September 30, 2019 AlexNet

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 9 September 30, 2019 AlexNet

AlexNet Citations per year (As of 9/30/2019) 16000 14951 14000 11533 12000 10173 10000 8000 5955 6000 4000 2672 2000 284 942 0 2013 2014 2015 2016 2017 2018 2019 Total Citations: 46,510 Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 10 September 30, 2019 AlexNet

Citation Counts AlexNet Citations per year Darwin, “On the origin of species”, 1859: 50,007 (As of 9/30/2019) 16000 14951 Shannon, “A mathematical theory of 14000 11533 communication”, 1948: 69,351 12000 10173 10000 8000 Watson and Crick, “Molecular Structure of Nucleic 5955 6000 Acids”, 1953: 13,111 4000 2672 2000 284 942 ATLAS Collaboration, “Observation of a new 0 particle in the search for the Standard Model Higgs 2013 2014 2015 2016 2017 2018 2019 boson with the ATLAS detector at the LHC”, 2012: Total Citations: 46,510 14,424 Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 11 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 ? 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 12 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 ? 56 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 Recall: Output channels = number of filters27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 13 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384Recall: W’ = (W 3 1 – K + 2P) / S + 11 384 13 254 664 112 conv4 384 13 256 3 = 227 1 –111 + 2*2) / 4 + 1256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 = 220/4 + 1 = 562 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 14 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 ? 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 15 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 Number of output elements = C * H’ * W’ 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 = 64*56*56 = 200,704384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 Bytes per element = 4 (for 3213 256 3 1 1 -256bit floating point)13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 KB = (number of elements) * (bytes per 4096 4096 elem) / 102416 37,749 38 fc7 4096 = 200704 * 4 / 10244096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

= 784 Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 16 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 ? 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 17 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 Weight shape = 384 3 1 Cout1 x C384in x K x K13 254 664 112 conv4 384 13 256 3 1= 64 x 3 x 11 x 111 256 13 169 885 145 conv5 256 13 Bias shape = 256 3 C1 = 641 256 13 169 590 100 pool5 256 13 3 out2 0 256 6 36 flatten 256 6 Number of weights = 64*3*11*11 + 649216 36 fc6 9216 4096 = 409623,296 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 18 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 ? 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 19 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 Number of floating point operations (27 3 2 0 192 13multiply+add127 ) conv3 192 = (number of output elements) * (ops per output 13 384 3 1 1 384 13 254 elem) 664 112 conv4 384 = (C13 x H’ x W’) * (256 3 C 1 x K x K)1 256 13 169 885 145 conv5 256 13out 256 3 in1 1 256 13 169 590 100 pool5 256 = (64 * 56 * 56) * (3 * 11 * 11)13 3 2 0 256 6 36 flatten 256 = 200,704 * 3636 9216 36 fc6 9216 = 72,855,5524096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 20 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 ? 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 21 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 For pooling layer:3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 #output channels = #input channels = 64256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 W’ = floor((W 4096 – K) / S + 1)4096 16 37,749 38 fc7 4096 4096= floor(53 / 2 + 1) = floor(27.5) = 4096 27 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 22 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 ? conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 #output 3 elems1 = 1 Cout384x H’ x W’13 254 664 112 conv4 384 13 256 Bytes per 3 elem1 1= 4 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 KB = C3out * H’ * W’ * 4 / 10242 0 256 6 36 flatten 256 6 = 64 * 27 * 27 * 4 / 10249216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 = 182.25 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 23 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 0 ? conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 Pooling layers have no 256 3 1 1 learnable parameters!256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 flatten 256 6 9216 36 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 24 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 0 0 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192Floating27 -point ops for pooling layer3 2 0 192 13 127 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384= (number of output positions) * (flops per output position)13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 = (Cout * H’ * W’) * (K * K) pool5 256 13 3 2 0 256 6 36 flatten 256= (64 * 27 * 27) * (3 * 3)6 9216 36 fc6 9216= 419,9044096 4096 16 37,749 38 fc7 4096= 0.4 MFLOP4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 25 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 0 0 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 0 0 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 0 0 flatten 256 6 9216 36 0 0 fc6 9216 Flatten output size = 4096 C x H x W 4096 16 37,749 38 fc7 4096 4096 in 4096 16 16,777 17 fc8 4096 1000 = 256 * 6 * 6 1000 4 4,096 4

= 9216 Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Justin Johnson Lecture 8 - 26 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 0 0 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 0 0 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 0 0 flatten 256 6 9216 36 0 0 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 FC params4096 = Cin * Cout + Cout 16 FC flops = 16,777Cin * Cout 17 fc8 4096 1000 1000 = 9216 * 4096 + 4096 4 4,096= 9216 * 40964 = 37,725,832 = 37,748,736 Justin Johnson Lecture 8 - 27 September 30, 2019 AlexNet

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 0 0 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 0 0 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 0 0 flatten 256 6 9216 36 0 0 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Justin Johnson Lecture 8 - 28 September 30, 2019 AlexNet How to choose this? Trial and error =(

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 0 0 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 0 0 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 0 0 flatten 256 6 9216 36 0 0 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Justin Johnson Lecture 8 - 29 September 30, 2019 AlexNet Interesting trends here!

Input size Layer Output size Layer C H / W filters kernel stride pad C H / W memory (KB) params (k) flop (M) conv1 3 227 64 11 4 2 64 56 784 23 73 pool1 64 56 3 2 0 64 27 182 0 0 conv2 64 27 192 5 1 2 192 27 547 307 224 pool2 192 27 3 2 0 192 13 127 0 0 conv3 192 13 384 3 1 1 384 13 254 664 112 conv4 384 13 256 3 1 1 256 13 169 885 145 conv5 256 13 256 3 1 1 256 13 169 590 100 pool5 256 13 3 2 0 256 6 36 0 0 flatten 256 6 9216 36 0 0 fc6 9216 4096 4096 16 37,749 38 fc7 4096 4096 4096 16 16,777 17 fc8 4096 1000 1000 4 4,096 4

Justin Johnson Lecture 8 - 30 September 30, 2019 AlexNet

Most of the memory Most floating-point Nearly all parameters are in usage is in the early ops occur in the the fully-connected layers convolution layers convolution layers Memory (KB) Params (K) MFLOP

900 40000 250

800 35000 700 200 30000 600 25000 150 500 20000 400 15000 100 300 10000 200 50 100 5000

0 0 0

Justin Johnson Lecture 8 - 31 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

AlexNet but: CONV1: change from (11x11 stride 4) to (7x7 stride 2) CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512 More trial and error =(

Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014 Justin Johnson Lecture 8 - 34 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

20 16.4 15 19 22 11.7 layers layers Error Rate 10 7.3 6.7 8 layers 8 layers 5.1 5 3.6 3 2.3 Shallow 0 2010 2011 2012 2013 2014 2014 2015 2016 2017 Human Szegedy et al He et al Hu et al Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Shao et al Russakovsky et al Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet) Justin Johnson Lecture 8 - 36 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool VGG Design rules: FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Pool Pool Pool 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 Input Input Input AlexNet VGG16 VGG19

Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 37 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool VGG Design rules: FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 Network has 5 convolutional stages: FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Stage 1: conv-conv-pool Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Stage 2: conv-conv-pool 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 Stage 3: conv-conv-pool 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Stage 4: conv-conv-conv-[conv]-pool Pool Pool Pool 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 Stage 5: conv-conv-conv-[conv]-pool 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 Input Input Input (VGG-19 has 4 conv in stages 4 and 5) AlexNet VGG16 VGG19 Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 38 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool VGG Design rules: FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Option 1: Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Conv(5x5, C -> C) 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Pool Pool Pool 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 2 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 Params: 25C Input Input Input FLOPs: 25C2HW AlexNet VGG16 VGG19

Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 39 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool VGG Design rules: FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Option 1: Option 2: Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Conv(5x5, C -> C) Conv(3x3, C -> C) 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 Conv(3x3, C -> C) 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Pool Pool Pool 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 2 2 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 Params: 25C Params: 18C Input Input Input FLOPs: 25C2HW FLOPs: 18C2HW AlexNet VGG16 VGG19

Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 40 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 Two 3x3 conv has same FC 4096 Pool VGG Design rules: FC 4096 3x3 conv, 512 receptive field as a single 5x5 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 conv, but has fewer parameters 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 and takes less computation! 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Option 1: Option 2: Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Conv(5x5, C -> C) Conv(3x3, C -> C) 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 Conv(3x3, C -> C) 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Pool Pool Pool 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 2 2 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 Params: 25C Params: 18C Input Input Input FLOPs: 25C2HW FLOPs: 18C2HW AlexNet VGG16 VGG19

Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 41 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool VGG Design rules: FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Input: C x 2H x 2W Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Layer: Conv(3x3, C->C) 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Pool Pool Pool Memory: 4HWC 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 2 Params: 9C Input Input Input FLOPs: 36HWC2 AlexNet VGG16 VGG19 Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 42 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool VGG Design rules: FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Input: C x 2H x 2W Input: 2C x H x W Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Layer: Conv(3x3, C->C) Conv(3x3, 2C -> 2C) 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Pool Pool Pool Memory: 4HWC Memory: 2HWC 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 2 2 Params: 9C Params: 36C Input Input Input FLOPs: 36HWC2 FLOPs: 36HWC2 AlexNet VGG16 VGG19 Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 43 September 30, 2019 Softmax FC 1000 VGG: Deeper Networks, Regular Design Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool VGG Design rules: Conv layers at each spatial FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 All conv are 3x3 stride 1 pad 1 resolution take the same 3x3 conv, 512 3x3 conv, 512 amount of computation! 3x3 conv, 512 3x3 conv, 512 All max pool are 2x2 stride 2 3x3 conv, 512 Pool Pool 3x3 conv, 512 After pool, double #channels Softmax 3x3 conv, 512 3x3 conv, 512 FC 1000 3x3 conv, 512 3x3 conv, 512 FC 4096 3x3 conv, 512 3x3 conv, 512 FC 4096 Pool Pool Input: C x 2H x 2W Input: 2C x H x W Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Layer: Conv(3x3, C->C) Conv(3x3, 2C -> 2C) 3x3 conv, 384 Pool Pool Pool 3x3 conv, 128 3x3 conv, 128 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128 Pool Pool Pool Memory: 4HWC Memory: 2HWC 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64 2 2 Params: 9C Params: 36C Input Input Input FLOPs: 36HWC2 FLOPs: 36HWC2 AlexNet VGG16 VGG19 Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 Justin Johnson Lecture 8 - 44 September 30, 2019 AlexNet vs VGG-16: Much bigger network! AlexNet vs VGG-16 AlexNet vs VGG-16 AlexNet vs VGG-16 (Memory, KB) (Params, M) (MFLOPs)

30000 120000 5000 4500 25000 100000 4000 3500 20000 80000 3000

15000 60000 2500 2000 10000 40000 1500 1000 5000 20000 500 0 0 0

AlexNet VGG-16 AlexNet VGG-16 AlexNet VGG-16 AlexNet total: 1.9 MB AlexNet total: 61M AlexNet total: 0.7 GFLOP VGG-16 total: 48.6 MB (25x) VGG-16 total: 138M (2.3x) VGG-16 total: 13.6 GFLOP (19.4x)

Justin Johnson Lecture 8 - 45 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

Many innovations for efficiency: reduce parameter count, memory usage, and computation

Szegedy et al, “Going deeper with convolutions”, CVPR 2015 Justin Johnson Lecture 8 - 48 September 30, 2019 GoogLeNet: Aggressive Stem

Stem network at the start aggressively downsamples input (Recall in VGG-16: Most of the compute was at the start)

Szegedy et al, “Going deeper with convolutions”, CVPR 2015 Justin Johnson Lecture 8 - 49 September 30, 2019 GoogLeNet: Aggressive Stem

Stem network at the start aggressively downsamples input (Recall in VGG-16: Most of the compute was at the start)

Input size Layer Output size Layer C H / W filters kernel stride pad C H/W memory (KB) params (K) flop (M) conv 3 224 64 7 2 3 64 112 3136 9 118 max-pool 64 112 3 2 1 64 56 784 0 2 conv 64 56 64 1 1 0 64 56 784 4 13 conv 64 56 192 3 1 1 192 56 2352 111 347 max-pool 192 56 3 2 1 192 28 588 0 1 Total from 224 to 28 spatial resolution: Memory: 7.5 MB Params: 124K MFLOP: 418

Szegedy et al, “Going deeper with convolutions”, CVPR 2015 Justin Johnson Lecture 8 - 50 September 30, 2019 GoogLeNet: Aggressive Stem

Stem network at the start aggressively downsamples input (Recall in VGG-16: Most of the compute was at the start)

Szegedy et al, “Going deeper with convolutions”, CVPR 2015 Justin Johnson Lecture 8 - 51 September 30, 2019 GoogLeNet: Inception Module Inception module Local unit with parallel branches

Local structure repeated many times throughout the network

Szegedy et al, “Going deeper with convolutions”, CVPR 2015 Justin Johnson Lecture 8 - 52 September 30, 2019 GoogLeNet: Inception Module Inception module Local unit with parallel branches

Local structure repeated many times throughout the network

Uses 1x1 “Bottleneck” layers to reduce channel dimension before expensive conv (we will revisit this with ResNet!)

Szegedy et al, “Going deeper with convolutions”, CVPR 2015 Justin Johnson Lecture 8 - 53 September 30, 2019 GoogLeNet: Global Average Pooling

No large FC layers at the end! Instead uses global average pooling to collapse spatial dimensions, and one linear layer to produce class scores (Recall VGG-16: Most parameters were in the FC layers!)

Input size Layer Output size Layer C H/W filters kernel stride pad C H/W memory (KB) params (k) flop (M) avg-pool 1024 7 7 1 0 1024 1 4 0 0 fc 1024 1000 1000 0 1025 1 Compare with VGG-16: Layer C H/W filters kernel stride pad C H/W memory (KB) params (K) flop (M) flatten 512 7 25088 98 fc6 25088 4096 4096 16 102760 103 fc7 4096 4096 4096 16 16777 17 fc8 4096 1000 1000 4 4096 4

Justin Johnson Lecture 8 - 54 September 30, 2019 GoogLeNet: Global Average Pooling

No large FC layers at the end! Instead uses “global average pooling” to collapse spatial dimensions, and one linear layer to produce class scores (Recall VGG-16: Most parameters were in the FC layers!)

Justin Johnson Lecture 8 - 55 September 30, 2019 GoogLeNet: Auxiliary Classifiers

Training using loss at the end of the network didn’t work well: Network is too deep, gradients don’t propagate cleanly

As a hack, attach “auxiliary classifiers” at several intermediate points in the network that also try to classify the image and receive loss

GoogLeNet was before batch normalization! With BatchNorm no longer need to use this trick

Justin Johnson Lecture 8 - 56 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 59 September 30, 2019 Residual Networks Once we have Batch Normalization, we can train networks with 10+ layers. What happens as we go deeper?

Test error Deeper model does worse than shallow model! 56-layer

Initial guess: Deep model is 20-layer overfitting since it is much bigger than the other model Iterations

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 60 September 30, 2019 Residual Networks Once we have Batch Normalization, we can train networks with 10+ layers. What happens as we go deeper?

Training error Test error

56-layer 56-layer

20-layer

Iterations Iterations

In fact the deep model seems to be underfitting since it also performs worse than the shallow model on the training set! It is actually underfitting

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 61 September 30, 2019 Residual Networks

A deeper model can emulate a shallower model: copy layers from shallower model, set extra layers to identity

Thus deeper models should do at least as good as shallow models

Hypothesis: This is an optimization problem. Deeper models are harder to optimize, and in particular don’t learn identity functions to emulate shallow models

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 62 September 30, 2019 Residual Networks

A deeper model can emulate a shallower model: copy layers from shallower model, set extra layers to identity

Thus deeper models should do at least as good as shallow models

Hypothesis: This is an optimization problem. Deeper models are harder to optimize, and in particular don’t learn identity functions to emulate shallow models

Solution: Change the network so learning identity functions with extra layers is easy!

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 63 September 30, 2019 Residual Networks

Solution: Change the network so learning identity functions with extra layers is easy!

relu H(x) F(x) + x

conv conv Additive relu F(x) relu “shortcut” conv conv

X X “Plain” block Residual Block

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 64 September 30, 2019 Residual Networks

Solution: Change the network so learning identity functions with extra layers is easy!

relu H(x) F(x) + x

If you set these to conv conv 0, the whole block Additive relu will compute the F(x) relu identity function! “shortcut” conv conv

X X “Plain” block Residual Block

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 65 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 A residual network is a stack of 3x3 conv, 512 relu 3x3 conv, 512 many residual blocks 3x3 conv, 512, /2 F(x) + x .. . Regular design, like VGG: each 3x3 conv, 128 residual block has two 3x3 conv 3x3 conv 3x3 conv, 128 3x3 conv, 128 F(x) relu 3x3 conv, 128 3x3 conv, 128 Network is divided into stages: the 3x3 conv 3x3 conv, 128, / 2 first block of each stage halves the 3x3 conv, 64 3x3 conv, 64

resolution (with stride-2 conv) and 3x3 conv, 64 doubles the number of channels X 3x3 conv, 64 3x3 conv, 64 Residual block 3x3 conv, 64

Pool 7x7 conv, 64, / 2 Input He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 66 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 Uses the same aggressive stem as GoogleNet to 3x3 conv, 512, /2

downsample the input 4x before applying residual blocks: .. . 3x3 conv, 128 3x3 conv, 128

Input Output 3x3 conv, 128 size Layer size 3x3 conv, 128 3x3 conv, 128 params flop 3x3 conv, 128, / 2

Layer C H/W filters kernel stride pad C H/W memory (KB) (k) (M) 3x3 conv, 64 conv 3 224 64 7 2 3 64 112 3136 9 118 3x3 conv, 64 3x3 conv, 64 max-pool 64 112 3 2 1 64 56 784 0 2 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Pool 7x7 conv, 64, / 2 Input He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 67 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512, /2

.. Like GoogLeNet, no big fully-connected-layers: instead use . 3x3 conv, 128 global average pooling and a single linear layer at the end 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128, / 2

3x3 conv, 64 3x3 conv, 64

Pool 7x7 conv, 64, / 2 Input He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 68 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

ResNet-18: 3x3 conv, 512 Stem: 1 conv layer 3x3 conv, 512 3x3 conv, 512 Stage 1 (C=64): 2 res. block = 4 conv 3x3 conv, 512, /2 Stage 2 (C=128): 2 res. block = 4 conv .. . Stage 3 (C=256): 2 res. block = 4 conv 3x3 conv, 128 3x3 conv, 128

Stage 4 (C=512): 2 res. block = 4 conv 3x3 conv, 128 Linear 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128, / 2

3x3 conv, 64 ImageNet top-5 error: 10.92 3x3 conv, 64 GFLOP: 1.8 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Pool 7x7 conv, 64, / 2 He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input Error rates are 224x224 single-crop testing, reported by torchvision Justin Johnson Lecture 8 - 69 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

ResNet-18: ResNet-34: 3x3 conv, 512 Stem: 1 conv layer Stem: 1 conv layer 3x3 conv, 512 3x3 conv, 512 Stage 1 (C=64): 2 res. block = 4 conv Stage 1: 3 res. block = 6 conv 3x3 conv, 512, /2 Stage 2 (C=128): 2 res. block = 4 conv Stage 2: 4 res. block = 8 conv .. . Stage 3 (C=256): 2 res. block = 4 conv Stage 3: 6 res. block = 12 conv 3x3 conv, 128 3x3 conv, 128

Stage 4 (C=512): 2 res. block = 4 conv Stage 4: 3 res. block = 6 conv 3x3 conv, 128 Linear Linear 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128, / 2

3x3 conv, 64 ImageNet top-5 error: 10.92 ImageNet top-5 error: 8.58 3x3 conv, 64 GFLOP: 1.8 GFLOP: 3.6 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Pool 7x7 conv, 64, / 2 He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input Error rates are 224x224 single-crop testing, reported by torchvision Justin Johnson Lecture 8 - 70 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

Stage 4 (C=512): 2 res. block = 4 conv Stage 4: 3 res. block = 6 conv 3x3 conv, 128 Linear Linear 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128, / 2

3x3 conv, 64 ImageNet top-5 error: 10.92 ImageNet top-5 error: 8.58 3x3 conv, 64 GFLOP: 1.8 GFLOP: 3.6 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 VGG-16: 3x3 conv, 64 Pool ImageNet top-5 error: 9.62 7x7 conv, 64, / 2 He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input Error rates are 224x224 single-crop testing, reported by torchvision GFLOP: 13.6 Justin Johnson Lecture 8 - 71 September 30, 2019 Residual Networks: Basic Block

Conv(3x3, C->C)

“Basic” Residual block

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 72 September 30, 2019 Residual Networks: Basic Block

Conv(3x3, C->C) FLOPs: 9HWC2

“Basic” Total FLOPs: Residual block 18HWC2

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 73 September 30, 2019 Residual Networks: Bottleneck Block

Conv(1x1, C->4C) Conv(3x3, C->C) FLOPs: 9HWC2 Conv(3x3, C->C) Conv(3x3, C->C) FLOPs: 9HWC2 Conv(1x1, 4C->C)

“Basic” Total FLOPs: Residual block 18HWC2 “Bottleneck” Residual block He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 74 September 30, 2019 Residual Networks: Bottleneck Block

More layers, less computational cost!

FLOPs: 4HWC2 Conv(1x1, C->4C) Conv(3x3, C->C) FLOPs: 9HWC2 FLOPs: 9HWC2 Conv(3x3, C->C) Conv(3x3, C->C) FLOPs: 9HWC2 FLOPs: 4HWC2 Conv(1x1, 4C->C)

“Basic” Total FLOPs: Residual block 18HWC2 Total FLOPs: “Bottleneck” 17HWC2 Residual block He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 75 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512, /2 Stage 1 Stage 2 Stage 3 Stage 4 .. . Block Stem FC ImageNet 3x3 conv, 128 type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error 3x3 conv, 128 ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 3x3 conv, 128 ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128 3x3 conv, 128 ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13 3x3 conv, 128, / 2

ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 64 ResNet-152 Bottle 1 3 9 8 24 36 108 3 9 1 11.3 5.94 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Pool 7x7 conv, 64, / 2 He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input Error rates are 224x224 single-crop testing, reported by torchvision Justin Johnson Lecture 8 - 76 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 ResNet-50 is the same as ResNet-34, but replaces Basic blocks with Bottleneck Blocks. 3x3 conv, 512 This is a great baseline architecture for many tasks even today! 3x3 conv, 512 3x3 conv, 512, /2 Stage 1 Stage 2 Stage 3 Stage 4 .. . Block Stem FC ImageNet 3x3 conv, 128 type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error 3x3 conv, 128 ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 3x3 conv, 128 ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128 3x3 conv, 128 ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13 3x3 conv, 128, / 2

ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 64 ResNet-152 Bottle 1 3 9 8 24 36 108 3 9 1 11.3 5.94 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Pool 7x7 conv, 64, / 2 He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input Error rates are 224x224 single-crop testing, reported by torchvision Justin Johnson Lecture 8 - 77 September 30, 2019 Softmax FC 1000 Residual Networks Pool 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 Deeper ResNet-101 and ResNet-152 models are more 3x3 conv, 512 accurate, but also more computationally heavy 3x3 conv, 512 3x3 conv, 512, /2 Stage 1 Stage 2 Stage 3 Stage 4 .. . Block Stem FC ImageNet 3x3 conv, 128 type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error 3x3 conv, 128 ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 3x3 conv, 128 ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128 3x3 conv, 128 ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13 3x3 conv, 128, / 2

ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 64 ResNet-152 Bottle 1 3 9 8 24 36 108 3 9 1 11.3 5.94 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

- Able to train very deep networks - Deeper networks do better than shallow networks (as expected) - Swept 1st place in all ILSVRC and COCO 2015 competitions - Still widely used today!

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Justin Johnson Lecture 8 - 79 September 30, 2019 Improving Residual Networks: Block Design

Original ResNet block “Pre-Activation” ResNet Block

ReLU Note ReLU after residual:

Cannot actually learn Conv identity function since Batch Norm outputs are nonnegative! ReLU Conv Batch Norm Note ReLU inside residual: ReLU Conv Batch Norm Can learn true identity ReLU function by setting Conv Conv weights to zero! Batch Norm

He et al, ”Identity mappings in deep residual networks”, ECCV 2016 Justin Johnson Lecture 8 - 80 September 30, 2019 Improving Residual Networks: Block Design

Original ResNet block “Pre-Activation” ResNet Block

ReLU Slight improvement in accuracy Conv (ImageNet top-1 error) Batch Norm ReLU Conv ResNet-152: 21.3 vs 21.1 Batch Norm ResNet-200: 21.8 vs 20.7 ReLU Conv Batch Norm Not actually used that much in ReLU Conv practice Batch Norm

He et al, ”Identity mappings in deep residual networks”, ECCV 2016 Justin Johnson Lecture 8 - 81 September 30, 2019 Comparing Complexity

Canziani et al, “An analysis of deep neural network models for practical applications”, 2017 Justin Johnson Lecture 8 - 82 September 30, 2019 Comparing Complexity Inception-v4: Resnet + Inception!

Canziani et al, “An analysis of deep neural network models for practical applications”, 2017 Justin Johnson Lecture 8 - 83 September 30, 2019 Comparing Complexity VGG: Highest memory, most operations

Canziani et al, “An analysis of deep neural network models for practical applications”, 2017 Justin Johnson Lecture 8 - 84 September 30, 2019 Comparing Complexity GoogLeNet: Very efficient!

Canziani et al, “An analysis of deep neural network models for practical applications”, 2017 Justin Johnson Lecture 8 - 85 September 30, 2019 Comparing Complexity AlexNet: Low compute, lots of parameters

Canziani et al, “An analysis of deep neural network models for practical applications”, 2017 Justin Johnson Lecture 8 - 86 September 30, 2019 ResNet: Simple design, moderate efficiency, Comparing Complexity high accuracy

Canziani et al, “An analysis of deep neural network models for practical applications”, 2017 Justin Johnson Lecture 8 - 87 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

Multi-scale ensemble of Inception, Inception-Resnet, Resnet, Wide Resnet models

Shao et al, 2016 Justin Johnson Lecture 8 - 90 September 30, 2019 Improving ResNets

FLOPs: Conv(1x1, C->4C) 4HWC2

FLOPs: Conv(3x3, C->C) 9HWC2

FLOPs: Conv(1x1, 4C->C) 4HWC2

“Bottleneck” Total FLOPs: Residual block 17HWC2

Justin Johnson Lecture 8 - 91 September 30, 2019 G parallel pathways Improving ResNets: ResNeXt

FLOPs: Conv(1x1, c->4C) Conv(1x1, c->4C) Conv(1x1, C->4C) 4HWC2

FLOPs: Conv(3x3, c->c) … Conv(3x3, c->c) Conv(3x3, C->C) 9HWC2

FLOPs: Conv(1x1, 4C->c) Conv(1x1, 4C->c) Conv(1x1, 4C->C) 4HWC2

“Bottleneck” Total FLOPs: Residual block 17HWC2

Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Justin Johnson Lecture 8 - 92 September 30, 2019 G parallel pathways Improving ResNets: ResNeXt

FLOPs: Conv(1x1, C->4C) 2 Conv(1x1, c->4C) Conv(1x1, c->4C) 4HWC 4HWCc

FLOPs: 9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c) Conv(3x3, C->C) 9HWC2

FLOPs: 4HWCc Conv(1x1, 4C->c) Conv(1x1, 4C->c) Conv(1x1, 4C->C) 4HWC2

Total FLOPs: Total FLOPs: “Bottleneck” 2 Residual block 17HWC2 (8Cc + 9c )*HWG

Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Justin Johnson Lecture 8 - 93 September 30, 2019 G parallel pathways Improving ResNets: ResNeXt

FLOPs: Conv(1x1, C->4C) 2 Conv(1x1, c->4C) Conv(1x1, c->4C) 4HWC 4HWCc

FLOPs: 9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c) Conv(3x3, C->C) 9HWC2

FLOPs: 4HWCc Conv(1x1, 4C->c) Conv(1x1, 4C->c) Conv(1x1, 4C->C) 4HWC2

Total FLOPs: Total FLOPs: “Bottleneck” 2 2 Equal cost when (8Cc + 9c )*HWG Residual block 17HWC 9Gc2 + 8GCc – 17C2 = 0

Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Example: C=64, G=4, c=24; C=64, G=32, c=4 Justin Johnson Lecture 8 - 94 September 30, 2019 Grouped Convolution Convolution with groups=1: Normal convolution

Input: Cin x H x W Weight: Cout x Cin x K x K Output: Cout x H’ x W’ 2 FLOPs: CoutCinK HW

All convolutional kernels touch all Cin channels of the input

Justin Johnson Lecture 8 - 95 September 30, 2019 Convolution with groups=2: Grouped Convolution Two parallel convolution layers that Convolution with groups=1: work on half the channels Normal convolution Input: Cin x H x W

Input: Cin x H x W Split Weight: Cout x Cin x K x K Group 1: Group 2: Output: Cout x H’ x W’ 2 (Cin / 2) x H x W (Cin / 2) x H x W FLOPs: CoutCinK HW

All convolutional kernels touch Conv(K x K, Cin/2 -> Cout/2) Conv(K x K, Cin/2 -> Cout/2) all Cin channels of the input Out 1: Out 2:

(Cout / 2) x H’ x W’ (Cout / 2) x H’ x W’ Concat

Output: Cout x H’ x W’

Justin Johnson Lecture 8 - 96 September 30, 2019 Grouped Convolution Convolution with groups=1: Normal convolution Convolution with groups=G: G parallel conv layers; each “sees”

Cin/G input channels and produces Input: Cin x H x W Cout/G output channels Weight: Cout x Cin x K x K Output: Cout x H’ x W’ 2 Input: Cin x H x W FLOPs: CoutCinK HW Split to G x [(Cin / G) x H x W] All convolutional kernels touch Weight: G x (Cout / G) x (Cin x G) x K x K G parallel convolutions all Cin channels of the input Output: G x [(Cout / G) x H’ x W’] Concat to Cout x H’ x W’ 2 FLOPs: CoutCinK HW/G

Justin Johnson Lecture 8 - 97 September 30, 2019 Grouped Convolution Convolution with groups=1: Normal convolution Convolution with groups=G: G parallel conv layers; each “sees”

Justin Johnson Lecture 8 - 98 September 30, 2019 Grouped Convolution in PyTorch

PyTorch convolution gives an option for groups!

Justin Johnson Lecture 8 - 99 September 30, 2019 G parallel pathways Improving ResNets: ResNeXt Equivalent formulation with grouped convolution

Conv(1x1, Gc->4C) Conv(1x1, c->4C) Conv(1x1, c->4C) 4HWCc Conv(3x3, Gc->Gc, 9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c) groups=G)

Conv(1x1, 4C->Gc) 4HWCc Conv(1x1, 4C->c) Conv(1x1, 4C->c)

Total FLOPs: ResNeXt block: Equal cost when (8Cc + 9c2)*HWG Grouped convolution 9Gc2 + 8GCc – 17C2 = 0

Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Example: C=64, G=4, c=24; C=64, G=32, c=4 Justin Johnson Lecture 8 - 100 September 30, 2019 ResNeXt: Maintain computation by adding groups!

Model Groups Group width Top-1 Error Model Groups Group width Top-1 Error ResNet-50 1 64 23.9 ResNet-101 1 64 22.0 ResNeXt-50 2 40 23 ResNeXt-101 2 40 21.7 ResNeXt-50 4 24 22.6 ResNeXt-101 4 24 21.4 ResNeXt-50 8 14 22.3 ResNeXt-101 8 14 21.3 ResNeXt-50 32 4 22.2 ResNeXt-101 32 4 21.2

Adding groups improves performance with same computational complexity!

Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Justin Johnson Lecture 8 - 101 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

Adds a ”Squeeze-and-excite” branch to each residual block that performs global pooling, full-connected layers, and multiplies back onto feature map

Adds global context to each residual block!

Won ILSVRC 2017 with ResNeXt-152-SE

Hu et al, “Squeeze-and-Excitation networks”, CVPR 2018 Justin Johnson Lecture 8 - 103 September 30, 2019 ImageNet Classification Challenge 30 28.2 152 152 152 25.8 layers layers layers 25

20 Completion of the challenge:16.4 15 Annual ImageNet competition no longer 11.7 19 22 held after 2017 -> now moved to layers layers Kaggle. Error Rate 10 7.3 6.7 8 layers 8 layers 5.1 5 3.6 3 2.3 Shallow 0 2010 2011 2012 2013 2014 2014 2015 2016 2017 Human Szegedy et al He et al Hu et al Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Shao et al Russakovsky et al Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet) Justin Johnson Lecture 8 - 104 September 30, 2019 Densely Connected Neural Networks Softmax

1x1 conv, 64 Pool

Dense Block 3 Concat Dense blocks where each layer is Conv

connected to every other layer in 1x1 conv, 64 Pool feedforward fashion Conv Concat Dense Block 2

Conv Alleviates vanishing gradient, Conv Pool

strengthens feature propagation, Concat Conv

encourages feature reuse Dense Block 1 Conv Conv

Input Input Dense Block

Huang et al, “Densely connected neural networks”, CVPR 2017 Justin Johnson Lecture 8 - 105 September 30, 2019 MobileNets: Tiny Networks (For Mobile Devices)

Standard Convolution Block Depthwise Separable Convolution Total cost: 9C2HW Total cost: (9C + C2)HW

ReLU ReLU Batch Norm

Batch Norm C2HW Conv(1x1, C->C) “Pointwise Convolution” Conv(3x3, C->C) 9C2HW ReLU Batch Norm Speedup = 9C2/(9C+C2) Conv(3x3, C->C, 9CHW “Depthwise Convolution” = 9C/(9+C) groups=C) => 9 (as C->inf)

Howard et al, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017 Justin Johnson Lecture 8 - 106 September 30, 2019 MobileNets: Tiny Networks (For Mobile Devices)

Depthwise Separable Convolution Total cost: (9C + C2)HW

Also related: ReLU Batch Norm ShuffleNet: Zhang et al, CVPR 2018 C2HW Conv(1x1, C->C) “Pointwise Convolution” MobileNetV2: Sandler et al, CVPR 2018 ShuffleNetV2: Ma et al, ECCV 2018 ReLU Batch Norm

Conv(3x3, C->C, 9CHW “Depthwise Convolution” groups=C)

Howard et al, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017 Justin Johnson Lecture 8 - 107 September 30, 2019 Neural Architecture Search

Designing neural network architectures is hard – let’s automate it!

Zoph and Le, “Neural Architecture Search with Reinforcement Learning”, ICLR 2017 Justin Johnson Lecture 8 - 108 September 30, 2019 Neural Architecture Search

Designing neural network architectures is hard – let’s automate it!

- One network (controller) outputs network architectures - Sample child networks from controller and train them - After training a batch of child networks, make a gradient step on controller network (Using policy gradient) - Over time, controller learns to output good architectures! - VERY EXPENSIVE!! Each gradient step on controller requires training a batch of child models! - Original paper trained on 800 GPUs for 28 days! - Followup work has focused on efficient search

Zoph and Le, “Neural Architecture Search with Reinforcement Learning”, ICLR 2017 Justin Johnson Lecture 8 - 109 September 30, 2019 Neural Architecture Search

Neural architecture search can be used to find efficient CNN architectures!

Zoph et al, “Learning Transferable Architectures for Scalable Image Recognition”, CVPR 2018 Justin Johnson Lecture 8 - 110 September 30, 2019 CNN Architectures Summary Early work (AlexNet -> ZFNet -> VGG) shows that bigger networks work better

GoogLeNet one of the first to focus on efficiency (aggressive stem, 1x1 bottleneck convolutions, global avg pool instead of FC layers)

ResNet showed us how to train extremely deep networks – limited only by GPU memory! Started to show diminishing returns as networks got bigger

After ResNet: Efficient networks became central: how can we improve the accuracy without increasing the complexity?

Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet, etc

Neural Architecture Search promises to automate architecture design

Justin Johnson Lecture 8 - 111 September 30, 2019 Which Architecture should I use?

Don’t be a hero. For most problems you should use an off-the-shelf architecture; don’t try to design your own!

If you just care about accuracy, ResNet-50 or ResNet-101 are great choices

If you want an efficient network (real-time, run on mobile, etc) try MobileNets and ShuffleNets

Justin Johnson Lecture 8 - 112 September 30, 2019 Next Time: Deep Learning Hardware and Software

Justin Johnson Lecture 8 - 113 September 30, 2019