Deep Learning: What’s All the Fuss About?

Garrison W. Cottrell Gary's Unbelievable Research Unit (GURU) The Perceptual Expertise Network The Temporal Dynamics of Learning Center Computer Science and Engineering Department, UCSD

2/5/20 PACS-2017 1 Or, a Shallow Introduction to !

Garrison W. Cottrell Gary's Unbelievable Research Unit (GURU) The Perceptual Expertise Network The Temporal Dynamics of Learning Center Computer Science and Engineering Department, UCSD

2/5/20 PACS-2017 2 Outline

! What is Deep Learning? ! Why is it interesting? ! What can it do?

2/5/20 PACS-2017 3 What is Deep Learning?

! Deep Learning refers to Deep Neural Networks – What’s a neural network? – What’s a deep neural network?

2/5/20 PACS-2017 4 What’s a neural network?

! A neural network is a kind of computational model inspired by the brain

! It consists of “units” connected by weighted links

! Here’s a simple neural network, called a :

2/5/20 PACS-2017 5 : A bit of history

Frank Rosenblatt studied a simple version of a neural net called a perceptron: – A single layer of processing – Binary output – Can compute simple things like (some) boolean functions (OR, AND, etc.)

PACS-2017 6 Perceptrons: A bit of history

! Computes the weighted sum of its inputs (the net input), compares it to a threshold, and “fires” if the net is greater than or equal than the threshold. ! Why is this a “neural” network? – Because we can think of the input units here as “neurons” that are spreading activation to the output unit via synaptic connections – like real neurons do.

PACS-2017 7 The Perceptron Activation Rule

output

net input

This is called a binary threshold unit

PACS-2017 8 The Perceptron Activation Rule

output

net input

This is called a binary threshold unit

PACS-2017 9 Quiz!!!

X1 X2 OR(X1,X2) 0 0 0 0 1 1 1 0 1 1 1 1 Assume: FALSE == 0, TRUE==1, so if X1 is false, it is 0. Can you come up with a set of weights and a threshold so that a two-input perceptron computes OR? PACS-2017 10 Quiz

X1 X2 AND(X1,X2) 0 0 0 0 1 0 1 0 0 1 1 1 Assume: FALSE == 0, TRUE==1 Can you come up with a set of weights and a threshold so that a two-input perceptron computes AND? PACS-2017 11 Quiz

X1 X2 XOR(X1,X2) 0 0 0 0 1 1 1 0 1 1 1 0 Assume: FALSE == 0, TRUE==1 Can you come up with a set of weights and a threshold so that a two-input perceptron computes XOR? PACS-2017 12 Perceptrons

The goal was to make a neurally-inspired machine that could categorize inputs – and learn to do this from examples via supervised learning: The network is presented with many input-output examples, and the weights are adjusted by a learning rule:

Wi(t+1) = Wi(t) + α*(target - output)*Xi (the target is given by the example, α is the learning rate) This rule is called the delta rule, because the weights are changed according to the delta – the difference between what the output and

the target are. PACS-2017 13

How the delta rule works

Wi(t+1) = Wi(t) + α*(target - output)*Xi

So, the network is presented with an input, it computes the output, and this is compared with the target.

Notice that if the output and the target are the same, nothing happens – there is no change to the weight.

If the output is 0 and the target is 1, and Xi is 1, then the weight is raised – which will tend to make the output 1 next time.

If the output is 1 and the target is 0, and Xi is 1, then the weight is lowered– which will tend to make the output 0 next time.

PACS-2017 14 Problems with perceptrons

! The learning rule comes with a great guarantee: anything a perceptron can compute, it can learn to compute.

! Problem: Lots of things were not computable, e.g., XOR (Minsky & Papert, 1969)

! Minsky & Papert said: – if you had hidden units, you could compute any boolean function. – But no learning rule exists for such multilayer networks, and we don’t think one will ever be discovered. Back propagation, 25 years 15 later Problems with perceptrons

Back propagation, 25 years 16 later XOR: The smallest “hard problem”

Notice that the network could not do XOR.

If there is another layer (here, just one unit!) between the input and the output, then the network can compute XOR:

PACS-2017 17 XOR: The smallest “hard problem”

Notice that the network could not do XOR.

If there is another layer between the input and the output, then the network can compute XOR: Without the middle unit, the rest of the network just computes OR.

OR

PACS-2017 18 XOR: The smallest “hard problem”

Notice that the network could not do XOR.

If there is another layer between the input and the output, then the network can compute XOR: OR is “right” in 3 out of four cases for XOR:

X1 X2 OR XOR (X1,X2) (X1,X2) 0 0 0 0

0 1 1 1

1 0 1 1

1 1 1 0

OR PACS-2017 19 XOR: The smallest “hard problem”

Notice that the network could not do XOR.

If there is another layer between the input and the output, then the network can compute XOR: The middle unit computes “AND” of the inputs and turns the output off just for the one exception to OR: the 1,1 case!

AND PACS-2017 20 Multi-layer neural networks Ok, we’ve established that:

1. The simplest neural networks have a single layer of processing 2. By adding one more layer, networks can compute harder problems. 3. The AND unit can be thought of as a feature that is useful for solving the task

Now, backpropagation learning generalizes the delta rule to multi-layer networks – it can be used to learn the AND feature! PACS-2017 21

Aside about perceptrons

! They didn’t have hidden units - but Rosenblatt assumed nonlinear preprocessing!

! Hidden units compute features of the input

! The nonlinear preprocessing is a way to choose features by hand.

! Support Vector Machines essentially do this in a principled way, followed by a (highly sophisticated) perceptron learning algorithm.

Back propagation, 25 years 22 later Enter Rumelhart, Hinton, & Williams (1985)

! (Re-)Discovered a learning rule for networks with hidden units. ! Works a lot like the perceptron algorithm:

– Randomly choose an input-output pattern – present the input, let activation propagate through the network – give the teaching signal – propagate the error back through the network (hence the name back propagation) – change the connection strengths according to the error Back propagation, 25 years 23 later Enter Rumelhart, Hinton, & Williams (1985)

OUTPUTS

. . . Hidden Units Activation Error

. . . INPUTS

! The actual algorithm uses the chain rule of calculus to go downhill in an error measure with respect to the weights

Back propagation, 25 years ! The hidden units must learn features24 that solve the problem later XOR

Back Propagation Learning OR AND

Random Network XOR Network Here, the hidden units learned AND and OR - two features that when combined appropriately, can solve the problem

Back propagation, 25 years 25 later XOR Back Propagation Learning

OR AND Random Network

XOR Network But, depending on initial conditions, there are an infinite number of ways to do XOR - backprop can surprise you with innovative solutions. Back propagation, 25 years 26 later Why is/was this wonderful?

1. Learns internal representations 2. Learns internal representations 3. Learns internal representations

! Generalizes to recurrent networks

Back propagation, 25 years 27 later Hinton’s Family Trees example

! Idea: Learn to represent relationships between people that are encoded in a family tree:

Back propagation, 25 years 28 later Hinton’s Family Trees example

! Idea 2: Learn distributed representations of concepts: localist outputs Learn: features of these entities useful for solving the task

Input: localist people localist relations Localist: one unit “ON” to represent each item Back propagation, 25 years 29 later People hidden units: Hinton diagram

! What does the unit 1 encode?

What is unit 1 encoding? Back propagation, 25 years 30 later People hidden units: Hinton diagram

! What does unit 2 encode?

What is unit 2 encoding? Back propagation, 25 years 31 later People hidden units: Hinton diagram

! Unit 6?

What is unit 6 encoding? Back propagation, 25 years 32 later People hidden units: Hinton diagram

When all three are on, these units pick out Christopher and Penelope:

Back propagation, 25 years Other combinations pick out other parts of33 the trees later Relation units

What does the lower middle one code?

Back propagation, 25 years 34 later Lessons

! The network learns features in the service of the task - i.e., it learns features on its own.

! This is useful if we don’t know what the features ought to be.

! The networks have been used in my lab for years to explain some human phenomena

Back propagation, 25 years 35 later Switch to Demo

! This demo is downloadable from my website under “Resources” ! About the middle of the page: ! “A matlab neural net demo with face processing.”

Back propagation, 25 years 36 later Multi-layer neural networks

Why is this interesting?

Because now we can train neural networks to do very interesting tasks – like face recognition, object recognition, handwriting recognition, etc.

Thus there was a lot of excitement about neural nets when the backpropagation learning rule was (re-)discovered in 1985.

BUT, it turned out that training networks with more than three layers (one “hidden layer” between the input and the output) was difficult, so more powerful networks could not be trained

PACS-2017 37 Multi-layer neural networks

Why is this interesting?

In 2012, techniques were developed that allowed us to train networks with many layers.

This means that not only can we learn features of the input, but also features of the features of the input, and features of the features of the features of the input, and so on…

This allows very difficult problems to be solved.

PACS-2017 38 Multi-layer neural networks

1. A very deep network:

….

PACS-2017 39 Multi-layer neural networks

1. A very deep network:

Gary

…. Male

Old

PACS-2017 40 Deep Learning Overview

! Multiple layers work to build an improved feature space – First layer learns 1st order features (e.g. edges…) – 2nd layer learns higher order features (combinations of first layer features, combinations of edges, etc.) – Etc. ! What do I mean by “edges”? Often neural networks are used for image recognition, and edges in images are places where the value of the pixels changes a lot.

! Here is an “edge” in the background of this slide

! A “neuron” with low-valued weights on the left and high valued weights on the right would “fire” to this patch of image

2/5/20 PACS-2017 41 Deep Learning Tasks Here are 100 features of 11X11 image patches learned by a neural network – they respond to oriented edges:

2/5/20 PACS-2017 42 Deep Learning Tasks NOTE HERE: The “pixels” shown here are actually the weights of one “unit” of the network – they are real numbers, not just 0s and 1s

2/5/20 PACS-2017 43 First “breakthrough”: Convolutional Networks (Yann LeCun)

! Basic idea: Each rectangle here is a set of units that all learn the same feature. ! Why? Because then the network can recognize the same feature in many places.

2/5/20 PACS-2017 44 First “breakthrough”: Convolutional Networks

! The second innovation is to have each unit connected to a subset of the previous layer – so as you go up, the units have larger and larger areas of the input (receptive fields) that they respond to – just like our brains

2/5/20 PACS-2017 45 First “breakthrough”: Convolutional Networks

! The third innovation is to subsample previous layers – for example, out of a 3X3 patch of units, only compute the maximum value, and pass it on ! This is called pooling. DEMO: http://yann.lecun.com/exdb/lenet/index.html

2/5/20 PACS-2017 46 Now, very deep networks (Geoff Hinton)

! This is a (half of) a network with eight layers of convolutions and pooling units – called “AlexNet” after Alex Krishevsky, Hinton’s student. ! Trained with 1.2 million images from ImageNet – to categorize them into 1000 categories

2/5/20 PACS-2017 47 The Imagenet Large Scale Visual Recognition Challenge (ILSVRC)

! 1.2 Million training images ! 1000 categories (732-1300 training images per class) ! 50,000 test images ! Large variation in images ! Many fine-scale categories (120 dog breeds, not just “dog”)

2/5/20 PACS-2017 48 Imagenet Large Scale Visual Recognition Challenge Overall Image Variance

2/5/20 PACS-2017 49 ILSVRC: Within-category variance

These are ALL Scottish Deerhounds! Note how hard it would be for you to learn these!

2/5/20 PACS-2017 50 Why Deep Learning

Large Scale Visual Recognition Challenge 2012 35%

30%

25%

20%

15%

10%

ERROR RATE ERROR RATE 5%

0%

2/5/20 PACS-2017 51 Why Deep Learning

Large Scale Visual Recognition Challenge 2012 35% 30% 25% 20% 15% 10% ERROR RATE ERROR RATE 5% “Standard” 0% computer vision approaches

2/5/20 PACS-2017 52 Why Deep Learning

Large Scale Visual Recognition Challenge 2012 35% 30% 25% 20% 15% 10% ERROR RATE ERROR RATE 5% 0%

Deep neural network (AlexNet)

2/5/20 PACS-2017 53 Why Deep Learning

2010 2011 2012 2013 2014 Human 2015 2016

Alexnet

2/5/20 PACS-2017 54 Why Deep Learning

2010 2011 2012 2013 2014 Human 2015 2016

ZF-Net (better-tuned AlexNet) 2/5/20 PACS-2017 55 Why Deep Learning

2010 2011 2012 2013 2014 Human 2015 2016

GoogLeNet, AKA “Inception” 2/5/20 PACS-2017 56 Why Deep Learning

2010 2011 2012 2013 2014 Human 2015 2016

Andrej Karpathy after three days of training!

2/5/20 PACS-2017 57 Why Deep Learning

2010 2011 2012 2013 2014 Human 2015 2016

ResNet

2/5/20 PACS-2017 58 Why Deep Learning

2010 2011 2012 2013 2014 Human 2015 2016

Inception-ResNet-v2

2/5/20 PACS-2017 59 How did they do it?

Depth, depth, and more depth!

2/5/20 PACS-2017 60 A Revolution in Depth

AlexNet, 11x11 conv, 96, /4, pool/2 8 layers 5x5 conv, 256, pool/2 (ILSVRC 2012) 3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 4096

fc, 1000

2/5/20 PACS-2017 61 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. A Revolution in Depth

soft max2

Soft maxAct ivat ion

FC

AveragePool 7x7+ 1(V)

11x11 conv, 96, /4, pool/2 3x3 conv, 64 Dept hConcat

Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool AlexNet, 5x5 conv, 256, pool/2 VGG, 19 3x3 conv, 64, pool/2 GoogLeNet, 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv soft max1 3x3 conv, 384 3x3 conv, 128 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

8 layers layers 22 layers Conv Conv MaxPool Soft maxAct ivat ion 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

MaxPool FC 3x3 conv, 384 3x3 conv, 128, pool/2 3x3+ 2(S) (ILSVRC (ILSVRC (ILSVRC Dept hConcat FC Conv Conv Conv Conv Conv 3x3 conv, 256, pool/2 3x3 conv, 256 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V)

2012) fc, 4096 2014) 3x3 conv, 256 2014) Dept hConcat

Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool fc, 4096 3x3 conv, 256 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat soft max0

Conv Conv Conv Conv Soft maxAct ivat ion fc, 1000 3x3 conv, 256, pool/2 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool FC 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

3x3 conv, 512 Dept hConcat FC

Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool AveragePool 3x3 conv, 512 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V)

Dept hConcat

Conv Conv Conv Conv 3x3 conv, 512 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

3x3 conv, 512, pool/2 MaxPool 3x3+ 2(S)

Dept hConcat

Conv Conv Conv Conv 3x3 conv, 512 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

3x3 conv, 512 Dept hConcat

Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

3x3 conv, 512 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

MaxPool 3x3+ 2(S)

3x3 conv, 512, pool/2 LocalRespNorm

Conv 3x3+ 1(S)

fc, 4096 Conv 1x1+ 1(V)

LocalRespNorm

fc, 4096 MaxPool 3x3+ 2(S)

Conv 7x7+ 2(S) 2/5/20 PACS-2017fc, 1 000 62 input Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 7x7 conv, 64, /2, pool/2

1x1 conv, 64

3x3 conv, 64

1x1 conv, 25 6

1x1 conv, 64

3x3 conv, 64

1x1 conv, 25 6

1x1 conv, 64

3x3 conv, 64

1x1 conv, 25 6

1x1 conv, 12 8, /2

3x3 conv, 12 8

1x1 conv, 51 2

1x1 conv, 12 8

3x3 conv, 12 8

1x1 conv, 51 2

1x1 conv, 12 8 A Revolution in Depth 3x3 conv, 12 8 1x1 conv, 51 2

1x1 conv, 12 8

3x3 conv, 12 8

1x1 conv, 51 2

1x1 conv, 12 8

3x3 conv, 12 8

1x1 conv, 51 2

1x1 conv, 12 8

3x3 conv, 12 8

1x1 conv, 51 2

1x1 conv, 12 8

3x3 conv, 12 8

1x1 conv, 51 2

1x1 conv, 12 8

3x3 conv, 12 8

1x1 conv, 51 2

1x1 conv, 25 6, /2

3x3 conv, 25 6

soft max2 11x11 conv, 96, /4, pool/2 1x1 conv, 10 24 Soft maxAct ivat ion FC 1x1 conv, 25 6 5x5 conv, 256, pool/2 AveragePool 7x7+ 1(V) Dept hConcat 3x3 conv, 25 6 3x3 conv, 384 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 3x3 conv, 384 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 1x1 conv, 10 24 Dept hConcat

Conv Conv Conv Conv soft max1 3x3 conv, 256, pool/2 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1 conv, 25 6 Conv Conv MaxPool Soft maxAct ivat ion 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

MaxPool FC GoogLeNet, 3x3+ 2(S) fc, 4096 3x3 conv, 25 6 Dept hConcat FC

Conv Conv Conv Conv Conv fc, 4096 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) 1x1 conv, 10 24 Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) fc, 1000 Dept hConcat 1x1 conv, 25 6 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool AlexNet, 8 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 25 6

Dept hConcat soft max0

Conv Conv Conv Conv Soft maxAct ivat ion 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1 conv, 10 24

Conv Conv MaxPool FC ResNet, 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat FC 1x1 conv, 25 6

Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) 3x3 conv, 25 6

Dept hConcat

Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1 conv, 10 24

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

MaxPool 3x3+ 2(S) 1x1 conv, 25 6

22 layers Dept hConcat

Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 3x3 conv, 25 6

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

layers Dept hConcat 1x1 conv, 10 24

Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1 conv, 25 6 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

MaxPool 3x3+ 2(S)

LocalRespNorm 3x3 conv, 25 6

Conv 3x3+ 1(S) 152 layers Conv 1x1 conv, 10 24 1x1+ 1(V)

LocalRespNorm

MaxPool 1x1 conv, 25 6 3x3+ 2(S)

Conv 7x7+ 2(S) 3x3 conv, 25 6 input

1x1 conv, 10 24 (ILSVRC (ILSVRC 1x1 conv, 25 6 3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6 (ILSVRC 3x3 conv, 25 6 1x1 conv, 10 24

1x1 conv, 25 6 2012) 3x3 conv, 25 6 2014) 1x1 conv, 10 24 1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6 2015) 3x3 conv, 25 6 1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 25 6

3x3 conv, 25 6

1x1 conv, 10 24

1x1 conv, 51 2, /2

3x3 conv, 51 2

1x1 conv, 20 48

1x1 conv, 51 2

3x3 conv, 51 2

1x1 conv, 20 48

1x1 conv, 51 2

3x3 conv, 51 2

1x1 conv, 20 48 2/5/20 PACS-2017 ave pool, fc 10 00 63 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. A Revolution in Depth

7x7 conv, 64, /2, pool/2

1x1 conv, 64 ResNet, 152 layers 3x3 conv, 64 1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128 2/5/20 PACS-20173x3 conv, 128 64 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian1x1 conv, 51 2Sun. “Deep Residual Learning for Image Recognition”. CVPR 1x1 conv, 128 2016. 3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000 A Revolution in Depth

ResNet’s object detection result on COCO

2/5/20 PACS-2017 65 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. Demos

! pix 2 cat ! Quick! Draw!!! ! Teachable machine ! Neural network machine translation ! Neural network learning breakout ! Neural Style Transfer! ! Open AI's Text generator (GPT-2_

2/5/20 PACS-2017 66 Understanding Ourselves

! AlexNet, GoogLeNet, and friends are currently the best models of the temporal lobe of the human brain.

! We may now begin to understand how the “Halle Berry neuron” came to be…

2/5/20 PACS-2017 67 ! The “Halle Berry” neuron…

2/5/20 PACS-2017 68 Reinforcement Learning

! That’s cool, but we told the network what to do.

! We learn by trying things out and seeing what happens.

! This is how we discover things on our own.

! Can networks do that too?

2/5/20 PACS-2017 69 Reinforcement Learning

! Yes!

! The simplest algorithm to explain for reinforcement learning is called “policy gradient” (see Andrej Karpathy’s blog)

! Here is how it learns to play Pong

2/5/20 PACS-2017 70 Pong: Actions are UP, DOWN

2/5/20 PACS-2017 71 A (very!) simple Pong Policy Network

This isn’t deep, but it illustrates the idea.

2/5/20 PACS-2017 72 A (very!) simple Pong Policy Network

To play, we take in the current image – actually, the difference between the previous image and the current image to get movement – and then we flip a coin based on the probability computed at the output.

2/5/20 PACS-2017 73 A (very!) simple Pong Policy Network

So, we sampled a probability and produced 1 or 0. We can then compute the gradient at every step. (t=1, or t=0) The trick is, at the end, we use the sign of the reward on the gradient! (did we win or lose?) 2/5/20 PACS-2017 74 A (very!) simple Pong Policy Network

The trick is, at the end, we use the sign of the reward on the gradient – and apply it to every gradient we produced over the whole game. This way, when we win, we encourage the actions that led to a win (make them more probable), and discourage the actions that led to a loss (make them less probable).

2/5/20 PACS-2017 75 A (very!) simple Pong Policy Network

This algorithm is called “policy gradient”. It works better in this situation than the Q-learning technique used in the original Atari player paper.

2/5/20 PACS-2017 76 Atari games: The deep network is the policy

PACS-20172/5/20 77 Instead of one output, we have a softmax over all possible actions.

PACS-20172/5/20 78 Again, the network is the policy

Input screens (4) these are “s” Probabilities of actions, PACS-20172/5/20 Given79 the state: P(a|s) In summary, policy gradient is: 1. At each step of play, sample from the softmax distribution at the output: π(s) = (0.1, 0.02, 0.6, ..., 0.01) π is the network, mapping from states to probabilities of actions. 2. Treat the sample as the “teacher”: t = (0, 0, 1, 0, ..., 0) for that state/action pair. 3. At the end, multiply (t-π(s)) by the sign of the reward. 4. Backpropagate the error and change the weights.

Again, this will make the network increase the probabilities of all of the actions when it wins, and decrease the probabilities of all of the actions when it loses. Over time, good actions will become more likely. PACS-20172/5/20 80 Scores on 49 games

! It reaches human-level play (defined as 75% of an expert’s score) on 29 of the games.

! It meets or exceeds the expert’s score – often by a lot – on 23 of the games.

PACS-20172/5/20 81 PACS-20172/5/20 82 PACS-20172/5/20 83 PACS-20172/5/20 84 Go to movie https://www.youtube.com/watch?v=EfGD2qveGdQ

PACS-20172/5/20 85 AlphaGo

! Two years ago, AlphaGo beat a top european player

! Last year, AlphaGo beat Lee Sedol

! How does it work?

PACS-20172/5/20 86 AlphaGo

1. Starts by supervised training on human expert moves.

2. Then, it plays itself over and over (policy gradient RL learning). 3. Add some more complexities that we don’t have time for...and you get a Go player that beat the world champion

PACS-20172/5/20 87 Last Week: AlphaGo Zero

1. No training on human moves 2. Only plays itself.

PACS-20172/5/20 88 Recurrent Networks

! That’s cool, but these are just “FOOMP networks” – the activation comes in one end and FOOMP! it hits the output.

2/5/20 PACS-2017 89 Recurrent Networks

! If we ever want to model thought, we need recurrence.

! We all have a simulation in our heads of each other, and ourselves (unless we lead the unexamined life...)

! This “Theory of Mind” has to be reflective.

2/5/20 PACS-2017 90 google translate

! In 2016, Google replaced its phrase-based statistical translation system with a deep network trained end-to-end with no linguistic knowledge added – It translated between English and – French, German, Spanish, Portuguese, Chinese, Japanese, Korean and Turkish

! Today, Google Translate uses Google Neural Machine Translation for translating between English and thirty languages

2/5/20 PACS-2017 91 google translate

! Recent development: “Zero-shot” translation: Translating between pairs of languages it had never been trained to translate between.

2/5/20 PACS-2017 92 Recurrent Networks

! To translate between languages – it takes in the whole sentence – and then outputs the sentence in another language.

! Does it “understand” the sentence?

! Google DeepMind is working on systems with memory that can answer arbitrary questions about an image.

2/5/20 PACS-2017 93 What does it take to make a sentient robot?

! The ability to perceive the world ! The ability to interact with the world to achieve goals ! And of course, the ability to learn through – Trial and error – Accessing and storing knowledge ! The ability to reflect on yourself and your surroundings: – To think about what your goals are – and improve them – To be aware of yourself – To be able to think about your own thoughts

2/5/20 PACS-2017 94 Conclusions

! We have now made significant steps towards “real AI” ! All of this has happened in just the last five years ! There will many more advances in the years and decades to come ! We need to be aware of (and be concerned about!) the possibilities: – The Singularity

2/5/20 PACS-2017 95 The Singularity

! Kurzweil describes his law of accelerating returns which predicts an exponential increase in technologies like computers, genetics, nanotechnology, robotics and . He says this will lead to a technological singularity in the year 2045, a point where progress is so rapid it outstrips humans' ability to comprehend it.

2/5/20 PACS-2017 96 The Singularity

! Kurzweil predicts the technological advances will irreversibly transform people as they augment their minds and bodies with genetic alterations, nanotechnology, and artificial intelligence. Once the Singularity has been reached, Kurzweil says that machine intelligence will be infinitely more powerful than all human intelligence combined.

2/5/20 PACS-2017 97 Conclusions

! We have now made significant steps towards “real AI” ! All of this has happened in just the last five years ! There will many more advances in the years and decades to come ! We need to be aware of (and be concerned about!) the possibilities: – The Singularity – Autonomous Weapons

2/5/20 PACS-2017 98 2/5/20 PACS-2017 99 2/5/20 PACS-2017 100 2/5/20 PACS-2017 101 Conclusions

! We have now made significant steps towards “real AI” ! All of this has happened in just the last five years ! There will many more advances in the years and decades to come ! We need to be aware of (and be concerned about!) the possibilities: – The Singularity – Autonomous Weapons -- Android rights (The Measure of a Man (Star Trek: The Next Generation)): Lieutenant Commander Data must fight for his right of self-determination in order not to be declared the property of Starfleet and be disassembled in the name of science. 2/5/20 PACS-2017 102 To learn more: deeplearning.net

! http://deeplearning.net/

2/5/20 PACS-2017 103 END

104 Questions?

105 Another example

! In the next example(s), I make two points:

– The perceptron algorithm is still useful!

– Representations learned in the service of the task can explain the “Visual Expertise Mystery”

Back propagation, 25 years 106 later A Face Processing System

Happy Sad . . Afraid . Gabor Angry Filtering PCA Surprised Disgusted Neural . . . Net

Pixel Perceptual Object Category (Retina) (V1) (IT) Level Level Level Level

Back propagation, 25 years 107 later The Face Processing System

Bob Carol . . Ted . Gabor Alice Filtering PCA Neural . . . Net

Pixel Perceptual Object Category (Retina) (V1) (IT) Level Level Level Level

Back propagation, 25 years 108 later The Face Processing System

Bob Carol . . Ted . Gabor Alice Filtering PCA Neural . . . Net

Pixel Perceptual Object Category (Retina) (V1) (IT) Level Level Level Level

Back propagation, 25 years 109 later The Face Processing System

Bob Carol . . Ted . Gabor Cup Filtering PCA Can Book Neural . . . Net

Pixel Perceptual Object Category (Retina) (V1) Feature (IT) Level Level Level Level level

Back propagation, 25 years 110 later The Gabor Filter Layer

! Basic feature: the 2-D Gabor wavelet filter (Daugman, 85):

• These model the processing in early visual areas

Subsample in * a 29x36 grid

Convolution Magnitudes Back propagation, 25 years 111 later The Visual Expertise Mystery

! Why would a face area process BMW’s?: – Behavioral & brain data – Model of expertise – results

Back propagation, 25 years 112 later Are you a perceptual expert?

Take the expertise test!!!**

“Identify this object with the first name that comes to mind.”

**Courtesy of Jim Tanaka, University of Victoria Back propagation, 25 years 113 later “Car” - Not an expert

Back propagation,“2002 25 years BMW Series 7” - Expert! 114 later “Bird” or “Blue Bird” - Not an expert

Back propagation,“ 25Indigo years Bunting” - Expert! 115 later “Face” or “Man” - Not an expert “ ” Back propagation, 25 yearsGeorge Dubya - Expert! 116 later “Jerk” or “Megalomaniac” - Democrat! How is an object to be named?

Animal Superordinate Level

Bird Basic Level (Rosch et al., 1971)

Subordinate Indigo Bunting Species Level

Back propagation, 25 years 117 later Entry Point Recognition

Animal Semantic analysis

Entry Point Bird Visual analysis

Downward Shift Hypothesis Indigo Bunting Fine grain visual analysis

Back propagation, 25 years 118 later Dog and Bird Expert Study

• Each expert had a least 10 years experience in their respective domain of expertise.

• None of the participants were experts in both dogs and birds.

• Participants provided their own controls.

Back propagation, 25 years 119 later Tanaka & Taylor, 1991 Object Verification Task

Superordinate Animal Plant

Basic Bird Dog

Subordinate Robin Sparrow

YES NO YES NO

Back propagation, 25 years 120 later Dog and bird experts recognize objects in their domain of expertise at subordinate levels.

900 Novice Domain Expert Domain

800

700 Downward Shift

600 Mean Reaction Time (msec) Time Reaction Mean

Superordinate Basic Subordinate Back propagation, 25 years 121 later Animal Bird/Dog Robin/Beagle Is face recognition a general form of perceptual expertise?

George W. Indigo 2002 Bush Bunting Series 7 BMW

Back propagation, 25 years 122 later Face experts recognize faces at the individual level of unique identity

1200 Objects

Faces

1000

Downward 800 Shift

600 Mean Reaction Time (msec) Time Reaction Mean

Superordinate Basic Subordinate Back propagation, 25 years 123 later Tanaka, 2001 Event-related Potentials and Expertise Face Experts Object Experts

N170

Tanaka & Curran, 2001; see also Gauthier, Curran, Curby & Collins, 2003, Nature Neuro.

Bentin, Allison, Puce, Perez & McCarthy, 1996

Back propagation, 25 years 124 later Novice Domain Expert Domain Neuroimaging of face, bird and car experts

Cars-Objects Birds-Objects Fusiform Gyrus Car Experts

Fusiform Gyrus

“Face Experts”

Bird Experts Gauthier et al., 2000 Fusiform Back propagation, 25 years 125 later Gyrus How to identify an expert?

Behavioral benchmarks of expertise • Downward shift in entry point recognition • Improved discrimination of novel exemplars from learned and related categories

Neurological benchmarks of expertise • Enhancement of N170 ERP brain component • Increased activation of fusiform gyrus

Back propagation, 25 years 126 later End of Tanaka Slides

127 • Kanwisher showed the FFA is specialized for faces

• But she forgot to control for what???

Back propagation, 25 years 128 later Greeble Experts (Gauthier et al. 1999)

! Subjects trained over many hours to recognize individual Greebles. ! Activation of the FFA increased for Greebles as the training proceeded.

Back propagation, 25 years later 129 The “visual expertise mystery”

! If the so-called “Fusiform Face Area” (FFA) is specialized for face processing, then why would it also be used for cars, birds, dogs, or Greebles? ! Our view: the FFA is an area associated with a process: fine level discrimination of homogeneous categories. ! But the question remains: why would an area that presumably starts as a face area get recruited for these other visual tasks? Surely, they don’t share features, do they?

Back propagation, 25 years Sugimoto & Cottrell (2001), Proceedings130 of the Cognitive Science Society later Solving the mystery with models

! Main idea: – There are multiple visual areas that could compete to be the Greeble expert - “basic” level areas and the “expert” (FFA) area. – The expert area must use features that distinguish similar looking inputs -- that’s what makes it an expert – Perhaps these features will be useful for other fine-level discrimination tasks. ! We will create – Basic level models - trained to identify an object’s class – Expert level models - trained to identify individual objects. – Then we will put them in a race to become Greeble experts. – Then we can deconstruct the winner to see why they won.

Back propagation, 25 years Sugimoto & Cottrell (2001), Proceedings131 of the Cognitive Science Society later Model Database

• A network that can differentiate faces, books, cups and

cans is a “basic level network.”

• A network that can also differentiate individuals within ONE Back propagation, 25 years later 132 class (faces, cups, cans OR books) is an “expert.” Model

can cup (Experts) book ! Pretrain two groups of Bob neural networks on different Ted tasks. Carol ! Compare the abilities to Greeble1 learn a new individual Greeble2 Greeble classification task. Greeble3

can cup book face Greeble1 Greeble2 (Non-experts) Greeble3 Hidden layer Back propagation, 25 years later 133 Expertise begets expertise

Amount Of Training Required To be a Greeble Expert

Training Time on first task ! Learning to individuate cups, cans, books, or faces first, leads to faster learning of Greebles (can’t try this with kids!!!). ! The more expertise, the faster the learning of the new task! ! Hence in a competition with the object area, FFA would win. Back propagation, 25 years later 134 ! If our parents were cans, the FCA (Fusiform Can Area) would win. Entry Level Shift: Subordinate RT decreases with training (rt = uncertainty of response = 1.0 -max(output))

Network data Human data --- Subordinate Basic RT

# TrainingBack propagation, Sessions 25 years later 135 How do experts learn the task?

! Expert level networks must be sensitive to within- class variation: – Representations must amplify small differences ! Basic level networks must ignore within-class variation. – Representations should reduce differences

Back propagation, 25 years 136 later Observing hidden layer representations

! Principal Components Analysis on hidden unit activation: – PCA of hidden unit activations allows us to reduce the dimensionality (to 2) and plot representations. – We can then observe how tightly clustered stimuli are in a low-dimensional subspace ! We expect basic level networks to separate classes, but not individuals. ! We expect expert networks to separate classes and individuals.

Back propagation, 25 years 137 later Subordinate level training magnifies small differences within object representations

1 epoch 80 epochs 1280 epochs

Face

greeble

Basic

Back propagation, 25 years 138 later Greeble representations are spread out prior to Greeble Training

Basic Face

greeble

Back propagation, 25 years 139 later Variability Decreases Learning Time

(r = -0.834) Greeble Learning Time

Back propagation, 25 years 140 later Greeble Variance Prior to Learning Greebles Examining the Net’s Representations ! We want to visualize “receptive fields” in the network. ! But the Gabor magnitude representation is noninvertible. ! We can learn an approximate inverse mapping, however. ! We used linear regression to find the best linear combination of Gabor magnitude principal components for each image pixel. ! Then projecting each hidden unit’s weight vector into image space with the same mapping visualizes its “receptive field.”

Back propagation, 25 years 141 later Two hidden unit receptive fields

AFTER TRAINING AS A FACE EXPERT AFTER FURTHER TRAINING ON GREEBLES

HU 16

HU 36

Back propagation, 25 years NOTE: These are not142 face-specific! later Controlling for the number of classes ! We obtained 13 classes from hemera.com:

! 10 of these are learned at the basic level. ! 10 faces, each with 8 expressions, make the expert task ! 3 (lamps, ships, swords) are used for the novel expertise task.

Back propagation, 25 years 143 later Results: Pre-training

! New initial tasks of similar difficulty: In previous work, the basic level task was much easier.

! These are the learning curves for the 10 object classes and the 10 faces.

Back propagation, 25 years 144 later Results

! As before, experts still learned new expert level tasks faster

Number of epochs To learn swords After learning faces Or objects

Back propagation, 25 years 145 later Number of training epochs on faces or objects The Face and Object Processing System

Book FFA Can Expert Cup FaceLevel Bob Classifier Sue Jane Gabor Filtering PCA

Book Basic Can Level Cup Face Classifier LOC

Pixel Perceptual Gestalt Hidden Category (Retina) (V1) Level Layer Level Level Level Back propagation, 25 years 146 later Backprop, 25 years later

! Backprop is important because it was the first relatively efficient method for learning internal representations

! Recent advances have made deeper networks possible

! This is important because we don’t know how the brain uses transformations to recognize objects across a wide array of variations (e.g., the Halle Berry neuron)

Back propagation, 25 years 147 later ! E.g., the “Halle Berry” neuron…

Back propagation, 25 years 148 later END

149