: ! nouveaux apports!

Christian Wolf! Université de Lyon, INSA-Lyon! LIRIS UMR CNRS 5205!

10 Mars, 2016!

1! 2! Data interpretation!

="?"

3! Friends and foes!

4! Where am I?! Visual landmarks!

5! « What is a cat? »! ! Children automatically learn to recognize objects from shapes and visual input given examples.! Human learning is mixed supervised / unsupervised! 6! Learning to predict!

{dog, cat, pokemon, helicopter, …}

{A. Turing, A. Einstein, S. Cooper, …}

{0, 1, … 24, 25, 26, …, 98, 99, …}

We would like to predict a value t from an observed input ! ! ! Parameters are learned from training data.!

7! A biological neuron !

Devin K. Phillips!

8! Recognition of complex activities through unsupervised learning of structured attribute models

April 30, 2015

Abstract Complex human actionsArtificial are modelled as Neural graphs over basic networks attributes and! their spatial and temporal relationships. The attribute dictionary as well as the graphical structure are learned automatically from training« data. Perceptron »!

D y(x, w)= wixi Xi=0 1 The activity model

A set = c ,c ,...,c of C complex activities is considered, where each activity c is C { i 2 C } i modelled as a Graph i =(Vi,Ei), and both, nodes and edges of the graphs, are valued. For convenience, in the followingG we will drop the index i and consider a single graph for a given activity. The set of nodes V of the graph is indexed (here denoted by j)andcorrespondsto occurrences of basic attributes. EachG node j is assigned a vector of 4 values a ,x ,y ,t : { j j j j} the attribute type aj and a triplet of spatio-temporal coordinates xj,yj and tj. The edges define pairwise logical spatial or temporal relationships: before, after, overlaps, is included, near, ...). Input Output Examples of graphs for the threelayer activities! Leavinglayer! a baggage unattended, Telephone con- 9! versation,andHandshaking between two people are shown in figure 1. Note, that one node of a model graph can be matched with several consecutive attribute occurrences in the test video. For instance, when the model Telephone conversation (figure 1b) is matched, the node Person will in general be matched to multiple occurrences of a person in the video — as long as the conversation will last. Also, the graphs shown in the figure are only examples, the actual graphs are learned automatically and will in general not correspond to a graph designed by a human. The attribute type variables aj = k may take values k in an alphabet ⇤ = 1,...,L . These values can correspond to fixed (manually designed types), as for instance {Person,or} automatically learned attributes. Associated to each possible type k is a feature function k(v, x, y, t; ⇥k) 0, 1 which evaluates whether in the spatio-temporal block centered on (x, y, t)ofthevideo! {v the} attribute is found (= 1) or not (= 0). The parameters (to be learned) of these functions are denoted by ⇥k. Each edge ejk between two nodes j and k is assigned an edge label which may take values in an alphabet ⌥ (before, after, overlaps, is included, near, ...).). There may be multiple edges between the same pair of nodes.

1 5.1. Feed-forward Network Functions 229 notation for the two kinds of model. We shall see later how to give a probabilistic interpretation to a neural network. As discussed in Section 3.1, the bias parameters in (5.2) can be absorbed into the set of weight parameters by defining an additional input variable x0 whose value is clamped at x0 =1, so that (5.2) takes the form

D (1) Multi-layera jperceptron= wji xi. (MLP)! (5.8) i=0 ! We can similarly absorb« theFully-connected second-layer biases into» layers the second-layer! weights, so that the overall network function becomes M D (2) (1) yk(x, w)=σ wkj h wji xi . (5.9) "j=0 " i=0 ## ! ! As can be seen from Figure 5.1, the neural network model comprises two stages of processing, each of which resembles the perceptron model of Section 4.1.7, and for this reason the neural network is also known as the multilayer perceptron,or MLP. A key difference compared to the perceptron, however, is that the neural net- work uses continuous sigmoidal nonlinearities in the hidden units, whereas the per- ceptron uses step-function nonlinearities. This means that the neural network func- tion is differentiable with respect to the network... parameters, and this property will play a central role in network training. ! If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transforma- tions that the networkInput can generate are notHidden the most generalOutput possible linear trans- layer! layer! layer! formations from inputs to outputs because information is lost in the dimensionality10! reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units. The network architecture shown in Figure 5.1 is the most commonly used one in practice. However, it is easily generalized, for instance by considering additional layers of processing each consisting of a weighted linear combination of the form (5.4) followed by an element-wise transformation using a nonlinear activation func- tion. Note that there is some confusion in the literature regarding the terminology for counting the number of layers in such networks. Thus the network in Figure 5.1 may be described as a 3-layer network (which counts the number of layers of units, and treats the inputs as units) or sometimes as a single-hidden-layer network (which counts the number of layers of hidden units). We recommend a terminology in which Figure 5.1 is called a two-layer network, because it is the number of layers of adap- tive weights that is important for determining the network properties. Another generalization of the network architecture is to include skip-layer con- nections, each of which is associated with a corresponding adaptive parameter. For

Deep neural network!

! ...... ! ! ! !

n

11! 5.2. Network Training 235

If we consider a training set of independent observations, then the error function, which is given by the negative log likelihood, is then a cross-entropy error function of the form N E(w)= t ln y + (1 t ) ln(1 y ) (5.21) − { n n − n − n } n=1 ! where yn denotes y(xn, w). Note that there is no analogue of the noise precision β because the target values are assumed to be correctly labelled. However, the model Exercise 5.4 is easily extended to allow for labelling errors. Simard et al. (2003) found that using the cross-entropy error function instead of the sum-of-squares for a classification problem leads to faster training as well as improved generalization. If we have K separate binary classifications to perform, then we can use a net- work having K outputs each of which has a logistic sigmoid activation function. Associated with each output is a binary class label tk 0, 1 , where k =1,...,K. If we assume that the class labels are independent, given∈ { the} input vector, then the conditional distribution of the targets is

K t 1 tk p(t x, w)= y (x, w) k [1 y (x, w)] − . (5.22) | k − k k"=1 Taking the negative logarithm of the corresponding likelihood function then gives Exercise 5.5 the following error function

N K E(w)= t ln y +(1 t ) ln(1 y ) (5.23) − { nk nk − nk − nk } n=1 ! !k=1 where ynk denotes yk(xn, w). Again, the derivative of the error function with re- Exercise 5.6 spect to the activation for a particular output unit takes the form (5.18) just as in the regression case. It is interesting to contrast the neural network solution to this problem with the corresponding approach based on a linear classification model of the kind discussed in Chapter 4. Suppose that we are using a standard two-layer network of the kind shown in Figure 5.1. We see that the weight parameters in the first layer of the network are shared between the various outputs, whereas in the linear model each 240 5. NEURALclassification NETWORKS problem is solved independently. The first layer of the network can be viewed as performing a nonlinear feature extraction, and the sharing of features between the different outputs can save on computation and can also lead to improved evaluations,generalization. each of which would require O(W ) steps. Thus, the computational effortFinally, needed we to consider find the the minimum standard multiclass using such classification an approach problem would in be whichO(W each3). inputNow is assigned compareLearning to onethis of withK mutually an algorithmby exclusive gradient that classes. makes useThe of binarydescent the gradient target variables information.! t 0, 1 have a 1-of-K coding scheme indicating the class, and the network Becausek ∈ { each} evaluation of E brings W items of information, we might hope to Givenfindoutputs the are minimuman interpreted error of ( the aslossy functionk(x∇), wfunction)= inpO(tk(W=1:)! gradientx), leading evaluations. to the following As we error shall see, function | by using error backpropagation, eachN K such evaluation takes only O(W ) steps and so 2 the minimum can nowE be(w found)= in O(Wt ) steps.ln y (x For, w this). reason, the use(5.24) of gradient − kn k n information forms the basis of practicaln=1 for training neural networks. ! !k=1 5.2.4 Gradienttarget (ground-truth descent) label optimization! currently estimated label! IterativeThe simplest minimisation approach to usingthrough gradient gradient information descent is to choose:! the weight update in (5.27) to comprise a small step in the direction of the negative gradient, so that w(τ+1) = w(τ) η E(w(τ)) (5.41) 236 5. NEURAL NETWORKS − ∇ where the parameter η > 0 is known as the learning rate. After each such update, the Figure 5.5 Geometrical view of thegradient error function is re-evaluatedE(w) as E(w for) the new weight vector and the process repeated. Note that a surface sitting over weight space. Point wA is a local minimum and wtheB is theerror global function minimum. is defined with respect to a training set, and so each step requires At any point wC , the local gradient of the error surface is given by thethat vector theE. entire training set be processed in order to evaluate E. Techniques that use the∇ whole data set at once are called batch methods.Can be Atblocked∇ each step in the weight vector is moved in the direction of the greatest ratea oflocal decrease minimum of the error ! function, and so this approach is known as gradient descent or steepest descent. Although

such an approach might intuitively seemw reasonable,1 in fact it turns out to be a poor

wA w , for reasons discussedB inw BishopC and Nabney (2008). For batch optimization, there are more efficient methods, such as conjugate gra- w2 E ∇ dients and quasi-Newton methods, which are much[C. Bishop, more Pattern robust recognition and and much Machine faster learning12! , 2006]! Following the discussionthan of simple Section 4.3.4, gradient we see descent that the output (Gill unitet activation al., 1981; Fletcher, 1987; Nocedal and Wright, function, which corresponds1999). to the Unlike canonical gradient link, is given descent, by the softmax these function algorithms have the property that the error

function alwaysexp( decreasesak(x, w)) at each iteration unless the weight vector has arrived at a yk(x, w)= (5.25) local or globalexp( minimum.aj(x, w)) j In order! to find a sufficiently good minimum, it may be necessary to run a gradient-based algorithm multiple times, each time using a different randomly cho- which satisfies 0 ! yk ! 1 and k yk =1. Note that the yk(x, w) are unchanged if a constant is addedsen to all ofstarting the ak(x, point,w), causing and the comparing error function to the be constant resulting performance on an independent vali- for some directions in weight space." This degeneracy is removed if an appropriate regularization term (Sectiondation 5.5) set. is added to the error function. Once again, the derivativeThere of the is,error however, function with an respect on-line to the activation version for of gradient descent that has proved useful Exercise 5.7 a particular output unitin takes practice the familiar for form training (5.18). neural networks on large data sets (Le Cun et al., 1989). In summary, there is a natural choice of both output unit activation function and matching error function,Error according functions to the based type of on problem maximum being solved. likelihood For re- for a set of independent observations gression we use linearcomprise outputs and a a sum-of-squares sum of terms, error, for one (multiple for each independent) data point binary classifications we use logistic sigmoid outputs and a cross-entropy error func- tion, and for multiclass classification we use softmax outputs with the correspondingN multiclass cross-entropy error function. For classification problems involving two classes, we can use a single logistic sigmoid output, or alternativelyE( wew)= can use a En(w). (5.42) network with two outputs having a softmax output activation function. n=1 5.2.1 Parameter optimization ! On-line gradient descent, also known as sequential gradient descent or stochastic We turn next to the task of finding a weight vector w which minimizes the chosen function E(wgradient). At this point, descent it is useful, makes to have a an geometrical update picture to the of the weight vector based on one data point at a error function, whichtime, we can view so that as a surface sitting over weight space as shown in Figure 5.5. First note that if we make a small step in weight space(τ+1) from w to w(τ+)δw (τ) then the change in the error function is δE δwT E(w),w where the vector= w E(w) η En(w ). (5.43) points in the direction of greatest rate of increase≃ ∇ of the error function. Because∇ the− ∇ error E(w) is a smooth continuous function of w, its smallest value will occur at a 1960s! Alexey Ivakhnenko works on deep neural networks!

1986! G. Hinton proposes the backpropagation algorithm in its current form!

AI Winter

Vision and speech communities do not use neural networks! Flat structures dominate! Handcrafted features dominate (SIFT, HoG, LBP etc.) !

2006! 13! 2006! G. Hinton’s proposes deep belief networks! Layerwise unsupervised training! GPUs and clusters! Large datasets!! !

11/2011! NeuralnetworkBreak-through ofDeep Microsoftbeliefnet in speech recognition with Backpropagation Science Speech Naturedeep neural networks!

1986 2006 2011 2012 10/2012! G. Hinton’s group wins ImageNet / ILSVRC! Rank Name Error Description rate 1 U.Toronto 0.15315 Deeplearning 2 U.Tokyo 0.26172 Handcrafted 3U.Oxford0.26979featuresand learningmodels. 4 Xerox/INRIA 0270580.27058 Bottleneck.

Objectrecognitionover1,000,000imagesand1,000categories(2GPU)

A. Krizhevsky, L. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012.

14! 12 Olga Russakovsky* et al.

- 1,461,406 annotated images in ILSVRC 2010 ! - 1000 classes, from word-net! - Manual annotation and careful verification ! Crowd-sourcing with Amazon Mechanical Turk! 12 Olga Russakovsky* et al.

Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using scene-level queries. 15!

tage of all the positive examples available. The second is images collected from Flickr specifically for the de- source (24%) is negative images which were part of the tection task. These images were added for ILSVRC2014 original ImageNet collection process but voted as neg- following the same protocol as the second type of images ative: for example, some of the images were collected in the validation and test set. This was done to bring from Flickr and search engines for the ImageNet synset the training and testing distributions closer together. “animals” but during the manual verification step did not collect enough votes to be considered as containing an “animal.” These images were manually re-verified for the detection task to ensure that they did not in fact contain the target objects. The third source (13%)

Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using scene-level queries.

tage of all the positive examples available. The second is images collected from Flickr specifically for the de- source (24%) is negative images which were part of the tection task. These images were added for ILSVRC2014 original ImageNet collection process but voted as neg- following the same protocol as the second type of images ative: for example, some of the images were collected in the validation and test set. This was done to bring from Flickr and search engines for the ImageNet synset the training and testing distributions closer together. “animals” but during the manual verification step did not collect enough votes to be considered as containing an “animal.” These images were manually re-verified for the detection task to ensure that they did not in fact contain the target objects. The third source (13%) Neuralnetwork Deepbeliefnet Backpropagation Science Speech Neuralnetwork Deepbeliefnet Backpropagation Science Speech 03/2013! G. Hinton joins Google! 1986 2006 2011 2012

1986• ImageNet 2013– imageclassification2006 2011challenge2012 09/2013! Clarifai wins ImageNet / ILSVRC (PhD student at NYU)! ! Rank Name Errorrate Description • ImageNetClassification2013 task–:! imageclassificationchallenge 1NYU 0.11197DeepFirst 20 entries:learning deep learning! Rank Name Errorrate Description 2NUS 0.12535Deeplearning 1NYU 0.11197Deeplearning 3Oxford0.13555Deeplearning 2NUS 0.12535Deeplearning MSRA,IBM,Adobe,NEC,Clarifai,Berkley,U.Tokyo,UCLA,UIUC,Toronto….Top20 3Oxford0.13555Deeplearning groupsalluseddeeplearning MSRA,IBM,Adobe,NEC,Clarifai,Berkley,U.Tokyo,UCLA,UIUC,Toronto….Top20 • ImageNetDetection task 2013: – objectdetectionchallenge groupsalluseddeep! learning • ImageNetRank Name 2013– object MeandetectionAveragePrecisionchallenge Description 1UvAEuvision 0.22581 Handcraftedfeatures Rank Name MeanAveragePrecision Description 2NECMU 0.20895 Handcrafdftedfeatures 1UvAEuvision 0.22581 Handcraftedfeatures 3NYU0.19400 Deeplearning 2NECMU 0.20895 Handcrafdftedfeatures 3NYU0.19400 Deeplearning 12/2013! Y. LeCun directs Facebook’s new AI lab!

01/2014! Google acquires DeepMind for 400M$! 16! Neuralnetwork Deepbeliefnet Neuralnetwork Deepbeliefnet Backpropagation Science Speech Backpropagation Science Speech 05/2014! A. Ng directs Baidu’s new AI lab!

1986 2006 2011 2012 1986 2006 2011 2012 09/2014! • ImageNetGoogle wins2014 ImageNet– Image / ILSVRCclassification ! challenge • ImageNet 2014– Imageclassificationchallenge Classification task:! First 20 entries: deep learning! Rank Name Errorrate Description Rank Name Errorrate Description 1Google0.06656Deeplearning 1Google0.06656Deeplearning 2Oxford0.07325Deeplearning 2Oxford0.07325Deeplearning 3MSRA 0.08062Deeplearning 3MSRA 0.08062Deeplearning • ImageNet 2014– objectdetectionchallenge • ImageNetDetection task 2014:! – objectdetectionchallenge Rank Name MeanAveragePrecision Description Rank Name MeanAveragePrecision Description 1Google0.43933 Deeplearning 1Google0.43933 Deeplearning 2CUHK0.40656 Deeplearning 2CUHK0.40656 Deeplearning 3DeeppgInsight 0.40452 Deeplearning 3DeeppgInsight 0.40452 Deeplearning 4UvAEuvision 0.35421 Deeplearning 4UvAEuvision 0.35421 Deeplearning 5BerkleyVision 0.34521 Deeplearning 5BerkleyVision 0.34521 Deeplearning

09/2014! Hinton NIPS paper mentions internal google dataset with ! 100 000 000 labelled images and 18 000 classes!

17! 04/2015! Clarifai acquires 10M$ venture capital!

12/2015! Microsoft research wins ImageNet 2015!

12/2015! Google introduces its deep learning and cloud based « vision API »!

12/2015! Creation de « Open AI », fondation non-profit sur l’AI. Directeur: Ilya Sutskever (travaillé avec Hinton, Ng, Google)!

07/2015-! Public perception of deep learning increased; articles appear in « Le Monde », « The NY Times »; Portrait of 03/2016! Yann LeCun on « France 2 ».!

02/2016! Y. LeCun is nominated at Collège de France!

03/2016! Deepmind’s AI defeats the best Go players with deep neural networks ! 18! Why does it work?!

19! What do we feed into a neural network?!

Images have thousands … up to millions of pixels / values. Three possibilities:"

Direct

. .. classification ! of pixel input"

! Manual!

feature ! . ..

Extraction! ! « Handcrafted »!

! !

Learning to extract . .. features! !

20! Learn to extract features! (Deep Learning)! ...... ! !

Network for Network for feature classification or extraction" regression"

Challenge: learn interesting features (invariance, discriminance) from a limited amount of training data."

21! What is wrong with fully connecting pixels?! ... !

500 units in the first hidden layer" 400x400 pixels"

500 * 160 000 =! 80 Million parameters for a single layer !

Flattenened into a vector of 160 000 values " 22! What is wrong with fully connecting pixels?!

Weights learned for one part of the image do not carry over / generalize for other parts of the image."

Wasteful and does not generalize: ! no spatial invariance for images! " "

23! Convolutional neural networks! 3.2 deeptemporalmodels 61

flattening filters: input shared weights pooling W pooling

flattening ...

output pooling flattening layer

pooled feature maps fully connected layers feature maps

Figure 24: An example of a convolutional neural network taking as input a single chan- Introducednel by 2FukushimaD image. in In 1980 general! case, an arbitrary number of convolutional layers (or Refined by convolutional-poolingLeCun, Bottou, Bengio, pairs) Haffner can in be 1998 stacked! on top of each other, combined with • Parameteran arbitrary sharing in number convolutional of fully layers connected! layers closer to the output of the network. • End to end training of the full set of parameters! 24! convolutional layer change in the same way as its input [108]. The dimensionality of convolutions can be extended to an arbitrary number of input dimensions and ap- plied, for example, to 3D ultra-sound images or 3D spatio-temporal volumes, using the same spatial metaphor for time. The result of the convolutional operation followed by an application of an element- wise non-linearity is called a feature map. Each convolutional layer is typically followed by a pooling layer – an operation which outputs a maximum, for example (for max pooling), value within a local neighborhood, reducing the size of a fea- ture map and introducing additional invariance to small local translations. In some recent architectures, the pooling step may be omitted. The AlexNet [161] realization of a deep convolutional architecture for images, pro- posed in 2012, signified the first major success of deep learning in computer vision and was eight layers deep, where the three last layers were fully connected. A num- ber of more advanced and complex convolutional networks, especially in the context of visual recognition and detection from images, have been proposed over last years, including Network in network [178], Inception modules [279], VGG networks [266], and region-based CNNs [104]. The most recent convolutional architectures, as of the beginning of 2016, which are based on Highway networks [272] and residual learning [120], are dozens to hundreds layers deep and are able to dynamically adapt to the complexity of the task.

3.2 deep temporal models

The problem of sequence modeling with deep neural networks has been a subject of decades of scientific exploration, and it is probably safe to say that today this is one of the most actively researched topics in the community. Moreover, since the temporal aspect is often crucial for human motion analysis, neural temporal models Still no full invariance: data augmentation!

- Add cropped and mirrored versions of each training image! - Increases invariance!

25! Why does it work so well?!

- Parameter sharing (convolutional layers)!

- Deep models! - Massive amounts of training data ! – Pascal VOC : 20 000 annotated images! – ImageNet : 1 500 000 annotated images! - Computational power for training (GPUs!)! - New regularization techniques (dropout etc.)!

26! Deeper and deeper …!

2012 : AlexNet, 8 layers trained on 2 GPUs. New technique : dropout !

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts 2014 : GoogLEnet, at20 the bottom. layers The GPUs. New communicate technique only at certain layers. : intermediate The network’s input is 150,528-dimensional, supervision and ! the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 5 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening⇥ ⇥ pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 3 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The⇥ fourth⇥ convolutional layer has 384 kernels of size 3 3 192 , and the fifth convolutional layer has 256 kernels of size 3 3 192. The fully-connected⇥ layers⇥ have 4096 neurons each. ⇥ ⇥

4 Reducing Overfitting

2015 : Microsoft researchOur neural network, 150 architecture layers has 60(!!). million New parameters. technique Although the 1000 : residual classes of ILSVRC learning! make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we describe the two primary ways in which we combat overfitting.

4.1 Data Augmentation

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original 27! images with very little computation, so the transformed images do not need to be stored on disk. In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free. The first form of data augmentation consists of generating image translations and horizontal reflec- tions. We do this by extracting random 224 224 patches (and their horizontal reflections) from the 256 256 images and training our network on⇥ these extracted patches4. This increases the size of our training⇥ set by a factor of 2048, though the resulting training examples are, of course, highly inter- dependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 224 patches (the four corner patches and the center patch) as well as their horizontal reflections⇥ (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches. The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components,

4This is the reason why the input images in Figure 2 are 224 224 3-dimensional. ⇥ ⇥

5 Visualization of learned network features!

- Select the strongest activations in the feature map! 824- Backproject M.D. Zeiler and them R. Fergus into the lower layers!

Layer 1

Layer 2

[Zeiler and Fergus, ECCV 2014]! 28!

Layer 3

Layer 4 Layer 5

Fig. 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9activationsinarandomsubsetoffeaturemapsacrossthevalidationdata,projected down to pixel space using our deconvolutional network approach. Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form. The compression artifacts are a consequence of the 30Mb submission limit, not the reconstruction algorithm itself. 824 M.D. Zeiler and R. Fergus

Layer 1

Visualization of learned network features!

Layer 2

Layer 3

[Zeiler and Fergus, ECCV 2014]! 29!

Layer 4 Layer 5

Fig. 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9activationsinarandomsubsetoffeaturemapsacrossthevalidationdata,projected down to pixel space using our deconvolutional network approach. Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form. The compression artifacts are a consequence of the 30Mb submission limit, not the reconstruction algorithm itself. 824 M.D. Zeiler and R. Fergus

Layer 1

Layer 2

Visualization of learned network features! Layer 3

Layer 4 Layer 5 30! Fig. 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9activationsinarandomsubsetoffeaturemapsacrossthevalidationdata,projected down to pixel space using our deconvolutional network approach. Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form. The compression artifacts are a consequence of the 30Mb submission limit, not the reconstruction algorithm itself. Visualizing and Understanding Convolutional Networks 825

image size 224 110 26 13 13 13 flter size 7 3 3 1 384 1 384 256 256 stride 2 96 3x3 max 3x3 max C 3x3 max pool contrast pool contrast pool 4096 4096 class stride 2 norm. Visualizing and Understanding Convolutionalstride Networks 2 norm. 825 stride 2 units units softmax 5 3 55 3 2 13 6 image size 224 1 256 110 26 Input13 Image 13 13 96 256 flter size 7 3 3 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Output 1 384 1 384 256 256 stride 2 96 3x3 max 3x3 max C 3x3 max pool contrast pool contrastFig. 3. Architecturepool of our 84096 layer 4096 convnet class model. A 224 by 224 crop of an image (with stride 2 norm. stride 2 norm. stride 2 units units softmax 5 3colorplanes)ispresentedastheinput.Thisisconvolvedwith96different 1st layer 3 55 3 2 13 filters (red), each of6 size 7 by 7, using a stride of 2 in both x and y. The resulting 96 1 256 256 Input Image feature maps are then: (i) passed through a rectified linear function (not shown), (ii) Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Output pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are Fig. 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from 3colorplanes)ispresentedastheinput.Thisisconvolvedwith96different 1st layer the top convolutional layer as input in vector form (6 6 256 = 9216 dimensions). filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting · · The final layer is a C-way softmax function, C being the number of classes. All filters feature maps are then: (i) passed throughFeature a rectified linear function during (not training shown),! (ii) and feature maps are square in shape. pooled (max within 3x3 regions, usingA single stride strongest 2) and activitation (iii) in contrast the feature map normalized, downprojected across ! feature maps to give 96 different 55 byEpochs 55 element 1, 2, 5, 10, feature 20, 30, 40, maps. 64! Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 6 256 = 9216 dimensions). · · The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape.

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Fig. 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, we show arandomlychosensubsetoffeaturesatepochs[1,2,5,10,20,30,40,64].Thevisualiza- [Zeiler and Fergus, tionECCV shows 2014]! the strongest activation (across all training examples) for a given feature Layer 1 Layer 2 map,Layer 3 projected downLayer to 4 pixel space usingLayer 5 our31! deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. Fig. 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, we show arandomlychosensubsetoffeaturesatepochs[1,2,5,10,20,30,40,64].Thevisualiza-occluder covers the image region that appears in the visualization, we see a tion shows the strongest activation (acrossstrong all drop training in activity examples) in for the a feature given feature map. This shows that the visualization map, projected down to pixel space usinggenuinely our deconvnet corresponds approach. to the Color image contrast structure is that stimulates that feature map, artificially enhanced and the figure is besthence viewed validating in electronic the other form. visualizations shown in Fig. 4 and Fig. 2. occluder covers the image region that5Experiments appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image5.1 structure ImageNet that stimulates 2012 that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2. This dataset consists of 1.3M/50k/100k training/validation/test examples, spread over 1000 categories. Table 1 shows our results on this dataset. 5Experiments Using the exact architecture specified in Krizhevsky et al. [18], we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% 5.1 ImageNet 2012 of their reported value on the ImageNet 2012 validation set. This dataset consists of 1.3M/50k/100kNext we training/validation/test analyze the performance examples, of our model with the architectural changes spread over 1000 categories. Table 1outlined shows our in Section results 4.1 on this (7 dataset.7 filters in layer 1 and stride 2 convolutions in layers Using the exact architecture specified in Krizhevsky et al. [18],× we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set. Next we analyze the performance of our model with the architectural changes outlined in Section 4.1 (7 7 filters in layer 1 and stride 2 convolutions in layers × Beyond classification : ! other applications!

32! Hand pose estimation / regression!

Work of Natalia Neverova! With Graham W. Taylor, ! Phd student at LIRIS" University of Guelph, Canada"

33! Training : big data!

Dataset"1:"real"data"(NYU)" Dataset"2:"synthe7c"data"(ours)"

[Neverova, Wolf, Taylor, Nebout, arxiv 2015]! 34! Gesture recognition!

max pooling ConvD2 HLV1 Path V1: HLV2 depth video, right hand ConvD1 Path V1: shared hidden layer intensity video, right hand HLS ConvC1 ConvC2 ConvD2 Path V2: HLV1 depth video, left hand output layer ConvD1 Path V2: intensity video, HLV2 left hand ConvC1 ConvC2 HLM2 HLM3

Path M: mocap stream

HLM1 pose feature extractor Path A: audio stream

HLA2 mel frequency ConvA1 HLA1 spectrograms

Work of Natalia Neverova! With Graham W. Taylor, ! Phd student at LIRIS" University of Guelph, Canada"

35! Learning sequences!

output"layer,"t=2" output"layer,"t=1" output"layer,"t"

input"layer,"t=2" input"layer,"t=1" input"layer,"t"

36! Leo Tolstoy’s “War and Peace”! Iteration 100:!

Iteration 500:!

Iteration 2000:!

[hBp://karpathy.github.io/2015/05/21/rnn=effec7veness/]" 37" 37! Multimodal deep learning: ! image captioning!

[Karpathy"et"al,"CVPR"2015]"

38" 38! Image captioning!

[Karpathy"et"al,"2015]"

39! Entering PINs on smartphones is painful!

©"2005"ScoB"Adams,"Inc."/"Dist."by"UFS,"Inc."

Automatically authenticate smartphone users given their behavior (=interaction style). Shut off phone when theft is detected.!

40! Project "Abacus » (Google)!

• 1500 volunteers, 1500 Nexus 5 smartphones! • Several months of natural daily usage, 27.6 TB of data! • Multiple sensors: camera, touchscreen, GPS, bluetooth, wifi, cell antenna, inertial, magnetometer! • This work: inertial sensors and magnetometer only, recorded at 200Hz!

Work of Natalia Neverova! With Graham W. Taylor, ! Phd student at LIRIS" University of Guelph, Canada"

[Neverova, Wolf, Lacey, Fridmann, Chandra, Barbello, Taylor, arxiv 2015]! 41! Dense clockwork recurrent networks!

Validation! Test!

10sec! 1 sec! 10sec! 10sec! 10sec! 10sec! 10sec! 10sec! 10sec! 10sec!

[Neverova, Wolf, Lacey, Fridmann, Chandra, Barbello, Taylor, arxiv 2015]! 42! Reinforcement deep learning?!

[Mnih"et"al,"Nature"2015]" 43" 43! 44" 44! Conclusion!

- Deep learning methods are general and work well with completely different kinds of data (text, sparse data).! - Works best on large amount of training data (big data!)! - Hungry for computational resources (clusters, GPUs, clusters of GPUs, FPGAs …).! - Requires certain expertise to train (“babysitting”). ! - Data augmentation is often important to increase invariance properties! - Including invariance properties into the model is a goal and a challenge!

45!