Deep Learning : ! nouveaux apports!
Christian Wolf! Université de Lyon, INSA-Lyon! LIRIS UMR CNRS 5205!
10 Mars, 2016!
1! 2! Data interpretation!
="?"
3! Friends and foes!
4! Where am I?! Visual landmarks!
5! « What is a cat? »! ! Children automatically learn to recognize objects from shapes and visual input given examples.! Human learning is mixed supervised / unsupervised! 6! Learning to predict!
{dog, cat, pokemon, helicopter, …}
{A. Turing, A. Einstein, S. Cooper, …}
{0, 1, … 24, 25, 26, …, 98, 99, …}
We would like to predict a value t from an observed input ! ! ! Parameters are learned from training data.!
7! A biological neuron !
Devin K. Phillips!
8! Recognition of complex activities through unsupervised learning of structured attribute models
April 30, 2015
Abstract Complex human actionsArtificial are modelled as Neural graphs over basic networks attributes and! their spatial and temporal relationships. The attribute dictionary as well as the graphical structure are learned automatically from training« data. Perceptron »!
D y(x, w)= wixi Xi=0 1 The activity model
A set = c ,c ,...,c of C complex activities is considered, where each activity c is C { i 2 C } i modelled as a Graph i =(Vi,Ei), and both, nodes and edges of the graphs, are valued. For convenience, in the followingG we will drop the index i and consider a single graph for a given activity. The set of nodes V of the graph is indexed (here denoted by j)andcorrespondsto occurrences of basic attributes. EachG node j is assigned a vector of 4 values a ,x ,y ,t : { j j j j} the attribute type aj and a triplet of spatio-temporal coordinates xj,yj and tj. The edges define pairwise logical spatial or temporal relationships: before, after, overlaps, is included, near, ...). Input Output Examples of graphs for the threelayer activities! Leavinglayer! a baggage unattended, Telephone con- 9! versation,andHandshaking between two people are shown in figure 1. Note, that one node of a model graph can be matched with several consecutive attribute occurrences in the test video. For instance, when the model Telephone conversation (figure 1b) is matched, the node Person will in general be matched to multiple occurrences of a person in the video — as long as the conversation will last. Also, the graphs shown in the figure are only examples, the actual graphs are learned automatically and will in general not correspond to a graph designed by a human. The attribute type variables aj = k may take values k in an alphabet ⇤ = 1,...,L . These values can correspond to fixed (manually designed types), as for instance {Person,or} automatically learned attributes. Associated to each possible type k is a feature function k(v, x, y, t; ⇥k) 0, 1 which evaluates whether in the spatio-temporal block centered on (x, y, t)ofthevideo ! {v the} attribute is found (= 1) or not (= 0). The parameters (to be learned) of these functions are denoted by ⇥k. Each edge ejk between two nodes j and k is assigned an edge label which may take values in an alphabet ⌥ (before, after, overlaps, is included, near, ...).). There may be multiple edges between the same pair of nodes.
1 5.1. Feed-forward Network Functions 229 notation for the two kinds of model. We shall see later how to give a probabilistic interpretation to a neural network. As discussed in Section 3.1, the bias parameters in (5.2) can be absorbed into the set of weight parameters by defining an additional input variable x0 whose value is clamped at x0 =1, so that (5.2) takes the form
D (1) Multi-layera jperceptron= wji xi. (MLP)! (5.8) i=0 ! We can similarly absorb« theFully-connected second-layer biases into» layers the second-layer! weights, so that the overall network function becomes M D (2) (1) yk(x, w)=σ wkj h wji xi . (5.9) "j=0 " i=0 ## ! ! As can be seen from Figure 5.1, the neural network model comprises two stages of processing, each of which resembles the perceptron model of Section 4.1.7, and for this reason the neural network is also known as the multilayer perceptron,or MLP. A key difference compared to the perceptron, however, is that the neural net- work uses continuous sigmoidal nonlinearities in the hidden units, whereas the per- ceptron uses step-function nonlinearities. This means that the neural network func- tion is differentiable with respect to the network... parameters, and this property will play a central role in network training. ! If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transforma- tions that the networkInput can generate are notHidden the most generalOutput possible linear trans- layer! layer! layer! formations from inputs to outputs because information is lost in the dimensionality10! reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units. The network architecture shown in Figure 5.1 is the most commonly used one in practice. However, it is easily generalized, for instance by considering additional layers of processing each consisting of a weighted linear combination of the form (5.4) followed by an element-wise transformation using a nonlinear activation func- tion. Note that there is some confusion in the literature regarding the terminology for counting the number of layers in such networks. Thus the network in Figure 5.1 may be described as a 3-layer network (which counts the number of layers of units, and treats the inputs as units) or sometimes as a single-hidden-layer network (which counts the number of layers of hidden units). We recommend a terminology in which Figure 5.1 is called a two-layer network, because it is the number of layers of adap- tive weights that is important for determining the network properties. Another generalization of the network architecture is to include skip-layer con- nections, each of which is associated with a corresponding adaptive parameter. For
Deep neural network!
! ...... ! ! ! !
n
11! 5.2. Network Training 235
If we consider a training set of independent observations, then the error function, which is given by the negative log likelihood, is then a cross-entropy error function of the form N E(w)= t ln y + (1 t ) ln(1 y ) (5.21) − { n n − n − n } n=1 ! where yn denotes y(xn, w). Note that there is no analogue of the noise precision β because the target values are assumed to be correctly labelled. However, the model Exercise 5.4 is easily extended to allow for labelling errors. Simard et al. (2003) found that using the cross-entropy error function instead of the sum-of-squares for a classification problem leads to faster training as well as improved generalization. If we have K separate binary classifications to perform, then we can use a net- work having K outputs each of which has a logistic sigmoid activation function. Associated with each output is a binary class label tk 0, 1 , where k =1,...,K. If we assume that the class labels are independent, given∈ { the} input vector, then the conditional distribution of the targets is
K t 1 tk p(t x, w)= y (x, w) k [1 y (x, w)] − . (5.22) | k − k k"=1 Taking the negative logarithm of the corresponding likelihood function then gives Exercise 5.5 the following error function
N K E(w)= t ln y +(1 t ) ln(1 y ) (5.23) − { nk nk − nk − nk } n=1 ! !k=1 where ynk denotes yk(xn, w). Again, the derivative of the error function with re- Exercise 5.6 spect to the activation for a particular output unit takes the form (5.18) just as in the regression case. It is interesting to contrast the neural network solution to this problem with the corresponding approach based on a linear classification model of the kind discussed in Chapter 4. Suppose that we are using a standard two-layer network of the kind shown in Figure 5.1. We see that the weight parameters in the first layer of the network are shared between the various outputs, whereas in the linear model each 240 5. NEURALclassification NETWORKS problem is solved independently. The first layer of the network can be viewed as performing a nonlinear feature extraction, and the sharing of features between the different outputs can save on computation and can also lead to improved evaluations,generalization. each of which would require O(W ) steps. Thus, the computational effortFinally, needed we to consider find the the minimum standard multiclass using such classification an approach problem would in be whichO(W each3). inputNow is assigned compareLearning to onethis of withK mutually an algorithmby exclusive gradient that classes. makes useThe of binarydescent the gradient target variables information.! t 0, 1 have a 1-of-K coding scheme indicating the class, and the network Becausek ∈ { each} evaluation of E brings W items of information, we might hope to Givenfindoutputs the are minimuman interpreted error of ( the aslossy functionk(x∇), wfunction)= inpO(tk(W=1:)! gradientx), leading evaluations. to the following As we error shall see, function | by using error backpropagation, eachN K such evaluation takes only O(W ) steps and so 2 the minimum can nowE be(w found)= in O(Wt ) steps.ln y (x For, w this). reason, the use(5.24) of gradient − kn k n information forms the basis of practicaln=1 algorithms for training neural networks. ! !k=1 5.2.4 Gradienttarget (ground-truth descent) label optimization! currently estimated label! IterativeThe simplest minimisation approach to usingthrough gradient gradient information descent is to choose:! the weight update in (5.27) to comprise a small step in the direction of the negative gradient, so that w(τ+1) = w(τ) η E(w(τ)) (5.41) 236 5. NEURAL NETWORKS − ∇ where the parameter η > 0 is known as the learning rate. After each such update, the Figure 5.5 Geometrical view of thegradient error function is re-evaluatedE(w) as E(w for) the new weight vector and the process repeated. Note that a surface sitting over weight space. Point wA is a local minimum and wtheB is theerror global function minimum. is defined with respect to a training set, and so each step requires At any point wC , the local gradient of the error surface is given by thethat vector theE. entire training set be processed in order to evaluate E. Techniques that use the∇ whole data set at once are called batch methods.Can be Atblocked∇ each step in the weight vector is moved in the direction of the greatest ratea oflocal decrease minimum of the error ! function, and so this approach is known as gradient descent or steepest descent. Although
such an approach might intuitively seemw reasonable,1 in fact it turns out to be a poor
wA w algorithm, for reasons discussedB inw BishopC and Nabney (2008). For batch optimization, there are more efficient methods, such as conjugate gra- w2 E ∇ dients and quasi-Newton methods, which are much[C. Bishop, more Pattern robust recognition and and much Machine faster learning12! , 2006]! Following the discussionthan of simple Section 4.3.4, gradient we see descent that the output (Gill unitet activation al., 1981; Fletcher, 1987; Nocedal and Wright, function, which corresponds1999). to the Unlike canonical gradient link, is given descent, by the softmax these function algorithms have the property that the error
function alwaysexp( decreasesak(x, w)) at each iteration unless the weight vector has arrived at a yk(x, w)= (5.25) local or globalexp( minimum.aj(x, w)) j In order! to find a sufficiently good minimum, it may be necessary to run a gradient-based algorithm multiple times, each time using a different randomly cho- which satisfies 0 ! yk ! 1 and k yk =1. Note that the yk(x, w) are unchanged if a constant is addedsen to all ofstarting the ak(x, point,w), causing and the comparing error function to the be constant resulting performance on an independent vali- for some directions in weight space." This degeneracy is removed if an appropriate regularization term (Sectiondation 5.5) set. is added to the error function. Once again, the derivativeThere of the is,error however, function with an respect on-line to the activation version for of gradient descent that has proved useful Exercise 5.7 a particular output unitin takes practice the familiar for form training (5.18). neural networks on large data sets (Le Cun et al., 1989). In summary, there is a natural choice of both output unit activation function and matching error function,Error according functions to the based type of on problem maximum being solved. likelihood For re- for a set of independent observations gression we use linearcomprise outputs and a a sum-of-squares sum of terms, error, for one (multiple for each independent) data point binary classifications we use logistic sigmoid outputs and a cross-entropy error func- tion, and for multiclass classification we use softmax outputs with the correspondingN multiclass cross-entropy error function. For classification problems involving two classes, we can use a single logistic sigmoid output, or alternativelyE( wew)= can use a En(w). (5.42) network with two outputs having a softmax output activation function. n=1 5.2.1 Parameter optimization ! On-line gradient descent, also known as sequential gradient descent or stochastic We turn next to the task of finding a weight vector w which minimizes the chosen function E(wgradient). At this point, descent it is useful, makes to have a an geometrical update picture to the of the weight vector based on one data point at a error function, whichtime, we can view so that as a surface sitting over weight space as shown in Figure 5.5. First note that if we make a small step in weight space(τ+1) from w to w(τ+)δw (τ) then the change in the error function is δE δwT E(w),w where the vector= w E(w) η En(w ). (5.43) points in the direction of greatest rate of increase≃ ∇ of the error function. Because∇ the− ∇ error E(w) is a smooth continuous function of w, its smallest value will occur at a 1960s! Alexey Ivakhnenko works on deep neural networks!
1986! G. Hinton proposes the backpropagation algorithm in its current form!
AI Winter
Vision and speech communities do not use neural networks! Flat structures dominate! Handcrafted features dominate (SIFT, HoG, LBP etc.) !
2006! 13! 2006! G. Hinton’s proposes deep belief networks! Layerwise unsupervised training! GPUs and clusters! Large datasets!! !
11/2011! Neural networkBreak-through ofDeep Microsoft belief net in speech recognition with Back propagation Science Speech Naturedeep neural networks!
1986 2006 2011 2012 10/2012! G. Hinton’s group wins ImageNet / ILSVRC! Rank Name Error Description rate 1 U. Toronto 0.15315 Deep learning 2 U. Tokyo 0.26172 Hand crafted 3U.Oxford0.26979features and learning models. 4 Xerox/INRIA 0270580.27058 Bottleneck.
Object recognition over 1,000,000 images and 1,000 categories (2 GPU)
A. Krizhevsky, L. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012.
14! 12 Olga Russakovsky* et al.
- 1,461,406 annotated images in ILSVRC 2010 ! - 1000 classes, from word-net! - Manual annotation and careful verification ! Crowd-sourcing with Amazon Mechanical Turk! 12 Olga Russakovsky* et al.
Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using scene-level queries. 15!
tage of all the positive examples available. The second is images collected from Flickr specifically for the de- source (24%) is negative images which were part of the tection task. These images were added for ILSVRC2014 original ImageNet collection process but voted as neg- following the same protocol as the second type of images ative: for example, some of the images were collected in the validation and test set. This was done to bring from Flickr and search engines for the ImageNet synset the training and testing distributions closer together. “animals” but during the manual verification step did not collect enough votes to be considered as containing an “animal.” These images were manually re-verified for the detection task to ensure that they did not in fact contain the target objects. The third source (13%)
Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using scene-level queries.
tage of all the positive examples available. The second is images collected from Flickr specifically for the de- source (24%) is negative images which were part of the tection task. These images were added for ILSVRC2014 original ImageNet collection process but voted as neg- following the same protocol as the second type of images ative: for example, some of the images were collected in the validation and test set. This was done to bring from Flickr and search engines for the ImageNet synset the training and testing distributions closer together. “animals” but during the manual verification step did not collect enough votes to be considered as containing an “animal.” These images were manually re-verified for the detection task to ensure that they did not in fact contain the target objects. The third source (13%) Neural network Deep belief net Back propagation Science Speech Neural network Deep belief net Back propagation Science Speech 03/2013! G. Hinton joins Google! 1986 2006 2011 2012
1986• ImageNet 2013 – image classification2006 2011 challenge2012 09/2013! Clarifai wins ImageNet / ILSVRC (PhD student at NYU)! ! Rank Name Error rate Description • ImageNetClassification2013 task –:! image classification challenge 1NYU 0.11197DeepFirst 20 entries: learning deep learning! Rank Name Error rate Description 2NUS 0.12535Deep learning 1NYU 0.11197Deep learning 3Oxford0.13555Deep learning 2NUS 0.12535Deep learning MSRA, IBM, Adobe, NEC, Clarifai, Berkley, U. Tokyo, UCLA, UIUC, Toronto …. Top 20 3Oxford0.13555Deep learning groups all used deep learning MSRA, IBM, Adobe, NEC, Clarifai, Berkley, U. Tokyo, UCLA, UIUC, Toronto …. Top 20 • ImageNetDetection task 2013: – object detection challenge groups all used deep! learning • ImageNetRank Name 2013 – object Mean detectionAveragePrecision challenge Description 1UvA Euvision 0.22581 Hand crafted features Rank Name Mean AveragePrecision Description 2NEC MU 0.20895 Hand crafdfted features 1UvA Euvision 0.22581 Hand crafted features 3NYU0.19400 Deep learning 2NEC MU 0.20895 Hand crafdfted features 3NYU0.19400 Deep learning 12/2013! Y. LeCun directs Facebook’s new AI lab!
01/2014! Google acquires DeepMind for 400M$! 16! Neural network Deep belief net Neural network Deep belief net Back propagation Science Speech Back propagation Science Speech 05/2014! A. Ng directs Baidu’s new AI lab!
1986 2006 2011 2012 1986 2006 2011 2012 09/2014! • ImageNetGoogle wins2014 ImageNet – Image / ILSVRC classification ! challenge • ImageNet 2014 – Image classification challenge Classification task:! First 20 entries: deep learning! Rank Name Error rate Description Rank Name Error rate Description 1Google0.06656Deep learning 1Google0.06656Deep learning 2Oxford0.07325Deep learning 2Oxford0.07325Deep learning 3MSRA 0.08062Deep learning 3MSRA 0.08062Deep learning • ImageNet 2014 – object detection challenge • ImageNetDetection task 2014:! – object detection challenge Rank Name Mean AveragePrecision Description Rank Name Mean AveragePrecision Description 1Google0.43933 Deep learning 1Google0.43933 Deep learning 2CUHK0.40656 Deep learning 2CUHK0.40656 Deep learning 3DeeppgInsight 0.40452 Deep learning 3DeeppgInsight 0.40452 Deep learning 4UvA Euvision 0.35421 Deep learning 4UvA Euvision 0.35421 Deep learning 5Berkley Vision 0.34521 Deep learning 5Berkley Vision 0.34521 Deep learning
09/2014! Hinton NIPS paper mentions internal google dataset with ! 100 000 000 labelled images and 18 000 classes!
17! 04/2015! Clarifai acquires 10M$ venture capital!
12/2015! Microsoft research wins ImageNet 2015!
12/2015! Google introduces its deep learning and cloud based « vision API »!
12/2015! Creation de « Open AI », fondation non-profit sur l’AI. Directeur: Ilya Sutskever (travaillé avec Hinton, Ng, Google)!
07/2015-! Public perception of deep learning increased; articles appear in « Le Monde », « The NY Times »; Portrait of 03/2016! Yann LeCun on « France 2 ».!
02/2016! Y. LeCun is nominated at Collège de France!
03/2016! Deepmind’s AI defeats the best Go players with deep neural networks ! 18! Why does it work?!
19! What do we feed into a neural network?!
Images have thousands … up to millions of pixels / values. Three possibilities:"
Direct
. .. classification ! of pixel input"
! Manual!
feature ! . ..
Extraction! ! « Handcrafted »!
! !
Learning to extract . .. features! !
20! Learn to extract features! (Deep Learning)! ...... ! !
Network for Network for feature classification or extraction" regression"
Challenge: learn interesting features (invariance, discriminance) from a limited amount of training data."
21! What is wrong with fully connecting pixels?! ... !
500 units in the first hidden layer" 400x400 pixels"
500 * 160 000 =! 80 Million parameters for a single layer !
Flattenened into a vector of 160 000 values " 22! What is wrong with fully connecting pixels?!
Weights learned for one part of the image do not carry over / generalize for other parts of the image."
Wasteful and does not generalize: ! no spatial invariance for images! " "
23! Convolutional neural networks! 3.2 deeptemporalmodels 61
flattening filters: input shared weights pooling W pooling
flattening ...
output pooling flattening layer