Deep Learning : ! Nouveaux Apports!
Total Page:16
File Type:pdf, Size:1020Kb
Deep Learning : ! nouveaux apports! Christian Wolf! Université de Lyon, INSA-Lyon! LIRIS UMR CNRS 5205! 10 Mars, 2016! 1! 2! Data interpretation! ="?" 3! Friends and foes! 4! Where am I?! Visual landmarks! 5! « What is a cat? »! ! Children automatically learn to recognize objects from shapes and visual input given examples.! Human learning is mixed supervised / unsupervised! 6! Learning to predict! {dog, cat, pokemon, helicopter, …} {A. Turing, A. Einstein, S. Cooper, …} {0, 1, … 24, 25, 26, …, 98, 99, …} We would like to predict a value t from an observed input ! ! ! Parameters are learned from training data.! 7! A biological neuron ! Devin K. Phillips! 8! Recognition of complex activities through unsupervised learning of structured attribute models April 30, 2015 Abstract Complex human actionsArtificial are modelled as Neural graphs over basic networks attributes and! their spatial and temporal relationships. The attribute dictionary as well as the graphical structure are learned automatically from training« data. Perceptron »! D y(x, w)= wixi Xi=0 1 The activity model A set = c ,c ,...,c of C complex activities is considered, where each activity c is C { i 2 C } i modelled as a Graph i =(Vi,Ei), and both, nodes and edges of the graphs, are valued. For convenience, in the followingG we will drop the index i and consider a single graph for a given activity. The set of nodes V of the graph is indexed (here denoted by j)andcorrespondsto occurrences of basic attributes. EachG node j is assigned a vector of 4 values a ,x ,y ,t : { j j j j} the attribute type aj and a triplet of spatio-temporal coordinates xj,yj and tj. The edges define pairwise logical spatial or temporal relationships: before, after, overlaps, is included, near, ...). Input Output Examples of graphs for the threelayer activities! Leavinglayer! a baggage unattended, Telephone con- 9! versation,andHandshaking between two people are shown in figure 1. Note, that one node of a model graph can be matched with several consecutive attribute occurrences in the test video. For instance, when the model Telephone conversation (figure 1b) is matched, the node Person will in general be matched to multiple occurrences of a person in the video — as long as the conversation will last. Also, the graphs shown in the figure are only examples, the actual graphs are learned automatically and will in general not correspond to a graph designed by a human. The attribute type variables aj = k may take values k in an alphabet ⇤ = 1,...,L . These values can correspond to fixed (manually designed types), as for instance {Person,or} automatically learned attributes. Associated to each possible type k is a feature function φk(v, x, y, t; ⇥k) 0, 1 which evaluates whether in the spatio-temporal block centered on (x, y, t)ofthevideo! {v the} attribute is found (= 1) or not (= 0). The parameters (to be learned) of these functions are denoted by ⇥k. Each edge ejk between two nodes j and k is assigned an edge label which may take values in an alphabet ⌥ (before, after, overlaps, is included, near, ...).). There may be multiple edges between the same pair of nodes. 1 5.1. Feed-forward Network Functions 229 notation for the two kinds of model. We shall see later how to give a probabilistic interpretation to a neural network. As discussed in Section 3.1, the bias parameters in (5.2) can be absorbed into the set of weight parameters by defining an additional input variable x0 whose value is clamped at x0 =1, so that (5.2) takes the form D (1) Multi-layera jperceptron= wji xi. (MLP)! (5.8) i=0 ! We can similarly absorb« theFully-connected second-layer biases into» layers the second-layer! weights, so that the overall network function becomes M D (2) (1) yk(x, w)=σ wkj h wji xi . (5.9) "j=0 " i=0 ## ! ! As can be seen from Figure 5.1, the neural network model comprises two stages of processing, each of which resembles the perceptron model of Section 4.1.7, and for this reason the neural network is also known as the multilayer perceptron,or MLP. A key difference compared to the perceptron, however, is that the neural net- work uses continuous sigmoidal nonlinearities in the hidden units, whereas the per- ceptron uses step-function nonlinearities. This means that the neural network func- tion is differentiable with respect to the network... parameters, and this property will play a central role in network training. ! If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transforma- tions that the networkInput can generate are notHidden the most generalOutput possible linear trans- layer! layer! layer! formations from inputs to outputs because information is lost in the dimensionality10! reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units. The network architecture shown in Figure 5.1 is the most commonly used one in practice. However, it is easily generalized, for instance by considering additional layers of processing each consisting of a weighted linear combination of the form (5.4) followed by an element-wise transformation using a nonlinear activation func- tion. Note that there is some confusion in the literature regarding the terminology for counting the number of layers in such networks. Thus the network in Figure 5.1 may be described as a 3-layer network (which counts the number of layers of units, and treats the inputs as units) or sometimes as a single-hidden-layer network (which counts the number of layers of hidden units). We recommend a terminology in which Figure 5.1 is called a two-layer network, because it is the number of layers of adap- tive weights that is important for determining the network properties. Another generalization of the network architecture is to include skip-layer con- nections, each of which is associated with a corresponding adaptive parameter. For ! 11 n ! ... ...! ! ...! ...! neural network ...! Deep 5.2. Network Training 235 If we consider a training set of independent observations, then the error function, which is given by the negative log likelihood, is then a cross-entropy error function of the form N E(w)= t ln y + (1 t ) ln(1 y ) (5.21) − { n n − n − n } n=1 ! where yn denotes y(xn, w). Note that there is no analogue of the noise precision β because the target values are assumed to be correctly labelled. However, the model Exercise 5.4 is easily extended to allow for labelling errors. Simard et al. (2003) found that using the cross-entropy error function instead of the sum-of-squares for a classification problem leads to faster training as well as improved generalization. If we have K separate binary classifications to perform, then we can use a net- work having K outputs each of which has a logistic sigmoid activation function. Associated with each output is a binary class label tk 0, 1 , where k =1,...,K. If we assume that the class labels are independent, given∈ { the} input vector, then the conditional distribution of the targets is K t 1 tk p(t x, w)= y (x, w) k [1 y (x, w)] − . (5.22) | k − k k"=1 Taking the negative logarithm of the corresponding likelihood function then gives Exercise 5.5 the following error function N K E(w)= t ln y +(1 t ) ln(1 y ) (5.23) − { nk nk − nk − nk } n=1 ! !k=1 where ynk denotes yk(xn, w). Again, the derivative of the error function with re- Exercise 5.6 spect to the activation for a particular output unit takes the form (5.18) just as in the regression case. It is interesting to contrast the neural network solution to this problem with the corresponding approach based on a linear classification model of the kind discussed in Chapter 4. Suppose that we are using a standard two-layer network of the kind shown in Figure 5.1. We see that the weight parameters in the first layer of the network are shared between the various outputs, whereas in the linear model each 240 5. NEURALclassification NETWORKS problem is solved independently. The first layer of the network can be viewed as performing a nonlinear feature extraction, and the sharing of features between the different outputs can save on computation and can also lead to improved evaluations,generalization. each of which would require O(W ) steps. Thus, the computational effortFinally, needed we to consider find the the minimum standard multiclass using such classification an approach problem would in be whichO(W each3). inputNow is assigned compareLearning to onethis of withK mutually an algorithmby exclusive gradient that classes. makes useThe of binarydescent the gradient target variables information.! t 0, 1 have a 1-of-K coding scheme indicating the class, and the network Becausek ∈ { each} evaluation of E brings W items of information, we might hope to Givenfindoutputs the are minimuman interpreted error of ( the aslossy functionk(x∇), wfunction)= inpO(tk(W=1:)! gradientx), leading evaluations. to the following As we error shall see, function | by using error backpropagation, eachN K such evaluation takes only O(W ) steps and so 2 the minimum can nowE be(w found)= in O(Wt ) steps.ln y (x For, w this).