Lecture 7. Multilayer Perceptron. Backpropagation
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 7. Multilayer Perceptron. Backpropagation COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Statistical Machine Learning (S2 2017) Deck 7 This lecture • Multilayer perceptron ∗ Model structure ∗ Universal approximation ∗ Training preliminaries • Backpropagation ∗ Step-by-step derivation ∗ Notes on regularisation 2 Statistical Machine Learning (S2 2017) Deck 7 Animals in the zoo Artificial Neural Recurrent neural Networks (ANNs) networks Feed-forward Multilayer perceptrons networks art: OpenClipartVectors at pixabay.com (CC0) Perceptrons Convolutional neural networks • Recurrent neural networks are not covered in this subject • If time permits, we will cover autoencoders. An autoencoder is an ANN trained in a specific way. ∗ E.g., a multilayer perceptron can be trained as an autoencoder, or a recurrent neural network can be trained as an autoencoder. 3 Statistical Machine Learning (S2 2016) Deck 7 Multilayer Perceptron Modelling non-linearity via function composition 4 Statistical Machine Learning (S2 2017) Deck 7 Limitations of linear models Some function are linearly separable, but many are not AND OR XOR Possible solution: composition XOR = OR AND not AND 1 2 1 2 1 2 We are going to combine perceptrons … 5 Statistical Machine Learning (S2 2017) Deck 7 Simplified graphical representation Sputnik – first artificial Earth satellite 6 Statistical Machine Learning (S2 2017) Deck 7 Perceptorn is sort of a building block for ANN • ANNs are not restricted to binary classification • Nodes in ANN can have various activation functions 1, 0 = Step function 0, < 0 ≥ � 1, 0 Sign function = 1, < 0 ≥ � 1 Logistic function = − 1 + − Many others: , rectifier, etc. 푡 7 Statistical Machine Learning (S2 2017) Deck 7 Feed-forward Artificial Neural Network flow of computation are are outputs, i.e., predicted labels inputs, i.e., attributes 1 note: ANNs 1 note: here are …1 naturally handle components of a multidimensional 2 output single training … …2 instance hidden e.g., for handwritten a training digits recognition, layer dataset is a set output each output node of instances input layer (ANNs can have can represent the layer more than one probability of a digit hidden layer) 8 Statistical Machine Learning (S2 2017) Deck 7 ANN as a function composition = ( ) 푡 = + = ( ) 1 1 0 � …1 =1 note that is a = + 2 function composition … …2 0 � (a function applied to =1 the result of another function, etc.) you can add bias node = 1 to simplify here , are activation equations: = 0 functions. These can be either same푡 (e.g., both similarly you can add bias∑ node=0 = 1 to simplify equations: = sigmoid) or different 0 ∑=0 9 Statistical Machine Learning (S2 2017) Deck 7 ANN in supervised learning • ANNs can be naturally adapted to various supervised learning setups, such as univariate and multivariate regression, as well as binary and multilabel classification • Univariate regression = ∗ e.g., linear regression earlier in the course • Multivariate regression = ( ) ∗ predicting values for multiple continuous outcomes • Binary classification ∗ e.g., predict whether a patient has type II diabetes • Multivariate classification ∗ e.g., handwritten digits recognition with labels “1”, “2”, etc. 10 Statistical Machine Learning (S2 2017) Deck 7 The power of ANN as a non-linear model • ANNs are capable of approximating various non-linear functions, e.g., = and = sin 2 • For example, consider the following network. In this example, hidden unit activation functions are tanh 1 2 Plot by Geek3 3 Wikimedia Commons (CC3) 11 Statistical Machine Learning (S2 2017) Deck 7 The power of ANN as a non-linear model • ANNs are capable of approximating various non-linear functions, e.g., = and = sin 2 Blue points are the function values evaluated at different . Red lines are the predictions ( ) from the ANN. Dashed lines are outputs of the Adapted from Bishop 2007 hidden units • Universal approximation theorem (Cybenko 1989): An ANN with a hidden layer with a finite number of units, and mild assumptions on the activation function, can approximate continuous functions on compact subsets of arbitrarily well 12 Statistical Machine Learning (S2 2017) Deck 7 How to train your dragon network? • You know the drill: Define the loss function and find parameters that minimise the loss on training data • In the following, we are going to use stochastic gradient descent with a batch size of one. That is, we will process training examples one by one Adapted from Movie Poster from Flickr user jdxyw (CC BY-SA 2.0) 13 Statistical Machine Learning (S2 2017) Deck 7 Training setup: univariate regression • In what follows we consider univariate regression setup • Moreover, we will use identity 1 output activation function = = = …1 2 • This 푡will simplify ∑description=0 … of backpropagation. In other settings, the training procedure is similar 14 Statistical Machine Learning (S2 2017) Deck 7 Training setup: univariate regression • How many parameters does this ANN have? Bias nodes and are present, but not shown 1 0 0 …1 + + 1 2 … + 2 + 1 + 1 art: OpenClipartVectors at pixabay.com (CC0) 15 Statistical Machine Learning (S2 2017) Deck 7 Loss function for ANN training • In online training, we need to define the loss between a single training example { , } and ANN’s prediction , = , where is a parameter vector comprised of all coefficients and ̂ • For regression we can use good old squared error 1 1 = ( , ) = 2 2 2 2 ̂ (the constant is used for mathematical − convenience, − see later) • Training means finding the minimum of as a function of parameter vector ∗ Fortunately ( ) is a differentiable function ∗ Unfortunately there is no analytic solution in general 16 Statistical Machine Learning (S2 2017) Deck 7 Stochastic gradient descent for ANN Choose initial guess ( ), = 0 0 Here is a set of all weights form all layers For from 1to (epochs) For from 1 to (training examples) Consider example , ( ) ( ) ( ) Update: = ( ) +1 − 1 Need to compute partial = 2 2 derivatives and − 17 Statistical Machine Learning (S2 2016) Deck 7 Backpropagation = “backward propagation of errors” Calculating the gradient of a loss function 18 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation: start with the chain rule • Recall that the output of an ANN is a function composition, and hence is also a composition ∗ = 0.5 = 0.5 ( ) = 0.5 2 2 2 ∗ = 0.5 − 푡=0.5− ( ) − = 2 2 =0 =0 • Backpropagation∑ makes− use of∑ − ⋯ this fact by applying the chain rule for derivatives 1 … • = 1 …2 • = 19 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation: intermediate step • Apply the chain rule • Now define = • = ≡ = = • ≡ • Here = 0.5 and = 2 Thus = − 1 …1 • Here = − and …2 = ∑=0 Thus = ( ) 푡 ′ 푡 20 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation equations • We have … where • Recall that ∗ = ∗ = = ∗ = ∗ = =0 ∗ = = ( ) ∑ ∗ = − ′ =0 ∑ 푡 • So = and = 1 • We have …1 ∗ = = …2 ∗ = = − ′ 푡 21 Statistical Machine Learning (S2 2017) Deck 7 Forward propagation • Use current • Calculate , estimates of , and and 1 • Backpropagation equations …1 ∗ = = …2 ∗ = = − ′ 푡 22 Statistical Machine Learning (S2 2017) Deck 7 Backward propagation of errors = = ( ) = = ( ) ′ 푡 − 1 • Backpropagation equations …1 ∗ = = …2 ∗ = = − ′ 푡 23 Statistical Machine Learning (S2 2017) Deck 7 Some further notes on ANN training • ANN is a flexible model (recall universal approximation theorem), but the flipside of it is over-parameterisation, hence tendency to overfitting • Starting weights are usually small random values distributed around zero • Implicit regularisation: early stopping ∗ With some activation functions, this shrinks the ANN towards a linear model (why?) Plot by Geek3 Wikimedia Commons (CC3) 24 Statistical Machine Learning (S2 2017) Deck 7 Explicit regularisation • Alternatively, an explicit regularisation can be used, much like in ridge regression • Instead of minimising the loss , minimise regularised function + + 2 2 • This will simply add∑= 02∑=1and ∑=0terms to the partial derivatives 2 • With some activation functions this also shrinks the ANN towards a linear model 25 Statistical Machine Learning (S2 2017) Deck 7 This lecture • Multilayer perceptron ∗ Model structure ∗ Universal approximation ∗ Training preliminaries • Backpropagation ∗ Step-by-step derivation ∗ Notes on regularisation 26.