Lecture 7. Multilayer Perceptron. Backpropagation

Lecture 7. Multilayer Perceptron. Backpropagation

Lecture 7. Multilayer Perceptron. Backpropagation COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Statistical Machine Learning (S2 2017) Deck 7 This lecture • Multilayer perceptron ∗ Model structure ∗ Universal approximation ∗ Training preliminaries • Backpropagation ∗ Step-by-step derivation ∗ Notes on regularisation 2 Statistical Machine Learning (S2 2017) Deck 7 Animals in the zoo Artificial Neural Recurrent neural Networks (ANNs) networks Feed-forward Multilayer perceptrons networks art: OpenClipartVectors at pixabay.com (CC0) Perceptrons Convolutional neural networks • Recurrent neural networks are not covered in this subject • If time permits, we will cover autoencoders. An autoencoder is an ANN trained in a specific way. ∗ E.g., a multilayer perceptron can be trained as an autoencoder, or a recurrent neural network can be trained as an autoencoder. 3 Statistical Machine Learning (S2 2016) Deck 7 Multilayer Perceptron Modelling non-linearity via function composition 4 Statistical Machine Learning (S2 2017) Deck 7 Limitations of linear models Some function are linearly separable, but many are not AND OR XOR Possible solution: composition XOR = OR AND not AND 1 2 1 2 1 2 We are going to combine perceptrons … 5 Statistical Machine Learning (S2 2017) Deck 7 Simplified graphical representation Sputnik – first artificial Earth satellite 6 Statistical Machine Learning (S2 2017) Deck 7 Perceptorn is sort of a building block for ANN • ANNs are not restricted to binary classification • Nodes in ANN can have various activation functions 1, 0 = Step function 0, < 0 ≥ � 1, 0 Sign function = 1, < 0 ≥ � 1 Logistic function = − 1 + − Many others: , rectifier, etc. 푡 7 Statistical Machine Learning (S2 2017) Deck 7 Feed-forward Artificial Neural Network flow of computation are are outputs, i.e., predicted labels inputs, i.e., attributes 1 note: ANNs 1 note: here are …1 naturally handle components of a multidimensional 2 output single training … …2 instance hidden e.g., for handwritten a training digits recognition, layer dataset is a set output each output node of instances input layer (ANNs can have can represent the layer more than one probability of a digit hidden layer) 8 Statistical Machine Learning (S2 2017) Deck 7 ANN as a function composition = ( ) 푡 = + = ( ) 1 1 0 � …1 =1 note that is a = + 2 function composition … …2 0 � (a function applied to =1 the result of another function, etc.) you can add bias node = 1 to simplify here , are activation equations: = 0 functions. These can be either same푡 (e.g., both similarly you can add bias∑ node=0 = 1 to simplify equations: = sigmoid) or different 0 ∑=0 9 Statistical Machine Learning (S2 2017) Deck 7 ANN in supervised learning • ANNs can be naturally adapted to various supervised learning setups, such as univariate and multivariate regression, as well as binary and multilabel classification • Univariate regression = ∗ e.g., linear regression earlier in the course • Multivariate regression = ( ) ∗ predicting values for multiple continuous outcomes • Binary classification ∗ e.g., predict whether a patient has type II diabetes • Multivariate classification ∗ e.g., handwritten digits recognition with labels “1”, “2”, etc. 10 Statistical Machine Learning (S2 2017) Deck 7 The power of ANN as a non-linear model • ANNs are capable of approximating various non-linear functions, e.g., = and = sin 2 • For example, consider the following network. In this example, hidden unit activation functions are tanh 1 2 Plot by Geek3 3 Wikimedia Commons (CC3) 11 Statistical Machine Learning (S2 2017) Deck 7 The power of ANN as a non-linear model • ANNs are capable of approximating various non-linear functions, e.g., = and = sin 2 Blue points are the function values evaluated at different . Red lines are the predictions ( ) from the ANN. Dashed lines are outputs of the Adapted from Bishop 2007 hidden units • Universal approximation theorem (Cybenko 1989): An ANN with a hidden layer with a finite number of units, and mild assumptions on the activation function, can approximate continuous functions on compact subsets of arbitrarily well 12 Statistical Machine Learning (S2 2017) Deck 7 How to train your dragon network? • You know the drill: Define the loss function and find parameters that minimise the loss on training data • In the following, we are going to use stochastic gradient descent with a batch size of one. That is, we will process training examples one by one Adapted from Movie Poster from Flickr user jdxyw (CC BY-SA 2.0) 13 Statistical Machine Learning (S2 2017) Deck 7 Training setup: univariate regression • In what follows we consider univariate regression setup • Moreover, we will use identity 1 output activation function = = = …1 2 • This 푡will simplify ∑description=0 … of backpropagation. In other settings, the training procedure is similar 14 Statistical Machine Learning (S2 2017) Deck 7 Training setup: univariate regression • How many parameters does this ANN have? Bias nodes and are present, but not shown 1 0 0 …1 + + 1 2 … + 2 + 1 + 1 art: OpenClipartVectors at pixabay.com (CC0) 15 Statistical Machine Learning (S2 2017) Deck 7 Loss function for ANN training • In online training, we need to define the loss between a single training example { , } and ANN’s prediction , = , where is a parameter vector comprised of all coefficients and ̂ • For regression we can use good old squared error 1 1 = ( , ) = 2 2 2 2 ̂ (the constant is used for mathematical − convenience, − see later) • Training means finding the minimum of as a function of parameter vector ∗ Fortunately ( ) is a differentiable function ∗ Unfortunately there is no analytic solution in general 16 Statistical Machine Learning (S2 2017) Deck 7 Stochastic gradient descent for ANN Choose initial guess ( ), = 0 0 Here is a set of all weights form all layers For from 1to (epochs) For from 1 to (training examples) Consider example , ( ) ( ) ( ) Update: = ( ) +1 − 1 Need to compute partial = 2 2 derivatives and − 17 Statistical Machine Learning (S2 2016) Deck 7 Backpropagation = “backward propagation of errors” Calculating the gradient of a loss function 18 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation: start with the chain rule • Recall that the output of an ANN is a function composition, and hence is also a composition ∗ = 0.5 = 0.5 ( ) = 0.5 2 2 2 ∗ = 0.5 − 푡=0.5− ( ) − = 2 2 =0 =0 • Backpropagation∑ makes− use of∑ − ⋯ this fact by applying the chain rule for derivatives 1 … • = 1 …2 • = 19 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation: intermediate step • Apply the chain rule • Now define = • = ≡ = = • ≡ • Here = 0.5 and = 2 Thus = − 1 …1 • Here = − and …2 = ∑=0 Thus = ( ) 푡 ′ 푡 20 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation equations • We have … where • Recall that ∗ = ∗ = = ∗ = ∗ = =0 ∗ = = ( ) ∑ ∗ = − ′ =0 ∑ 푡 • So = and = 1 • We have …1 ∗ = = …2 ∗ = = − ′ 푡 21 Statistical Machine Learning (S2 2017) Deck 7 Forward propagation • Use current • Calculate , estimates of , and and 1 • Backpropagation equations …1 ∗ = = …2 ∗ = = − ′ 푡 22 Statistical Machine Learning (S2 2017) Deck 7 Backward propagation of errors = = ( ) = = ( ) ′ 푡 − 1 • Backpropagation equations …1 ∗ = = …2 ∗ = = − ′ 푡 23 Statistical Machine Learning (S2 2017) Deck 7 Some further notes on ANN training • ANN is a flexible model (recall universal approximation theorem), but the flipside of it is over-parameterisation, hence tendency to overfitting • Starting weights are usually small random values distributed around zero • Implicit regularisation: early stopping ∗ With some activation functions, this shrinks the ANN towards a linear model (why?) Plot by Geek3 Wikimedia Commons (CC3) 24 Statistical Machine Learning (S2 2017) Deck 7 Explicit regularisation • Alternatively, an explicit regularisation can be used, much like in ridge regression • Instead of minimising the loss , minimise regularised function + + 2 2 • This will simply add∑= 02∑=1and ∑=0terms to the partial derivatives 2 • With some activation functions this also shrinks the ANN towards a linear model 25 Statistical Machine Learning (S2 2017) Deck 7 This lecture • Multilayer perceptron ∗ Model structure ∗ Universal approximation ∗ Training preliminaries • Backpropagation ∗ Step-by-step derivation ∗ Notes on regularisation 26.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    26 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us