<<

Lecture 7. Multilayer .

COMP90051 Statistical

Semester 2, 2017 Lecturer: Andrey Kan

Copyright: University of Melbourne Statistical Machine Learning (S2 2017) Deck 7 This lecture

∗ Model structure ∗ Universal approximation ∗ Training preliminaries • Backpropagation ∗ Step-by-step derivation ∗ Notes on regularisation

2 Statistical Machine Learning (S2 2017) Deck 7 Animals in the zoo

Artificial Neural Recurrent neural Networks (ANNs) networks

Feed-forward Multilayer networks art: OpenClipartVectors at pixabay.com (CC0)

Perceptrons Convolutional neural networks

• Recurrent neural networks are not covered in this subject • If time permits, we will cover . An is an ANN trained in a specific way. ∗ E.g., a multilayer perceptron can be trained as an autoencoder, or a can be trained as an autoencoder. 3 Statistical Machine Learning (S2 2016) Deck 7

Multilayer Perceptron

Modelling non-linearity via function composition

4 Statistical Machine Learning (S2 2017) Deck 7 Limitations of linear models Some function are linearly separable, but many are not

AND OR XOR

Possible solution: composition XOR = OR AND not AND

𝑥𝑥1 𝑥𝑥2 𝑥𝑥1 𝑥𝑥2 𝑥𝑥1 𝑥𝑥2 We are going to combine perceptrons …

5 Statistical Machine Learning (S2 2017) Deck 7 Simplified graphical representation

Sputnik – first artificial Earth satellite

6 Statistical Machine Learning (S2 2017) Deck 7 Perceptorn is sort of a building block for ANN

• ANNs are not restricted to • Nodes in ANN can have various activation functions 1, 0 = Step function 0, < 0 𝑖𝑖𝑖𝑖 𝑠𝑠 ≥ 𝑓𝑓 𝑠𝑠 � 1, 𝑖𝑖𝑖𝑖 𝑠𝑠 0 Sign function = 1, < 0 𝑖𝑖𝑖𝑖 𝑠𝑠 ≥ 𝑓𝑓 𝑠𝑠 � 1 Logistic function = − 𝑖𝑖𝑖𝑖 𝑠𝑠 1 + 𝑓𝑓 𝑠𝑠 −𝑠𝑠 Many others: , rectifier, etc. 𝑒𝑒

𝑡𝑡𝑡𝑡𝑡𝑡푡 7 Statistical Machine Learning (S2 2017) Deck 7 Feed-forward Artificial Neural Network flow of computation

are are outputs, i.e., predicted labels inputs, i.e., 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑥𝑥𝑖𝑖 𝑣𝑣 𝑤𝑤𝑗𝑗𝑗𝑗 𝑧𝑧 attributes 1 𝑥𝑥 note: ANNs 1 note: here are …1 𝑧𝑧 naturally handle components of a 𝑢𝑢 multidimensional 𝑥𝑥𝑖𝑖 2 output single training 𝑥𝑥… …2 instance 𝑧𝑧

𝒙𝒙 hidden𝑝𝑝 e.g., for handwritten a training 𝑢𝑢 digits recognition, 𝑞𝑞 dataset is a set o𝑧𝑧utput each output node of instances input layer 𝑚𝑚 (ANNs can have can represent the 𝑥𝑥layer more than one probability of a digit hidden layer)

8 Statistical Machine Learning (S2 2017) Deck 7 ANN as a function composition

= ( )

𝑖𝑖𝑖𝑖 𝑧𝑧𝑘𝑘 ℎ 𝑠𝑠𝑘𝑘 𝑣𝑣 𝑤𝑤𝑗𝑗𝑗𝑗 = + = ( ) 𝑥𝑥1 𝑝𝑝 1 𝑘𝑘 0𝑘𝑘 𝑗𝑗 𝑗𝑗𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑧𝑧 𝑠𝑠 𝑤𝑤 � 𝑢𝑢 𝑤𝑤 𝑢𝑢 𝑔𝑔 𝑟𝑟 …1 𝑗𝑗=1 𝑢𝑢 note that is a = + 𝑚𝑚 2 function composition 𝑥𝑥… …2 𝑘𝑘 𝑟𝑟𝑗𝑗 𝑣𝑣0𝑗𝑗 � 𝑥𝑥𝑖𝑖𝑣𝑣𝑖𝑖𝑖𝑖 𝑧𝑧 (a function applied𝑧𝑧 to 𝑖𝑖=1 the result of another function, etc.) 𝑢𝑢𝑝𝑝 𝑧𝑧𝑞𝑞 𝑚𝑚 you can add bias node = 1 to simplify here , are activation𝑥𝑥 equations: = 0 functions. These can be 𝑥𝑥 𝑚𝑚 either𝑔𝑔 sameℎ (e.g., both similarly you can add𝑟𝑟𝑗𝑗 bias∑𝑖𝑖 node=0 𝑥𝑥𝑖𝑖𝑣𝑣 𝑖𝑖𝑖𝑖 = 1 to simplify equations: = sigmoid) or different 0 𝑝𝑝 𝑢𝑢 𝑠𝑠𝑘𝑘 ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗𝑗𝑗 9 Statistical Machine Learning (S2 2017) Deck 7 ANN in

• ANNs can be naturally adapted to various supervised learning setups, such as univariate and multivariate regression, as well as binary and multilabel classification • Univariate regression = ∗ e.g., earlier in the course 𝑦𝑦 𝑓𝑓 𝒙𝒙 • Multivariate regression = ( ) ∗ predicting values for multiple continuous outcomes 𝒚𝒚 𝑓𝑓 𝒙𝒙 • Binary classification ∗ e.g., predict whether a patient has type II diabetes • Multivariate classification ∗ e.g., handwritten digits recognition with labels “1”, “2”, etc. 10 Statistical Machine Learning (S2 2017) Deck 7 The power of ANN as a non-linear model • ANNs are capable of approximating various non-linear functions, e.g., = and = sin 2 • For example, consider𝑧𝑧 𝑥𝑥 the𝑥𝑥 𝑧𝑧 𝑥𝑥 𝑥𝑥 following network. In this example, hidden unit activation functions are tanh 𝑢𝑢1

𝑥𝑥 𝑢𝑢2 𝑧𝑧

Plot by Geek3 3 Wikimedia Commons (CC3) 𝑢𝑢

11 Statistical Machine Learning (S2 2017) Deck 7 The power of ANN as a non-linear model • ANNs are capable of approximating various non-linear functions, e.g., = and = sin 2 Blue points are the 𝑧𝑧 𝑥𝑥 𝑥𝑥 𝑧𝑧 𝑥𝑥 𝑥𝑥 function values evaluated at different . Red lines are the predictions 𝑥𝑥 ( ) from the ANN. Dashed lines are 𝑧𝑧 𝑥𝑥 outputs of the Adapted from Bishop 2007 hidden units

• Universal𝑥𝑥 approximation theorem (Cybenko 1989): An ANN with a hidden layer with a finite number of units, and mild assumptions on the , can approximate continuous functions on compact subsets of arbitrarily well 𝑛𝑛 12 𝑹𝑹 Statistical Machine Learning (S2 2017) Deck 7 How to train your dragon network?

• You know the drill: Define the and find parameters that minimise the loss on training data

• In the following, we are going to use stochastic descent with a batch size of one. That is, we will process training examples one by one

Adapted from Movie Poster from Flickr user jdxyw (CC BY-SA 2.0)

13 Statistical Machine Learning (S2 2017) Deck 7 Training setup: univariate regression

• In what follows we consider univariate regression setup 𝑖𝑖𝑖𝑖 𝑣𝑣 𝑤𝑤𝑗𝑗 • Moreover, we will use identity 𝑥𝑥1 output activation function = = = 𝑢𝑢…1 𝑝𝑝 2 • This𝑧𝑧 ℎwill𝑠𝑠 simplify𝑠𝑠 ∑description𝑗𝑗=0 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗 𝑥𝑥… 𝑧𝑧 of backpropagation. In other settings, the training 𝑝𝑝 procedure is similar 𝑢𝑢 𝑥𝑥𝑚𝑚

14 Statistical Machine Learning (S2 2017) Deck 7 Training setup: univariate regression • How many parameters does this ANN have? Bias 𝑖𝑖𝑖𝑖 nodes and are 𝑣𝑣 𝑤𝑤𝑗𝑗 present, but not shown 𝑥𝑥1 𝑥𝑥0 𝑢𝑢0 𝑢𝑢…1 + + 1 2 𝑥𝑥… 𝑧𝑧 𝑚𝑚𝑚𝑚 𝑝𝑝 + 2 + 1 𝑢𝑢𝑝𝑝 𝑚𝑚 𝑝𝑝 + 1 𝑥𝑥𝑚𝑚

art: OpenClipartVectors at 𝑚𝑚 𝑝𝑝 pixabay.com (CC0)

15 Statistical Machine Learning (S2 2017) Deck 7 Loss function for ANN training • In online training, we need to define the loss between a single training example { , } and ANN’s prediction , = , where is a parameter vector comprised of all coefficients and𝒙𝒙 𝑦𝑦 𝑓𝑓̂ 𝒙𝒙 𝜽𝜽 𝑧𝑧 𝜽𝜽 • For regression we𝑣𝑣 can𝑖𝑖𝑖𝑖 use𝑤𝑤 good𝑗𝑗 old squared error 1 1 = ( , ) = 2 2 2 2 ̂ (the constant 𝐿𝐿is used 𝑓𝑓for𝒙𝒙 mathematical𝜽𝜽 − 𝑦𝑦 convenience,𝑧𝑧 − 𝑦𝑦 see later) • Training means finding the minimum of as a function of parameter vector ∗ Fortunately ( ) is a 𝐿𝐿 ∗ Unfortunately there is𝜽𝜽 no analytic solution in general 𝐿𝐿 𝜽𝜽 16 Statistical Machine Learning (S2 2017) Deck 7 Stochastic for ANN Choose initial guess ( ), = 0 0 Here is a set 𝜽𝜽of all 𝑘𝑘weights form all layers For from 1𝜽𝜽to (epochs)

𝑖𝑖 For from𝑇𝑇 1 to (training examples) 𝑗𝑗 Consider𝑁𝑁 example , ( ) ( ) ( ) Update: = 𝒙𝒙𝑗𝑗 𝑦𝑦𝑗𝑗 ( ) 𝑖𝑖+1 𝑖𝑖 𝑖𝑖 𝜽𝜽 𝜽𝜽 − 𝜂𝜂𝛁𝛁𝐿𝐿 𝜽𝜽 1 Need to compute partial = 2 2 derivatives and 𝐿𝐿 𝑧𝑧𝑗𝑗 − 𝑦𝑦𝑗𝑗 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝜐𝜐𝑖𝑖𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 17 Statistical Machine Learning (S2 2016) Deck 7

Backpropagation = “backward propagation of errors”

Calculating the gradient of a loss function

18 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation: start with the • Recall that the output of an ANN is a function composition, and hence is also a composition ∗ = 0.5 = 0.5𝑧𝑧 ( ) = 0.5 2 𝐿𝐿 𝑧𝑧 2 2 ∗ 𝐿𝐿= 0.5 𝑧𝑧 − 𝑦𝑦 ℎ=𝑠𝑠0.5− 𝑦𝑦 ( ) 𝑠𝑠 − 𝑦𝑦 = 2 2 𝑝𝑝 𝑝𝑝 𝑗𝑗=0 𝑗𝑗 𝑗𝑗 𝑗𝑗=0 𝑗𝑗 𝑗𝑗 • Backpropagation∑ 𝑢𝑢 𝑤𝑤 makes− 𝑦𝑦 use of∑ 𝑔𝑔 𝑟𝑟 𝑤𝑤 − 𝑦𝑦 ⋯ this fact by applying the chain 𝑖𝑖𝑖𝑖 rule for derivatives 𝑣𝑣 𝑤𝑤𝑗𝑗 𝑥𝑥1 … • = 𝑢𝑢1 𝜕𝜕𝜕𝜕 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝑥𝑥…2 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗 𝑧𝑧 • = 𝑝𝑝 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝑢𝑢 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝑥𝑥𝑚𝑚 19 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation: intermediate step • Apply the chain rule • Now define = • = 𝜕𝜕𝜕𝜕 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝜕𝜕 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗 𝛿𝛿 ≡ = = 𝜕𝜕𝑠𝑠 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 • 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝑗𝑗 𝜀𝜀 ≡ 𝑗𝑗 𝑗𝑗 𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 • Here𝜕𝜕 𝑟𝑟 = 0𝜕𝜕.5𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢 𝜕𝜕𝑟𝑟 and = 2 𝑖𝑖𝑖𝑖 𝑣𝑣 𝑤𝑤𝑗𝑗 Thus𝐿𝐿 = 𝑧𝑧 − 𝑦𝑦 𝑥𝑥1 𝑧𝑧 𝑠𝑠 𝑢𝑢…1 • Here =𝛿𝛿 𝑧𝑧 − 𝑦𝑦 and …2 = 𝑝𝑝 𝑥𝑥 𝑧𝑧 𝑠𝑠 ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗 Thus = ( ) 𝑢𝑢𝑗𝑗 ℎ 𝑟𝑟𝑗𝑗 𝑢𝑢𝑝𝑝 ′ 𝜀𝜀𝑗𝑗 𝛿𝛿𝑤𝑤𝑗𝑗ℎ 𝑟𝑟𝑗𝑗 20 𝑥𝑥𝑚𝑚 Statistical Machine Learning (S2 2017) Deck 7 Backpropagation equations

• We have … where • Recall that ∗ = ∗ = = ∗ = 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝑝𝑝 ∗ = 𝑗𝑗=0 𝑗𝑗 𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 ∗ = 𝜕𝜕𝜕𝜕 = ( ) 𝑠𝑠 ∑ 𝑢𝑢 𝑤𝑤 ∗ = 𝛿𝛿 𝛿𝛿 𝑧𝑧 − 𝑦𝑦 𝑚𝑚 𝑗𝑗 𝜕𝜕𝜕𝜕 ′ 𝑗𝑗 𝑖𝑖=0 𝑖𝑖 𝑖𝑖𝑖𝑖 𝜕𝜕𝜕𝜕 𝜕𝜕𝑟𝑟 𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑟𝑟 ∑ 𝑥𝑥 𝑣𝑣 𝜀𝜀 𝜕𝜕𝑟𝑟𝑗𝑗 𝛿𝛿𝑤𝑤 ℎ 𝑟𝑟 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜀𝜀𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 • So = and = 𝜕𝜕𝑠𝑠 𝜕𝜕𝑟𝑟𝑗𝑗 𝑖𝑖𝑖𝑖 𝑗𝑗 𝑗𝑗 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑣𝑣 𝑤𝑤𝑗𝑗 𝜕𝜕𝑤𝑤 𝑢𝑢 𝜕𝜕𝑣𝑣 𝑥𝑥 𝑥𝑥1 • We have …1 ∗ = = 𝑢𝑢 𝜕𝜕𝜕𝜕 …2 𝑗𝑗 𝑗𝑗 𝑥𝑥 𝑧𝑧 ∗ 𝜕𝜕𝑤𝑤𝑗𝑗 = 𝛿𝛿𝑢𝑢 = 𝑧𝑧 − 𝑦𝑦 𝑢𝑢 𝜕𝜕𝜕𝜕 ′ 𝑝𝑝 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖 𝛿𝛿𝑤𝑤𝑗𝑗ℎ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝑢𝑢 21 𝑥𝑥𝑚𝑚 Statistical Machine Learning (S2 2017) Deck 7 Forward propagation

• Use current • Calculate , estimates of , and and 𝑟𝑟𝑗𝑗 𝑢𝑢𝑗𝑗 𝑠𝑠 𝑧𝑧 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗

𝑖𝑖𝑖𝑖 𝑣𝑣 𝑤𝑤𝑗𝑗 𝑥𝑥1 • Backpropagation equations …1 ∗ = = 𝑢𝑢 𝜕𝜕𝜕𝜕 …2 𝑗𝑗 𝑗𝑗 𝑥𝑥 𝑧𝑧 ∗ 𝜕𝜕𝑤𝑤𝑗𝑗 = 𝛿𝛿𝑢𝑢 = 𝑧𝑧 − 𝑦𝑦 𝑢𝑢 𝜕𝜕𝜕𝜕 ′ 𝑝𝑝 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖 𝛿𝛿𝑤𝑤𝑗𝑗ℎ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝑢𝑢 22 𝑥𝑥𝑚𝑚 Statistical Machine Learning (S2 2017) Deck 7 Backward propagation of errors

= = ( ) = = ( ) 𝜕𝜕𝜕𝜕 ′ 𝜕𝜕𝜕𝜕 𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖 𝜀𝜀𝑗𝑗 𝛿𝛿𝑤𝑤𝑗𝑗ℎ 𝑟𝑟𝑗𝑗 𝛿𝛿𝑢𝑢𝑗𝑗 𝛿𝛿 𝑧𝑧 − 𝑦𝑦 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜕𝜕𝑤𝑤𝑗𝑗

𝑖𝑖𝑖𝑖 𝑣𝑣 𝑤𝑤𝑗𝑗 𝑥𝑥1 • Backpropagation equations …1 ∗ = = 𝑢𝑢 𝜕𝜕𝜕𝜕 …2 𝑗𝑗 𝑗𝑗 𝑥𝑥 𝑧𝑧 ∗ 𝜕𝜕𝑤𝑤𝑗𝑗 = 𝛿𝛿𝑢𝑢 = 𝑧𝑧 − 𝑦𝑦 𝑢𝑢 𝜕𝜕𝜕𝜕 ′ 𝑝𝑝 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖 𝛿𝛿𝑤𝑤𝑗𝑗ℎ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝑢𝑢 23 𝑥𝑥𝑚𝑚 Statistical Machine Learning (S2 2017) Deck 7 Some further notes on ANN training • ANN is a flexible model (recall universal approximation theorem), but the flipside of it is over-parameterisation, hence tendency to • Starting weights are usually small random values distributed around zero • Implicit regularisation: early stopping ∗ With some activation functions, this shrinks the ANN towards a linear model (why?) Plot by Geek3 Wikimedia Commons (CC3)

24 Statistical Machine Learning (S2 2017) Deck 7 Explicit regularisation

• Alternatively, an explicit regularisation can be used, much like in ridge regression • Instead of minimising the loss , minimise regularised function + + 𝐿𝐿 𝑚𝑚 𝑝𝑝 2 𝑝𝑝 2 • This will simply𝐿𝐿 𝜆𝜆 add∑𝑖𝑖= 02∑𝑗𝑗=1and𝑣𝑣𝑖𝑖𝑖𝑖 ∑𝑗𝑗=0terms𝑤𝑤𝑗𝑗 to the partial derivatives 𝜆𝜆𝑣𝑣𝑖𝑖𝑖𝑖 2𝜆𝜆𝑤𝑤𝑗𝑗 • With some activation functions this also shrinks the ANN towards a linear model

25 Statistical Machine Learning (S2 2017) Deck 7 This lecture

• Multilayer perceptron ∗ Model structure ∗ Universal approximation ∗ Training preliminaries • Backpropagation ∗ Step-by-step derivation ∗ Notes on regularisation

26