Multilayer Perceptron Backpropagation Summary
Total Page:16
File Type:pdf, Size:1020Kb
Multilayer Perceptron Backpropagation Summary C. Chadoulos, P. Kaplanoglou, S. Papadopoulos, Prof. Ioannis Pitas Aristotle University of Thessaloniki [email protected] www.aiia.csd.auth.gr Version 2.8 Biological Neuron ⚫ Basic computational unit of the brain. ⚫ Main parts: • Dendrites ➔ Act as inputs. • Soma ➔ Main body of neuron. • Axon ➔ Acts as output. ⚫ Neurons connect with other neurons via synapses. 2 Artificial Neurons ⚫ Artificial neurons are mathematical models loosely inspired by their biological counterparts. x2 ⚫ Incoming signals through the axons serve as the input vector: x1 x 푇 n 퐱 = 푥1, 푥2, … , 푥푛 , 푥푖 ∈ ℝ. w1 ⚫ The synaptic weights are grouped in a weight vector: w w1 n 푇 Z 퐰 = [푤1, 푤2, … , 푤푛] , 푤푖 ∈ ℝ. ⚫ Synaptic integration is modeled as the inner product: threshold 푁 푇 푧 = 푤푖 푥푖 = 퐰 퐱. 푖=1 10 Perceptron threshold Axon from a neuron synapse dendrit e Cell body Activation Function 11 Perceptron ⚫ Simplest mathematical model of a neuron (Rosenblatt – McCulloch & Pitts) ⚫ Inputs are binary, 푥푖 ∈ 0,1 . ⚫ Produces single, binary outputs, 푦푖 ∈ {0,1}, through an activation function 푓(∙). ⚫ Output 푦푖 signifies whether the neuron will fire or not. ⚫ Firing threshold: 퐰푇퐱 ≥ −푏 ⇒ 퐰푇퐱 + 푏 ≥ 0. Cell body 12 MLP Architecture ⚫ Multilayer perceptrons (feed-forward neural networks) typically consist of 퐿 + 1 layers with 푘푙 neurons in each layer: 푙 = 0, … , 퐿. ⚫ The first layer (input layer 푙 = 0) contains 푛 inputs, where 푛 is the dimensionality of the input sample vector. ⚫ The 퐿 − 1 hidden layers 푙 = 1, … , 퐿 − 1 can contain any number of neurons. 26 MLP Architecture ⚫ The output layer 푙 = 퐿 typically contains either one neuron (especially when dealing with regression problems) or 푚 neurons, where 푚 is the number of classes. ⚫ Thus, neural networks can be described by the following characteristic numbers: 1.Depth: Number of hidden layers 퐿. 2.Width: Number of neurons per layer 푘푙, 푙 = 1, … , 퐿. 3. 푘퐿 = 1, or 푘퐿 = 푚. 26 MLP Forward Propagation ⚫ Following the same procedure for 푡ℎ 푓(푧 푙 ) the 푙 hidden layer, we get: 푧(푙) 1 (푙) 1 푎1 . 푙 . 푓(푧푖 ) (푙) (푙) 푎 푧푖 푖 . ⚫ In compact form: . 푙 (푙) 푓(푧푗 )1 (푙) 푧푗 푎푗 . 푙 푓(푧푘 ) 푧(푙) 푙 푎(푙) 푘푙 푘푙 . 29 MLP Activation Functions ⚫ Previous theorem verifies MLPs as universal approximators, under the condition that the activation functions 휙 ∙ for each neuron are continuous. ⚫ Continuous (and differentiable) activation functions: 1 푛 ⚫ 휎 푥 = , ℝ → 0,1 1+푒−푥 0, 푥 < 0 푛 ⚫ 푅푒퐿푈 푥 = ቊ , ℝ → ℝ 푥, 푥 ≥ 0 + −1 푛 ⚫ 푓 푥 = tan 푥 , ℝ → −휋Τ2 , 휋Τ2 푥 −푥 푒 −푒 푛 ⚫ 푓 푥 = tanh 푥 = , ℝ → −1, +1 푒푥+푒−푥 25 Softmax Layer • It is the last layer in a neural network classifier. • The response of neuron 푖 in the softmax layer 퐿 is calculated with regard to the value of its activation function 푎푖 = 푓 푧푖 . 푒푎푖 푦ො푖 = 푔 푎푖 = 푘 : ℝ → 0,1 , 푖 = 1, . , 푘퐿 σ 퐿 푎푘 푘=1 푒 푘퐿 • The responses of softmax neurons sum up to one: σ푖=1 푦ො푖 = 1. • Better representation for mutually exclusive class labels. 31 MLP Example Fully connected • Example architecture with 퐿 = 2 layer and 푛 = 푚0 input features, 푚1 neurons at the first layer 푚 and 푚2 output units. 푙−1 푎푗 푙 = 푓푙( 푤푖푗 푙 푎푖(푙 − 1) + 푤0푗(푙)) 푖=1 Input Axons Synapses Dendrites Axons Synapses Dendrites Output 푙 = 0 1 1 푙 = 1 푤01(2) 푙 = 2 푤01(1) 1 푧 (1) 푎 (1) 푤 2 푎 (1) 푧 (2) 푦ො = 푎 (2) 푥 푤11(1)푥1 1 1 푤11(2) 11 1 1 1 1 1 푤11(1) … 푥1 푤21(1) 푧 (1) 2 푤푚11(2) … 푥2 … 푤0푚2 (2) 푥1 푤 (1) … 1푚1 푤1푚1(1)푥1 푤1푚2 (2) 푤1푚2 2 푎1(1) … 푧 (1) 푧푚 (2) 푦ො푚 = 푎푚 (2) 푥푛 푚1 푎푚1(1) … 2 2 2 푥푛0 푤푛푚1 (1) 푤푚1푚2 (2) 32 MLP Training • Mean Square Error (MSE): 1 퐽 훉 = 퐽 퐖, 퐛 = σ푁 퐲ෝ − 퐲 2. 푁 푖=1 푖 푖 • It is suitable for regression and classification. • Categorical Cross Entropy Error: 푁 푚 퐽퐶퐶퐸 = − σ푖=1 σj=1 푦푖푗 log 푦ො푖푗 . • It is suitable for classifiers that use softmax output layers. • Exponential Loss: 푇 푁 −훽퐲 퐲ෝ푖 퐽(훉) = σ푖=1 푒 푖 . 37 MLP Training 휕퐽 휕퐽 • Differentiation: = = ퟎ can provide the critical points: 휕퐖 휕퐛 • Minima, maxima, saddle points. • Analytical computations of this kind are usually impossible. • Need to resort to numerical optimization methods. 38 MLP Training • Iteratively search the parameter space for the optimal values. In each iteration, apply a small correction to the previous value. 훉 푡 + 1 = 훉 푡 + ∆훉. • In the neural network case, jointly optimize 퐖, 퐛 퐖 푡 + 1 = 퐖 푡 + ∆퐖, 퐛 푡 + 1 = 퐛 푡 + ∆퐛. • What is an appropriate choice for the correction term? 38 MLP Training Steepest descent on a function surface. 39 MLP Training • Gradient Descent, as applied to Artificial Neural Networks: • Equations for the 푗푡ℎ neuron in the 푙푡ℎ layer: 40 Backpropagation ⚫ We can now state the four essential equations for the implementation of the backpropagation algorithm: (BP1) (BP2) (BP3) (BP4) 45 Backpropagation Algorithm 퐀퐥퐠퐨퐫퐢퐭퐡퐦: Neural Network Training with Gradient Descent and Back propagation 푁 Input: learning rate 휇, dataset { 퐱푖, 퐲푖 }푖=1 퐑퐚퐧퐝퐨퐦 퐢퐧퐢퐭퐢퐚퐥퐢퐳퐚퐭퐢퐨퐧 퐨퐟 퐧퐞퐭퐰퐨퐫퐤 퐩퐚퐫퐚퐦퐞퐭퐞퐫퐬 (퐖, 퐛) 1 퐃퐞퐟퐢퐧퐞 퐜퐨퐬퐭 퐟퐮퐧퐜퐭퐢퐨퐧 퐽 = σ푁 퐲 − 퐲ො 2 푁 푖=1 푖 푖 퐰퐡퐢퐥퐞 표푝푡푖푚푖푧푎푡푖표푛 푐푟푖푡푒푟푖푎 푛표푡 푚푒푡 퐝퐨 퐽 = 0 퐅퐨퐫퐰퐚퐫퐝 퐩퐚퐬퐬 퐨퐟 퐰퐡퐨퐥퐞 퐛퐚퐭퐜퐡 퐲ො = 퐟푁푁 퐱; 퐖, 퐛 푁 1 퐽 = 퐲 − 퐲ො 2 푁 푖 푖 푖=1 퐿 푑푓 푧푗 훅 퐿 = ∇ 퐽 ∘ 푎 퐿 푑푧푗 퐟퐨퐫 푙 = 1, … , 퐿 − 1 퐝퐨 푑푓 퐳 푙 훅 푙 = 퐰 푙+1 훅 푙+1 ∘ 푑퐳 푙 휕퐽 = 훿 푙 푎 푙−1 푙 푗 푘 휕푤푗푘 휕퐽 = 훿 푙 푙 푗 휕푏푗 퐞퐧퐝 퐖 ← 퐖 − 휇∇퐰퐽(퐖, 퐛) 퐛 ← 퐛 − 휇∇퐛퐽(퐖, 퐛) 퐞퐧퐝 46 Optimization Review ⚫ Batch Gradient Descent problems: 3) Ill conditioning: Takes into account only first – order information (1푠푡 derivative). No information about curvature of cost function may lead to zig-zagging motions during optimization, resulting in very slow convergence. 4) May depend too strongly on the initialization of the parameters. 5) Learning rate must be chosen very carefully. 48 Optimization Review ⚫ Stochastic Gradient Descent (SGD): 1. Passing the whole dataset through the network just for a small update in the parameters is inefficient. 2.Most cost functions used in Machine Learning and Deep Learning applications can be decomposed into a sum over training samples. Instead of computing the gradient through the whole training set, we decompose the training set {(퐱푖, 퐲푖), 푖 = 1, … 푁} to 퐵 mini-batches (possibly same cardinality 푆) to estimate 훻휃퐽 from 훻휃퐽푏, 푏 = 1, … , 퐵: 푆 푁 푆 σ푠=1 훻휃퐽퐱 σ 훻휃퐽퐱 1 푠 ≈ 푖=1 푖 = 훻 퐽 ⇒ 훻 퐽 ≈ 훻 퐽 푆 푁 휃 휃 푆 휃 퐱푠 푠=1 49 Optimization Review • Adaptive Learning Rate Algorithms: • Up to this point, the learning rate was given as an input to the algorithm and was kept stable throughout the whole optimization procedure. • To overcome the difficulties of using a set rate, adaptive rate algorithms assign a different learning rate for each separate model parameter. 51 Optimization Review • Adaptive Learning Rate Algorithms: 1.AdaGrad algorithm: i. Gradient accumulation variable 퐫 = ퟎ 1 ii.Compute and accumulate gradient 퐠 ← 훻 σ 퐽 푓 퐱 푖 ; 훉 푖 , 퐲 , 퐫 ← 퐫 + 퐠 ∘ 퐠 푚 휽 푖 푖 휇 iii.Compute update ∆훉 ← − ∘ 퐠 and apply update 훉 ← 훉 + ∆훉 휀+ 퐫 Parameters with largest partial derivative of the loss have a more rapid decrease in their learning rate. ○ denotes element-wise multiplication. 51 Generalization • Good performance on the training set alone is not a satisfying indicator of a good model. • We are interested in building models that can perform well on data they have not been used the training process. Bad generalization capabilities mainly occur through 2 forms: 1.Overfitting: • Overfitting can be detected when there is great discrepancy between the performance of the model in the training set, and that in any other set. It can happen due to excessive model complexity. • An example of overfitting would be trying to model a quadratic equation with a 10-th degree polynomial. 54 Generalization 2.Underfitting: • On the other hand, underfitting occurs when a model cannot accurately capture the underlying structrure of data. • Underfitting can be detected by a very low performance in the training set. • Example: trying to model a non-linear equation using linear ones. 55 Revisiting Activation Functions • To overcome the shortcomings of the sigmoid activation function, the Rectified Linear Unit (ReLU) was introduced. • Very easy to compute. • Due to its quasi-linear form leads to faster convergence (does not saturate at high values) • Units with zero derivative will not activate at all. 59 Training on Large Scale Datasets • Large number of training samples in the magnitude of hundreds of thousands. • Problem: Datasets do not fit in memory. • Solution: Using mini-batch SGD method. • Many classes, in the magnitude of hundreds up to one thousand. • Problem: Difficult to converge using MSE error. • Solution: Using Categorical Cross Entropy (CCE) loss on Softmax output. 60 Towards Deep Learning ⚫ Increasing the network depth (layer number) 퐿 can result in negligible weight updates in the first layers, because the corresponding deltas become very small or vanish completely ⚫ Problem: Vanishing gradients. ⚫ Solution: Replacing sigmoid with an activation function without an upper bound, like a rectifier (a.k.a. ramp function, ReLU).