Chapter 10: Artificial Neural Networks

Chapter 10: Artificial Neural Networks Dr. Xudong Liu Assistant Professor School of Computing University of North Florida Monday, 9/30/2019 1 / 17 Overview 1 Artificial neuron: linear threshold unit (LTU) 2 Perceptron 3 Multi-Layer Perceptron (MLP), Deep Neural Networks (DNN) 4 Learning ANN's: Error Backpropagation Overview 2 / 17 Differential Calculus In this course, I shall stick to Leibniz's notation. The derivative of a function at an input point, when it exists, is the slope of the tangent line to the graph of the function. 2 0 dy Let y = f (x) = x + 2x + 1. Then, f (x) = dx = 2x + 2. A partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant. 2 2 @z Let z = f (x; y) = x + xy + y . Then, @x = 2x + y. The chain rule is a formula for computing the derivative of the composition of two or more functions. dz dz dy 0 0 Let z = f (y) and y = g(x). Then, dx = dy · dx = f (y) · g (x). 1 Nice thing about the sigmoid function f (x) = 1+e−x : f 0(x) = f (x) · (1 − f (x)). Some Preliminaries 3 / 17 Artificial Neuron Artificial neuron, also called linear threshold unit (LTU), by McCulloch and Pitts, 1943: with one or more numeric inputs, it produces a weighted sum of them, applies an activation function, and outputs the result. Common activation functions: step function and sigmoid function. Artificial Neuron 4 / 17 Linear Threshold Unit (LTU) Below is an LTU with the activation function being the step function. Artificial Neuron 5 / 17 Perceptrons A perceptron, by Rosenblatt in 1957, is composed of two layers of neurons: an input layer consisting of special passing through neurons and an output layer of LTU's. The bias neuron is added for the completeness of linearity. Rosenblatt proved that, if training examples are linearly separable, a perceptron always can be learned to correctly classify all training examples. Perceptrons 6 / 17 Perceptrons For instance, perceptrons can implement logical conjunction, disjunction and negation. For the following perceptron of one LTU with the step function as the activation function. x1 ^ x2: w1 = w2 = 1, θ = −2 x1 _ x2: w1 = w2 = 1, θ = −0:5 :x1: w1 = −0:6, w2 = 0, θ = −0:5 Perceptrons 7 / 17 Perceptrons However, perceptrons cannot solve some trivial non-linear separable problems, such as the Exclusive OR classification problem. This is shown by Minsky and Papert in 1969. Turned out stacking multiple perceptrons can solve any non-linear problems. Perceptrons 8 / 17 Multi-Layer Perceptrons A Multi-Layer Perceptrons (MLP) is composed of one passthrough input layer, one or more layers of LTU's, called hidden layer, and one final layer of LTUs, called output layer. Again, every layer except the output layer includes a bias neuron and is fully connected to the next layer. When an MLP has two or more hidden layers, it is called a deep neural network (DNN). Multi-Layer Perceptrons 9 / 17 Learning Multi-Layer Perceptrons: Error Backpropagation Error backpropagation so far is the most successful learning algorithm for training MLP's. Learning Internal Representations by Error Propagation, Rumelhart, Hinton and Williams, 1986. Idea: we start with a fix network, then update edge weights using v v + ∆v, where ∆v is a gradient descent of some error objective function. Before learning, let's take a look at a few possible activation functions and their derivatives. Learning MLP 10 / 17 Error Backpropagation Now, I explain the error backpropagation algorithm for the following MLP of d passthrough input neurons, one hidden layer of q LTU's, and output layer of l LTU's. I will assume activation functions f at the l + q LTU's all are the sigmoid function. Note the bias neuron is gone from the picture, as now the bias is P embedded in the LTU's activation function yj = f ( wi xi − θj ). i Learning MLP 11 / 17 Error Backpropagation d So our training set is f(x1; y1);:::; (xm; ym)g, where xi 2 R and l yi 2 R . We want to have an MLP like below to fit this data set; namely, to compute the l · q + d · q weights and the l + q biases. To predict on a new example x, we feed it to the input neurons and collect the outputs. The predicted class could be the yj with the highest score. If you want class probabilities, consider softmax. MLP can be used for regression too, when there only is one output neuron. Learning MLP 12 / 17 Error Backpropagation I use θj to mean the bias in the j-th neuron in the output layer, and γh the bias in the h-th neuron in the hidden layer. q P I use βj = whj bh to mean the input to the j-th neuron in the h=1 d P output layer, and αh = vihxi . i=1 Take a training example (x; y), I usey ^ = (^y1;:::; y^l ) to mean the predicted output, where eachy ^j = f (βj − θj ). Learning MLP 13 / 17 Error Backpropagation l 1 P 2 Error function (objective function): E = 2 (^yj − yj ) . j=1 This error function is a composition function of many parameters and it is differentiable, so we can compute the gradient descents to be used to update the weights and biases. Learning MLP 14 / 17 The Error Backpropagation Algorithm Given a training set D = f(x; y)g, and learning rate η, we want to finalize the weights and biases in the MLP. 1 Initialize all the weights and biases in the MLP using random values from interval [0; 1]; 2 Repeat the following until some stopping criteron: 1 For every (x; y) 2 D, do 1 Calculate y^; 2 Calculate gradient descents ∆whj and ∆θj for the neurons in the output layer; 3 Calculate gradient descents ∆vih and ∆γh for the neurons in the hidden layer; 4 Update weights whj and vih, and biases θj and γh; Learning MLP 15 / 17 The Error Backpropagation Algorithm: ∆whj @E 1 ∆whj = −η @whj @E @E @y^j @βj 2 = · · @whj @y^j @βj @whj q @βj P 3 We know = bh, for we have βj = whj bh @whj h=1 l @E 1 P 2 4 We know =y ^j − yj , for we have E = (^yj − yj ) @y^j 2 j=1 @y^j 0 5 We also know = f (βj − θj ) =y ^j (1 − y^j ), for we know f is sigmoid @βj 6 Bullets 3, 4 and 5 together can solve ∆whj in bullet 1. 7 Computing ∆θj is similar. Learning MLP 16 / 17 The Error Backpropagation Algorithm: ∆vih @E 1 ∆vih = −η @vih @E @E @b @α 2 = · h · h @vih @bh @αh @vih 3 (Long story short) l P 4 ∆vih = ηxi bh(1 − bh) whj gj , where gj = (yj − y^j )^yj (1 − y^j ) j=1 5 So we update ∆whj and ∆θj for the output layer first, then ∆vih and ∆γh for the hidden layer. 6 This is why it is called backpropagation. Learning MLP 17 / 17.

Chapter 10: Artificial Neural Networks

Malware Classification with BERT

Training Autoencoders by Alternating Minimization

Fun with Hyperplanes: Perceptrons, Svms, and Friends

Artificial Neural Networks Part 2/3 – Perceptron

CS 189 Introduction to Machine Learning Spring 2021 Jonathan Shewchuk HW6

Face Recognition: a Convolutional Neural-Network Approach

Can Temporal-Difference and Q-Learning Learn Representation? a Mean-Field Analysis

CNN Architectures

Introduction to Machine Learning

Neural Network in Hardware

Population Dynamics: Variance and the Sigmoid Activation Function ⁎ André C

Neural Networks Shuai Li John Hopcroft Center, Shanghai Jiao Tong University