Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition • Given a collection of handwritten digits and their corresponding labels, we’d like to be able to correctly classify handwritten digits – A simple algorithmic technique can solve this problem with 95% accuracy • This seems surprising, in fact, state- of-the-art methods can achieve near 99% accuracy (you’ve probably seen these in action if you’ve deposited a Digits from the MNIST check recently) data set 2 Neural Networks • The basis of neural networks was developed in the 1940s -1960s – The idea was to build mathematical models that might “compute” in the same way that neurons in the brain do – As a result, neural networks are biologically inspired, though many of the algorithms developed for them are not biologically plausible – Perform surprisingly well for the handwritten digit recognition task 3 Neural Networks • Neural networks consist of a collection of artificial neurons • There are different types of neuron models that are commonly studied – The perceptron (one of the first studied) – The sigmoid neuron (one of the most common, but many more) – Rectified linear units • A neural network is typically a directed graph consisting of a collection of neurons (the nodes in the graph), directed edges (each with an associated weight), and a collection of fixed binary inputs 4 The Perceptron • A perceptron is an artificial neuron that takes a collection of binary inputs and produces a binary output – The output of the perceptron is determined by summing up the weighted inputs and thresholding the result: if the weighted sum is larger than the threshold, the output is one (and zero otherwise) 푥1 푥2 푦 푥3 1 푤 푥 + 푤 푥 + 푤 푥 > 푡ℎ푟푒푠ℎ표푙푑 푦 = ቊ 1 1 2 2 3 3 0 표푡ℎ푒푟푤푖푠푒 5 The Perceptron 푥1 푥2 푦 푥3 1 푤 푥 + 푤 푥 + 푤 푥 > 푡ℎ푟푒푠ℎ표푙푑 푦 = ቊ 1 1 2 2 3 3 0 표푡ℎ푒푟푤푖푠푒 • The weights can be both positive and negative • Many simple decisions can be modeled using perceptrons 6 Perceptron for NOT 푥 ⌐ 푦 • Choose 푤 = −1, threshold = −.5 1 −푥 > −.5 • 푦 = ቊ 0 −푥 ≤ −.5 7 Perceptron for OR 8 Perceptron for OR 푥1 ᴠ 푦 푥2 • Choose 푤1 = 푤2 = 1, threshold = 0 1 푥 + 푥 > 0 • 푦 = ቊ 1 2 0 푥1 + 푥2 ≤ 0 9 Perceptron for AND 10 Perceptron for AND 푥1 ᴧ 푦 푥2 • Choose 푤1 = 푤2 = 1, threshold = 1.5 1 푥 + 푥 > 1.5 • 푦 = ቊ 1 2 0 푥1 + 푥2 ≤ 1.5 11 Perceptron for XOR 12 Perceptron for XOR • Need more than one perceptron! 푥1 ᴠ ᴧ 푦 푥2 ᴧ ⌐ • Weights for incoming edges are chosen as before – Networks of perceptrons can encode any circuit! 13 Perceptrons • Perceptrons are usually expressed in terms of a collection of input weights and a bias 푏 (which is the negative threshold) 푥1 푥2 푦 푥3 1 푤 푥 + 푤 푥 + 푤 푥 + 푏 > 0 푦 = ቊ 1 1 2 2 3 3 0 표푡ℎ푒푟푤푖푠푒 • A single node perceptron is just a linear classifier – This is actually where the “perceptron algorithm” comes from 14 Neural Networks • Gluing a bunch of perceptrons together gives us a neural network • In general, neural nets have a collection of binary inputs and a collection of binary outputs Inputs Outputs 15 Beyond Perceptrons • Given a collection of input-output pairs, we’d like to learn the weights of the neural network so that we can correctly predict the ouput of an unseen input – We could try learning via gradient descent (e.g., by minimizing the Hamming loss) • This approach doesn’t work so well: small changes in the weights can cause dramatic changes in the output • This is a consequence of the discontinuity of sharp thresholding (same problem we saw in SVMs) 16 The Sigmoid Neuron • A sigmoid neuron is an artificial neuron that takes a collection of inputs in the interval [0,1] and produces an output in the interval [0,1] – The output is determined by summing up the weighted inputs plus the bias and applying the sigmoid function to the result 푥1 푥2 푦 푥3 푦 = 휎(푤1푥1 + 푤2푥2 + 푤3푥3 + 푏) where 휎 is the sigmoid function 17 The Sigmoid Function • The sigmoid function is a continuous function that approximates a step function 1 휎 푧 = 1 + 푒−푧 18 Rectified Linear Units • The sigmoid neuron approximates a step function as a smooth function • The relu approximates a hinge loss max(0, 푥) as a smooth continuous function ln(1 + 푒푥) 19 Multilayer Neural Networks from Neural Networks and Deep Learning by Michael Nielson 20 Multilayer Neural Networks NO intralayer connections from Neural Networks and Deep Learning by Michael Nielson 21 Neural Network for Digit Classification from Neural Networks and Deep Learning by Michael Nielson 22 Neural Network for Digit Classification Why 10 instead of 4? from Neural Networks and Deep Learning by Michael Nielson 23 Expressiveness of NNs • Boolean functions • Every Boolean function can be represented by a network with a single hidden layer consisting of possibly exponentially many hidden units • Continuous functions • Every bounded continuous function can be approximated up to arbitrarily small error by a network with one hidden layer • Any function can be approximated to arbitrary accuracy with two hidden layers 24 Training Neural Networks • To do the learning, we first need to define a loss function to minimize 1 퐶 푤, 푏 = 푦푚 − 푎(푥푚, 푤, 푏) 2 2푀 푚 • The training data consists of input output pairs (푥1, 푦1), … , (푥푀, 푦푀) • 푎(푥푚, 푤, 푏) is the output of the neural network for the 푚푡ℎ sample • 푤 and 푏 are the weights and biases 25 Gradient of the Loss • The derivative of the loss function is calculated as follows 휕퐶(푤, 푏) 1 휕푎(푥푚, 푤, 푏) = 푦푚 − 푎(푥푚, 푤, 푏) 휕푤푘 푀 휕푤푘 푚 – To compute the derivative of 푎, use the chain rule and the derivative of the sigmoid function 푑휎(푧) = 휎 푧 ⋅ (1 − 휎 푧 ) 푑푧 – This gets complicated quickly with lots of layers of neurons 26 Stochastic Gradient Descent • To make the training more practical, stochastic gradient descent is used instead of standard gradient descent • Recall, the idea of stochastic gradient descent is to approximate the gradient of a sum by sampling a few indices and averaging 푛 퐾 1 훻 푓 (푥) ≈ 훻 푓 푘(푥) 푥 푖 퐾 푥 푖 푖=1 푘=1 here, for example, each 푖푘 is sampled uniformly at random from {1, … , 푛} 27 Computing the Gradient • We’ll compute the gradient for a single sample 퐶 푤, 푏 = 푦 − 푎(푥, 푤, 푏) 2 • Some definitions: – 퐿 is the number of layers 푙 푡ℎ 푡ℎ – 푎푗 is the output of the 푗 neuron on the 푙 layer 푙 푡ℎ 푡ℎ – 푧푗 is the input of the 푗 neuron on the 푙 layer 푙 푙 푙−1 푙 푧푗 = 푤푗푘푎푘 + 푏푗 푘 푙 휕C – 훿푗 is defined to be 푙 휕푧푗 28 Computing the Gradient For the output layer, we have the following partial derivative 휕푎퐿 휕C 퐿 푗 퐿 = − 푦푗 − 푎푗 퐿 휕푧푗 휕푧푗 휕휎 푧퐿 퐿 푗 = − 푦푗 − 푎푗 퐿 휕푧푗 퐿 퐿 퐿 = − 푦푗 − 푎푗 휎 푧푗 1 − 휎 푧푗 퐿 = 훿푗 • For simplicity, we will denote the vector of all such partials for each node in the 푙푡ℎ layer as 훿푙 29 Computing the Gradient For the 퐿 − 1 layer, we have the following partial derivative 퐿 휕C 휕푎푗 = 푎퐿 − 푦 휕푧퐿−1 푗 푗 휕푧퐿−1 푘 푗 푘 퐿 휕휎 푧푗 = 푎퐿 − 푦 푗 푗 휕푧퐿−1 푗 푘 퐿 휕푧푗 = 푎퐿 − 푦 휎 푧퐿 1 − 휎 푧퐿 푗 푗 푗 푗 휕푧퐿−1 푗 푘 퐿 퐿−1 퐿 휕 σ푘′ 푤푗푘′푎푘′ + 푏푗 = 푎퐿 − 푦 휎 푧퐿 1 − 휎 푧퐿 푗 푗 푗 푗 휕푧퐿−1 푗 푘 퐿 퐿 퐿 퐿−1 퐿−1 퐿 = 푎푗 − 푦푗 휎 푧푗 1 − 휎 푧푗 휎 푧푘 1 − 휎 푧푘 푤푗푘 푗 퐿 푇 퐿 퐿−1 퐿−1 = (훿 ) 푤∗푘 1 − 휎 푧푘 휎 푧푘 30 Computing the Gradient • We can think of 푤푙 as a matrix • This allows us to write 훿퐿−1 = (훿퐿)푇푤퐿 1 − 휎 푧퐿−1 휎 푧퐿−1 퐿−1 푡ℎ 퐿−1 where 휎 푧 is the vector whose 푘 component is 휎 푧푘 • Applying the same strategy, for 푙 < 퐿 훿푙 = (훿푙+1)푇푤푙+1 1 − 휎 푧푙 휎 푧푙 31 Computing the Gradient • Now, for the partial derivatives that we care about 휕푧푙 휕퐶 휕퐶 푗 푙 푙 = 푙 ⋅ 푙 = 훿푗 휕푏푗 휕푧푗 휕푏푗 휕푧푙 휕퐶 휕퐶 푗 푙 푙−1 푙 = 푙 ⋅ 푙 = 훿푗 푎푘 휕푤푗푘 휕푧푗 휕푤푗푘 • We can compute these derivatives one layer at a time! 32 Backpropagation: Putting it all together • Compute the inputs/outputs for each layer by starting at the input layer and applying the sigmoid functions • Compute 훿퐿 for the output layer 퐿 퐿 퐿 퐿 훿 = − 푦푗 − 푎푗 휎 푧푗 1 − 휎 푧푗 • Starting from 푙 = 퐿 − 1 and working backwards, compute 훿푙 = (훿푙+1)푇푤푙+1 휎 푧푙 1 − 휎 푧푙 • Perform gradient descent 푙 푙 푙 푏푗 = 푏푗 − 훾 ⋅ 훿푗 푙 푙 푙 푙−1 푤푗푘 = 푤푗푘 − 훾 ⋅ 훿푗 푎푘 33 Backpropagation • Backpropagation converges to a local minimum (loss is not convex in the weights and biases) – Like EM, can just run it several times with different initializations – Training can take a very long time (even with stochastic gradient descent) – Prediction after learning is fast – Sometimes include a momentum term 훼 in the gradient update 푤 푡 = 푤 푡 − 1 − 훾 ⋅ 훻푤퐶 푡 − 1 + 훼 −훾 ⋅ 훻푤퐶(푡 − 2) 34 Overfitting 35 Overfitting 36 Neural Networks in Practice • Many ways to improve weight learning in NNs – Use a regularizer! (better generalization) – Try other loss functions – Initialize the weights of the network more cleverly • Random initializations are likely to be far from optimal – etc.