Deep Learning Opening the Black Box (Part 2) Overview

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Deep Learning Opening the Black Box (Part 2) Overview Artificial neural networks Deep learning Opening the black box (part 2) 2 Artificial Neural Networks 3 The basic idea Very loosely based on how the human brain works A collection of mathematical “neurons” are created and connected together, allowing them to send signals to each other Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure 4 A brief history https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html 5 A brief history An electronic brain (1940s): since the dawn of computing, researchers have been thinking about the idea of an “intelligent”, perhaps even “conscious” machine Alan Turing laid out several criteria to assess whether a machine could be said be intelligent in Computing Machinery and Intelligence (Turing test) The perceptron (1950s): early work in machine learning was inspired by the (then) working theories on the human brain Frank Rosenblatt kickstarts the field by introducing the “perceptron”: simplified mathematical representation of a neuron Convinced that this would quickly lead to true AI 6 A brief history “ The embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. – The New York Times, based on Rosenblatt’s statements on the Perceptron “ Into the AI winter (1960s-80s): Marvin Minsky, considered as one of the fathers of AI, is less convinced Along with Seymor Papert, Minksy wrote a book entitled Perceptrons that in fact ended the optimism around the perceptron They showed that the perceptron was incapable of learning the simple exclusive-or (XOR) function 7 A brief history Backpropagation to the rescue (1980s-90s): interest returns to neural networks Geoff Hinton shows that neural networks with many hidden layers (i.e. consisting of more than one perceptron) could be effectively trained by a relatively simple procedure, called “backpropagation” Such networks have the ability to learn any function, a result known as the “universal approximation theorem”, and with that, neural networks were hot again The idea of using multiple perceptrons in a layered fashion was not new, though it was unclear how exactly such networks could be “trained” The backpropagation algorithm works by taking starting from a network’s “error” and “back-propagating” these throughout all its layers to adjust their parameters Leads to some early successes: Multi-layer perceptrons (MLP), the first convolutional neural networks (CNNs) to recognize handwritten digits by Yann Lecun at AT&T Bell Labs (“LeNet”) A second AI winter (1990s-early 2000s): the MLP approach didn’t scale well to larger problems Computing power lacking for larger networks Meanwhile, by the 90s, the support vector machine (SVM) was rapidly taking the center-stage as the method of choice Neural networks were left behind once again The fields shifts its angle to be a lot more theoretical in nature 8 A brief history Deep learning (early 2000s): around 2006, Hinton introduces “unsupervised pretraining” and “deep belief nets” Train a simple 2-layer unsupervised model, freeze all its parameters, add a new layer on top and just train the parameters for the new layer Keep adding and training layers until you have a “deep network” Deep learning unleashed (2010s): based on Hinton’s work, more and more research papers began to take form In 2010, a large database known as “Imagenet” containing millions of labeled images was created and published by a research group at Stanford. This database was coupled with the annual Large Scale Visual Recognition Challenge (LSVRC), where contestants would build computer vision models In the first two years of the contest, the top models had error rates of about 25% In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton entered a submission that would halve the error rate! Combined several critical components that would go one to become mainstays in deep learning models: the use of graphics processing units (GPUs) to train the model, a method to reduce overfitting known as dropout and the rectified linear activation unit (ReLU) The network went on to become known as “Alexnet” and the paper describing it has been cited nearly 60000 times since it was published Deep learning all around (2020s): many innovations would follow after this result Appearance of large, high-quality labeled datasets Massively parallel computing with GPUs, TPUs New activation functions, improved architectures Software support New regularization techniques: dropout, batch normalization, data-augmentation to protect against overfitting New optimizers: from stochastic gradient descent (SGD) to RMSprop, ADAM and others Focus returns on practice, experiments, empirical approaches 9 Foundations: the perceptron A perceptron works over a number of numerical inputs: the input neurons or input units Each neuron has an output: its “activation” For the input neurons, the activation is simply given by the input data instance Every input unit’s output is connected to another neuron (the perceptron unit): inputs are multiplied with a weight and summed A bias can also be added with an input unit that always outputs 1 The final output of the perceptron unit is equal to the result of an “activation function” over the weighted sum of the inputs (This can also be expressed as two operations: combination/transfer function and activation function) 10 Foundations: the perceptron 11 So, how to train it? Feedforward is easy: just plug in the inputs and feed them through the network, obtaining an output For simple statistical models (for example, linear regression), closed-form analytical formulas to determine the optimum parameter estimates exist For nonlinear models, such as neural networks, the parameter estimates can be determined numerically using an iterative algorithm 12 So, how to train it? Say we use L = (y − y^)2 to see how good our prediction is for an instance This is our error or loss function Weights for this simple perceptron can be updated using: wi,t+1 = wi,t + η(y − y^)xi ∂L ∂(y − y^)2 ∂(y − y^)2 ∂(y − y^) ∂(y − y^) ∂y^ = = = 2(y − y^) = 2(y − y^)(−1) ∂wi ∂wi ∂(y − y^) ∂wi ∂wi ∂wi Let o = w1x1 + w2x2 + b , then: ∂σ(o) ∂o = 2(y − y^)(−1) = 2(y − y^)(−1)σ(o)(1 − σ(o))xi ∂o ∂wi 13 So, how to train it? Gradient descent: ∂L wi,t+1 = wi,t − η ∂wi η is the learning rate ∂L = 2(y − y^)(−1)σ(o)(1 − σ(o))xi = −2(y − y^)σ(o)(1 − σ(o))xi ∂wi Weights for this simple perceptron can be updated using: wi,t+1 = wi,t + η(y − y^)xi 14 So, how to train it? We’re nudging the weights in order to minimize the error or “loss” The learning rate η determines the “speed” of the convergence Higher: quicker towards minimum, but risk of overshooting Lower: slower towards minimum, risk of getting trapped in local minima 15 So, how to train it? The loss is a function of the weights given a piece of training data Minimize error using the gradient of the loss function Gradient descent is the process of minimizing a function by following the gradients of the cost function This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move downhill towards the minimum value (Stochastic) gradient descent One iteration: one instance fed-forward, weights are updated One epoch: one full pass over all instances in training set To properly train our perceptron, need to perform multiple passes over training set: multiple epochs Typically, the training set is reshuffled after every epoch 16 So, how to train it? import math X = [[0, 1], [1, 0], [2, 2], [3, 4], [4, 2], [5, 2], [4, 1], [5, 0]] y = [0, 0, 0, 0, 1, 1, 1, 1] weights = [0, 0, 0] # Initialize the weights (first weight is the bias) def sigmoid(x): return 1 / (1 + math.exp(-x)) def predict(instance, weights): output = weights[0] for i in range(len(weights)-1): output += weights[i+1] * instance[i] return sigmoid(output) def train(instance, weights, y_true, l_rate=0.01): prediction = predict(instance, weights) error = y_true - prediction weights[0] = weights[0] + l_rate * error for i in range(len(weights)-1): weights[i+1] = weights[i+1] + l_rate * error * instance[i] return weights 17 So how to train it? Without training: After 20 epochs: After 2000 epochs: Instance y y^ Instance y y^ Instance y y^ [0, 1] 0 0.5 [0, 1] 0 0.36 [0, 1] 0 0.00 [1, 0] 0 0.5 [1, 0] 0 0.57 [1, 0] 0 0.10 [2, 2] 0 0.5 [2, 2] 0 0.49 [2, 2] 0 0.04 [3, 4] 0 0.5 [3, 4] 0 0.42 [3, 4] 0 0.01 [4, 2] 1 0.5 [4, 2] 1 0.71 [4, 2] 1 0.94 [5, 2] 1 0.5 [5, 2] 1 0.79 [5, 2] 1 0.99 [4, 1] 1 0.5 [4, 1] 1 0.78 [4, 1] 1 0.99 [5, 0] 1 0.5 [5, 0] 1 0.89 [5, 0] 1 0.99 18 However… x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 # Weights: [0.004697241052453581, -0.009743527387551375, -0.00476408160440969] [0, 0] 0 -> 0.5011743081039458 [0, 1] 1 -> 0.4999832898620173 [1, 0] 1 -> 0.49873843109337934 [1, 1] 0 -> 0.4975474276853999 This is the XOR problem 19 So far, not very spectacular… One neuron on its own is hardly a brain Multilayer Perceptron (MLP): stack different neurons in layers Input layer Hidden layer(s) Output layer Connect all outputs with all inputs of next layer (“fully connected” or “dense” architecture) The question is now: how to train? 20 Backpropagation We can’t use the same approach as we did for a single perceptron as we don’t know what the “true outcome” should be for the lower layers This issue took quite some time to solve Eventually, a method called “backpropagation” was devised to overcome this Some controversy regarding who should be credited with the discovery of backprogation (e.g.

Load more