Lecture 11 Supervised Learning Artificial Neural Networks

Lecture 11 Supervised Learning Artificial Neural Networks Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Neural Networks Course Overview Other Methods and Issues 4 Introduction Learning 4 Artificial Intelligence Supervised 4 Intelligent Agents Decision Trees, Neural 4 Search Networks Learning Bayesian Networks 4 Uninformed Search Unsupervised 4 Heuristic Search EM Algorithm 4 Uncertain knowledge and Reasoning Reinforcement Learning 4 Probability and Bayesian Games and Adversarial Search approach Minimax search and 4 Bayesian Networks Alpha-beta pruning 4 Hidden Markov Chains Multiagent search 4 Kalman Filters Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2 Neural Networks Outline Other Methods and Issues 1. Neural Networks Feedforward Networks Single-layer perceptrons Multi-layer perceptrons 2. Other Methods and Issues 3 Neural Networks A neuron in a living biological system Other Methods and Issues Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Signals are noisy “spike trains” of electrical potential 4 Neural Networks Other Methods and Issues In the brain: > 20 types of neurons with 1014 synapses (compare with world population =7 × 109) Additionally, brain is parallel and reorganizing while computers are serial and static Brain is fault tolerant: neurons can be destroyed. 5 Neural Networks Other Methods and Issues Observations of neuroscience Neuroscientists: view brains as a web of clues to the biological mechanisms of cognition. Engineers: The brain is an example solution to the problem of cognitive computing 6 Neural Networks Applications Other Methods and Issues supervised learning: regression and classification associative memory optimization: grammatical induction, (aka, grammatical inference) e.g. in natural language processing noise filtering simulation of biological brains 7 Neural Networks Artificial Neural Networks Other Methods and Issues “The neural network” does not exist. There are different paradigms for neural networks, how they are trained and where they are used. Artificial Neuron Each input is multiplied by a weighting factor. Output is 1 if sum of weighted inputs exceeds the threshold value; 0 otherwise. Network is programmed by adjusting weights using feedback from examples. 8 Neural Networks McCulloch–Pitts “unit” (1943) Other Methods and Issues Output is a function of weighted inputs: 0 1 X ai = g(ini ) = g @ Wj;i aj A j Bias Weight a = −1 0 ai = g(ini) W0,i g ini Wj,i aj Σ ai Input Input Activation Output Links Function Function Output Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do 9 Neural Networks Activation functions Other Methods and Issues Non linear activation functions g(ini) g(ini) +1 +1 ini ini (a) (b) (a) is a step function or threshold function (mostly used in theoretical studies) (b) is a continuous activation function, e.g., sigmoid function1 =(1 + e−x ) (mostly used in practical applications) Changing the bias weight W0;i moves the threshold location 10 Neural Networks Implementing logical functions Other Methods and Issues = = = W0 1.5 W0 0.5 W0 – 0.5 W = 1 W = 1 1 1 = W1 –1 = = W2 1 W2 1 AND OR NOT McCulloch and Pitts: every (basic) Boolean function can be implemented (eventually by connecting a large number of units in networks, possibly recurrent, of arbitrary depth) 11 Neural Networks Network structures Other Methods and Issues Architecture: definition of number of nodes and interconnection structures and activation functions g but not weights. Feed-forward networks: no cycles in the connection graph single-layer perceptrons (no hidden layers) multi-layer perceptrons (one or more hidden layers) Feed-forward networks implement functions, have no internal state Recurrent networks: – Hopfield networks have symmetric weights (Wi;j = Wj;i ) g(x) = sign(x), ai = f1; 0g; associative memory – recurrent neural nets have directed cycles with delays =) have internal state (like flip-flops), can oscillate etc. 13 Neural Networks Use Other Methods and Issues Neural Networks are used in classification and regression Boolean classification: - value over 0.5 one class - value below 0.5 other class k-way classification - divide single output into k portions - k separate output unit continuous output - identity activation function in output unit 14 Neural Networks Single-layer NN (perceptrons) Other Methods and Issues Perceptron output 1 0.8 0.6 0.4 0.2 4 0 2 -4 0 x -2 0 -2 2 Output x 2 -4 Input W 1 4 Unitsj,i Units Output units all operate separately—no shared weights Adjusting weights moves the location, orientation, and steepness of cliff 15 Neural Networks Expressiveness of perceptrons Other Methods and Issues Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) The output is1 when: X Wj xj > 0 or W · x > 0 j Hence, it represents a linear separator in input space: - hyperplane in multidimensional space - line in 2 dimensions Minsky & Papert (1969) pricked the neural network balloon 16 Neural Networks Perceptron learning Other Methods and Issues Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is 1 1 E = Err 2 ≡ (y − h (x))2 ; 2 2 W Find local optima for the minimization of the function E(W) in the vector of variables W by gradient methods. Note, the function E depends on constant values x that are the inputs to the perceptron. The function E depends on h which is non-convex, hence the optimization problem cannot be solved just by solving rE(W) = 0 17 Neural Networks Digression: Gradient methods Other Methods and Issues Gradient methods are iterative approaches: find a descent direction with respect to the objective function E move W in that direction by a step size The descent direction can be computed by various methods, such as gradient descent, Newton-Raphson method and others. The step size can be computed either exactly or loosely by solving a line search problem. Example: gradient descent 1. Set iteration counter t = 0, and make an initial guess W0 for the minimum 2. Repeat: 3. Compute a descent direction pt = r(E(Wt )) 4. Choose αt to minimize f (α) = E(Wt − αpt ) over α 2 R+ 5. Update Wt+1 = Wt − αt pt , and t = t + 1 6. Until krf (Wk )k < tolerance Step 3 can be solved ’loosely’ by taking a fixed small enough value α > 0 18 Neural Networks Perceptron learning Other Methods and Issues In the specific case of the perceptron, the descent direction is computed by the gradient: 0 n 1 @E @Err @ X = Err · = Err · @y − g( Wj xj )A @Wj @Wj @Wj j = 0 0 = −Err · g (in) · xj and the weight update rule (perceptron learning rule) in step 5 becomes: t+1 t 0 Wj = Wj + α · Err · g (in) · xj For threshold perceptron, g 0(in) is undefined: Original perceptron learning rule (Rosenblatt, 1957) simply omits g 0(in) 19 Neural Networks Perceptron learning contd. Other Methods and Issues function Perceptron-Learning(examples,network) returns perceptron weights inputs: examples, a set of examples, each with input x = x1; x2;:::; xn and output y inputs: network, a perceptron with weights Wj ; j = 0;:::; n and activation function g repeat for each e in examples do Pn in j=0 Wj xj [e] Err y[e] − g(in) 0 Wj Wj + α · Err · g (in) · xj [e] end until all examples correctly predicted or stopping criterion is reached return network Perceptron learning rule converges to a consistent function for any linearly separable data set 20 Neural Networks Numerical Example Other Methods and Issues The (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables petal length and width, respectively, for 50 flowers from each of 2 species of iris. The species are “Iris setosa”, and “versicolor”. > head(iris.data) Petal Dimensions in Iris Blossoms 4.5 Sepal.Length Sepal.Width Species id 6 5.4 3.9 setosa -1 4.0 S S S 4 4.6 3.1 setosa -1 S S 3.5 SS 84 6.0 2.7 versicolor 1 S SSS S S V SS S V V S S V 31 4.8 3.1 setosa -1 3.0 SS S V V V V VV V V V 77 6.8 2.8 versicolor 1 Width V V V V V 2.5 V 15 5.8 4.0 setosa -1 V V S V V V 2.0 1.5 S Setosa Petals V Versicolor Petals 1.0 4 5 6 7 8 Length 21 > sigma <- function(w, point) { + x <- c(point, 1) + sign(w %*% x) + } > w.0 <- c(runif(1), runif(1), runif(1)) > w.t <- w.0 > for (j in 1:1000) { + i <- (j - 1)%%50 + 1 + diff <- iris.data[i, 4] - sigma(w.t, c(iris.data[i, 1], iris.data[i, 2])) + w.t <- w.t + 0.2 * diff * c(iris.data[i, 1], iris.data[i, 2], 1) + } Petal Dimensions in Iris Blossoms 4.5 S 4.0 S S S 3.5 S S S S S S S V S V S S V S S V V 3.0 S S V V V V V S V V V V V V V V Width 2.5 V V V V V S V V 2.0 1.5 S Setosa Petals V Versicolor Petals 1.0 4 5 6 7 8 Length Neural Networks Multilayer Feed-forward Other Methods and Issues W1,3 1 3 W3,5 W1,4 5 W2,3 W 2 4 4,5 W2,4 Feed-forward network = a parametrized family of nonlinear functions: a5 = g(W3;5 · a3 + W4;5 · a4) = g(W3;5 · g(W1;3 · a1 + W2;3 · a2) + W4;5 · g(W1;4 · a1 + W2;4 · a2)) Adjusting weights changes the function: do learning this way! 24 Neural Networks Neural Network with two layers Other Methods and Issues 25 Neural Networks Expressiveness of MLPs Other Methods and Issues All continuous functions with 2 layers, all functions with 3 layers hW(x1, x2) hW(x1, x2) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 4 0.2 4 0 2 0 2 0 x 0 x -4 -2 -2 2 -4 -2 -2 2 0 2 -4 0 2 -4 x1 4 x1 4 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units 27 Neural Networks Backpropagation Algorithm Other Methods and Issues Supervised learning method to train multilayer feedforward NNs with diffrerentiable transfer functions.

Load more