Neural Networks Slides

Artificial Intelligence: Representation and Problem Solving 15-381 January 16, 2007 Neural Networks Topics • decision boundaries • linear discriminants • perceptron • gradient learning • neural networks Artificial Intelligence: Neural Networks 2 Michael S. Lewicki ! Carnegie Mellon The Iris dataset with decision tree boundaries 2.5 2 1.5 petal width (cm) 1 0.5 0 1 2 3 4 5 6 7 petal length (cm) Artificial Intelligence: Neural Networks 3 Michael S. Lewicki ! Carnegie Mellon The optimal decision boundary for C2 vs C3 • optimal decision boundary is p(petal length |C3) determined from the statistical 0.9p(petal length |C2) distribution of the classes 0.8 0.7 optimal only if model is correct! 0.6 • 0.5 • assigns precise degree of uncertainty 0.4 to classification 0.3 0.2 0.1 0 2.5 2 1.5 1 petal width (cm) 0.5 0 1 2 3 4 5 6 7 petal length (cm) Artificial Intelligence: Neural Networks 4 Michael S. Lewicki ! Carnegie Mellon Optimal decision boundary Optimal decision boundary p(C2 | petal length) p(C3 | petal length) 1 0.8 p(petal length |C2) p(petal length |C3) 0.6 0.4 0.2 0 1 2 3 4 5 6 7 Artificial Intelligence: Neural Networks 5 Michael S. Lewicki ! Carnegie Mellon Can we do better? • only way is to use more information p(petal length |C3) DTs use both petal width and petal 0.9p(petal length |C2) • 0.8 length 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2.5 2 1.5 1 petal width (cm) 0.5 0 1 2 3 4 5 6 7 petal length (cm) Artificial Intelligence: Neural Networks 6 Michael S. Lewicki ! Carnegie Mellon Arbitrary decision boundaries would be more powerful 2.5 Decision boundaries 2 could be non-linear 1.5 1 petal width (cm) 0.5 0 1 2 3 4 5 6 7 petal length (cm) Artificial Intelligence: Neural Networks 7 Michael S. Lewicki ! Carnegie Mellon Defining a decision boundary consider just two classes • 2.5 • want points on one side of line in class 1, otherwise class 2. • 2D linear discriminant function: 2 2 y = mT x + b x 1.5 = m1x1 + m2x2 + b = mixi + b i ! 1 3 4 5 6 7 This defines a 2D plane which x • 1 leads to the decision: The decision boundary: T class 1 if y 0, y = m x + b = 0 x ≥ ∈ !class 2 if y < 0. Or in terms of scalars: m x + m x = b 1 1 2 2 − m1x1 + b x2 = ⇒ − m2 Artificial Intelligence: Neural Networks 8 Michael S. Lewicki ! Carnegie Mellon Linear separability • Two classes are linearly separable if they can be separated by a linear combination of attributes - 1D: threshold not linearly separable - 2D: line - 3D: plane - M-D: hyperplane 2.5 2 1.5 1 petal width (cm) 0.5 linearly separable 0 1 2 3 4 5 6 7 petal length (cm) Artificial Intelligence: Neural Networks 9 Michael S. Lewicki ! Carnegie Mellon Diagraming the classifier as a “neural” network • The feedforward neural network is specified by weights wi and bias b: y = wT x + b y “output unit” M “bias” b = wixi + b w1 w2 wM “weights” i=1 ! “input units” x1 x2 • •!• xM • It can written equivalently as M T y = w x = wixi y i=0 ! w0 w1 w2 wM • where w0 = b is the bias and a “dummy” input x0 that is always 1. x0=1 x1 x2 • •!• xM Artificial Intelligence: Neural Networks 10 Michael S. Lewicki ! Carnegie Mellon Determining, ie learning, the optimal linear discriminant • First we must define an objective function, ie the goal of learning • Simple idea: adjust weights so that output y(xn) matches class cn • Objective: minimize sum-squared error over all patterns xn: 1 N E = (wT x c )2 2 n − n n=1 ! • Note the notation xn defines a pattern vector: x = x , . , x n { 1 M }n • We can define the desired class as: 0 xn class 1 cn = ∈ 1 x class 2 ! n ∈ Artificial Intelligence: Neural Networks 11 Michael S. Lewicki ! Carnegie Mellon We’ve seen this before: curve fitting t = sin(2πx) + noise tn 1 t 0 y(xn, w) !1 x 0 1 xn example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Neural Networks 12 Michael S. Lewicki ! Carnegie Mellon Neural networks compared to polynomial curve fitting M y(x, w) = w + w x + w x2 + + w xM = w xj 0 1 2 · · · M j 1 N j=0 E(w) = [y(x , w) t ]2 ! 2 n − n n=1 ! 1 1 0 0 !1 !1 For the linear network, M=1 and 0 1 0 1 there are multiple input dimensions 1 1 0 0 !1 !1 0 1 0 1 example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Neural Networks 13 Michael S. Lewicki ! Carnegie Mellon General form of a linear network • A linear neural network is simply a y linear transformation of the input. M = yj ! wi,j xi W i=0 x • Or, in matrix-vector form: y = Wx y1 • •!• yi • •!• yK “outputs” • Multiple outputs corresponds to multivariate regression “weights” wĳ x0=1 x1 • •!• xi • •!• xM “inputs” “bias” Artificial Intelligence: Neural Networks 14 Michael S. Lewicki ! Carnegie Mellon Training the network: Optimization by gradient descent • We can adjust the weights incrementally to minimize the objective function. • This is called gradient descent • Or gradient ascent if we’re maximizing. • The gradient descent rule for weight wi is: t+1 t ∂E wi = wi − ! wi w2 w4 Or in vector form: • w3 ∂E w = w − ! w2 t+1 t w For gradient ascent, the sign • w1 of the gradient step changes. w1 Artificial Intelligence: Neural Networks 15 Michael S. Lewicki ! Carnegie Mellon Computing the gradient • Idea: minimize error by gradient descent • Take the derivative of the objective function wrt the weights: 1 N E = (wT x c )2 2 n − n n=1 ! ∂E 2 N = (w x + + w x + + w x c )x w 2 0 0,n · · · i i,n · · · M M,n − n i,n i n=1 ! N = (wT x c )x n − n i,n n=1 ! • And in vector form: ∂E N = (wT x c )x w n − n n n=1 ! Artificial Intelligence: Neural Networks 16 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary Each iteration updates the gradient: • 2.5 t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1 3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration Artificial Intelligence: Neural Networks 17 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary Each iteration updates the gradient: • 2.5 t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1 3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration Artificial Intelligence: Neural Networks 18 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary Each iteration updates the gradient: • 2.5 t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1 3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration Artificial Intelligence: Neural Networks 19 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary Each iteration updates the gradient: • 2.5 t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1 3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration Artificial Intelligence: Neural Networks 20 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary Each iteration updates the gradient: • 2.5 t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1 3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration Artificial Intelligence: Neural Networks 21 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary Each iteration updates the gradient: • 2.5 t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1 3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration Artificial Intelligence: Neural Networks 22 Michael S.

Load more