Artificial Intelligence: Representation and Problem Solving 15-381 January 16, 2007

Neural Networks

Topics

• decision boundaries • linear discriminants • • gradient learning • neural networks

Artificial Intelligence: Neural Networks 2 Michael S. Lewicki ! Carnegie Mellon The Iris dataset with decision tree boundaries

2.5

2

1.5

petal width (cm) 1

0.5

0 1 2 3 4 5 6 7 petal length (cm) Artificial Intelligence: Neural Networks 3 Michael S. Lewicki ! Carnegie Mellon

The optimal decision boundary for C2 vs C3

• optimal decision boundary is p(petal length |C3) determined from the statistical 0.9p(petal length |C2) distribution of the classes 0.8 0.7 optimal only if model is correct! 0.6 • 0.5 • assigns precise degree of uncertainty 0.4 to classification 0.3 0.2 0.1 0 2.5

2

1.5

1 petal width (cm) 0.5

0 1 2 3 4 5 6 7 petal length (cm)

Artificial Intelligence: Neural Networks 4 Michael S. Lewicki ! Carnegie Mellon Optimal decision boundary

Optimal decision boundary

p(C2 | petal length) p(C3 | petal length) 1 0.8 p(petal length |C2) p(petal length |C3) 0.6 0.4 0.2 0 1 2 3 4 5 6 7

Artificial Intelligence: Neural Networks 5 Michael S. Lewicki ! Carnegie Mellon

Can we do better?

• only way is to use more information p(petal length |C3) DTs use both petal width and petal 0.9p(petal length |C2) • 0.8 length 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2.5

2

1.5

1 petal width (cm) 0.5

0 1 2 3 4 5 6 7 petal length (cm)

Artificial Intelligence: Neural Networks 6 Michael S. Lewicki ! Carnegie Mellon Arbitrary decision boundaries would be more powerful

2.5 Decision boundaries 2 could be non-linear

1.5

1 petal width (cm) 0.5

0 1 2 3 4 5 6 7 petal length (cm)

Artificial Intelligence: Neural Networks 7 Michael S. Lewicki ! Carnegie Mellon

Defining a decision boundary

consider just two classes • 2.5 • want points on one side of in class 1, otherwise class 2. • 2D linear discriminant function: 2 2 y = mT x + b x 1.5 = m1x1 + m2x2 + b

= mixi + b i ! 1 3 4 5 6 7 This defines a 2D which x • 1 leads to the decision: The decision boundary: T class 1 if y 0, y = m x + b = 0 x ≥ ∈ !class 2 if y < 0. Or in terms of scalars: m x + m x = b 1 1 2 2 − m1x1 + b x2 = ⇒ − m2

Artificial Intelligence: Neural Networks 8 Michael S. Lewicki ! Carnegie Mellon Linear separability

• Two classes are linearly separable if they can be separated by a linear combination of attributes - 1D: threshold not linearly separable - 2D: line - 3D: plane - M-D: 2.5

2

1.5

1 petal width (cm) 0.5

linearly separable 0 1 2 3 4 5 6 7 petal length (cm)

Artificial Intelligence: Neural Networks 9 Michael S. Lewicki ! Carnegie Mellon

Diagraming the classifier as a “neural” network

• The feedforward neural network is specified by weights wi and bias b:

y = wT x + b y “output unit” M “bias” b = wixi + b w1 w2 wM “weights” i=1 ! “input units” x1 x2 • •!• xM • It can written equivalently as M T y = w x = wixi y i=0 ! w0 w1 w2 wM • where w0 = b is the bias and a “dummy” input x0 that is always 1. x0=1 x1 x2 • •!• xM

Artificial Intelligence: Neural Networks 10 Michael S. Lewicki ! Carnegie Mellon Determining, ie learning, the optimal linear discriminant

• First we must define an objective function, ie the goal of learning • Simple idea: adjust weights so that output y(xn) matches class cn • Objective: minimize sum-squared error over all patterns xn: 1 N E = (wT x c )2 2 n − n n=1 !

• Note the notation xn defines a pattern vector: x = x , . . . , x n { 1 M }n • We can define the desired class as:

0 xn class 1 cn = ∈ 1 x class 2 ! n ∈

Artificial Intelligence: Neural Networks 11 Michael S. Lewicki ! Carnegie Mellon

We’ve seen this before: curve fitting

t = sin(2πx) + noise

tn 1 t

0 y(xn, w)

!1

x 0 1 xn

example from Bishop (2006), Pattern Recognition and Artificial Intelligence: Neural Networks 12 Michael S. Lewicki ! Carnegie Mellon Neural networks compared to polynomial curve fitting M y(x, w) = w + w x + w x2 + + w xM = w xj 0 1 2 · · · M j 1 N j=0 E(w) = [y(x , w) t ]2 ! 2 n − n n=1 !

1 1

0 0

!1 !1 For the linear network, M=1 and 0 1 0 1 there are multiple

input dimensions 1 1

0 0

!1 !1

0 1 0 1 example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Neural Networks 13 Michael S. Lewicki ! Carnegie Mellon

General form of a linear network

• A linear neural network is simply a y linear transformation of the input. M = yj ! wi,j xi W i=0

x • Or, in matrix-vector form: y = Wx

y1 • •!• yi • •!• yK “outputs” • Multiple outputs corresponds to multivariate regression “weights” wij

x0=1 x1 • •!• xi • •!• xM “inputs”

“bias”

Artificial Intelligence: Neural Networks 14 Michael S. Lewicki ! Carnegie Mellon Training the network: Optimization by gradient descent

• We can adjust the weights incrementally to minimize the objective function. • This is called gradient descent • Or gradient ascent if we’re maximizing. • The gradient descent rule for weight wi is:

t+1 t ∂E wi = wi − ! wi w2 w4 Or in vector form: • w3 ∂E w = w − ! w2 t+1 t w

For gradient ascent, the sign • w1 of the gradient step changes.

w1

Artificial Intelligence: Neural Networks 15 Michael S. Lewicki ! Carnegie Mellon

Computing the gradient

• Idea: minimize error by gradient descent • Take the derivative of the objective function wrt the weights:

1 N E = (wT x c )2 2 n − n n=1 ! ∂E 2 N = (w x + + w x + + w x c )x w 2 0 0,n · · · i i,n · · · M M,n − n i,n i n=1 ! N = (wT x c )x n − n i,n n=1 ! • And in vector form:

∂E N = (wT x c )x w n − n n n=1 !

Artificial Intelligence: Neural Networks 16 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary

Each iteration updates the gradient: • 2.5

t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1

3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration

Artificial Intelligence: Neural Networks 17 Michael S. Lewicki ! Carnegie Mellon

Simulation: learning the decision boundary

Each iteration updates the gradient: • 2.5

t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1

3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration

Artificial Intelligence: Neural Networks 18 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary

Each iteration updates the gradient: • 2.5

t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1

3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration

Artificial Intelligence: Neural Networks 19 Michael S. Lewicki ! Carnegie Mellon

Simulation: learning the decision boundary

Each iteration updates the gradient: • 2.5

t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1

3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration

Artificial Intelligence: Neural Networks 20 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary

Each iteration updates the gradient: • 2.5

t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1

3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration

Artificial Intelligence: Neural Networks 21 Michael S. Lewicki ! Carnegie Mellon

Simulation: learning the decision boundary

Each iteration updates the gradient: • 2.5

t+1 t ∂E wi = wi − ! wi 2 2 x N ∂E 1.5 wT x − c x w = !( n n) i,n i =1 n 1

3 4 5 6 7 x Epsilon is a small value: 1 • 11000 " = 0.1/N 10000 9000 Learning Curve 8000 Epsilon too large: 7000 • 6000 Error - learning diverges 5000 Epsilon too small: 4000 • 3000 - convergence slow 2000 1000 0 5 10 15 iteration

Artificial Intelligence: Neural Networks 22 Michael S. Lewicki ! Carnegie Mellon Simulation: learning the decision boundary

2.5

2 2 x

1.5

1

3 4 5 6 7 x 1 • Learning converges onto the solution 11000 that minimizes the error. 10000 9000 Learning Curve • For linear networks, this is 8000 guaranteed to converge to the 7000 6000

minimum Error 5000 • It is also possible to derive a closed- 4000 form solution (covered later) 3000 2000 1000 0 5 10 15 iteration

Artificial Intelligence: Neural Networks 23 Michael S. Lewicki ! Carnegie Mellon

Learning is slow when epsilon is too small

• Here, larger step sizes would converge more quickly to the minimum

Error

w

Artificial Intelligence: Neural Networks 24 Michael S. Lewicki ! Carnegie Mellon Divergence when epsilon is too large

• If the step size is too large, learning can oscillate between different sides of the minimum

Error

w

Artificial Intelligence: Neural Networks 25 Michael S. Lewicki ! Carnegie Mellon

Multi-layer networks

• Can we extend our network to z multiple layers? We have: = yj ! wi,jxi i V = zj ! vj,kyj k y = ! vj,k ! wi,j xi k i • Or in matrix form W z = Vy = VWx x • Thus a two-layer linear network is equivalent to a one-layer linear network with weights U=VW. • It is not more powerful. How do we address this?

Artificial Intelligence: Neural Networks 26 Michael S. Lewicki ! Carnegie Mellon Non-linear neural networks

• Idea introduce a non-linearity: %

yj = f(! wi,j xi) 0 x < 0 y = i ()'* threshold !1 x ≥ 0

Now, multiple layers are not equivalent • & !! !" !# !$ !% & % $ # " ! ' zj = f(! vj,k yj) k f v f w x = (! j,k (! i,j i)) % k i

Common nonlinearities: • 1 ()'* sigmoid y = - threshold 1 + exp(−x) - sigmoid & !! !" !# !$ !% & % $ # " ! '

Artificial Intelligence: Neural Networks 27 Michael S. Lewicki ! Carnegie Mellon

Modeling logical operators

x1 x1 x1

x2 x2 x2

x1 AND x2 x1 OR x2 x1 XOR x2

A one-layer binary-threshold • y = f( w x ) network can implement the logical j ! i,j i i operators AND and OR, but not % XOR. • Why not? 0 x < 0 y = ()'* threshold !1 x ≥ 0

& !! !" !# !$ !% & % $ # " ! '

Artificial Intelligence: Neural Networks 28 Michael S. Lewicki ! Carnegie Mellon Posterior odds interpretation of a sigmoid

Artificial Intelligence: Neural Networks 29 Michael S. Lewicki ! Carnegie Mellon

The general classification/regression problem

for classification: desired output 1 if xi Ci class i, y = {y1, . . . , yK } yi = ∈ ≡ !0 otherwise regression for arbitrary y.

model model (e.g. a decision tree) is defined by θ = {θ1, . . . , θM } M parameters, e.g. a multi-layer neural network.

Data input is a set of T observations, each an D = {x1, . . . , xT } N-dimensional vector (binary, discrete, or xi = {x1, . . . , xN }i continuous)

Given data, we want to learn a model that can correctly classify novel observations or map the inputs to the outputs

Artificial Intelligence: Neural Networks 30 Michael S. Lewicki ! Carnegie Mellon A general multi-layer neural network

• Error function is defined as before, where we use the target vector tn to define the desired output for network output yn.

N y1 • •!• yi • •!• yK output 1 2 E = !(yn(xn, W1:L) − tn)

2 •

n=1 ! • The “forward pass” computes the • y =1 • •!• • •!• outputs at each layer: 0 y1 yi yK layer 2

l l l−1 wij yj = f(! wi,j yj ) i layer 1 l = {1, . . . , L} y0=1 y1 • •!• yi • •!• yK x ≡ y0 output = yL wij

x0=1 x1 • •!• xi • •!• xM input

Artificial Intelligence: Neural Networks 31 Michael S. Lewicki ! Carnegie Mellon

Deriving the gradient for a sigmoid neural network

• Mathematical procedure for train New problem: local minima is gradient descient: same as before, except the gradients are more complex to derive.

N 1 2 y1 • •!• yi • •!• yK output E = !(yn(xn, W1:L) − tn) 2 n=1 • ! • Convenient fact for the sigmoid • y =1 • •!• • •!• non-linearity: 0 y1 yi yK layer 2

dσ(x) d 1 = wij dx dx 1 + exp (−x) layer 1 = σ(x)(1 − σ(x)) y0=1 y1 • •!• yi • •!• yK • backward pass computes the gradients: back-propagation wij ∂E W = W + ! x =1 • •!• • •!• t+1 t W 0 x1 xi xM input

Artificial Intelligence: Neural Networks 32 Michael S. Lewicki ! Carnegie Mellon Applications: Driving (output is analog: steering direction) Real Example network with 1 layer (4 hidden units)Real Example network with 1 layer (4 hidden units)

• Learns to drive on roads D. Pomerleau. Neural network • Demonstrated at highway perception for mobile robot speeds over 100s of miles • Learns to drive on roads Dgu. iPdoamnceer.l eKalu.w Neer uArcaal dneemtwico rk • Demonstrated at highway pPeurbcelishptionng ,f o1r9 9m3o. bile robot

Artificialsp Intelligence:eed sNeural ov Netwer orks100s of miles 33 guidance. KluwerMichaelAca S. Ledwickiem ! iCarnegiec Mellon Publishing, 1993.

Real Timagerainin ginput data : is augmented to avoid overfitting Images + coTrrarienspingo nddaitnag: steIemraingge as n+g le corresponding steering angle

Important: Conditioning of tIrmapinoinrtga ndta: ta to gCeonnedriatioten innegw o f etrxaainminpgle dsa !ta taov oids ogveenrefirtatitneg new examples ! avoids overfitting

Artificial Intelligence: Neural Networks 34 Michael S. Lewicki ! Carnegie Mellon

24

24 Hand-written digits:R LeNeteal Example Real Example

• Takes as input image of handwritten digit • Each pixel is an input unit • Complex network with many layers • Takes as input image of handwritten digit • O•uEtapchut ipsi xdeigl ist clana ssinput unit • Complex network with many layers • T•eOstuetpdu ot nis l adriggiet cl(5a0ss,000+) database of handwritten samples • Real-time • U•seTedst coedm omn elarrcigael l(y50,000+) database of handwritten samples • Real-time

Artificial Intelligence:• Use Neurald co Netwmorksmercially 35 Michael S. Lewicki ! Carnegie Mellon

LeNet

Very low error rate (<< 1% Very low error rate (<< 1%

Y. LeCun, L. Bottou, Y. Bengio, and P. HY.aLffenCeur.n G, Lra. dBioetntot-ub, aYs.eBde lnegairon, ianngd appPl.ieHda tfofn deor.c Gurmadeinetn rte-bcaosgendi tlieoanr.n ing Proacpepelideidn gtos doofc tuhme eIEntE rEec, ongonvietiomnb. er http://yann.lecun.com/exdb/lenet/ 199P8ro. ceedings of the IEEE, november hhttp://yann.lecun.com/exdb/lenet/ttp://yann.lecun.com/exdb/lenet/ 1998.

Artificial Intelligence: Neural Networks 36 Michael S. Lewicki ! Carnegie Mellon

2323 Object recognition

• LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting. Proceedings of CVPR 2004. • http://www.cs.nyu.edu/~yann/research/norb/

Artificial Intelligence: Neural Networks 37 Michael S. Lewicki ! Carnegie Mellon

Summary

• Decision boundaries - Bayes optimal - linear discriminant - linear separability • Classification vs regression • Optimization by gradient descent • Degeneracy of a multi-layer linear network • Non-linearities:: threshold, sigmoid, others? • Issues: - very general architecture, can solve many problems - large number of parameters: need to avoid overfitting - usually requires a large amount of data, or special architecture - local minima, training can be slow, need to set stepsize

Artificial Intelligence: Neural Networks 38 Michael S. Lewicki ! Carnegie Mellon