<<

Introduction to the Introduction to Artificial Neural Network

Vuong Le with Hao Tang’s slides

Part of the content of the slides are from the Internet (possibly with modifications). The lecturer does not claim any ownership in any of the content of these slides. Outline

 Biological Inspirations  Applications & Properties of ANN   Multi-Layer  Error Algorithm  Remarks on ANN Biological Inspirations

 Humans perform complex tasks like vision, motor control, or language understanding very well  One way to build intelligent machines is to try to imitate the (organizational principles of) human brain Biological Artificial Neural Networks

 ANNs have been widely used in various domains for: ◦ approximation ◦ Etc. Perceptron ()

 A perceptron ◦ takes a vector of real-valued inputs ◦ calculates a linear combination of the inputs ◦ outputs +1 if the result is greater than some threshold and -1 (or 0) otherwise Perceptron  To simplify notation, assume an additional constant input x0=1. We can write the perceptron function as Representational Power of Perceptron

 The perceptron ~ a decision surface in the n-dimensional space of instances

X2 + + +

- X + 1 - - -

Linearly separable data Boolean Functions

 A single perceptron can be used to represent many boolean functions ◦ 1 (true); 0 (false)  Perceptrons can represent all of the primitive boolean functions ◦ AND, OR, NOT Implementing AND

1 x1 1 W=-1.5

∑ o(x1,x2) x2 1

o(x1, x2 ) =1 if −1.5 + x1 + x2 > 0 = 0 otherwise Implementing OR

1 x1 1 W=-0.5

∑ o(x1,x2) x2 1

o(x1,x2) = 1 if –0.5 + x1 + x2 > 0 = 0 otherwise Implementing NOT

1 W=0.5 -1 o(x1) x1 ∑

o(x1) =1 if 0.5 − x1 > 0 = 0 otherwise The XOR Function

 Unfortunately, some Boolean functions cannot be represented by a single perceptron

o(x ,x ) 1 2 w0 + 0⋅ w1 + 0⋅ w2 ≤ 0

w w0 + 0⋅ w1 +1⋅ w2 > 0 w0 2 XOR(x1,x2) w1 w0 +1⋅ w1 + 0⋅ w2 > 0

1 x1 x2 w0 +1⋅ w1 +1⋅ w2 ≤ 0

There is no assignment of values to w0,w1 and w2 that satisfies above inequalities. XOR cannot be represented! Linear Seprability

Boolean AND Boolean XOR

Representation Theorem: Perceptrons can only represent linearly separable functions. That is, the decision surface separating the output values has to be a plane. (Minsky & Papert, 1969) Remarks on perceptron

 Perceptrons can represent all the primitive Boolean functions ◦ AND, OR, and NOT  Some Boolean functions cannot be represented by a single perceptron ◦ Such as the XOR function  Every can be represented by some combination of ◦ AND, OR, and NOT  We want networks of the perceptrons… Implementing XOR by Multi-layer perceptron (MLP)

NAND

OR

x1 XOR x2 = (x1 OR x2) AND (NOT(x1 AND x2)) Multi-Layer Perceptrons (MLP) Representation Power of MLP

 Conjunction of piece-wise Representation Power of ANN

 Boolean functions: Every Boolean function can be represented exactly by some network with two layers of units  Continuous functions: Every bounded can be approximated with arbitrarily small error (under a finite norm) by a network with two layers of units  Arbitrary functions: Any function can be approximated to arbitrary accuracy by a network with three layers of units Neuron network design for non-closed form problem

Testing data x*

Result y* Training process Testing process Training data {x , y } i i Neuron network Definition of Training Error

 Training error E: a function of weight vector over the training data set D

1 2 E(w) ≡ ∑(td − od ) 2 d∈D

Unthresholded perceptron or linear unit

 To reduce error E, update the weight vector w in the direction of steepest descent along the error surface

 ∂E ∂E ∂E ∂E  ∇E(w) ≡  , , ,,  ∂w0 ∂w1 ∂w2 ∂wn 

w ← w + (−η∇E(w)) Gradient Descent Weight Update Rule

w ← w + (−η∇E(w))

∂E wi ← wi + (−η ), ∂wi ∂E = ∑(td − od )(−xid ) ∂wi d∈D Gradient Descent Search Algorithm repeat ∆w ← 0 for each training example o(x)=w·x

for each wi ∆wi ← ∆wi + η(t(x)-o(x))xi for each wi

wi ← wi + ∆wi until (termination condition) Perceptron Learning Rule vs

wi ← wi + ∆wi Perceptron learning rule

∆wi =η(t − o)xi

w ← w + ∆w i i i Delta rule

∆wi =η(t − o)xi

The perceptron learning rule uses the output of the threshold function (either -1 or +1) for learning. The delta-rule uses the net output without further mapping into output values -1 or +1 Perceptron Learning Algorithm

 Guaranteed to converge within a finite time if the training data is linearly separable and η is sufficiently small  If the data are not linearly separable, convergence is not assured Feed-forward Networks Definition of Error for MLP

1 (d ) (d ) 2 E(w) ≡ ∑∑(ti − oi ) 2 d∈D i

 ∂E ∂E ∂E ∂E ∂E  ∇E(w) ≡  , ,, , ,,  ∂ (o) ∂ (o) ∂ (h) ∂ (h) ∂ (h)  w11 w12 w11 w12 wij 

w ← w + (−η∇E(w)) Output Layer’s Weight Update

(d ) 1 2 ∂ ∑(tl − ol ) ∂E 2 l (d ) = ∂wij ∂wij 2   1   ∂2 ∑tl − Θ(∑ wlmhm ) = l  m  ∂wij 2 1    ∂  − Θ  ti ∑ wimhm  2  m  =   ∂wij ∂E ∂o ∂σ 1 = i i o = Θ(σ ) = and σ = w h i i −σ i i ∑ im m ∂oi ∂σ i ∂wij 1+ e m   1 2   1  ∂ ∂ (ti − oi ) ∂ −σ ∑ wimhm     + i  = 2  1 e   m  ∂oi ∂σ i ∂wij

= −(ti − oi )oi (1− oi )hj Hidden Layer’s Weight Update

∂E  Error of hj ∑ wij i ∂wij  Distribute error to inputs proportional∝ to weights

 Similar to output layer: ∂E = ∑[− (ti − oi )oi (1 − oi )wij ]h j (1 − h j )xk ∂w jk i

Error Back-propagation Error Back Propagation Algorithm initialize all weights to small random numbers repeat for each training example

for each hidden node hj ← Θ(∑ w jk xk ) for each output node k oi ← Θ(∑ wij hj ) j

for each output node’s weight ∂E / ∂wij = −oi (1− oi )(ti − oi )h j

for each hidden node’s weight ∂E / ∂w jk = [∑− oi (1− oi )(ti − oi )wij ]h j (1− h j )xk i ∂E for each hidden node’s weight w ← w −η ij ij ∂w ij ∂E for each output node’s weight w jk ← w jk −η ∂w jk until (termination condition) Generalization, , etc.

 Artificial neural networks with a large number of weights tend to overfit the training data  To increase generalization accuracy, use a validation set ◦ Find the optimal number of perceptrons ◦ Find the optimal number of training iterations  Stop when overfitting happens Generalization, Overfitting, etc.