An

introduction

to Neural Networks

..

Ben Krose Patrick van der Smagt

Eighth edition

November

c

The Universityof Amsterdam Permission is granted to distribute single copies of this

book for noncommercial use as long as it is distributed as a whole in its original form and

the names of the authors and the University of Amsterdam are mentioned Permission is also

granted to use this book for noncommercial courses provided the authors are notied of this

b eforehand

The authors can b e reached at

Ben Krose Patrickvan der Smagt

Faculty of Mathematics Computer Science Institute of Rob otics and System Dynamics

University of Amsterdam German Aerospace Research Establishment

Kruislaan NL SJ Amsterdam POBox D Wessling

THE NETHERLANDS GERMANY

Phone Phone

Fax Fax

email krosefwiuvanl email smagtdlrde

URL httpwwwfwiuvanlresearchneuro URL httpwwwopdlrdeFFDRRS

Contents

Preface

I FUNDAMENTALS

Intro duction

Fundamentals

A framework for distributed representation

Pro cessing units

Connections b etween units

Activation and output rules

Network top ologies

Training of articial neural networks

Paradigms of learning

Mo difying patterns of connectivity

Notation and terminology

Notation

Terminology

II THEORY

and Adaline

Networks with threshold activation functions

Perceptron learning rule and convergence theorem

Example of the Perceptron learning rule

Convergence theorem

The original Perceptron

The adaptive linear element Adaline

Networks with linear activation functions the delta rule

ExclusiveOR problem

Multilayer p erceptrons can do everything

Conclusions

BackPropagation

Multilayer feedforward networks

The generalised delta rule

Understanding backpropagation

Working with backpropagation

An example

Other activation functions

CONTENTS

Deciencies of backpropagation

Advanced algorithms

How go o d are multilayer feedforward networks

The eect of the numb er of learning samples

The eect of the numb er of hidden units

Applications

Recurrent Networks

The generalised deltarule in recurrentnetworks

The Jordan network

The Elman network

Backpropagation in fully recurrentnetworks

The Hopeld network

Description

Hopeld network as asso ciative memory

Neurons with graded resp onse

Boltzmann machines

SelfOrganising Networks

Comp etitive learning

Clustering

Vector quantisation

Kohonen network

Principal comp onent networks

Intro duction

Normalised Hebbian rule

Principal comp onent extractor

More eigenvectors

Adaptive resonance theory

Background Adaptive resonance theory

ART The simplied neural network mo del

ART The original mo del

Reinforcement learning

The critic

The controller network

Bartos approach the ASEACE combination

Asso ciativesearch

Adaptive critic

The cartp ole system

Reinforcementlearningversus optimal control

III APPLICATIONS

Rob ot Control

Endeector p ositioning

Camerarob ot co ordination is function approximation

Rob ot arm dynamics

Mobile rob ots

Mo del based navigation

Sensor based control

CONTENTS

Vision

Intro duction

Feedforward typ es of networks

Selforganising networks for image compression

Backpropagation

Linear networks

Principal comp onents as features

The cognitron and neo cognitron

Description of the cells

Structure of the cognitron

Simulation results

Relaxation typ es of networks

Depth from stereo

Image restoration and image segmentation

Silicon retina

IV IMPLEMENTATIONS

General Purp ose Hardware

The Connection Machine

Architecture

Applicability to neural networks

Systolic arrays

Dedicated NeuroHardware

General issues

Connectivity constraints

Analogue vs digital

Optics

Learning vs nonlearning

Implementation examples

Carver Meads silicon retina

LEPs LNeuro chip

References

Index

CONTENTS

List of Figures

The basic comp onents of an articial neural network

Various activation functions for a unit

Single layer network with one output and two inputs

Geometric representation of the discriminant function and the weights

Discriminant function b efore and after weight up date

The Perceptron

The Adaline

Geometric representation of input space

Solution of the XOR problem

Amultilayer network with l layers of units

The descentinweight space

Example of function approximation with a feedforward network

The p erio dic function f x sinx sinx approximated with sine activation

functions

The p erio dic function f xsinxsinx approximated with sigmoid activation

functions

Slow decrease with conjugate gradient in nonquadratic systems

Eect of the learning set size on the generalization

Eect of the learning set size on the error rate

Eect of the numb er of hidden units on the network p erformance

Eect of the numb er of hidden units on the error rate

The Jordan network

The Elman network

Training an Elman network to control an ob ject

Training a feedforward network to control an ob ject

The autoasso ciator network

A simple comp etitive learning network

Example of clustering in D with normalised vectors

Determining the winner in a comp etitive learning network

Comp etitive learning for clustering data

Vector quantisation tracks input density

Anetwork combining a vector quantisation layer with a layer feedforward neu

ral network This network can b e used to approximate functions from to

the input space is discretised in disjoint subspaces

Gaussian neuron distance function

A top ologyconserving map converging

The mapping of a twodimensional input space on a onedimensional Kohonen

network

LIST OF FIGURES

Mexican hat

Distribution of input samples

The ART architecture

The ART neural network

An example ART run

Reinforcement learning scheme

Architecture of a reinforcement learning scheme with critic element

The cartp ole system

An exemplar rob ot manipulator

Indirect learning system for rob otics

The system used for sp ecialised learning

A Kohonen network merging the output of two cameras

The neural mo del prop osed byKawato et al

The neural network used byKawato et al

The desired joint pattern for joints Joints and have similar time patterns

Schematic representation of the stored ro oms and the partial information which

is available from a single sonar scan

The structure of the network for the autonomous land vehicle

Input image for the network

Weights of the PCA network

The basic structure of the cognitron

Cognitron receptive regions

Two learning iterations in the cognitron

Feeding back activation values in the cognitron

The Connection Machine system organisation

Typical use of a systolic array

The Warp system architecture

Connections b etween M input and N output neurons

Optical implementation of matrix multiplication

The photoreceptor used by Mead

The resistivelayer a and enlarged a single no de b

The LNeuro chip

Preface

This manuscript attempts to provide the reader with an insight in articial neural networks

Back in the absence of any stateoftheart textb o ok forced us into writing our own

However in the meantime a number of worthwhile textb o oks have been published which can

b e used for background and indepth information Weareaware of the fact that at times this

manuscript mayprove to b e to o thorough or not thorough enough for a complete understanding

of the material therefore further reading material can be found in some excellent text books

such as Hertz Krogh Palmer Ritter Martinetz Schulten Kohonen

Anderson Rosenfeld DARPA McClelland Rumelhart Rumelhart

McClelland

Some of the material in this b o ok esp ecially parts I I I and IV contains timely material and

thus may heavily change throughout the ages The choice of describing rob otics and vision as

neural network applications coincides with the neural network researchinterests of the authors

Much of the material presented in chapter has b een written by Joris van Dam and Anuj Dev

at the University of Amsterdam Also Anuj contributed to material in chapter The basis of

chapter was form by a rep ort of Gerard Schram at the University of Amsterdam Furthermore

we express our gratitude to those p eople out there in NetLand who gave us feedback on this

manuscript esp ecially Michiel van der Korst and Nicolas Maudit who pointed out quite a few

of our go ofups Weowe them many kwartjes for their help

The seventh edition is not drastically dierent from the sixth one we corrected some typing

errors added some examples and deleted some obscure parts of the text In the eighth edition

symb ols used in the text have b een globally changed Also the chapter on recurrent networks

has b een alb eit marginally up dated The index still requires an up date though

AmsterdamOb erpfaenhofen Novemb er

Patrickvan der Smagt

Ben Krose

LIST OF FIGURES

Part I

FUNDAMENTALS

Intro duction

A rst wave of interest in neural networks also known as connectionist mo dels or parallel

distributed pro cessing emerged after the intro duction of simplied neurons by McCullo ch and

Pitts in McCullo ch Pitts These neurons were presented as mo dels of biological

neurons and as conceptual comp onents for circuits that could p erform computational tasks

When Minsky and Pap ert published their b o ok in Minsky Pap ert

in which they showed the deciencies of p erceptron mo dels most neural network funding was

redirected and researchers left the eld Only a few researchers continued their eorts most

notably Teuvo Kohonen Stephen Grossb erg James Anderson and KunihikoFukushima

The interest in neural networks reemerged only after some imp ortant theoretical results were

attained in the early eighties most notably the discovery of error backpropagation and new

hardware developments increased the pro cessing capacities This renewed interest is reected

in the number of scientists the amounts of funding the number of large conferences and the

numb er of journals asso ciated with neural networks Nowadays most universities haveaneural

networks group within their psychologyphysics computer science or biology departments

Articial neural networks can be most adequately characterised as computational mo dels

with particular prop erties such as the ability to adapt or learn to generalise or to cluster or

organise data and which op eration is based on parallel pro cessing However many of the ab ove

mentioned prop erties can b e attributed to existing nonneural mo dels the intriguing question

is to which extent the neural approach proves to b e b etter suited for certain applications than

existing mo dels Todateanequivo cal answer to this question is not found

Often parallels with biological systems are describ ed However there is still so little known

even at the lowest cell level ab out biological systems that the mo dels we are using for our

articial neural systems seem to intro duce an oversimplication of the biological mo dels

In this course we give an intro duction to articial neural networks The point of view we

take is that of a computer scientist We are not concerned with the psychological implication of

the networks and we will at most o ccasionally refer to biological neural mo dels We consider

neural networks as an alternative computational scheme rather than anything else

These lecture notes start with a chapter in which a number of fundamental prop erties are

discussed In chapter a numb er of classical approaches are describ ed as well as the discussion

on their limitations which to ok place in the early sixties Chapter continues with the descrip

tion of attempts to overcome these limitations and intro duces the backpropagation learning

algorithm Chapter discusses recurrent networks in these networks the restraint that there

are no cycles in the network graph is removed Selforganising networks which require no exter

nal teacher are discussed in chapter Then in chapter reinforcement learning is intro duced

Chapters and fo cus on applications of neural networks in the elds of rob otics and image

pro cessing resp ectively The nal chapters discuss implementational asp ects

CHAPTER INTRODUCTION

 Fundamentals

The articial neural networks whichwe describ e in this course are all variations on the parallel

distributed pro cessing PDP idea The architecture of each network is based on very similar

building blo cks which p erform the pro cessing In this chapter we rst discuss these pro cessing

units and discuss dierent network top ologies Learning strategiesas a basis for an adaptive

systemwill b e presented in the last section

A framework for distributed representation

An articial network consists of a p o ol of simple pro cessing units which communicate bysending

signals to each other over a large number of weighted connections

A set of ma jor asp ects of a parallel distributed mo del can be distinguished cf Rumelhart

and McClelland McClelland Rumelhart Rumelhart McClelland

a set of pro cessing units neurons cells

a state of activation y for every unit which equivalentto the output of the unit

k

connections b etween the units Generally each connection is dened byaweight w which

jk

determines the eect which the signal of unit j has on unit k

a propagation rule which determines the eective input s of a unit from its external

k

inputs

an activation function F which determines the new level of activation based on the

k

eectiveinputs t and the current activation y t ie the up date

k k

an external input aka bias oset for each unit

k

a metho d for information gathering the learning rule

an environment within which the system must op erate providing input signals andif

necessaryerror signals

Figure illustrates these basics some of which will b e discussed in the next sections

Pro cessing units

Each unit p erforms a relatively simple job receive input from neighb ours or external sources

and use this to compute an output signal which is propagated to other units Apart from this

pro cessing a second task is the adjustmentoftheweights The system is inherently parallel in

the sense that many units can carry out their computations at the same time

Within neural systems it is useful to distinguish three typ es of units input units indicated

byan index i which receive data from outside the neural network output units indicated by

CHAPTER FUNDAMENTALS

k

w

P

y

w

s w y k

jk

j

j

k

F

k

w

jk

k

w

j

y

j

k

Figure The basic comp onents of an articial neural network The propagation rule used here is

the standard weighted summation

an index o which send data out of the neural network and hidden units indicated by an index

h whose input and output signals remain within the neural network

During op eration units can b e up dated either synchronously or asynchronously With syn

chronous up dating all units up date their activation simultaneously with asynchronous up dat

ing each unit has a usually xed probability of up dating its activation at a time t and usually

only one unit will be able to do this at a time In some cases the latter mo del has some

advantages

Connections between units

In most cases we assume that each unit provides an additive contribution to the input of the

unit with which it is connected The total input to unit k is simply the weighted sum of the

separate outputs from each of the connected units plus a bias or oset term

k

X

s t w t y t t

jk k

k j

j

The contribution for p ositive w is considered as an excitation and for negative w as inhibition

jk jk

In some cases more complex rules for combining inputs are used in which a distinction is made

between excitatory and inhibitory inputs We call units with a propagation rule sigma

units

A dierent propagation rule intro duced byFeldman and Ballard Feldman Ballard

is known as the propagation rule for the sigmapi unit

X Y

s t w t y t t

jk k

k j

m

m

j

Often the y are weighted b efore multiplication Although these units are not frequently used

j

m

they have their value for gating of input as well as implementation of lo okup tables Mel

Activation and output rules

We also need a rule whichgives the eect of the total input on the activation of the unit We need

a function F whichtakes the total input s t and the currentactivation y t and pro duces a

k

k k

new value of the activation of the unit k

y t F y ts t

k

k k k

NETWORK TOPOLOGIES

Often the activation function is a nondecreasing function of the total input of the unit

X

A

w t y t t y t F s t F

jk k k k

j k k

j

although activation functions are not restricted to nondecreasing functions Generallysomesort

of threshold function is used a hard limiting threshold function a sgn function or a linear or

semilinear function or a smo othly limiting threshold see gure For this smo othly limiting

function often a sigmoid Sshap ed function like

y F s

k k

s

k

e

is used In some applications a hyp erb olic tangent is used yielding output values in the range

iii

sgn semi-linear sigmoid

Figure Various activation functions for a unit

In some cases the output of a unit can be a sto chastic function of the total input of the

unit In that case the activation is not deterministically determined by the neuron input but

the neuron input determines the probability p that a neuron get a high activation value

py

k

s T

k

e

in which T cf temp erature is a parameter which determines the slop e of the probability

function This typ e of unit will b e discussed more extensively in chapter

In all networks wedescribewe consider the output of a neuron to b e identical to its activation

level

Network top ologies

In the previous section we discussed the prop erties of the basic pro cessing unit in an articial

neural network This section fo cuses on the pattern of connections between the units and the

propagation of data

As for this pattern of connections the main distinction we can make is b etween

Feedforward networks where the data ow from input to output units is strictly feed

forward The data pro cessing can extend over multiple layers of units but no feedback

connections are present that is connections extending from outputs of units to inputs of

units in the same layer or previous layers

Recurrent networks that do contain feedback connections Contrary to feedforward net

works the dynamical prop erties of the network are imp ortant In some cases the activa

tion values of the units undergo a relaxation pro cess such that the network will evolve to

a stable state in which these activations do not change anymore In other applications

the change of the activation values of the output neurons are signicant such that the

dynamical b ehaviour constitutes the output of the network Pearlmutter

CHAPTER FUNDAMENTALS

Classical examples of feedforward networks are the Perceptron and Adaline which will be

discussed in the next chapter Examples of recurrentnetworks have b een presented by Anderson

Anderson Kohonen Kohonen and Hopeld Hopeld and will b e discussed

in chapter

Training of articial neural networks

A neural network has to be congured such that the application of a set of inputs pro duces

either direct or via a relaxation pro cess the desired set of outputs Various metho ds to set

the strengths of the connections exist One way is to set the weights explicitly using a priori

knowledge Another way is to train the neural network by feeding it teaching patterns and

letting it change its weights according to some learning rule

Paradigms of learning

We can categorise the learning situations in two distinct sorts These are

Sup ervised learning or Asso ciative learning in whichthe network is trained by providing

it with input and matching output patterns These inputoutput pairs can b e provided by

an external teacher or by the system whichcontains the network selfsup ervised

Unsup ervised learning or Selforganisation in which an output unit is trained to resp ond

to clusters of pattern within the input In this paradigm the system is supp osed to dis

cover statistically salient features of the input p opulation Unlike the sup ervised learning

paradigm there is no a priori set of categories into which the patterns are to b e classied

rather the system must develop its own representation of the input stimuli

Mo difying patterns of connectivity

Both learning paradigms discussed ab ove result in an adjustmentofthe weights of the connec

tions b etween units according to some mo dication rule Virtually all learning rules for mo dels

of this typ e can be considered as a variant of the Hebbian learning rule suggested by Hebb in

his classic book Organization of Behaviour Hebb The basic idea is that if two

units j and k are activesimultaneously their interconnection must b e strengthened If j receives

input from k the simplest version of Hebbian learning prescrib es to mo dify the weight w with

jk

w y y

jk

j k

where is a p ositive constant of prop ortionality representing the Another common

rule uses not the actual activation of unit k but the dierence between the actual and desired

activation for adjusting the weights

w y d y

jk

j k k

in which d is the desired activation provided bya teacher This is often called the WidrowHo

k

rule or the delta rule and will b e discussed in the next chapter

Many variants often very exotic ones have b een published the last few years In the next

chapters some of these up date rules will b e discussed

Notation and terminology

Throughout the years researchers from dierent disciplines havecomeupwithavast number of

terms applicable in the eld of neural networks Our computer scientist pointofview enables

us to adhere to a subset of the terminology which is less biologically inspired yet still conicts

arise Our conventions are discussed b elow

NOTATION AND TERMINOLOGY

Notation

We use the following notation in our formulae Note that not all symb ols are meaningful for all

networks and that in some cases subscripts or sup erscripts may b e left out eg p is often not

necessary or added eg vectors can contrariwise to the notation below have indices where

necessary Vectors are indicated with a b old nonslanted font

j k the unit j k

i an input unit

h a hidden unit

o an output unit

p

x the pth input pattern vector

p

x the j th elementofthepth input pattern vector

j

p

s the input to a set of neurons when input pattern vector p is clamp ed ie presented to the

network often the input of the network by clamping input pattern vector p

p

d the desired output of the network when input pattern vector p was input to the network

p

d the j th element of the desired output of the network when input pattern vector p was input

j

to the network

p

y the activation values of the network when input pattern vector p was input to the network

p

y the activation values of element j of the network when input pattern vector p was input to

j

the network

W the matrix of connection weights

w the weights of the connections whichfeedinto unit j

j

w the weight of the connection from unit j to unit k

jk

F the activation function asso ciated with unit j

j

the learning rate asso ciated with weight w

jk jk

the biases to the units

the bias input to unit j

j

U the threshold of unit j in F

j j

p

E the error in the output of the network when input pattern vector p is input

E the energy of the network

Terminology

Output vs activation of a unit Since there is no need to do otherwise we consider the

output and the activation value of a unit to b e one and the same thing That is the output of

each neuron equals its activation value

CHAPTER FUNDAMENTALS

Bias oset threshold These terms all refer to a constant ie indep endent of the network

input but adapted by the learning rule term which is input to a unit They may be used

interchangeably although the latter two terms are often envisaged as a prop erty of the activation

function Furthermore this external input is usually implemented and can be written as a

weight from a unit with activation value

Number of layers In a feedforward network the inputs p erform no computation and their

layer is therefore not counted Thus a network with one input layer one hidden layer and one

output layer is referred to as a network with two layers This convention is widely though not

yet universally used

Representation vs learning When using a neural network one has to distinguish two issues

which inuence the p erformance of the system The rst one is the representational power of

the network the second one is the learning algorithm

The representational power of a neural network refers to the ability of a neural network to

represent a desired function Because a neural network is built from a set of standard functions

in most cases the network will only approximate the desired function and even for an optimal

set of weights the approximation error is not zero

The second issue is the learning algorithm Given that there exist a set of optimal weights

in the network is there a pro cedure to iteratively nd this set of weights

Part II

THEORY

 Perceptron and Adaline

This chapter describ es single layer neural networks including some of the classical approaches

to the neural computing and learning problem In the rst part of this chapter we discuss the

representational power of the single layer networks and their learning algorithms and will give

some examples of using the networks In the second part we will discuss the representational

limitations of single layer networks

Two classical mo dels will be describ ed in the rst part of the chapter the Perceptron

prop osed by Rosenblatt Rosenblatt in the late s and the Adaline presented in the

early s byby Widrow and Ho Widrow Ho

Networks with threshold activation functions

A single layer feedforward network consists of one or more output neurons o each of whichis

connected with a weighting factor w to all of the inputs i In the simplest case the network

io

has only two inputs and a single output as sketched in gure we leave the output index o

out The input of the neuron is the weighted sum of the inputs plus the bias term The output

x

w

y

w

x

Figure Single layer network with one output and two inputs

of the network is formed by the activation of the output neuron whichissome function of the

input

X

y F w x

i

i

i

The activation function F can b e linear so that we have a linear network or nonlinear In this

section we consider the threshold or Heaviside or sgn function

if s

F s

otherwise

The output of the network thus is either or dep ending on the input The network

can now be used for a classication task it can decide whether an input pattern b elongs to

one of two classes If the total input is p ositive the pattern will b e assigned to class if the

CHAPTER PERCEPTRON AND ADALINE

total input is negative the sample will b e assigned to class The separation b etween the two

classes in this case is a straight line given by the equation

w x w x

Thesinglelayer network represents a linear discriminant function

A geometrical representation of the linear threshold neural network is given in gure

Equation can b e written as

w

x x

w w

and we see that the weights determine the slop e of the line and the bias determines the oset

ie how far the line is from the origin Note that also the weights can b e plotted in the input

space the weightvector is always p erp endicular to the discriminant function

x

w

w

x

kw k

Figure Geometric representation of the discriminant function and the weights

Now that we have shown the representational p ower of the single layer network with linear

threshold units we come to the second issue how do we learn the weights and biases in the

network We will describ e two learning metho ds for these typ es of networks the p erceptron

learning rule and the delta or LMS rule Both metho ds are iterative pro cedures that adjust

the weights A learning sample is presented to the network For each weight the new value is

computed by adding a correction to the old value The threshold is up dated in a same way

w t w tw t

i i i

t t t

The learning problem can now b e formulated as howdowe compute w t and t in order

i

to classify the learning patterns correctly

Perceptron learning rule and convergence theorem

Supp ose wehave a set of learning samples consisting of an input vector x and a desired output

d x For a classication task the d x is usually or The p erceptron learning rule is very

simple and can b e stated as follows

Start with random weights for the connections

Select an input vector x from the set of training samples

If y d x the p erceptron gives an incorrect resp onse mo dify all connections w accord

i

ing to w d x x

i i

PERCEPTRON LEARNING RULE AND CONVERGENCE THEOREM

Go backto

Note that the pro cedure is very similar to the Hebb rule the only dierence is that when the

network resp onds correctly no connection weights are mo died Besides mo difying the weights

wemust also mo dify the threshold This is considered as a connection w between the output

neuron and a dummy predicate unit whichisalways on x Given the p erceptron learning

rule as stated ab ove this threshold is mo died according to

if the p erceptron resp onds correctly

d x otherwise

Example of the Perceptron learning rule

A p erceptron is initialized with the following weights w w The p erceptron

learning rule is used to learn a correct discriminant function for a numb er of samples sketched in

gure The rst sample A with values x and target value d x is presented

to the network From eq it can b e calculated that the network output is so no weights

are adjusted The same is the case for point B with values x and target value

d x the network output is negative so no change When presenting p ointCwithvalues

x the network output will b e while the target value d x According to

the p erceptron learning rule the weightchanges are w w The new

weights are now w w and sample C is classied correctly

In gure the discriminant function b efore and after this weight up date is shown x

original discriminant function 2 after weight update

A 1

B C

12

x

Figure Discriminant function b efore and after weight up date

Convergence theorem

For the p erceptron learning rule there exists a convergence theorem which states the following

Theorem If there exists a set of connection weights w which is able to perform the transfor

mation y d x theperceptron learning rule wil l converge to some solution which may or may

not be the same as w in a nite number of steps for any initial choice of the weights

Pro of Given the fact that the length of the vector w does not play a role because of the sgn

operation we take kw k Because w is a correct solution the value jw x j where

denotes dot or inner product wil l begreater than or there exists a such that jw x j

for al l inputs x Now dene cos w w kw k When according to the perceptron learning

Technically this need not to b e true for any w w x could in fact b e equal to for a w which yields no

misclassications lo ok at denition of F However another w can b e found for which the quantity will not b e

Thanks to Terry Regier Computer Science UC Berkeley

CHAPTER PERCEPTRON AND ADALINE

rule connection weights are modied at a given input x we know that w d x x and the

weight after modication is w w w From this it fol lows that

w w w w d x w x

w w sgn w x w x

w w

 

kw k kw d x x k

 

w d x w x x

 

w x because d x sgnw x



w M

After t modications we have

w t w w w t

 

kw tk w tM

such that

w w t

cos t

kw tk

w w t

p



w tM

p

p

t while by denition cos From this fol lows that lim cos tlim

t t

M

The conclusion is that there must be an upper limit t for t The system modies its

max

connections only a limited number of times In other words after maximal ly t modications

max

of the weights the perceptron is correctly performing the mapping t wil l be reached when

max

cos If we start with connections w

M

t

max



The original Perceptron

The Perceptron prop osed by Rosenblatt Rosenblatt is somewhat more complex than a

single layer network with threshold activation functions In its simplest form it consist of an

N element input layer retina whichfeedsinto a layer of M asso ciation mask or predicate

units and a single output unit The goal of the op eration of the p erceptron is to learn a given

h

N

transformation d f g f g using learning samples with input x and corresp onding

output y d x In the original denition the activity of the predicate units can b e any function

of the input layer x but the learning pro cedure only adjusts the connections to the output

h

unit The reason for this is that no recip e had been found to adjust the connections between

x and Dep ending on the functions p erceptrons can be group ed into dierent families

h h

In Minsky Pap ert a number of these families are describ ed and prop erties of these

families have been describ ed The output unit of a p erceptron is a linear threshold element

Rosenblatt Rosenblatt proved the remarkable theorem ab out p erceptron learning

and in the early s p erceptrons created a great deal of interest and optimism The initial

euphoria was replaced by disillusion after the publication of Minsky and Pap erts Perceptrons

in Minsky Pap ert In this book they analysed the p erceptron thoroughly and

proved that there are severe restrictions on what p erceptrons can represent

THE ADAPTIVE LINEAR ELEMENT ADALINE

φ 1

Ψ Ω

φ n

Figure The Perceptron

The adaptive linear element Adaline

An imp ortant generalisation of the p erceptron training algorithm was presented by Widrow and

Ho as the least mean square LMS learning pro cedure also known as the delta rule The

main functional dierence with the p erceptron training rule is the way the output of the system is

used in the learning rule The p erceptron learning rule uses the output of the threshold function

either or for learning The deltarule uses the net output without further mapping into

output values or

The learning rule was applied to the adaptive linear element also named Adaline devel

op ed by Widrow and Ho Widrow Ho In a simple physical implementation g

this device consists of a set of controllable resistors connected to a circuit which can sum up

currents caused by the input voltage signals Usually the central blo ck the summer is also

followed by a quantiser which outputs either of dep ending on the p olarity of the sum

+1

−1 +1 level w1 w0 w 2 output +1 w Σ 3 −1 − error quantizer summer Σ input gains + pattern −1 +1 switches reference

switch

Figure The Adaline

Although the adaptive pro cess is here exemplied in a case when there is only one output

it may b e clear that a system with many parallel outputs is directly implementable bymultiple

units of the ab ovekind

If the input conductances are denoted by w i n and the input and output signals

i

ADALINE rst sto o d for ADAptive LInear NEuron but when articial neurons b ecame less and less p opular

this acronym was changed to ADAptive LINear Element

CHAPTER PERCEPTRON AND ADALINE

by x and y resp ectively then the output of the central blo ck is dened to b e

i

n

X

y w x

i

i

i

p

where w The purp ose of this device is to yield a given value y d at its output when

p

the set of values x i n is applied at the inputs The problem is to determine the

i

co ecients w i ninsucha way that the inputoutput resp onse is correct for a large

i

numb er of arbitrarily chosen signal sets If an exact mapping is not p ossible the average error

must be minimised for instance in the sense of least squares An adaptive op eration means

that there exists a mechanism by whichthe w can b e adjusted usually iteratively to attain the

i

correct values For the Adaline Widrow intro duced the delta rule to adjust the weights This

rule will b e discussed in section

Networks with linear activation functions the delta rule

For a single layer network with an output unit with a linear activation function the output is

simply given by

X

y w x

j

j

j

Such a simple network is able to represent a linear relationship between the value of the

output unit and the value of the input units By thresholding the output value a classier can

b e constructed such as Widrows Adaline but here we fo cus on the linear relationship and use

the network for a function approximation task In high dimensional input spaces the network

represents a hyp erplane and it will b e clear that also multiple output units may b e dened

Supp ose we want to train the network such that a hyp erplane is tted as well as p ossible

p

to a set of training samples consisting of input values x and desired or target output values

p p

d For every given input sample the output of the network diers from the target value d

p p p

byd y where y is the actual output for this pattern The deltarule now uses a cost or

errorfunction based on these dierences to adjust the weights

The error function as indicated by the name least mean square is the summed squared

error That is the total error E is dened to b e

X X

p p p

d y E E

p p

p

where the index p ranges over the set of input patterns and E represents the error on pattern

p The LMS pro cedure nds the values of all the weights that minimise the error function bya

metho d called gradient descent The idea is to makeachange in the weight prop ortional to the

negative of the derivativeof the error as measured on the current pattern with resp ect to each

weight

p

E

w

p j

w

j

where is a constant of prop ortionality The derivativeis

p p p

E y E

p

w y w

j j

Because of the linear units eq

p

y

x

j

w j

EXCLUSIVEOR PROBLEM

and

p

E

p p

d y

p

y

suchthat

p

w x

p j

j

p p p

where d y is the dierence b etween the target output and the actual output for pattern

p

The delta rule mo dies weight appropriately for target and actual outputs of either p olarity

and for b oth continuous and binary input and output units These characteristics have op ened

up a wealth of new applications

ExclusiveOR problem

In the previous sections wehave discussed two learning algorithms for single layer networks but

wehave not discussed the limitations on the representation of these networks

d x x

Table Exclusiveor truth table

One of Minsky and Pap erts most discouraging results shows that a single layer p ercep

tron cannot representa simple exclusiveor function Table shows the desired relationships

between inputs and output units for this function

In a simple network with two inputs and one output as depicted in gure the net input

is equal to

s w x w x

According to eq the output of the p erceptron is zero when s is negative and equal to

one when s is p ositive In gure a geometrical representation of the input domain is given

For a constant the output of the p erceptron is equal to one on one side of the dividing line

which is dened by

w x w x

and equal to zero on the other side of this line

x x x

(−1,1) (1,1)

?

x x x

?

(−1,−1) (1,−1)

AND OR XOR

Figure Geometric representation of input space

CHAPTER PERCEPTRON AND ADALINE

To see that such a solution cannot b e found take a lo ot at gure The input space consists

of four p oints and the two solid circles at and cannot b e separated by a straight

line from the two op en circles at and The obvious question to ask is How can

this problem b e overcome Minsky and Pap ert prove in their b o ok that for binary inputs any

transformation can be carried out by adding a layer of predicates which are connected to all

inputs The pro of is given in the next section

For the sp ecic XOR problem we geometrically show that by intro ducing hidden units

thereby extending the network to a multilayer p erceptron the problem can b e solved Fig a

demonstrates that the four input p oints are nowemb edded in a threedimensional space dened

by the two inputs plus the single hidden unit These four points are now easily separated by

(1,1,1)

1 1 −0.5 −2 −0.5 1 1

(-1,-1,-1)

a. b.

Figure Solution of the XOR problem

a The p erceptron of g with an extra hidden unit With the indicated values of the

weights w next to the connecting lines and the thresholds in the circles this p erceptron

ij i

solves the XOR problem b This is accomplished by mapping the four points of gure

onto the four p oints indicated here clearlyseparation by a linear manifold into the required

groups is now p ossible

a linear manifold plane into two groups as desired This simple example demonstrates that

adding hidden units increases the class of problems that are soluble by feedforward p erceptron

like networks However by this generalisation of the basic architecture wehave also incurred a

serious loss wenolongerhave a learning rule to determine the optimal weights

Multilayer p erceptrons can do everything

In the previous section we showed that by adding an extra hidden unit the XOR problem

can be solved For binary units one can prove that this architecture is able to p erform any

transformation given the correct connections and weights The most primitive is the next one

For a given transformation y d x we can divide the set of all p ossible input vectors into two

classes

X f x j d x g and X f x j d x g

N

Since there are N input units the total numb er of p ossible input vectors x is For every

p

x X a hidden unit h can b e reserved of which the activation y is if and only if the sp ecic

h

p

pattern p is present at the input wecancho ose its weights w equal to the sp ecic pattern x

ih

and the bias equal to N suchthat

h

X

p p

w x N sgn y

ih

i

h

i

CONCLUSIONS

p

is equal to for x w only Similarly the weights to the output neuron can b e chosen such

h

that the output is one as so on as one of the M predicate neurons is one

M

X

p

y M y sgn

h o

h

This p erceptron will give y only if x X it p erforms the desired mapping The

o

problem is the large numb er of predicate units which is equal to the numb er of patterns in X

N

which is maximally Of course we can do the same trick for X and we will always take

N

the minimal number of mask units which is maximally A more elegant pro of is given

in Minsky Pap ert but the point is that for complex transformations the number of

required units in the hidden layer is exp onential in N

Conclusions

In this chapter we presented single layer feedforward networks for classication tasks and for

function approximation tasks The representational p ower of single layer feedforward networks

was discussed and two learning algorithms for nding the optimal weights were presented The

simple networks presented here have their advantages and disadvantages The disadvantage

is the limited representational power only linear classiers can be constructed or in case of

function approximation only linear functions can be represented The advantage however is

that b ecause of the linearityofthe system the training algorithm will converge to the optimal

solution This is not the case anymore for nonlinear systems suchasmultiple layer networks as

we will see in the next chapter

CHAPTER PERCEPTRON AND ADALINE

 BackPropagation

As wehave seen in the previous chapter a singlelayer network has severe restrictions the class

of tasks that can b e accomplished is very limited In this chapter we will fo cus on feedforward

networks with layers of pro cessing units

Minsky and Pap ert Minsky Pap ert showed in that a two layer feedforward

network can overcome many restrictions but did not present a solution to the problem of how

to adjust the weights from input to hidden units An answer to this question was presented by

Rumelhart Hinton and Williams in Rumelhart Hinton Williams and similar

solutions app eared to have b een published earlier Werb os Parker Cun

The central idea b ehind this solution is that the errors for the units of the hidden layer are

determined by backpropagating the errors of the units of the output layer For this reason

the metho d is often called the backpropagation learning rule Backpropagation can also be

considered as a generalisation of the delta rule for nonlinear activation functions and multi

layer networks

Multilayer feedforward networks

A feedforward network has a layered structure Eachlayer consists of units which receive their

input from units from a layer directly below and send their output to units in a layer directly

ab ove the unit There are no connections within a layer The N inputs are fed into the rst

i

layer of N hidden units The input units are merely fanout units no pro cessing takes place

h

in these units The activation of a hidden unit is a function F of the weighted inputs plus a

i

bias as given in in eq The output of the hidden units is distributed over the next layer of

N hidden units until the last layer of hidden units of which the outputs are fed intoalayer

h

of N output units see gure

o

Although backpropagation can be applied to networks with any number of layers just as

for networks with binary units section it has b een shown Hornik Stinchcomb e White

Funahashi Cyb enko Hartman Keeler Kowalski that only one

layer of hidden units suces to approximate any function with nitely many discontinuities to

arbitrary precision provided the activation functions of the hidden units are nonlinear the

universal approximation theorem In most applications a feedforward network with a single

layer of hidden units is used with a sigmoid activation function for the units

The generalised delta rule

Since wearenow using units with nonlinear activation functions wehave to generalise the delta

rule which was presented in chapter for linear functions to the set of nonlinear activation

Of course when linear activation functions are used a multilayer network is not more powerful than a

singlelayer network

CHAPTER BACKPROPAGATION

h o

N

o

N

i

N N N

h hl hl

Figure A multilayer network with l layers of units

functions The activation is a dierentiable function of the total input given by

p p

y F s

k k

in which

X

p p

w y s

jk k

j

k

j

To get the correct generalisation of the delta rule as presented in the previous chapter wemust

set

p

E

w

p

jk

w

jk

p

The error measure E is dened as the total quadratic error for pattern p at the output units

N

o

X

p p p

E y d

o o

o

X

p

p

where d is the desired output for unit o when pattern p is clamp ed We further set E E

o

p

as the summed squared error We can write

p

p p

s

E E

k

p

w s w

jk jk

k

By equation we see that the second factor is

p

s

p

k

y

j

w

jk

When we dene

p

E

p

p

k

s

k

we will get an up date rule which is equivalent to the delta rule as describ ed in the previous

chapter resulting in a gradient descent on the error surface if we make the weight changes

according to

p p

w y

p

jk

j

k

p

The trick is to gure out what should b e for each unit k in the network The interesting

k

result whichwe now derive is that there is a simple recursive computation of these s which

can b e implemented by propagating error signals backward through the network

THE GENERALISED DELTARULE

p

To compute we apply the chain rule to write this partial derivative as the pro duct of two

k

factors one factor reecting the change in error as a function of the output of the unit and one

reecting the change in the output as a function of changes in the input Thus wehave

p

p p

y

E E

p

k

p p p

k

s y s

k k k

Let us compute the second factor By equation we see that

p

y

p

k

F s

p

k

s

k

which is simply the derivative of the squashing function F for the k th unit evaluated at the

p

net input s to that unit To compute the rst factor of equation we consider two cases

k

First assume that unit k is an output unit k o of the network In this case it follows from

p

the denition of E that

p

E

p p

d y

p

o o

y

o

which is the same result as we obtained with the standard delta rule Substituting this and

equation in equation weget

p p p p

d y F s

o

o o o o

for any output unit o Secondly if k is not an output unit but a hidden unit k h we do

not readily know the contribution of the unit to the output error of the network However

the error measure can be written as a function of the net inputs from hidden to output layer

p p p

p p

E E s s s andweusethechain rule to write

j

N

N N N N

o o o o

h

p p p p p

X X X X X

E s E E E

p

o

p

w w w y

ho ho ko

p p p p p p

o

j

y s y s y s

o o o

h h h

o o o o

j

Substituting this in equation yields

N

o

X

p p

p

F s w

ho

o

h h

o

Equations and give a recursive pro cedure for computing the s for all units in

the network which are then used to compute the weight changes according to equation

This pro cedure constitutes the generalised delta rule for a feedforward network of nonlinear

units

Understanding backpropagation

The equations derived in the previous section may be mathematically correct but what do

they actually mean Is there away of understanding backpropagation other than reciting the

necessary equations

The answer is of course yes In fact the whole backpropagation pro cess is intuitively

very clear What happ ens in the ab ove equations is the following When a learning pattern

is clamp ed the activation values are propagated to the output units and the actual network

output is compared with the desired output values we usually end up with an error in each of

the output units Lets call this error e for a particular output unit o We have to bring e to

o o zero

CHAPTER BACKPROPAGATION

The simplest metho d to do this is the greedy metho d we strive to change the connections

in the neural network in such a way that next time around the error e will be zero for this

o

particular pattern We know from the delta rule that in order to reduce an error we have to

adapt its incoming weights according to

w d y y

ho

o o h

Thats step one But it alone is not enough when we only apply this rule the weights from

input to hidden units are never changed and we do not have the full representational power

of the feedforward network as promised by the universal approximation theorem In order to

adapt the weights from input to hidden units we again want to apply the delta rule In this

case however we do not have a value for for the hidden units This is solved by the chain

rule which do es the following distribute the error of an output unit o to all the hidden units

that is it connected to weighted by this connection Dierently put a hidden unit h receives a

delta from each output unit o equal to the delta of that output unit weighted with multiplied

P

by the weight of the connection between those units In symb ols w Well not

o

h ho

o

exactly we forgot the activation function of the hidden unit F has to b e applied to the delta

b efore the backpropagation pro cess can continue

Working with backpropagation

The application of the generalised delta rule thus involves two phases During the rst phase

the input x is presented and propagated forward through the network to compute the output

p

values y for each output unit This output is compared with its desired value d resulting in

o o

p

an error signal for each output unit The second phase involves a backward pass through

o

the network during whichtheerrorsignalispassedtoeach unit in the network and appropriate

weightchanges are calculated

Weight adjustments with sigmoid activation function The results from the previous

section can b e summarised in three equations

The weight of a connection is adjusted by an amount prop ortional to the pro duct of an

error signal on the unit k receiving the input and the output of the unit j sending this

signal along the connection

p p

w y

p

jk

j

k

If the unit is an output unit the error signal is given by

p p p p

d y F s

o o o o

Take as the activation function F the sigmoid function as dened in chapter

p p

y F s

p

s

e

In this case the derivative is equal to

p

F s

p

p s

s e

p

s

e

p

s

e

p

s

e

p p

s s

e e

p p

y y

AN EXAMPLE

such that the error signal for an output unit can b e written as

p p p p p

d y y y

o o o o o

The error signal for a hidden unit is determined recursively in terms of error signals of the

units to which it directly connects and the weights of those connections For the sigmoid

activation function

N N

o o

X X

p p p p

p p

F s w y y w

ho ho

o o

h h h h

o o

Learning rate and momentum The learning pro cedure requires that the change in weight

p

is prop ortional to E w True gradient descent requires that innitesimal steps are taken The

constant of prop ortionalityis the learning rate For practical purp oses we cho ose a learning

rate that is as large as p ossible without leading to oscillation One way to avoid oscillation

at large is to make the change in weight dep endent of the past weight change by adding a

momentum term

p p

w t w t y

jk jk

j

k

where t indexes the presentation number and is a constant which determines the eect of the

previous weightchange

The role of the momentum term is shown in gure When no momentum term is used

it takes a long time b efore the minimum has b een reached with a low learning rate whereas for

high learning rates the minimum is never reached b ecause of the oscillations When adding the

momentum term the minimum will b e reached faster

b a

c

Figure The descent in weight space a for small learning rate b for large learning rate note

the oscillations and c with large learning rate and momentum term added

Learning per pattern Although theoretically the backpropagation algorithm p erforms

gradient descent on the total error only if the weights are adjusted after the full set of learning

patterns has been presented more often than not the learning rule is applied to each pattern

p

separately ie a pattern p is applied E is calculated and the weights are adapted p

P There exists empirical indication that this results in faster convergence Care has

to be taken however with the order in which the patterns are taught For example when

using the same sequence over and over again the network may b ecome fo cused on the rst few

patterns This problem can b e overcome by using a p ermuted training metho d

An example

A feedforward network can be used to approximate a function from examples Supp ose we

have a system for example a chemical pro cess or a nancial market of whichwewant to know

CHAPTER BACKPROPAGATION

the characteristics The input of the system is given by the twodimensional vector x and the

output is given by the onedimensional vector d Wewant to estimate the relationship d f x

p p

from examples fx d g as depicted in gure top left A feedforward network was

1 1

0 0

−1 −1 1 1 1 1 0 0 0 0 −1 −1 −1 −1

1 1

0 0

−1 −1 1 1 1 1 0 0 0 0

−1 −1 −1 −1

Figure Example of function approximation with a feedforward network Top left The original

learning samples Top right The approximation with the network Bottom left The function which

generated the learning samples Bottom right The error in the approximation

programmed with two inputs hidden units with sigmoid activation function and an output

unit with a linear activation function Check for yourself how equation should b e adapted

for the linear instead of sigmoid activation function The network weights are initialized to

small values and the network is trained for learning iterations with the backpropagation

training rule describ ed in the previous section The relationship b etween x and d as represented

bythenetwork is shown in gure top right while the function which generated the learning

samples is given in gure b ottom left The approximation error is depicted in gure

b ottom right We see that the error is higher at the edges of the region within which the

learning samples were generated The network is considerably b etter at interp olation than

extrap olation

Other activation functions

Although sigmoid functions are quite often used as activation functions other functions can b e

used as well In some cases this leads to a formula which is known from traditional function

approximation theories

For example from Fourier analysis it is known that any p erio dic function can b e written as

a innite sum of sine and cosine terms Fourier series

X

f x a cos nx b sin nx

n n

n

DEFICIENCIES OF BACKPROPAGATION

We can rewrite this as a summation of sine terms

X

f xa c sinnx

n n

n

p

a b and arctan ba This can be seen as a feedforward network with with c

n n

n n

a single input unit for x a single output unit for f x and hidden units with an activation

function F sins The factor a corresp onds with the bias of the output unit the factors c

n

corresp ond with the weighs from hidden to output unit the phase factor corresp onds with

n

the bias term of the hidden units and the factor n corresp onds with the weights between the

input and hidden layer

The basic dierence between the Fourier approach and the backpropagation approach is

that the in the Fourier approach the weights between the input and the hidden units these

are the factors n are xed integer numb ers which are analytically determined whereas in the

backpropagation approach these weights can take anyvalue and are typically learning using a

learning heuristic

To illustrate the use of other activation functions we have trained a feedforward network

with one output unit four hidden units and one input with ten patterns drawn from the function

f xsinxsinx The result is depicted in Figure The same function alb eit with other

learning p oints is learned with a network with eight sigmoid hidden units see gure

From the gures it is clear that it pays o to use as muchknowledge of the problem at hand as p ossible

+1

-4 -2 2 468

-0.5

Figure The p erio dic function f xsinx sinx approximated with sine activation functions

Adapted from Dastani

Deciencies of backpropagation

Despite the apparent success of the backpropagation learning algorithm there are some asp ects

whichmake the algorithm not guaranteed to b e universally useful Most troublesome is the long

training pro cess This can b e a result of a nonoptimum learning rate and momentum A lot of

advanced algorithms based on backpropagation learning have some optimised metho d to adapt

this learning rate as will b e discussed in the next section Outright training failures generally

arise from two sources network paralysis and lo cal minima

Network paralysis As the network trains the weights can b e adjusted to very large values

The total input of a hidden unit or output unit can therefore reachvery high either p ositiveor

negative values and b ecause of the sigmoid activation function the unit will have an activation

very close to zero or very close to one As is clear from equations and the weight

CHAPTER BACKPROPAGATION

+1

-4 246

-1

Figure The p erio dic function f xsinxsinx approximated with sigmoid activation func

tions

Adapted from Dastani

p p

will b e close to zero and the training pro cess y adjustments which are prop ortional to y

k k

can come to a virtual standstill

Lo cal minima The error surface of a complex network is full of hills and valleys Because

of the gradient descent the network can get trapp ed in a lo cal minimum when there is a much

deep er minimum nearby Probabilistic metho ds can help to avoid this trap but they tend to

b e slow Another suggested p ossibility is to increase the numb er of hidden units Although this

will work b ecause of the higher dimensionality of the error space and the chance to get trapp ed

is smaller it app ears that there is some upp er limit of the numb er of hidden units which when

exceeded again results in the system b eing trapp ed in lo cal minima

Advanced algorithms

Many researchers have devised improvements of and extensions to the basic backpropagation

algorithm describ ed ab ove It is to o early for a full evaluation some of these techniques may

prove to be fundamental others may simply fade away A few metho ds are discussed in this

section

Mayb e the most obvious improvement is to replace the rather primitive steep est descent

metho d with a direction set minimisation metho d eg conjugate gradient minimisation Note

that minimisation along a direction u brings the function f at a place where its gradient is

p erp endicular to u otherwise minimisation along u is not complete Instead of following the

gradientatevery step a set of n directions is constructed which are all conjugate to each other

such that minimisation along one of these directions u do es not sp oil the minimisation along one

j

of the earlier directions u ie the directions are noninterfering Thus one minimisation in the

i

direction of u suces such that n minimisations in a system with n degrees of freedom bring this

i

system to a minimum provided the system is quadratic This is dierent from gradient descent

which directly minimises in the direction of the steep est descent Press FlanneryTeukolsky

Vetterling

Supp ose the function to b e minimised is approximated byitsTaylor series

X X

f f

f x f p x x x

i i j

x x x

i i j

p

i ij

p

T T

x Ax b x c

ADVANCED ALGORITHMS

where T denotes transp ose and

f

c f p b rf j A

ij

p

x x

i j

p

A is a symmetric p ositive denite n n matrix the Hessian of f at p The gradientof f is

rf Ax b

suchthatachange of x results in a change of the gradientas

rf A x

Now supp ose f was minimised along a direction u to a p oint where the gradient g of f is

i

i

p erp endicular to u ie

i

T

u g

i

i

and a new direction u is sought In order to make sure that moving along u do es not sp oil

i i

minimisation along u we require that the gradientof f remain p erp endicular to u ie

i i

T

u g

i

i

otherwise we would once more have to minimise in a direction which has a comp onent of u

i

Combining and weget

T T T

Au rf u g g u u

i

i i

i i i

When eq holds for two vectors u and u they are said to b e conjugate

i i

Now starting at some point p the rst minimisation direction u is taken equal to g

rf p resulting in a new p oint p For i calculate the directions

u g u

i i i

i

T

where is chosen to make u Au and the successive gradients p erp endicular ie

i i

i

T

g g

i

i

with g rf j for all k

i

k

p

T

k

g g

i

i

Next calculate p p u where is chosen so as to minimise f p

i i i

i i i

It can be shown that the us thus constructed are all mutually conjugate eg see Sto er

Bulirsch The pro cess describ ed ab ove is known as the FletcherReeves metho d but

there are many variants which work more or less the same Hestenes Stiefel Polak

Powell

Although only n iterations are needed for a quadratic system with n degrees of freedom

due to the fact that we are not minimising quadratic systems as well as a result of roundo

errors the n directions have to be followed several times see gure Powell intro duced

some improvements to correct for b ehaviour in nonquadratic systems The resulting cost is

O nwhich is signicantly b etter than the linear convergence of steep est descent

A matrix A is called p ositive denite if y

T

y Ay

This is not a trivial problem see Press et al However line minimisation metho ds exist with

sup erlinear convergence see fo otnote

A metho d is said to converge linearly if E cE with c Metho ds whichconverge with a higher p ower

i i

m

ie E cE with m are called sup erlinear

i i

CHAPTER BACKPROPAGATION

gradient ui ui+1

a very slow approximation

Figure Slow decrease with conjugate gradient in nonquadratic systems The hills on the left

are very steep resulting in a large search vector u When the quadratic p ortion is entered the new

i

search direction is constructed from the previous direction and the gradient resulting in a spiraling

minimisation This problem can b e overcome by detecting such spiraling minimisations and restarting

the algorithm with u rf

Some improvements on backpropagation have b een presented based on an indep endent adap

tive learning rate parameter for eachweight

Van den Bo omgaard and Smeulders Bo omgaard Smeulders show that for a feed

forward network without hidden units an incremental pro cedure to nd the optimal weight

matrix W needs an adjustment of the weights with

W t t d t W t x t x t

in which is not a constant but an variable N N matrix which dep ends on the

i i

input vector By using a priori knowledge ab out the input signal the storage requirements for

can b e reduced

Silva and Almeida Silva Almeida also show the advantages of an indep endent step

size for each weight in the network In their algorithm the learning rate is adapted after every

learning pattern

E t E t

u t if and have the same signs

jk

w w

jk jk

t

jk

E t E t

d t if and have opp osite signs

jk

w w

jk jk

where u and d are p ositive constants with values slightly ab ove and below unity resp ectively

The idea is to decrease the learning rate in case of oscillations

How go o d are multilayer feedforward networks

From the example shown in gure is is clear that the approximation of the network is not

p erfect The resulting approximation error is inuenced by

The learning algorithm and numb er of iterations This determines how go o d the error on the training set is minimized

HOW GOOD ARE MULTILAYER FEEDFORWARD NETWORKS

The numb er of learning samples This determines how go o d the training samples represent

the actual function

The numb er of hidden units This determines the expressive power of the network For

smo oth functions only a few number of hidden units are needed for wildly uctuating

functions more hidden units will b e needed

In the previous sections we discussed the learning rules suchasbackpropagation and the other

gradient based learning algorithms and the problem of nding the minimum error In this

section we particularly address the eect of the numb er of learning samples and the eect of the

numb er of hidden units

We rst have to dene an adequate error measure All neural network training algorithms

try to minimize the error of the set of learning samples which are available for training the

network The average error p er learning sample is dened as the learning error rate error rate

P

learning

X

p

E E

learning

P

learning

p

p

in which E is the dierence between the desired output value and the actual network output

for the learning samples

N

o

X

p p p

E y d

o o

o

This is the error which is measurable during the training pro cess

It is obvious that the actual error of the network will dier from the error at the lo cations of

the training samples The dierence between the desired output value and the actual network

output should b e integrated over the entire input domain to give a more realistic error measure

This integral can b e estimated if wehave a large set of samples the test set Wenow dene the

test error rate as the average error of the test set

P

test

X

p

E E

test

P

test

p

In the following subsections we will see how these error measures dep end on learning set size

and numb er of hidden units

The eect of the number of learning samples

A simple problem is used as example a function y f x has to b e approximated with a feed

forward neural network A neural network is created with an input hidden units with sigmoid

activation function and a linear output unit Supp ose wehave only a small numb er of learning

samples eg and the networks is trained with these samples Training is stopp ed when the

error do es not decrease anymore The original desired function is shown in gure A as a

dashed line The learning samples and the approximation of the network are shown in the same

gure We see that in this case E is small the network output go es p erfectly through the

learning

learning samples but E is large the test error of the network is large The approximation

test

obtained from learning samples is shown in gure B The E is larger than in the

learning

case of learning samples but the E is smaller

test

This exp erimentwas carried out with other learning set sizes where for each learning set size

the exp erimentwas rep eated times The average learning and test error rates as a function

of the learning set size are given in gure Note that the learning error increases with an

increasing learning set size and the test error decreases with increasing learning set size Alow

CHAPTER BACKPROPAGATION

A B 1 1

0.8 0.8

0.6 0.6 y y 0.4 0.4

0.2 0.2

0 0 0 0.5 1 0 0.5 1

x x

Figure Eect of the learning set size on the generalization The dashed line gives the desired

function the learning samples are depicted as circles and the approximation by the network is shown

by the drawn line hidden units are used a learning samples b learning samples

learning error on the small learning set is no guaranteeforagoodnetwork p erformance With

increasing number of learning samples the two error rates converge to the same value This

value dep ends on the representational power of the network given the optimal weights how

go o d is the approximation This error dep ends on the numb er of hidden units and the activation

function If the learning error rate do es not converge to the test error rate the learning pro cedure

has not found a global minimum

error rate

test set

learning set

number of learning samples

Figure Eect of the learning set size on the error rate The average error rate and the average

test error rate as a function of the numb er of learning samples

The eect of the number of hidden units

The same function as in the previous subsection is used but now the numb er of hidden units is

varied The original desired function learning samples and network approximation is shown

in gure A for hidden units and in gure B for hidden units The eect visible

in gure B is called overtraining The network ts exactly with the learning samples but

b ecause of the large number of hidden units the function which is actually represented by the

network is far more wild than the original one Particularly in case of learning samples which

contain a certain amount of noise which all realworld data have the network will t the noise

of the learning samples instead of making a smo oth approximation

APPLICATIONS

A B 1 1

0.8 0.8

0.6 0.6 y y 0.4 0.4

0.2 0.2

0 0 0 0.5 1 0 0.5 1

x x

Figure Eect of the numb er of hidden units on the network p erformance The dashed line

gives the desired function the circles denote the learning samples and the drawn line gives the

approximation by the network learning samples are used a hidden units b hidden units

This example shows that a large numb er of hidden units leads to a small error on the training

set but not necessarily leads to a small error on the test set Adding hidden units will always

lead to a reduction of the E However adding hidden units will rst lead to a reduction

learning

of the E but then lead to an increase of E This eect is called the p eaking eect The

test test

average learning and test error rates as a function of the learning set size are given in gure

error rate

test set

learning set

number of hidden units

Figure The average learning error rate and the average test error rate as a function of the

numb er of hidden units

Applications

Backpropagation has b een applied to a wide variety of research applications Sejnowski and

Rosenb erg Sejnowski Rosenb erg pro duced a sp ectacular success with NETtalk

a system that converts printed English text into highly intelligible sp eech

A feedforward network with one layer of hidden units has been describ ed by Gorman and

Sejnowski Gorman Sejnowski as a classication machine for sonar signals

Another application of a multilayer feedforward network with a backpropagation training

algorithm is to learn an unknown function between input and output signals from the presen

CHAPTER BACKPROPAGATION

tation of examples It is hop ed that the network is able to generalise correctly so that input

values which are not presented as learning patterns will result in correct output values An

example is the work of Josin Josin who used a twolayer feedforward network with

backpropagation learning to p erform the inverse kinematic transform which is needed by a

rob ot arm controller see chapter

 Recurrent Networks

The learning algorithms discussed in the previous chapter were applied to feedforward networks

all data ows in a network in which no cycles are present

But what happ ens when we intro duce a cycle For instance we can connect a hidden unit

with itself over a weighted connection connect hidden units to input units or even connect all

units with each other Although as we know from the previous chapter the approximational

capabilities of suchnetworks do not increase we may obtain decreased complexity network size

etc to solve the same problem

An imp ortant question we have to consider is the following what do we want to learn in

a recurrent network After all when one is considering a recurrent network it is p ossible to

continue propagating activation values ad innitumoruntil a stable p oint attractor is reached

As we will see in the sequel there exist recurrent network which are attractor based ie the

activation values in the network are rep eatedly up dated until a stable p oint is reached after

which the weights are adapted but there are also recurrent networks where the learning rule

is used after each propagation where an activation value is transversed over each weight only

once while external inputs are included in each propagation In such networks the recurrent

connections can b e regarded as extra inputs to the network the values of whichare computed

by the network itself

In this chapter recurrent extensions to the feedforward network intro duced in the previous

chapters will be discussedyet not to exhaustion The theory of the dynamics of recurrent

networks extends b eyond the scop e of a onesemester course on neural networks Yet the basics

of these networks will b e discussed

Subsequently some sp ecial recurrent networks will be discussed the Hopeld network in

section which can b e used for the representation of binary patterns subsequently we touch

up on Boltzmann machines therewith intro ducing sto chasticity in neural computation

The generalised deltarule in recurrent networks

The backpropagation learning rule intro duced in chapter can be easily used for training

patterns in recurrentnetworks Before we will consider this general case however we will rst

describ e networks where some of the hidden unit activation values are fed backto an extra set

of input units the Elman network or where output values are fed backinto hidden units the

Jordan network

A typical application of such a network is the following Supp ose we have to construct a

networkthatmust generate a control command dep ending on an external input which is a time

series x t x t x t With a feedforward network there are two p ossible approaches

create inputs x x x which constitute the last n values of the input vector Thus

n

a time window of the input vector is input to the network

create inputs x x x Besides only inputting x t we also input its rst second etc

CHAPTER RECURRENT NETWORKS

derivatives Naturally computation of these derivatives is not a trivial task for higherorder

derivatives

The disadvantage is of course that the input dimensionality of the feedforward network is

multiplied with n leading to a very large network which is slow and dicult to train The

Jordan and Elman networks provide a solution to this problem Due to the recurrent connections

a window of inputs need not be input anymore instead the network is supp osed to learn the

inuence of the previous time steps itself

The Jordan network

One of the earliest recurrent neural network was the Jordan network Jordan a b

An exemplar network is shown in gure In the Jordan network the activation values of the

input

units

h o

state

units

Figure The Jordan network Output activation values are fed back to the input layer to aset

of extra neurons called the state units

output units are fed backinto the input layer through a set of extra input units called the state

units There are as many state units as there are output units in the network The connections

between the output and state units have a xed weight of learning takes place only in the

connections between input and hidden units as well as hidden and output units Thus all the

learning rules derived for the multilayer p erceptron can b e used to train this network

The Elman network

The Elman network was intro duced by Elman in Elman In this network a set of

context units are intro duced whichare extra input units whose activation values are fed back

from the hidden units Thus the network is very similar to the Jordan network except that

the hidden units instead of the output units are fed back and the extra input units have

no selfconnections

The schematic structure of this network is shown in gure

Again the hidden units are connected to the context units with a xed weight of value

Learning is done as follows

the context units are set to t

THE GENERALISED DELTARULE IN RECURRENT NETWORKS

output layer

hidden layer

input layer context layer

Figure The Elman network With this network the hidden unit activation values are fed back

to the input layer to a set of extra neurons called the context units

t

pattern x is clamp ed the forward calculations are p erformed once

the backpropagation learning rule is applied

t t goto

The context units at step t thus always have the activation value of the hidden units at step

t

Example

As we mentioned ab ove the Jordan and Elman networks can be used to train a network on

repro ducing time sequences The idea of the recurrent connections is that the network is able to

rememb er the previous states of the input values As an example we trained an Elman network

on controlling an ob ject moving in D This ob ject has to follow a presp ecied tra jectory x To

d

control the ob ject forces F must b e applied since the ob ject suers from friction and p erhaps

other external forces

Totackle this problem we use an Elman net with inputs x and x one output F and three

d

hidden units The hidden units are connected to three context units In total ve units feed

into the hidden layer

The results of training are shown in gure The same test can b e done with an ordinary

Figure Training an Elman network to control an object The solid line depicts the desired

trajectory x the dashed line the realised trajectory The third line is the error d

CHAPTER RECURRENT NETWORKS

feedforward network with sliding window input We tested this with a network with ve inputs

four of which constituted the sliding window x x x and x and one the desired next

position of the ob ject Results are shown in gure The disapp ointing observation is that

Figure Training a feedforward network to control an object The solid line depicts the desired

trajectory x the dashed line the realised trajectory The third line is the error

d

the results are actually b etter with the ordinary feedforward network which has the same

complexity as the Elman network

Backpropagation in fully recurrent networks

More complex schemes than the ab ove are p ossible For instance indep endently of each other

Pineda Pineda and Almeida Almeida discovered that error backpropagation is

in fact a sp ecial case of a more general gradient learning metho d which can b e used for training

attractor networks However also when a network do es not reach a xp oint a learning metho d

can b e used backpropagation through time Pearlmutter This learning metho d

the discussion of which extents b eyond the scop e of our course can b e used to train a multilayer

p erceptron to follow tra jectories in its activation values

The Hopeld network

One of the earliest recurrent neural networks rep orted in literature was the autoasso ciator

indep endently describ ed by Anderson Anderson and Kohonen Kohonen in

It consists of a p o ol of neurons with connections b etween eachuniti and j i j see gure

All connections are weighted

In Hopeld Hopeld brings together several earlier ideas concerning these

networks and presents a complete mathematical analysis based on Ising spin mo dels Amit

Gutfreund Somp olinsky It is therefore that this network whichwe will describ e in

this chapter is generally referred to as the Hopeld network

Description

The Hopeld network consists of a set of N interconnected neurons gure which up date

their activation values asynchronously and indep endently of other neurons All neurons are

b oth input and output neurons The activation values are binary Originally Hopeld chose

activation values of and but using values and presents some advantages discussed

below We will therefore adhere to the latter convention

THE HOPFIELD NETWORK

Figure The autoasso ciator network All neurons are both input and output neurons ie a

pattern is clamp ed the network iterates to a stable state and the output of the network consists of

the new activation values of the neurons

The state of the system is given by the activation values y y The net input s t

k k

of a neuron k at cycle t isaweighted sum

X

s t y t w

jk k

k j

j k

A simple threshold function gure is applied to the net input to obtain the new activation

value y t at time t

i

if s t U

k

k

if s t U

y t

k

k

k

y t otherwise

k

ie y t sgns t For simplicitywe henceforth cho ose U but this is of course

k

k k

not essential

A neuron k in the Hopeld network is called stable at time t if in accordance with equa

tions and

y tsgns t

k k

Astate is called stable if when the network is in state all neurons are stable A pattern

p p

x is called stable if when x is clamp ed all neurons are stable

When the extra restriction w w is made the b ehaviour of the system can b e describ ed

jk kj

with an energy function

X XX

y y y w E

k jk

k j k

k j k

Theorem Arecurrent network with connections w w in which the neurons areupdated

jk kj

using rule has stable limit points

Pro of First note that the energy expressedineq is boundedfrom below since the y are

k

bounded from below and the w and are constant Second ly E is monotonical ly decreasing

jk k

when state changes occur because

X

A

E y y w

jk k

k j

j k

is always negative when y changes according to eqs and

k

Often these networks are describ ed using the symb ols used by Hopeld V for activation of unit k T for

k jk

the connection weightbetween units j and k andU for the external input of unit k We decided to stick to the

k

more general symb ols y w and

jk k k

CHAPTER RECURRENT NETWORKS

The advantage of a mo del over a mo del then is symmetry of the states of the

network For when some pattern x is stable its inverse is stable to o whereas in the mo del

this is not always true as an example the pattern is always stable but need

not b e Similarly b oth a pattern and its inverse have the same energy in the model

Removing the restriction of bidirectional connections ie w w results in a system

jk kj

that is not guaranteed to settle to a stable state

Hopeld network as asso ciative memory

A primary application of the Hopeld network is an asso ciative memory In this case the

weights of the connections b etween the neurons havetobethus set that the states of the system

corresp onding with the patterns which are to b e stored in the network are stable These states

can b e seen as dips in energy space When the network is cued with a noisy or incomplete test

pattern it will render the incorrect or missing data by iterating to a stable state which is in

some sense near to the cued pattern

The Hebb rule can b e used section to store P patterns

P

X

p p

x x if j k

j

k

w

jk

p

otherwise

p p

ie if x and x are equal w is increased otherwise decreased by one note that in the original

jk

j

k

Hebb rule weights only increase It app ears however that the network gets saturated very

quickly and that ab out N memories can b e stored b efore recall errors b ecome severe

There are two problems asso ciated with storing to o many patterns

the stored patterns b ecome unstable

spurious stable states app ear ie stable states which do not corresp ond with stored

patterns

The rst of these two problems can b e solved by an algorithm prop osed by Bruce et al Bruce

Canning Forrest Gardner Wallace

h i

p

Algorithm Given a starting weight matrix W w for each pattern x to be stored and

jk

p

p

each element x in x dene a correction such that

k

k

p

if y is stable and x is clamped

k

k

otherwise

Now modify w by w y y if j k Repeat this procedure until al l patterns are

j

jk jk k

j

k

stable

It app ears that in practice this algorithm usually converges There exist cases however where

the algorithm remains oscillatory try to nd one

The second problem stated ab ove can b e alleviated by applying the Hebb rule in reverse to

the spurious stable state but with a low learning factor Hopeld Feinstein Palmer

Thus these patterns are weakly unstored and will b ecome unstable again

Neurons with graded resp onse

The network describ ed in section can be generalised by allowing continuous activation

values Here the threshold activation function is replaced by a sigmoid As b efore this system

canbeproved to b e stable when a symmetric weight matrix is used Hopeld

THE HOPFIELD NETWORK

Hopeld networks for optimisation problems

An interesting application of the Hopeld network with graded resp onse arises in a heuristic

solution to the NPcomplete travelling salesman problem Garey Johnson In this

problem a path of minimal distance must b e found b etween n cities suchthat the b egin and

endp oints are the same

Hopeld and Tank Hopeld Tank use a network with n n neurons Each row

in the matrix represents a city whereas each column represents the p osition in the tour When

the network is settled each rowandeach column should have one and only one active neuron

indicating a sp ecic cityoccupying a sp ecic p osition in the tour The neurons are up dated using

rule with a sigmoid activation function between and The activation value y

Xj

th

indicates that city X o ccupies the j place in the tour

An energy function describing this problem can be set up as follows To ensure a correct

solution the following energy must b e minimised

X X X

A

E y y

Xj Xk

j

X k j

X X X

B

y y

Xj Yj

j

X

X Y

X X

C

A

y n

Xj

j

X

where A B and C are constants Therstand second terms in equation are zero if and

only if there is a maximum of one active neuron in eachrow and column resp ectively The last

term is zero if and only if there are exactly n active neurons

To minimise the distance of the tour an extra term

X X X

D

d y y y

XY

Xj Yj Yj

j

X

Y X

is added to the energywhered is the distance b etween cities X and Y and D is a constant

XY

For convenience the subscripts are dened mo dulo n

The weights are set as follows

w A inhibitory connections within eachrow

XY

XjY k jk

B inhibitory connections within each column

XY

jk

C global inhibition

Dd data term

XY

kj kj

where if j k and otherwise Finallyeach neuron has an external bias input Cn

jk

Discussion

Although this application is interesting from a theoretical p oint of view the applicability is

limited Whereas Hopeld and Tank state that in a ten city tour the network converges to a

valid solution in out of trials while of the solutions are optimal other rep orts show

less encouraging results For example Wilson Pawley nd that in only of the

runs a valid result is obtained few of which lead to an optimal or nearoptimal solution The

main problem is the lack of global information Since for an N city problem there are N

p ossible tours each of whichmaybetraversed in two directions as well as started in N p oints

the numb er of dierent tours is N N Dierently put the N dimensional hyp ercub e in which

the solutions are situated is N degenerate The degenerate solutions occur evenly within the

CHAPTER RECURRENT NETWORKS

hyp ercub e such that all but one of the nal N congurations are redundant The comp etition

between the degenerate tours often leads to solutions which are piecewise optimal but globally

inecient

Boltzmann machines

The Boltzmann machine as rst describ ed by Ackley Hinton and Sejnowski in Ackley

Hinton Sejnowski is a neural network that can be seen as an extension to Hopeld

networks to include hidden units and with a sto chastic instead of deterministic up date rule

The weights are still symmetric The op eration of the network is based on the physics principle

of annealing This is a pro cess whereby a material is heated and then co oled veryvery slowly to

a freezing p oint As a result the crystal lattice will b e highly ordered without any impurities

such that the system is in a state of very low energy In the Boltzmann machine this system

is mimicked by changing the deterministic up date of equation in a sto chastic up date in

which a neuron b ecomes active with a probability p

py

k

E T

k

e

where T is a parameter comparable with the synthetic temp erature of the system This

sto chastic activation function is not to b e confused with neurons having a sigmoid deterministic

activation function

In accordance with a physical system ob eying a Boltzmann distribution the network will

eventually reach thermal equilibrium and the relative probabilityoftwo global states and

will follow the Boltzmann distribution

P

E E T

e

P

th

where P is the probability of b eing in the global state and E is the energy of that state

Note that at thermal equilibrium the units still change state but the probability of nding the

network in any global state remains constant

At low temp eratures there is a strong bias in favour of states with low energy but the

time required to reach equilibrium may be long At higher temp eratures the bias is not so

favourable but equilibrium is reached faster A go o d way to beat this tradeo is to start at a

high temp erature and gradually reduce it At high temp eratures the network will ignore small

energy dierences and will rapidly approach equilibrium In doing so it will p erform a searchof

the coarse overall structure of the space of global states and will nd a go o d minimum at that

coarse level As the temp erature is lowered it will b egin to resp ond to smaller energy dierences

and will nd one of the b etter minima within the coarsescale minimum it discovered at high

temp erature

As multilayer p erceptrons the Boltzmann machine consists of a nonempty set of visible

and a p ossibly empty set of hidden units Here however the units are binaryvalued and are

up dated sto chastically and asynchronously The simplicityof the Boltzmann distribution leads

to a simple learning pro cedure which adjusts the weights so as to use the hidden units in an

optimal way Ackley et al This algorithm works as follows

First the input and output vectors are clamp ed The network is then annealed until it

approaches thermal equilibrium at a temp erature of It then runs for a xed time at equi

librium and each connection measures the fraction of the time during which both the units it

connects are active This is rep eated for all inputoutput pairs so that each connection can

clamp ed

measure hy y i the exp ected probabilityaveraged over all cases that units j and k are

j

k

simultaneously active at thermal equilibrium when the input and output vectors are clamp ed

BOLTZMANN MACHINES

free

Similarly hy y i is measured when the output units are not clamp ed but determined by the

j

k

network

In order to determine optimal weights in the network an error function must b e determined

free p p

Now the probability P y that the visible units are in state y when the system is running

clamp ed p

freely can be measured Also the desired probability P y that the visible units are

p

in state y is determined by clamping the visible units and letting the network run Now if

the weights in the network are correctly set b oth probabilities are equal to each other and the

error E in the network must be Otherwise the error must have a p ositive value measuring

the discrepancy b etween the networks internal mo de and the environment For this eect the

asymmetric divergence or Kullback information is used

clamp ed p

X

P y

clamp ed p

E P y log

free p

P y

p

Now in order to minimise E using gradient descent wemust change the weights according to

E

w

jk

w

jk

It is not dicult to show that

E

clamp ed free

hy y i hy y i

j k j k

w T

jk

Therefore eachweight is up dated by

clamp ed free

w hy y i hy y i

jk

j k j k

CHAPTER RECURRENT NETWORKS

 SelfOrganising Networks

In the previous chapters we discussed a number of networks which were trained to p erform a

p p

n m p p

mapping F by presenting the network examples x d with d F x of this

mapping However problems exist where such training data consisting of input and desired

output pairs are not available but where the only information is provided by a set of input

p

patterns x In these cases the relevant information has to be found within the redundant

p

training samples x

Some examples of such problems are

clustering the input data may be group ed in clusters and the data pro cessing system

has to nd these inherent clusters in the input data The output of the system should give

the cluster lab el of the input pattern discrete output

vector quantisation this problem occurs when a continuous space has to be discretised

The input of the system is the ndimensional vector x the output is a discrete repre

sentation of the input space The system has to nd optimal discretisation of the input

space

dimensionality reduction the input data are group ed in a subspace which has lower di

mensionality than the dimensionality of the data The system has to learn an optimal

mapping such that most of the variance in the input data is preserved in the output data

feature extraction the system has to extract features from the input signal This often

means a dimensionality reduction as describ ed ab ove

In this chapter we discuss a number of neurocomputational approaches for these kinds of

problems Training is done without the presence of an external teacher The unsup ervised

weight adapting algorithms are usually based on some form of global comp etition between the

neurons

There are very manytyp es of selforganising networks applicable to a wide area of problems

One of the most basic schemes is comp etitive learning as prop osed by Rumelhart and Zipser

Rumelhart Zipser A very similar network but with dierent emergent prop erties

is the top ologyconserving map devised by Kohonen Other selforganising networks are ART

prop osed byCarpenter and Grossb erg Carp enter Grossb erg a Grossb erg and

Fukushimas cognitron Fukushima

Comp etitive learning

Clustering

Comp etitive learning is a learning pro cedure that divides a set of input patterns in clusters

that are inherent to the input data A comp etitive learning network is provided only with input

CHAPTER SELFORGANISING NETWORKS

vectors x and thus implements an unsup ervised learning pro cedure We will show its equivalence

to a class of traditional clustering algorithms shortly Another imp ortant use of these networks

is vector quantisation as discussed in section

o

w

io

i

Figure A simple comp etitive learning network Each of the four outputs o is connected to all

inputs i

An example of a comp etitive learning network is shown in gure All output units o are

connected to all input units i with weights w When an input pattern x is presented only a

io

single output unit of the network the winner will b e activated In a correctly trained network

all x in one cluster will have the same winner For the determination of the winner and the

corresp onding learning rule two metho ds exist

Winner selection dot pro duct

For the time b eing we assume that b oth input vectors x and weightvectors w are normalised

o

to unit length Each output unit o calculates its activation value y according to the dot pro duct

o

of input and weightvector

X

T

x y w x w

io o

o i

i

In a next pass output neuron k is selected with maximum activation

o k y y

o k

Activations are reset such that y and y This is is the comp etitive asp ect of the

k o k

network and we refer to the output layer as the winnertakeall layer The winnertakeall layer

is usually implemented in software by simply selecting the output neuron with highest activation

value This function can also b e p erformed by a neural network known as MAXNET Lippmann

In MAXNET all neurons o are connected to other units o with inhibitory links and to

itself with an excitatory link

if o o



w

oo

otherwise

Itcanbeshown that this network converges to a situation where only the neuron with highest

initial activation survives whereas the activations of all other neurons converge to zero From

nowonwe will simply assume a winner k is selected without b eing concerned which algorithm

is used

Once the winner k hasbeenselectedtheweights are up dated according to

w t x t w t

k k

w t

k

kw t x t w t k

k k

where the divisor ensures that all weightvectors w are normalised Note that only the weights

of winner k are up dated

The weight up date given in equation eectively rotates the weight vector w towards

o

the input vector x Each time an input x is presented the weightvector closest to this input is

COMPETITIVE LEARNING

weight vector

pattern vector

w

w

w

Figure Example of clustering in D with normalised vectors which all lie on the unity sphere The

three weight vectors are rotated towards the centres of gravity of the three dierent input clusters

selected and is subsequently rotated towards the input Consequentlyweightvectors are rotated

towards those areas where many inputs app ear the clusters in the input This pro cedure is

visualised in gure

w

w

x

w

x

w

a. b.

Figure Determining the winner in a comp etitive learning network a Three normalised vectors

b The three vectors having the same directions as in a but with dierent lengths In a vectors

T

x and w are nearest to each other and their dot pro duct x w jx jjw j cos is larger than the

dot pro duct of x and w In b however the pattern and weight vectors are not normalised and in

T

this case w should b e considered the winner when x is applied However the dot pro duct x w

T

is still larger than x w

Winner selection Euclidean distance

Previously it was assumed that b oth inputs x and weightvectors w were normalised Using the

the activation function given in equation gives a biological plausible solution In gure

it is shown how the algorithm would fail if unnormalised vectors were to be used Naturally

one would like to accommo date the algorithm for unnormalised input data To this end the

winning neuron k is selected with its weightvector w closest to the input pattern x using the k

CHAPTER SELFORGANISING NETWORKS

Euclidean distance measure

k kw x kkw x k o

o

k

It is easily checked that equation reduces to and if all vectors are normalised The

Euclidean distance norm is therefore a more general case of equations and Instead of

rotating the weightvector towards the input as p erformed by equation the weight up date

must b e changed to implementa shift towards the input

w t w t x t w t

k k k

Again only the weights of the winner are up dated

A p oint of attention in these recursive clustering techniques is the initialisation Esp ecially

if the input vectors are drawn from a large or highdimensional input space it is not beyond

imagination that a randomly initialised weight vector w will never be chosen as the winner

o

and will thus never b e moved and never b e used Therefore it is customary to initialise weight

vectors to a set of input patterns fx g drawn from the input set at random Another more

thorough approachthatavoids these and other problems in comp etitive learning is called leaky

learning This is implemented by expanding the weight up date given in equation with

w t w t x t w t l k

l l l

with the leaky learning rate A somewhat similar metho d is known as frequency sensitive

comp etitive learning Ahalt Krishnamurthy Chen Melton In this algorithm

each neuron records the numb er of times it is selected winner The more often it wins the less

sensitive it b ecomes to comp etition Conversely neurons that consistently fail to win increase

their chances of b eing selected winner

Cost function

Earlier it was claimed that a comp etitive network p erforms a clustering pro cess on the input

data Ie input patterns are divided in disjoint clusters such that similarities between input

patterns in the same cluster are much bigger than similarities b etween inputs in dierent clusters

Similarity is measured by a distance function on the input vectors as discussed b efore A

common criterion to measure the quality of a given clustering is the square error criterion given

by

X

p

E kw x k

k

p

p

where k is the winning neuron when input x is presented The weights w are interpreted

as cluster centres It is not dicult to show that comp etitive learning indeed seeks to nd a

minimum for this square error by following the negative gradient of the errorfunction

p

Theorem The error function for pattern x

X

p

p 

E w x

ki

i



i

where k is the winning unit is minimised by the weight update rule in eq

Pro of As in eq we calculate the eect of a weight change on the error function So we

have that

p

E

w

p io

w

io

p

where is a constant of proportionality Now we have to determine the partial derivative of E

n

p

p

E

w x if unit o wins

io

i

w

otherwise io

COMPETITIVE LEARNING

such that

p

p

w w x x w

p io io io

o

i

which is eq written down for one element of w

o

Therefore eq is minimisedbyrepeated weight updates using eq

An almost identical pro cess of moving cluster centres is used in a large family of conven

tional clustering algorithms known as square error clustering metho ds eg k means FORGY

ISODATA CLUSTER

Example

In gure clusters of each data p oints are depicted A comp etitive learning network using

Euclidean distance to select the winner was initialised with all weight vectors w The

o

network was trained with and a and the p ositions of the weights after

iterations are shown

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

−0.5 0 0.5 1

Figure Comp etitive learning for clustering data The data are given by The p ositions of

the weight vectors after iterations is given by o

Vector quantisation

Another imp ortant use of comp etitive learning networks is found in vector quantisation Avector

quantisation scheme divides the input space in a numb er of disjoint subspaces and represents each

input vector x by the lab el of the subspace it falls into ie index k of the winning neuron The

dierence with clustering is that we are not so muchinterested in nding clusters of similar data

butmoreinquantising the entire input space The quantisation p erformed by the comp etitive

learning network is said to track the input probability density function the density of neurons

and thus subspaces is highest in those areas where inputs are most likely to app ear whereas

a more coarse quantisation is obtained in those areas where inputs are scarce An example of

tracking the input density is sketched in gure Vector quantisation through comp etitive

CHAPTER SELFORGANISING NETWORKS

x2

x1

: input pattern : weight vector

Figure This gure visualises the tracking of the input density The input patterns are drawn

from the weight vectors also lie in In the areas where inputs are scarce the upp er part of the

gure only few in this case two neurons are used to discretise the input space Thus the upp er

part of the input space is divided into two large separate regions The lower part however where

many more inputs have o ccurred ve neurons discretise the input space into ve smaller subspaces

learning results in a more negrained discretisation in those areas of the input space where

most input have o ccurred in the past

In this way comp etitive learning can be used in applications where data has to be com

pressed such as telecommunication or storage However comp etitive learning has also b e used

in combination with sup ervised learning metho ds and be applied to function approximation

problems or classication problems We will describ e two examples the counterpropagation

metho d and the learning vector quantization

Counterpropagation

In a large numb er of applications networks that p erform vector quantisation are combined with

another typ e of network in order to p erform function approximation An example of such a

vector feed-

quantisation forward

h

i

o

y

w

ih

w

ho

Figure A network combining a vector quantisation layer with a layer feedforward neural

network This network can b e used to approximate functions from to the input space is discretised in disjoint subspaces

COMPETITIVE LEARNING

network is given in gure This network can approximate a function

n m

f

T

by asso ciating with each neuron o a function value w w w which is somehow repre

o o mo

sentative for the function values f x of inputs x represented by o This way of approximating

a function eectively implements a lo okup table an input x is assigned to a table entry k

T

with o k kx w kkx w k and the function value w w w in this table

o

k k k mk

entry is taken as an approximation of f x A wellknown example of such a network is the

Counterpropagation network HechtNielsen

Dep ending on the application one can cho ose to p erform the vector quantisation b efore

learning the function approximation or one can cho ose to learn the quantisation and the ap

proximation layer simultaneously As an example of the latter the network presented in gure

can b e sup ervisedly trained in the following way

present the network with b oth input x and function value d f x

p erform the unsup ervised quantisation step For eachweightvector calculate the distance

from its weight vector to the input pattern and nd winner k Up date the weights w

ih

with equation

p erform the sup ervised approximation step

w t w t d w t

ko ko ko

o

P

This is simply the rule with y y w w when k is the winning neuron and the

ho ko

o h

h

desired output is given by d f x

If we dene a function g x kas

if k is winner

g x k

otherwise

it can b e shown that this learning pro cedure converges to

Z

w y g x h dx

ho

o

n

Ie each table entry converges to the mean function value over all inputs in the subspace

represented by that table entry As we have seen b efore the quantisation scheme tracks the

input probability density function which results in a b etter approximation of the function in

those areas where input is most likely to app ear

Not all functions are represented accurately by this combination of quantisation and approx

imation layers Eg a simple identity or combinations of sines and cosines are much b etter

approximated by multilayer backpropagation networks if the activation functions are chosen

appropriately However if we exp ect our input to b e a subspace of a high dimensional input

n

space and we exp ect our function f to b e discontinuous at numerous p oints the combination

of quantisation and approximation is not uncommon and probably very ecient Of course this

combination extends itself much further than the presented combination of the presented single

layer comp etitive learning network and the single layer feedforward network The latter could

b e replaced by a reinforcement learning pro cedure see chapter The quantisation layer can

b e replaced by various other quantisation schemes such as Kohonen networks see section

or o ctree metho ds Jansen Smagt Gro en In fact various mo dern statistical function

approximation metho ds CART MARS Breiman Friedman Olshen Stone Friedman

are based on this very idea extended with the p ossibilitytohave the approximation layer

inuence the quantisation layer eg to obtain a b etter or lo cally more negrained quantisa

tion Recent research Rosen Go o dwin Vidal also investigates in this direction

CHAPTER SELFORGANISING NETWORKS

Learning Vector Quantisation

It is an unpleasant habit in neural network literature to also cover Learning Vector Quantisation

LVQ metho ds in chapters on unsup ervised clustering Granted that these metho ds also p erform

a clustering or quantisation task and use similar learning rules they are trained sup ervisedly

and p erform discriminant analysis rather than unsup ervised clustering These networks attempt

to dene decision b oundaries in the input space given a large set of exemplary decisions the

training set each decision could eg b e a correct class lab el

A rather large number of slightly dierent LVQ metho ds is app earing in recent literature

They are all based on the following basic algorithm

with each output neuron o a class lab el or decision of some other kind y is asso ciated

o

p p

a learning sample consists of input vector x together with its correct class lab el y

o

p

using distance measures between weight vectors w and input vector x not only the

o

winner k is determined but also the second b est k

p p p

kx w k kx w k kx w k o k k

i

k k

p p

p

the lab els y y are compared with d The weight up date rule given in equation

k k

is used selectively based on this comparison

An example of the last step is given by the LVQ algorithm by Kohonen Kohonen using

the following strategy

p p

p p

if y d and d y

k k

p p

and kx w kkx w k

k k

then w t w x w t

k k k

and w t w t x w t

k k k

Ie w with the correct lab el is moved towards the input vector while w with the incorrect

k k

lab el is moved away from it

The new LVQ algorithms that are emerging all use dierent implementations of these dierent

steps eg how to dene class lab els y how many nextb est winners are to be determined

o

howtoadaptthenumb er of output neurons i and how to selectively use the weight up date rule

Kohonen network

The Kohonen network Kohonen can be seen as an extension to the comp etitive

learning network although this is chronologically incorrect Also the Kohonen network has a

dierent set of applications

In the Kohonen network the output units in S are ordered in some fashion often in a two

dimensional grid or array although this is applicationdep endent The ordering whichischosen

by the user determines which output neurons are neighb ours

Now when learning patterns are presented to the network the weights to the output units

N

are thus adapted such that the order present in the input space is preserved in the output

ie the neurons in S This means that learning patterns which are near to each other in the

input space where near is determined by the distance measure used in nding the winning unit

Of course variants have b een designed which automatically generate the structure of the network Martinetz

Schulten Fritzke

KOHONEN NETWORK

must b e mapp ed on output units which are also near to each other ie the same or neighb ouring

N

units Thus if inputs are uniformly distributed in and the order must be preserved the

dimensionalityof S must b e at least N The mapping which represents a discretisation of the

input space is said to b e top ology preserving However if the inputs are restricted to a subspace

N

of a Kohonen network can be used of lower dimensionality For example data on a two

dimensional manifold in a high dimensional input space can b e mapp ed onto a twodimensional

Kohonen network which can for example b e used for visualisation of the data

N

Usually the learning patterns are random samples from At time t a sample x t is

generated and presented to the network Using the same formulas as in section the winning

unit k is determined Next the weights to this winning unit as well as its neighb ours are adapted

using the learning rule

w t w tgo k x t w t o S

o o o

Here g o k is a decreasing function of the griddistance between units o and k such that

g k k For example for g a Gaussian function can b e used such that in one dimension

see gure Due to this collective learning scheme input signals g o k exp o k

1 0.755 2 h(i,k) 0.55 0.2525 1 0 -2 0 -1 0 -1 1

2 -2

Figure Gaussian neuron distance function g In this case g is shown for atwodimensional

grid b ecause it lo oks nice

which are near to each other will be mapp ed on neighb ouring neurons Thus the top ology

inherently present in the input signals will be preserved in the mapping such as depicted in gure

Iteration 0 Iteration 200 Iteration 600 Iteration 1900

Figure A top ologyconserving map converging The weight vectors of a network with two inputs

and output neurons arranged in a planargridare shown A line in each gure connects weight

w with weights w and w The leftmost gure shows the initial weights the

io o io o ii i

rightmost when the map is almost completely formed

If the intrinsic dimensionalityof S is less than N the neurons in the network are folded in

the input space such as depicted in gure

CHAPTER SELFORGANISING NETWORKS

Figure The mapping of a twodimensional input space on a onedimensional Kohonen network

The top ologyconserving quality of this network has manycounterparts in biological brains

The brain is organised in many places so that asp ects of the sensory environment are represented

in the form of twodimensional maps For example in the visual system there are several

top ographic mappings of visual space onto the surface of the visual cortex There are organised

mappings of the body surface onto the cortex in both motor and somatosensory areas and

tonotopic mappings of frequency in the auditory cortex The use of top ographic representations

where some imp ortant asp ect of a sensory mo dality is related to the physical lo cations of the

cells on a surface is so common that it obviously serves an imp ortant information pro cessing

function

It do es not come as a surprise therefore that already many applications have b een devised

of the Kohonen top ologyconserving maps Kohonen himself has successfully used the network

for phonemerecognition Kohonen Makisara Saramaki Also the network has b een

used to merge sensory data from dierent kinds of sensors such as auditory and visual lo oking

at the same scene Gielen Krommenho ek Gisb ergen Yet another application is in

rob otics suchasshown in section

To explain the plausibility of a similar structure in biological networks Kohonen remarks

that the lateral inhibition b etween the neurons could be obtained via eerent connections be

tween those neurons In one dimension those connection strengths form a Mexican hat see gure

excitation

lateral distance

Figure Mexican hat Lateral interaction around the winning neuron as a function of distance

excitation to nearby neurons inhibition to farther o neurons

Principal comp onent networks

Intro duction

The networks presented in the previous sections can b e seen as nonlinear vector transformations

which map an input vectortoanumb er of binary output elements or neurons The weights are

adjusted in sucha way that they could b e considered as prototyp e vectors vectorial means for

the input patterns for which the comp eting neuron wins

The selforganising transform describ ed in this section rotates the input space in such a

way that the values of the output neurons are as uncorrelated as p ossible and the energy or

variances of the patterns is mainly concentrated in a few output neurons An example is shown

PRINCIPAL COMPONENT NETWORKS

x

de

e

de

dx

e

x

dx

Figure Distribution of input samples

in gure The two dimensional samples x x are plotted in the gure It can b e easily

seen that x and x are related such that if we know x we can make a reasonable prediction

of x and vice versa since the p oints are centered around the line x x If we rotate the axes

over we get the e e axis as plotted in the gure Here the conditional prediction has no

use b ecause the p oints have uncorrelated co ordinates Another prop erty of this rotation is that

the variance or energy of the transformed patterns is maximised on a lower dimension This can

be intuitively veried by comparing the spreads d d andd d in the gures After the

x x e e

rotation the variance of the samples is large along the e axis and small along the e axis

This transform is very closely related to the eigenvector transformation known from image

pro cessing where the image has to be co ded or transformed to a lower dimension and recon

structed again by another transform as well as p ossible see section

The next section describ es a learning rule which acts as a Hebbian learning rule but which

scales the vector length to unity In the subsequent section we will see that a linear neuron with

a normalised Hebbian learning rule acts as such a transform extending the theory in the last

section to multidimensional outputs

Normalised Hebbian rule

The mo del considered here consists of one linear neuron with input weights w The output

y t of this neuron is given by the usual inner pro duct of its weight w and the input vector x

o

T

y tw t x t

o

As seen in the previous sections all mo dels are based on a kind of Hebbian learning However

the basic Hebbian rule would make the weights grow uninhibitedly if there were correlation in

the input patterns This can be overcome by normalising the weight vector to a xed length

typically which leads to the following learning rule

w ty tx t

w t

L w ty tx t

where L indicates an op erator which returns the vector length and is a small learning

parameter Compare this learning rule with the normalised learning rule of comp etitive learning

There the delta rule was normalised here the standard Hebb rule is

CHAPTER SELFORGANISING NETWORKS

Now the op erator which computes the vector length the norm of the vector can b e approx

imated bya Taylor expansion around

L

L w ty tx t O

When we substitute this expression for the vector length in equation it resolves for

small to

L

O w t w ty tx t

L

j y t discarding the higher order terms of leads to Since

w t w ty tx t y tw t

which is called the Oja learning rule Oja This learning rule thus mo dies the weight

in the usual Hebbian sense the rst pro duct terms is the Hebb rule y tx t but normalises

o

its weight vector directly by the second pro duct term y ty tw t What exactly do es this

o o

learning rule do with the weightvector

Principal comp onent extractor

Rememb er probability theory Consider an N dimensional signal x twith

mean E x t

T

correlation matrix R E x t x t

In the following we assume the signal mean to b e zero so

From equation we see that the exp ectation of the weights for the Oja learning rule

equals

T

E w t jw t w t Rw t w t Rw t w t

which has a continuous counterpart

d

T

w tRw t w t Rw t w t

dt

Theorem Let the eigenvectors e of R be ordered with descending associated eigenvalues

i i

such that With equation the weights w t wil l converge to e

N

Pro of Since the eigenvectors of R span the N dimensional space the weight vector can be

decomposed as

N

X

w t te

i i

i

Substituting this in the dierential equation and concluding the theorem is left as an exercise

Rememb ering that a a O

ADAPTIVE RESONANCE THEORY

More eigenvectors

In the previous section it was shown that a single neurons weightconverges to the eigenvector of

the correlation matrix with maximum eigenvalue ie the weight of the neuron is directed in the

direction of highest energy or variance of the input patterns Here wetackle the question of how

to nd the remaining eigenvectors of the correlation matrix given the rst found eigenvector

Consider the signal x which can be decomp osed into the basis of eigenvectors e of its

i

correlation matrix R

N

X

x e

i i

i

If we now subtract the comp onent in the direction of e the direction in which the signal has

the most energy from the signal x

x x e

we are sure that when we again decomp ose x into the eigenvector basis the co ecient

simply b ecause we just subtracted it We call x the deation of x

If now a second neuron is taught on this signal x then its weights will lie in the direction of the

remaining eigenvector with the highest eigenvalue Since the deation removed the comp onent

in the direction of the rst eigenvector the weight will converge to the remaining eigenvector

with maximum eigenvalue In the previous section we ordered the eigenvalues in magnitude so

according to this denition in the limit we will nd e We can continue this strategy and nd

all the N eigenvectors b elonging to the signal x

We can write the deation in neural network terms if we see that

N

X

T T

e y w x e

i i i

o

i

since

w e

So that the deated vector x equals

x x y w

o

The term subtracted from the input vector can b e interpreted as a kind of a backpro jection or

exp ectation Compare this to ART describ ed in the next section

Adaptive resonance theory

The last unsup ervised learning network we discuss diers from the previous networks in that it

is recurrent as with networks in the next chapter the data is not only fed forward but also back

from output to input units

Background Adaptive resonance theory

In Grossb erg Grossb erg intro duced a mo del for explaining biological phenomena

The mo del has three crucial prop erties

a normalisation of the total network activity Biological systems are usually very adaptive

to large changes in their environment For example the human eye can adapt itself to

large variations in lightintensities

contrast enhancement of input patterns The awareness of subtle dierences in input

patterns can mean a lot in terms of survival Distinguishing a hiding panther from a

resting one makes all the dierence in the world The mechanism used here is contrast

enhancement

CHAPTER SELFORGANISING NETWORKS

shortterm memory STM storage of the contrastenhanced pattern Before the input

pattern can be deco ded it must be stored in the shortterm memory The longterm

memory LTM implements an arousal mechanism ie the classication whereas the

STM is used to cause gradual changes in the LTM

The system consists of two layers F and F which are connected to each other via the

LTM see gure The input pattern is received at F whereas classication takes place in

F As mentioned b efore the input is not directly classied First a characterisation takes place

category representation field F2

STM activity pattern LTM LTM STM activity pattern F1 feature representation field

input

Figure The ART architecture

by means of extracting features giving rise to activation in the feature representation eld The

exp ectations residing in the LTM connections translate the input pattern to a categorisation

in the category representation eld The classication is compared to the exp ectation of the

network which resides in the LTM weights from F to F If there is a match the exp ectations

are strengthened otherwise the classication is rejected

ART The simplied neural network mo del

The ART simplied mo del consists of twolayers of binary neurons with values and called

F the comparison layer and F the recognition layer see gure Each neuron in F

is connected to all neurons in F via the continuousvalued forward long term memory LTM

f b

W and vice versa via the binaryvalued backward LTM W The other mo dules are gain

and GandG and a reset mo dule

Each neuron in the comparison layer receives three inputs a comp onent of the input pattern

a comp onent of the feedback pattern and a gain G A neuron outputs a if and only if at

least three of these inputs are high the twothirds rule

The neurons in the recognition layer each compute the inner pro duct of their incoming

continuousvalued weights and the pattern sent over these connections The winning neuron

then inhibits all the other neurons via lateral inhibition

Gain is the logical or of all the elements in the input pattern x

Gain equals gain except when the feedback pattern from F contains any then it is

forced to zero

Finally the reset signal is sent to the active neuron in F if the input vector x and the

output of F dier by more than some vigilance level

Op eration

The network starts by clamping the input at F Because the output of F is zero GandG

are b oth on and the output of F matches its input

ADAPTIVE RESONANCE THEORY

F2 + + M neurons G2 j +

b +

W

f W F1 − + + N neurons G1i reset − + +

input

Figure The ART neural network

The pattern is sent to F and in F one neuron b ecomes active This signal is then sent

back over the backward LTM which repro duces a binary pattern at F Gain is inhibited

and only the neurons in F whichreceive a one from b oth x and F remain active

If there is a substantial mismatchbetween the two patterns the reset signal will inhibit the

neuron in F and the pro cess is rep eated

Instead of following Carp enter and Grossb ergs description of the system using dierential

equations we use the notation employed by Lippmann Lippmann

Initialisation

b

w

ji

f

w

ij

N

where N is the number of neurons in F M the number of neurons in F i N

and jM Also cho ose the vigilance threshold

Apply the new input pattern x

compute the activation values y of the neurons in F

N

X

f

y w t x

ij

i i

j

select the winning neuron k kM

vigilance test if

b

w t x

k

x x

b

x essentially where denotes inner pro duct go to step else go to step Note that w

k

is the inner pro duct x x which will b e large if x and x are near to eachother

neuron k is disabled from further activity Go to step

Set for all l lN

b b

t x t w w

kl kl

l

b

t x w

kl

f

l

t w

P

lk

N

b

t x w

ki

i

i

CHAPTER SELFORGANISING NETWORKS

reenable all neurons in F and go to step

Figure shows exemplar b ehaviour of the network

backward LTM from: input output output output output pattern 1 2 3 4

not not not active active active

not not active active

not active

not

active

Figure An example of the b ehaviour of the Carp enter Grossb erg network for letter patterns

The binary input patterns on the left were applied sequentially On the right the stored patterns ie

b

the weights of W for the rst four output units are shown

ART The original mo del

In later work Carp enter and Grossb erg Carp enter Grossb erg a b present several

neural network mo dels to incorp orate parts of the complete theory We will only discuss the

rst mo del ART

The network incorp orates a followtheleader clustering algorithm Hartigan This

algorithm tries to t each new input pattern in an existing class If no matching class can be

found ie the distance b etween the new pattern and all existing classes exceeds some threshold

a new class is created containing the new pattern

The novelty in this approach is that the network is able to adapt to new incoming pat

terns while the previous memory is not corrupted In most neural networks such as the back

propagation network all patterns must be taught sequentially the teaching of a new pattern

might corrupt the weights for all previously learned patterns By changing the structure of the

network rather than the weights ART overcomes this problem

Normalisation

We will refer to a cell in F orF with k

Each cell k in F orF receives an input s and resp ond with an activation level y

k k

P

In order to intro duce normalisation in the mo del wesetI s and let the relativeinput

k

intensity s I

k

k

So wehave a mo del in whichthechange of the resp onse y of an input at a certain cell k

k

dep ends inhibitorily on all other inputs and the sensitivity of the cell ie the surroundings

P

s of each cell have a negative inuence on the cell y

l k

l k

ADAPTIVE RESONANCE THEORY

has an excitatory resp onse as far as the input at the cell is concerned Bs

k

has an inhibitory resp onse for normalisation y s

k k

has a decay Ay

k

Here A and B are constants The dierential equation for the neurons in F andF nowis

X

dy

k

Ay B y s y s

k k k k l

dt

l k

with y B b ecause the inhibitory eect of an input can never exceed the excitatory

k

input

P

At equilibrium when dy dt and with I s wehave that

k k

y A I Bs

k k

Because of the denition of s I weget

k

k

BI

y

k

k

A I

Therefore at equilibrium y is prop ortional to and since

k

k

BI

B

A I

P

total

the total activity y y never exceeds B it is normalised

k

Contrast enhancement

In order to make F react b etter on dierences in neuron values in F or vice versa contrast

enhancement is applied the contrasts b etween the neuronal values in a layer are amplied We

can show that eq do es not suce anymore In order to enhance the contrasts we chop

o all the equal fractions uniform parts in F or F This can be done by adding an extra

inhibitory input prop ortional to the inputs from the other cells with a factor C

X

dy

k

Ay B y s y C s

k k k k l

dt

l k

At equilibrium when wesetB n C where n is the numb er of neurons wehave

nC I

y

k

k

A I n

Now when an input in which all the s are equal is given then all the y are zero the eect of

k k

C is enhancing dierences If wesetB n C or CB C n then more of the input

shall b e chopp ed o

Discussion

The description of ART continues by dening the dierential equations for the LTM Instead

of following Carp enter and Grossb ergs description we will revert to the simplied mo del as

presented by Lippmann Lippmann

CHAPTER SELFORGANISING NETWORKS

 Reinforcement learning

In the previous chapters a numb er of sup ervised training metho ds have b een describ ed in which

the weight adjustments are calculated using a set of learning samples existing of input and

desired output values However not always suchaset of learning examples is available Often

the only information is a scalar evaluation r which indicates howwell the neural network is p er

forming Reinforcement learning involves two subproblems The rst is that the reinforcement

signal r is often delayed since it is a result of network outputs in the past This temp oral credit

assignment problem is solved by learning a critic network which represents a cost function

J predicting future reinforcement The second problem is to nd a learning pro cedure which

adapts the weights of the neural network such that a mapping is established which minimizes

J The two problems are discussed in the next paragraphs resp ectively Figure shows a

reinforcementlearning network interacting with a system

The critic

The rst problem is howto construct a critic whichisabletoevaluate system p erformance If

the ob jective of the network is to minimize a direct measurable quantity r p erformance feedback

is straightforward and a critic is not required On the other hand how is current behavior to

be evaluated if the ob jective concerns future system p erformance The p erformance may for

instance b e measured bythe cumulative or future error Most reinforcement learning metho ds

such as Barto Sutton and Anderson Barto Sutton Anderson use the temp oral

dierence TD algorithm Sutton to train the critic

Supp ose the immediate cost of the system at time step k are measured by r x u k as a

k k

function of system states x and control actions network outputs u The immediate measure

k k

r is often called the external reinforcement signal in contrast to the internal reinforcement

signal in gure Dene the p erformance measure J x u k of the system as a discounted

k k

J

critic

reinforcement

signal

reinf

x

u

learning system

controller

Figure Reinforcement learning scheme

CHAPTER REINFORCEMENT LEARNING

cumulative of future cost The task of the critic is to predict the p erformance measure

X

ik

r x u i J x u k

i i

k k

ik

in which is a discount factor usually

The relation b etween two successive prediction can easily b e derived

J x u kr x u kJ x u k

k k k k k k

If the network is correctly trained the relation between two successive network outputs J

should b e

J x u kr x u k J x u k

k k k k k k

If the network is not correctly trained the temporal dierence k between two successive

predictions is used to adapt the critic network

h i

r x u k J x u k J x u k k

k k k k k k

A learning rule for the weights of the critic network w k based on minimizing k can

c

be derived

J x u k

k k

w k k

c

w k

c

in which is the learning rate

The controller network

If the critic is capable of providing an immediate evaluation of p erformance the controller

network can b e adapted such that the optimal relation b etween system states and control actions

is found Three approaches are distinguished

In case of a nite set of actions U all actions may virtually b e executed The action which

decreases the p erformance criterion most is selected

u min J x u k

k k k

uU

The RLmetho d with this controller is called Qlearning Watkins Dayan The

metho d approximates dynamic programming which will b e discussed in the next section

If the p erformance measure J x u k is accurately predicted then the gradient with

k k

resp ect to the controller command u can b e calculated assuming that the critic network

k

is dierentiable If the measure is to be minimized the weights of the controller w are

r

adjusted in the direction of the negative gradient

uk J x u k

k k

w k

r

u k w k

r

with b eing the learning rate Werb os Werb os has discussed some of these gradient

based algorithms in detail Sofge and White Sofge White applied one of the

gradient based metho ds to optimize a manufacturing pro cess

BARTOS APPROACH THE ASEACE COMBINATION

A direct approach to adapt the controller is to use the dierence b etween the predicted and

the true p erformance measure as expressed in equation Supp ose that the p erformance

measure is to b e minimized Control actions that result in negative dierences ie the true

p erformance is b etter than was exp ected then the controller has to b e rewarded On the

other hand in case of a p ositive dierence then the control action has to b e p enalized

The idea is to explore the set of p ossible actions during learning and incorp orate the

b enecial ones into the controller Learning in this way is related to trialanderror learning

studied bypsychologists in whichbehavior is selected according to its consequences

Generally the algorithms select probabilistical ly actions from a set of p ossible actions and

up date action probabilities on basis of the evaluation feedback Most of the algorithms

are based on a lo okup table representation of the mapping from system states to actions

Barto et al Each table entry has to learn whichcontrol action is b est when that

entry is accessed It may b e also p ossible to use a parametric mapping from systems states

to action probabilities Gullapalli Gullapalli adapted the weights of a single layer

network In the next section the approach of Barto et al is describ ed

Bartos approach the ASEACE combination

Barto Sutton and Anderson Barto et al have formulated reinforcement learning

as a learning strategy which do es not need a set of examples provided by a teacher The

system describ ed by Barto explores the space of alternative inputoutput mappings and uses an

evaluative feedback reinforcement signal on the consequences of the control signal network

output on the environment It has b een shown that such reinforcement learning algorithms are

implementing an online incremental approximation to the dynamic programming metho d for

optimal control and are also called heuristic dynamic programming Werb os

The basic building blo cks in the Barto network are an Asso ciative Search Element ASE

which uses a sto chastic metho d to determine the correct relation b etween input and output and

an Adaptive Critic Element ACE which learns to give a correct prediction of future reward

or punishment Figure The external reinforcement signal r can b e generated by a sp ecial

sensor for example a collision sensor of a mobile rob ot or b e derived from the state vector For

example in control applications where the state s of a system should remain in a certain part

A of the control space reinforcement is given by

if s A

r

otherwise

Asso ciative search

In its most elementary form the ASE gives a binary output value y t f g as astochastic

o

function of an input vector The total input of the ASE is similar to the neuron presented in

chapter the weighted sum of the inputs with the exception that the bias input in this case is

a sto chastic variable N with mean zero normal distribution

N

X

w x tN s t

Sj j

j

j

The activation function F is a threshold such that

if s t

y ty t

o

otherwise

CHAPTER REINFORCEMENT LEARNING

reinforcement reinforcement

r detector

ACE

w

C

w

Cn

w

C

r internal

reinforcement

w

S

w

S y

decoder ASEo system

w Sn

state vector

Figure Architecture of a reinforcement learning scheme with critic element

For up dating the weights a Hebbian typ e of learning rule is used However the up date is

weighted with the reinforcement signal r t and an eligibility e is dened instead of the pro duct

j

y tx t of input and output

o

j

w t w tr te t

Sj Sj j

where is a learning factor The eligibility e is given by

j

e t e t y tx t

j j

o j

with the decay rate of the eligibility The eligibility is a sort of memory e is high if the

j

signals from the input state unit j and the output unit are correlated over some time

Using r t in expression has the disadvantage that learning only nds place when there

is an external reinforcement signal Instead of r t usually a continuous internal reinforcement

signal rtgiven bytheACE is used

Barto and Anandan Barto Anandan proved convergence for the case of a single

p

binary output unit and a set of linearly indep endent patterns x In control applications the

input vector is the ndimensional state vector s of the system In order to obtain a linear

p

indep endent set of patterns x often a deco der is used which divides the range of each of the

input variables s in a number of intervals The aim is to divide the input state space in a

i

number of disjunct subspaces or b oxes as called by Barto The input vector can therefore

only b e in one subspace at atime The deco der converts the input vector into a binary valued

vector x with only one element equal to one indicating which subspace is currently visited It

has been shown Krose Dam that instead of apriori quantisation of the input space

a selforganising quantisation based on metho ds describ ed in this chapter results in a b etter

p erformance

Adaptive critic

The Adaptive Critic ElementACE or evaluation network is basically the same as describ ed in

section An error signal is derived from the temp oral dierence of two successive predictions

in this case denoted by p and is used for training the ACE

rtr tpt pt

BARTOS APPROACH THE ASEACE COMBINATION

pt is implemented as a series of weights w to the ACE such that

Cj

ptw

Ck

if the system is in state k at time t denoted by x The function is learned by adjusting

k

the w s according to a deltarule with an error signal given by rt

Cj

w t rth t

Cj j

is the learning parameter and h t indicates the trace of neuron x

j

j

h th t x t

j j

j

This trace is a lowpass lter or momentum through which the credit assigned to state j increases

while state j is active and decays exp onentially after the activityofj has expired

If rt is p ositive the action u of the system has resulted in a higher evaluation value whereas

a negative rt indicates a deterioration of the system rt can be considered as an internal

reinforcement signal

The cartp ole system

An example of such a system is the cartp ole balancing system see gure Here a dynamics

controller must control the cart in such a way that the pole always stands up straight The

controller applies a left or right force F of xed magnitude to the cart which may change

direction at discrete time intervals The mo del has four state variables

x the p osition of the cart on the track

the angle of the p ole with the vertical

x the cart velo cityand

the angle velo city of the p ole

Furthermore a set of parameters sp ecify the p ole length and mass cart mass co ecients of

friction between the cart and the track and at the hinge between the pole and the cart the

control force magnitude and the force due to gravity The state space is partitioned on the

basis of the following quantisation thresholds

x m

x ms

s

This yields regions corresp onding to all of the combinations of the intervals

The deco der output is a dimensional vector A negative reinforcement signal is provided

when the state vector gets out of the admissible range when x x or

The system has proved to solve the problem in ab out learning steps

CHAPTER REINFORCEMENT LEARNING

θ

F

x

Figure The cartp ole system

Reinforcement learning versus optimal control

The ob jective of optimal control is generate control actions in order to optimize a predened

p erformance measure One technique to nd such a sequence of control actions which dene an

optimal control p olicy is Dynamic Programming DP The metho d is based on the principle

of optimality formulated by Bellman Bellman Whatever the initial system state if

the rst control action is contained in an optimal control policy then the remaining control

actions must constitute an optimal control policy for the problem with as initial system state the

state remaining from the rst control action The Bellman equations follow directly from the

principle of optimality Solving the equations backwards in time is called dynamic programming

P

N

Assume that a p erformance measure J x u k r x u i with r b eing the

i i

k k

ik

immediate costs is to b e minimized The minimum costs J of cost J can b e derived by the

min

Bellman equations of DP The equations for the discrete case are White Jordan

J x u kminJ x u k r x u k

min min

k k k k k k

uU

J x r x

min N N

The strategy for nding the optimal control actions is solving equation and from

which u can be derived This can be achieved backwards starting at state x The require

N

k

ments are a bounded N and a mo del which is assumed to be an exact representation of the

system and the environment The mo del has to provide the relation b etween successive system

states resulting from system dynamics control actions and disturbances In practice a solution

can b e derived only for a small N and simple systems In order to deal with large or innityN

the p erformance measure could b e denedasa discounted sum of future costs as expressed by

equation

Reinforcement learning provides a solution for the problem stated ab ove without the use of

a mo del of the system and environment RL is therefore often called an heuristic dynamic pro

gramming technique Barto Sutton Watkins Sutton Barto Wilson Wer

bos The most directly related RLtechnique to DP is Qlearning Watkins Dayan

The basic idea in Qlearning is to estimate a function Q of states and actions where

Qisthe minimum discounted sum of future costs J x u k the name Qlearning comes

min

k k

from Watkins notation For convenience the notation with J is continued here

J x u k J x u k r x u k

min

k k k k k k

The optimal control rule can b e expressed in terms of J by noting that an optimal control action

for state x is any action u that minimizes J according to equation

k k

The estimate of minimum cost J is up dated at time step k according equation The

REINFORCEMENT LEARNING VERSUS OPTIMAL CONTROL

temp oral dierence k between the true and exp ected p erformance is again used

k min J x u k r x u k J x u k

k k k k k k

uU

Watkins has shown that the function converges under some presp ecied conditions to the true

optimal Bellmann equation Watkins Dayan the critic is implemented as a lo okup

table the learning parameter must converge to zero all actions continue to be tried from all states

CHAPTER REINFORCEMENT LEARNING

Part III

APPLICATIONS

 Rob ot Control

An imp ortant area of application of neural networks is in the eld of rob otics Usually these

networks are designed to direct a manipulator which is the most imp ortant form of the industrial

rob ot to grasp ob jects based on sensor data Another applications include the steering and

pathplanning of autonomous rob ot vehicles

In rob otics the ma jor task involves making movements dep endent on sensor data There

are four related problems to b e distinguished Craig

Forward kinematics Kinematics is the science of motion which treats motion without regard

to the forces which cause it Within this science one studies the p osition velo city acceleration

and all higher order derivatives of the p osition variables Avery basic problem in the study of

mechanical manipulation is that of forward kinematics This is the static geometrical problem of

computing the p osition and orientation of the endeector hand of the manipulator Sp eci

callygiven a set of joint angles the forward kinematic problem is to compute the p osition and

orientation of the to ol frame relative to the base frame see gure

4

3 tool frame

2

1

base frame

Figure An exemplar rob ot manipulator

Inverse kinematics This problem is p osed as follows given the p osition and orientation of

the endeector of the manipulator calculate all p ossible sets of joint angles which could b e used

to attain this given p osition and orientation This is a fundamental problem in the practical use

of manipulators

The inverse kinematic problem is not as simple as the forward one Because the kinematic

equations are nonlinear their solution is not always easy or even p ossible in a closed form Also

the questions of existence of a solution and of multiple solutions arise

Solving this problem is a least requirement for most rob ot control systems

CHAPTER ROBOT CONTROL

Dynamics Dynamics is a eld of study devoted to studying the forces required to cause

motion In order to accelerate a manipulator from rest glide at a constant endeector velo city

and nally decelerate to a stop a complex set of torque functions must b e applied by the joint

actuators In dynamics not only the geometrical prop erties kinematics are used but also the

physical prop erties of the rob ot are taken into account Take for instance the weight inertia

of the rob otarm which determines the force required to change the motion of the arm The

dynamics intro duces two extra problems to the kinematic problems

The rob ot arm has a memory Its resp onds to a control signal dep ends also on its history

eg previous p ositions sp eed acceleration

If a rob ot grabs an ob ject then the dynamics change but the kinematics dont This is

b ecause the weight of the ob ject has to be added to the weight of the arm thats why

rob ot arms are so heavy making the relativeweightchange very small

Tra jectory generation To move a manipulator from here to there in a smo oth controlled

fashion eachjointmust b e moved via a smo oth function of time Exactly how to compute these

motion functions is the problem of tra jectory generation

In the rst section of this chapter we will discuss the problems asso ciated with the p ositioning

of the endeector in eect representing the inverse kinematics in combination with sensory

transformation Section discusses a network for controlling the dynamics of a rob ot arm

Finally section describ es neural networks for mobile rob ot control

Endeector p ositioning

The nal goal in rob ot manipulator control is often the p ositioning of the hand or endeector in

order to b e able to eg pickupanobject With the accurate rob ot arm that are manufactured

this task is often relatively simple involving the following steps

determine the target co ordinates relative to the base of the rob ot Typically when this

p osition is not always the same this is done with a number of xed cameras or other

sensors which observe the work scene from the image frame determine the p osition of the

ob ject in that frame and p erform a predetermined co ordinate transformation

with a precise mo del of the rob ot supplied by the manufacturer calculate the joint angles

to reach the target ie the inverse kinematics This is a relatively simple problem

move the arm dynamics control and close the gripp er

The arm motion in p oint is discussed in section Gripp er control is not a trivial matter at

all but we will not fo cus on that

Involvement of neural networks So if these parts are relatively simple to solve with a

high accuracy why involve neural networks The reason is the applicabilityof rob ots When

traditional metho ds are used to control a rob ot arm accurate mo dels of the sensors and manip

ulators in some cases with unknown parameters whichhave to b e estimated from the systems

behaviour yet still with accurate mo dels as starting point are required and the system must

be calibrated Also systems which suer from wearandtear and which mechanical systems

dont need frequent recalibration or parameter determination Finally the development of

more complex adaptive control metho ds allows the design and use of more exible ie less rigid rob ot systems b oth on the sensory and motory side

ENDEFFECTOR POSITIONING

Camerarob ot co ordination is function approximation

The system we fo cus on in this section is a work o or observed by a xed cameras and a rob ot

arm The visual system must identify the target as well as determine the visual p osition of the

endeector

target hand

The target p osition x together with the visual p osition of the hand x are input to

the neural controller This controller then generates a joint p osition for the rob ot

N

target hand

x x

N

We can compare the neurally generated with the optimal generated by a ctitious p erfect

controller

R

target hand

x x

R

The task of learning is to make the generate an output close enough to

N

There are two problems asso ciated with teaching

N

generating learning samples which are in accordance with eq This is not trivial

since in useful applications isanunknown function Instead a form of selfsup ervised

R

or unsup ervised learning is required Some examples to solve this problem are given b elow

constructing the mapping from the available learning samples When the usually

N

randomly drawn learning samples are available a neural network uses these samples to

represent the whole input space over which the rob ot is active This is evidently a form

of interp olation but has the problem that the input space is of a high dimensionality and

the samples are randomly distributed

We will discuss three fundamentally dierent approaches to neural networks for rob ot end

eector p ositioning In each of these approaches a solution will b e found for b oth the learning

sample generation and the function representation

Approach Feedforward networks

When using a feedforward system for controlling the manipulator a selfsup ervised learning

system mustbeused

One such a system has b een rep orted by Psaltis Sideris and Yamamura Psaltis Sideris

Yamamura Here the network which is constrained to twodimensional p ositioning of

the rob ot arm learns by exp erimentation Three metho ds are prop osed

Indirect learning

In indirect learning a Cartesian target point x in world co ordinates is generated eg

by a two cameras lo oking at an ob ject This target p oint is fed into the network which

generates an angle vector The manipulator moves to p osition and the cameras

determine the new p osition x of the endeector in world co ordinates This x again is

input to the network resulting in The network is then trained on the error

see gure

However minimisation of do es not guarantee minimisation of the overall error x x

For example the network often settles at a solution that maps all xs to a single ie

the mapping I

General learning

The metho d is basically very much like sup ervised learning but here the plant input

must be provided by the user Thus the network can directly minimise j j The

success of this metho d dep ends on the interp olation capabilities of the network Correct

choice of may p ose a problem

CHAPTER ROBOT CONTROL

x θ x’ Neural Plant Network

ε1 Neural

θ’ Network

Figure Indirect learning system for rob otics In each cycle the network is used in two dierent

places rst in the forward step then for feeding back the error

Sp ecialised learning

Keep in mind that the goal of the training of the network is to minimise the error at

the output of the plant x x We can also train the network by backpropagating

this error trough the plant compare this with the of the error in Chap

ter This metho d requires knowledge of the Jacobian matrix of the plant A Jacobian

matrix of a multidimensional function F is a matrix of partial derivatives of F ie the

multidimensional form of the derivative For example if wehave Y F X ie

y f x x x

n

y f x x x

n

y f x x x

m m n

then

f f f

x x x y

n

x x x

n

f f f

x x x y

n

x x x

n

f f f

m m m

y x x x

m n

x x x

n

or

F

X Y

X

Eq is also written as

Y J X X

where J is the Jacobian matrix of F So the Jacobian matrix can b e used to calculate the

change in the function when its parameters change

Now in this case wehave

P

i

J

ij

j

where P the ith element of the plant output for input The learning rule applied

i

here regards the plant as an additional and unmo diable layer in the neural network The

ENDEFFECTOR POSITIONING

x θ x’ Neural Plant Network

ε

Figure The system used for sp ecialised learning

total error x x is propagated back through the plant by calculating the as in

j

eq

X

P

i

F s

j i

j

j

i

x x

i i

i

where i iterates over the outputs of the plant When the plant is an unknown function

P

i

can b e approximated by

j

P P h e P

i i j j i

h

j

where e is used to change the scalar into a vector This approximate derivative can

j j

b e measured by slightly changing the input to the plant and measuring the changes in the

output

A somewhat similar approach is taken in Krose Korst Gro en and Smagt Krose

Again a twolayer feedforward network is trained with backpropagation However

instead of calculating a desired output vector the input vector which should have invoked the

current output vector is reconstructed and backpropagation is applied to this new input vector

and the existing output vector

The conguration used consists of a mono cular manipulator which has to grasp ob jects Due

to the fact that the camera is situated in the hand of the rob ot the task is to move the hand

such that the ob ject is in the centre of the image and has some predetermined size in a later

article a biologically inspired system is prop osed Smagt Krose Groen in which the

visual oweld is used to account for the mono cularity of the system such that the dimensions

of the ob ject need not to b e known anymore to the system

One step towards the target consists of the following op erations

measure the distance from the current p osition to the target p osition in camera domain

x

use this distance together with the current state of the rob ot as input for the neural

network The network then generates a joint displacementvector

send to the manipulator

again measure the distance from the current p osition to the target p osition in camera

domain x

t t

calculate the move made by the manipulator in visual domain x R x where R is

t t

the rotation matrix of the second camera image with resp ect to the rst camera image

CHAPTER ROBOT CONTROL

t

teach the learning pair x R x to the network

t

This system has shown to learn correct behaviour in only tens of iterations and to be very

adaptive to changes in the sensor or manipulator Smagt Krose Smagt Gro en

Krose

By using a feedforward network the available learning samples are approximated by a single

smo oth function consisting of a summation of sigmoid functions As mentioned in section a

feedforward network with one layer of sigmoid units is capable of representing practically any

function But how are the optimal weights determined in nite time to obtain this optimal

representation Exp eriments have shown that although a reasonable representation can be

obtained in a short p erio d of time an accurate representation of the function that governs the

learning samples is often not feasible or extremely dicult Jansen et al The reason

for this is the global character of the approximation obtained with a feedforward network with

sigmoid units every weight in the network has a global eect on the nal approximation that

is obtained

Building lo cal representations is the obvious way out every part of the network is resp onsible

for a small subspace of the total input space Thus accuracy is obtained lo cally Keep It Small

Simple This is typically obtained with a Kohonen network

Approach Top ology conserving maps

Ritter Martinetz and Schulten Ritter Martinetz Schulten describ e the use of a

Kohonenlike network for rob ot control We will only describ e the kinematics part since it is

the most interesting and straightforward

The system describ ed by Ritter et al consists of a rob ot manipulator with three degrees of

freedom orientation of the endeector is not included which has to grab ob jects in Dspace

The system is observed bytwo xed cameras which output their x y co ordinates of the ob ject

and the end eector see gure

Figure A Kohonen network merging the output of two cameras

Each run consists of twomovements In the gross move the observed lo cation of the ob ject

x a fourcomp onentvector is input to the network As with the Kohonen network the neuron

k with highest activation value is selected as winner b ecause its weightvector w is nearest to

k

x The neurons which are arranged in a dimensional lattice corresp ond in a fashion with

subregions of the D workspace of the rob ot ie the neuronal lattice is a discrete representation

of the workspace With each neuron a vector and Jacobian matrix A are asso ciated During

gross move is fed to the rob ot which makes its move resulting in retinal co ordinates x of

g

k

the endeector To correct for the discretisation of the working space an additional move is

ROBOT ARM DYNAMICS

made which is dep endent of the distance between the neuron and the ob ject in space w x

k

this small displacement in Cartesian space is translated to an angle change using the Jacobian

A

k

nal

A x w

k k k

nal

which is a rstorder Taylor expansion of The nal retinal co ordinates of the endeector

after this ne moveareinx

f

Learning pro ceeds as follows when an improved estimate A has been found the fol

lowing adaptations are made for all neurons j

old old new

t g t x w w w

j j j

jk

new old old

A A t g t A A

j j jk j j

If g t g t this is similar to p erceptron learning Here as with the Kohonen

jk jk

jk

learning rule a distance function is used such that g tandg t are Gaussians dep ending on

jk

jk

the distance b etween neurons j and k with a maximum at j k cf eq

An improved estimate A is obtained as follows

A x x

k k f

T

x x

g

f

A A A x w x x

g

k k k f

kx x k

g

f

T

x

A A x

k k

kxk

In eq the nal error x x in Cartesian space is translated to an error in joint space via

f

multiplication by A This error is then added to to constitute the improved estimate

k k

steep est descent minimisation of error

In eq x x x ie the change in retinal co ordinates of the endeector due to the

g

f

ne movement and A x w ie the related joint angles during ne movement Thus

k k

eq can b e recognised as an errorcorrection rule of the WidrowHo typ e for Jacobians A

It app ears that after iterations the system approaches correct b ehaviour and that after

learning steps no noteworthy deviation is present

Rob ot arm dynamics

While endeector p ositioning via sensorrob ot co ordination is an imp ortant problem to solve

the rob ot itself will not move without dynamic control of its limbs

Again accurate control with nonadaptivecontrollers is p ossible only when accurate mo dels

of the rob ot are available and the rob ot is not to o susceptible to wearandtear This requirement

has led to the currentday rob ots that are used in many factories But the application of neural

networks in this eld changes these requirements

One of the rst neural networks which succeeded in doing dynamic control of a rob ot arm

was presented by Kawato Furukawa and Suzuki Kawato Furukawa Suzuki They

describ e a neural network which generates motor commands from a desired tra jectory in joint

angles Their system do es not include the tra jectory generation or the transformation of visual

co ordinates to b o dy co ordinates

The network is extremely simple In fact the system is a feedforward network but by

carefully cho osing the basis functions the network can b e restricted to one learning layer such

that nding the optimal is a trivial task In this case the basis functions are thus chosen that

the function that is approximated is a linear combination of those basis functions This approach

is similar to that presented in section

CHAPTER ROBOT CONTROL

Dynamics mo del The manipulator used consists of three joints as the manipulator in g

ure without wrist joint The desired tra jectory t which is generated by another subsys

d

tem is fed into the inversedynamics mo del gure The error between t and t is

d

fed into the neural mo del

inverse dynamics model

(t) T (t) (t) θd T i (t) + θ + manipulator Tf (t) K

+ -

Figure The neural mo del prop osed byKawato et al

The neural mo del which is shown in gure consists of three p erceptrons each one

feeding in one joint of the manipulator The desired tra jectory isfedinto

d d d d

nonlinear subsystems The resulting signals are weighted and summed suchthat

X

T t w x k

ik lk lk

l

with

x f t t t

l l d d d

x x g t t t

l l l d d d

and f and g as in table

l l

x (t) f 1,1 1

T (t) Σ i1 x (t) 13,1 f 13 (t) x1,2 g T (t) 1 Σ i2 (t) x1,3 T (t) x (t) Σ i3 13,2 g 13 x (t) (t) 13,3 θd1 (t) T (t) θd2 1 T (t) θ (t) 2 (t)

d3 T3

Figure The neural network used byKawato et al There are three neurons one p er joint in the

rob ot arm Each neuron feeds from thirteen nonlinear subsystems The upp er neuron is connected

to the rotary base joint cf joint in gure the other two neurons to joints and

ROBOT ARM DYNAMICS

l f g

l l

sin

cos cos

sin cos

cos sin cos

sin sin sin cos

sin cos sin cos

sin cos cos sin

sin cos sin

cos sin sin

sin cos sin

sin cos

Table Nonlinear transformations used in the Kawato mo del

The feedback torque T t in gure consists of

f

d t

k

T tK t t K k

fk pk dk k vk

dt

K unless j t ob jectivepointj

vk k dk

T T

The feedback gains K and K were computed as and

p v

Next the weights adapt using the delta rule

dw

ik

x T x T T k

ik ik fk ik

dt

A desired move pattern is shown in gure After minutes of learning the feedback

torques are nearly zero such that the system has successfully learned the transformation Al

though the applied patterns are very dedicated training with a rep etitive pattern sin t with

k

p p

is also successful

θ1 π

0

−π

10 20 30 t/s

Figure The desired joint pattern for joints Joints and have similar time patterns

The usefulness of neural algorithms is demonstrated by the fact that novel rob ot architectures

which no longer need a very rigid structure to simplify the controller are now constructed For

example several groups Katayama Kawato Hesselroth Sarkar Smagt Schulten

rep ort on work with a pneumatic musculoskeletal rob ot arm with rubb er actuators re

placing the DC motors The very complex dynamics and environmental temp erature dep endency

of this arm make the use of nonadaptive algorithms imp ossible where neural networks succeed

CHAPTER ROBOT CONTROL

Mobile rob ots

In the previous sections some applications of neural networks on rob ot arms were discussed In

this section we fo cus on mobile rob ots Basically the control of a rob ot arm and the control

of a mobile rob ot is very similar the hierarchical controller rst plans a path the path is

transformed from Cartesian world domain to the joint or wheel domain using the inverse

kinematics of the system and nally a dynamic controller takes care of the mapping from set

points in this domain to actuator signals However in practice the problems with mobile rob ots

o ccur more with pathplanning and navigation than with the dynamics of the system Two

examples will b e given

Mo del based navigation

Jorgensen Jorgensen describ es a neural approach for pathplanning Rob ot pathplanning

techniques can b e divided into two categories The rst called lo cal planning relies on informa

tion available from the current viewp oint of the rob ot This planning is imp ortant since it is

able to deal with fast changes in the environment Unfortunatelyby itself lo cal data is generally

not adequate since o cclusion in the line of sight can cause the rob ot to wander into dead end

corridors or cho ose nonoptimal routes of travel The second situation is called global path

planning in which case the system uses global knowledge from a top ographic map previously

stored into memory Although global planning p ermits optimal paths to b e generated it has its

weakness Missing knowledge or incorrectly selected maps can invalidate a global path to an ex

tent that it b ecomes useless A p ossible third anticipatory planning combined b oth strategies

the lo cal information is constantly used to give a b est guess what the global environment may

contain

Jorgensen investigates two issues asso ciated with neural network applications in unstructured

or changing environments First can neural networks b e used in conjunction with direct sensor

readings to asso ciatively approximate global terrain features not observable from a single rob ot

p ersp ective Secondly is a neural network fast enough to b e useful in path relaxation planning

where the rob ot is required to optimise motion and situation sensitive constraints

For the rst problem the system had to store a number of p ossible sensor maps of the

environment The rob ot was p ositioned in eight p ositions in each ro om and sonar scans

were made from each p osition Based on these data for each ro om a map was made To b e able

to represent these maps in a neural network the map was divided into grid elements

which could be pro jected onto the no des neural network The maps of the dierent

ro oms were stored in a Hopeld typ e of network In the op erational phase the rob ot wanders

around and enters an unknown ro om It makes one scan with the sonar which provides a partial

representation of the ro om map see gure This pattern is clamp ed onto the network which

will regenerate the b est tting pattern With this information a global pathplanner can b e used

The results which are presented in the pap er are not very encouraging With a network of

neurons the total number of weights is squared which costs more than Mbyte of storage

if only one byte per weight is used Also the sp eed of the recall is low Jorgensen mentions a

recall time of more than two and a half hour on an IBM AT which is used on board of the

rob ot

Also the use of a simulated annealing paradigm for path planning is not proving to be an

eective approach The large number of settling trials is far to o slow for real time

when the same functions could be b etter served by the use of a p otential eld approach or distance transform

MOBILE ROBOTS

A B C D E

Figure Schematic representation of the stored ro oms and the partial information which is

available from a single sonar scan

Sensor based control

Very similar to the sensor based control for the rob ot arm as describ ed in the previous sections

a mobile rob ot can be controlled directly using the sensor data Such an application has b een

develop ed at CarnegyMellon byTouretzky and Pomerleau The goal of their network is to drive

avehicle along a winding road The network receives twotyp e of sensor inputs from the sensory

system One is a see gure pixel image from a camera mounted on the ro of of the

vehicle where each pixel corresp onds to an input unit of the network The other input is an

pixel image from a laser range nder The activation levels of units in the range nders

retina represent the distance to the corresp onding ob jects

sharp leftstraight ahead sharp right 45 output units

29 hidden units

66

8x32 range finder input retina

30x32 video input retina

Figure The structure of the network for the autonomous land vehicle

The network was trained by presenting it samples with as inputs a wide variety of road images

taken under dierent viewing angles and lighting conditions Images were presented

CHAPTER ROBOT CONTROL

times each while the weights were adjusted using the backpropagation principle The authors

claim that once the network is trained the vehicle can accurately driveat ab out kmhour

along a path though a wo o ded area adjoining the Carnegie Mellon campus under a variety

of weather and lighting conditions The sp eed is nearly twice as high as a nonneural algorithm

running on the same vehicle

Although these results show that neural approaches can b e p ossible solutions for the sensor

based control problem there still are serious shortcomings In simulations in our own lab oratory

we found that networks trained with examples which are provided byhuman op erators are not

always able to nd a correct approximation of the human behaviour This is the case if the

human op erator uses other information than the networks input to generate the steering signal

Also the learning of in particular backpropagation networks is dep endent on the sequence of

samples and for all sup ervised training metho ds dep ends on the distribution of the training samples

 Vision

Intro duction

In this chapter we illustrate some applications of neural networks which deal with visual infor

mation pro cessing In the neural literature we nd roughly twotyp es of problems the mo delling

of biological vision systems and the use of articial neural networks for machine vision We will

fo cus on the latter

The primary goal of machine vision is to obtain information ab out the environment by

pro cessing data from one or multiple twodimensional arrays of intensityvalues images which

are pro jections of this environment on the system This information can b e of dierent nature

recognition the classication of the input data in one of a numb er of p ossible classes

geometric information ab out the environment which is imp ortant for autonomous systems

compression of the image for storage and transmission

Often a distinction is made between low level or early vision intermediate level vision and

high level vision Typical lowlevel op erations include image ltering isolated feature detection

and consistency calculations At a higher level segmentation can be carried out as well as

the calculation of invariants The high level vision mo dules organise and control the ow of

information from these mo dules and combine this information with high level knowledge for

analysis

Computer vision already has a long tradition of research and many algorithms for image

pro cessing and pattern recognition have b een develop ed There app ear to b e two computational

paradigms that are easily adapted to massive parallelism lo cal calculations and neighb ourho o d

functions Calculations that are strictly lo calised to one area of an image are obviously easy to

compute in parallel Examples are lters and edge detectors in early vision A cascade of these

lo cal calculations can b e implementedinafeedforward network

The rst section describ es feedforward networks for vision Section shows how back

propagation can be used for image compression In the same section it is shown that the

PCA neuron is ideally suited for image compression Finally sections and describ e the

cognitron for optical character recognition and relaxation networks for calculating depth from

stereo images

Feedforward typ es of networks

The early feedforward networks as the p erceptron and the adaline were essentially designed to

b e b e visual pattern classiers In principle a multilayer feedforward network is able to learn to

classify all p ossible input patterns correctly but an enormous amount of connections is needed

for the p erceptron Minsky showed that many problems can only b e solved if each hidden unit is

CHAPTER VISION

connected to all inputs The question is whether such systems can still b e regarded as vision

systems No use is made of the spatial relationships in the input patterns and the problem

of classifying a set of real world images is the same as the problem of classifying a set of

articial random dot patterns which are according to Smeulders no images For that reason

most successful neural vision applications combine selforganising techniques with a feedforward

architecture such as for example the neo cognitron Fukushima describ ed in section

The neo cognitron p erforms the mapping from input data to output data bya layered structure

in which at each stage increasingly complex features are extracted The lower layers extract

lo cal features such as a line at a particular orientation and the higher layers aim to extract more

global features

Also there is the problem of translation invariance the system has to classify a pattern

correctly indep endent of the lo cation on the retina However a standard feedforward network

considers an input pattern which is translated as a totally new pattern Several attempts have

been describ ed to overcome this problem one of the more exotic ones by Widrow Widrow

Winter Baxter as a layered structure of adalines

Selforganising networks for image compression

In image compression one wants to reduce the number of bits required to store or transmit an

image We can either require a p erfect reconstruction of the original or we can accept a small

deterioration of the image The former is called a lossless co ding and the latter a lossy co ding

In this section we will consider lossy co ding of images with neural networks

The basic idea behind compression is that an ndimensional sto chastic vector n part of

the image is transformed into an mdimensional sto chastic vector

m Tn

After transmission or storage of this vector m a discrete version of m we can make a recon

struction of n by some sort of inverse transform T so that the reconstructed signal equals

n Tn

The error of the compression and reconstruction stage together can b e given as

E kn nk

There is a tradeo between the dimensionality of m and the error As one decreases the

dimensionality of m the error increases and vice versa ie a b etter compression leads to a

higher deterioration of the image The basic problem of compression is nding T and T such

that the information in m is as compact as p ossible with acceptable error The denition of

acceptable dep ends on the application area

The cautious reader has already concluded that dimension reduction is in itself not enough to

obtain a compression of the data The main imp ortance is that some asp ects of an image are more

imp ortant for the reconstruction then others For example the mean grey level and generally

the low frequency comp onents of the image are very imp ortant so we should co de these features

with high precision Other like high frequency comp onents are much less imp ortant so these

can be coarseco ded So when we reduce the dimension of the data we are actually trying to

concentrate the information of the data in a few numb ers the low frequency comp onents which

can be co ded with precision while throwing the rest away the high frequency comp onents

In this section we will consider co ding an image of pixels It is a bit tedious to

transform the whole image directly by the network This requires a huge amount of neurons

Because the statistical description over parts of the image is supp osed to b e stationary wecan

SELFORGANISING NETWORKS FOR IMAGE COMPRESSION

break the image into blo cks of size which is large enough to entail a lo cal statistical

description and small enough to b e managed These blo cks can then b e co ded separately stored

or transmitted where after a reconstruction of the whole image can be made based on these

co ded blo cks

Backpropagation

The pro cess ab ove can be interpreted as a layer neural network The inputs to the network

are the patters and the desired outputs are the same patterns as presented on the

input units This typ e of network is called an autoasso ciator

After training with a gradient search metho d minimising the weights between the rst

and second layer can b e seen as the co ding matrix T and b etween the second and third as the

reconstruction matrix T

If the number of hidden units is smaller then the number of input output units a com

pression is obtained in other words we are trying to squeeze the information through a smaller

channel namely the hidden layer

This network has b een used for the recognition of human faces by Cottrell Cottrell Munro

Zipser He uses an input and output layer of units on which he presented the

whole face at once The hidden layer which consisted of units was classied with another

network by means of a delta rule Is this complex network invariant to translations in the input

Linear networks

It is known from statistics that the optimal transform from an ndimensional to an mdimensional

sto chastic vector optimal in the sense that contains the lowest energy p ossible equals the

concatenation of the rst m eigenvectors of the correlation matrix R of N So if e e e

n

are the eigenvectors of R ordered in decreasing corresp onding eigenvalue the transformation

T

matrix is given as T e e e

In section a linear neuron with a normalised Hebbian learning rule was able to learn

the eigenvectors of the correlation matrix of the input patterns The denition of the optimal

transform given ab ove suits exactly in the PCA network wehave describ ed

So we end up with a m network where m is the desired number of hidden units

which is coupled to the total error Since the eigenvalues are ordered in decreasing values

which are the outputs of the hidden units the hidden units are ordered in imp ortance for the

reconstruction

Sanger Sanger used this implementation for image compression The test image is

shown in gure It is with bitspixel

After training the image four times thus generating learning patterns of size

the weights of the network converge into gure

Principal comp onents as features

If parts of the image are very characteristic for the scene like corners lines shades etc one

sp eaks of features of the image The extraction of features can make the image understanding

task on a higher level much easer If the image analysis is based on features it is very imp ortant

that the features are tolerant of noise distortion etc

From an image compression viewp ointitwould b e smart to co de these features with as little

bits as p ossible just b ecause the denition of features was that they occur frequently in the

image

So one can ask oneself if the two describ ed compression metho ds also extract features from

the image Indeed this is true and can most easily be seen in g It might not be clear

directly but one can see that the weights are converged to

CHAPTER VISION

Figure Input image for the network The image is divided into blo cks which arefedtothe

network

ordering of the neurons:

neuron neuron neuron 0 12

neuron neuron neuron 345

neuron neuron (unused)

6 7

Figure Weights of the PCA network The nal weights of the network trained on the test

image For each neuron an rectangle is shown in which the grey level of each of the elements

represents the value of the weight Dark indicates a large weight light a small weight

neuron the mean grey level

neuron and neuron the rst order gradients of the image

neuron neuron second orders derivates of the image

The features extracted by the principal comp onentnetwork are the gradients of the image

The cognitron and neo cognitron

Yet another typ e of unsup ervised learning is found in the cognitronintro duced byFukushima as

early as Fukushima This network with primary applications in pattern recognition

was improved at a later stage to incorp orate scale rotation and translation invariance resulting

in the neo cognitron Fukushima whichwe will not discuss here

Description of the cells

Central in the cognitron is the typ e of neuron used Whereas the Hebb synapse unit k say

which is used in the p erceptron mo del increases an incoming weight w if and only if the jk

THE COGNITRON AND NEOCOGNITRON

incoming signal y ishighand a control input is high the synapse intro duced by Fukushima

j

increases the absolute value of its weightjw j only if it has p ositive input y and a maximum

jk

j

where k k k are all neighbours of k Note activation value y maxy y y

n

k k k k

n

that this learning scheme is comp etitive and unsup ervised and the same typ e of neuron has

at a later stage b een used in the comp etitive learning network section as well as in other

unsup ervised networks

Fukushima distinguishes b etween excitatory inputs and inhibitory inputs The output of an

excitatory cell u is given by

e h e

F uk F

h h

where e is the excitatory input from ucells and h the inhibitory input from v cells The activation

function is

x if x

F x

otherwise

When the inhibitory input is small ie h uk can b e approximated by uk e h

which agrees with the formula for a conventional linear threshold element with a threshold of

zero

When b oth the excitatory and inhibitory inputs increase in prop ortion ie

e x h x

constants and then eq can b e transformed into

x

ui tanh log x

x

ie a squashing function as in gure

Structure of the cognitron The basic structure of the cognitron is depicted in gure

U l-1 U l

ul(n)

al(v,n ) ul-1 (n+v)

c (v) l-1 b (n) u (n’ ) l l

vl-1 (n)

Figure The basic structure of the cognitron

The cognitron has a multilayered structure The l th layer U consists of excitatory neurons

l

u n and inhibitory neurons v n where n n n isatwodimensional lo cation of the cell

x y

l l

Here our notational system fails We adhere to Fukushimas symbols

CHAPTER VISION

Acellu n receives inputs via mo diable connections a v n from neurons u n v and

l l l

connections b n from neurons v n where v is in the connectable area cf area of atten

l l

tion of the neuron Furthermore an inhibitory cell v n receives inputs via xed excitatory

l

connections c v from the neighb ouring cells u n v and yields an output equal to its

l l

weighted input

X

v n c v u n v

l l l

v

P

where c v and are xed

l

v

It can b e shown that the growing of the synapses ie mo dication of the a and b weights

ensures that if an excitatory neuron has a relatively large resp onse the excitatory synapses

grow faster than the inhibitory synapses and vice versa

Receptive region

For each cell in the cascaded layers describ ed ab ove a connectable area must b e established A

connection scheme as in gure is used a neuron in layer U connects to a small region in

l

layer U

l " "

U0 U0 U1 U1 U2

Figure Cognitron receptive regions

If the connection region of a neuron is constant in all layers a to o large number of layers is

needed to cover the whole input layer On the other hand increasing the region in later layers

results in so much overlap that the output neurons have near identical connectable areas and

thus all react similarly This again can b e prevented by increasing the size of the vicinity area in

which neurons comp ete but then only one neuron in the output layer will react to some input

stimulus This is in contradiction with the b ehaviour of biological brains

A solution is to distribute the connections probabilistically such that connections with a

large deviation are less numerous

Simulation results

In order to illustrate the working of the network a simulation has b een run with a fourlayered

network with neurons in eachlayer The network is trained with four learning patterns

consisting of a vertical a horizontal and two diagonal lines Figure shows the activation

levels in the layers in the rst two learning iterations

After learning iterations the learning is halted and the activation values of the neurons

in layer are fed back to the input neurons also the maximum output neuron alone is fed back

and thus the input pattern is recognised see gure

RELAXATION TYPES OF NETWORKS

a. b.

Figure Twolearning iterations in the cognitron

Four learning patterns one in each row are shown in iteration a and b Each

column in a and b shows one layer in the network The activation level of each neuron is

shown by a circle Alarge circle means a high activation In the rst iteration a a structure

is already developing in the second layer of the network In the second iteration the second

layer can distinguish b etween the four patterns

Relaxation typ es of networks

As demonstrated by the Hopeld network a relaxation pro cess in a connectionist network can

provide a powerful mechanism for solving some dicult optimisation problems Many vision

problems can be considered as optimisation problems and are potential candidates for an im

plementation in a Hopeldlikenetwork A few examples that are found in the literature will b e

mentioned here

Depth from stereo

By observing a scene with two cameras one can retrieve depth information out of the images

by nding the pairs of pixels in the images that b elong to the same p oint of the scene The

calculation of the depth is relatively easy nding the corresp ondences is the main problem One

solution is to nd features such as corners and edges and match those reducing the computational

complexity of the matching Marr Marr showed that the corresp ondence problem can

be solved correctly when taking into account the physical constraints underlying the pro cess

Three matching criteria were dened

Compatibility Two descriptiveelements can only match if they arise from the same phys

ical marking corners can only match with corners blobs with blobs etc

Uniqueness Almost always a descriptive element from the left image corresp onds to exactly

one element in the right image and vice versa

Continuity The disparity of the matches varies smo othly almost everywhere over the image

CHAPTER VISION

a. b.

c. d.

Figure Feeding back activation values in the cognitron

The four learning patterns are now successively applied to the network row of gures

ad Next the activation values of the neurons in layer are fed back to the input row

of gures ad Finally all the neurons except the most active in layer are set to and

the resulting activation values are again fed back row of gures ad After as little as

iterations the network has shown to b e rather robust

Marrs co op erative algorithm also a nonco op erative or lo cal algorithm has been describ ed

Marr is able to calculate the disparity map from which the depth can b e reconstructed

This algorithm is some kind of neural network consisting of neurons N x y d where neuron

N x y d represents the hyp othesis that pixel x y in the left image corresp onds with pixel

x d y in the right image The up date function is

B C

X X

B C

t t t

N x y d N x y d N x y d N x y d

B C

A

     

x y d  x y d 

S xy d O xy d

Here is an inhibition constant is a threshold function S x y d is the lo cal excitatory

neighb ourho o d and O x y d is the lo cal inhibitory neighb ourho o d which are chosen as follows

S x y df rst j r x r x d s y g

O x y df rst j d t k rs x y k w g

RELAXATION TYPES OF NETWORKS

The network is loaded with the cross correlation of the images at rst N x y d

I x y I x d y where I and I are the intensity matrices of the left and right image re

r r

l l

sp ectively This network state represents all p ossible matches of pixels Then the set of p ossible

matchesisreducedby recursive application of the up date function until the state of the network

is stable

The algorithm converges in ab out ten iterations Then the disparity of a pixel x y is

displayed bythe ring neuron in the set fN rs d j r x s y g In each of these sets there

should b e exactly one neuron ring but if the algorithm could not compute the exact disparity

for instance at hidden contours there may b e zero or more than one neurons ring

Image restoration and image segmentation

The restoration of degraded images is a branch of digital picture pro cessing closely related to

image segmentation and b oundary nding An analysis of the ma jor applications and pro cedures

may b e found in Rosenfeld Kak An algorithm which is based on the minimisation

of an energy function and can very well b e parallelised is given by Geman and Geman Geman

Geman Their approach is based on sto chastic mo delling in which image samples

are considered to b e generated by a random pro cess that changes its statistical prop erties from

region to region The random pro cess that that generates the image samples is a twodimensional

analogue of a Markov pro cess called a Markov random eld Image segmentation is then

considered as a statistical estimation problem in which the system calculates the optimal estimate

of the region b oundaries for the input image Simultaneously estimation of the region prop erties

and b oundary prop erties has to b e p erformed resulting in a set of nonlinear estimation equations

that dene the optimal estimate of the regions The system must nd the maximum a p osteriori

probability estimate of the image segmentation Geman and Geman showed that the problem can

b e recast into the minimisation of an energy function which in turn can b e solved approximately

by optimisation techniques suchassimulated annealing The interesting p oint is that simulated

annealing can be implemented using a network with lo cal connections in which the network

iterates into a global solution using these lo cal op erations

Silicon retina

Mead and his coworkers Mead have develop ed an analogue VLSI vision prepro cessing

chip mo delled after the retina The design not only replicates many of the imp ortant functions

of the rst stages of retinal pro cessing but it do es so by replicating in a detailed way both

the structure and dynamics of the constituent biological units The logarithmic compression

from photon input to output signal is accomplished by analogue circuits while similarly space

and time averaging and temp oral dierentiation are accomplished by analogue pro cesses and a

resistivenetwork see section

CHAPTER VISION

Part IV

IMPLEMENTATIONS

Implementation of neural networks can b e divided into three categories

software simulation

hardware emulation

hardware implementation

The distinction between the former two categories is not clearcut We will use the term sim

ulation to describ e software packages whichcan run on a variety of host machines eg PYG

MALION the Ro chester Connectionist Simulator NeuralWare Nestor etc Implementation of

neural networks on generalpurp ose multipro cessor machines such as the Connection Machine

the Warp transputers etc will b e referred to as emulation Hardware implementation will b e

reserved for neurochips and the like which are sp ecically designed to run neural networks

To evaluate and provide a taxonomy of the neural network simulators and emulators dis

cussed we will use the descriptors of table cf DARPA

Equation typ e many networks are dened by the typ e of equation describing their op eration For

example Grossb ergs ART cf section is describ ed by the dierential equation

X

dx

k

Ax B x I x I

j

k k k k

dt

j k

P

in which Ax is a decay term BI is an external input x I is a normalisation term and x I

j

k k k k k

j k

is a neighb our shuto term for comp etition Although dierential equations are very p owerful they require

a high degree of exibility in the software and hardware and are thus dicult to implement on sp ecial

purp ose machines Other typ es of equations are eg dierence equations as used in the description of

Kohonens top ological maps see section and optimisation equations as used in backpropagation

networks

Connection top ology the design of most general purp ose computers includes random access memory

RAM such that each memory p osition can b e accessed with uniform sp eed Such designs always present

a tradeo b etween size of memory and sp eed of access The top ology of neural networks can b e matched

in a hardware design with fast lo cal interconnections instead of global access Most networks are more or

less lo cal in their interconnections and a global RAM is unnecessary

Pro cessing schema although most articial neural networks use a synchronous up date ie the output

of the network dep ends on the previous state of the network asynchronous up date in which comp onents

or blo cks of comp onents can be up dated one by one can be implemented much more eciently Also

continuous up date is a p ossibility encountered in some implementations

Synaptic transmission mo de most articial neural networks have a transmission mo de based on the

neuronal activation values multiplied by synaptic weights In these mo dels the propagation time from one

neuron to another is neglected On the other hand biological neurons output a series of pulses in whichthe

frequency determines the neuron output such that propagation times are an essential part of the mo del

Currently mo dels arise whichmakeuseof temp oral synaptic transmission Murray Tomlinson

Walker

Table A p ossible taxonomy

The following chapters describ e generalpurp ose hardware which can be used for neural

network applications and neurochips and other dedicated hardware

The term emulation see eg Mallach for a go o d intro duction in computer design means running

one computer to execute instructions sp ecic to another computer It is often used to provide the user with a

machine which is seemingly compatible with earlier mo dels

General Purp ose Hardware

Parallel computers Almasi Gottlieb can be divided into several categories One im

portant asp ect is the granularity of the parallelism Broadly sp eaking the granularity ranges

from coarsegrain parallelism typically up to ten pro cessors to negrain parallelism up to

thousands or millions of pro cessors

Both negrain and coarsegrain parallelism is in use for emulation of neural networks The

former mo del in which one or more pro cessors can be used for each neuron corresp onds with

table s typ e whereas the second corresp onds with typ e We will discuss one mo del of b oth

typ es of architectures the extremely negrain Connection Machine and coarsegrain Systolic

arrays viz the Warp computer A more complete discussion should also include transputers

whicharevery p opular nowadays due to their very high p erformanceprice ratio Group

Board Eckmiller Hartmann Hauske In this case descriptor of table is

most applicable

Besides the granularity the computers can be categorised by their op eration The most

widely used categorisation is by Flynn Flynn see table It distinguishes two

typ es of parallel computers SIMD Single Instruction Multiple Data and MIMD Multiple

Instruction Multiple Data The former typ e consists of a numb er of pro cessors which execute

the same instructions but on dierent data whereas the latter has a separate program for each

pro cessor Finegrain computers are usually SIMD while coarse grain computers tend to be

MIMD also in corresp ondence with table entries and

Numb er of Data Streams

single multiple

single SISD SIMD

Number of

von Neumann vector array

Instruction

multiple MISD MIMD

Streams

pip eline multiple micros

Table Flynns classication

Table shows a comparison of several typ es of hardware for neural network simulation

The sp eed entry measured in interconnects per second is an imp ortant measure which is of

ten used to compare neural network simulators It measures the number of multiplyandadd

op erations that can be p erformed per second However the comparison is not honest

it do es not always include the time needed to fetch the data on which the op erations are to

be p erformed and may also ignore other functions required by some algorithms such as the

computation of a sigmoid function Also the sp eed is of course dep endent of the algorithm

used

CHAPTER GENERAL PURPOSE HARDWARE

HARDWARE WORD STORAGE SPEED COST SPEED

LENGTH K Intcnts K Ints K COST

WORKSTATIONS

MicroMini PCAT

Computers Sun

VAX

Symb olics

Attached ANZA

Pro cessors

Transputer

Busoriented Mark I I I IV

MX

MASSIVELY CM K

PARALLEL Warp

Warp

Buttery

SUPER CrayXMP

COMPUTERS

Table Hardware machines for neural network simulation

The authors are well aware that the mentioned computer architectures are archaic current computer

architectures are several orders of magnitute faster For instance currentday Sun Sparc machines eg an

Ultra at MHz b enchmark at almost dhrystones p er second whereas the archaic Sun b enchmarks

at ab out Prices of b oth machines then vs now are approximately the same Go gure Nevertheless

the table gives an insight of the p erformance of dierenttyp es of architectures

The Connection Machine

Architecture

One of the most outstanding negrain SIMD parallel machines is Daniel Hillis Connection Ma

chine Hillis Corp oration originally develop ed at MIT and later built at Thinking

Machines Corp oration The original mo del the CM consists of K onebit pro ces

sors divided up into four units of K pro cessors each The units are connected via a crossbar

switch the nexus to up to four frontend computers see gure The large numb er of ex

tremely simple pro cessors make the machine a data parallel computer and can b e b est envisaged

as active memory

Each pro cessor chip contains pro cessors a control unit and a router It is connected

to a memory chip which contains K bits of memory per pro cessor Each pro cessor consists

of a onebit ALU with three inputs and two outputs and a set of registers The control unit

deco des incoming instructions broadcast by the frontend computers which can b e DEX VAXes

or Symb olics Lisp machines Atany time a pro cessor may b e either listening to the incoming

instruction or not

The router implements the communication algorithm each router is connected to its nearest

neighbours via a twodimensional grid the NEWS grid for fast neighb our communication also

the chips are connected via a Bo olean cub e ie chips i and j are connected if and only if

k

ji j j for some integer k Thus at most hops are needed to deliver a message So there

are routers connected by bidirectional wires

By slicing the memory of a pro cessor the CM can also implement virtual pro cessors

The CM diers from the CM in that it has K bits instead of K bits memory per

pro cessor and an improved IO system

THE CONNECTION MACHINE

Nexus

Front end 0 (DEC VAX or Symbolics) sequencer sequencer Bus interface 0 1

Front end 1 (DEC VAX Connection Machine Connection Machine or Symbolics) 16,384 Processors 16,384 Processors Bus interface

Front end 2 sequencer sequencer (DEC VAX 2 3 or Symbolics) Bus interface

Front end 3 Connection Machine Connection Machine (DEC VAX 16,384 Processors 16,384 Processors or Symbolics) Bus interface Connection Machine I/O System

Data Data Data Graphics Network

Vault Vault Vault display

Figure The Connection Machine system organisation

Applicability to neural networks

There have been a few researchers trying to implement neural networks on the Connection

Machine Blello ch Rosenb erg Singer Even though the Connection Machine has

a top ology whichmatches the top ology of most articial neural networks very well the relatively

slow message passing system makes the machine not very useful as a generalpurp ose neural

network simulator It app ears that the Connection Machine suers from a dramatic decrease in

throughput due to communication delays Hummel Furthermore the costsp eed ratio

see table is very bad compared to eg a transputer b oard As an eect the Connection

Machine is not widely used for neural network simulation

One p ossible implementation is given in Blello ch Rosenberg Here a back

propagation network is implemented by allo cating one pro cessor p er unit and one p er outgoing

weight and one p er incoming weight The pro cessors are thus arranged that each pro cessor for a

unit is immediately followed by the pro cessors for its outgoing weights and preceded by those for

itsincomingweights The feedforward step is p erformed by rst clamping input units and next

executing a copyscan op eration bymoving those activation values to the next k pro cessors the

outgoing weight pro cessors The weights then multiply themselves with the activation values

and p erform a send op eration in which the resulting values are sent to the pro cessors allo cated

for incoming weights A plusscan then sums these values to the next layer of units in the net

work The feedback step is executed similarly Both the feedforward and feedbackstepscanbe

interleaved and pip elined suchthatnolayer is ever idle For example for the feedforward step

p p

a new pattern x is clamp ed on the input layer while the next layer is computing on x etc

Toprevent inecient use of pro cessors one weight could also b e represented by one pro cessor

CHAPTER GENERAL PURPOSE HARDWARE

Systolic arrays

Systolic arrays Kung Leierson take the advantage of laying out algorithms in two

dimensions The design favours computeb ound as opp osed to IOb ound op erations The

name systolic is derived from the analogy of pumping blo o d through a heart and feeding data

through a systolic array

A typical use is depicted in gure Here two band matrices A and B are multiplied

and added to C resulting in an output C AB Essential in the design is the reuse of data

elements instead of referencing the memory each time the element is needed

C+A*B

a a 32 22 a 12 b b c+a*b 21 22 b 23

a a b a b 31 21 a11 b11 b 12 13 systolic c cell 11 b a

c 21 c 12 c cc c 31 22 13 cccc41 32 23 14 ccc42 33 24

c 43 c 34

Figure Typical use of a systolic array

The Warp computer develop ed at Carnegie Mellon University has b een used for simulating

articial neural networks Pomerleau Gusciora Touretzky Kung see table It

is a system with ten or more programmable onedimensional systolic arrays Two data streams

one of which is bidirectional ow through the pro cessors see gure To implement a

matrix pro duct W x the W is not a stream as in gure but stored in the memory of the pro cessors

Warp Interface & Host

X

cell cell cell Y 1 2 n

address

Figure The Warp system architecture

Dedicated NeuroHardware

Recently many neurochips have been designed and built Although many techniques suchas

digital and analogue electronics optical computers chemical implementation and biochips are

investigated for implementing neurocomputers only digital and analogue electronics and in

a lesser degree optical implementations are at present feasible techniques We will therefore

concentrate on such implementations

General issues

Connectivity constraints

Connectivity within a chip

Amajor problem with neurochips always is the connectivity A single integrated circuit is in

currentday technology planar with limited p ossibility for crossover connections This p oses

a problem Whereas connectivity to nearest neighb our can b e implemented without problems

connectivity to the second nearest neighbour results in a crossover of four which is already

problematic On the other hand full connectivitybetween a set of input and output units can

be easily attained when the input and output neurons are situated near two edges of the chip

see gure Note that the number of neurons in the chip grows linearly with the size of

the chip whereas in the earlier layout the dep endence is quadratic

N outputs

M inputs

Figure Connections b etween M input and N output neurons

Connectivity between chips

To build large or layered ANNs the neurochips have to be connected together When only

few neurons have to be connected together or the chips can be placed in subsequent rows in

feedforward typ es of networks this is no problem But in other cases when large numbers

CHAPTER DEDICATED NEUROHARDWARE

of neurons in one chip have to be connected to neurons in other chips there are a number of

problems

designing chip packages with a very large numb er of input or output leads

fanout of chips each chip can ordinarily only send signals two a small number of other

chips Ampliers are needed which are costly in p ower dissipation and chip area

wiring

A p ossible solution would b e using optical interconnections In this case an external light source

would reect light on one set of neurons whichwould reect part of this light using deformable

mirror spatial light mo dulator technology on to another set of neurons Also under development

are threedimensional integrated circuits

Analogue vs digital

Due to the similaritybetween articial and biological neural networks analogue hardware seems

agoodchoice for implementing articial neural networks resulting in cheap er implementations

which op erate at higher sp eed On the other hand digital approaches oer far greater exibility

and not to b e neglected arbitrarily high accuracy Also digital chips can b e designed without

the need of very advanced knowledge of the circuitry using CADCAM systems whereas the

design of analogue chips requires good theoretical knowledge of transistor physics as well as

exp erience

An advantage that analogue implementations haveover digital neural networks is that they

closely match the physical laws present in neural networks table p oint First of all

weights in a neural network can b e co ded by one single analogue element eg a resistor where

several digital elements are needed Secondlyvery simple rules as Kircho s laws can b e used

to carry out the addition of input signals As another example Boltzmann machines section

can b e easily implemented by amplifying the natural noise present in analogue devices

Optics

As mentioned ab ove optics could b e very well used to interconnect several layers of neurons

One can distinguish twoapproaches One is to store weights in a planar transmissive or reective

device eg a spatial light mo dulator and use lenses and xed holograms for interconnection

Figure shows an implementation of optical matrix multiplication When N is the linear

size of the optical array divided bywavelength of the light used the array has capacity for N

weights so it can fully connect N neurons with N neurons Fahrat Psaltis Prata Paek

A second approach uses volume holographic correlators oering connectivity between two

areas of N neurons for a total of N connections A p ossible use of such volume holograms

in an alloptical network would be to use the system for image completion AbuMostafa

Psaltis A number of images could be stored in the hologram The input pattern is

correlated with each of them resulting in output patterns with a brightness varying with the

On the other hand the opp osite can b e found when considering the size of the element esp ecially when high

accuracy is needed However once articial neural networks have outgrown rules like backpropagation high

accuracy might not b e needed

The Kircho laws state that for two resistors R and R in series the total resistance can b e calculated

using R R R and in parallel the total resistance can b e found using R R R Feynman

Leighton Sands

Well not exactly Due to diraction the total numb er of indep endent connections that can b e stored in

an ideal medium is N ie the volume of the hologram divided by the cub e of the wavelength So in fact N

neurons can b e connected with N neurons

IMPLEMENTATION EXAMPLES

receptor for laser for column 6 row 1

laser for row 5

weight

mask

Figure Optical implementation of matrix multiplication

degree of correlation The images are fed into a threshold device which will conduct the image

with highest brightness b etter than others This enhancement can b e rep eated for several lo ops

Learning vs nonlearning

It is generally agreed that the ma jor forte of neural networks is their abilityto learn Whereas a

network with xed precomputed weightvalues could have its merit in industrial applications

online adaptivity remains a design goal for most neural systems

With resp ect to learning we can distinguish b etween the following levels

xed weights the design of the network determines the weights Examples are the

retina and co chlea chips of Carver Meads group discussed b elowcfaROM ReadOnly

Memory in computer design

preprogrammed weights the weights in the network can b e set only once when the

chip is installed Many optical implementations fall in this category cf PROM Pro

grammable ROM

programmable weights the weights can be set more than once by an external device

cf EPROM Erasable PROM or EEPROM Electrically Erasable PROM

onsite adapting weights the learning mechanism is incorp orated in the network

cf RAM Random Access Memory

Implementation examples

Carver Meads silicon retina

The chips devised by Carver Meads group at Caltech Mead are heavily inspired by

biological neural networks Mead attempts to build analogue neural chips whichmatch biolog

ical neurons as closely as p ossible including extremely low power consumption fully analogue

hardware and op eration in continuous time table p oint One example of suchachip is

the Silicon Retina Mead Mahowald

CHAPTER DEDICATED NEUROHARDWARE

Retinal structure

The ocenter retinal structure can be describ ed as follows Light is transduced to electrical

signals by photoreceptors which have a primary pathway through the triad synapses to the

bip olar cells The bip olar cells are connected to the retinal ganglion cells whicharetheoutput

cells of the retina The horizontal cells whichare also connected via the triad synapses to the

photoreceptors are situated directly b elow the photoreceptors and have synapses connected to

the axons leading to the bip olar cells

The system can b e describ ed in terms of the triad synapses three elements

the photoreceptor outputs the logarithm of the intensity of the light

the horizontal cells form a network whichaverages the photoreceptor over space and time

the output of the bip olar cell is prop ortional to the dierence b etween the photoreceptor

output and the horizontal cell output

The photoreceptor

The photoreceptor circuit outputs a voltage which is prop ortional to the logarithm of the

intensity of the incoming light There are two imp ortant consequences

several orders of magnitude of intensity can b e handled in a mo derate signal level range

the voltage dierence between two points is prop ortional to the contrast ratio of their

illuminance

The photoreceptor can b e implemented using a photodetector two FETs connected in series

and one transistor see gure The lowest photocurrent is ab out A or photons

Vout 3.6

3.4

3.2

3.0

2.8

Vout 2.6

2.4

2.2 2 3 4 5 6 7 8

Intensity

Figure The photoreceptor used by Mead To prevent current being drawn from the photo

receptor the output is only connected to the gate of the transistor

p er second corresp onding with a mo onlit scene

Field Eect Transistor

A detailed description of the electronics involved is out of place here However wewillprovide gures where

useful See Mead for an indepth study

IMPLEMENTATION EXAMPLES

Horizontal resistive layer

Each photoreceptor is connected to its six neighbours via resistors forming a hexagonal array

The voltage at every no de in the network is a spatially weighted average of the photoreceptor

inputs such that farther away inputs have less inuence see gure a

ganglion output

− +

− +

photo- receptor

(a) (b)

Figure The resistive layer a and enlarged a single no de b

Bip olar cell

The output of the bip olar cell is prop ortional to the dierence b etween the photoreceptor output

and the voltage of the horizontal resistivelayer The architecture is shown in gure b It

consists of two elements a widerange amplier which drives the resistive network towards

the photoreceptor output and an amplier sensing the voltage dierence between the photo

receptor output and the network p otential

Implementation

Achip was built containing pixels The output of every pixel can b e accessed indep en

dently byproviding the chip with the horizontal and vertical address of the pixel The selectors

can be run in two mo des static prob e or serial access In the rst mo de a single row and

column are addressed and the output of a single pixel is observed as a function of time In the

second mo de b oth vertical and horizontal shift registers are clo cked to provide a serial scan of

the pro cessed image for display on a television display

Performance

Several exp eriments show that the silicon retina p erforms similarly as biological retina Mead

Mahowald Similarities are shown between sensitivityfor intensities time resp onses for

a single output when ashes of light are input resp onse to contrast edges

LEPs LNeuro chip

A radically dierent approach is the LNeuro chip develop ed at the Lab oratoires dElectronique

Philips LEP in France Theeten Duranton Mauduit Sirat Duranton Sirat

Whereas most neurochips implement Hopeld networks section or in some cases Kohonen

CHAPTER DEDICATED NEUROHARDWARE

networks section due to the fact that these networks have lo cal learning rules these digital

neurochips can b e congured to incorp orate any learning rule and network top ology

Architecture

The LNeuro chip depicted in gure consists of an multiplyandadd or relaxation part

and a learning part The LNeuro has a parallelism of The weights w are bits long in

ij

the relaxation phase and bit in the learning phase

learning control mul

parallel

loadable LPj mul tree and of adders readable mul

synaptic alu

memory mul accumulator

neural state registers learning non- processor aj linear function δ and j control Fct. Fct. Fct. Fct. learning register

learning functions

Figure The LNeuro chip Forclarity only four neurons are drawn

Multiplyandadd

The multiplyandadd in fact p erforms a matrix multiplication

X

A

w y t y t F

jk

j k

j

The input activations y are kept in the neural state registers For each neural state there are

k

two registers These can be used to implement synchronous or asynchronous up date In the

former mo de the computed state of neurons wait in registers until all states are known then

the whole register is written into the register used for the calculations In asynchronous mo de

however every new state is directly written into the register used for the next calculation

The arithmetical logical unit ALU has an external input to allow for accumulation of

external partial pro ducts This can b e used to construct larger structured or higherprecision

networks

The neural states y are co ded in one to eight bits whereas either eight or sixteen bits can

k

b e used for the weights whicharekept in a RAM In order to save silicon area the multiplications

w y are serialised over the bits of y replacing N eightby eight bit parallel multipliers by N

jk

j j

eight bit AND gates The partial pro ducts are saved and added in the tree of adders

IMPLEMENTATION EXAMPLES

The computation thus increases linearly with the number of neurons instead of quadratic

in simulation on serial machines

The activation function is for reasons of exibilitykept ochip The results of the weighted

sum calculation go ochip serially ie bit by bit and the result must b e written backtothe

neural state registers

Finally a column of latches is included to temp orarily store memory values such that during

amultiply of the weight with several bits the memory can b e freely accessed These latches in

fact take part in the learning mechanism describ ed b elow

Learning

The remaining parts in the chip are dedicated to the learning mechanism The learning mecha

nism is designed to implement the Hebbian learning rule Hebb

w w y

jk jk k

j

where is a scalar which only dep ends on the output neuron k To simplify the circuitry

k

eq is simplied to

w w g y y

jk jk k

k j

where g y y can havevalue or In eect eq either increments or decrements

j

k

the w with orkeeps w unchanged Thus eq can b e simulated by executing eq

jk k jk

several times over the same set of weights

The weights w related to the output neuron k are all mo died in parallel A learning step

k

pro ceeds as follows Every learning pro cessor see gure LP loads the weight w from

j

jk

the synaptic memory the from the learning register and the neural state y Next they

k

j

all mo dify their weights in parallel using eq and write the adapted weights back to the

synaptic memory also in parallel

CHAPTER DEDICATED NEUROHARDWARE

References

AbuMostafa Y A Psaltis D Optical neural computers Scientic American

Cited on p

Ackley D H Hinton G E Sejnowski T J A learning algorithm for Boltzmann

machines Cognitive Science Cited on p

Ahalt S C Krishnamurthy A K Chen P Melton D Comp etititve learning

algorithms for vector quantization Neural Networks Cited on p

Almasi G S Gottlieb A Highly Parallel Computing The BenjaminCummings

Publishing Company Inc Cited on p

Almeida L B A learning rule for asynchronous p erceptrons with feedback in a com

binatorial environment In Pro ceedings of the First International Conference on Neural

Networks Vol pp IEEE Cited on p

Amit D J Gutfreund H Somp olinskyH Spinglass mo dels of neural networks

Physical Review A Cited on p

Anderson J A Neural mo dels with cognitive implications In D LaBerge S J

Samuels Eds Basic Pro cesses in Reading Perception and Comprehension Mo dels p

Hillsdale NJ Erlbaum Cited on pp

Anderson J A Rosenfeld E Neuro computing Foundations of Research Cambridge

MA The MIT Press Cited on p

Barto A G Anandan P Patternrecognizing sto chastic learning automata IEEE

Transactions on Systems Man and Cyb ernetics Cited on p

Barto A G Sutton R S Anderson C W Neuronlikeadaptiveelements that can

solve dicult learning problems IEEE Transactions on Systems Man and Cyb ernetics

Cited on pp

Barto A G Sutton R S Watkins C Sequential decision problems and neural

networks In D Touretsky Ed Advances in Neural Information Pro cessing I I DUNNO

Cited on p

Bellman R Dynamic Programming Princeton UniversityPress Cited on p

Blello ch G Rosenberg C R Network learning on the Connection Machine In

Pro ceedings of the Tenth International Joint Conference on Articial Intelligence pp

DUNNO Cited on p

Board J A B Jr Transputer Research and Applications Pro ceedings of the Second

Conference on the North American Transputer Users Group Durham NC IOS Press

Cited on p

REFERENCES

Bo omgaard R van den Smeulders A Self learning image pro cessing using apriori

knowledge of spatial patterns In T Kanade F C A Gro en L O Hertzb erger Eds

Pro ceedings of the IAS Conference p Elsevier Science Publishers Cited on

p

Breiman L Friedman J H Olshen R A Stone C J Classication and Regression

Trees Wadsworth and BroksCole Cited on p

Bruce A D Canning A Forrest B Gardner E Wallace D J Learning and

memory prop erties in fully connected networks In J S Denker Ed AIP Conference

Pro ceedings Neural Networks for Computing pp DUNNO Cited on p

Carp enter G A Grossb erg S a A massively parallel architecture for a selforganizing

neural pattern recognition machine Computer Vision Graphics and Image Pro cessing

Cited on pp

Carp enter G A Grossb erg S b ART Selforganization of stable category recog

nition co des for analog input patterns Applied Optics Cited on

p

Corp oration T M Connection Machine Mo del CM Technical Summary Tech Rep

Nos HA Thinking Machines Corp oration Cited on p

Cottrell iGW Munro P Zipser D Image compression by backpropagation a

demonstration of extensional programming Tech Rep No TR USCD Institute

of Cognitive Sciences Cited on p

Craig J J Intro duction to Rob otics AddisonWesley Publishing Company Cited on

p

Cun Y L Une pro cedure dapprentissage p our reseau a seuil assymetrique Pro ceedings

of Cognitiva Cited on p

Cyb enko G Approximation by sup erp ositions of a sigmoidal function Mathematics of

Control Signals and Systems Cited on p

DARPA DARPA Neural Network Study AFCEA International Press Cited on

pp

Dastani M M FunctieBenadering met FeedForward Netwerken Unpublished masters

thesis Universiteit van Amsterdam Faculteit Computer Systemen Cited on pp

Duranton M Sirat J A Learning on VLSI A general purp ose digital neuro chip

In Pro ceedings of the Third International Joint Conference on Neural Networks DUNNO

Cited on p

Eckmiller R Hartmann G Hauske G Parallel Pro cessing in Neural Systems and

Computers Elsevier Science Publishers BV Cited on p

Elman J L Finding structure in time Cognitive Science Cited on p

Fahrat N H Psaltis D Prata A Paek E Optical implementation of the Hopeld

mo del Applied Optics Cited on p

Feldman J A Ballard D H Connectionist mo dels and their prop erties Cognitive

Science Cited on p

REFERENCES

Feynman R P Leighton R B Sands M The Feynman Lectures on Physics

Reading MA Menlo Park CA London Sidney Manila AddisonWesley Publishing

Company Cited on p

Flynn M J Some computer organizations and their eectiveness IEEE Transactions

on Computers C Cited on p

Friedman J H Multivariate adaptive regression splines Annals of Statistics

Cited on p

Fritzke B Let it growselforganizing feature maps with problem dep endent cell

structure In T Kohonen K Makisara O Simula J Kangas Eds Pro ceedings of

the International Conference on Articial Neural Networks pp North

HollandElsevier Science Publishers Cited on p

Fukushima K Cognitron A selforganizing multilayered neural network Biological

Cyb ernetics Cited on pp

Fukushima K Neo cognitron A hierarchical neural network capable of visual pattern

recognition Neural Networks Cited on pp

Funahashi KI On the approximate realization of continuous mappings by neural

networks Neural Networks Cited on p

Garey M R Johnson D S Computers and Intractability New York W H

Freeman Cited on p

Geman S Geman D Sto chastic relaxation Gibbs distributions and the Bayesian

restoration of images IEEE Transactions on Pattern Analysis and Machine Intelligence

Cited on p

Gielen C Krommenho ek K Gisb ergen J van A pro cedure for selforganized

sensorfusion in top ologically ordered maps In T Kanade F C A Gro en L O

Hertzb erger Eds Pro ceedings of the Second International Conference on Autonomous

Systems pp Elsevier Science Publishers Cited on p

Gorman R P Sejnowski T J Analysis of hidden units in a layered network trained

to classify sonar targets Neural Networks Cited on p

Grossb erg S Adaptive pattern classication and universalrecodingIII Biological

Cyb ernetics Cited on pp

Group O U Parallel Programming of Transputer Based Machines Pro ceedings of

the th Occam User Group Technical Meeting Grenoble France IOS Press Cited on

p

Gullapalli V A sto chastic reinforcement learning algorithm for learning realvalued

functions Neural Networks Cited on p

Hartigan J A Clustering Algorithms New York John Wiley Sons Cited on p

Hartman E J Keeler J D Kowalski J M Layered neural networks with Gaussian

hidden units as universal approximations Neural Computation Cited on

p

Hebb D O The Organization of Behaviour New York Wiley Cited on pp

REFERENCES

HechtNielsen R Counterpropagation networks Neural Networks Cited

on p

Hertz J Krogh A Palmer R G Intro duction to the Theory of Neural Computation

Addison Wesley Cited on p

Hesselroth T Sarkar K Smagt P van der Schulten K Neural network control

of a pneumatic rob ot arm IEEE Transactions on Systems Man and Cyb ernetics

Cited on p

Hestenes M R Stiefel E Metho ds of conjugate gradients for solving linear systems

Nat Bur Standards J Res Cited on p

Hillis W D The Connection Machine The MIT Press Cited on p

Hopeld J J Neural networks and physical systems with emergent collective compu

tational abilities Pro ceedings of the National Academy of Sciences Cited

on pp

Hopeld J J Neurons with graded resp onse have collective computational prop erties

like those of twostate neurons Pro ceedings of the National Academy of Sciences

Citedonp

Hopeld J J Feinstein D I Palmer R G unlearning has a stabilizing eect in

collective memories Nature Cited on p

Hopeld J J Tank D W neural computation of decisions in optimization

problems Biological Cyb ernetics Cited on p

Hornik K Stinchcomb e M White H Multilayer feedforward networks are universal

approximators Neural Networks Cited on p

Hummel R A Personal Communication Cited on p

Jansen A Smagt PPvan der Gro en F C A Nested networks for rob ot control

In A F Murray Ed Neural Network Applications Kluwer Academic Publishers In

press Cited on pp

Jordan M I a Attractor dynamics and parallelism in a connectionist sequential machine

In Pro ceedings of the Eighth Annual Conference of the Cognitive Science So ciety pp

Hillsdale NJ Erlbaum Cited on p

Jordan M I b Serial OPrder A Parallel Distributed Pro cessing Approach Tech

Rep No San Diego La Jolla CA Institute for Cognitive Science University of

California Cited on p

Jorgensen C C Neural network representation of sensor graphs in autonomous rob ot

path planning In IEEE First International Conference on Neural Networks Vol IV pp

IEEE Cited on p

Josin G Neuralspace generalization of a top ological transformation Biological Cyb er

netics Cited on p

Katayama M Kawato M AParallelHierarchical Neural Network Mo del for Motor

Control of a MusculoSkeletal System Tech Rep Nos TRA ATR Auditory and

Visual Perception Research Lab oratories Cited on p

REFERENCES

Kawato M Furukawa K Suzuki R A hierarchical neuralnetwork mo del for control

and learning of voluntary movement Biological Cyb ernetics Citedonp

Kohonen T Asso ciative Memory A SystemTheoretical Approach SpringerVerlag

Cited on pp

Kohonen T Selforganized formation of top ologically correct feature maps Biological

Cyb ernetics Cited on p

Kohonen T SelfOrganization and Asso ciativeMemory Berlin SpringerVerlag Cited

on p

Kohonen T SelfOrganizing Maps Springer Cited on p

Kohonen T Makisara M Saramaki T Phonotopic mapsinsightful represen

tation of phonological features for sp eech recognition In Pro ceedings of the th IEEE

International Conference on Pattern Recognition DUNNO Cited on p

Krose B J A Dam J W M van Learning to avoid collisions A reinforcement

learning paradigm for mobile rob ot manipulation In Pro ceedings of IFACIFIPIMACS

International Symp osium on Articial Intelligence in RealTime Control p

Delft IFAC Laxenburg Cited on p

Krose B J A Korst M J van der Gro en F C A Learning strategies for a

vision based neural controller for a rob ot arm In O Kaynak Ed IEEE International

Workshop on IntelligentMotorControl pp IEEE Cited on p

Kung H T Leierson C E Systolic arrays for VLSI In Sparse Matrix Pro ceedings

Academic Press also in Algorithms for VLSI Pro cessor Arrays C Mead L

Conway eds AddisonWesley Cited on p

Lippmann R P An intro duction to computing with neural nets IEEE Transactions

on Acoustics Sp eech and Signal Pro cessing Cited on pp

Lippmann R P Review of neural networks for sp eech recognition Neural Computation

Cited on p

Mallach E G Emulator architecture Computer Cited on p

Marr D Vision San Francisco W H Freeman Cited on pp

Martinetz T Schulten K A neuralgas network learns top ologies In T Koho

nen K Makisara O Simula J Kangas Eds Pro ceedings of the International

Conference on Articial Neural Networks pp NorthHollandElsevier Science

Publishers Cited on p

McClelland J L Rumelhart D E Parallel Distributed Pro cessing Explorations in

the Microstructure of Cognition The MIT Press Cited on pp

McCullo ch W S Pitts W A logical calculus of the ideas immanent in nervous

activity Bulletin of Mathematical Biophysics Cited on p

Mead C Analog VLSI and Neural Systems Reading MA AddisonWesley Cited on

pp

Mead C A Mahowald M A A silicon mo del of early visual pro cessing Neural

Networks Cited on pp

REFERENCES

Mel B W Connectionist Rob ot Motion Planning San Diego CA Academic Press

Cited on p

Minsky M Pap ert S Perceptrons An Intro duction to Computational Geometry

The MIT Press Cited on pp

Murray A F Pulse arithmetic in VLSI neural networks IEEE Micro

Cited on p

Oja E A simplied neuron mo del as a principal comp onent analyzer Journal of

mathematical biology Cited on p

Parker D B LearningLogic Tech Rep Nos TR Cambridge MA Massachusetts

Institute of Technology Center for Computational Research in Economics and Manage

ment Science Cited on p

Pearlmutter B A Learning state space tra jectories in recurrent neural networks Neural

Computation Cited on p

Pearlmutter B A Dynamic Recurrent Neural Networks Tech Rep Nos CMUCS

Pittsburgh PA Scho ol of Computer Science Carnegie Mellon University

Cited on pp

Pineda F Generalization of backpropagation to recurrent neural networks Physical

Review Letters Cited on p

Polak E Computational Metho ds in Optimization New York Academic Press Cited

on p

Pomerleau D A Gusciora G L Touretzky D S Kung H T Neural network

simulation at Warp sp eed Howwe got million connections p er second In IEEE Second

International Conference on Neural Networks Vol II p DUNNO Cited on

p

Powell M J D Restart pro cedures for the conjugate gradient metho d Mathematical

Programming Cited on p

Press W H FlanneryBP Teukolsky S A Vetterling W T Numerical Recip es

The Art of Scientic Computing Cambridge Cambridge University Press Cited on

pp

Psaltis D Sideris A Yamamura A A A multilayer neural network controller

IEEE Control Systems Magazine Cited on p

Ritter H J Martinetz T M Schulten K J Top ologyconserving maps for learning

visuomotorco ordination Neural Networks Cited on p

Ritter H J Martinetz T M Schulten K J Neuronale Netze AddisonWesley

Publishing Company Cited on p

Rosen B E Go o dwin J M Vidal J J Pro cess control with adaptive range co ding

Biological Cyb ernetics Cited on p

Rosenblatt F Principles of Neuro dynamics New York Spartan Bo oks Cited on pp

REFERENCES

Rosenfeld A Kak A C Digital Picture Pro cessing New York Academic Press

Cited on p

Rumelhart D E Hinton G E Williams R J Learning representations by back

propagating errors Nature Cited on p

Rumelhart D E McClelland J L Parallel Distributed Pro cessing Explorations in

the Microstructure of Cognition The MIT Press Cited on pp

Rumelhart D E Zipser D Feature discovery by comp etitive learning Cognitive

Science Cited on p

Sanger T D Optimal unsup ervised learning in a singlelayer linear feedforward neural

network Neural Networks Cited on p

Sejnowski T J Rosenb erg C R NETtalk A Parallel Network that Learns to

Read Aloud Tech Rep Nos JHUEECS The John Hopkins University Electrical

Engineering and Computer Science Department Cited on p

Silva F M Almeida L B Sp eeding up backpropagation In R Eckmiller Ed

Advanced Neural Computers pp NorthHolland Cited on p

Singer A Implementations of Articial Neural Networks on the Connection Machine

Tech Rep Nos RL Cambridge MA Thinking Machines Corp oration Cited on

p

Smagt Pvan der Gro en F Krose B Rob ot HandEye Co ordination Using Neural

Networks Tech Rep Nos CS Department of Computer Systems University of

Amsterdam ftpable from archivecisohiostateedu Cited on p

Smagt P van der Krose B J A Gro en F C A Using timetocontact to guide

a rob ot manipulator In Pro ceedings of the IEEERSJ International Conference on

Intelligent Rob ots and Systems pp IEEE Cited on p

Smagt PPvan der Krose B J A A realtime learning neural rob ot controller In

T Kohonen K Makisara O Simula J Kangas Eds Pro ceedings of the Inter

national Conference on Articial Neural Networks pp NorthHollandElsevier

Science Publishers Cited on pp

Sofge D White D Applied learning optimal control for manufacturing In

D Sofge D White Eds Handb o ok of Intelligent Control Neural Fuzzy and Adaptive

Approaches Van Nostrand Reinhold New York In press Cited on p

Sto er J Bulirsch R Intro duction to Numerical Analysis New YorkHeidelb erg

Berlin SpringerVerlag Citedonp

Sutton R S Learning to predict by the metho ds of temp oral dierences Machine

Learning Cited on p

Sutton R S Barto A Wilson R Reinforcement learning is direct adaptive optimal

control IEEE Control Systems Cited on p

Theeten J B Duranton M Mauduit N Sirat J A The LNeuro chip A digital

VLSI with onchip learning mechanism In Pro ceedings of the International Conference on

Neural Networks Vol I pp DUNNO Cited on p

REFERENCES

Tomlinson M S Jr Walker D J DNNA A digital neural network architecture

In Pro ceedings of the International Conference on Neural Networks Vol I I pp

DUNNO Cited on p

Watkins C J C H Dayan P Qlearning Cited on

pp

Werb os P Approximate dynamic programming for realtime control and neural mo d

elling In D Sofge D White Eds Handb o ok of Intelligent Control Neural Fuzzy

and Adaptive Approaches Van Nostrand Reinhold New York Cited on pp

Werb os PJ Beyond Regression New To ols for Prediction and Analysis in the Behav

ioral Sciences Unpublished do ctoral dissertation Harvard University Cited on p

Werb os PW A menu for designs of reinforcement learning over time In W T M I I I

R S Sutton PJWerb os Eds Neural Networks for Control MIT PressBradford

Cited on p

White D Jordan M Optimal control a foundation for intelligent control In

D Sofge D White Eds Handb o ok of Intelligent Control Neural Fuzzy and Adaptive

Approaches Van Nostrand Reinhold New York Cited on p

Widrow B Ho M E Adaptive switching circuits In IRE WESCON

Convention Record pp DUNNO Cited on pp

Widrow B Winter R G Baxter R A Layered neural nets for pattern recognition

IEEE Transactions on Acoustics Sp eech and Signal Pro cessing Cited

on p

Wilson G V Pawley G S On the stability of the travelling salesman problem

algorithm of Hopfield and tank Biological Cyb ernetics Cited on p

Index

implementation on Connection Machine Symb ols

k means clustering

learning by pattern

A

learning rate

ACE

lo cal minima

activation function

momentum

hard limiting

network paralysis

Heaviside

oscillation in

linear

understanding f

nonlinear

vision

semilinear

bias f

sgn

biochips

sigmoid

bip olar cells

derivative of

retina

threshold

silicon retina

Adaline

Boltzmann distribution

adaline f

Boltzmann machine

vision

C

adaptive critic element

Carnegie Mellon University

analogue implementation

CART

annealing

cartp ole

ART

chemical implementation

ASE

CLUSTER

activation function

clustering

bias

coarsegrain parallelism

asso ciative learning

co ding

asso ciative memory

lossless

instability of stored patterns

lossy

spurious stable states

cognitron

asso ciative search element

comp etitive learning

asymmetric divergence

errorfunction

asynchronous up date

frequency sensitive

autoasso ciator

conjugate directions

vision

conjugate gradient

connectionist mo dels B

Connection Machine backpropagation

architecture advanced training algorithms

communication conjugate gradient

derivation of

NEWS grid

nexus discovery of

connectivity gradient descent

INDEX

constraints FET

optical negrain parallelism

convergence FORGY

steep est descent forward kinematics

co op erative algorithm

G

correlation matrix

Gaussian

counterpropagation

generalised delta rule f

counterpropagation network

general learning

gradient descent D

granularity of parallelism DARPA neural network study

deco der

H

deation

hard limiting activation function

delta rule

Heaviside

generalised f

Hebb rule

digital implementation f

normalised

dimensionality reduction

Hessian

discovery vs creation

high level vision

discriminant analysis

holographic correlators

distance measure

Hopeld network

dynamic programming

as asso ciativememory

dynamics

instability of stored patterns

in neural networks

spurious stable states

rob otics

energy f

graded resp onse neurons E

optimisation EEPROM

stable limit p oints eigenvector transformation

stable neuron in eligibility

stable pattern in Elman network

stable state in emulation

stable storage algorithm energy

sto chastic up date Hopeld network f

symmetry travelling salesman problem

unlearning EPROM

horizontal cells error

retina backpropagation

silicon retina comp etitive learning

learning

I

p erceptron

image compression

quadratic

backpropagation

test

PCA

error measure

selforganising networks

excitation

implementation

external input

analogue

eye

chemical

connectivity constraints F

digital f face recognition

on Connection Machine feature extraction

optical f feedforward network f

silicon retina f

INDEX

lossy co ding indirect learning

lowlevel vision information gathering

inhibition

LVQ

instability of stored patterns

M

intermediate level vision

Markov random eld

inverse kinematics

MARS

Ising spin mo del

mean vector

ISODATA

MIMD

J

MIT

Jacobian matrix f

mobile rob ots

Jordan network

momentum

multilayer p erceptron

K

Kircho laws

N

KISS

neo cognitron

Kohonen network

Nestor

dimensional

NETtalk

for rob ot control

network paralysis

Kullback information

backpropagation

NeuralWare

L

neurocomputers

leaky learning

nexus

learning

nonco op erative algorithm

asso ciative

normalisation

general

notation

indirect

LNeuro

O

selfsup ervised

o ctree metho ds

sp ecialised

oset

sup ervised

Oja learning rule

unsup ervised

optical implementation f

learning error

optimisation

learning rate f

oscillation in backpropagation

backpropagation

output vs activation of a unit

learning vector quantisation

P

LEP

panther

linear activation function

hiding

linear convergence

resting

linear discriminant function

parallel distributed pro cessing

linear networks

parallelism

vision

coarsegrain

linear threshold element

negrain

LNeuro f

PCA

activation function

image compression

ALU

PDP

learning

Perceptron

RAM

p erceptron f

lo cal minima

convergence theorem

backpropagation

error lo okup table

learning rule f lossless co ding

INDEX

horizontal cells threshold

vision

implementation

photoreceptor

photoreceptor f

retina

SIMD f

silicon retina f

simulated annealing

p ositive denite

simulation

Principal comp onents

taxonomy

PROM

sp ecialised learning

prototyp e vectors

spurious stable states

PYGMALION

stable limit p oints

stable neuron

R

stable pattern

RAM

stable state

recurrent networks

stable storage algorithm

Elman network

steep est descent

Jordan network

convergence

reinforcement learning

sto chastic up date

relaxation

summed squared error

representation

sup ervised learning

representation vs learning

synchronous up date

resistor

systolic

retina

systolic arrays

bip olar cells

horizontal cells

T

photoreceptor

target

retinal ganglion

temp erature

structure

terminology

triad synapses

test error

retinal ganglion

Thinking Machines Corp oration

rob otics

threshold f

dynamics

top ologies

forward kinematics

top ologyconserving map

inverse kinematics

training tra jectory generation

Ro chester Connectionist Simulator

tra jectory generation

ROM

transistor

transputer

S

travelling salesman problem

selforganisation

energy

selforganising networks

triad synapses

image compression

vision

U

selfsup ervised learning

understanding backpropagation f

semilinear activation function

universal approximation theorem

sgn function

unsup ervised learning f

sigmapi unit

up date of a unit

sigma unit

asynchronous

sigmoid activation function

sto chastic

derivativeof

synchronous

silicon retina bip olar cells

INDEX

V

vector quantisation f

vision

high level

intermediate level

lowlevel

W

Warp

WidrowHo rule

winnertakeall

X

XOR problem f