The Dynamic Pattern Selection Algorithm: Effective Training and Controlled Generalization of Neural Networks Axel Roebel

To cite this version:

Axel Roebel. The Dynamic Pattern Selection Algorithm: Effective Training and Controlled Gen- eralization of Backpropagation Neural Networks. [Research Report] Technical University of Berlin, Institut for Applied Computer Science. 1994. ￿hal-02911738￿

HAL Id: hal-02911738 https://hal.archives-ouvertes.fr/hal-02911738 Submitted on 4 Aug 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

The Dynamic Pattern Selection Algorithm

Eective Training and Controlled Generalization of

Backpropagation Neural Networks

A Rb el

Technische Universitt Berlin

March

1

Institut fr Angewandte Informatik FG Informatik in Natur und Ingenieurwissenschaften

Abstract

In the following rep ort the problem of selecting prop er training sets for neural network

time series prediction or function approximation is addressed As a result of analyzing

the relation b etween approximation and generalization a new measure the generalization

factor is intro duced Using this factor and cross validation a new algorithm the dynamic

pattern selection is develop ed

Dynamically selecting the training patterns during training establishes the p ossibility

of controlling the generalization prop erties of the neural net As a consequence of the

prop osed selection criterion the generalization error is limited to the training error As

an additional b enet the practical problem of selecting a concise training set out of known

data is likewise solved

By employing two time series prediction tasks the results for dynamic pattern selection

training and for xed training sets are compared The favorable prop erties of the dynamic

pattern selection namely lower computational exp ense and control of generalization are

demonstrated

This rep ort describ es a revised version of the algorithm intro duced in Rb el

Contents

Intro duction

Approximation Interp olation and Overtting in the context of func

tion theory

Cho osing the training set

Online Cross Validation

Dynamic selection of training patterns

Exp erimental results

Predicting the Henon mo del

Predicting the MackeyGlass mo del

Discussion

Data requirements

Noise

Comparison with online training

Conclusion

Bibliography i

Intro duction

Since the formulation of the backpropagation algorithm by Rumelhart Hinton and Wil

liams there has b een a steadily growing interest on articial neural networks Due

to some vague analogies b etween neural networks and the biological nervous systems it has

b een exp ected that successful applications of neural networks in elds like Classication

Pattern Recognition Nonlinear Signal Pro cessing or Control all areas in which the known

technical solutions remain far b ehind the p erformance of the biological systems will b e

p ossible in the near future Concerning the theoretical investigations of neural networks

there exists encouraging results supp orting these exp ectations

However the exp eriences concerning the practical generalization prop erties of neural net

works demonstrated that the widely used backpropagation algorithm do es not always

achieve the desired generalization precision This is not surprising b ecause as a detailed

analysis shows the two tasks which should b e solved during training to represent and to

generalize the training examples are not well determined Poggio and Girosi

Mathematically sp eaking the conditions for go o d approximation and go o d interp olation

are only partly related As a matter of fact the backpropagation algorithm only considers

approximation errors for and improved approximation will in general

not b e accompanied by a b etter interp olation Therefore the interp olation obtained is

strongly inuenced by the random starting conditions of the optimization and long train

ing times often results in high quality approximations but insucient interp olations This

widely known eect often is called overtting

There are two dierent strategies to prevent neural networks from overtting The rst

one is esp ecially useful if the available data set is small It is based on a heuristic argu

ment which states that the simplest mo del will in general achieve the b est generalization

or interp olation Following these argument one may try to cho ose as simple a net struc

ture as p ossible or state additional constraints on the weights to limit network complexity

Weigend et al Ji et al The latter normally are called Constraint Nets

However due to the general foundation of this metho d only weak heuristic arguments

concerning the interp olation prop erties nd their way into the optimization pro cedure

An adaption to the sp ecial problem under investigation is only obtainable with great ad

ditional eort and consequently b etter interp olation is restricted to sp ecial well b ehaved

problems Moreover the additional optimization exp ense leads to considerably increased

training times

If there are enough training samples one may follow another strategy which relies up on

the chosen training set If the training data is selected carefully it will contain enough

information to ensure that the optimal approximating network will have go o d interp ola

tion prop erties to o Up to now it is a well known practice to achieve this by selecting

very large training sets which in general contain a lot of redundancy

Following the latter strategy a new metho d has b een develop ed to ensure valid generaliza

tion This algorithm the dynamic pattern selection is based on the batch training variant

of the backpropagation algorithm and has b een proven to b e useful in applications with

very large data sets The training data is selected during the training phase employing

cross validation to revise the actual training set The error of the net function is used to

cho ose the pattern which should b e added to the steadily growing training set Rb el

The overhead for the selection pro cedure is small Due to the initially small training

set size the dynamic pattern selection algorithm leads to more eective training then the

standard algorithm Practical exp eriments have shown that it outp erforms current online

training variants even in the case of very big and highly redundant data sets

Plutowski and White have develop ed a similar algorithm which they call active

selection of training sets Their algorithm fo cuses mainly on reducing training set size

without considering generalization eects and do es not employ cross validation to con

tinually assess the generalization obtained by the training set in use In contrast to

their algorithm the dynamic pattern selection prop osed here validates the training set

by continually monitoring the the generalization prop erties of the net

In the following section the relations b etween approximation interp olation and overtting

will b e discussed on a background of function theory Subsequently the known heuristics

concerning the numb er and distribution of training patterns are summarized The actual

metho ds to cho ose the training sets for neural nets will b e describ ed and the dynamic

pattern selection will b e established Thereafter two examples from the eld of nonlinear

signal pro cessing are investigated to demonstrate the prop erties of the new algorithm In

the last section there will b e a short discussion concerning data requirements noise and

a comparison to online training metho ds

The following explanations are based on the well known backpropagation algorithm as

intro duced by Rumelhart Hinton and Williams Descriptions of this algorithm are

widely spread in the literature and will not b e rep eated here

Approximation Interp olation and Overtting in

the context of function theory

There have b een many publications proving that under weak assumptions simple feedfor

ward networks with a nite numb er of neurons in one hidden layer are able to approximate

n m

arbitrary closely all continuous mappings R R HechtNielsen White

Concerning practical applications however these results are obviously of limited use b e

cause they are not able to establish the required network complexity to achieve a certain

approximation delity

The conditions by which the stepwise improved approximation achieved with the back

propagation algorithm is accompanied by a decreasing interp olation error are not pre

cisely known To understand the basic relations it is useful to analyze the optimization

pro cedure on the background of function theory The target function x y f x is

t

1

assumed to b e smo oth that is f is a memb er of C the set of functions with continuous

t

derivatives of every order and the domain X of f to b e a compact manifold In general

t

there is only a limited set of memb ers x X available for which the targets y f x are

t

F ()

a

) F (

a

2

F (0)

a

F (0)

i

N = f

F

t

n

F ()

i

F ( )

i

2

Figure A p ossible relation b etween the set of all representable net functions F

n

the sets of all approximating functions F and the sets of all interp olating

a

1

functions F out of C

i

known All known pairs x y form the set of available data

D fx y x y x y g

a 0 0 1 1 2 2

Given D a neural network and a real numb er we may distinguish b etween three

a

1

subsets of C First there is the set of approximating functions F which is the set

a

of functions f approximating the memb ers in D to a given precision

a

kf x f xk sup

t

D

a

Second the set of interp olating functions F with distance

i

sup kf x f xk

t

X

to f The third set is the set of functions f representable by the the neural net and is

t n

denoted as F While the sets F dep end on the set of available data D the sets F

n a a i

are completely dened by the target function f Using these terms the target function

t

may b e sp ecied as the single memb er in the set of interp olating functions F

i

Figure shows a p ossible relation b etween the three function sets dened ab ove De

picted is a sp ecial setting in that the target function f is a memb er of F such that f

t n t

might b e represented by f without any error The following statements do not rely on

n

this and therefore remain valid in general

Note that the backpropagation algorithm generally is used with a squared error function

to measure the approximation quality To compare approximation results achieved with

dierent training sets this measure has to b e normalized using the numb er of elements

contained in each training set Employing this normalized squared error as a measure

of distance in equations and would result in more complicated relations b etween

F and F Even then however the following statements remain valid in explaining

a i

the principle prop erties of the backpropagation algorithm

With resp ect to the relation it is p ossible to establish the ordering F F

i j a i a j

on the set of all F with A corresp onding relation exists for the set of all F

a i

As mentioned ab ove the smallest set F consists of one element only in contrast to

i

F which may have innite cardinality Note that the set F is always a subset of

a i

F Concerning F and F however it is imp ossible to nd such a simple general

a n a

relation Presuming the network weights are b ounded a valid assumption for practical

applications there exists an such that F F for every On the other hand

u n a u

there exists an such that F F is empty for all In the case of gure

l n a l

for example one nds

l

The ob jective of the backpropagation algorithm is to cho ose a network function f which

n

is in F

a l

The area in gure marked by the dotted line gives the cutting F F F

an l a l n

containing all solutions obtainable by gradient descent It is not p ossible to ensure that

gradient descent optimization will reach F Due to the sp ecic error function the

an l

constraints on the initial conditions and the selected training set there might b e no

descending connection from the initial function to F

an l

As a matter of fact the generalization of the optimum set F is biased through the

an l

data contained in D There exist data sets D for which the relation b etween F and

a a i

F combined with the gradient descend pro cedure will result in p o or generalization

a

b ehavior

For many applications esp ecially for function approximation it is sensible to demand

that the generalization error b e equal to or lower than the training error Formally

f F should implicate f F To b e able to rank the generalization prop erties

n a n i

of f it is sensible to dene the generalization factor

n

f

i n

f

n

f

a n

where f is the minimal such that f F and f is minimal such that

a n n a i n

f F The generalization factor indicates the error made in optimizing on D instead

n i a

of X As a result we conclude that

f

n

is a sensible condition for valid generalization

Cho osing the training set

As a result of the previous section it should b e clear that the set D strongly inuences

a

the generalization prop erties of the solutions obtained by gradient descent Clearly a

further selection of training data out of D in general will lead to an even worse situation

a

The decreasing numb er of supp orting p oints leads to increasing sets F containing less

a

information over the interesting sets F

i

Consequently one might think it would b e b est to use all available data for training

purp oses For many applications however this would b e awkward due to the immense

data sets available In sp eech recognition or signal pro cessing D often contains many

a

thousands of samples and a large amount of redundancy Training on all of the samples

can result in unnecessarily exp ensive computation Moreover the redundancy might not

b e equally distributed over the input space thereby preventing optimal generalization

prop erties As a consequence the question of selecting the prop er training set to achieve

optimal approximation and interp olation results arises

Although the suciently dense distribution of training patterns on f is an imp ortant

t

condition for successful training there exist only some vague statements concerning this

issue Surely the suitable numb er of training patterns dep ends on the chosen network

structure the problem and the required precision The latter relation often unjustiably

neglected evidently stems from the fact that F has to contain more information ab out

a

F to achieve an interp olation with higher precision

i

A rst hint towards the necessary numb er of training samples can b e obtained by an

alyzing the numb er of free parameters of f given by the numb er of network weights

n

Consequently one of the rst supp ositions concerning the suitable numb er of training

patterns stated as a rule of thumb for simple linear networks by Widrow demands

that the numb er of training patterns should b e ten times the numb er of free parame

ters This rule has b een utilized for nonlinear neural nets as well Morgan and Boulard

Although there exists a unique solution for considerably fewer supp orting p oints

the overhead of information will result in a lower generalization error

The necessary numb er of training samples heavily dep ends on their distribution over the

input set D It is common to cho ose the training samples randomly out of D This is

a a

thought to repro duce the density of the underlying distribution If there is no further

information available this may b e sensible but b ecause the prop erties of f maxima

t

minima curvature and F are not involved this will in general result in sub optimal

n

training sets One of the main advantages of random selection is its easy implementation

Moreover the generalization prop erties of neural networks trained on randomly selected

training sets may b e investigated theoretically In analyzing certain classes of networks

and randomly selected training sets Baum and Haussler for example obtained

some coarse estimates of the relation b etween training set size and achievable generaliza

tion precision

In practical exp eriments considerably fewer training patterns than stated in the ab ove

mentioned investigations suce to give go o d generalization Morgan and Boulard

Aside from random selection there are other approaches to obtain prop er training sets

by carefully cho osing the training data out of the domain X One might try for example

to cho ose the distance b etween adjacent training patterns to b e almost constant This

however dep ends on a meaningful metho d to measure distances in X In signal pro cessing

applications it is sensible to cho ose the Euclidean distance Some exp erimental results

obtained by the author shows that such an equal distance distribution of training samples

leads to considerably b etter traininggeneralization prop erties then the random distribu

tion For other applications however it might b e dicult to nd a meaningful distance

measure

A more adept approach to select a suitable training set would b e to adapt the training

set by dynamically selecting training patterns while training pro ceeds Atlas Cohn and

Ladner have prop osed an algorithm which by investigation of the network state

decides which patterns are to b e added to the training set Although their algorithm

shows b etter results then random selection of training sets it requires exp ensive compu

tations and is therefore dicult to use in practical applications

As previously mentioned another dynamic approach was established by Plutowski Cot

trell and White They train with a sp ecic training set until the error stalls and

then search in D for the element which p ossesses a gradient vector most similar to the

a

average gradient of the entire set This element is chosen to enlarge the training set To

prevent overtting of the initially small training set the initial network is rather small

with further hidden units added if the capabilities of the actual network to t the growing

training set are exhausted This strategy will lead to very small training sets For high

precision approximation however the selection of the prop er training set is computation

ally very exp ensive as the neural net is trained to the desired precision for all intermediate

training sets

Online Cross Validation

The dynamic selection of training patterns prop osed in the following section uses a well

known to ol from the eld of estimation theory called cross validation Cross validation

describ ed in detail by Stone is often used to revise statistical mo dels by applying

them to test sets In the eld of neural computation cross validation has b een used to

verify parameter settings Finno et al the network structure or generalization

prop erties The last is of great interest here and therefore will b e explained further As

HechtNielsen has prop osed the provided data set has to b e divided into a training

and a validation set the latter not b eing used for training purp oses Applying the network

function to the validation set it is p ossible to estimate the generalization error of the net

This can already b e done during the training phase In the b eginning of optimization

the estimated generalization error will generally decrease with the training error After

some time the generalization error will reach a minimum and start to increase while the

training error decreases further This is interpreted as the b eginning of overtting and

HechtNielsen suggests stopping training at this p oint

Cross validation is a very exible to ol However there unfortunately exist two contrary

demands which refer to the necessary division of the data into training and validation

sets On the one hand one would like to cho ose the test set as large as p ossible to achieve

valid estimations for the generalization prop erties on the other hand all data contained

in the test set can not b e used for training and its information is lost to the training

pro cess This will esp ecially b e a problem in situations where the available numb er of

training data is very limited

Dynamic selection of training patterns

In the preceding discussion the principal relations b etween approximation and generaliza

tion have b een claried The two results essential for the understanding of the prop osed

dynamic pattern selection are in short form

The numb er and the distribution of training patterns have an imp ortant inuence

on the resulting generalization prop erties of a neural network but there exist only

incomplete ndings concerning practical solutions for selecting the training data

Online cross validation is a useful to ol to monitor the generalization prop erties of

the network and can b e applied during the training phase

If one is willing to select the training patterns dynamically two basic questions arise

Starting with an empty training set the rst question b ecomes Which pattern should

b e chosen There are several p ossible answers The easiest excluding random selection

is to select the pattern which has the highest error contribution Compared to other

p ossibilities for example the sophisticated ISB criterion prop osed by Plutowski and

White the maximum error criterion is very easy to compute and has the advantage

of b eing directly coupled to the generalization factor equation Therefore this criterion

is used for the dynamic pattern selection algorithm Rb el

The second question at which time the next pattern should b e selected turns out to b e

more tricky There are two ob jectives First as is given by equation the generalization

factor ought to b e less than one and second to prevent overtting the selection of new

data should take place as early as necessary The rst ob jective may b e achieved by

estimating the generalization factor and inserting a new training pattern whenever it

grows b eyond one

A numb er of exp eriments have shown this straightforward strategy results in reasonable

training sets which in many cases leads to b etter results then comparable sized xed

training sets However by employing this criterion alone the generalization factor tends

1

to oscillate b etween one and a value considerably b elow at ab out Each selected

pattern eects that the generalization factor decreases and reaches a minimum Then it

slowly increases again and due to the long time it takes the generalization factor to reach

one the selection of prop er training sets for high precision training takes a long time It

would obviously b e b etter to catch the generalization factor at its minimum and select

1

This is esp ecially true for training to very small errors

a new training pattern just when it starts to increase Following this we have found the

second ob jective to aim for keep the generalization factor at its minimum

There is one problem left now which is to obtain valuable estimates of the generalization

factor and its tendency without extensive computational eorts We do not want to com

pute f for the whole data D but try to estimate the generalization factor by comparing

n a

the error function on the selected training set and a validation set Following the metho d

of cross validation section the available data D is divided into subsets D and D D

a T V T

contains all p ossible training patterns and is referred to as the training store D the

V

validation store contains all p ossible validation patterns The actual training set D D

t T

and the validation set D D are selected from the resp ective stores

v V

The estimation of the generalization factor is obtained by selecting a random validation

set D and computing

v

E D

v

v

E D

t

with E denoting the error function of the backpropagation optimization To achieve

2

comparable statistical prop erties of E D and E D one cho oses jD j jD j Having

v t v t

computed the generalization factor estimate it now remains to use this value to estimate

the generalization factor tendency We compute the average a and standard deviation

3

of the generalization factors from a xed numb er of preceding ep o chs M To catch

the increasing of the generalization factors as early as p ossible we cho ose the threshold

for the generalization factor to b e

n min n a

and select a new training pattern whenever

n n

Here the argument n reects the numb er of training ep o chs computed so far Note that

is monotonically decreasing As the optimal threshold might increase it is appropriate

to allow a small increase of the after the selection of a new training pattern Therefore

the threshold is initialized after each selection to

n minn

and is xed at this level for the numb er of ep o chs M used to calculate the statistical

prop erties a and Note that due to the selection criterion in case of a selection

n is always ab ove n

Regarding section there is a sensible extension of the algorithm Calculating E D

V

we are able to achieve a go o d less uctuating estimate of the generalization error of the

actual net This estimate may then b e used to select the net which achieves b est gen

eralization prop erties during training Moreover a further investigation of the relation

2

jA j here denotes the cardinality of A

3

In the following exp eriments M is used

b etween generalization error and training set size jD j helps get further insight into the

t

reasons for bad training results

At rst the growing of D will b e accompanied by a decrease of the generalization error

t

estimation If f is not contained in F or due to the actual state of the network not

t n

achievable by gradient descent there will b e a certain minimal generalization error After

having reached this limit the further decreasing training error obtained by the back

propagation algorithm results in increasing generalization errors The dynamic pattern

selection algorithm prevents overtting by frequently inserting new training patterns into

D As a result the generalization error will uctuate around the minimum value with D

t t

slowly increasing When this situation arises the training pro cess could b e stopp ed as

the generalization will not improve further

While this eect is due to the limited net complexity and may easily b e prevented by

cho osing a larger network there exists another situation which results in a fast growing

of D up to D accompanied by an increasing generalization error This rapid growth is

t T

due to missing information in D compared to D Therefore we are able to decide which

T V

generalization precision might b e achieved with the available data D by monitoring the

a

rate of selection and the generalization error

Whenever the set of available data D is rather small the partitioning of the data into

a

training and validation sets can imp ede successful training This is the case if the infor

mation contained in each of these sets is incomplete For such situations there exists a

mo died version of the dynamic pattern selection algorithm The mo dication consists

in cho osing b oth sets D and D to completely cover D As a consequence the assertion

T V a

obtained by the validation tests are less reliable However the exp eriments which are

partly presented in the following section showed that the selected training sets remains

quite reasonable as long as they remain small If D grows to more than half D one will

t a

in general not exp ect to obtain sound generalization estimates

In the following the dynamic pattern selection algorithm will b e describ ed more formally

using a very general application example The task which is to b e learned consists in

learning a target function f which is given in accordance to equation by a set

t

of examples only Formally the error function E D has to b e minimized and the net

a

function f ought to generalize to f outside the given examples

n t

The algorithm

Initialization

The neural net weights are initialized with small random numb ers as is usual

with the backpropagation algorithm This initialization establishes a random

net function f which is used to select the memb er d x y out of D that

n i i i T

shows the maximal error with resp ect to the error function E The training

set D is now set up containing just this maximal error element The threshold

t

of the generalization factor is set to one

The minimal generalization error E D is initialized using the random

V min

start error on D which is E D

V V

Training

4

After each training ep o ch one selects a random validation set D D holding

v V

as much elements as D

t

Whenever

E D E D

v t

D will b e enlarged by adding d D nD the element from D which con

t i T t T

tributes most to E D and is not already a memb er of D At each training

T t

ep o ch the threshold has to b e up dated as p er equations and

In the case that E D is smaller than E D the actual net will b e stored

V V min

and E D will b e up dated After checking the resp ective stopping criteria

V min

training continues

Break

If the generalization error stalls although D is growing or if any other stopping

t

criterion matches training will b e nished and the optimal net stored so far

represents the result of optimization

The pro cedure just describ ed has several advantages

The training patterns will b e inserted whenever and wherever the information con

tained in D fails to yield a regular convergence of the net function f to the target

t n

function f The algorithm could b e interpreted as a dynamic adaption of F to

t a

F whereby esp ecially the critical regions are formed

i

The redundancy contained in D remains small and the numb er of training patterns

t

is related to the reached training and generalization errors

The interp olation is controlled using the full information contained in D without

a

using all the data for training An overtting of the training patterns is suppressed

Analyzing the relation b etween jD j and jD j may give some hints ab out the redun

t T

dancy contained in D and moreover ab out the validity of the achieved generaliza

T

tion

The slowly growing training set leads to a slowly increasing complexity of the trained

task which as was shown by Jacobs will often result in b etter training

results

Several additional remarks concerning the prop osed algorithm are in turn

4

As long as the changes of the net function f remains small it is p ossible to train several ep o chs

n

without correcting the training set Otherwise the adaption of the training set stays to o far b ehind the

actual network state and training and generalization error diverge

Between two enlargements of D there should b e at least one training cycle

t

to ensure that the additional information could have had some eect on the

net function This is ensured by the threshold initialization of equation

which takes place after each training data selection Because the data p oints in

the neighb orho o d of the latest added training pattern often show comparable

errors one might otherwise select a numb er of patterns in the actual critical

place This would lead to groups of redundant patterns and thereby prevent

a regular convergence

Because the estimation of the generalization is a computationally exp ensive

task for large sets D and b ecause the changes of the generalization error are

V

usually very slow esp ecially at the end of the optimization pro cedure one

may cho ose to evaluate this estimation after a numb er of training ep o chs only

This will generally have a negligible eect on the achieved results

As the generalization is controlled by adapting the training set this estimation

of the generalization error can even b e omitted at all The exp eriments have

shown that the dierences b etween the optimal and the nal state of the

optimization pro cedure is fairly small

The prop osed algorithm results in a monotonously increasing training set D

t

There have b een some exp eriments done with decreasing training sets by

removing the b est pattern from D In general this results in p o orer p erfor

t

mance Even if the training pro cess has adapted f to f in the neighb orho o d

n t

of the selected patterns it seems that they retain their imp ortance in ensuring

regular interp olation

Exp erimental results

In the following section the favorable prop erties of the dynamic pattern selection are

demonstrated The learning algorithm used here is an accelerated batch mo de back

propagation algorithm with dynamically adapted learning rate and momentum Salomon

Rb el All neural nets are simple feedforward structures with sigmoidal

of the hidden units and linear output units

The tasks to solve stems from the eld of nonlinear signal pro cessing Precisely sp eaking

two examples of continuous nonlinear system functions shall b e represented by the neural

nets After successful training the net function may b e used as a predictor for the future

b ehavior of the dynamical system The underlying theory is b eyond the scop e of this

article A suitable intro duction has b een given by Lap edes and Farb er a b

Predicting the Henon mo del

In the rst exp eriment the network is trained to predict the chaotic dynamic of the henon

mo del Grassb erger and Pro caccia a This two dimensional mo del is given by the

dierence equation

2

x y a x

n+1 n

n

y b x

n+1 n

Cho osing the parameters a and b will result in a solution showing chaotic

b ehavior A simple transformation of equation gives the prediction function

2

x b x a x

n+2 n

n+1

which is to b e approximated by the neural net

The available data set

D fx x x x x x g

a n n+1 n+2 n+1 n+2 n+3

consists of vectors built from the solution of and the initialization x y

0 0

An additional indep endent validation set

D fx x x x x x g

u n+333 n+334 n+335 n+334 n+335 n+336

has b een built from the following vectors of the solution

The neural net architecture used for this task is chosen to have an input layer with two

units one hidden layer with seven units and an output layer with one unit The initial

weights are randomly chosen out of the interval It turns out that equation is

easily approximated by this fairly small neural net and therefore this task is well suited

to compare the training results of the dynamic pattern selection with the results obtained

with xed training sets of varying sizes Moreover by virtue of the low dimensionality of

the problem it is p ossible to get a visual impression of the distribution of the selected

training samples on f

t

The xed training sets are subsets of D and are chosen to approximate with varying

a

density a uniform covering of D While in the case of xed training sets each training

a

run consists of ep o chs the dynamic pattern selection algorithms were stopp ed after

ep o chs This results in similar average generalization errors To obtain statistically

valid assertions in comparing the results dierent initial weight settings have b een

used with each training algorithm

The average training and generalization results are presented in table The error

bars are prop ortional to the variance of the results In the upp er part of this table the

results using the dierent sized xed training sets are shown in the lower part one nds the

results using b oth variants of the dynamic pattern selection Here the rst line represents

the dynamic pattern selection with an indep endent validation set For the training and

validation rep ertoire the relations D D and D D are chosen The second line

T a V u

shows the results for the mo died version without a sp ecial validation set In this case

the data sets are chosen following the relation D D D

T V a

The numb er of training patterns in the xed training sets and the average numb er of

selected training patterns at the end of the training are listed in column one The columns

training training generalization generalization forwardbackward

set size error E D error E D factor propagations

t u u

e  e e  e e

e  e e  e e

e  e e  e e

e  e e  e e

e  e e  e e

e  e e  e e

e  e e  e e

e  e e  e e

e  e e  e e

 e  e e  e e

 e  e e  e e

Table Predicting the Henon mo del Comparison of the average training and

generalization results using a neural net and dierent training sets

two and three contain the training and generalization results measured by means of the

normalized prediction error

v

P

u

2

u

f x f x

t n

x2D

t

E D

2

jD j

D

where represents the standard deviation of the target function measured on the data

D

set D

All the errors contained in table have b een measured using the optimal weight set

which was determined during the training phase by cross validating the generalization

error on D Using equation the generalization factor has b een estimated to b e

V

E D

u

u

E D

t

Column ve shows the average computational exp ense neede to achieve the generalization

results for the resp ective training set It is estimated by the numb er of propagation passes

through the net Here the simple approximation of equal computational eorts for forward

and backward propagations is made The acceleration algorithm needs four additional

forward propagations for each training pattern and each training ep o ch so each training

cycle consists of forward and backward propagation for a selected training pattern

For the dynamic selection there is an additional overhead of forward propagation for

each pattern in the randomly chosen validation set D Moreover each pattern selection

v

results in forward propagation for each memb er of D to select the training pattern with

T

the maximum error contribution

The smallest of the xed training sets contain just ve and fteen training patterns and

the net function f is obviously underdetermined The examples are learned quickly and

n

with high precision but the generalization error minimum stalls at a very early state All

the other xed training sets lead to comparable precisions A remarkable result is the

at minimum for training and generalization error when using a training set with

elements The increasing error for larger training sets is a result of p o orly distributed

training patterns The full set D with all training patterns obviously do es not provide

T

an equally spaced distribution of the patterns The more patterns that have to b e chosen

out of this set the harder it is to select a training set with prop er distribution in the

input space

Using dynamic pattern selection training ep o chs are sucient to reach the same

average generalization error as with the b est xed training set Compared to the xed

training set with the minimal generalization error which is the set containing training

patterns the dynamic training sets contains a higher numb er of patterns This is to

ensure that the generalization error stays b elow one which is not obtained by any of the

xed training sets

In comparing b oth variants of the dynamic selection algorithm one nds that the training

sets selected by the mo died algorithm are signicantly smaller This is due to the less

severe validation without an additional validation set In the case of the henon mo del

the loss in mo del reliability do es not aect the generalization prop erties and therefore

the training results are even improved In general this is not the case An example for

dierent b ehavior will b e found in the next exp eriment

As mentioned earlier all exp eriments are done with a xed numb er of training cycles

Due to the varying training sets this leads to varying computational costs for the dierent

runs For the xed training sets the total numb er of forward and backward propagations

increase from ab out thousand up to million The total computational cost for

the dynamic pattern selection is comparable to that for patterns in a xed training set

However compared to this xed training set size the dynamic pattern selection achieves

considerably b etter generalization results The smaller cost of the dynamic selection

algorithm is a result of the initially small training sets which comp ensate for the larger

training sets in the nal training phase

In gure the dep endency b etween generalization factor and the numb er of training

u

ep o chs is shown As can b e seen the numb er of training patterns considerably eects

the generalization prop erties As the xed training sets are enlarged the reliability of the

generalization is increased but with a higher numb er of training ep o chs and improved

training error this reliability decreases For very long training or small training sets

overtting takes place and the generalization prop erties tend to b ecome random

The evolution of the generalization factor for xed training sets with and patterns

shows that there is a randomly varying training ep o ch for which the information contained

in the training set do es not ensure the desired generalization quality This limit is moved

towards smaller training errors if the training set is enlarged Regarding the dynamic

5

selection it is shown that the generalization factor is controlled to stay well b elow one

5

The initially higher variance of the generalization factor is a consequence of the learning rate adap

tion The learning rate is fairly high in the rst few hundred ep o chs This leads to a quickly varying

generalization error which nevertheless is controlled by inserting new training patterns

training patterns training patterns

u u

3 3

Ep o chs Ep o chs

training patterns

Dynamic selection

u u

3 3

Ep o chs Ep o chs

Figure Generalization factor as a function of the numb er of training ep o chs

u

for learning to predict the henon mo del

As was already mentioned despite the size of the training set the distribution of the

training patterns has a considerable eect on the generalization results A typical distri

bution obtained by the dynamic pattern selection is depicted in gure The critical

regions as maxima minima margins and parts of high curvature are o ccupied The

distribution is not uniform but reects the error distribution of the net function This

indicates the advantage of the dynamic selection in contrast to the xed training sets

which must b e selected based on general heuristics and without any knowledge of the

intermediate network state

Predicting the MackeyGlass mo del

After having shown the basic prop erties of the dynamic pattern selection by means of a

fairly easy problem the more complicated task of predicting the chaotic b ehavior of the

Available data

3

Selected Patterns

x

n+2

3 3

3

3 3

3

3

3

3 3

3

3 3

3

3

3 3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

3 3

3

x

n+1

x

n

Figure A typical distribution of training patterns on the prediction function

of the henon mo del Depicted is the distribution of training patterns subsequent

to training cycles

MackeyGlass mo del is investigated Due to the time delay the MackeyGlass mo del

a xt

b xt x t

10

xt

has innite degrees of freedom The stationary motion is however governed by a low

dimensional attractor Farmer Grassb erger and Pro caccia b This is the

reason why the MackeyGlass mo del is often used as an emulation of real world systems

and predicting this mo del has b een established widely as a kind of b enchmark for testing

predictors Farmer and Sidorowich Lap edes and Farb er b Crowder As

Lap edes and Farb er b have shown the prediction of the MackeyGlass mo del with

the parameter set a b and might b e attained by using a neural net

with six input units two hidden layers with ten units each and a linear output unit This

settings will b e used here as test for the dynamic pattern selection to o

Solving the equation has b een done using a second order RungeKutta metho d

where the rst steps of the solution were skipp ed to reach the steady state The

data set D is built from the following steps of the solution with a total of

a

vectors v fx x x x x x x g Following Lap edes and Farb er the

i i i+6 i+12 i+18 i+24 i+30 i+36

vectors are constructed with a time delay of six steps and the xed training set contains

vectors out of D which has b een selected with nearly uniform distribution in input

a

space The indep endent generalization test set contains vectors chosen randomly

out of the following steps of the solution

The xed training set and each of the dynamic selection algorithms has b een used for

training with ten initial weight sets Each run consists of cycles and the average

results are shown in table The lab eling here is identical to that in table

The xed training sets give rise to a generalization factor considerably b eyond one As is

training patterns Dynamic selection

u u

3 3

Ep o chs Ep o chs

Figure Generalization factor as a function of the numb er of training ep o chs

u

for learning to predict the MackeyGlass mo del

training training generalization generalization forwardbackward

set size error E D error E D factor propagations

t u u

e  e e  e e

 e  e e  e e

 e  e e  e e

Table Predicting the MackeyGlass mo del Comparison of the average training

and generalization results using a neural net and dierent training sets

depicted in gure the generalization factor increases monotonically In contrast to this

the generalization factor of the dynamic selection algorithm is well b elow one indicating

a more reliable generalization and the selected sets contains on average less than half

the numb er of training patterns The generalization results b eing approximately equal

the computational exp enses of the dynamic selection algorithms are less by a factor of

Note that there is no need to do any initial investigations to determine the optimal

training set

Discussion

After having demonstrated the basic prop erties of the dynamic pattern selection some

further topics concerning p ossible extensions data requirements and noise shall b e dis

cussed

Obviously the application of the dynamic pattern selection is limited to problems where

generalization is p ossible Otherwise any selection of a training subset will fail to end

up with usable network p erformance Up to now there have b een no investigations with

problems other than the approximation of continuous functions To apply the dynamic

pattern selection to classication problems with binary target values the criterion for

prop er selection times probably has to b e mo died In those cases it seems quite reason

able to exp ect the dynamic pattern selection to reveal clustering prop erties and further

investigation might lead to interesting results

As a matter of fact the dynamic pattern selection do es not dep end on a sp ecial error

function Therefore one might use the dynamic training sets in combination with other

backpropagation extensions as for example constrained networks thus getting b enets

from each of the metho ds

Data requirements

As should b e clear from the discussion ab ove an imp ortant precondition for the utiliza

tion of the dynamic pattern selection is a sucient numb er of training patterns in the

data set D In fact even with very small data sets the dynamic selection algorithm is a

a

favorable choice due to the increased control over network results Although the training

set will cover the whole training store more or less quickly one gains the p ossibility to

estimate the achievable generalization by measuring the training error at the time when

the training rep ertoire is exhausted After having selected all available training patterns

the dierent versions of the algorithm are equivalent to the standard backpropagation

algorithm with or without cross validation As a consequence of this limiting b ehavior

the dynamic pattern selection has the same data requirements as the standard backprop

agation algorithm

Noise

In the preceding discussion one question has b een left unasked that is of great interest

for the practical virtue of the prop osed algorithm What happ ens if there is considerable

noise in the data

There is one imp ortant dierence b etween dynamic and xed training sets which might

result in a somewhat higher sensitivity to noise in the dynamic case If only some of the

available training patterns are disturb ed the selection pro cess will probably select most of

these patterns trying to improve the bad generalization results Due to the concentrated

training sets the averaging b etween the selected samples is smaller and consequently the

training results are aected more by the additional noise then in the case of xed training

sets

In many cases however the noise will b e uniformly distributed over the data and in

this cases the dynamic training sets will yield similar averaging prop erties as the xed

sets In many exp eriments the dynamic selection has b een applied for training neural

network predictors of real world signals which are disturb ed by noise levels that are small

compared to the training error Rb el In all cases the dynamic selection proved

to b e stable with sup erior training results then the standard algorithm

Comparison with online training

It has b een argued that the dynamic pattern selection is obviously well suited to pro cess

very large data sets containing highly redundant data In contrast to this it is often as

sumed that in the case of redundant data the online training variant of backpropagation

will yield sup erior results Therefore we have compared the computational exp enses for

an up to date online training metho d the SearchThenConverge learning rate schedule as

is set forth in Darken and Mo o dy and the batch mo de dynamic pattern selection

algorithm The neural networks have b een trained to predict a real world piano signal

consisting of samples and the learning rate adaption of the online training metho d

has b een optimized in a numb er of preceding runs For the dynamic pattern selection

the automatic learning rate adaption already mentioned in the previous sections has b een

employed

Despite the preceding optimization necessary for the online training algorithm and the fact

that there have b een only patterns selected by the dynamic selection which results in

a big overhead for the selection pro cess the overall exp ense for the batch mo de dynamic

pattern selection training has b een lower than the online training exp ense by a factor of

6

As a consequence of this exp eriment it follows that the prop osed dynamic pattern selec

tion combined with an accelerated batch mo de training is the metho d of choice in all cases

where the complete training set D is available at training time Only in cases where the

a

net has to b e adapted to time varying situations during application should online training

b e used

Conclusion

Based on the fact that the generalization prop erties of neural networks are heavily deter

mined by the training sets an extension to the standard backpropagation algorithm the

dynamic pattern selection has b een intro duced The prop osed algorithm has b een tested

on two problems from the area of nonlinear signal pro cessing Comparing the results to

standard backpropagation training on optimized xed training sets it has b een shown that

the dynamic pattern selection algorithm achieves the same average generalization results

with less computational exp ense and without any preceding investigation of the available

data

The dynamic pattern selection has esp ecially proven to b e useful for very large and highly

redundant data sets which are often used in signal pro cessing applications In these cases

one should select a reasonable subset as a training and validation store Further selection

6

I would like to thank Jens Ehrke for his supp ort on online training exp eriments

will then b e done automatically As a sp ecial feature the investigation of the dynamically

selected training sets allows to qualitatively estimate the data redundancy and moreover

gives some hints on the reliability of the training results

References

Atlas et al L Atlas D Cohn and R Ladner Training connectionist networks

with queries and selective sampling In D Touretzky editor Advances in Neural In

formation Processing Systems NIPS pages Morgan Kaufmann Pub

Baum and Haussler E Baum and D Haussler What size nets give valid general

ization Neural Computation

Crowder R Crowder Predicting the mackeyglass timeseries with cascade

correlation learning In Proc of the Connectionsts Models Summer School pages

Darken and Mo o dy C Darken and J Mo o dy Note on learning rate schedules for

sto chastic optimization In R L and JE Mo o dy and D Touretzky editors Neural

Information Processing Systems NIPS pages Morgan Kaufmann Pub

Farmer and Sidorowich J D Farmer and J J Sidorowich Predicting chaotic

dynamics Dynamic Patterns in Complex Systems World Scientic

Farmer J Farmer Chaotic attractors of an innitedimensional dynamical system

Physica D

Finno et al W Finno F Hergert and H Zimmermann Improving general

ization p erformance by nonconvergent mo del selection metho ds In I Aleksander and

J Taylor editors Proc of the Inter Conf on Articial Neural Networks ICANN

pages Elsevier Sience Publisher

Grassb erger and Pro caccia a P Grassb erger and I Pro caccia Estimation of the

Kolmogerov entropy from chaotic signal Physical Review A

Grassb erger and Pro caccia b P Grassb erger and I Pro caccia Measuring the

strangeness of strange attractors Physica D

HechtNielsen R HechtNielsen Theory of the backpropagation neural network

In Proceedings of the Intern Joint Conf on Neural Networks pages II IEEE

TAB Neural Network Comittee June Washington DC

HechtNielsen R HechtNielsen Neurocomputing AddisonWesley Publishing

Company

Jacobs R A Jacobs Initial exp eriments on constructing domains of exp ertise

and hierarchies in connectionist systems Proc of the Connectionsts Models Summer

School pages

Ji et al C Ji R Snapp and D Psaltis Generalizing smo othness constraints from

discrete samples Neural Computation

Lap edes and Farb er a A Lap edes and R Farb er How neural nets work IEEE

Conference on Neural Information Systems

Lap edes and Farb er b A Lap edes and R Farb er Nonlinear signal pro cessing using

neural networks Prediction and system mo delling Technical Rep ort LAUR

Los Alamos National Lab oratory

Morgan and Boulard N Morgan and H Boulard Generalization and parameter

estimation in feedforward nets Some exp eriments In D Touretzky editor Advances in

Neural Information Processing Systems NIPS pages Morgan Kaufmann

Pub

Plutowski and White M Plutowski and H White Selecting concise training sets

from clean data IEEE Transactions on Neural Networks

Plutowski et al M Plutowski G Cottrell and H White Learning MackeyGlass

from examples plus or minus Preprint

Poggio and Girosi T Poggio and F Girosi Networks for approximation and learn

ing Proceedings of the IEEE

Rb el A Rb el Dynamic selection of training patterns for neural networks A

new metho d to control the generalization Technical Rep ort Technical University

of Berlin In German

Rb el A Rb el Neural Models of nonlinear dynamical systems and their appli

cation to musical signals PhD thesis Technical University of Berlin

Rumelhart et al D Rumelhart G Hinton and R Williams Learning Internal

Representations by Error Propagation volume Foundations of Paral lel Distributed

Processing Explorations in the Microstructure of Cognition pages Rumelhard

McClelland Eds MIT Press

Salomon R Salomon Verbesserung konnektionistischer Lernverfahren die nach

der Gradientenmethode arbeiten PhD thesis Technische Universitt Berlin

Stone M Stone Crossvalidatory choice and assessment of statistical predictors

with discussion Journal of the Royal Statistic Society

Weigend et al A Weigend B Hub erman and D Rumelhart Predicting the

future a connectionist approach Intern Jou of Neural Systems

White H White Connectionist nonparametric regression Multilayer feedforward

networks can learn arbitrary mappings Neural Networks

Widrow B Widrow ADALINE and MADALINE Proc IEEE st Inter

Conf on Neural Networks