Intro duction to

Radial Networks

Mark J L Orr

Centre for Cognitive Science

University of Edinburgh

Buccleuch Place

Edinburgh EH LW Scotland

April

Abstract

This do cumentisanintro duction to radial basis function RBF networks

a typ e of articial neural network for application to problems of sup ervised

learning eg regression classication and time series prediction It is now

only available in PostScript an older and now unsupp orted hyp ertext ver

sion maybeavailable for a while longer

The do cumentwas rst published in along with a package of Matlab

functions implementing the metho ds describ ed In a new do cument

Recent Advances in Radial Basis Function Networks b ecame available with

a second and improved version of the Matlab package

mjoancedacuk



wwwancedacukmjopapersintrops



wwwancedacukmjointrointrohtml



wwwancedacukmjosoftwarerbfzip



wwwancedacukmjopapersrecadps



wwwancedacukmjosoftwarerbfzip

Contents

Intro duction

Sup ervised Learning

Nonparametric Regression

Classication and Time Series Prediction

Linear Mo dels

Radial Functions

Radial Basis Function Networks

Least Squares

The Optimal WeightVector

The Pro jection Matrix

Incremental Op erations

The Eective NumberofParameters

Example

Mo del Selection Criteria

CrossValidation

Generalised CrossValidation

Example

Ridge Regression

Bias and Variance

Optimising the Regularisation Parameter

Lo cal Ridge Regression

Optimising the Regularisation Parameters

Example

Forward Selection

Orthogonal Least Squares

Regularised Forward Selection

Regularised Orthogonal Least Squares

Example

A App endices

A Notational Conventions

A Useful Prop erties of Matrices

A Radial Basis Functions

A The Optimal WeightVector

A The Variance Matrix

A The Pro jection Matrix

A Incremental Op erations

A Adding a new basis function

A Removing an old basis function

A Adding a new training pattern

A Removing an old training pattern

A The Eective NumberofParameters

A Leaveoneout Crossvalidation

A A ReEstimation Formula for the Global Parameter

A Optimal Values for the Lo cal Parameters

A Forward Selection

A Orthogonal Least Squares

A Regularised Forward Selection

A Regularised Orthogonal Least Squares

Intro duction

This do cumentisanintro duction to linear neural networks particularly radial basis

function RBF networks The approach describ ed places an emphasis on retaining

as much as p ossible the linear character of RBF networks despite the fact that

for go o d generalisation there has to be some kind of nonlinear optimisation The

two main advantages of this approacharekeeping the simple it is just

linear algebra and the computations relatively cheap there is no optimisation by

general purp ose gradient descent algorithms

Linear mo dels have b een studied in statistics for ab out years and the theory

is applicable to RBF networks which are just one particular typ e of linear mo del

However the fashion for neural networks which started in the mids has given

rise to new names for concepts already familiar to statisticians Table gives

some examples Such terms are used interchangeably in this do cument

statistics neural networks

mo del network

estimation learning

regression sup ervised learning

interp olation generalisation

observations training set

parameters synaptic weights

indep endent variables inputs

dep endent variables outputs

ridge regression weight decay

Table Equivalent terms in statistics and neural networks

The do cument is structured as follows We rst outline sup ervised learning

section the main application area for RBF networks including the related areas

of classication and time series prediction section We then describ e linear

mo dels section including RBF networks section Least squares optimisation

section including the eects of ridge regression is then briey reviewed followed

by mo del selection section After that we cover ridge regression section in

more detail and lastly we lo ok at forward selection section for building networks

Most of the mathematical details are put in an app endix section A



For alternative approaches see for example the work of Platt and asso ciates and of

Fritzke

Sup ervised Learning

A ubiquitous problem in statistics with applications in many areas is to guess or es

timate a function from some example inputoutput pairs with little or no knowledge

of the form of the function So common is the problem that it has dierent names in

dierent disciplines eg nonparametric regression function approximation system

identication inductive learning

In neural network parlance the problem is called supervisedlearning The func

tion is learned from the examples which a teacher supplies The set of examples

or training set contains elements which consist of paired values of the indep endent

input variable and the dep endent output variable For example the indep endent

variable in the functional relation

y f x

is x a vector and the dep endent variable is y a scalar The value of the variable

y dep ends through the function f on each of the comp onents of the vector variable

x

x

x

x

n

Note that we are using b old lowercase letters for vectors and italicised lowercase

letters for scalars including scalar valued functions like f see app endix A on

notational conventions

The general case is where b oth the indep endent and dep endent variables are

vectors This adds more mathematics but little extra insight to the sp ecial case of

univariate output so for simplicitywe will conne our attention to the latter Note

however that multiple outputs can be treated in a sp ecial way in order to reduce

redundancy

The training set in which there are p pairs indexed by i running from up to

p is represented by

p

T fx y g

i i

i

The reason for the hat over the letter y another convention see app endix A

indicating an estimate or uncertain value is that the output values of the training

set are usually assumed to b e corrupted by noise In other words the correct value

to pair with x namely y is unknown The training set only sp ecies y which is

i i i

equal to y plus a small amount of unknown noise

i

In real applications the indep endentvariable values in the training set are often

also aected by noise This typ e of noise is more dicult to mo del and we shall

not attempt it In any case taking account of noise in the inputs is approximately

equivalent to assuming noiseless inputs but with an increased amount of noise in

the outputs

Nonparametric Regression

There are two main sub divisions of regression problems in statistics parametric

and nonparametric In parametric regression the form of the functional relationship

between the dep endent and indep endentvariables is known but maycontain param

eters whose values are unknown and capable of b eing estimated from the training

set For example tting a straightline

f x ax b

p

to a bunch of points fx y g see gure is parametric regression b ecause

i i

i

the functional form of the dep endence of y on x is given even though the values of

a and b are not Typicallyinany given parametric problem the free parameters as

well as the dep endent and indep endent variables have meaningful interpretations

like initial water level or rate of ow

P

P

P

P

x y r

P

r P

P

x y

P

P

x y

P

P

r

P

P

P

P

P

P

y P

x y

P

P

r r P

P

P

x y P

P

P

P

P

x

Figure Fitting a straightlinetoabunchofpoints is a kind of parametric regression

where the form of the mo del is known

The distinguishing feature of nonparametric regression is that there is no or

very little a priori knowledge ab out the form of the true function which is b eing

estimated The function is still mo delled using an equation containing free parame

ters but in a way which allows the class of functions which the mo del can represent

to be very broad Typically this involves using many free parameters which have

no physical meaning in relation to the problem In parametric regression there is

typically a small numb er of parameters and often they havephysical interpretations

Neural networks including radial basis function networks are nonparametric

mo dels and their weights and other parameters have no particular meaning in

relation to the problems to which they are applied Estimating values for the weights

of a neural network or the parameters of any nonparametric mo del is never the

primary goal in sup ervised learning The primary goal is to estimate the underlying

function or at least to estimate its output at certain desired values of the input

On the other hand the main goal of parametric regression can b e and often is the

estimation of parameter values b ecause of their intrinsic meaning

Classication and Time Series Prediction

In classication problems the goal is to assign previously unseen patterns to their

resp ective classes based on previous examples from each class Thus the output

of the learning algorithm is one of a discrete set of p ossible classes rather than

as in nonparametric regression section the value of a continuous function

However classication problems can b e made to lo ok like nonparametric regression

if the outputs of the estimated function are interpreted as b eing prop ortional to the

probability that the input b elongs to the corresp onding class

The discrete class lab els of the training set equation outputs are given

numerical values by interpreting the k th class lab el as a probability of that the

example b elongs to the class and a probability of that it b elongs to any other

class Thus the training output values are vectors of length equal to the number of

classes and containing a single one and otherwise zeros After training the network

resp onds to a new pattern with continuous values in each comp onentofthe output

vector which can b e interpreted as b eing prop ortional to class probability and used

to generate a class assignment For a comparison of various typ es of algorithms on

dierent classication problems see

Time series prediction is concerned with estimating the next value and future

values of a sequence such as

y y y

t t t

but usually not as a explicit function of time Normally time series are mo deled as

autoregressive in nature ie the outputs suitably delayed are also the inputs

y f y y y

t t t t

To create the training set equation from the available historical sequence rst

requires the choice of how many and which delayed outputs aect the next output

This is one ofanumber of complications which make time series prediction a more

dicult problem than straight regression or classication Others include regime

switching and asynchronous sampling

Linear Mo dels

A linear mo del for a function y x takes the form

m

X

w h x f x

j j

j

The mo del f is expressed as a of a set of m xed functions often

called basis functions by analogy with the concept of a vector being comp osed of a

linear combination of basis vectors The choice of the letter w for the co ecients of

the linear combinations and the letter h for the basis functions reects our interest

in neural networks which have w eights and hidden units

The exibilityof f its abilityto t many dierent functions derives only from

the freedom to cho ose dierent values for the weights The basis functions and any

parameters which they might contain are xed If this is not the case if the basis

functions can change during the learning pro cess then the mo del is nonlinear

Any set of functions can be used as the basis set although it helps of course

if they are well b ehaved dierentiable However mo dels containing only basis

functions drawn from one particular class have a sp ecial interest Classical statistics

j

ab ounds with linear mo dels whose basis functions are p olynomials h x x or

j

variations on the theme Combinations of sinusoidal waves

j x

j

h x sin

j

m

are often used in signal pro cessing applications Logistic functions of the sort

hx



exp b x b

are p opular in articial neural networks particularly in multilayer perceptrons

MLPs

A familiar example almost the simplest p olynomial is the straight line

f xax b

which is a linear mo del whose two basis functions are

h x

h x x

and whose weights are w b and w a This is of course a very simple mo del

and is not exible enough to b e used for sup ervised learning section

Linear mo dels are simpler to analyse mathematically In particular if sup ervised

learning problems section are solved by least squares section then it is p ossible

to derive and solve a set of equations for the optimal weight values implied by the

training set equation The same do es not apply for nonlinear mo dels such as

MLPs which require iterativenumerical pro cedures for their optimisation

Radial Functions

Radial functions are a sp ecial class of function Their characteristic feature is that

their resp onse decreases or increases monotonically with distance from a central

point The centre the distance scale and the precise shap e of the radial function

are parameters of the mo del all xed if it is linear

A typical radial function is the Gaussian which in the case of a scalar input is

x c

hx exp

r

Its parameters are its centre c and its radius r Figure illustrates a Gaussian RBF

with centre c and radius r

A Gaussian RBF monotonically decreases with distance from the centre In

contrast a multiquadric RBF which in the case of scalar input is

p

r x c

hx

r

monotonically increases with distance from the centre see Figure Gaussianlike

RBFs are lo cal give a signicant resp onse only in a neighb ourho o d near the centre

and are more commonly used than multiquadrictyp e RBFs which have a global

resp onse They are also more biologically plausible b ecause their resp onse is nite

2 2

1 1

0 0

−2 0 2 −2 0 2

Figure Gaussian left and multiquadric RBFs

Formulae for other typ es of RBF functions and for multiple inputs are given in

app endix A

Radial Basis Function Networks

Radial functions are simply a class of functions In principle they could b e employed

in any sort of mo del linear or nonlinear and any sort of network singlelayer or

multilayer However since Bro omhead and Lowes seminal pap er radial

basis function networks RBF networks have traditionally b een asso ciated with

radial functions in a singlelayer network such as shown in gure



f x



I

w w w

j m

  

h x

h x h x

j

m

HY

H

  

H

I I

H

H

H

H

H

H

H

H

H

H

H

H

  

H

H

H

x x x

n i

  

Figure The traditional radial basis function network Each of n comp onents of

the input vector x feeds forward to m basis functions whose outputs are linearly

m

into the network output f x combined with weights fw g

j

j

An RBF network is nonlinear if the basis functions can move or change size or

if there is more than one hidden layer Here we fo cus on singlelayer networks with

functions which are xed in p osition and size We do use nonlinear optimisation

but only for the regularisation parameters in ridge regression section and the

optimal subset of basis functions in forward selection section We also avoid

the kind of exp ensive nonlinear gradient descent algorithms such as the conjugate

gradient and variable metric metho ds that are employed in explicitly nonlinear

networks Keeping one fo ot rmly planted in the world of linear algebra makes

analysis easier and computations quicker

Least Squares

When applied to sup ervised learning section with linear mo dels section the

least squares principle leads to a particularly easy optimisation problem If the

mo del is

m

X

w h x f x

j j

j

p

and the training set section is fx y g then the least squares recip e is to

i i

i

minimise the sumsquarederror

p

X

y f x S

i i

i

with resp ect to the weights of the mo del If a weight penalty term is added to the

sumsquarederror as is the case with ridge regression section then the following

cost function is minimised

p

m

X X

w C y f x

j i i

j

j i

m

where the f g are regularisation parameters

j

j

The Optimal Weight Vector

Weshow in app endix A that minimisation of the cost function equation leads

to a set of m simultaneous linear equations in the m unknown weights and how the

linear equations can be written more conveniently as the matrix equation



Aw H y

where H the design matrixis

h x h x h x

m

h x h x h x

m

H

h x h x h x

p p m p

A the variance matrixis



A H H

the elements of the matrix are all zero except for the regularisation parameters



along its diagonal and y y y y is the vector of training set equation

p

outputs The solution is the socalled normal equation



w A H y



and w w w w is the vector of weights which minimises the cost function

m

equation The use of these equations is illustrated with a simple example

section

Tosave rep eating the analysis with and without a weightpenalty the app endices

and the rest of this section includes its eects However any equation involving

weight p enalties can easily b e converted back to ordinary least squares without any

penalty simply by setting all the regularisation parameters to zero

The Pro jection Matrix

A useful matrix which often crops up in the analysis of linear networks is the pro

jection matrix app endix A



P I HA H

p

This square matrix pro jects vectors in pdimensional space p erp endicular to the m

dimensional subspace spanned by the mo del though only in the case of ordinary

least squares no weight penalty The training set output vector y lives in p

dimensional space since there are p patterns see gure However the mo del

b eing linear and p ossessing only m degrees of freedom the m weights can only

reach p oints in an mdimensional hyp erplane a subspace of pdimensional space

assuming m p For example if p and m the mo del can only reach

points lying in some xed plane and can never exactly match y if it lies somewhere

outside this plane The least squares principle implies that the optimal mo del is

the one with the minimum distance from y ie the pro jection of y onto the m

dimensional hyp erplane The mismatch or error between y and the least squares

mo del is the pro jection of y p erp endicular to the subspace and this turns out to b e

Py

The imp ortance of the matrix P is p erhaps evident from the following two equa

tions proven in app endix A At the optimal weight equation the value of

the cost function equation is

b

y

b

Py

I P y

p

b

Figure The mo del spans a dimensional plane with basis vectors b and b in

dimensional space with an extra basis vector b and cannot match y exactly

The least squares mo del is the pro jection of y onto this plane and the error vector

is Py



C y Py

and the sumsquarederror equation is



S y P y

These two equations the latter in particular will be useful later when we have to

cho ose the b est regularisation parameters in ridge regression section or the

best subset in forward selection section

Incremental Op erations

A common problem in sup ervised learning is to determine the eect of adding a new

basis function or taking one away often as part of a pro cess of trying to discover

the optimal set of basis functions see forward selection section Alternatively

one might also be interested in the eect of removing one of the training examples

p erhaps in an attempt to clean outliers out of the data or of including a new one

online learning

These are incremental or decremental op erations with resp ect to either the

basis functions or the training set Their eects can of course be calculated by

retraining the network from scratch the only route available when using nonlinear

networks However in the case of linear networks such as radial basis function

networks simple analytical formulae can be derived to cover each situation These

are based on a couple of basic formulae app endix A for handling matrix inverses

Their use can pro duce signicant savings in compute time compared to retraining

from scratch see table The details are all in app endix A Formulae are given

for

adding a new basis function app endix A

removing an old basis function app endix A

adding a new training pattern app endix A

removing an old training pattern app endix A

Equation A for the change in the pro jection matrix equation caused by

removing a basis function is particularly useful as we shall see in later sections on

ridge regression and forward selection

The Eective Number of Parameters

p

Supp ose you are given a set of numb ers fx g randomly drawn from a Gaussian

i

i

distribution and you are asked to estimate its variance without b eing told its mean

The rst thing you would do is to estimate the mean

p

X

x x

i

p

i

and then use this estimate in the calculation of the variance

p

X

x x

i

p

i

But where has the factor p in this familiar formula come from Why not divide

by p the number of samples When a parameter such as x is estimated as ab ove

it is unavoidable that it ts some of the noise in the data The apparent variance

which you would have got by dividing by p will thus be an underestimate and

since one degree of freedom has b een used up in the calculation of the mean the

balance is restored by reducing the remaining degrees of freedom by one ie by

dividing by p

The same applies in sup ervised learning section It would be a mistake to

divide the sumsquaredtrainingerror bythenumb er of patterns in order to estimate

the noise variance since some degrees of freedom will have b een used up in tting

the mo del In a linear mo del section there are m weightsif there are p patterns

in the training set that leaves p m degrees of freedom

The variance estimate is thus

S

p m

where S is the sumsquarederror over the training set at the optimal weightvector

is called the unbiased estimate of variance equation

However things are not as simple when ridge regression is used Although there

are still m weights in the mo del what John Mo o dy calls the eective number of pa

rameters and David MacKay calls the number of goodparameter measurements

is less than m and dep ends on the size of the regularisation parameters The

simplest formula for this number is

p trace P

which is consistent with b oth Mo o dys and MacKays formulae as shown in app endix

A The unbiased estimate of variance b ecomes

S

p

In the absence of regularisation trace P p m and m as we would exp ect

Example

Almost the simplest linear mo del for scalar inputs is the straight line given by

f x w h xw h x

where the two basis functions are

h x

h x x

We stress this is unlikely ever to be used in a nonparametric context section

as it obviously lacks the exibilitytot a large range of dierent functions but its

simplicitylends itself to illustrating a p ointortwo

Assume we sample points from the curve y x at three p oints x x

and x generating the training set

f g

Clearly our sampling is noisy since the sampled output values do not corresp ond

exactly with the exp ected values Supp ose however that the actual line from

which we have sampled is unknown to us and that the only information we have is

the three noisy samples We estimate its co ecients intercept w and slop e w by

solving the normal equation section

To do this we rst set up the design matrix equation which is

h x h x

h x h x

H

h x h x

and the vector of training set section outputs which is

y

Next we compute the pro duct of H with its own transp ose which is



H H

and then the inverse of this square matrix to obtain



A H H



You can check that this is correct bymultiplying out H H and its inverse to obtain

the identity matrix Using this in equation for the optimal weight vector yields



w A H y

This means that the line with the least sumsquarederror with resp ect to the train

ing set has slop e and intercept see gure

Fortuitously we have arrived at exactly the right answer since the actual line

sampled did indeed have slop e and intercept The pro jection matrix app endix

A is

P I HA H

and consequently the sumsquarederror equation at the optimal weight is



y P y 4

(3, 3.1)

2 (2, 1.8)

(1, 1.1)

0

0 2 4

Figure The least squares line t to the three inputoutput pairs

Another way of computing the same gure is



Hw y Hw y

Supp ose in ignorance of the true function we lo ok at gure and decide that

the mo del is wrong that it should have an extra term x The new mo del is then

f x w h xw h xw h x

where

h x x

The eect of the extra basis function is to add an extra column

h x x

h x x

h

h x x

to the design matrix We could retrain the network from scratch adding the extra



column to H recomputing H H its inverse A and the new pro jection matrix

The cost in oating p oint op erations is given in table Alternativelywe could re

train as an incremental op eration calculating the new variance matrix from equation

A and the new pro jection matrix from equation A This results in a reduction

in the amount of computation see table a reduction which for larger values of p

and m can be quite signicantsee tableinappendix A

Figure shows the t made by the mo del with the extra basis function The

curve passes through each training point and there is consequently no training set

incremental

retrain from

op eration

scratch

A

P

Table The cost in FLOPS of retraining from scratch and retraining by using an

incremental op eration after adding the extra x term to the mo del in the example

4

(3, 3.1)

2 (2, 1.8)

(1, 1.1)

0

0 2 4

Figure The least squares quadratic t to the three inputoutput pairs

error The extra basis function has made the mo del exible enough to absorb all

the noise If we did not already know that the target was a straight line we might

now be fo oled into thinking that the new mo del was b etter than the previous one

This is a simple demonstration of the dilemma of sup ervised learning section

if the mo del is to o exible it will t the noise if it is to o inexible it will miss the

target Somehowwehave to tread a path b etween these two pitfalls and this is the

reason for employing mo del selection criteria section which we lo ok at next

Mo del Selection Criteria

The mo del selection criteria we shall cover are all estimates of prediction error

that is estimates of how well the trained mo del will p erform on future unknown

inputs The best mo del is the one whose estimated prediction error is least Cross

validation section is the standard to ol for measuring prediction error but there

are several others exemplied by generalised crossvalidation section which

all involve various adjustments to the sumsquarederror equation over the

training set

The key elements in these criteria are the pro jection matrix and the eective

numb er of parameters in the network In ridge regression section the pro jection

matrix although we continue to call it that is not exactly a pro jection and the

eective number of parameters is not equal to m the number of weights To be

as general as p ossible we shall use ridge versions for b oth the pro jection matrix

equation and the eective number of parameters equation However

the ordinary least squares versions of the selection criteria can always be obtained

simply by setting all the regularisation parameters to zero and rememb ering that

the pro jection matrix is idemp otentP P

For reviews of mo del selection see the two articles byHocking andchapter

of Efron and Tisbshiranis b o ok

CrossValidation

If data is not scarce then the set of available inputoutput measurements can be

divided into two parts one part for training and one part for testing In this way

several dierent mo dels all trained on the training set can b e compared on the test

set This is the basic form of crossvalidation

A better metho d which is intended to avoid the p ossible bias intro duced by

relying on any one particular division into test and train comp onents is to partition

the original set in several dierent ways and to compute an average score over the

dierent partitions An extreme variant of this is to split the p patterns into a

training set of size p and a test of size and average the squared error on the

leftout pattern over the p p ossible ways of obtaining such a partition This is

called leaveoneout LOO crossvalidation The advantage is that all the data can

be used for training none has to b e held back in a separate test set

The b eauty of LOO for linear mo dels equation such as RBF networks

section is that it can be calculated analytically app endix A as



y P diag P Py

LOO

p

In contrast LOO is intractable to calculate for all but the smallest nonlinear mo dels

as there is no alternative to training and testing p times

Generalised CrossValidation

The matrix diag P makes LOO slightly awkward to handle mathematically Its

cousin generalised crossvalidation GCV is more convenient and is



p y P y

GCV

trace P

The similarity with leaveoneout crossvalidation equation is apparent Just

replace diag P in the equation for LOO with the average diagonal element times

the identity matrix trace P p I

p

GCV is one of a numb er of criteria whichallinvolve an adjustmenttotheaverage

meansquarederror over the training set equation There are several other

criteria which share this form The unbiased estimate of variance UEV which

we met in a previous section is



y P y

UEV

p

where is the eectivenumb er of parameters equation Aversion of Mallows

C and also a sp ecial case of Akaikes information criterion is nal prediction

p

error FPE



y P y

FPE UEV

p



p y P y

p p

Schwarzs Bayesian information criterion BIC is



y P y lnp

BIC UEV

p



p lnp y P y

p p

Generalisedcross validation can also be written in terms of instead of trace P

using the equation for the eective number of parameters



p y P y

GCV

p



UEV FPE GCV and BIC are all in the form y P y p and have

XYZ

XYZ

a natural ordering

UEV FPE GCV BIC

as is plain if the factors are expanded out in

p

UEV

p p p p

p

FPE

p p p p

p

GCV

p p p p

p lnp

lnp

BIC

p p p p

Example

fit (λ = 1) target 2 data y

0

−4 −2 0 2 4

x

Figure Training set target function and t for the example

Figure shows a set of p training inputoutput points sampled from the

function

x

y x x x e

with added Gaussian noise of standard deviation We shall t this data with

an RBF network section with m Gaussiancentres section coincident

with the input p oints in the training set and of width r To avoid overt we

shall use standard ridge regression section controlled by a single regularisation

parameter The error predictions as a function of made by the various mo del

selection criteria are shown in gure Also shown is the meansquarederror MSE

between the t and the target over an indep endent test set

3 10

MSE 2 10 LOO UEV FPE 1 GCV 10 BIC

0 10

−1 10

−2 10 −10 −5 0 5 10 10 10 10

λ

Figure The various mo del selection criteria plus meansquarederror over a test

set as a function of the regularisation parameter

A feature of the gure is that all the criteria have a minimum which is close to

the minimum MSE near This is why the mo del selection criteria are useful

When we do not have access to the true MSE as in any practical problem they are

often able to select the best mo del in this example the b est by picking out the

one with the lowest predicted error

Two words of warning they dont always work as eectively as in this example

and UEV is inferior to GCV as a selection criteria and probably to the others

as well

Ridge Regression

Around the middle of the th century the Russian theoretician Andre Tikhonov

was working on the solution of il lposed problems These are mathematical prob

lems for which no unique solution exists b ecause in eect there is not enough

information sp ecied in the problem It is necessary to supply extra information or

assumptions and the mathematical technique Tikhonovdevelop ed for this is known

as regularisation

Tikhonovs work only b ecame widely known in the West after the publication

in of his book Meanwhile two American statisticians Arthur Ho erl and

Rob ert Kennard published a pap er in on ridge regression a metho d for

solving badly conditioned linear regression problems Bad conditioning means nu

merical diculties in p erforming the matrix inverse necessary to obtain the variance

matrix equation It is also a symptom of an illp osed regression problem in

Tikhonovs sense and Ho erl Kennards metho d was in fact a crude form of regu

larisation known now as zeroorder regularisation

In the s when neural networks b ecame p opular weight decay was one of

a number of techniques invented to help prune unimp ortantnetwork connections

However it was so on recognised that weight decay involves adding the same

penalty term to the sumsquarederror as in ridge regression Weightdecay and

ridge regression are equivalent

While it is admittedly crude I like ridge regression b ecause it is mathematically

and computationally convenient and consequently other forms of regularisation are

rather ignored here If the reader is interested in higherorder regularisation I suggest

lo oking at for a general overview and for a sp ecic example secondorder

regularisation in RBF networks

We next describ e ridge regression from the p ersp ective of bias and variance

section and how it aects the equations for the optimal weightvector app endix

A the variance matrix app endix A and the pro jection matrix app endix A

A metho d to select a go o d value for the regularisation parameter based on a re

estimation formula section is then presented Next comes a generalisation

of ridge regression which if radial basis functions section are used can be

justly called lo cal ridge regression section It involves multiple regularisation

parameters and we describ e a metho d section for their optimisation Finally

we illustrate with a simple example section

Bias and Variance

When the input is x the trained mo del predicts the output as f x If we had many

training sets which we never do but just supp ose and if we knew the true output

y x we could calculate the meansquarederror as

MSE y x f x

where the exp ectation averaging indicated by hi is taken over the training sets

This score which tells us how go o d the average prediction is can be broken down

into two comp onents namely

MSE y x hf xi f x hf xi

The rst part is the bias and the second part is the variance

If hf xi y x for all x then the mo del is unbiased the bias is zero However

an unbiased mo del may still have a large meansquarederror if it has a large variance

This will b e the case if f x is highly sensitive to the p eculiarities such as noise and

the choice of sample p oints of each particular training set and it is this sensitivity

which causes regression problems to b e illp osed in the Tikhonov sense Often

however the variance can be signicantly reduced by delib erately intro ducing a

small amount of bias so that the net eect is a reduction in meansquarederror

Intro ducing bias is equivalent to restricting the range of functions for which a

mo del can account Typically this is achieved by removing degrees of freedom

Examples would be lowering the order of a p olynomial or reducing the number of

weights in a neural network Ridge regression do es not explicitly remove degrees of

freedom but instead reduces the eective number of parameters section The

resulting loss of exibility makes the mo del less sensitive

Aconvenient if somewhat arbitrary metho d of restricting the exibility of linear

mo dels section is to augment the sumsquarederror equation with a term

which p enalises large weights

p

m

X X

C y f x w

i i

j

i j

This is ridge regression weight decay and the regularisation parameter

controls the balance between tting the data and avoiding the p enalty A small

value for means the data can b e t tightly without causing a large p enaltya large

value for means a tight t has to b e sacriced if it requires large weights The bias

intro duced favours solutions involving small weights and the eect is to smo oth the

output function since large weights are usually required to pro duce a highly variable

rough output function

The optimal weightvector for the ab ove cost function equation has already

been dealt with app endix A as have the variance matrix app endix A and

the pro jection matrix app endix A In summary



A H H I

m



w A H y



P I HA H

p

Optimising the Regularisation Parameter

Some sort of mo del selection section must be used to cho ose a value for the

regularisation parameter The value chosen is the one asso ciated with the lowest

prediction error But which methodshouldbe used to predict the error and how is

the optimal value found

The answer to the rst question is that nob o dy knows for sure The p opular

choices are leaveoneout crossvalidation generalised crossvalidation nal predic

tion error and Bayesian information criterion Then there are also b o otstrap meth

ods Our approachwillbetouse the most convenient metho d as long as there

are no serious ob jections to it and the most convenient metho d is generalised cross

validation GCV It leads to the simplest optimisation formulae esp ecially in

lo cal optimisation section

Since all the mo del selection criteria dep end nonlinearly on weneedamethod

of nonlinear optimisation We could use any of the standard techniques for this

such as the Newton metho d and in fact that has been done Alternatively

we can exploit the fact that when the derivative of the GCV error prediction is set

to zero the resulting equation can be manipulated so that only app ears on the

left hand side see app endix A



y P y trace A A



w A w trace P

This is not a solution it is a reestimation formula b ecause the right hand side

dep ends on explicitly as well as implicitly through A and P To use it an

initial value of is chosen and used to calculate a value for the right hand side This

leads to a new estimate and the pro cess can be rep eated until convergence

Lo cal Ridge Regression

P

m

Instead of treating all weights equally with the p enalty term w we can

j

j

treat them all separately and have a regularisation parameter asso ciated with each

P

m

w The new cost function is basis function by using

j

j

j

p

m

X X

C y f x w

i i j

j

i j

and is identical to the the standard form equation if the regularisation param

eters are all equal Then the variance matrix app endix A is

j



A H H

m

where is a diagonal matrix with the regularisation parameters fg along its

j

diagonal As usual the optimal weight vector is



w A H y

and the pro jection matrix is



P I HA H

p

These formulae are identical to standard ridge regression section except for

the variance matrix where I has b een replaced by

m

In general there is nothing lo cal ab out this form of weight decay However if we

conne ourselves to lo cal basis functions such as radial functions section but

not the multiquadric typ e which are seldom used in practice then the smo othness

pro duced by this form of ridge regression is controlled in a lo cal fashion by the

individual regularisation parameters That is why we call it lo cal ridge regression

and it provides a mechanism to adapt smo othness to lo cal conditions Standard

ridge regression with just one parameter to control the biasvariance tradeo

has diculty with functions whichhave signicantly dierent smo othness in dierent

parts of the input space

Optimising the Regularisation Parameters

To take advantage of the adaptive smo othing capability provided by lo cal ridge

regression requires the use of mo del selection criteria section to optimise the

regularisation parameters Information ab out how much smo othing to apply in

dierent parts of the input space may b e present in the data and the mo del selection

criteria can help to extract it

The selection criteria dep end mainly on the pro jection matrix equation P

and therefore we need to deduce its dep endence on the individual regularisation

parameters The relevant relationship equation A is one of the incremental

op erations section Adapting the notation somewhat it is



P h h P

j j j

j

P P

j



h P h

j j j

j

where P is the pro jection matrix after the j th basis function has been removed

j

and h is the j th column of the design matrix equation

j

In contrast to the case of standard ridge regression section there is an

analytic solution for the optimal value of based on GCV minimisation

j

no reestimation is necessary see app endix A The trouble is that there are

m other parameters to optimise and each time one is optimised it changes

j

the optimal value of each of the others Optimising all the parameters together has

to b e done as a kind of reestimation doing one at atimeand then rep eating until

they all converge

When the two pro jection matrices P and P are equal This means

j j

that if the optimal value of is then the j th basis function can be removed

j

from the network In practice esp ecially if the network is initially very exible high

variance low bias see section innite optimal values are very common and

lo cal ridge regression can b e used as a metho d of pruning unnecessary hidden units

The algorithm can get stuck in lo cal minima like any other nonlinear optimi

sation dep ending on the initial settings For this reason it is best in practice to

give the algorithm a head start by using the results from other metho ds rather than

starting with random parameters For example standard ridge regression section

can be used to nd the best global parameter and the lo cal algorithm can

then start from

j m

j

Alternatively forward selection section can b e used to cho ose a subset S of the

original m basis functions in which case the lo cal algorithm can start from

if j S

j

otherwise

Example

Figure shows four dierent ts the red curves to a training set of p patterns

randomly sampled from the the sine wave

y sin x

between x and x with Gaussian noise of standard deviation added

The training set inputoutput pairs are shown by blue circles and the true target by

dashed magenta curves The mo del used is a radial basis function network section

with m Gaussian functions of width r whose p ositions coincide

with the training set input points Each t uses standard ridge regression section

but with four dierent values of the regularisation parameter

The rst t top left is for and is to o rough high weights have

not b een p enalised enough The last t b ottom right is for andistoo

smo oth high weights have b een p enalised to o much The other two ts are for

top right and b ottom left which are just ab out right there

is not much to cho ose between them λ = 0.0000000001 λ = 0.00001

1 1

y 0 0

−1 −1 0 1 0 1 λ = 1 λ = 100000

1 1

y 0 0

−1 −1 0 1 0 1

x x

Figure Four dierent RBF ts solid red curves to data blue crosses sampled

from a sine wave dashed magenta curve

a b

1 40 10

30

0 γ 20 10 RMSE

10

−1 0 10 −10 −5 0 5 −10 −5 0 5 10 10 10 10 10 10 10 10

λ λ

Figure a and b RMSE as functions of The optimal value is shown with

a star

The variation of the eective number of parameters section as a func

tion of is shown in gure a Clearly decreases monotonically as increases

and the RBF network loses exibility Figure b shows ro otmeansquarederror

RMSE as a function of RMSE was calculated using an array of noiseless

samples of the target between x and x The gure suggests as the

best value minimum RMSE for the regularisation parameter

In real applications where the target is of course unknown we do not unfortu

nately have access to RMSE Then we must use one of the mo del selection criteria

section to nd parameters like The solid red in gure shows the variation

of GCV over a range of values The reestimation formula equation based

on GCV gives starting from the initial value of Note how

ever that had we started the reestimation at then the lo cal minima at

would have b een the nal resting place The values of and RMSE

at the optimum are marked with stars in gure

0 10

−1 10 GCV

−2 10 −10 −5 0 5 10 10 10 10

λ

Figure GCV as functions of with the optimal value shown with a star

The values were used to initialise a lo cal ridge regression algorithm

j

which continually sweeps through the regularisation parameters optimising each in

turn until the GCV error prediction converges This algorithm reduced the error

prediction from an initial value of at the optimal global parameter value

to At the same time regularisation parameters ended up with an

optimal value of enabling the corresp onding hidden units to be pruned from

the network

Forward Selection

We previously lo oked at ridge regression section as a means of controlling the

balance between bias and variance section by varying the eective number of

parameters section in a linear mo del section of xed size

An alternative strategy is to compare mo dels made up of dierent subsets of

basis functions drawn from the same xed set of candidates This is called subset

selection in statistics To nd the b est subset is usually intractable there

M

are subsets in a set of size M so heuristics must be used to search a

small but hop efully interesting fraction of the space of all subsets One of these

heuristics is called forward selection which starts with an empty subset to which is

added one basis function at a time the one which most reduces the sumsquared

error equation until some chosen criterion such as GCV section stops

decreasing Another is backward elimination which starts with the full subset from

which is removed one basis function at a time the one which least increases the

sumsquarederror until once again the chosen criterion stops decreasing

It is interesting to compare subset selection with the standard way of optimis

ing neural networks The latter involves the optimisation by gradient descent of

a nonlinear sumsquarederror surface in a highdimensional space dened by the

network parameters In RBF networks section the network parameters are

the centres sizes and hiddentooutput weights In subset selection the optimisation

algorithm searches in a discrete space of subsets of a set of hidden units with xed

centres and sizes and tries to nd the subset with the lowest prediction error section

The hiddentooutput weights are not selected they are slaved to the centres

and sizes of the chosen subset Forward selection is also of course a nonlinear typ e

of algorithm but it has the following advantages

There is no need to x the number of hidden units in advance

The mo del selection criteria are tractable

The computational requirements are relatively low

In forward selection each step involves growing the network by one basis function

Adding a new basis function is one of the incremental op erations section The

key equation A is



P f f P

m J m

J

P P

m m



f P f

m J

J

which expresses the relationship b etween P the pro jection matrix equation

m

for the m hidden units in the current subset and P the succeeding pro jection

m

M

matrix if the J th member of the full set is added The vectors ff g are the

J

J

columns of the design matrix equation for the entire set of candidate basis

functions

F f f f

M

of which there are M where M m

If the J th basis function is chosen then f is app ended to the last column of

J

H the design matrix of the current subset This column is renamed h and the

m m

new design matrix is H The choice of basis function can be based on nding

m

the greatest decrease in sumsquarederror equation From the up date rule

equation for the pro jection matrix and equation for sumsquarederror



y P f

m J

S S

m m



f P f

m J

J

see app endix A The maximum over J M of this dierence is used to

nd the best basis function to add to the current network

To decide when to stop adding further basis functions one of the mo del selection

criteria section such as GCV equation is monitored Although the training

error S will never increase as extra functions are added GCV will eventually stop

decreasing and start to increase as overt section sets in That is the p ointat

which to cease adding to the network See section for an illustration

Orthogonal Least Squares

Forward selection is a relatively fast algorithm but it can be sp eeded up even fur

ther using a technique called orthogonal least squares This is a GramSchmidt

orthogonalisation pro cess which ensures that each new column added to the

design matrix of the growing subset is orthogonal to all previous columns This

simplies the equation for the change in sumsquarederror and results in a more

ecient algorithm

Any matrix can b e factored into the pro duct of a matrix with orthogonal columns

and a matrix which is upp er triangular In particular the design matrix equation

pm

H R can b e factored into

m

H H U

m m m

where

pm

H h h h R

m m

 mm

has orthogonal columns h h i j and U R is upp er triangular

j m

i

When considering whether to add the basis function corresp onding to the J

th column f of the full design matrix the pro jection of f in the space already

J J

spanned by the m columns of the current design matrix is irrelevant Only its

pro jection p erp endicular to this space namely

m



X

f h

j

J

f f h

J J j



h h

j

j

j

can contribute to a further reduction in the training error and this reduction is



y f

J

S S

m m



f f

J

J

as shown in app endix A Computing this change in sumsquarederror requires

of order p oating point op erations compared with p for the unnormalised ver

sion equation This is the basis of the increased eciency of orthogonal least

squares

A small overhead is necessary to maintain the columns of the full design matrix

orthogonal to the space spanned by the columns of the growing design matrix and

to up date the upp er triangular matrix After f is selected the new orthogonalised

J

full design matrix is



f F f

m J

J

F F

m m



f f

J

J

and the upp er triangular matrix is up dated to

 

U H H H f

m m J

m m

U

m



m

Initially U and F F The orthogonalised optimal weight vector

 

w H H H y

m m

m m

and the unorthogonalised optimal weight equation are then related by

w U w

m m

m

see app endix A

Regularised Forward Selection

Forward selection and standard ridge regression section can b e combined lead

ing to a mo dest improvement in p erformance In this case the key equation

involves the regularisation parameter



P f f P

m J m

J

P P

m m



P f f

m J

J

The search for the maximum decrease in the sumsquarederror equation is

then based on

   

y P f y P f y P f f P f

J m J m J J

m m

J

S S

m m

 

f P f f P f

m J m J

J J

see app endix A Alternatively it is p ossible to search for the maximum decrease

in the cost function equation



y P f

m J

C C

m m



f P f

m J

J

see app endix A

The advantage of ridge regression is that the regularisation parameter can be

optimised section in between the addition of new basis functions Since new

additions will cause a change in the optimal value anyway there is little point in

running the reestimation formula equation to convergence Go o d practical

results have b een obtained by p erforming only a single iteration of the reestimation

formula after each new selection

Regularised Orthogonal Least Squares

Orthogonal least squares section works b ecause after factoring H the orthog

m

onalised variance matrix equation is diagonal



A H H

m

m m

 

U H H U

m

m m m

h h





h h



U U

m m

h h

m

m



U A U

m m

A is simple to compute while the inverse of the upp er triangular matrix is not

required to calculate the change in sumsquarederror However when standard

ridge regression section is used



A H H I

m m

m m

 

U H H U I

m m m

m m

and this expression can not b e simplied any further A solution to this problem is

to slightly alter the nature of the regularisation so that

 

A H H U U

m m

m m m

 

U H I H U

m m

m m m

h h





h h



U U

m m

h h

m

m



U A U

m m

This means that instead of the normal cost function equation for standard

ridge regression section the cost is actually

  

C y H w y H w w U U w

m m m m m

m m

Then the change in sumsquarederror is

 

f f y f

J J

J

S S

m m



f f

J

J

Alternatively searching for the b est basis function to add could be made on the

basis of the change in cost which is simply



y f

J

C C

m m



f f

J

J

For details see app endix A

Example

Figure shows a set of p training examples sampled with Gaussian noise of

standard deviation fromthe logistic function

x

e

y x

x

e

fit 1 target data

0 y

−1

−10 −5 0 5 10

x

Figure Training set target function and forward selection t

This data is to be modeled by forward selection from a set of M Gaussian

radial functions section coincident with the training set inputs and of width

r Figure shows the sumsquarederror equation generalised cross

validation equation and the meansquarederror between the t and target

based on an indep endent test set as new basis functions are added one at a time

The training set error monotonically decreases as more basis functions are added

but the GCV error prediction eventually starts to increase at the p oint marked

with a star This is the signal to stop adding basis functions which in this example

o ccurs after the RBF network has acquired basis functions and happily coincides

with the minimum test set error The t with these basis functions is shown in

gure Further additions only serve to make GCV and MSE worse Eventually

after ab out additions the variance matrix equation b ecomes illconditioned

when numerical calculations are unreliable 0 10 MSE GCV SSE −1 10

−2 10 ill−conditioned

−3 10 0 10 20 30 40 50

iteration number

Figure The progress of SSE GCV and MSE as new basis functions are added in

plain vanilla forward selection The minimum GCV and corresp onding MSE are

marked with stars After ab out selections the variance matrix b ecomes badly conditioned

0 10 MSE GCV SSE −1 10

−2 10

−3 10 0 10 20 30 40 50

iteration number

Figure The progress of SSE GCV and MSE as new basis functions are added

in regularised forward selection

The results of using regularised forward selection section are shown in gure

The sumsquarederror equation no longer decreases monotonically b ecause

is reestimated after each selection Also regularisation prevents the variance

matrix equation b ecoming illconditioned even after all the candidate basis

functions are in the subset

In this example the two metho ds chose similar but not identical subsets and

ended with very close test set errors However on average over a large number of

similar training sets the regularised version p erforms slightly b etter than the plain

vanilla version

A App endices

A Notational Conventions

Scalars are represented by italicised letters such as y or Vectors are represented

by b old lower case letters such as x and The rst comp onent of x a vector

is x a scalar Vectors are singlecolumn matrices so for example if x has n

comp onents then

x

x

x

x

n

Matrices suchasH are represented by b old capital letters If the rows of H are

indexed by i and the columns by j then the entry in the ith row and j th column

pm

of H is H If H has p rows and m columns H R then

ij

H H H

m

H H H

m

H

H H H

p p pm

The transp ose of a matrix the rows and columns swapp ed is represented by

 

H So for example if G H then

G H

ji ij



The transp ose of a vector is a singlerow matrix x x x x It follows

n



that x x x x which is a useful way of writing vectors on a single line

n

Also the common op eration of vector dotpro duct can be written as the transp ose

of one vector multiplied by the other vector So instead of writing x y we can just



write x y

The identity matrix written I is a square matrix one with equal numbers of

rows and columns with diagonal entries of and zero elsewhere The dimension is

written as a subscript as in I which is the identity matrix of size m m rows and

m

m columns

The inverse of a square matrix A is written A If A has m rows and m columns

then

A A AA I

m

which denes the inverse op eration

Estimated or uncertain values are distinguished by the use of the hat symb ol

For example is an estimated value for and w is an estimate of w

A Useful Prop erties of Matrices

Some useful denitions and prop erties of matrices are given below and illustrated

with some of the matrices and vectors commonly app earing in the main text

The elements of a diagonal matrix are zero o the diagonal An example is the

matrix of regularisation parameters app earing in equation for the optimal weight

in lo cal ridge regression

m

Nonsquare matrices can also b e diagonal If M is any matrix then it is diagonal if

M i j

ij

A symmetric matrix is identical to its own transp ose Example are the vari

mm

ance matrix equation A R and the pro jection matrix equation

pp

P R Any square diagonal matrix such as the identity matrix is necessarily

symmetric

The inverse of an orthogonal matrix is its own transp ose If V is orthogonal then

 

V V VV I

m

Any matrix can be decomp osed into the pro duct of two orthogonal matrices and a

diagonal one This is called singular value decomposition SVD For example

pm

the design matrix equation H R decomp oses into



H UV

pp mm pm

where U R and V R are orthogonal and R is diagonal

The trace of a square matrix is the sum of its diagonal elements The trace of

the pro duct of a sequence of matrices is unaected by rotation of the order For

example

 

trace HA H trace A H H

The transp ose of a pro duct is equal to the pro duct of the individual transp oses

in reverse order For example



 

A H y y HA



A is symmetric so A A

The inverse of a pro duct of square matrices is equal to the pro duct of the indi

vidual inverses in reverse order For example

 

 

V V V V

 

V is orthogonal so V V and V V

A Radial Basis Functions

The most general formula for any radial basis function RBF is



hx x c R x c

where is the function used Gaussian multiquadric etc c is the centre and R is



the metric The term x c R x c is the distance b etween the input x and the

centre c in the metric dened by R There are several common typ es of functions



z

used for example the Gaussian z e the multiquadric z z the



and the Cauchy z z inverse multiquadric z z

Often the metric is Euclidean In this case R r I for some scalar radius r and

the ab ove equation simplies to



x c x c

hx

r

A further simplication is a dimensional input space in which case we have

x c

hx

r

This function is illustrated for c andr in gure for the ab oveRBFtyp es

2

1

0

−2 0 2

Figure Gaussian green multiquadric magenta inversemultiquadric red

and Cauchy cyan RBFs all of unit radius and all centred at the origin

A The Optimal Weight Vector

As is well known from elementary calculus to nd an extremum of a function you

dierentiate the function with resp ect to the free variables

equate the results with zero and

solve the resulting equations

In the case of least squares section applied to sup ervised learning section

with a linear mo del section the function to b e minimised is the sumsquarederror

p

X

S y f x

i i

i

where

m

X

f x w h x

j j

j

m

and the free variables are the weights fw g

j

j

To avoid rep eating two very similar analyses we shall nd the minimum not of

S but of the cost function

p

m

X X

C y f x w

i i j

j

i j

used in ridge regression section This includes an additional weight p enalty term

m

controlled by the values of the nonnegative regularisation parameters fg To

j

get back to ordinary least squares without any weight p enalty is simply a matter

of setting all the regularisation parameters to zero

So let us carry out this optimisation for the j th weight First dierentiating

the cost function

p

X

C f

x w f x y

i j j i i

w w

j j

i

We now need the derivative of the mo del which because the mo del is linear is

particularly simple

f

x h x

i j i

w

j

Substituting this into the derivativeof the cost function and equating the result to

zero leads to the equation

p p

X X

f x h x w y h x

i j i j j i j i

i i

There are m such equations for j m each representing one constraint on

the solution Since there are exactly as many constraints as there are unknowns

the system of equations has except under certain pathological conditions a unique

solution

To nd that unique solution we employ the language of matrices and vectors

linear algebra These are invaluable for the representation and analysis of systems

of linear equations liketheoneabovewhich can b e rewritten in vector notation see

app endix A as follows

 

h f w h y

j j

j j

where

h x f x y

j

h x f x y

j

h f y

j

h x f x y

j p p p

Since there is one of these equations each relating one scalar quantity to another

for each value of j from up to m we can stack them one on top of another to

create a relation b etween two vector quantities

 

f h w y h

 

h f w h y

 

h f w h y

m m

m m

However using the laws of matrix multiplication this is just equivalentto

 

H f w H y A

where

m

m

and where H which is called the design matrix has the vectors fh g as its

j

j

columns

H h h h

m

and has p rows one for each pattern in the training set Written out in full it is

h x h x h x

m

h x h x h x

m

H

h x h x h x

p p m p

which is where equation came from

The vector f can b e decomp osed into the pro duct of two terms the design matrix

and the weight vector since each of its comp onents is a dotpro duct between two

mdimensional vectors For example the ith comp onentoff when the weights are

at their optimal values is

m

X



f f x w h x h w

i i j j i

i

j

where

h x

i

h x

i

h

i

h x

m i



Note that while h is one of the columns of H h isoneofitsrows f is the result

j

i

p

of stacking the ff g one on top of the other or

i

i



w h

f



h w f

Hw f



h w

f

p

p

Finally substituting this expression for f into the equation A gives

 

H y H f w



H Hw w



H H w

the solution to which is

 

w H H H y

which is where equation comes from

The latter equation is the most general form of the normal equation which we

deal with here There are two sp ecial cases In standard ridge regression section

j m so

j

 

H y w H H I

m

which is where equation comes from

Ordinary least squares where there is no weight penalty is obtained by setting

all regularisation parameters to zero so

 

H H w H y

A The Variance Matrix

What is the variance of the weightvector w In other words since the weights have

b een calculated up on the basis of a measurement y ofastochastic variable y what

is the corresp onding uncertainty in w

The answer to this question dep ends on the nature and size of the uncertainty

in y and the relationship b etween w and y considered as random variables In our

case we assume that the noise aecting y is normal and indep endently identically

distributed

hy y y y i I

p

where is the standard deviation of the noise y the mean value of y and the

exp ectation averaging is taken over training sets Since we consider only linear

networks section the relationship b etween w and y is given by



w A H y



where A H H see app endix A Then the exp ected mean value of w is

  

w hA H y i A H hy i A H y

so the variance W is



W hw w w w i

 

A H hy y y y i HA



A H HA



In ordinary least squares section H H A so W A We shall refer

to the matrix A as the variance matrix b ecause of its close link to the variance

of w in ordinary least squares For standard ridge regression section when



H H A I

m

W A A I A

m

A A

An imp ortant point is that if the training set has b een used to estimate the

regularisation parameters as in ridge regression section or to chose the basis

functions as in forward selection section then A as well as y is a sto chastic

variable and there is no longer a simple linear relationship between uncertainty in

w and uncertainty in y In this case it is probably necessary to resort to b o otstrap

techniques to estimate W

A The Pro jection Matrix

The prediction of the output at any of the training set inputs by the linear mo del

section using equation equation for the weightvector is

m

X

w h x f x

j j i i

j



h w

i



where h is the ith row of the design matrix equation If we stack these

i

predictions intoavector f we get



h w



h w

f



h w

p

Hw



HA H y

Therefore the error vector b etween the predictions of the network from the training

set inputs and the actual outputs observed in the training set is



y f y HA H y



I HA H y

p

Py

where



P I HA H

p

As explained in section Py is the pro jection of y p erp endicular to the space

spanned by the mo del and represents the error between the observed outputs and

the least squares prediction at least in the absence of any weight penalty The

sumsquarederror equation at the weight equation which minimises the

cost function equation can be conveniently written in terms of P and y by



S y f y f

 

y P Py



y P y



the last step following b ecause P is symmetric P P Also the cost function

equation itself is

 

C Hw y Hw y w w

   

y HA H I HA H I y y HA A H y

p p

  

y P y y HA A H y

However b ecause

  

HA A H HA A H H A H

 

HA H HA H

P P

the expression for the minimum cost simplies to

 

C y P y y P P y



y Py

A Incremental Op erations

If a matrix A is square of size m and none of its columns rows are linear combi

nations of its other columns rows then there exists a unique matrix A called

the inverse of Awhich satises

A A I

m

AA I

m

The following are two useful lemmas for matrix inversion They nd frequent

use whenever the design matrix equation is partitioned as for example in any

of the incremental op erations section

The small rank adjustment is

A A XRY

mm  mr

where the inverse of A R is known X Y R are known m r

r r

R R is known and the inverse of A is sought Instead of recomputing the

inverse from scratch the formula

A A A X YA X R YA A

may reach the same answer more eciently since it involves numerically inverting

an r sized matrix instead of a larger msized one the cost of one matrix inverse is

roughly prop ortional to the cub e of its size

The inverse of a partitioned matrix

A A

A

A A

can b e expressed as

A A A A A A A A A A

A

A A A A A A A A A A

assuming that all the relevant inverses exist Alternatively using the small rank

adjustment formula equation A and dening A A A A the same

inverse can b e written

A A A A A A A

A A

A A

The next four subsections applies these formulae to working out the eects of in

cremental op erations section on linear mo dels section applied to sup ervised

learning section by considering what happ ens to the variance matrix app endix

A and the pro jection matrix app endix A We employ the general ridge regres

sion formulae for b oth the variance matrix equation and the pro jection matrix

equation Ordinary least squares section is the sp ecial case where all the

regularisation parameters are zero

Retraining a linear network with an incremental op eration can lead to consid

erable compute time savings over the alternative of retraining from scratch which

involves constructing the new design matrix multiplying it with itself adding the

regulariser if there is one taking the inverse to obtain the variance matrix ap

p endix A and recomputing the pro jection matrix app endix A Table gives

the approximate number of multiplications ignoring rst order terms required to

compute P

op eration completely retrain use op eration

add a new basis function m pm p m p

remove an old basis function m pm p m p

add a new pattern m pm p m m pm p

remove an old pattern m pm p m m pm p

Table The cost in multiplications of calculating the pro jection matrix by re

training from scratch compared with using the appropriate incremental op eration

A Adding a new basis function

Adding a new basis function the m th to a linear mo del section which

already has m basis functions and applying it to a training set section with p

patterns has the eect of adding an extra column to the design matrix equation

If the old design matrix is H and the new basis function is h x then the

m m

new design matrix is

H h

H

m m

m

where

h x

m

h x

m

h

m

h x

m p

The new variance matrix app endix A is A where

m



H A H

m m m

m



H

m

m

H h

m m

 

h

m

m



A H h

m m

m

 

h H h h

m m m

m m



and where A H H Applying the formula for the inverse of a partitioned

m m m

m

matrix equation A yields

A

m

A



m



 

A H h A H h

m m

m m m m

A



h P h

m m m

m



where P I H A H is the pro jection matrix app endix A for the

m p m

m m

original network with m basis functions The ab ove result for A can be used

m

to calculate the new pro jection matrix after the m th basis function has b een

added which is



P I H A H

m p m

m m



P h h P

m m m

m

P A

m



h P h

m m m

m

A Removing an old basis function

The line of reasoning is similar here to the section A except we are interested in

knowing the new pro jection matrix P after the j th basis function j m

m

has b een removed from a network with m basis functions and for whichwe know the

old pro jection matrix P The main dierence is that any column can b e removed

m

from H whereas in the previous section the new column was inserted in a xed

m

p osition right at the end of H just after the mth column

m

As noted in the pro jection matrix app endix A is invariant to a p ermu

tation of the columns of the design matrix equation which is why it do es not

matter in which p osition a new column is insertedputting it at the end is as go o d a

choice as any other If wewant to remove a column say the j th one we can shue

it to the end p osition without altering P and apply equation A but with P

m m

in place of P P in place of P h in place of h and in place of

m m m j m j m

The result is



P h h P

m j m

j

P P A

m m



h P h

j m j

j

This equation gives the old known P in terms of the new unknown P and

m m

not the other way around which would be more useful If and only in that

j

case as far as I can see it is p ossible to reverse the relationship to arrive at



P h h P

m j m

j

A P P

m m



h P h

j m j

j

by rst p ost and then premultiplying by h to obtain expressions for P h and

j m j



h P h in terms of P However beware of roundo errors when using this

m j m

j

equation in a computer program if is small

j

A Adding a new training pattern

As wehave seen ab ove in the case of incremental adjustments to the basis functions

the pro jection matrix app endix A keeps the same size p p while the variance

matrix app endix A either shrinks or expands by one row and one column If a

single example is added removed from the training set then it is the pro jection

matrix app endix A which expands contracts and the variance matrix app endix

A which maintains the same size

If a single new example x y is added to the training set the design

p p

matrix equation acquires a new row



h h x h x h x

p p m p

p

The order in which the examples are listed do es not matter but for convenience

we shall insert the new row in the last p osition just after row p The new design

matrix is then

H

p

H

p



h

p

where and the new variance matrix app endix A is A

p



A H H

p p

p

H

p



H h

p



p

h

p



A h h

p p

p

and this is true with or without ridge regression section Applying the formula



for a small rank adjustment equation A with X Y h and R then

p

gives



A h h A

p

p p p

A A A

p p



h A h

p

p

p

The new pro jection matrix is

H P

p



p

H h

P I A

p

p p



p



p

h

p



H A h H A h

p p p p

p p

A



h A h

p

p p

A Removing an old training pattern

The line of reasoning is similar here to section A except we are interested in

knowing the new variance matrix A after the ith pattern i p has b een

p

removed from a network trained with p patterns and for which we know the old

variance matrix A The main dierence is that anyrow can b e removed from H

p

p

whereas in the last section the new rowwas inserted in a xed p osition right at the

end of H just after the pth row

p

As noted b efore however the ordering of examples do es not matter so the

variance matrix app endix A is invariant to a p ermutation of the rows of the

design matrix equation Consequently the relation between A and A is

p p

obtained from equation A byachange of subscripts If the removed rowish then

i



h h A A

i

p p i

A A A

p p



h h A

i

p

i

This equation gives the old known A in terms of the new unknown A and not

p p

the other way around whichwould b e more useful However it is easy to invert the

relation ship by rst deriving from it that

h A

i

p

h A

i

p



h h A

i

p

i



h A h

i

i p



h A h

i

i p



h A h

i

p

i

from which it follows that



h A h

i

p i



h A h

i

i p



h h A

i

p i

A h

i

p

A h

i

p



h h A

i

p

i

Substituting these in equation A gives



A h h A

i

p i p

A A A

p p



h A h

i

p

i

A The Eective Number of Parameters

In the case of ordinary least squares no regularisation the variance matrix ap

p endix A is



A H H

so the eective number of parameters equation is



p trace I HA H

p



trace HA H



trace A H H

 

H H H H trace

trace I

m

m

so no surprise there without regularisation there are m parameters just the number

of weights in the network In standard ridge regression section



A H H I

m

so the eective number of parameters equation is



trace A H H

trace A A I

m

trace I A

m

m trace A

This is the same as MacKays equation in If the eigenvalues of the matrix

 m

H H are f g then

j

j

m trace A

m

X

m

j

j

m

X

j

j

j

which is the same as Mo o dys equation in

A Leaveoneout Crossvalidation

We prove equation for leaveoneout LOO crossvalidation error prediction

Let f x be the prediction of the mo del for the ith pattern in the training set

i i

section after it has b een trained on the p other patterns Then leaveoneout

LOO crossvalidation predicts the error variance

p

X

y f x

i i i

LOO

p

i

b e its Let H b e the design matrix equation of the reduced training set A

i

i

variance matrix equation and y its output vector so that the optimal weight

i

equation is



w A H y

i i

i i

and therefore the prediction error for the ith pattern is



y f x y w h

i i i i i

i



y y H A h

i i i

i

i



where h h x h x h x

i i i m i

H and y are b oth obtained by removing the ith rows from their counterparts

i i

for the full training set H and y Consequently

 

H y H y y h

i i i

i



is the row removed from the matrix H and y is the comp onent taken from where h

i

i

the vector y In addition A is related to its counterpart A by equation A

i

for removing a training example namely



A h h A

i

i

A A

i



h A h

i

i

Fromthisitis easy to verify that

A h

i

A h

i

i



h A h

i

i



Substituting this expression for A h and the one for H y into the formula for

i i

i

i

the prediction error gives



y y HA h

i i

y f x

i i i



h A h

i

i

The numerator of this ratio is the ith comp onentofthevector Py and the denom

inator is the ith comp onent of the diagonal of P where P is the pro jection matrix

app endix Therefore the full vector of prediction errors is

y f x

y f x

diag P Py

y f x

p p p

The matrix diag P is the same size and has the same diagonal as P but is zero

o the diagonal LOO is the mean of the square of the p prediction errors ie the

dot pro duct of the ab ove vector of errors with itself divided by pandso we nally

arriveat



y P diag P Py

LOO

p

which is equation which we set out to prove

A A ReEstimation Formula for the Global Parameter

The GCV error prediction section is



p y P y

trace P

where P is the pro jection matrix equation for standard ridge regression To

nd the minimum of this error as a function of the regularisation parameter we

dierentiate it with resp ect to and set the result equal to zero To p erform the

dierentiation we will have to dierentiate the variance matrix equation up on

which P dep ends To do this assume that the matrix H has the singular value

decomp osition app endix A



H UV

where U and V are orthogonal and is

p

p

p

m

p

m m 

g are the singular values of H and f g are the eigenvalues of H H We f

j j

j j

assume H has more rows than columns p m though the same results hold for

the opp osite case to o It then follows that



 

H H V V

and therefore that





A V V I

m



 

V V VV





V I V

m





V I V

m

where we have made use of some useful matrix prop erties app endix A The

inverse is easy b ecause the matrix is diagonal





I

m

m

and is the only part of A which dep ends on Its derivative is also straight

forward





I

m

m



I

m



I

m

and consequently

A





V I V

m





V I V

m

 

 

V I V V I V

m m

A

Atthispoint the reader may think that all this complication just go es to showthat

matrices can be treated like scalars and that the derivativeof A is obviously

A A

A

A I

m

A

However I dont think this is true in general The reader might like to try dieren

tiating the variance matrix equation for lo cal ridge regression with resp ect to

if there is any doubt ab out this p oint

j

Now that weknow the derivativeof A we can pro ceed to the derivative of the

pro jection matrix equation and the other derivatives we need to dierentiate

the GCV error prediction section

P



I HA H

p

A



H H



HA H

Similarly

A



trace P trace H H



trace HA H



trace A H H

trace A A I

m

trace A A

and also

P

 

y P y y P y

 

y PHA H y

 

y HA H y



w A w

The ingredients for dierentiating the GCV error prediction section are

now all in place



p p y P y



y P y trace P

trace P trace P

p

 

w A w trace P y P y trace A A

trace P

The last step is to equate this to zero which nally leaves us with

 

w A w trace P y P y trace A A

at the optimal value of the regularisation parameter Isolating on the left

hand side leads to the reestimation formula equation that wesetouttoprove

namely



y P y trace A A



w A w trace P

The other GCVlike mo del selection criteria section lead to similar formulae

Using p in place of trace P where is the eective number of parameters

equation we get



y P y trace A A

UEV



w A w p



y P y trace A A

p

FPE



w A w p p



y P y trace A A

GCV



w A w p



y P y trace A A

p logp

BIC



w A w p p log p

Leaveoneout crossvalidation section involves the awkward term diag P

and I have not managed to nd a reestimation formula for it

A Optimal Values for the Lo cal Parameters

We take as our starting p oint equation A for removing a basis function



P P P P h h

j j j j

j

j

where



h P h

j j j j

j

P is the pro jection matrix after the j th basis function has been removed and h

j j

is the j th column of the design matrix equation Actually we are not going

to use this equation to removeany basis functions unless see b elow we

j

simply employitto makethe dep endence of P on explicit

j

The GCV criterion equation which we wish to minimise is



p y P y

GCV

trace P

As will shortly become apparent this function is a rational p olynomial of order

in and fairly easy to analyse ie to nd its minimum for In principle

j j

any of the other mo del selection criteria section could be used instead of GCV

but they lead to higherorder rational p olynomials which are not so simple That is

why as mentioned earlier section GCV is so nice

Substituting P in terms of P into GCV

j

p a b c

j

j

j

j

where



a y P y

j

 

b y P h y P h

j j j

j

 

c h P h y P h

j j j

j j

trace P

j



h P h

j

j j

This is how GCV dep ends on when all the other regularisation parameters are

j

held constant a rational p olynomial in or It has a p ole b ecomes innite

j j

at whichnever o ccurs at a p ositivevalue of since it is always true that

j j



P h h

j

j j



h P h

j j

j

trace P

j

the pro of is left as an exercise for the reader The single minimum can b e calculated

by dierentiating with resp ect to

j

j

j j j j

pa b c

p a b

j

j

j

j j

p

b a c b

j

j

If b a this derivative will only attain the value zero at

j j

Since we are only interested in the solution of interest in that case is

j j

Assuming that b a when the derivative is set equal to zero the solution is

c b

j

b a

or

c b



h P h

j j j

j

b a

and this can be either p ositive or negative If it is p ositive we are done we have

found the minimum error prediction for If it is negative there are two cases

j

Firstlyif

b

a

the pole at o ccurs to the right of the minimum As go es from

j j

to the derivativeofthe error prediction go es through the sequence

negative minimum p ositive pole negative

see gure a Therefore the error prediction is falling as passes through zero

j

after the p ole and keeps on decreasing thereafter The minimum error for

j

must be at Alternativelyif

j

b

a

the p ole o ccurs to the left of the minimum As go es from to the derivative

j

of the error prediction go es through the sequence

p ositive pole negative minimum positive

see gure b Therefore the error prediction is rising as passes through zero

j

after the minimum and keeps on increasing thereafter The minimum error for

must be at In summary

j j

if a b

cb



if h P h and a b

j j

j

ba

j

cb



if h P h and a b

j j

j

ba

cb cb

 

h P h if h P h

j j j j

j j

ba ba

a b

λ > 0 global minimum minimum

GCV GCV pole λ > 0 global minimum minimum pole

0 0

λ λ

Figure When the global minimum is negative either a a b and the

minimum for nonnegative is or b a b and the minimum for

j j

nonnegative is

j j

Each individual optimisation is guaranteed to reduce or at least not increase

the GCV error prediction All m regularisation parameters are optimised one after

the other and if necessary rep eatedly until GCV has stopp ed decreasing or is only

decreasing byavery small amount Naturally the order in which the parameters are

optimised as well as their initial values will aect which lo cal minimum is reached

The computational requirements of optimisation in lo cal ridge regression are

heavier than standard ridge regression section or forward selection section

since P changes and has to be up dated after each individual optimisation Some

increase in eciency can be gained by up dating only the quantities needed to cal

culate the ve co ecients a b c and such as Py and PH rather than P

itself

A Forward Selection

Toprove equation for calculating the change in sumsquarederror with standard

ridge regression section equation for the sumsquared error



S y P y

m

m



S y P y

m

m

is combined with equation for up dating the pro jection matrix In the absence of

any regularisation P and P are true pro jection matrices and so are idemp otent

m m

P P Therefore



S y P y

m m



S y P y

m m

and



S S y P P y

m m m m



P f f P

m J m

J



y y



f P f

m J

J



y P f

m J



f P f

m J

J

This calculation requires of order p oating p oint op erations and has to b e done for

all M candidates to nd the maximum App endix A describ es a way of reducing

this cost by a factor of p the number of patterns in the training set section

A Orthogonal Least Squares

First we show that the factored form equation of the design matrix equation

results in a simple expression for the pro jection matrix equation and then

that this leads to an ecient metho d of computing the change in sumsquarederror

equation due to a new basis function

Substituting the factored form equation into the pro jection matrix equa

tion gives

   

H U H U H U P I H U

m m m p m m

m m m m

 

H I H H H

p m m

m m

m



X

h h

j

j

I

p



h h

j

j

j

The rst step is due to one of the useful prop erties of matrices section A and

m

the last step follows because the fh g the columns of H are mutually

j

j

orthogonal

The m th pro jection matrix with an extra column h in the design

m

matrix just adds another term to the sum If the new column is from the full design

matrix ie if h f then

m J

m





X

h h

f f

j

J j

J

P I

m p

 

h h f f

j J

j

J

j

and consequently in the absence of regularisation P P



S S y P P y

m m m m



f f

J

J



y y



f f

J

J



y f

J



f f

J

J

This is much faster to compute than the corresp onding equation without or

thogonal least squares The orthogonalised weight vector is

 

w H H H y

m m

m m



h

h h







h

h h

y



h

m

h h

m

m

and its j th comp onent is therefore



y h

j

w

m j



h h

j

j

It is related to the unnormalised weight vector by

 

H y w H H

m m

m m

   

H H U U H y U

m

m m m m m

 

U H H H y

m

m m m

U w

m

m

For the purp oses of mo del selection section



m

h trace h

X

j

j

trace P trace I

m p



h h

j

j

j

m



X

h h

j

j

p



h h

j

j

j

p m

and so for example GCV equation is



p y P y

m

trace P

m





X

h h

p y

j

j

I y

p



p m

h h

j

j

j

m



X

p y h

j



y y



p m

h h

j

j

j

A Regularised Forward Selection

Toprove equation for calculating the change in sumsquarederror with standard

ridge regression section we combine equation for the sumsquared error



y S y P

m

m



S y P y

m

m

with equation for up dating the pro jection matrix which gives



P f f P

m J m

J



P P S S y y

m m m

m



f P f

m J

J

   

y P f y P f y P f f P f

J m J m J J

m m

J

 

f P f f P f

m J m J

J J

An alternative is to seek to maximise the decrease in the cost function equation

which is



C C y P y yP y

m m m m



P f f P

m J m

J



y y



P f f

m J

J



y P f

m J



P f f

m J

J

A Regularised Orthogonal Least Squares

If computational sp eed is imp ortantthen orthogonal least squares section can

be adapted to standard ridge regression section although as explained in

section a change of cost function is involved This app endix gives some more of

the details

The equations are very similar to orthogonal least squares app endix A ex

cept for the extra term and the fact that the pro jection matrix equation is no

longer idemp otent so when we calculate the change in sumsquarederror equation

we cannot use the simplication P P Writing P as a sum we get

m m

m

 

P I H H H I H

m p m m m

m m

h h





h h



H I H

p m

m

h h

m

m

m



X

h h

j

j

I

p



h h

j

j

j

The m th pro jection matrix adds another term to this sum The change in sum

squarederror equation due to the addition of f from the full design matrix

J

ie if h f isthen

m J



S S y P P y

m m

m m

 

y f f f

J J

J

 

f f f f

J J

J J

An alternative search criterion the change in the mo died cost function equa

tion is



f y

J

C C

m m



f f

J

J

The j th comp onent of the orthogonalised weight vector is



y h

j

w

m j



h h

j

j

and it is related to the unnormalised weight vector by the same equation we had

previously namely

w U w

m m

m

For the purp oses of mo del selection section



m

h trace h

X

j

j

trace P trace I

m p



h h

j

j

j

m



X

h h

j

j

p



h h

j

j

j

p

m

where is the eective number of parameters equation in this case

m

m



X

h h

j

j

m



h h

j

j

j

Then GCV equation is



p y P y

m

trace P

m





X

h h

y h p

j

j j



y y

 

p

h h h h m

j j

j j

j

References

D M Allen The relationship b etween variable selection and data augmentation

and a metho d for prediction Technometrics

L Breiman and J Friedman Predicting multivariate resp onses in multiple

linear regression Technical rep ort Department of Statistics University of Cal

ifornia Berkeley

DS Bro omhead and D Lowe Multivariate functional interp olation and adap

tive networks Complex Systems

S Chen CFN Cowan and PM Grant Orthogonal least squares learning

for radial basis function networks IEEE Transactions on Neural Networks

S Geman E Bienensto ck and R Doursat Neural networks and the

biasvariance dilemma Neural Computation

GH Golub M Heath and G Wahba Generalised crossvalidation as a metho d

for cho osing a go o d ridge parameter Technometrics

C Gu and G Wahba Minimising GCVGML scores with multiple smo othing

parameters via the Newton metho d SIAM Journal on Scientic and Statistical

Computing

J Hertz A Krough and RG Palmer Introduction to the Theory of Neural

Computation Addison Wesley Redwood City CA

RR Ho cking The analysis and selection of variables in linear regression Bio

metrics

RR Ho cking Developments in linear regression metho dology with

discussion Technometrics

AE Ho erl and RW Kennard Ridge regression Biased estimation for

nonorthogonal problems Technometrics

RA Horn and CR Johnson Matrix Analysis Cambridge University Press

Cambridge UK

AS Weigend ASM Mangeas and AN Srivastava Nonlinear gated exp erts

for time series Discovering regimes and avoiding overtting International

Journal of Neural Systems

B Efron and RJ Tibshirani An Introduction to the Bootstrap Chapman and

Hall

B Fritzke Growing cell structures a selforganizing network for unsup ervised

and sup ervised learning Neural Networks

C Bishop Improving the generalisation prop erties of radial basis function

neural networks Neural Computation

DJC MacKay Bayesian interp olation Neural Computation

JE Mo o dy The eectivenumb er of parameters An analysis of generalisation

and regularisation in nonlinear learning systems In JE Mo o dy SJ Hanson

and RP Lippmann editors Neural Information Processing Systems pages

Morgan Kaufmann San Mateo CA

MJL Orr Lo cal Smo othing of Radial Basis Function Networks long version

In International Symposium on Articial Neural Networks Hsinchu Taiwan

MJL Orr Regularisation in the Selection of Radial Basis Function Centres

Neural Computation

V Kadirkamanathan and M Niranjan A function estimation approach to

sequential learning with neural networks Neural Computation

C Mallows Some comments on C Technometrics

p

D Michie DJ Spiegelhalter and CC Taylor editors Machine Learning

Neural and Statistical Classication Ellis Horwo o d London

J Platt A resourceallo cating network for function interp olation Neural Com

putation

WH Press SA Teukolsky WT Vetterling and BP Flannery Numerical

Recipes in C Cambridge University Press Cambridge UK second edition

JO Rawlings AppliedRegression AnalysisWadsworth Bro oksCole Pacic

Grove CA

WS Sarle Neural Networks and Statistical Mo dels In Proceedings of the

Nineteenth Annual SAS Users Group International Conference pages

Cary NC

G Schwarz Estimating the dimension of a mo del Annals of Statistics

AN Tikhonov and VY Arsenin Solutions of Il lPosed Problems Winston

Washington