<<

TNN A011 REV 1

Statistical Active Learning in Multilayer

Perceptrons

Kenji Fukumizu

Brain Science Institute, RIKEN

Hirosawa 2-1, Wako

Saitama 351-0198 Japan

Tel: +81-48-467-9664

Fax: +81-48-467-9693

E-mail: [email protected]

Abstract | This pap er prop oses new metho ds of generating ory,we can derive a criterion on where are e ective input

input lo cations actively in gathering training data, aiming

lo cations to minimize the generalization error ([6]).

at solving problems sp ecial to multilayer p erceptrons. One

of the problems is that the optimum input lo cations which

The main purp ose of this pap er is to solve prob-

are calculated deterministically sometimes result in badly-

lems related to sp ecial prop erties of multilayer networks.

distributed data and cause lo cal minima in back-propagation

training. Two probabilistic active learning metho ds, which

One problem is that a learning rule like the error back-

utilize the statistical variance of lo cations, are prop osed to

propagation cannot always achieve the global minimum of

solvethis problem. One is parametric active learning and

the training error, while many statistical active learning or

the other is multi-p oint-search active learning. Another se-

exp erimental design metho ds assume its availability. We

rious problem in applying active learning to multilayer p er-

ceptrons is the singularityof a Fisher information matrix,

see that learning with the optimal data which are calcu-

whose regularity is assumed in many metho ds including the

lated deterministically is trapp ed by lo cal minima more

prop osed ones. A technique of pruning redundant hidden

often than passive learning. To overcome this problem,

units is prop osed to keep the regularityof a Fisher infor-

mation matrix, which makes active learning applicable to

we prop ose probabilistic metho ds, which generate an input

multilayer p erceptrons. The e ectiveness of the prop osed

data with deviation from the optimal lo cation.

metho ds is demonstrated through computer simulations on

simple arti cial problems and a real-world problem in color

Another problem is caused by the singularityofaFisher

conversion.

information matrix. Many statistical active learning meth-

Keywords | Active learning, Multilayer p erceptron, Fisher

ods assume the regularity of a Fisher information ma-

information matrix, Pruning.

trix ([1],[4],[6]), which plays an imp ortant role in the

asymptotic b ehavior of the least square error estimator

I. Introduction

([7],[8],[9],[10],[11]). The Fisher information matrix of a

MLP,however, can b e singular if the network has redun-

HEN we train a learning machine like a feedforward

dant hidden units. Since active learning metho ds usually

neural network to estimate the true input-output re-

W

require that the prepared mo del includes the true function,

lation of the target system, wemust prepare input vectors,

the numb er of hidden units must b e large enough to realize

observe the corresp onding output vectors, and pair them

it with high accuracy. Thus, the mo del tends to b e redun-

to make training data. It is well known that we can im-

dant esp ecially in active learning. To solve this problem,

prove the ability of a learning machine by designing the

we prop ose active learning with hidden unit pruning based

input of training data. Such metho ds of selecting the lo ca-

on the regularity condition of a Fisher information matrix

tion of input vectors have b een long studied in the name

of a MLP ([12]). The metho d removes redundant hidden

of experimental design ([1]), response surface methodology

units to keep the regularity of a Fisher information matrix,

([2]), active learning ([3],[4]), and query construction ([5]).

and makes active learning metho ds applicable to the MLP

They are esp ecially imp ortant when collecting data is very

mo del.

exp ensive.

This pap er discusses statistical active learning metho ds

This pap er is organized as follows. In Section I I, we give

for the multilayer p erceptron (MLP) mo del. We consider

basic de nitions and terminology, and describ e an active

learning of a network as statistical estimation of a regres-

learning criterion. In Section I I I, we prop ose twonovel ac-

sion function. The accuracy of the estimation is often eval-

tive learning metho ds based on the probabilistic optimality

uated using the generalization error, which is the mean

of training data. In Section IV, we explain a problem con-

square error b etween the true function and its estimate. In

cerning the singularity of a Fisher information matrix, and

this pap er, the ob jectiveofactive learning is to reduce the

prop ose a pruning technique. Section V demonstrates the

generalization error. Using the statistical asymptotic the-

e ectiveness of the prop osed metho ds through an applica-

tion to a real-world problem, and Section VI concludes this

K. Fukumizu is with the Brain Science Institute, RIKEN, Saitama,

Japan. E-mail: [email protected] pap er.

2 TNN A011 REV

( )

II. Active learning in statistical learning If the input vectors fx g are indep endent samples from

the environmental distribution Q, such learning is called

A. Basic de nitions and terminology

passive. Active learning is, of course, exp ected to b e sup e-

First, we give basic de nitions and terminology, which

rior to passive learning with resp ect to the generalization

our active learning metho ds are based on.

error. When the number of training data is suciently

We discuss the three-layer p erceptron mo del de ned by

large, and if the true function is included in the mo del, the

statistical asymptotic theory tells that E[E ] of passive

!

gen

H L

2

X X



 S , where S is the dimension learning is approximately

i

N

f (x;  )= w s u x +  +  ; (1  i  M )

ij jk k j i

of  ([8],[10]).

j =1

k =1

(1)

B. Criterion of statistical active learning

Because our principle of learning is to minimize the ex-

where  =(w ;::: ;w ; ;::: ; ;u ;::: ;u ;

11 MH 1 M 11 HL

1

p ectation of the generalization error, in order to construct

 ;::: ; ) represents weights and biases, and s(t)=

1 H t

1+e

an active learning metho d we must evaluate how E[E ]

gen

is the sigmoidal function.

dep ends on X . There are several kinds of metho ds, in

N

We assume that the target system which is estimated

general, to estimate the generalization error. One is to use

by a network is a function f (x), and the output of the

the statistical asymptotic theory ([7]), and another one is

system is observed with an additive Gaussian noise. Then,

to use resampling techniques like b o otstrap ([13]) or cross-

an output data y follows

validation ([14]). The concept of structural risk minimiza-

tion (SRM, [15]) develop ed by Vapnik also gives a solid

y = f (x)+Z ; (2)

basis to discuss generalization problems. In this pap er, we

where Z is a random vector with a zero mean and a scalar

employ a metho d based on the asymptotic theory. The

2

covariance  I . To obtain a set of training data D =

resampling techniques, which estimate the generalization

M

( ) ( )

f(x ; y ) j  = 1;:::;Ng, we prepare input vectors

error using given training data, is not suitable for active

( )

X = fx g, feed them to the target system, and observe

learning in whichwehavetoknowhow the generalization

N

( )

output vectors fy g sub ject to eq.(2). The problem of

error dep ends on an input p oint b efore the data is actually

active learning is how to prepare X .

generated. We do not adopt the SRM principle either, b e-

N

When a set of training data D is given, we employ the

cause it is based on the b ound of the worst case unlike our

^

ob jective to minimize the exp ectation of the generalization least square error (LSE) estimator  ,thatis,

error.

N

X

For the approximation of eq.(4), we assume that the

( ) ( ) 2

^

 = arg min ky f (x ;  )k : (3)

true function f (x) is completely included in the mo del and



 =1

f (x;  )=f (x). This assumption is not rigorously satis-

0

ed in practical problems. In general, the exp ectation of

Unlike linear mo dels whose exp erimental design has b een

the generalization error can b e decomp osed as

extensively studied in the eld of statistics ([1]), the solu-

 

Z

tion of eq.(3) cannot be rigorously calculated in the case

2

^

E[E ] = E kf (x;  ) f (x;  )k dQ(x)

of neural networks. An iterative learning rule like the er-

gen o

ror back-propagation is needed to obtain an approximation

Z

^

2

of  . To derive an active learning criterion, however, we

+ kf (x;  ) f (x)k dQ(x); (5)

o

^

assume the availability of  . A problem related to this

R

assumption is discussed later.

where  is the parameter that gives min kf (x;  )

o  o

We use the generalization error to evaluate the ability

2

f (x)k dQ(x). The rst and the second terms in eq.(5) are

of a trained network. For the de nition, weintro duce the

called the variance and the bias of the mo del resp ectively.

environmental probability Q, whichgives indep endent input

Mo o dy ([16]), for example, discusses the generalization er-

vectors in the actual environment where a trained network

ror in the framework of nonparametric regression which

should b e lo cated. In system identi cation, for example, Q

allows the mo del bias. However, it is very dicult to de-

represents the distribution of input vectors which are given

scrib e explicitly the dep endence of E[E ] on X if the

gen N

to the system. The generalization error is de ned by

mo del bias exists. Therefore, we assume that the bias of

Z

the mo del is small enough to b e neglected, and that active

2

^

E = kf (x;  ) f (x)k dQ(x); (4)

gen

learning is supp osed to reduce the variance term. In Sec-

tion IV, we discuss howtosolve the problem of the mo del

bias.

which is the mean square error b etween the true function

Similar to Cohn's discussion ([4]), application of the

and its estimate. The purp ose of our active learning meth-

asymptotic theory ([9],[10]) or lo cal linearization under the

ods is to reduce the exp ectation of the generalization er-

bias-free assumption shows

ror E[E ]. The exp ectation E[]istaken with resp ect to

gen

^

training data, as  is a random vector dep ending on the

 

2 1

E[E ]   Tr I ( )J ( ; X ) ; (6)

gen o o N

statistical training data D . N

FUKUMIZU: STATISTICAL ACTIVE LEARNING IN 3

(n)

the next input x according to

h i

(n) 1

^ ^

x = arg min Tr I ( )J ( ; X [fxg) :

n1 n1 n1

x

(11)

We call this deterministic active learning, b ecause the lo-

cation of the next input is selected deterministically.

In the case of neural networks, this metho d do es not

necessarily work well. Training of a neural network do es

not always give the correct LSE estimator b ecause of lo cal

minima and plateaus. The ab ove metho d tend to gener-

Fig. 1. Sequential active learning

ate training data that are trapp ed by lo cal minima more

easily. We explain the reason brie y. It is known that the

where the matrix I ( )andJ ( ; X ) are de ned by

optimal data that minimize the left hand side of eq.(6) can

N

b e approximated byadatasetona xednumberofinput

Z

lo cations, b ecause any Fisher information matrix at  can

o

I ( ) = I (x;  )dQ(x); (7)

S (S +1)

b e approximately realized using a data set on +1

2

N

p oints ([1], Theorem 2.1.2). Therefore, it is very likely that

X

( )

the same input p ositions are rep eatedly selected in deter-

J ( ; X ) = I (x ;  ); (8)

N

 =1

ministic active learning. Obviously such a training data set

T

makes the convergence of the back propagation muchmore

@ f (x;  ) @ f (x;  )

I (x;  ) = : (9)

ab

dicult.

@ @

a b

We illustrate this in uence with a simple exp erimentus-

ing a MLP network with 2 input, 2 hidden, and 1 output

The matrix I ( )andJ ( ; X ) are called Fisher informa-

N

unit. The target function is also de ned by a parameter in

tion matrixes or asymptotic covariance matrixes. Note that

this mo del (Fig.2). The normal distribution N (0; 16I )is

2

the matrix I ( )isaveraged with the environmental proba-

used for Q, where N (m; ) means the normal distribution

bility Q, while J ( ; X ) is calculated using empirical data

N

with m as its mean and  as its variance-covariance ma-

X . Replacing the unknown parameter  with its current

N o

^

trix. Fig.3 shows the average of generalization errors for 50

estimate  ,we adopt the following as the criterion of active

trials changing initial training data set. The result of de-

learning;

terministic active learning is inferior to that of passiveone

h i

after 60 data. We nd that the parameter sometimes do es

1

^ ^

Tr I ( )J ( ; X ) : (10)

N

^

not approachto  b ecause of the excessive lo calization of

training data, which is shown clearly in Fig.4.

This criterion is equivalent to Q-optimality ([1]) if the

III. Probabilistic active learning

mo del is linear. Thus, our active learning criterion for neu-

ral networks is a nonlinear extension of Q-optimality. In

A. Probabilistic active learning methods

the rest of this pap er, we discuss sp ecial problems caused

We prop ose two probabilistic active learning metho ds.

by nonlinearityofneural networks. Similar criterions are

One is the parametric active learning, which utilizes a para-

derived by MacKay ([3]) and Cohn ([4]). However, their

metric probability family to generate a new input point.

criterions are based on the error of one p ointtoavoid the

integral calculation. We p erform the numerical integral eep the principle of minimizing the gener-

calculation to k 2 alization error. 1.5

1

C. Problem of deterministic active learning

0.5

this subsection, we explain that a simple implemen-

In 0

tation of the ab ove active learning criterion has a problem.

−0.5

We employ sequential active learning which is commonly

−1

in exp erimental design ([1]), b ecause we should up- used 15

^ 10

date  in eq.(10) to obtain a more accurate estimate each

5 Design of the next input time a new training data is added. 0

−5

point, observation of the resp onse, and estimation of  are −10

10 15

ely p erformed in sequential learning (Fig.1). iterativ 5 x2 −15 −5 0

−15 −10

simplest sequential active learning metho d is de- The x1

scrib ed as follows ([1]). When wehave n 1 training data

^

Fig. 2. The true function of 2-2-1 MLP mo del

D and the corresp onding LSE estimator  ,we select

n1 n1

4 TNN A011 REV

This is a slight re nement of the metho d prop osed by

Fukumizu ([17]). The other is the multi-p oint-search active

learning, which generates a nite number of input points

as candidates and selects the b est one. In b oth metho ds,

1E-01

e intro duce randomness which is exp ected to solve the Deterministic active learning w

Passive learninig

problem of excessive lo calization.

A.1 Parametric active learning

Instead of optimizing a p oint x in eq.(10), weintro duce

y functions fr (x; v )g for

1E-02 a parametric family of the densit

generating x, and try to optimize the density. A p ossible

choice of fr (x; v )g is a normal mixture mo del de ned by

 

K

X

c 1

k

2

r (x; v )= ; (12) exp kx m k

k

2 2

1E-03 L=2

2

(2  )

k

k

k =1

Av. of generalization errors (50 trials)

P

K

c = 1, and v = c  0 (k = 1;::: ;K),

20 40 60 80 100 where

k k

k =1

Number of training data

(c ; m ; ;::: ;c ; m ; ) is a variable parameter vec-

1 1 1 K K K

tor. Since a normal mixture converges to a p oint distribu-

Fig. 3. Deterministic active learning

tion if  go es to zero, we should restrict the value of  in

k k

[A; 1) for a p ositive A.

We optimize the densityby nding the b est v to mini-

mize

 

 

1

^ ^ ^

Tr I ( ) J ( ; X )+J ( ; r ) ; (13)

n1 n1 n1 n1 v

Deterministic active learning

15 where

Z

( ; r )= J ( ; x)r (x; v )dx: (14)

10 J v

5

The algorithm is describ ed as follows.

ARAMETRIC ACTIVE LEARNING]

0 [P

1. Prepare an initial set of training data D .

N 0

−5 ^

2. Calculate the initial estimator  with resp ect to D .

N N

0 0

3. Prepare an initial parameter v .

N 0

−10

4. n := N +1.

0

5. Find v that minimizes −15 n

−15 −10 −5 0 5 10 15

 

 

1

Passive learning

^ ^ ^

Tr I ( ) J ( ; X )+J ( ; r ) ;

1 n1 n1 n1 v

15 n umerical optimization metho d.

10 using a n

(n)

6. Generate an input data x from r (x; v ). n

5

(n) (n)

7. Feed x to the target system. Observe a resp onse y .

(n) (n)

8. Set D := D [ (x ; y ).

n n1

0

^

9. Calculate the LSE estimator  with resp ect to D .

n n

. n := n +1.

−5 10

11. if n>N, then END, otherwise go to 5.

−10

Although the selected data are optimal only probabilisti-

−15

at best, they distribute over the input space more

−15 −10 −5 0 5 10 15 cally

widely than those of deterministic active learning. wecan

Fig. 4. Distributions of input data

exp ect this prevents the excessive lo calization of training

data. However, this metho d needs the integral calculation

^

of J ( ; r )ineach iteration of numerical optimization.

n1 v

The calculation cost is very exp ensive.

FUKUMIZU: STATISTICAL ACTIVE LEARNING IN MULTILAYER PERCEPTRON 5

Kindermann et al. ([19]) prop oses an active learning A.2 Multi-p oint-search active learning

metho d for neural networks based on this criterion. They

We consider a metho d in which multiple candidates of

use the b o otstrap to estimate V (x). The computational

the next p ointaregenerated and the best one that mini-

cost of this metho d is very exp ensive, since wehavetoper-

mizes

form b oth the b o otstrap and the numerical optimization of

h i

1

^ ^

the input p oint.

Tr I ( )J ( ; X [fxg)

n1 n1 n1

These criterions are clearly di erent from ours in that

is selected. If the numb er of candidates increases according

they do not minimize the generalization error. It dep ends

to the numb er of training data, the b est one converges to

on the purp ose of learning which criterion should be ap-

the true optimal lo cation. This metho d is more random in

plied.

early stage of the training, and it comes to generate the

Cohn ([4]) prop oses a metho d that uses reference p oints

optimal data gradually. It aims at avoiding lo cal minima

to avoid the integral calculation of I ( ). He uses a ran-

for a small number of training data. Learning in early

dom reference p oint x , and selects the next p oint that

r

stage is esp ecially imp ortant, b ecause it is very dicult

minimizes

to converge to  if data are generated based on a wrong

h i o

1

^ ^

estimate of  . If we generate random candidates sub ject

Tr I (x ;  )J ( ; X [fxg) : (17)

0

r n1 n1 n1

to Q, the learning moves from passive to active.

The algorithm is describ ed as follows. The number of

Although our metho ds lo ok similar to this, they are essen-

candidates for the nth training data, K , is an increasing

n

tially di erent in that the ob jective of our metho d is to

function of n.

minimize the generalization error. Note that the ab ovecri-

terion is di erent from eq.(10) even if x is taken from Q,

r

[MULTI-POINT-SEARCH ACTIVE LEARNING]

b ecause the op erations of min and integral are not change-

1. Prepare an initial set of training data D .

N

0

able. However, we can exp ect that this metho d also has an

^

2. Calculate the initial estimator  with resp ect to D .

e ect of avoiding lo calization of input p oints by the varia-

N N

0 0

tion of reference p oints.

3. n := N +1.

0

4. Generate K input data x ;::: ;x . Cho ose

n <1>

n

C. Experimental results on active learning methods

(n)

x according to

Weshow simple exp erimental results to compare the p er-

h i

(n) 1

^ ^

x = arg min Tr I ( )J ( ; X [fx g) :

formance and prop erty of active learning metho ds.

n1 n1 n1

x

The rst exp erimentisavery simple one to see the basic

(n) (n)

prop erties of various active learning metho ds. We use the

5. Feed x to the target system. Observe a resp onse y .

MLP mo del with 1 input, 1 hidden, and 1 output unit. The

(n) (n)

6. Set D := D [f(x ; y )g.

n n1

target function is given by

^

7. Calculate the LSE estimator  with resp ect to D .

n n

8. n := n +1.

f (x)=s(x); (18)

9. if n> N, then END, otherwise go to 4.

which is realized by the mo del. The total numb er of train-

This metho d do es not require numerical minimization.

ing data is 100. The initial 10 data are given passively

This remarkably saves the computational cost, which is

sub ject to Q = N (0; 1). The deviation of the noise added

often a problem of active learning metho ds.

to the output is 0.1. We compare the following 5 metho ds;

A. parametric active learning

B. Comparison with other active learning methods

B. multi-p oint-search active learning

There have b een prop osed other active learning metho ds

C. maximum variance p oint ([19])

which use Fisher information matrixes. We brie y review

D. usage of reference p oints ([4])

them and compare them with ours.

E. passive learning

The most famous criterion of exp erimental design is D-

In metho d A, we use a mixture mo del of four normal dis-

optimality ([1]), which selects input data that maximize

tributions. The candidates in metho d B are generated by

detJ ( ;X ). It is known ([18]) that under some condi-

0 N

l m

p

tions D-optimality is equivalent to the minimax criterion

Q, and K = 10(n 10) . In Metho d C, we use 20

n

that selects the input data X according to

N

b o otstrap samples in estimating V (x). In Metho d D, the

i h

probability to generate reference p oints is Q.

2

^

min : (15) max E kf (x;  ) f (x;  )k

0

Fig.5 shows the average of generalization errors for 50

x

X

N

sets of training data. Active learning with metho d A, B,

In a sequential implementation of D-optimality ([1]), the

and D outp erform passive learning. The prop osed meth-

selected p oint attains the maximum of the exp ected vari-

o ds, A and B, show go o d p erformance in generalization er-

ance of the estimation de ned by

ror. Interestingly, metho d D is as go o d as A and B, though

h i

the criterion do es not precisely minimize the generalization

2

^

V (x)=E ky f (x;  )k : (16)

error. Metho d C also shows e ectiveness in small number

6 TNN A011 REV

of training data, while the nal result is worse than the

of passive learning. This is reasonable b ecause the

result A. Parametric t from minimization of the

criterion of metho d C is di eren B. Multi-point-search

In fact, there are many training data

generalization error. C. Maximum variance

ery far from the high-density region of Q.

selected v D. Reference point

e apply active learning metho ds to see the p erfor-

Next, w E. Passive

mances in a little complicated problem, which is the same

as the one in Section I I.C (Fig.2). We omit the metho d

C in this simulation, since it is computationally very ex-

e and we know from the previous exp eriment that

p ensiv 1E-03

the p erformance in the generalization error is not so high.

Fig.6 shows the average of generalization errors for 50 data

sets. In this case, the multi-p oint-search metho d shows

the b est p erformance. Although parametric active learning

still shows much b etter p erformance than passive learning,

Av. of generalization errors (50 trials)

its e ect is not so remarkable as the e ect of the multi-

t-search metho d. One reason is that the density mo del

poin 20 40 60 80 100

(x; v ), which is the mixture of 4 normal distributions, is

r Number of training data

not sucient to express the optimal density. In fact, as we

can see in Fig.7, the distribution of the input data in the

Fig. 5. Comparison of active learning metho ds (1-1-1)

parametric metho d is more concentrated around the center

than the multi-p oint-search metho d. Although Metho d D

shows e ectiveness in early stage of learning, it is worse

passive learning after the number of data b ecomes

than 1E-01 A. Parametric This seems natural b ecause the criterion is not equiv-

large. B. Multi-point-search t to the generalization error.

alen D. Reference point

both simulations, our probabilistic active learning

In E. Passive

metho ds show signi cant reduction of the generalization

error. The multi-p oint-search metho d shows almost the

in both simulations. In the parametric metho d, we

best 1E-02

have to cho ose carefully a density family r (x; v ), which

has an essential in uence on the p erformance. It is also a

disadvantage of the parametric metho d that the deviation

of data from the optimal p osition remains even after the

training has converged successfully. On the other hand,

1E-03

in the multi-p oint-search metho d, wehaveonlytocho ose

Av. of generalization errors (50 trials)

the number of candidates at each sampling. It automati-

y of selected lo cations. Cohn's

cally increases the optimalit 20 40 60 80 100

ws e ectiveness in the generalization error

metho d also sho Number of training data

in spite of the di erence of the criterion. However, in the

Fig. 6. Comparison of active learning metho ds (2-2-1)

latter simulation, the e ect b ecomes comparatively small

as the increase of the numb er of data.

IV. Model selection in active learning

ciently large numb er of hidden units can almost realize the

A. Model mismatch problem in active learning

true function. A network with many hidden units, how-

In the previous sections, we assume that the true func- ever, causes a critical problem in active learning, in addi-

tion is completely included in the mo del or can b e approx- tion to the increase of the generalization error caused by

imated by the mo del with high accuracy. This assumption surplus parameters. It is proved that the Fisher informa-

is to o strong in actual problems. On the tion matrix at the true parameter is singular if and only

other hand, it is easy to see that activelearningdoesnot if the mo del has surplus hidden units to realize the true

work if the mo del has a large bias. The data set given by function ([12]). Even if the true function cannot be re-

active learning is far from optimal, for example, if we es- alized p erfectly, the Fisher information of a network with

timate a quadratic function by the mo del almost redundant hidden units is very close to a singular

b elieving that the true function is linear. one, which makes an algorithm using the inverse matrix

Mo del selection is, then, esp ecially imp ortant in active numerically unstable. We should establish a metho d of

learning. It is known that a MLP can approximate any keeping the Fisher information matrix non-singular during

continuous function on a compact set with arbitrary ac- learning. We will describ e a solution to this problem in

curacy ([20],[21],[22]). Therefore, a network with a su- this section. This metho d is rst intro duced in Fukumizu

FUKUMIZU: STATISTICAL ACTIVE LEARNING IN MULTILAYER PERCEPTRON 7

Parametric metho d

(B) If w = 0, then

15 j

[P2] eliminate H .

j

(C) If (u ; )=(u ; ), then

j j j

10 j

1 1 2 2

[P3] eliminate H and w 7! w + w for all i.

j ij ij ij

2 1 1 2

5

(D) If (u ; )=(u ; ), then

j j j j

1 1 2 2

[P4] eliminate H and w 7! w w ,  7!  + w

j ij ij ij i i ij

2 1 1 2 2

0

for all i. y that a Fisher

−5 In most problems, there is little p ossibilit

information is p erfectly singular. However, we should re-

ve almost redundant hidden units to ensure the stability

−10 mo

of the inverse. At the same time, necessary hidden units

−15 ved b ecause it results in the increase of

−15 −10 −5 0 5 10 15 should not b e remo

the mo del bias. Wemust establish a criterion to determine

when hidden units should b e removed.

Multi-p oint-search metho d

e eliminate a hidden unit if the inequality

15 W

Z

A

2

~ ^

(19) f (x;  ) f (x;  )k dQ <

10 k N

5 ^

is satis ed for the LSE estimator  and a pruned estima-

~

tor  derived from [P1]-[P4]. The constant A is a p ositive

0

numb er. If there is no redundant hidden unit, it is known

that the LSE estimator approaches to  in the order of

o

1=2

−5

N . The asymptotic b ehavior in the existence of re-

dundant hidden units is very complicated and still an op en

−10

problem. Therefore, we put an assumption

  Z

−15

2 1

−15 −10 −5 0 5 10 15 ^

E kf (x;  ) f (x;  )k dQ(x) = O (N ); (20)

0

Fig. 7. Distributions of input data

and use eq.(19) as heuristics.

We employ a pruning pro cedure during the batch back-

propagation algorithm, in which one training example in

([17]), and wegive its full description here.

a xed data set is used for one up date of the parameter.

B. Pruning for regularity of a Fisher information matrix

The condition of eq.(19) is checked for every candidate of

~

a pruned estimator  once in T up dates. Eq.(19) can be

Our pruning technique is based on the following theorem.

satis ed during the training if the optimal parameter  is

0

1=2

Theorem 1 ([12]) The Fisher information matrix of a

lo cated within the order of N distance of the parame-

three-layer p erceptron at a parameter  =(w ;::: ;w ; ;::: ; ;u ;::: ;u ; ;::: ; )

11 MH 1 M 11 HL 1 H

ter set that realizes networks with redundant hidden units,

is singular if and only if one of the following three condi-

Therefore, the following pruning algorithm is exp ected to

tions holds;

eliminate only almost redundant hidden units. The condi-

T

(1) there exists j such that u := (u ;::: ;u ) = 0.

j j1 jL

tions in (a)-(d) are derived by calculating eq.(19). In the

T

T

(2) there exists j such that w := (w ;::: ;w ) = 0.

^

j 1j Mj

^

x +  ). following, wewrite^s for s(u

j j

j

(3) there exist di erent j and j such that (u ; ) =

1 2 j j

1 1

(u ; ).

j j

[BP with hidden unit pruning] 2 2

1. t := 1.

According to this theorem, we can keep the Fisher in-

(t mo d N ) (t mo d N )

^

2. Up date  with resp ect to (x ; y ) using

formation of a network non-singular bychecking the ab ove

the back-propagation rule.

three conditions which indicate the existence of redundant

hidden units, and by pruning them if any. The parameter

3. If t mo d T = 0, then execute the following four sub-

should b e mo di ed in (1) and (3) to keep the function un-

pro cedure:

Z

changed, when the redundant hidden unit is removed. The

2 2

^

^

(a) If kw k (^s s( )) dQ(x)

j j j

following is the pruning pro cedure. Note that we use the

[P1].

relation s(t)=1 s(t) in the derivation of (D). We write

Z

H for the j th hidden unit.

2 2

j

^

(b) If kw k (^s ) dQ(x)

j j

Z

[Pruning pro cedure]

2 2

^

) dQ(x)

1 2 j j j

1 2 2

(A) If u = 0,then

j

then execute [P3]. [P1] eliminate H and  7!  + w s( ) for all i.

j i i ij j

8 TNN A011 REV

Z

2 2

^

(d) If kw (1 s^ k s^ ) dQ(x)

j j j 1

2 2 1

j , then execute [P4].

2 parametric

multi-point-search

4. t := t +1.

1E-03 passive

5. If t>t , then END. Otherwise go to 2.

MAX

A p ositive constant A and a natural number T control the

p ossibility of pruning. The constant A should be su-

ciently large so that the inverse of a Fisher information

1E-04

matrix can be stably calculated. On the other hand, A

should b e small enough for the pruning pro cedure to pre-

serve eliminate necessary hidden units. The optimization

of these values is also very dicult, b ecause it requires to

know the exact asymptotic b ehavior of the estimator in the

1E-05

existence of redundant hidden units. Therefore, we decide

Av. of generalization errors (10 trials) them heuristically in this pap er.

30 60 90 120 150 180 210 240 270 300

Active learning with hidden unit pruning

C. Number of training data

The pruning pro cedure keeps the information matrix

Fig. 8. Active/Passive learning : f (x) = erf(x ).

nonsingular and makes the active learning metho ds appli-

1

cable to a MLP even if we rst prepare a surplus number of

The mo di cation of active learning metho ds

hidden units. learning curve

is simple. We have only to use the BP with hidden unit

Transient of hidden units 7

pruning instead of the usual back-propagation.

We demonstrate the e ect of the mo di ed metho ds

ts in which the true function is not in-

through exp erimen 6 The number of hidden units

1E-03

cluded in the MLP mo del. We use the MLP mo del with 4

units, 7 hidden units, and 1 output unit. The true

input 5

function is given by

4

f (x) = erf(x ); 1 1E-04

3

where x =(x ;x ;x ;x ) and erf(t) is the error function

1 2 3 4

de ned by 2

Generalization error

r

Z

t

2

2

x

e dx: t)=

erf( 1 

0 1E-05 bles that of the sig- The graph of the error function resem 0

30 60 90 120 150 180 210 240 270 300

moidal function, while they never coincide by any ane

The mo del has many almost redundant hid-

transforms. Number of training data

den units in this case. We set Q = N (0; 9I ). The nal

4

Fig. 9. Atypical learning curveofmulti-p oint-search active learning.

numb er of training data is 300. The deviation of the noise

added to the output is 0.01. In the parametric active learn-

ing metho d, the mixture mo del of 5 normal distributions is

curveofthemulti-p oint-search metho d, and the transition

used for r (x; v ). Tosave the computational cost, we gener-

of the numb er of hidden units during the learning. Wesee

ate 5 data at one time. When new data are generated, all

the elimination of redundant hidden units. The general-

data are presented 50000 times cyclically in the BP train-

ization error is reduced by b oth the e ect of pruning and

ing, and the pruning conditions are checked every 100 up-

active learning.

dates of parameters from 40000th through 50000th cycle.

In the multi-p oint-search active learning, n candidates are

V. Application to a color conversion problem

generated for selecting nth training data. In this case, all

the data are presented 5000 times cyclically each time a We apply our active learning metho ds with the pruning

new data is added, and the pruning condition is checked technique for a color conversion problem, which is found in

every 100 up dates from 4000th through 5000th cycle. many color repro duction systems using CMY (cyan, ma-

Fig.8 shows the average of generalization errors for 10 genta, and yellow) ink. The problem is to simulate a sp e-

sets of training data. We nd that the active learning ci c color repro duction system like a color printer, which

metho ds reduce the error remarkably, though the bias-free pro duces a color print for a given CMY input signal. The

assumption is not satis ed. Fig.9 shows a typical learning print result can b e physically measured and represented by

FUKUMIZU: STATISTICAL ACTIVE LEARNING IN MULTILAYER PERCEPTRON 9

a color system likeRGB.Itisvery imp ortanttoknow the

function from CMY to RGB of a sp eci c system to achieve

color repro duction. It is known that the func-

accurate Parametric

GB can b e theoretically given by the

tion from CMY to R Multi-point-search

However, a practical system

Neugebauer equations ([23]). Passive

like a color xerography has a very complicated mechanism,

and the theoretical equations cannot predict the actual re-

sult with high accuracy. The neural network approachis

one of the metho ds to approximate the non-linear relation

of the color conversion ([24]). Moreover, since the precise

measurement of color is costly, active learning is a promis-

ing way to simulate the system.

In this pap er, to demonstrate the e ectiveness of our

1E-04

active learning metho ds, we estimate the relation of the

Av. of generalization errors (30 trials)

Neugebauer equations instead of using a real color repro-

It is known that the Neugebauer equations

duction system. 50 100 150

ximate the real system well in the case of o set print-

appro Number of training data

ing. Then, if we can verify e ectiveness in estimating the

Fig. 10. Active learning of a color conversion problem

theoretical equations, we can also exp ect the e ectiveness

in approximating a real system.

The mo del whichwe use has 3 input and 3 output units.

The initial number of hidden units is 8. For the param-

VI. Conclusion

eters of the Neugebauer equations, weuse the relation in

4

[25]. We add an indep endent Gaussian noise N (0; 10 I )

3

to the output of the Neugebauer equations to simulate a

measurement noise. Since we have no meaningful reason

We discussed statistical active learning metho ds for the

to assume a sp ecial environmental distribution, we use the

purp ose of applying them to the multilayer p erceptron

3

uniform distribution on [0; 1] for Q. After the initial train-

mo del. We explained the problem of lo cal minima in active

ing with 30 examples which are given passively, we train

learning of neural networks, and prop osed two probabilis-

anetwork by the parametric and multi-p oint-search active

tic active learning metho ds to prevent lo cal minima. This

learning metho d, collecting training samples one by one

problem do es not app ear in linear mo dels, in which the

up to 150. In the parametric metho d, we use a mixture

least square error estimator can b e solved directly.

3

mo del of 8 normal distributions restricted on [0; 1] . In

the multi-p oint-search metho d, the number of candidates

We explained the imp ortance of mo del selection esp e-

l m

p

cially in active learning. The derivation of many active

is 10(n 30) .

learning metho ds requires that the mo del includes the true

Fig.10 shows the average of generalization errors for 30

function. This is essential to the e ect of active learning,

trials with di erent initial training examples. We can see

while we cannot assure it in many practical applications.

the results of the active learning metho ds remarkably out-

On the other hand, to o many hidden units make the active

p erform the result of passive learning. If weevaluate their

learning metho ds inapplicable b ecause of the singularity

e ect by l (i)=l (i), where l (i) and l (i)

of Fisher information matrixes. Tosolve this problem, we

par a passiv e par a passiv e

are the value of the graphs at ith data for parametric ac-

prop osed the active learning metho d with pruning to keep

tive learning and passive learning resp ectively, the aver-

the Fisher information nonsingular, based on the theorem

age of l (i)=l (i) for i = 50;::: ;100 is 0.80, and

that clari es the singularity condition of a Fisher infor-

par a passiv e

the average for i =50;::: ;150 is 0.85. The e ect of the

mation matrix of a three-layer p erceptron. Exp erimental

multi-p oint-search metho d in the same manner is 0.61 for

results showed that active learning with pruning eliminated

i =50;::: ;100, and 0.67 for i =50;::: ;150. These results

surplus hidden units and had a remarkable e ect of reduc-

clearly show the e ectiveness of our metho ds in this real-

ing the generalization error.

world application. The multi-p oint-search metho d shows a

b etter result than the parametric metho d throughout the

training, and the advantage of the latter metho d over pas-

sive learning is smaller at the nal stage of learning. One of

Acknowledgments

the reasons of this is the probabilistic asp ect as describ ed in

Section IV. Esp ecially, it is dicult to nd a suitable den-

sity family on a compact input space. This problem always

The author would like to express gratitude to Dr. Sumio arises if the input space is b ounded, and is a disadvantage

Watanab e for his encouragement and helpful comments. of the parametric metho d.

10 TNN A011 REV

References

[1] V.V. Fedorov. Theory of Optimal Experiments. Academic Press,

New York, 1972.

[2] R.H. Myers, A.I. Khuri, and W.H. Carter, Jr. Resp onse surface

metho dology: 1966-1988. Technometrics, 31(2):137{157, 1989.

[3] D. MacKay. Information-based ob jective functions for active

data selection. Neural Computation, 4(4):305{318, 1992.

[4] D.A. Cohn. Neural network exploration using optimal exp eri-

ment. In J. Cowan et al., editor, Advances in Neural Information

Processing Systems 6, pages 679{686, San Meteo, 1994. Morgan

Kaufmann.

[5] P. Sollich. Query construction, entropy and generalization in

neural network mo dels. Physical Review E, 49:4637{4651, 1994.

[6] K. Fukumizu and S. Watanab e. Error estimation and learning

data arrangement for neural networks. In Proceedings of IEEE

International Conference on Neural Networks,volume 2, pages

777{780, June 1994.

[7] H. Cramer. Mathematical method of statistics, pages 497{506.

Princeton University Press, Princeton, NJ, 1946.

[8] H. Akaike. A new lo ok at the statistical mo del identi cation.

IEEE Trans. Automatic Control, 19(6):716{723, 1974.

[9] H. White. Learning in arti cial neural networks: a statistical

p ersp ective. Neural Computation, 1:425{464, 1989.

[10] N. Murata, S. Yoshizawa, and S. Amari. Network information

criterion { determining the numb er of hidden units for an ar-

ti cial neural network mo del. IEEE Trans. Neural Networks,

Vol.5, No.9:865{872, 1994.

[11] S. Watanab e and K. Fukumizu. Probabilistic design of layered

neural networks based on their uni ed framework. IEEE Trans-

action on Neural Networks, 6(3), 1995.

[12] K. Fukumizu. A regularity condition of the information matrix

of a multilayer p erceptron network. Neural Networks, Vol.9,

No.5:871{879, 1996.

[13] B. Efron and R. Tibshirani. An introduction to the bootstrap.

Chapman and Hall, New York, 1993.

[14] M. Stone. Cross-validatory choice and assessment of statistical

predictions. J. Royal Statist. Soc., 36:111-133, 1974.

[15] V.N. Vapnik. Estimation of dependences based on empirical

data. Springer-Verlag, New York, 1982.

[16] J. Mo o dy. The e ectivenumb er of parameters: An analysis of

generalization and regularization in nonlinear learning systems.

In Advances in Neural Information Processing Systems ,

[17] K. Fukumizu. Active learning in multilayer p erceptrons. In

D. S. Touretzky et al., editor, Advances in Neural Information

Processing Systems 8, pages 295{301, Cambridge, 1996. MIT

Press.

[18] J. Kiefer and J. Wolfowitz. The equivalence of two extremum

problem. Canadian Journal of Mathematics, 12:363{366, 1960.

[19] J. Kindermann, G. Paass, and F. Web er. Query construction

for neural networks using the b o otstrap. In Proceedings of In-

ternational ConferenceonArti cial Neural Networks 95, pages

135{140, 1995.

[20] G. Cyb enco. Approximation by sup erp ositions of a sig-

moidal function. Mathematics of Control, Signals and Systems,

2(4):303{314, 1989.

[21] K. Funahashi. On the approximate realization of continuous

mapping by neural networks. Neural Network, 2:183{192, 1989.

[22] K. Hornik, M. Stinchcombe, and H. White. Multi-layer feed-

forward networks are universal approximators. Neural Networks,

2:359{366, 1989.

[23] H.E.J. Neugebauer. Die theoretishen Grundlagen des Mehrfar-

ben buchdrucks. Zeitschrift fur  wissenschaftliche Photographik,

Photophysik, und Photochemie, 36(4):73{89, 1937.

[24] T. Iga, Y. Arai, and S. Usui. Trend of a present color

management technology in the industry. In Proceedings of

5th International Conference on Neural Information Processing

(ICONIP'98), pages 40{43, 1998.

[25] J.A.C. Yule. Principles of color reproduction, App endix E. John

Willey & Sons, New York, 1967.