TNN A011 REV 1
Statistical Active Learning in Multilayer
Perceptrons
Kenji Fukumizu
Brain Science Institute, RIKEN
Hirosawa 2-1, Wako
Saitama 351-0198 Japan
Tel: +81-48-467-9664
Fax: +81-48-467-9693
E-mail: [email protected]
Abstract | This pap er prop oses new metho ds of generating ory,we can derive a criterion on where are e ective input
input lo cations actively in gathering training data, aiming
lo cations to minimize the generalization error ([6]).
at solving problems sp ecial to multilayer p erceptrons. One
of the problems is that the optimum input lo cations which
The main purp ose of this pap er is to solve prob-
are calculated deterministically sometimes result in badly-
lems related to sp ecial prop erties of multilayer networks.
distributed data and cause lo cal minima in back-propagation
training. Two probabilistic active learning metho ds, which
One problem is that a learning rule like the error back-
utilize the statistical variance of lo cations, are prop osed to
propagation cannot always achieve the global minimum of
solvethis problem. One is parametric active learning and
the training error, while many statistical active learning or
the other is multi-p oint-search active learning. Another se-
exp erimental design metho ds assume its availability. We
rious problem in applying active learning to multilayer p er-
ceptrons is the singularityof a Fisher information matrix,
see that learning with the optimal data which are calcu-
whose regularity is assumed in many metho ds including the
lated deterministically is trapp ed by lo cal minima more
prop osed ones. A technique of pruning redundant hidden
often than passive learning. To overcome this problem,
units is prop osed to keep the regularityof a Fisher infor-
mation matrix, which makes active learning applicable to
we prop ose probabilistic metho ds, which generate an input
multilayer p erceptrons. The e ectiveness of the prop osed
data with deviation from the optimal lo cation.
metho ds is demonstrated through computer simulations on
simple arti cial problems and a real-world problem in color
Another problem is caused by the singularityofaFisher
conversion.
information matrix. Many statistical active learning meth-
Keywords | Active learning, Multilayer p erceptron, Fisher
ods assume the regularity of a Fisher information ma-
information matrix, Pruning.
trix ([1],[4],[6]), which plays an imp ortant role in the
asymptotic b ehavior of the least square error estimator
I. Introduction
([7],[8],[9],[10],[11]). The Fisher information matrix of a
MLP,however, can b e singular if the network has redun-
HEN we train a learning machine like a feedforward
dant hidden units. Since active learning metho ds usually
neural network to estimate the true input-output re-
W
require that the prepared mo del includes the true function,
lation of the target system, wemust prepare input vectors,
the numb er of hidden units must b e large enough to realize
observe the corresp onding output vectors, and pair them
it with high accuracy. Thus, the mo del tends to b e redun-
to make training data. It is well known that we can im-
dant esp ecially in active learning. To solve this problem,
prove the ability of a learning machine by designing the
we prop ose active learning with hidden unit pruning based
input of training data. Such metho ds of selecting the lo ca-
on the regularity condition of a Fisher information matrix
tion of input vectors have b een long studied in the name
of a MLP ([12]). The metho d removes redundant hidden
of experimental design ([1]), response surface methodology
units to keep the regularity of a Fisher information matrix,
([2]), active learning ([3],[4]), and query construction ([5]).
and makes active learning metho ds applicable to the MLP
They are esp ecially imp ortant when collecting data is very
mo del.
exp ensive.
This pap er discusses statistical active learning metho ds
This pap er is organized as follows. In Section I I, we give
for the multilayer p erceptron (MLP) mo del. We consider
basic de nitions and terminology, and describ e an active
learning of a network as statistical estimation of a regres-
learning criterion. In Section I I I, we prop ose twonovel ac-
sion function. The accuracy of the estimation is often eval-
tive learning metho ds based on the probabilistic optimality
uated using the generalization error, which is the mean
of training data. In Section IV, we explain a problem con-
square error b etween the true function and its estimate. In
cerning the singularity of a Fisher information matrix, and
this pap er, the ob jectiveofactive learning is to reduce the
prop ose a pruning technique. Section V demonstrates the
generalization error. Using the statistical asymptotic the-
e ectiveness of the prop osed metho ds through an applica-
tion to a real-world problem, and Section VI concludes this
K. Fukumizu is with the Brain Science Institute, RIKEN, Saitama,
Japan. E-mail: [email protected] pap er.
2 TNN A011 REV
( )
II. Active learning in statistical learning If the input vectors fx g are indep endent samples from
the environmental distribution Q, such learning is called
A. Basic de nitions and terminology
passive. Active learning is, of course, exp ected to b e sup e-
First, we give basic de nitions and terminology, which
rior to passive learning with resp ect to the generalization
our active learning metho ds are based on.
error. When the number of training data is suciently
We discuss the three-layer p erceptron mo del de ned by
large, and if the true function is included in the mo del, the
statistical asymptotic theory tells that E[E ] of passive
!
gen
H L
2
X X
S , where S is the dimension learning is approximately
i
N
f (x; )= w s u x + + ; (1 i M )
ij jk k j i
of ([8],[10]).
j =1
k =1
(1)
B. Criterion of statistical active learning
Because our principle of learning is to minimize the ex-
where =(w ;::: ;w ; ;::: ; ;u ;::: ;u ;
11 MH 1 M 11 HL
1
p ectation of the generalization error, in order to construct
;::: ; ) represents weights and biases, and s(t)=
1 H t
1+e
an active learning metho d we must evaluate how E[E ]
gen
is the sigmoidal function.
dep ends on X . There are several kinds of metho ds, in
N
We assume that the target system which is estimated
general, to estimate the generalization error. One is to use
by a network is a function f (x), and the output of the
the statistical asymptotic theory ([7]), and another one is
system is observed with an additive Gaussian noise. Then,
to use resampling techniques like b o otstrap ([13]) or cross-
an output data y follows
validation ([14]). The concept of structural risk minimiza-
tion (SRM, [15]) develop ed by Vapnik also gives a solid
y = f (x)+Z ; (2)
basis to discuss generalization problems. In this pap er, we
where Z is a random vector with a zero mean and a scalar
employ a metho d based on the asymptotic theory. The
2
covariance I . To obtain a set of training data D =
resampling techniques, which estimate the generalization
M
( ) ( )
f(x ; y ) j = 1;:::;Ng, we prepare input vectors
error using given training data, is not suitable for active
( )
X = fx g, feed them to the target system, and observe
learning in whichwehavetoknowhow the generalization
N
( )
output vectors fy g sub ject to eq.(2). The problem of
error dep ends on an input p oint b efore the data is actually
active learning is how to prepare X .
generated. We do not adopt the SRM principle either, b e-
N
When a set of training data D is given, we employ the
cause it is based on the b ound of the worst case unlike our
^
ob jective to minimize the exp ectation of the generalization least square error (LSE) estimator ,thatis,
error.
N
X
For the approximation of eq.(4), we assume that the
( ) ( ) 2
^
= arg min ky f (x ; )k : (3)
true function f (x) is completely included in the mo del and
=1
f (x; )=f (x). This assumption is not rigorously satis-
0
ed in practical problems. In general, the exp ectation of
Unlike linear mo dels whose exp erimental design has b een
the generalization error can b e decomp osed as
extensively studied in the eld of statistics ([1]), the solu-
Z
tion of eq.(3) cannot be rigorously calculated in the case
2
^
E[E ] = E kf (x; ) f (x; )k dQ(x)
of neural networks. An iterative learning rule like the er-
gen o
ror back-propagation is needed to obtain an approximation
Z
^
2
of . To derive an active learning criterion, however, we
+ kf (x; ) f (x)k dQ(x); (5)
o
^
assume the availability of . A problem related to this
R
assumption is discussed later.
where is the parameter that gives min kf (x; )
o o
We use the generalization error to evaluate the ability
2
f (x)k dQ(x). The rst and the second terms in eq.(5) are
of a trained network. For the de nition, weintro duce the
called the variance and the bias of the mo del resp ectively.
environmental probability Q, whichgives indep endent input
Mo o dy ([16]), for example, discusses the generalization er-
vectors in the actual environment where a trained network
ror in the framework of nonparametric regression which
should b e lo cated. In system identi cation, for example, Q
allows the mo del bias. However, it is very dicult to de-
represents the distribution of input vectors which are given
scrib e explicitly the dep endence of E[E ] on X if the
gen N
to the system. The generalization error is de ned by
mo del bias exists. Therefore, we assume that the bias of
Z
the mo del is small enough to b e neglected, and that active
2
^
E = kf (x; ) f (x)k dQ(x); (4)
gen
learning is supp osed to reduce the variance term. In Sec-
tion IV, we discuss howtosolve the problem of the mo del
bias.
which is the mean square error b etween the true function
Similar to Cohn's discussion ([4]), application of the
and its estimate. The purp ose of our active learning meth-
asymptotic theory ([9],[10]) or lo cal linearization under the
ods is to reduce the exp ectation of the generalization er-
bias-free assumption shows
ror E[E ]. The exp ectation E[]istaken with resp ect to
gen
^
training data, as is a random vector dep ending on the
2 1
E[E ] Tr I ( )J ( ; X ) ; (6)
gen o o N
statistical training data D . N
FUKUMIZU: STATISTICAL ACTIVE LEARNING IN MULTILAYER PERCEPTRON 3
(n)
the next input x according to
h i
(n) 1
^ ^
x = arg min Tr I ( )J ( ; X [fxg) :
n 1 n 1 n 1
x
(11)
We call this deterministic active learning, b ecause the lo-
cation of the next input is selected deterministically.
In the case of neural networks, this metho d do es not
necessarily work well. Training of a neural network do es
not always give the correct LSE estimator b ecause of lo cal
minima and plateaus. The ab ove metho d tend to gener-
Fig. 1. Sequential active learning
ate training data that are trapp ed by lo cal minima more
easily. We explain the reason brie y. It is known that the
where the matrix I ( )andJ ( ; X ) are de ned by
optimal data that minimize the left hand side of eq.(6) can
N
b e approximated byadatasetona xednumberofinput
Z
lo cations, b ecause any Fisher information matrix at can
o
I ( ) = I (x; )dQ(x); (7)
S (S +1)
b e approximately realized using a data set on +1
2
N
p oints ([1], Theorem 2.1.2). Therefore, it is very likely that
X
( )
the same input p ositions are rep eatedly selected in deter-
J ( ; X ) = I (x ; ); (8)
N
=1
ministic active learning. Obviously such a training data set
T
makes the convergence of the back propagation muchmore
@ f (x; ) @ f (x; )
I (x; ) = : (9)
ab
dicult.
@ @
a b
We illustrate this in uence with a simple exp erimentus-
ing a MLP network with 2 input, 2 hidden, and 1 output
The matrix I ( )andJ ( ; X ) are called Fisher informa-
N
unit. The target function is also de ned by a parameter in
tion matrixes or asymptotic covariance matrixes. Note that
this mo del (Fig.2). The normal distribution N (0; 16I )is
2
the matrix I ( )isaveraged with the environmental proba-
used for Q, where N (m; ) means the normal distribution
bility Q, while J ( ; X ) is calculated using empirical data
N
with m as its mean and as its variance-covariance ma-
X . Replacing the unknown parameter with its current
N o
^
trix. Fig.3 shows the average of generalization errors for 50
estimate ,we adopt the following as the criterion of active
trials changing initial training data set. The result of de-
learning;
terministic active learning is inferior to that of passiveone
h i
after 60 data. We nd that the parameter sometimes do es