Statistical Active Learning in Multilayer Perceptrons

TNN A011 REV 1 Statistical Active Learning in Multilayer Perceptrons Kenji Fukumizu Brain Science Institute, RIKEN Hirosawa 2-1, Wako Saitama 351-0198 Japan Tel: +81-48-467-9664 Fax: +81-48-467-9693 E-mail: [email protected] Abstract | This pap er prop oses new metho ds of generating ory,we can derive a criterion on where are e ective input input lo cations actively in gathering training data, aiming lo cations to minimize the generalization error ([6]). at solving problems sp ecial to multilayer p erceptrons. One of the problems is that the optimum input lo cations which The main purp ose of this pap er is to solve prob- are calculated deterministically sometimes result in badly- lems related to sp ecial prop erties of multilayer networks. distributed data and cause lo cal minima in back-propagation training. Two probabilistic active learning metho ds, which One problem is that a learning rule like the error back- utilize the statistical variance of lo cations, are prop osed to propagation cannot always achieve the global minimum of solvethis problem. One is parametric active learning and the training error, while many statistical active learning or the other is multi-p oint-search active learning. Another se- exp erimental design metho ds assume its availability. We rious problem in applying active learning to multilayer p erceptrons is the singularityof a Fisher information matrix, see that learning with the optimal data which are calcu- whose regularity is assumed in many metho ds including the lated deterministically is trapp ed by lo cal minima more prop osed ones. A technique of pruning redundant hidden often than passive learning. To overcome this problem, units is prop osed to keep the regularityof a Fisher information matrix, which makes active learning applicable to we prop ose probabilistic metho ds, which generate an input multilayer p erceptrons. The e ectiveness of the prop osed data with deviation from the optimal lo cation. metho ds is demonstrated through computer simulations on simple arti cial problems and a real-world problem in color Another problem is caused by the singularityofaFisher conversion. information matrix. Many statistical active learning meth- Keywords | Active learning, Multilayer p erceptron, Fisher ods assume the regularity of a Fisher information ma- information matrix, Pruning. trix ([1],[4],[6]), which plays an imp ortant role in the asymptotic b ehavior of the least square error estimator I. Introduction ([7],[8],[9],[10],[11]). The Fisher information matrix of a MLP,however, can b e singular if the network has redun- HEN we train a learning machine like a feedforward dant hidden units. Since active learning metho ds usually neural network to estimate the true input-output re- W require that the prepared mo del includes the true function, lation of the target system, wemust prepare input vectors, the numb er of hidden units must b e large enough to realize observe the corresp onding output vectors, and pair them it with high accuracy. Thus, the mo del tends to b e redun- to make training data. It is well known that we can im- dant esp ecially in active learning. To solve this problem, prove the ability of a learning machine by designing the we prop ose active learning with hidden unit pruning based input of training data. Such metho ds of selecting the lo ca- on the regularity condition of a Fisher information matrix tion of input vectors have b een long studied in the name of a MLP ([12]). The metho d removes redundant hidden of experimental design ([1]), response surface methodology units to keep the regularity of a Fisher information matrix, ([2]), active learning ([3],[4]), and query construction ([5]). and makes active learning metho ds applicable to the MLP They are esp ecially imp ortant when collecting data is very mo del. exp ensive. This pap er discusses statistical active learning metho ds This pap er is organized as follows. In Section I I, we give for the multilayer p erceptron (MLP) mo del. We consider basic de nitions and terminology, and describ e an active learning of a network as statistical estimation of a regres- learning criterion. In Section I I I, we prop ose twonovel ac- sion function. The accuracy of the estimation is often eval- tive learning metho ds based on the probabilistic optimality uated using the generalization error, which is the mean of training data. In Section IV, we explain a problem con- square error b etween the true function and its estimate. In cerning the singularity of a Fisher information matrix, and this pap er, the ob jectiveofactive learning is to reduce the prop ose a pruning technique. Section V demonstrates the generalization error. Using the statistical asymptotic the- e ectiveness of the prop osed metho ds through an applica- tion to a real-world problem, and Section VI concludes this K. Fukumizu is with the Brain Science Institute, RIKEN, Saitama, Japan. E-mail: [email protected] pap er. 2 TNN A011 REV ( ) II. Active learning in statistical learning If the input vectors fx g are indep endent samples from the environmental distribution Q, such learning is called A. Basic de nitions and terminology passive. Active learning is, of course, exp ected to b e sup e- First, we give basic de nitions and terminology, which rior to passive learning with resp ect to the generalization our active learning metho ds are based on. error. When the number of training data is suciently We discuss the three-layer p erceptron mo del de ned by large, and if the true function is included in the mo del, the statistical asymptotic theory tells that E[E ] of passive ! gen H L 2 X X S , where S is the dimension learning is approximately i N f (x; )= w s u x + + ; (1 i M ) ij jk k j i of ([8],[10]). j =1 k =1 (1) B. Criterion of statistical active learning Because our principle of learning is to minimize the ex- where =(w ;::: ;w ; ;::: ; ;u ;::: ;u ; 11 MH 1 M 11 HL 1 p ectation of the generalization error, in order to construct ;::: ; ) represents weights and biases, and s(t)= 1 H t 1+e an active learning metho d we must evaluate how E[E ] gen is the sigmoidal function. dep ends on X . There are several kinds of metho ds, in N We assume that the target system which is estimated general, to estimate the generalization error. One is to use by a network is a function f (x), and the output of the the statistical asymptotic theory ([7]), and another one is system is observed with an additive Gaussian noise. Then, to use resampling techniques like b o otstrap ([13]) or cross- an output data y follows validation ([14]). The concept of structural risk minimiza- tion (SRM, [15]) develop ed by Vapnik also gives a solid y = f (x)+Z ; (2) basis to discuss generalization problems. In this pap er, we where Z is a random vector with a zero mean and a scalar employ a metho d based on the asymptotic theory. The 2 covariance I . To obtain a set of training data D = resampling techniques, which estimate the generalization M ( ) ( ) f(x ; y ) j = 1;:::;Ng, we prepare input vectors error using given training data, is not suitable for active ( ) X = fx g, feed them to the target system, and observe learning in whichwehavetoknowhow the generalization N ( ) output vectors fy g sub ject to eq.(2). The problem of error dep ends on an input p oint b efore the data is actually active learning is how to prepare X . generated. We do not adopt the SRM principle either, b e- N When a set of training data D is given, we employ the cause it is based on the b ound of the worst case unlike our ^ ob jective to minimize the exp ectation of the generalization least square error (LSE) estimator ,thatis, error. N X For the approximation of eq.(4), we assume that the ( ) ( ) 2 ^ = arg min ky f (x ; )k : (3) true function f (x) is completely included in the mo del and =1 f (x; )=f (x). This assumption is not rigorously satis- 0 ed in practical problems. In general, the exp ectation of Unlike linear mo dels whose exp erimental design has b een the generalization error can b e decomp osed as extensively studied in the eld of statistics ([1]), the solu- Z tion of eq.(3) cannot be rigorously calculated in the case 2 ^ E[E ] = E kf (x; ) f (x; )k dQ(x) of neural networks. An iterative learning rule like the er- gen o ror back-propagation is needed to obtain an approximation Z ^ 2 of . To derive an active learning criterion, however, we + kf (x; ) f (x)k dQ(x); (5) o ^ assume the availability of . A problem related to this R assumption is discussed later. where is the parameter that gives min kf (x; ) o o We use the generalization error to evaluate the ability 2 f (x)k dQ(x). The rst and the second terms in eq.(5) are of a trained network. For the de nition, weintro duce the called the variance and the bias of the mo del resp ectively. environmental probability Q, whichgives indep endent input Mo o dy ([16]), for example, discusses the generalization er- vectors in the actual environment where a trained network ror in the framework of nonparametric regression which should b e lo cated. In system identi cation, for example, Q allows the mo del bias.

Load more