Analysis of Ensemble Learning using Simple based on Online Learning Theory

Seiji Miyoshi Kazuyuki Hara Masato Okada Department of Electronic Engineering Department of Electronics RIKEN Brain Science Institute Kobe City College of Tech. and Information Engineering Wako, Saitama, 351-0198 Japan Kobe, 651-2194 Japan Tokyo Metropolitan College of Tech. Intelligent Cooperation and Control, E-mail: [email protected] Tokyo, 140-0011 Japan PRESTO, JST Wako, Saitama, 351-0198 Japan

Abstract— Ensemble learning of à nonlinear perceptrons, functions within the framework of online learning and finite which determine their outputs by sign functions, is discussed à [14], [15]. First, we show that an ensemble generalization within the framework of online learning and statistical mechan-

error of à students can be calculated by using two order ics. This paper shows that ensemble generalization error can be calculated by using two order parameters, that is, the similarity parameters: one is a similarity between a teacher and a stu- between a teacher and a student, and the similarity among dent, the other is a similarity among students. Next, we derive students. The differential equations that describe the dynamical differential equations that describe dynamical behaviors of behaviors of these parameters are derived analytically in the these order parameters in the case of general learning rules. cases of Hebbian, and AdaTron learning. These three After that, we derive concrete differential equations about rules show different characteristics in their affinity for ensemble learning, that is “maintaining variety among students.” Results three well-known learning rules: Hebbian learning, perceptron show that AdaTron learning is superior to the other two rules. learning and AdaTron learning. We calculate the ensemble generalization errors by using results obtained through solving I. INTRODUCTION these equations numerically. Two methods are treated to Ensemble learning has recently attracted the attention of decide an ensemble output. One is the majority vote of many researchers [1], [2], [3], [4], [5], [6]. Ensemble learning students, and the other is an output of a new perceptron whose means to combine many rules or learning machines (students connection weight equals the means of those of students. As in the following) that perform poorly. Theoretical studies a result, we show that these three learning rules have different analyzing the generalization performance by using statistical properties with respect to an affinity for ensemble learning, mechanics[7], [8] have been performed vigorously[4], [5], [6]. and AdaTron learning, which is known to have the best Hara and Okada[4] theoretically analyzed the case in which asymptotic property [9], [10], [11], [12], is the best among students are linear perceptrons. Their analysis was performed the three learning rules within the framework of ensemble with statistical mechanics, focusing on the fact that the output learning. of a new perceptron, whose connection weight is equivalent to the mean of those of students, is identical to the mean outputs II. MODEL of students. Krogh and Sollich[5] analyzed ensemble learning Each student treated in this paper is a perceptron that

of linear perceptrons with noises within the framework of decides its output by a sign function. An ensemble of Ã

batch learning. students is considered. Connection weights of students are

    ¡¡¡    ´Â  ¡¡¡  µ ½ ¾ ¡¡¡ Ã

¾ Ã   ½ Æ

On the other hand, Hebbian learning, perceptron learning ½ .

Ü ´Ü  ¡¡¡ Ü µ Æ Æ

and AdaTron learning are well-known as learning rules for and input ½ are dimensional vectors.

Ü Ü a nonlinear perceptron, which decides its output by sign Each component  of is assumed to be an indepen-

function[9], [10], [11], [12]. Urbanczik[6] analyzed ensemble dent random variable that obeys the Gaussian distribution

¼

´¼ ½Æ µ Â

learning of nonlinear perceptrons that decide their outputs by Æ . Each component of , that is the initial value



à Â

sign functions for a large limit within the framework of of  , is assumed to be generated according to the Gaussian

´¼ ½µ

online learning[13]. Though Urbanczik discussed ensemble distribution Æ independently. Each student’s output is

´Ù Ð µ ´Ù Ð µ ¡¡¡  ´Ù Ð µ

½ ¾ ¾ Ã Ã learning of nonlinear perceptrons within the framework of sgn ½ sgn sgn where

online learning, he treated only the case in which the num-

·½ Ù Ð  ¼

 

´Ù Ð µ  sgn  (1)

ber à of students is large enough. Determining differences

½ Ù Ð ¼ 

among ensemble learnings with Hebbian learning, perceptron 

Ù Ð Â ¡ Ü

   (2)

learning and AdaTron learning (three typical learning rules),

Ð Â  is a very attractive problem, but it is one that has never been Here,  denotes the length of student . This is one of the

analyzed to the best of our knowledge. order parameters treated in this paper and will be described Ù

Based on the past studies, we discuss ensemble learning of in detail later. In this paper,  is called a normalized internal

à nonlinear perceptrons, which decide their outputs by sign potential of a student.

The teacher is also perceptron that decides its output by a and

   

à 

sign function. The teacher’s connection weight is . In this 

½

¯ ¢ ´ ¡ ܵ  ¡ Ü 



 ´  ¡¡¡  µ  sgn sgn (9) Æ

paper, is assumed to be fixed where ½ is

Ã

 ½

Æ

also an dimensional vector. Each component  is assumed

Æ ´¼ ½µ

to be generated according to the Gaussian distribution as error ¯ for the majority vote and the weight mean, respec-

Ñ

Ñ Ñ

´Ú µ

¯ Ü Â ¯ Ü Â

independently. The teacher’s output is sgn where tively. Here, , and  denote , and , respectively. 

However, superscripts Ñ, which represent time steps, are

 ¡ Ü

Ú (3)

¡µ omitted for simplicity. Then, ¢´ is the step function defined

Here, Ú represents an internal potential of the teacher. For as

simplicity, the connection weight of a student and that of the

·½ Þ  ¼

Þ µ

teacher are simply called student and teacher, respectively. ¢´ (10)

¼ Þ ¼

 ½

In this paper the thermodynamic limit Æ is also

¼

treated. Therefore, In both cases, ¯ if an ensemble output agrees with

Ô Ô ½

that of the teacher and ¯ otherwise. Generalization error

¼

Ü ½   Æ Â  Æ

 (4)



¯ ¯

 is defined as the average of error over the probability

Դܵ Ü ¯ ¡

where denotes a vector norm. Generally, a norm of student distribution of input . The generalization error 

   can be regarded as the probability that an ensemble output

 changes as the time step proceeds. Therefore, the ratio

Ô

Ü Æ Ð disagrees with that of the teacher for a new input . In the

 of the norm to is considered and is called a length of

 case of a majority vote, using Eqs. (2), (3) and (8), we obtain

student  . That is,

Ô

  Ð Æ

  

 (5)

Ã



¯ ¢ ´Ú µ ´ Ù µ Ð sgn sgn  (11)

where  is one of the order parameters treated in this paper.

 ½ The common input Ü is presented to the teacher and all students in the same order. Each student compares its output In the case of a weight mean, using Eqs. (2), (3) and (9), we

and an output of the teacher for input Ü. Each student’s obtain

   Ã

connection weight is corrected for the increasing probability 

¯ ¢ ´Ú µ Ù

that the student output agrees with that of the teacher. This sgn sgn  (12) ½ procedure is called learning, and a method of learning is called 

learning rule, of which Hebbian learning, perceptron learning

¯ ¯ ¯´Ù Úµ

That is error can be described as  by using

and AdaTron learning are well-known examples[9], [10], [11], Ù

a normalized internal potential  for the student and an [12]. Within the framework of online learning, information internal potential Ú for the teacher in both cases. Therefore,

that can be used for correction other than that regarding a ¯

the generalization error  can be also described as

Ü 

student itself is only input and an output of the teacher for  Ã

that input. Therefore, the update can be expressed as follows, 

¯ ÜԴܵ¯ Ù Ú Ô´Ù Úµ¯´Ù Úµ

   

 ½

Ñ·½ Ñ Ñ

Ñ (13)

 · Ü 

 (6)

 



Ñ Ñ Ñ

´ ´Ú µÙ µ

Ô´Ù Úµ Ù Ú 

sgn (7)  

 by using the probability distribution of and .

 ½

As the thermodynamic limit Æ is also considered in

where Ñ denotes time step, and is a function determined

Ù Ú

this paper,  and obey the multiple Gaussian distribution

by learning rule.

Ü Â

based on the central limit theorem. As input and  have

III. ENSEMBLE GENERALIZATION ERROR no correlation with each other within the framework of online Ù One purpose of statistical learning theory is to theoretically learning, from Eq. (2), the mean and the variance of  are obtain generalization error. In this paper, two methods are 0 and 1, respectively. In the same manner, since an input Ü treated to determine an ensemble output. One is the majority and  have no correlation with each other, from Eq. (3), the

mean and the variance of Ú are 0 and 1, respectively. vote of à students, which means an ensemble output is

From these, all diagonal components of the covariance ·½

decided to be ·½ if students whose outputs are exceed

¦ Ô´Ù Úµ

matrix of  equal unity.

½ ½ the number of students whose outputs are , and in the opposite case. Let us discuss a direction cosine between connection

Another method for deciding an ensemble output is adopt- weights as preparation for obtaining non-diagonal compo- Ê

ing an output of a new perceptron whose connection weight nents. First,  is defined as a direction cosine between a

 Â

teacher and a student  . That is,

is the mean of the weights of à students. This method is

Æ



¡ Â ½

simply called the weight mean in this paper. 



 Ê

 

We use  (14)

 Â  Ð Æ

    

½

Ã



¯ ¢ ´  ¡ ܵ ´Â ¡ ܵ 

 Â sgn sgn sgn  (8)

When a teacher and a student  have no correlation,

 ½

Ê ¼ Ê ½  Â

   , and when the directions of and Ê

agree. Therefore,  is called the similarity between teacher obtained based on self-averaging as follows[9],

ª «

Ê 

and student in the following. Furthermore, is the second ¾





¼

Õ

 Ù ·

order parameter treated in this paper. Next,  is defined as 

 (21)

Ø ¾Ð

 Â

a direction cosine between a student  and another student

ª «

Ê Ú Ù Ê Ê

     

¾

¼ Â

 . That is,



(22)



¾

Ø Ð ¾Ð





Æ



¼

½ Â ¡ Â

 

¼ ¡ ¼

   Õ

  

 (15) where stands for the sample average.

¼ ¼

 Æ Â Â Ð Ð

   

¼

Õ

½

Next, let us derive a differential equation regarding  Â

for the general learning rule. Considering a student  and

¼

Ñ·½

  Â Ñ

where . When a student  and another student

¼

 Р Ð Ð  Ð ·

 

another student  and rewriting as ,





¼ ¼ ¼

Ñ·½

Â Õ Õ ¼ ½

Ñ

 

 have no correlation, , and when the

¼ ¼ ¼

Ð Õ  Õ Õ ½Æ  Ø  Õ · Õ

¼

¼

  

, ,  and ,a





¼ ¼

  Õ

  directions of  and agree. Therefore, is called the

differential equation regarding Õ is obtained as follows,

¼ Õ

similarity among students in the following, and  is the

¼ ¼ ¼ ¼ ¼ ¼ ¼

Õ Ù Ù Õ Õ

Ù Ù

         

third order parameter treated in this paper. 

·

¼

Ø Ð Ð

  

Covariance between an internal potential Ú of a teacher

 

ª ª « «

¾ ¾

¼ Ù Â ¼

Õ

¼

 

 

and a normalized internal potential of a student equals 

 

·  ·

(23)

¾ ¾

Ê  Â

 

¼

Ð ¾ Ð Ð a similarity between a teacher and a student as Ð

¼

  

follows,  ·

¶ from Eqs. (6), (15), (21) and self-averaging.

Æ Æ

 

½

Â Ü Ê  Ü ÚÙ

    

 (16) IV. ANALYTICAL RESULTS

Ð



 ½ ½

 A. Conditions of analytical calculations

¼

Ê Õ

¡ 

where denotes the average. Covariance between a normal- Similarities  and increase and approach unity as

¼

Ê Õ

Ù Â

  

ized internal potential  of a student and a normalized learning proceeds, leading to and becoming less

¼

Ê Ê ½

¼ ¼

Ù Â

  

internal potential  of another student equals a similar- irrelevant to each other. For example when ,

¼

Õ ½  Â

¼

Õ

 

ity  among students as follows, cannot be since a teacher , a student and

¼

· ¶

Â Ê 

another student  have the same direction. Thus, and

Æ Æ

 

½

¼ Õ

 are under a certain restraint relationship each other. When

¼ ¼ ¼

Â Ü Â Ü Õ Ù Ù

      

 (17)

¼

¼

Ð Ð

Õ Ê

  

 is relatively smaller when compared with , variety

½  ½  among students is further maintained and the effect of the

Therefore, Eq. (13) can be rewritten as ensemble can be considered as large. On the contrary, after

¼ ¼

Õ Â Â

 

 becomes unity, a student and another student are

 Ã

 the same and there is no merit in combining them.

¯ Ù Ú Ô´Ù Úµ¯´Ù Úµ

    (18)

Let us explain these considerations intuitively by using ½

 Figure 1, and let us assume that learning starts from the con-

½

Ô´Ù Úµ



à ·½

½ dition that connection weights of students have no correlation.

¾ ¾

´¾ µ ¦

When learning has proceeded to some degree, Ã connection



½

Ì

´Ù Úµ¦ ´Ù Úµ

 

 Ã

weight vectors  of students must distribute at the same

¢ ÜÔ  ¾ distance from connection weight vector  of the teacher, as

(19) shown in Figure 1. Especially, Figure 1(a) shows the case

½ ¼

Õ Õ Ê

½ in which students are unlike each other — in other words

½¾ ½Ã ½



. . . the variety among students is large, that is, Õ is small. In this

È 

. . .

Ã

Õ ½ . ½

¾½ . .



Â

case, a mean vector  of the connection weights of

 ½

Ã



¦ . . . . (20)  . . . . students can closely approximate the connection weight vector

. . Õ

½Ã

. Ã .





 of the teacher. Thus a combination of students in some

Õ Õ ½ Ê

à ½ Ãà ½ Ã

sense can approximate the teacher better than each student

Ê Ê ½

½ Ã

can do alone. In this case, the effect of ensemble learning is ¯

As a result, a generalization error  can be calculated if all strong. On the contrary, Figure 1(b) shows the case in which

¼

Ê Õ  similarities  and are obtained. Let us thus discuss students are similar to each other — in other words the variety

differential equations that describe dynamical behaviors of among students is small, that is, Õ is large. In this case, the these order parameters. In this paper, norms of input, teacher significance of combining three students is small. Therefore,

and student are set as Eq. (4); influence of input can be effect of ensemble learning is small when Õ is large, as in

replaced with the average over the distribution of inputs Figure 1(b).

¼

Æ Ê Õ  (sample average) in a large limit. This idea is called Thus, the relationship between  and is essential to

self-averaging in statistical mechanics. Differential equations know in ensemble learning. This relationship regarding linear

Ð Ê  regarding  and for general learning rules have been perceptron has already been analyzed quantitatively in very B J B C. Perceptron learning 2 J 2 The update procedure for perceptron learning is J1 J1 J3

J3

´ ´Ú µÙµ¢´ ÙÚ µ ´Ú µ

sgn sgn (29)

ª «

¾

Ù Ú

 

Using this expression,  , and in the  case of perceptron learning can be obtained as follows

analytically[9], [16].

Ê ½ ½ Ê

Ô Ô

Ù Ú 

(a) (b) 

 

 (30)

¾ ¾

Ô





½

´Ã ¿µ

Fig. 1. Effect of ensemble learning . ¾

« ª

ÊÚ ½ ½ Ê

¾ ½

Ô

Ú À ¾ Ø Ò



¾

 Ê

½ Ê ¼ (31)

clear form[4]. Here, we analytically investigate the relation-

¼ ¼

Ù

  

In this section,  and are derived. Using Eq.

¼

Ê Õ 

ship between  and with respect to three learning rules

¼ ¼

Ù

   of nonlinear perceptrons in the following. (29),  and in the case of perceptron learning

As described above, in this paper each component of initial are obtained as follows analytically:



¼

  

value of student  and teacher is generated indepen-



¼ ¼ ¼ ¼

Ù ´Ú µÙ Ù Ù Ú Ô ´Ù Ù Úµ¢´ Ù Ú µ

     ¿  

 sgn

´¼ ½µ

dently according to the Gaussian distribution Æ , and

Ê Õ ½

the thermodynamic limit Æ is considered. Therefore, Ô

(32)

¼

  

all and are orthogonal to each other. That is, ¾





¼ ¼

Ê ¼ Õ ¼

¼

¼ ¼ ¼ ¼

Ù Ù Ú Ô ´Ù Ù Úµ¢´ Ù Ú µ¢´ Ù µ

(24) Ú

 

    ¿    

  ½

From Eq. (24) and symmetry of students, we can write ½

ÜÀ ´Þ µ ¾

Ú (33)

¼ ¼ ¼ ¼

ÊÚ

Ù Ù 

       

(25) Ô

¼

¾

½ Ê

in Eq. (23). From Eq. (24) and symmetry among students, we where

¼

Ô

¼

  Ð Ê Õ

¾

  

omit subscript from order parameters and in ¾

´Õ Ê µÜ · Ê ½ Ê Ú

Ô

Þ Õ

Eqs. (21)Ð(23) and write them as Ð Ê and . In the following (34)

¾

´½ Õ µ´½ · Õ ¾Ê µ

Ù Ú

 

sections, we discuss five sample averages  , ,

« ª

¾

À ´Ùµ Ü

¼ ¼

Ù

  

,  and concretely, which are necessary and the definitions of and are





 ½

to solve Eqs. (21)Ð(23) with respect to typical learning rules ¾

Ü Ü

Ô

À ´Ùµ ÜÔ Ü

under the conditions given in Eqs. (24)Ð(25). Ü (35)

¾

¾ Ù

B. Hebbian learning Figure 3 shows a comparison between the analytical results Õ The update procedure for Hebbian learning is regarding the dynamical behaviors of Ê and , which are

obtained by solving Eqs. (21)Ð(25), (30)Ð(33) numerically and

´ ´Ú µÙµ ´Ú µ

sgn sgn (26) 

Æ ½¼ µ

by simulation ´ . They closely agree with

ª «

¾

Ù Ú

 

Using this expression,  , and in the each other. That is, the derived theory explains the computer 

case of Hebbian learning can be obtained as follows simulation quantitatively. Figure 3 shows that Õ is smaller

Ø  ¼ analytically[9], [16]. than Ê in the early period of learning ( ), which means

Ö perceptron learning maintains the variety among students for

« ª

¾Ê ¾

¾

Ô

½ Ù   Ú

   (27)

 a longer time than Hebbian learning.



¾

D. AdaTron learning

¼ ¼

Ù

   In this section,  and are derived. Since The update procedure for AdaTron learning is

Eq.(26) is independent of Ù, we obtain

¾Ê

´ ´Ú µÙµ Ù¢´ ÙÚ µ

sgn (36)

¼ ¼

Ô

Ù Ù  ½

      «

(28) ª

¾

¾

Ù Ú

 

Using this expression,  , and in the 

Figure 2 shows a comparison between the analytical results case of AdaTron learning can be obtained as follows Õ

regarding the dynamical behaviors of Ê and , which are analytically[9], [16]:

 

obtained by solving Eqs.(21)Ð(25), (27)Ð(28) numerically and ½

ÊÙ

¾



Ô

Ù ¾ À ÙÙ

 

Æ ½¼ µ

by computer simulation ´ . They closely agree with (37)

¾

½ Ê

¼ 

each other. That is, the derived theory explains the computer

Ô

Ê ½ ½

½

¾

Õ

Ô

· Ê ½ Ê

simulation quantitatively. Figure 2 shows that rises more cot (38)

¾

 

½ Ê Õ

rapidly than Ê in Hebbian learning; in other words, is

¿

¡

¾ ¾

Ê

ª «

Ê

relatively large when compared with , meaning the variety ½

¾

· Ê Ù  Ù Ú

     (39)

among students disappears rapidly in Hebbian learning.   1 1 1

0.8 0.8 0.8 q 0.6 0.6 R 0.6 R q q 0.4 0.4 0.4 Similarity Theory Similarity Theory Similarity Theory 0.2 Simulation (N=1e5) 0.2 Simulation (N=1e5) 0.2 Simulation (N=1e5)

0 0 0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Time: t=m/N Time: t=m/N Time: t=m/N

Õ Ê Õ Ê Õ Fig. 2. Dynamical behaviors of Ê and in Fig. 3. Dynamical behaviors of and in Fig. 4. Dynamical behaviors of and in

Hebbian learning. perceptron learning. AdaTron learning.

¼ ¼

Ù

  

In this section,  and are derived. Using Eq. characteristic of the three learning rules[9]. However, it has

¼ ¼

Ù

   (36),  and in the case of AdaTron learning disadvantage that the learning is slow at the beginning; that

are obtained as Eqs. (40) and (41) analytically, where the is, the generalization error is larger than for the other two

À ´Ùµ Ü Ø  definitions of Þ , and are Eqs. (34) and (35), learning rules in the period of . This paper shows that respectively. AdaTron learning has a good affinity with ensemble learning

Figure 4 shows a comparison between the analytical results in regard to “the variety among students” and the disadvantage Õ regarding the dynamical behaviors of Ê and , which are of the early period can be improved by combining it with

obtained by solving Eqs. (21)Ð(25), (38)Ð(41) numerically ensemble learning.



Æ ½¼ µ and by computer simulation ´ . They closely agree From the perspective of the difference between the majority with each other. That is, the derived theory explains the vote and the weight mean, Figure 5Ð7 show that the improve-

computer simulation quantitatively. Figure 4 shows that Õ is ment by weight mean is larger than that by majority vote in

relatively smaller when compared with Ê than in the cases all three learning rules. Improvement in the generalization of Hebbian learning and perceptron learning. This means error by averaging connection weights of various students AdaTron learning maintains variety among students most out can be understood intuitively because the mean of students of these three learning rules. is close to that of the teacher in Figure 1(a). The reason why the improvement in the majority vote is smaller than that in V. D ISCUSSION the weight mean is considered to be that the variety among The results in the previous section showed that AdaTron students cannot be utilized as effectively by the majority learning maintains the variety among students best out of the vote as by the weight mean. However, the majority vote can three learning rules. Thus, AdaTron learning is expected to determine an ensemble output only using outputs of students, be the best advanced for ensemble learning. To confirm this and is easy to implement. It is, therefore, significant that the

prediction, we have obtained numerical ensemble generaliza- effect of an ensemble in the case of the majority vote has

¯ Ã ¿ Ê Õ tion errors  in the case of by using and for been analyzed quantitatively.

the three learning rules, that is Figures 2Ð4, and Eqs. (18)Ð Figure 8 shows the results of computer simulations where 

(20). Figures 5Ð7 show the results. In these figures, MV and ¿

½¼ Ã ½ ¿ ½½ ¿½ Ø ½¼ Æ , until in order WM indicate the majority vote and weight mean, respectively. to investigate asymptotic behaviors of generalization errors. Numerical integrations of Eq. (18) in theoretical calculations Asymptotic behavior of generalization error in AdaTron learn- have been executed by using the six-point closed Newton-

ing in the case of the number à of students at unity is



Æ ½¼ ½

Cotes formula. In the computer simulation, and

´Ø µ Ç [9], [12]. Asymptotic order of the generalization error

ensemble generalization errors have been obtained through in the case of ensemble learning is considered equal to that 

tests using ½¼ random inputs at each time step. In each figure,

½ Ã ¿ ½½ ¿½

of à , since properties of are parallel to ½

the result of theoretical calculations of à is also shown ½ those of à in Figure 8. This figure also shows that the to clarify the make effect of the ensemble. effect of ensemble learning on AdaTron learning is maintained These three figures show that the ensemble generalization asymptotically. Improvement in the generalization error tends errors obtained by theoretical calculation explain the computer to be saturated since the difference between the generalization

simulation quantitatively. Though the generalization errors

½½ à ¿½ error at à and that at on a log scale is very of the three learning rules are all improved by increasing small.

à from 1 to 3, the degree of improvement is small in Hebbian learning and large in AdaTron learning. That is, the VI. CONCLUSION

effect of the ensemble in AdaTron learning is the largest, as

Õ Ã predicted above, due to the relationship between Ê and . This paper discussed ensemble learning of nonlinear AdaTron learning originally featured the fastest asymptotic perceptrons, which determine their outputs by sign functions

  

½ ½

Ô

½·Õ

¾

¾

¼ ¼ ¼ ¼

Ê ÜÜ Ù Ù Ù Ú Ô ´Ù Ù Úµ¢´ Ù Ú µÙ Ù ½ Ê ¾Õ Ú

   ¿     

 (40)

ÊÚ



Ô

¼

¾

½ Ê



¼ ¼ ¼ ¼ ¼

Ú Ù Ù Ù Ù Ô ´Ù Ù Úµ¢´ Ù Ú µ¢´ Ù Ú µ

      ¿    

×

 

¡

 

½ ½

¾ ¾

¾

½·Õ ¾Ê ´½ Õ µ

´½ · Õ µ´½ Ê µ

¾ ¾

ÜÜ À ´ Þ µ Ê ·¾´Õ Ê µ Ú

¿

ÊÚ

½ Õ

¾

¾

Ô

¼

¾ ´½ Ê µ

¾

½ Ê

¡

   

½ ½ ½ ½

¾

¾Ê ½·Õ Ê

¾ ¾

Ô

Ú Ú

ÜÜÀ ´Þ µ·¾Ê ÜÀ ´ Þ µ

Ú Ú (41)

¾

ÊÚ ÊÚ

½ Ê

Ô Ô

¼ ¼

¾ ¾

½ Ê ½ Ê

K=1 (Theory) K=1 (Theory) K=1 (Theory) 0.5 K=3, MV (Theory) 0.5 K=3, MV (Theory) 0.5 K=3, MV (Theory) K=3, MV (simulation) K=3, MV (simulation) K=3, MV (simulation) K=3, WM (Theory) K=3, WM (Theory) K=3, WM (Theory) 0.4 K=3, WM (simulation) 0.4 K=3, WM (simulation) 0.4 K=3, WM (simulation)

0.3 0.3 0.3 MV MV MV

0.2 0.2 0.2 WM WM WM Generalization Error Generalization Error Generalization Error

0.1 0.1 0.1

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Time: t=m/N Time: t=m/N Time: t=m/N

Fig. 5. Dynamical behaviors of ensemble gener- Fig. 6. Dynamical behaviors of ensemble gener- Fig. 7. Dynamical behaviors of ensemble gener-

¯ ¯ ¯

alization error in Hebbian learning. alization error in perceptron learning. alization error in AdaTron learning.

1 superior to the other two rules with respect to that affinity.

ACKNOWLEDGMENT 0.1 This research was partially supported by the Ministry of Ed-

0.01 K=1, Theory ucation, Culture, Sports, Science and Technology, Japan, with K=1 Grant-in-Aid for Scientific Research 13780313, 14084212, K=3 (MV) K=11 (MV) 14580438 and 15500151.

Generalization Error 0.001 K=31 (MV) REFERENCES 0.0001 [1] Y. Freund and R. E. Schapire, Journal of Japanese Society for Artificial 0.1 1 10 100 1000 1e4 Intelligence, 14(5), 771 (1999) (in Japanese, translation by N. Abe.). Time : t=m/N [2] Breiman, L., , 26(2), 123 (1996). [3] Y. Freund and R. E. Shapire, Journal of Comp. and Sys. Sci., 55(1), 119 (1997). Fig. 8. Asymptotic behavior of generalization error of majority vote in [4] K. Hara and M. Okada, Proc. Fifth Workshop on Information-Based AdaTron learning. Computer simulations, except for the solid line. Induction Sciences, 113 (2002) (in Japanese). [5] A. Krogh and P. Sollich, Phys. Rev. E 55(1), 811 (1997). [6] R. Urbanczik, Phys. Rev. E 62(1), 1448 (2000). [7] J. A. Hertz, A. Krogh and R. G. Palmer, Introduction to the Theory of within the framework of online learning and statistical me- Neural Computation (Addison-Wesley, Redwood City, CA, 1991). [8] M. Opper and W. Kinzel, in Physics of Neural Networks III, edited chanics. We have shown that the ensemble generalization error by E. Domany, J. L. van Hemmen and K. Schulten (Springer, Berlin, can be calculated by using two order parameters, that is the 1995). similarity between the teacher and a student, and the similarity [9] H. Nishimori, Statistical Physics of Spin Glasses and Information Processing: An Introduction (Oxford University Press, Oxford, 2001). among students. The differential equations that describe the [10] J. K. Anlauf and M. Biehl, Europhys. Lett. 10(7), 687 (1989). dynamical behaviors of these order parameters have been [11] M. Biehl and P. Riegler, Europhys. Lett. 28(7), 525 (1994). derived in the case of general learning rules. The concrete [12] J. Inoue and H. Nishimori, Phys. Rev. E 55(4), 4544 (1997). [13] D. Saad, On-line Learning in Neural Networks (Cambridge University forms of these differential equations have been derived an- Press, Cambridge, 1998). alytically in the cases of three well-known rules: Hebbian [14] S. Miyoshi, K. Hara and M. Okada, IEICE Technical Report 103(228), learning, perceptron learning and AdaTron learning. As a 13 (2003) (in Japanese). [15] S. Miyoshi, K. Hara and M. Okada, Proc. Annual Conf. of Japanese result, these three rules have different characteristics in their Neural Network Society, 104 (2003) (in Japanese). affinity for ensemble learning, that is, “maintaining variety [16] A. Engel and C. V. Broeck, Statistical Mechanics of Learning (Cam- among students.” The results show that AdaTron learning is bridge University Press, Cambridge, 2001).