<<

Mathematical of Evolutionary for Optimization

Heinz Muhlenbein¨ Thilo Mahnig GMD – Schloss Birlinghoven GMD – Schloss Birlinghoven 53754 Sankt Augustin, Germany 53754 Sankt Augustin, Germany [email protected] [email protected]

Abstract Simulating evolution as seen in nature has been identified as one of the key paradigms for the new decade. Today evolutionary algorithms have been successfully used in a number of applications. These include discrete and continuous optimization problems, synthesis of neural networks, synthesis of computer programs from examples (also called genetic programming) and even evolvable hardware. But in all application problems have been encountered where evolutionary algorithms performed badly. In this survey we concentrate on the analysis of evolutionary algorithms for optimization. We present a mathematical theory based on probability distributions. It gives the reasons why evolutionary algorithms can solve many difficult multi-modal functions and why they fail on seemingly simple ones. The theory also leads to new sophisticated algorithms for which convergence is shown.

1 Introduction In Section 5.2 is extended to the Factorized Distri-

bution . We prove convergence of the al- We first introduce the most popular algorithm, the simple ge- gorithm to the global optima if Boltzmann selection is used. netic algorithm. This algorithm has many degrees of free- The theory of factorization connects with the theory of dom, especially in the recombination scheme used. We show graphical models and Bayesian networks. We derive a new that all genetic algorithms behave very similar, if recombina- adaptive Boltzmann selection schedule SDS using ideas from tion is done without selection a sufficient number of times be- the of breeding. fore the next selection step. We correct the classical schema In Section 6.1 we use results from the theory of Bayesian analysis of genetic algorithm. We show why the usual schema networks for the Learning Factorized Distribution Algorithm folklore is mathematically wrong. We approximate , which learns a factorization from the data. We make genetic algorithms by a conceptual algorithm. This algo- a preliminary comparison between the efficiency of rithm we call the Univariate Marginal Distribution Algorithm and .

, which is analyzed in Section 3. We compute the dif- In Section 7 we describe the system dynamics approach to ference for the univariate marginal distributions un- optimization. The difference obtained for der the assumption of proportionate selection. This equation are iterated until convergence. Thus the continuous opti- has been proposed in populations genetics by Sewall Wright mization problem is mathematically solved without using a as early as 1937 [Wri70]. This is an independent confirma- population of points at all. We present numerical results tion of our claim that approximates any genetic algo- for three different system dynamics equations. They con- rithm. Using Wright’s equation we show that solves sists of Wright’s equation, the diversified replicator equation a continuous optimization problem. The to be opti- and a modified version of Wright’s equation which converges mized is given by the average fitness of the population. faster. Proportionate selection is far too weak for optimization. In the final section we classify the different evolutionary This has been recognized very early in breeding of livestock. computation methods presented. The classification criterion Artificial selection as done by breeders is a much better model is whether a microscopic or a macroscopic model is used for for optimization than natural selection modelled by propor- selection and/or recombination. tionate selection. Unfortunately an exact mathematical anal- ysis of efficient artificial selection schemes seems impossible. 2 Analysis of the Simple Genetic Algorithm Therefore breeders have developed an approximate theory, using the concepts of regression of offspring to parent, her- In this section we investigate the standard genetic algorithm, itability and response to selection. This theory is discussed also called the Simple Genetic Algorithm (SGA). The algo- in Section 4. At the end of the section numerical results are rithm is described by Holland [Hol92] and Goldberg [Gol89]. shown which show the strength and the weakness of It consists of

as a numerical optimization method. ¯ optimizes very efficient some difficult optimiza- fitness proportionate selection tion problems, but it fails on some simple problems. For these

¯ recombination/crossover problems higher order marginal distributions are necessary which capture the nonlinear dependency between variables. ¯ mutation Here we will analyze selection and recombination only. Mu- 2.2 Proportionate Selection tation is considered to be a background operator. It can Proportionate selection changes the probabilities according to be analyzed by known techniques from stochastics [MSV94,

M¨uh97].

´µ

´Ü ·½µ ´Üµ

There have been many claims concerning the optimization (2.3)

´µ

power of . Most of them are based on a rather qualita- tive application of the schema theorem. We will show the Lemma 2.1. For proportionate selection the response is

shortcomings of this approach. Our analysis is based on tech- given by

´µ

niques used in population genetics. The analysis reveals that

´µ

(2.4)

´µ an exact mathematical analysis of is possible for small

problems only. For a binary problem of size the exact anal-

Proof: Ò We have

ysis needs the computation of ¾ equations. But we propose

¾



´µ ´µ

an often used in population genetics. The ap-

´µ ´µ ´ µ

(2.5)

proximation assumes that the gene frequencies are in linkage

´µ ´µ equilibrium. The main result is that any genetic algorithm can Ü

be approximated by an algorithm using only, the univariate marginal gene frequencies. With proportionate selection the average fitness never de-

2.1 Definitions creases. This is true for every rational selection scheme.

Ü ´ µ Ò Let ½ denote a binary vector. For notational

2.3 Recombination

¾

simplicity we restrict the discussion to binary variables 

¼ ½

. We use the following conventions. Capital letters  For the analysis of recombination we introduce a special dis-

denote variables, small letters  assignments. tribution.

¼

  Definition 2.1. Let a function be given. We consider the optimization problem Definition 2.4. Robbins’ proportions are given by the distri-

bution

Ò



Ü  Ö ÑÜ ´Üµ

ÓÔØ (2.1)

´ µ ´ µ 

 (2.6)

½

´Üµ We will use as the fitness function for the .We A population in Robbins’ proportions is also called to be in will investigate two widely used recombination/crossover linkage equilibrium. schemes. Geiringer [Gei44] has shown that all reasonable recombi-

nation schemes lead to the same distribution. Ý Definition 2.2. Let two strings Ü and be given. In one-

point crossover the string Þ is created by randomly choosing Theorem 2.1 (Geiringer). Recombination does not change

¼   

a crossover point and setting  for and

´ ·½µ  

the univariate marginal frequencies, i.e. 



 

 for .Inuniform crossover is randomly chosen

´ µ 

 . The limit distribution of any complete recombina-

  

with equal probability from  .

´Üµ

tion scheme is Robbins’ proportions .

´Üµ Ü Definition 2.3. Let denote the probability of

Complete recombination that for each of

´ µ  

in the population at generation . Then 

½

È , the probability of an exchange of genes by re-

´Üµ

defines a univariate marginal distribution.

Ü Ü combination is greater than zero. Convergence to the limit

distribution is very fast. We have to mention an important

´ µ  We often write  if just one generation is discussed. In this notation the average fitness of the population and the vari- fact. In a finite population linkage equilibrium cannot be ex-

ance is given by actly achieved. We take the uniform distribution as example.

Ò

´Üµ¾

Here linkage equilibrium is given by . This value



can only be obtained if the size of the population is sub-

´µ  ´Üµ ´Üµ

Ò

 ½¼¼¼

stantial larger than ¾ ! For a population of the

Ü



¡

¾

minimum deviation ÑÒ from Robbins’ proportions is

´µ  ´Üµ ´Üµ ´µ

already achieved after four generations, then slowly in-

Ü creases due to stochastic fluctuations by genetic drift. Ul-

´µ The response to selection is defined by timately the population will consist of one genotype only.

Genetic drift has been analyzed by Asoh and & M¨uhlenbein

´µ ´ ·½µ ´µ (2.2) [AM94b]. It will not be considered here.

2.4 Selection and Recombination strings where the gene at locus i is fixed to  . The univari-

´ µ

ate marginal frequency  is obviously identical to the

´µ

We have shown that the average never decreases after se-

´ µ

frequency of schema  . The fitness of the schema at lection and that any complete recombination scheme moves

generation is given by

the genetic population to Robbins’ proportions. Now the 

question arises: What happens if recombination is applied af- ½

´Üµ ´Üµ ´ ´ µµ

ter selection. The answer is very difficult. The problem still  (2.9)

´ µ

 

Ü Ü puzzles populations genetics [Nag92]. Formally the difference equations can be easily written.

From Theorem 2.2 we obtain:

Let a recombination distribution be given. ÜÝ Þ denotes

the probability that and produce after recombination. Corollary 2.1 (First-order schema theorem). For a genetic Then

 algorithm with proportionate selection using any complete

× ×

´Ü ·½µ ´Ý µ ´Þµ ÜÝ Þ (2.7) recombination the frequency of first-order schemata changes

ÝÞ according to

×

´µ

denotes the probability of string after selection.

Ò Ò

´ ´ µµ



¾ £ ¾

For loci the recombination distribution consists of

´ ·½µ ´ µ

  

 (2.10)

´µ parameters. Recently Christiansen and Feldman [CF98] have written a survey about the mathematics of selection and re- combination from the viewpoint of population genetics. A We now extend the analysis to general schemata.

new technique to obtain the equations has been developed by

Ü  ´ µ  

× × × Ò

Definition 2.5. Let ½ . ½

Vose [Vos99]. In both frameworks one needs a computer pro-

Ü Ü

gram to compute the equations for a given fitness function. Thus × denotes a subvector of defined by the indices

´×µ  A mathematical analysis of the mathematical properties ½ . Then the probability of schema is defined

by

of loci systems is difficult. For a problem of size we 

Ò

´Üµ ´ ´×µµ

have ¾ equations. Furthermore the equations depend on the (2.11)

  Ü × recombination operator used! If the gene frequencies remain ×

in linkage equilibrium, then only equations are needed Ü

The summation is done by fixing the values of × . Thus for the marginal frequencies. Thus the crucial question is: the probability of a schema is just the corresponding marginal

Does the optimization process gets worse because of this

´Ü µ Ü × distribution × .If consists of a single only, we simplification? The answer is no. We provide evidence for have a univariate marginal distribution. this statement by citing a theorem from [M¨uh97]. It shows SGA uses fitness proportionate selection, i.e. the proba- that the univariate marginal frequencies are the same for all bility of Ü being selected is given by

recombination schemes if applied to the same distribution

´Üµ

´Üµ

. ×

´Üµ´Üµ

(2.12)

´µ

Theorem 2.2. For any complete recombination/crossover È

´µ ´Üµ ´Üµ is the average fitness of the pop- scheme used after proportionate selection the univariate Ü marginal frequencies are determined by ulation. Let us now assume that we have an algorithm which

generates new points according to the distribution of selected



´Üµ ´Üµ

points, i.e.

´ ·½µ

 (2.8)

´µ

Ü Ü

´Üµ

´Ü ·½µ ´Üµ

(2.13)

´µ Proof: After selection the univariate marginal frequencies are given by We will later address the problem which probability dis-

tribution SGA really generates.

 

´Üµ ´Üµ

× ×

´ µ ´Üµ



´× µ

Definition 2.6. The fitness of schema  is defined by

´µ

Ü Ü Ü Ü



´Üµ

´ ´× µµ ´Üµ

Now the selected individuals are randomly paired. Therefore  (2.14)

´ ´×µµ

  Ü

× ×

×

´ µ ´ ·½µ

   

Theorem 2.3 (Schema Theorem). The probability of

´× µ

2.5 Schema Analysis Demystified schema is given by

´ ´×µµ

Theorem 2.2 can be formulated in the terms of Holland’s

´ ´×µ ·½µ ´ ´× µµ

(2.15)

´ µ´££ ££µ



schema theory [Hol92]. Let 

´µ

be a first-order schema at locus . This schema includes all

´Üµ Holland ([Hol92] Theorem 6.2.3) computed for SGA an Theorem 2.4 (Convergence). The distribution for

inequality proportionate selection is given by

Ø

´ ´× µµ

´Ü ¼µ ´Üµ

È

´ ´×µ ·½µ ´½ Æ µ´ ´×µµ

´Üµ

(2.16) (2.18)

Ø

´µ ´Ý ¼µ ´Ý µ

Ý Æ is a small factor. The inequality complicates the ap- Let Å be the of global optima, then

plication. But a careful investigation of Holland’s analysis

´

½

Ü ¾Å

[Hol92] reveals the fact, that the application is much simpler

Å

´Üµ

if equation (2.14) is used instead of the inequality. Thus equa- ÐÑ (2.19)

ؽ ¼ else tion (2.14 is obviously the ideal starting point for Holland’s

schema analysis. Proof: The proof is by induction. The assumption is ful- ½ filled for . Then

2.6 Schema Analysis Folklore

Ø·½

´Ü ¼µ ´Üµ

È

´Ü ·½µ ·½

The mathematical difficulty of using the inequality (2.16) to Ø

´Ý ¼µ ´Ý µ

estimate the distribution of schemata lies in the fact that the Ý

Ø

´Üµ ´Ü ¼µ ´Üµ

´Üµ

fitness of a schema depends on , i.e the distribution of



È

Ø

Ô´Ý ¼µ ´Ý µ ¡ ´Ý µ

´µ

the genotypes of the population. This is a defining fact of



Ý

´Øµ

Darwinian natural selection. The fitness is always relative to

Ø·½

´Ü ¼µ ´Üµ È

the current population. To cite a proverb: the one-eyed is the 

Ø·½

´Ý ¼µ ´Ý µ king of the blinds. Ý

Thus an application of the inequality (2.16) is not possible

Ü ¾Å ´Üµ ´Ü µ ÑÜ

Let ÑÜ and . Then

´Üµ without computing . Goldberg [Gol89] circumvented

this problem by assuming Ø

´Üµ ´Ü ¼µ ´Üµ

  ¼

Ø

´Ü µ ´Ü ¼µ ´Ü µ

ÑÜ ÑÜ ÑÜ

´ ´×µµ ´½ · µ ´µ (2.17)

QED.

´ ´×µµ ´½ · With this assumption we estimate

This shows that our algorithm is ideal in the sense that it

Ø

µ ´ ´× µ ¼µ . But the assumption can never be fulfilled even converges to the set of global optima.

for all . When approaching an optimum, the fitness of all

¦ schemata in the population will be only ½ away from the 2.8 Application of the Schema Equation average fitness. Here proportionate selection gets difficulties. The typical folklore which arose from the schema analysis By using equation (2.18) we can make a correct schema anal- is nicely summarized by Ballard. He is not biased towards or ysis. We compute the probabilities of all schemata. We just against genetic algorithms. He just cites the commonly used discuss the interesting case of a deceptive function. We take arguments ([Bal97], p.270). the 3-bit deceptive function defined by

¯ Short schemata have a high probability of surviving the

´Üµ¼ ¼½´ · · µ

¾ ¿

genetic operations. ½

¼´ · · µ·¾

½ ¾ ¾ ¿ ½ ¿ ½ ¾ ¿

¯ Focusing on short schemata that compete shows that,

over the short run, the fittest are increasing at an expo- The function is called deceptive because the global opti-

½ ½µ ´¼ ¼ ¼µ nential rate. mum ´½ is isolated, whereas the local optimum is surrounded by strings of high fitness. We now look at the

¯ Ergo, if all of the assumptions hold (we cannot tell behavior of some schemata. whether they do, but we suspect they do), GAs are opti- mal. Definition 2.7. A schema is called optimal if its defining

string is contained in an optimal string.

We will not investigate the optimality argument, because

´  ½µ ´   ½µ

½ ¾ we will show that the basic conclusion of exponential increas- In our example ½ and are

ing schemata does not hold. optimal schemata. They are displayed in Figure 1. We see

´ ´ ½µ

that the probability of the optimal schema ½ de- creases for about 8 generations, then it increases fairly slowly. 2.7 Correct Schema Analysis

This behavior is contrary to the simple interpretation of the

´  ½µ ¾ It turns out that equation 2.13 for proportionate selection ad- evolution of schemata. Schema ½ decreases

mits an analytical solution. even dramatically at the first generation. Then its probability

½ ½µ is almost identical to the probability of the optimum ´½ . 1 Equation 3.1 can be used to generate new strings. This

p(H(X1=1),t) method is used by . p(H(X1=0),t) 0.8 p(H(X1=X2=1),t) UMDA

p(1,1,1,t)

´ ½  ¼ 0.6 ¯ STEP 0: Set . Generate points randomly.

p

 ¯ STEP 1: Select points according to a selection

0.4 ×

´ µ

method. Compute the marginal frequencies  of  the selected set.

0.2 ¯ STEP 2: Generate new points according to the dis-

0 tribution

É Ò

0 5 10 15 20 ×

´Ü ·½µ ´ µ ´ ·½

 . Set .

 ½ t 

¯ STEP 3: If termination criteria are not met, go to Figure 1: Evolution of some schemata STEP 1.

Remark: Even the probability of optimal schemata can de- For proportionate selection we need the average fitness of the

´µ ´µ

crease for a long time. It depends on the probability distri- population . We consider as a function which de-

´ µ bution, if the optimal schemata will increase. Any argument pends on  . To emphasize this dependency we write

about an exponential increase of optimal schemata is mathe-

´ ´ ¼µ´ ½µ ´  ½µµ  ´µ

½ ½ Ò Ò matically wrong. ½

(3.2)

¾ ´  ½µ 

3 The Univariate Marginal Distribution Algo- formally depends on parameters.  and

´  ¼µ   are considered as two independent parameters

rithm ÍÅ

´  ¼µ  ½ ´  ½µ

  

despite the constraint  .

 ´  ½µ ½

   We abbreviate  . If we insert

The univariate marginal distribution algorithm gen-

É

 

Ò

×

´  ¼µ 

for  into , we obtain . depends on n

´Üµ  ´ µ

erates new points according to  .

 ½  parameters. Now we can formulate the main theorem. Thus keeps the gene frequencies in linkage equi- librium. This makes a mathematical analysis possible. We derive a difference equation for proportionate selection. This Theorem 3.1. For infinite populations and proportionate se- equation has already been proposed by Sewall Wright in 1937 lection the difference equations for the gene frequencies used by UMDA are given by [Wri70]. Wright’s equation shows that is trying to

solve a continuous optimization problem. The continuous

´ µ

Ô ´Ü µ

 

function to be optimized is the average fitness of the popu-

´ ·½µ ´ µ  ´ µ

    

 (3.3)

´µ ´µ

´Ôµ

lation . The variables are the univariate marginal dis- É

tributions. In a fundamental theorem we show the relation È

Ò

´ µ ´Üµ ´ µ

 

where  . The equa-

Ü Ü   between the of the continuous problem and the lo-

tion can also be written as

´Üµ

cal optima of the fitness function .

 Ï

3.1 Definition of

´ ·½µ ´µ· ´µ´½ ´µµ

  

 (3.4)



´µ Instead of performing recombination a number of times in or- der to converge to linkage equilibrium, one can achieve this in The reponse is approximately given by

one step by gene pool recombination [MV96]. In gene pool

¾



´µ ½ £

 

recombination a new string is computed by randomly taking 

´µ ·

· (3.5)

¾

 

¾



for each loci a gene from the distribution of the selected par- 



×



´ µ 

ents. This means that gene  occurs with probability



¾ ¾

 

×

´µ ´½µ´ ´½µ µ · ´¼µ´ ´¼µ µ

   

´ µ 

in the next population.  is the distribution of in the

  selected parents. New strings Ü are generated according to

the distribution (3.6)



Ò



 ´µ´½ ´µµ

  

×

´Ü ·½µ ´ µ

 (3.1)

 

½

´µ is called the additive genetic variance. Furthermore One can simplify the algorithm still more by directly comput- the average fitness never decreases. ing the univariate marginal frequencies from the data. Then

Proof: Equation 3.3 has been proven in [M¨uh97]. We have The restricted application lies in the following fact. In Ò to prove Equation 3.4. Note that general the difference equations need the evaluation of ¾

terms. The computational complexity can be drastically



´ ½µ ´µ 

 reduced if the fitness function has a special form.

´ ·½µ ´µ ´µ

  



´µ

È

¾¼ ½ ´µ

  Example 3.1.  Obviously we have 

After some tedious manipulations one obtains:





´Ôµ  ´½µ

 ´ ½µ ´ ¼µ

 

 









´µ ´ ½µ·´½ ´µµ ´ ¼µ ´µ

    

From 

 · ´½µ

  

´½µ

we obtain 

 



´ ½µ ´µ ´½ ´µµ ´ ½µ

This gives the difference equation

  



·´½ ´µµ ´ ¼µ¼

 

È

¡ ´½µ  ´½µ´½ ´½µµ

 

 (3.7)

´½µ

 

This gives 

 Ï



Noting that  we have proving nothing else than

Ô ´½µ





´ ½µ ´µ´½ ´µµ

Wright’s equation. This equation has been approximately

    solved in [MM99a].

Inserting this equation into the difference equation gives This example shows that the expressions for and its

´Ôµ

Equation 3.4. can be surprisingly simple. can be ob-

´Üµ ´½µ 

Equation 3.5 is just the beginning of a multi-dimensional tained from by exchanging  with . But the formal

´Ôµ

Taylor expansion. The first term follows from derivation of cannot be obtained from the simplified

´Ôµ

expression.



¾

 



 We will investigate the computation of and its gradient

´ ´ ·½µ ´µµ  ´µ´½ ´µµ

  

 in the following section.

 

 



  

 ´µ´ ´½µ µ´ ´½µ · ´¼µµ

  

 3.2 Computing the Average Fitness 

 Wright is also the originator of the landscape metaphor now

¾

 

 ´µ´ ´½µ µ ·´½ ´µµ´ ´¼µ µ

    popular in evolutionary computation and population genetics.

 Unfortunately Wright used two quite different definitions for

´µ  the landscape, apparently without realizing the fundamental distinction between them. The first landscape describes the The above equations completely describe the dynamics relation between the genotypes and their fitness, while the of UMDA with proportionate selection. Mathematically second describes the relation between the allele frequencies

UMDA performs gradient ascent in the landscape defined by in a population and its fitness.



´Üµ or . The first definition is just the fitness function used in

evolutionary computation, the second one is the average fit-



´Ôµ Equation 3.4 is especially suited for the theoretical anal- ness . The second definition is much more useful, be- ysis. It is called Wright’s equation because it has been pro- cause it lends to a quantitative description of the evolutionary posed by Wright in 1937. Wright’s [Wri70] remarks are still process, i.e. Wright’s equation.

valid today: For notational simplicity we only derive the relation be-



´Üµ ´ µ Ò tween and for binary alleles. Let ½

The appearance of this formula is deceptively ¼

¾¼ ½ ¼  ½

with  be a multi-index. We define with :

simple. Its use in conjunction with other



«

«

Ü  components is not such a gross oversimplifi- 

cation in principle as has sometimes been al- 

leged ...Obviously calculations can be made



´Ôµ  ´µ ´µ

only from rather simple models, involving only a Lemma 3.1. is an extension of to .



´µ few loci or simple patterns of interaction among There exist two representations for . These are given by

many similarly behaving loci. . . Apart from ap- 

´Ôµ ´¼¼µ´½ µ ¡¡¡´½ µ·¡¡¡

½ Ò

plication to simple systems, the greatest signif-

· ´½½µ ¡¡¡

½ Ò

icance of the general formula is that its form 

«



´Ôµ

brings out properties of systems that would not «

be apparent otherwise. « The above lemma can rigorously be proven by Moebius Then the best individuals are selected as parents for the

inversion. If the the order of the function is bounded by a next generation. These parents are then randomly mated.



´Ôµ constant independent of , can be computed in poly- The science of breeding is the domain of quantitative ge-

nomial time. The equation can also be used to compute the netics. The theory is based on macroscopic variables. Be-  of , which is needed for Wright’s equation. It is cause an exact mathematical analysis is impossible, many

given by statistical techniques are used. In fact, the concepts of regres-





¼

´µ

sion, correlation, heritability and decomposition of variance

«



« (3.8) ´½µ

have been developed and applied in quantitative genetics for



«« ½

the first time.

¼ ¼

¼ 

with  .

 



We will now characterize the attractors of UMDA. Let  4.1 Single Trait Theory

È É

  ´ µ  ½ ¼  ´ µ  ½ 

    

 and

¾ ¼½ 

 For a single trait the theory can be easily summarized. Start-

Ò

¼ ½

the Cartesian product. Then is the unit cube.

´µ ing with the fitness distribution, the selection differential

Theorem 3.2. The stable attractors of Wright’s equation are is introduced. It is the difference between the average of the

¾ ¼ ½  ½ selected parents and the average of the population. at the corners of , i.e  . In the

interior there are only saddle points or local minima where

×

´µ ´Ô ´ ·½µµ ´Ô´µµ

(4.1)

´µµ  ¼ ´µ . The attractors are local maxima of

according to one bit changes. Wright’s equation solves the

´µ

Similarly the response is defined



 ´Ôµ continuous optimization problem Ö ÑÜ in by

gradient ascent.

´µ ´Ô´ ·½µµ ´Ô´µµ

(4.2)

Proof: is linear in  , therefore it cannot have any lo-

Next a linear regression is done

´µ ¼ cal maxima in the interior. Points with are

unstable fixpoints of UMDA.

´µ´µ ´µ (4.3)

We next show that boundary points which are not local

´µ

´µ

maxima of cannot be attractors. We prove the conjecture is called realized heritability. The most difficult part of

´µ

indirectly. Without loss of generality, let the boundary point applying the theory is to predict . The first estimate uses

  ´½½µ 

be . We now consider an arbitrary neighbor, the regression of offspring to parent. Let  be the pheno-

£

 ´¼ ½½µ i.e . The two points are connected at the typic values of parents and , then

boundary by

·

 





¾

´ µ ´½ ½½µ ¾ ¼ ½

is called the mid-parent value. Let the stochastic



We know that is linear in the parameters  . Because denote the mid-parent value.

£

 

´ µ ´¼ ½½µ ´µ ´½½µ

and we have

´µ´ µ Æ

Theorem 4.1. Let ½ be the population at

¢



´´ µµ  ´½½µ · ¡ ´¼ ½½µ ´½½µ

generation , where  denotes the phenotypic value of indi-

´µ

(3.9) vidual . Assume that an offspring generation is created

´¼ ½½µ ´½½µ  If then cannot be an by random mating, without selection. If the regression equa-

of UMDA. The mean fitness increases with . tion

·

 

´µ´µ· ´µ ¡ ·

 

 (4.4) ÈÇ The extension of the above lemma to multiple alleles and ¾

multivariate distributions is straightforward, but the notation with

´ µ¼

becomes difficult. 

is valid, where  is the fitness value of the offspring of and

4 The Science of Breeding , then

Fitness proportionate selection is the undisputed selection

´µ  ´µ  (4.5) method in population genetics. It is considered to be a model ÈÇ for natural selection. But for proportionate selection the fol- Proof: From the regression equation we obtain for the ex- lowing problem arises. When the population approaches an pected averages

optimum, selection gets weaker and weaker because the fit-

´µ ´µ ´ ´µµ  ´µ·  ness values become similar. Therefore breeders of livestock ÈÇ use other selection methods. These are called artificial selec- Because the offspring generation is created by random mat- tion. For large populations they mainly apply truncation se- ing without selection, the expected average fitness remains lection. It works as follows. A truncation threshold is fixed. constant

The proof of the next theorem can be found in [AM94a].

´ ´µµ  ´µ Theorem 4.3. Let the population be in linkage equilibrium

Let us now select a subset as parents. The parents will be i.e. Ò

randomly mated, producing the offspring generation. If the 

´Üµ ´ µ   (4.8)

subset is large enough, we may still use the regression equa- ½ tion and obtain for the averages 

Then the variance of the population is given by

´ ·½µ ´µ· ´µ ¡ ´µ



×

ÈÇ

 · · ¡¡¡ · ·

¾ Ò ½ Ò

½ (4.9)

´ ·½µ Here is the average fitness of the offspring genera- tion produced by the selected parents. Subtracting the above The covariance of mid-parent and offspring can be computed

equations we obtain from

Ò



½ ½ ½ ½

· · ¡¡¡·  ´ µ

¾ Ò 

½ (4.10)

Ò 

¾  ¾ ¾

´ ·½µ ´µ ´µ ¡ ´ ´µ ´µµ



×

ÈÇ

 ½

´µ´µ

This proves  . Ç È We now compare the estimates for heritability. For propor- tionate selection we have from Theorem 3.1

The importance of regression for estimating the heritability

´µ



´µ ´µ· ´µ

was discovered by Galton and Pearson at the end of the 19th

ÍÅ ½

´µ century. They computed the regression coefficient rather in-

tuitively by scatter diagrams of mid-parent and offspring. The For Two-Parent-Recombination (TPR) M¨uhlenbein (1997) ¾ problem of computing a good regression coefficient is math- has shown for loci

ematically solved by the theorem of Gauss-Markov. We just

´ ´µ´µµ ½

cite the theorem. The proof can be found in any textbook on

´µ¾ ´µ· ´µ

ÌÈÊ ¾

´µ ¾ [Rao73].

Theorem 4.2. A good estimate for the regression coefficient If the population is in linkage equilibrium we have

 ¾ of mid-parent and offspring is given by ½ Using the covariance decomposition we

can write

´ ´µ ´µµ

´µ



´µ ´µ ½

(4.6) ½

ÈÇ

¾

´ ´µµ

´µ ´µ· ´µ· ´µ

ÌÈÊ

´µ ¾ ´µ ¾

The covariance of and is defined by Thus the first term of the expansion is identical to the

term. This shows again the similarity between two



½

´ ´µ ´µµ  ´ ´µµµ¡´ ´ ´µµµ

´ parent recombination and the method. Breeders usu-

 

´µ´µ ´µ

 ally use the expression as an estimate. It

is called heritability in the narrow sense [Fal81]. But note denotes the average and the variance. Closely re- that the variance decomposition seems to be only true for

lated to the regression coefficient is the correlation coefficient Robbins’ proportions.

´ µ .Itisgivenby The selection differential is not suited for mathematical

analysis. For truncation selection it can be approximated by

½

´ ´µµ

½¾

¾

´µ  ´µ

´ ´µ´µµ  ´µ ¡ ´ µ 

 (4.11)

Ç

È

´ ´µµ

where  is called the selection intensity. Combining the two The concept of covariance is restricted to parents produc- equations we obtain the famous equation for the response to ing offspring. It cannot be used for UMDA. Here the analysis

selection.

½

´Üµ

of variance helps. We will decompose the fitness value

¾

´µ´µ ´µ

 (4.12) recursively into an additive part and interaction parts. We re- call the definition of conditional probability. These equations are in depth discussed in [M¨uh97]. The

theory of breeding uses macroscopic variables, the average

´Üµ Ü Definition 4.1. Let denote the probability of . Then

and the variance of the population. But we have derived

´Ü µ Ü the conditional probability of given is defined by only one equation, the response to selection equation. We

need a second equation connecting the average fitness and

´Üµ

´Ü µ

(4.7) the variance in order to be able to compute the time evolu-

´ µ tion of the average fitness and the variance. There have been many attempts in population genetics to find a second equa- 35 tion. But all equations assume that the variance of the pop- Saw(36,4,0.85) ulation continuously decreases. This is not the case for arbi- 30 tau=0.01 tau=0.05 trary fitness functions. Recently Pr¨ugel-Bennet and Shapiro 25 [PBS97] have independently proposed to use moments for de- scribing genetic algorithms. They apply methods of statistical 20

to derive equations for higher moments for special fit- W ness functions. 15

Results for tournament selection and analytical solutions 10 for linear functions can be found in [MM00]. We next present numerical results for some popular fitness 5 functions. 0 0 0.2 0.4 0.6 0.8 1 4.2 Numerical Results for UMDA p This section solves the problem put forward by Mitchell et al. Figure 3: Results with normal and strong selection. [MHF94]: to understand the class of problems for which ge-

netic algorithms are most suited, and in particular, for which



´Ôµ

they will outperform other search algorithm. The famous stops at the local maximum for . It is approximately

¼ ¼¼½

Royal-Road function is analyzed in [MM00]. .For is able to converge to the ½ optimum . It does so by even going downhill!

35 This example confirms in a nutshell our theory.

Saw(36,4,0.85)

´Üµ transforms the original fitness landscape defined by into

30 

´Ôµ

a fitness landscape defined by . This transformation

´Üµ 25 smoothes the rugged fitness landscape . might find the global optimum, if there is a tendency towards the 20 global optimum. This example shows that UMDA can solve difficult multi- Value 15 modal optimization problems. It is obvious that any search

10 method using a single search point like the ´½·½µ-algorithm needs an almost exponential number of function evaluations. 5 We next show how the science of breeding can be used for

0 controlling . 0 5 10 15 20 25 30 35 Number of 1-Bits 4.3 Numerical Investigations of the Science of Breeding

Figure 2: Definition of Saw(36,4,0.85) The application of the science of breeding needs the compu-

´µ ´µ

tation of the average fitness , the variance and the

´µ

Equation 3.4 shows that UMDA performs a gradient as- additive genetic variance . The first two terms are stan-

´ µ  dard statistical terms. The computation of needs 

cent in the landscape given by . This helps our search for

´ µ  and  . The computation of the first term only poses some functions best suited for UMDA. We take the landscape

as a spectacular example. The definition of the function can difficulties. It can be approximated by

µ

be extrapolated from Figure 2. In Saw´ , denotes

Æ



 

µ ½ ´ ´Üµ



¾

the number of bits and the from one peak to the

´ ½µ ´Üµ 

 

´ ½µ ´ µ

   

 ½

next. The highest peak is multiplied by (with ), the

Ü ½

 ½

¾ ¿

second highest by , then and so on. The landscape is (4.13)



Ü ½ very rugged. In order to get from one local optimum to an-

are those values in the population which contain  . 

other one, one has to cross a deep valley. Linear functions are the ideal case for the theory. The her-

´Ôµ

´µ

But again the transformed landscape is fairly itability is 1 and the additive genetic variance is identical

´Üµ

smooth. An example is shown in Figure 3. Whereas to the variance. We skip this trivial case and start with a mul-

É

½ Ü

´Ôµ

´Üµ  ´½ µ

has 5 isolated peaks, has three plateaus, a local peak tiplicative fitness function . For a



´µ ´µ and the global peak. We will use with truncation multiplicative function we also have . selection. We have not been able to derive precise analytical Figure 4 confirms the theoretical results from Section 2

expressions. In Figure 3 the results are displayed. (VA and Var are multiplied by 10 in this figure). Additive

 ¼¼

In the simulation two truncation thresholds, and genetic variance is almost identical to the variance and the

 ¼¼½  ¼¼ , have been used. For the probability heritability is 1. The function is highly nonlinear of order , 1.2 good candidate for a search distribution used for optimiza- tion. The problem lies in the efficient computation of the 1 Boltzmann distribution. The theory presented in this section unifies simulated annealing and population based algorithms 0.8 with the general theory of estimating distributions. x 0.6 5.1 Boltzmann selection 0.4 Avg. f R(t)/S(t) Closely related to the Boltzmann distribution is Boltzmann VA 0.2 Var selection. An early study about this selection method can be found in [dlMT93]. 0 0 1 2 3 4 5 6 7 Definition 5.2. Given a distribution and a selection param-

Gen eter , Boltzmann selection calculates the distribution of the selected points according to

Figure 4: Heritability and Variance for a multiplicative func-

­ ´Üµ

´Üµ

 ¼½  ¿¾

tion (variance and VA multiplied by 10); , ×

È

´Üµ

(5.2)

­ ´Ý µ

´Ý µ Ý

This allows us to define the (Boltzmann Estimated but nevertheless it is easy to optimize. The function has also Distribution Algorithm). been investigated by Rattray and Shapiro [RS99]. But their calculations are very difficult. BEDA – Boltzmann Estimated Distribution Algorithm

We have shown that can optimize difficult multi-

´ ¼ ¯ STEP 0: . Generate points according to the

modal functions, thus explaining the success of genetic al-

´Ü ¼µ  ´Üµ

¼ .

gorithms in optimization. We have also shown that

¡ ´µ ¼ can easily be deceived by simple functions called deceptive ¯ STEP 1: With a given , let

functions. These functions need more complex search distri-

¡¬ ´Øµ ´Üµ

´Üµ

×

È

´Üµ

butions. This problem is investigated next.

¡¬ ´Øµ ´Ý µ

´Ý µ

Ý ¯

5 Graphical Models and Optimization STEP 2: Generate new points according to the dis-

×

´ ·½µ ´ µ tribution .

The simple product distribution of cannot capture

´ ·½ dependencies between variables. If these dependencies are ¯ STEP 3: .

necessary to find the global optimum, and simple

¯ STEP 4: If stopping criterion not met go to STEP 1 genetic algorithms fail. We take an extreme case as exam-

ple, the needle in a haystack problem. The fitness function is is a conceptional algorithm, because the calcula-

Ü everywhere one, except for a single where it is 10. All  tion of the distribution requires to compute the sum of expo- values have to be set in the right order to obtain the optimum. nentially many terms. The following convergence theorem is Of course, there exist no clever search method for this prob- easily proven.

lem. But there is a of increasing complexity from

´µ Theorem 5.1 (Convergence). Let ¡ be an annealing

the simple function to the needle in a haystack. For schedule, i.e. for every increase the inverse temperature

complex problems we need a complex search distribution. A

´µ by ¡ . Then for the distribution at time is given good candidate for a search distribution for optimization is by

the Boltzmann distribution.

¬ ´Øµ ´Üµ

´Üµ

¼

´Üµ

(5.3)

¼

Definition 5.1. For define the weighted Boltzmann

´ ´µ µ

¼

´Üµ distribution of a function as

with the inverse temperature

¬ ´Üµ ¬ ´Üµ

´Üµ ´Üµ

Ø

¼ ¼



È

 ´Üµ

¬ (5.1)

¬ ´Ý µ

´µ ¡ ´ µ

´ µ

´Ý µ

¼

¼ (5.4)

Ý

 ½

´ µ ¼

where is the partition function. To simplify the no-

´µ ½

Let Å be the set of global optima. If , then

´Üµ

tation and/or can be omitted. ¼ is the distribution for

´ ¼

.

½

¾Å

Å

´ µ

The Boltzmann distribution concentrates the search ÐÑ (5.5)

ؽ around good fitness values. Thus it is theoretically a very ¼ else

Ñ

¾ Å

Proof: Let be a point with optimal fitness and then

Ñ

¾Å ´Üµ ´ µ

a point with . Then



Ñ

´Üµ ´  µ #

" (5.12)

¬ ´Øµ ´Üµ ¬ ´Øµ ´Üµ

½

´Üµ

¼

È

´ µ 

Ñ

¬ ´Øµ ´Ü µ ¬ ´Øµ ´Ý µ

Å ¡ ¡ ´ µ ¼ Ý The constraint defined by Equation (5.11) is called the

½ running intersection property [Lau96].



Ñ

¬ ´Øµ ´Ü µ ´Üµ

Å ¡ ¡

FDA – Factorized Distribution Algorithm

´µ  ½ ´ µ

As , converges (exponentially fast) to 0.

¯

Ñ Ñ 

STEP 0: Calculate  and from the decomposition

´ µ  ´ µ ¾ Å Because for all , the limit

distribution is the uniform distribution on the set of optima. of the function. ¯ STEP 1: Generate an initial population with indi-

We next transform into a practical algorithm. This viduals.

means the reduction of the parameters of the distribution and the computation of an adaptive schedule. ¯ STEP 2: Select individuals using Boltzmann selec- tion. 5.2 Factorization of the distribution

¯ STEP 3: Estimate the conditional probabilities

´  µ "

# from the selected points.

In this section we describe a method for computing a factor-

ization of the probability, given an additive decomposition of

´Ü · ¯ STEP 4: Generate new points according to

the function: É

Ñ

½µ  ´  µ "

# .

½



Ñ 

Definition 5.3. Let ½ be index sets,

´ ·½

¯ STEP 5:

½  If not stopping criterion reached:

. Let × be functions depending only on the vari-

Go To STEP2

¾

 ×

ables  with . These variables we denote as Then

Ñ With the help of the factorization theorem, we can turn



´Üµ ´Üµ ´ µ

 ×

× (5.6) the conceptional algorithm into , the Factor-

½  ized Distribution Algorithm. As the factorized distribution is identical to the Boltzmann distribution if the conditions of the

is an additive decomposition of the fitness function . factorization theorem are fulfilled, the convergence proof of

also applies to . Ñ

Definition 5.4. Given ½ , we define for

Not every additive decomposition leads to a factorization

  the sets  , and :

using the factorization theorem. In these cases, more sophis-



ticated methods have to be used. But can also be used

  Ò  

    ½    ½

 (5.7) with an approximate factorization. ½

 We discuss a simple example.

 

We set ¼ . Example 5.1. Functions with a chain-like interaction can

In the theory of decomposable graphs,  are called his- also be factorized:



tories,  residuals and separators [Lau96]. We recall the Ò

following definition. 

´Üµ ´ µ

 ½ 

 (5.13)

´ÜÝ µ

Definition 5.5. The conditional probability is defined ¾

as 

´Ü Ý µ

´ÜÝ µ

(5.8) Here the factorization is

´Ý µ Ò

In [MMO99], we have shown the following theorem. 

´Üµ´ µ ´  µ

½  ½

 (5.14)

¾

´Üµ Theorem 5.2 (Factorization Theorem). Let be a Boltz- mann distribution with

can be used with any selection scheme, but then the

´Üµ

¬ convergence proof is no longer valid. We think that Boltz-

´Üµ (5.9)

mann selection is an essential part in using the .Inor-

´ µ

der to obtain a practical algorithm, we still have to solve two

È

Ñ

´Üµ ´Üµ problems: To find a good annealing schedule for Boltzmann

and × be an additive decomposition. If

 selection and to determine a reasonable sample size (popula-



   ½ 

tion size). These two problems will be investigated next. Ð

 (5.10)

 ¾    such that  (5.11) 5.3 A new annealing schedule for the Boltzmann distribu- We can now derive an adaptive annealing schedule. The tion variance (and the higher moments) can be estimated from the generated points. As long as the approximation is valid, one Boltzmann selection needs an annealing schedule. But if we can choose a desired increase in the average fitness and set

anneal too fast, the approximation of the Boltzmann due to

´ ·½µ accordingly. So we can set the sampling error can be very bad. For an extreme case, if

the annealing is very large, the second generation new

´µ ´ ´µµ

´µ ´ ·½µ ´µ

should consist only of the global maxima. ¡ (5.22)

¾

´ ´µµ

5.3.1 Taylor expansion of the average fitness As truncation selection has proven to be a robust and ef- In order to determine an adaptive annealing schedule, we will ficient selection scheme, we can try to approximate the be- make a Taylor expansion of the average fitness of the Boltz- haviour of this method. mann distribution. Definition 5.7. The standard deviation schedule (SDS) is de-

Definition 5.6. The average fitness of a fitness function and fined by

´µ

a distribution is ¡ (5.23)

´ ´µµ



´µ ´Üµ´Üµ

(5.15) Note that this annealing schedule cannot be used for sim-

Ü ulated annealing, as the estimation of the variance of the dis- tribution requires samples that are independently drawn. But For the Boltzmann distribution, we use the abbreviation

the of samples generated by simulated annealing are

´ µ ´ µ

¬ . not independent.

Theorem 5.3. The average fitness of the Boltzmann distribu- We next turn to the fixation problem in finite populations.

´ µ

tion has the following expansion in :

5.4 Finite Populations







´ µ

#



´ µ ´ µ ´ µ·

(5.16) ·½

 In finite populations convergence of or can



½ only be probabilistic. Since a simple algo-

rithm, it is sufficient to discuss . This section is ex- # where are the centred moments

 tracted from [MM99b].



¢ £

´ µ 

Definition 5.8. Let be given. Let #ÓÒÚ denote the prob-

#

´Üµ ´ µ ´Üµ ´ µ

(5.17)

 ability that with a population size of converges to the Ü optima. Then the critical population size is defined as

They can be calculated using the derivatives of the partition

£

´µÑÒ ´ µ ½

function: #ÓÒÚ (5.24)

Æ



´µ

¼

µ

´ If with a finite population does not convergence to

# #

½ µ ¼

´ for (5.18)

·½ ½

 an optimum, then at least one gene is fixed to a wrong value.

´ µ The probability of fixation is reduced if the population size is

Proof: The -th derivative of the partition function obeys for increased. We obviously have for FDA

¼

:

´ µ  ´ µ 

#ÓÒÚ ½ #ÓÒÚ ¾ ½ ¾



´ µ

 ¬ ´Üµ

´ µ ´Üµ (5.19)

The critical question is: How many sample points are Ü

necessary to reasonably approximate the distribution used

½ Thus the moments for can be calculated as by FDA. A general estimate from Vapnik [Vap98] can be a

guideline. One should use a sample size which is about 20

´ µ

 times larger than the number of free parameters.

´ µ



´Üµ ´Üµ ´ µ

 (5.20) We discuss the problem with a special function called Int.

´ µ

Ü

ܵ Int´ gives the value of the binary representation.

and thus Ò



 ½

´µ ¾

Int  (5.25)

¼

´ µ ´ µ ´ µ ´ µ

½

(5.21)

½

Direct evaluation of the derivatives of leads to complicate The fitness distribution of this function is not normal Ò expressions. The proof is rather technical by induction. We distributed. The function has ¾ different fitness values. omit it here. We show the cumulative fixation probability in Table 1 for

¼¾ ¼

5.5 Bayesian Networks, Population Size and Bayesian

¼  ¼ ¼¼  ½¼¼ t Prior 1 0.0 0.0 0.0885 0.0 2 0.0025 0.0095 0.1110 0.0 In order to derive the results of this section we will use a 3 0.0165 0.0205 0.1275 0.0 normalized representation of the distribution. 4 0.0355 0.0325 0.1375 0.002 Theorem 5.4 (Bayesian Factorization). Each probability 5 0.0575 0.0490 0.1455 0.002 can be factored into

6 0.0695 0.0630 0.1510 0.008 Ò



´Üµ´ µ ´  µ

7 0.0740 0.0715 0.1555 0.018

 

½ (5.26) ¾ 8 0.0740 0.0780 0.1565 0.030  9 0.0740 0.0806 0.1575 0.036

14 0.084 Proof: By definition of conditional probabilities we have

Ò 

Table 1: Cumulative fixation probability for Int´½ µ. Trunca-

´Üµ´ µ ´  ¡¡¡ µ

 ½  ½

½ (5.27)

¼¼½

tion selection vs. Boltzmann selection with ¡ and

¾

Boltzmann SDS; denotes size of population.

 ¡¡¡   ¡¡¡ Ò

½  ½  ½  ½ 

Let  .If and

are conditionally independent given  , we can simplify

Int´½ µ. The fixation probability is larger for stronger selec-

´  ¡¡¡ µ´  µ

½  ½    .

tion. For a given truncation selection the maximum fixation

  are called the parents of variable . This factoriza- probability is at generation 1 for very small . For larger val- tion can be represented by a directed graph. In the context of ues of the fixation probability increases until a maximum graphical models the graph and the conditional probabilities is reached and then decreases again. This behaviour has been are called a Bayesian network [Jor99, Fre98]. It is obvious

observed for many fitness distributions. that the factorization used in Theorem 5.2 can be easily trans-

 ¼¾

For truncation selection with we have for formed into a Bayesian factorization.

 ¼ ¼¼ a fixation probability of about . A larger Usually the empirical probabilities are computed by the

reduces the fixation probability. But this advantage is set off

 maximum likelihood estimator. For samples with by the larger number of generations needed to converge. The instances of the estimate is defined by problem of an optimal population size for truncation selec-

tion is investigated in [MM99b]. Boltzmann selection with

´µ

 ¼¼½

¡ is still very strong for the fitness distribution

 ¼¼

given by Int´½ µ.For the largest fixation prob-

 ´µ  ½  ¼ For we obtain and for we obtain

ability is still at the first generation. Therefore the critical

´µ¼

. This leads to our gene fixation problem, because

 ¼¼½

population size for Boltzmann selection for ¡ is both values are attractors. The fixation problem is reduced

£

¼¼µ

very high ( . In comparison, the adaptive Boltz-

´µ ¼  ´µ  ½

if is restricted to an ÑÒ

¼

mann schedule SDS has a total fixation probability of ¼

½

ÑÒ This is exactly what results from the Bayesian ½¼¼

for a population size of . This is almost as small as

´µ estimation. The estimate  is the expected value of the truncation selection. posterior distribution after applying Bayes formula to a prior This example shows that Boltzmann selection in finite popu- distribution and the given data. For binary variables the lations critically depends on a good annealing schedule. Nor- estimate

mally we run with truncation selection. This selection

·

method is a good compromise. But Boltzmann selection with

´µ

 (5.28)

·¾ SDS schedule is of comparable performance.

Estimates for the necessary size of the population can also

¼ ½ is used with . r is derived from a Bayesian prior. be found in [HCPGM99]. But they use a weaker performance is the result of the uniform Bayesian prior. The larger , the

definition. The goal is to have a certain percentage of the bits

¾ more the estimates tend towards ½ . The reader interested of the optimum in the final population. Furthermore their re- in a derivation of this estimate in the context of Bayesian - sult is only valid for fitness function which are approximately works is referred to [Jor99]. normally distributed. How can we determine an appropriate value for for our

The danger of fixation can further be reduced by a tech-

 ¼ application? The uniform prior gives for the

nique very popular in Bayesian statistics. This is discussed in

 ½´ ·¾µ ÑÒ value ÑÒ .If is small, then might be the next section. so large that we generate the optima with a very small prob- ability only. This means we perform more a random search

instead of converging to the optima. This consideration leads

½

to a constraint. ÑÒ should be so large that the optima are still generated with high probability. We now heuristically

 ¿¾

derive ÑÒ under the assumption that the optimum is unique. problem size of . Our convergence theorem gives

½

ÑÜ ´ µ

To simplify the formulas we require that ÓÔØ . convergence of with Boltzmann selection and an ex-

This means that the optimum string ÓÔØ should be gener- act factorization. The exact factorization consists of marginal ated more than ¿¼± at equilibrium. This is large enough to distributions of size 4. We compare in Figure 5 with

observe equilibrium and convergence. Let us first investigate SDS Boltzmann selection and truncation selection without

É

´µ  ´ µ  ½

the factorization  .For the Bayesian prior. We also show a run with SDS Boltzman se-

´ ·½µ´ ·¾µ

largest probability is ÑÜ . Obviously lection and Bayesian prior.

¼½

The simulation was started at , i.e. near the local

½

½ ½ optimum . Nevertheless, converges to the global

ÑÒ ÑÜ

·¾

 ½ optimum at . It is interesting to note that at The largest probability to generate the optimum is given by first moves into the direction of the local optimum. At the

Ò very last moment the direction of the curve is dramatically



Ò

½

Æ ·¾

µ  ´ µ ´½

 changed. SDS Boltzmann selection behaves almost identical

ÓÔØ

·¾

 ¼¿ ½  to truncation selection with threshold . But both

methods need a huge population size in order to converge to

½ «

 ´ µ ¼ ´ µ

If with , then ÓÔØ becomes arbitrar-  ¾¼¼¼¼

the optimum. In this example it is . If a prior of

½

 ´ µ 

ily small for large .For we obtain ÓÔØ .

½  ¾¼¼ is used the population size can be reduced to . This gives the following guideline, which actually is a lower With this prior the curve changes direction earlier. Because

bound of the population size. of the prior the univariate marginal probabilities never reach

¼ ½ ¼ or . In this example stops at about .

Rule of Thumb: For the size of the population should be at least equal to the size of the problem, if a

32 ½ Bayesian prior of is used. Boltzmann Bayesian priors are also defined for conditional distribu- 30 Truncation Prior tions. The above heuristic derivation can also be used for 28 general Bayesian factorizations. The Bayesian estimator is for binary variables 26 W

24

·

´  µ

 

·¾

22

is the number of occurrences of  . We make the as- 20 sumption that in the best case the optimum constitutes 25% of

18

 ½ the population. This gives .For we compute 0 0.2 0.4 0.6 0.8 1 as before

p

Ò Ò

 

½

´ µ  ´  µ ´½ µ

ÓÔØ ÓÔØ ÓÔØ

´µ ·¾

Figure 5: Average fitness for for Decep(32,4);

½ ½

 ¾¼¼¼¼  ¾¼¼

Ò population size without prior and with

Æ·¾

 ½

prior .

½

 ´ µ 

If we set we obtain ÓÔØ . Thus we obtain a lower bound for the population size: Let us now summarize the results: Because uses finite samples of points to estimate the conditional probabil- ities, convergence to the optimum will depend on the size of

Rule of Thumb: For using a factorization with many

the samples (the population size). has experimentally

 ½ conditional distributions and Bayesian prior of , the size of the population should be about four times the size of proven to be very successful on a number of functions where the problem. standard genetic algorithms fail to find the global optimum. In [MM99b], the scaling behaviour for various test functions These rule of thumbs have been heuristically derived. has been studied. The estimation of the probabilities and the

They have to be confirmed by numerical studies. Our generation of new points can be done in time. If estimate is a crude lower bound. There exist more general es- a Bayesian prior is used the influence of the population size timates. We just cite Vapnik [Vap98]. In order to approximate is reduced. There is a tradeoff. If no prior is used then con- a distribution with a reasonable accuracy, he proposes to use vergence is fast. But a large population size might be needed. a sample size which is about 20 times larger than the number If a prior is used, the population size can be much smaller. of free parameters of the distribution. For this gives

But the number of generations until convergence increases. ¾¼ , i.e. 20 times our estimate. We have not yet enough numerical results, therefore we just We demonstrate the importance of using a Bayesian prior conjecture: by an example. It is a deceptive function of order  and

  Å ´µ ½ with a finite population of size , SDS Boltz- Let be the set of global optima. If , then

mann selection, Bayesian prior, and a Bayesian factorization ´

½

¾Å

where the number of parents is restricted by independent of Å

´ µ

ÐÑ (5.34)

½ , will converge to the optimum in polynomial time with high Ø ¼ else probability. The proof is almost identical to the proof of Theorem 5.1. The 5.6 Constraint Optimization Problems factorization theorem needs an analytical description of the function. But it is also possible to determine the factorization

An advantage of compared to genetic algorithm is from the data sampled. This is described next. that it can handle optimization problems with constraints. Mendelian recombination or crossover in genetic algorithms often creates points which violate the constraints. If the struc- 6 Computing a Bayesian Network from Data ture of the constraints and the structure of the ADF are com-

The factorization is based on the decomposition of the patible, then will generate only legal points. fitness function. This has two drawbacks: first, the structure Definition 5.9. A constraint optimization problem is defined of the function has to be known. Second, for a given instance by of the fitness function, the structure might not give the small-

est factorization possible. In other words: complex struc- Ñ

 tures are not necessarily connected to corresponding complex

µ ´Üµ ´Ü ×  (5.29)

dependency structures for a given fitness function. The ac-

½

tual dependencies depend on the actual function values. This

´Ü µ Ù  (5.30)

problem can be circumvented by computing the dependency

´Ü µ Ü Ü 

structure from the data.

Ù × Ù

 stands for the th constraint function.

are sets of variables. The constraints are locally defined. Computing the structure of a Bayesian network from data

Thus they can be used to test which marginal probabilities are is called learning. Learning gives an answer to the question:

´µ

0. This is technically somewhat complicated, but straightfor- Given a population of selected points , what is a good

´ µ ·  ½

Bayesian factorization fitting the data? The most difficult

½ ¾ ½ ¾

ward. For instance, if we have ½ ,

´ ½ ½µ¼

part of the problem is to define a quality also called ¾

then obviously ½ . Thus the constraints

´Ü µ

scoring measure. Ù

are mapped to marginal distributions: if  is violated

´ µ¼

A Bayesian network with more arcs fits the data better than Ù

then we set  .

´Üµ

We can now factorize as before. But we can also fac- one with less arcs. Therefore a scoring should give

´ µ

the best score to the minimal Bayesian network which fits Ù

torize the graph defined by  . Our theory can handle

the two cases: the factorization of the constraints is contained the data. It is outside the scope of this paper to discuss this



problem in more detail. The interested reader is referred to ×

in the factorization of the function, i.e. Ù , or the fac-

torization of the function is contained in the factorization of the two papers by Heckerman and Friedman et al. in [Jor99].



For Bayesian networks two quality measures are most fre- Ù

the constraints, i.e. ×

ª quently used - the Bayes Dirichlet (BDe) score and the Min-

Let # be the set of feasible solutions. Then the Boltz-

ª imal Description (MDL) score. We concentrate on mann distribution on # is defined as

the MDL principle. This principle is motivated by universal

¬ ´Üµ

´µ

¼ coding. Suppose we are given a set D of instances, which

È

´Üµ

" # (5.31)

¬ ´Ý µ

´Ý µ

¼ we would like to store. Naturally, we would like to conserve

Ý ¾ª  and save a compressed version of D. One way of com- Then the following convergence theorem holds. pressing the data is to find a suitable model for D that the en- coder can use to produce a compact version of D. In order to Theorem 5.5 (Convergence). Let 1) the initial population recover D we must also store the model used by the encoder be feasible. Let 2) the factorization of thetechniques. con- to compress D. Thus the total description length is defined as

straints and the factorization of the function be contained in the sum of the length of the compressed version of D and the

¡ ´µ

the factorization. Let 3) be an annealing sched- length of the description of the model. The MDL principle ule. Then for the distribution at time is given by postulates that the optimal model is the one that minimizes

the total description length.

¬ ´Øµ ´Üµ

´Üµ

¼

È

´Üµ

(5.32)

¬ ´Øµ ´Ý µ

´Ý µ

¼

Ý ¾ª  6.1 LFDA - Learning a Bayesian Factorization with the inverse temperature In the context of learning Bayesian networks, the model is

a network B describing a over the Ø

 instances appearing in the data. Several authors have approx-

´µ ¡ ´ µ

(5.33)

  

imately computed the MDL score. Let denote the

 ½

¯ ´ µ 

size of the data set. Then MDL is approximately given by STEP 1: Add the arc  which gives the maxi-

  ÑÜ

mum increase of BIC( )if  and adding

½

µ Ð"´ ´ µµ · ¡ ´ µ· ¡ Ð" ´ µ MDL´

¾ the arc does not introduce a cycle.

(6.1)

¯

µ  ÐÓ ´µ ´ µ

with Ð"´ . denotes the prior probability STEP 2: Stop if no arc is found.

¾

È

Ô 



of network , ¾ gives the total number of



´ µ probabilities to compute. is defined by Checking whether an arc would introduce a cycle can be eas-

Ò ily done by maintaining for each node a list of parents and

  

´ µ ´ µ

   

´ µ Ð" ´  µ 

(6.2) ancestors, i.e. parents of parents etc. Then  intro-

´ µ



Ô Ü

½

 duces a cycle if  is ancestor of .

The BOA algorithm of Pelikan [PGCP00] uses the BDe

´ µ

  where  denotes the number of occurrences of

È score. This measure has the following drawback. It is more

´ µ ´ µ 

   

given configuration  . .If Ü

sensitive to coincidental correlations implied by the data than

 ´ µ  , then  is set to the number of occurrences of in the MDL measure. As a consequence, the BDe measure will D. prefer network structures with more arcs over simpler net-

The formula has an interpretation which can be easily works [Bou94].

´ µ understood. If no prior information is available, is Given the BIC score we have several options to extend

identical for all possible networks. For minimizing, this

to which learns a factorization. Due to lim-

 ¡ Ð"´ µ term can be left out. ¼ is the length required itations of space we can only show results of an algorithm

to code the parameter of the model with precision 1/M. which computes a Bayesian network at each generation using

¡ Ð" ´ µ

Normally one would need bits to encode the

´¼ µ

algorithm ÑÜ . and should behave parameters. However, the central limit theorem says that fairly similar, if computes factorizations which are these frequencies are roughly normally distributed with a

in probability terms very similar to the factorization.

½¾

¼Ð"´ µ

variance of . Hence, the higher bits are FDA uses the same factorization for all generations, whereas

¡ ´ µ not very useful and can be left out. has two computes a new factorization at each step which de-

interpretations. First, it is identical to the of the pends on the given data M.

´  µµ maximum likelihood (Ð"´ ). Thus we arrive at the We have applied to many problems [MM99b]. following principle: The results are encouraging. The numerical result indicates

that control of the weight factor can substantially reduce

½

´  µµ ¡ Choose the model which maximizes Ð"´

¾ the amount of computation. For Bayesian network we have

µ Ð"´ . not yet experimented with control strategies. We have inten- sively studied the problem in the context of neural networks The second interpretation arises from the observation that [ZOM97].

H(B,D) is the conditional entropy of the network structure

, defined by  , and the data . The above principle is appealing, because it has no parameter to be tuned. But 7 The System Dynamics Approach to Optimiza- the formula has been derived under many simplifications. In tion practice, one needs more control about the quality vs. com- We have shown that Wright’s equations converge to some lo- plexity tradeoff. Therefore we use a weight factor . Our cal optima of the fitness function at the boundary. We might measure is defined by . ask ourselves: Why not using the difference equations di-

rectly, without generating a population? This approach is cal-

´ µ ¡ ´ µ ¡ Ð"´ µ (6.3)

lled the systems dynamics approach to optimization. We just

¼ This measure with has been first derived by Schwarz discuss a few examples which are connected with our theory.

[Sch78] as Bayesian Information Criterion. Therefore we ab-

´µ

breviate our measure as . 7.1 The Replicator Equation

£ To compute a network which maximizes In this section we investigate the relation between Wright’s requires a search through the space of all Bayesian networks. equation and a popular equation called replicator equation. Such a search is more expensive than to search for the optima Replicator dynamics is a standard model in evolutionary bi- of the function. Therefore the following greedy algorithm

ology to describe the dynamics of growth and decay of a num-

has been used. ÑÜ is the maximum number of incoming

 ½ ¾ ber of species under selection. Let be a set

edges allowed.

of species,  the frequency of species in a fixed popula-

tion of size . Then the replicator equation is defined on a

È

Æ´  µ

ÑÜ

×

   ½ ¼   ½  simplex 

¯ STEP 0: Start with an arc-less network.

 7.2 Some System Dynamics Equations for Optimization

×





 ´µ ´Ôµ ´µ ´Ôµ

  

 (7.1) The theorem of Baum Eagon shows that both, Wright’s equa-

½  tion and the DDRP, maximize some potential. This means

that both equations can be used for maximization. But there is

 gives the fitness of species in relation to the others. The replicator equation is discussed in detail in [HS98]. For a problem: both equations are deterministic. For difficult op- the replicator equation a maximum principle can be shown. timization problems, there exists a large number of attractors,

each with a corresponding attractor . If the iteration



Theorem 7.1. If there exists a potential with 

starts at a point within the attractor region, it will converge to

´Ôµ ¼

 , then , i.e the potential increases using the replicator dynamics. the corresponding attractor at the boundary. But if the itera- If we want to apply the replicator equation to a binary op- tion starts at points which lie at the boundary of two or more

Ò attractors, i.e on the separatrix, the iteration will be confined

¾ timization problem of size ,wehavetoset . Thus the number of species is exponential in the size of the problem. to the separatrix. The cannot decide for The replicator equation can be used for small size problems one of the attractors. only. with a finite population does not have a sharp

Voigt [Voi89] had the idea, to generalize the replicator boundary between attractor regions. We model this behavior

´ ·½µ 

by introducing randomness. The new value  is

¼  ´ µ  ½  equation by introducing continuous variables 

È randomly chosen from the interval

´ µ½ ´ µ

  

with . Thus  can be interpreted as uni-

 ¼

variate probabilities. Voigt [Voi89] proposed the following ¼

´½ µ ´ ·½µ ´½ · µ ´ ·½µ

  

discrete equation. 

¼

´ ·½µ

 is determined by the deterministic equation.

Definition 7.1. The Discrete Diversified Replicator Equation 

 ¼ DDRP is given by is a small number. For we obtain the deterministic

equation. In order to use the difference equation optimally,

¼ ½

´ µ´ ·½µ  ´ µ´µ· ´ µ´µ 

we do not allow the boundary values  or .We

     

È

 ½

ÑÒ  ÑÒ

use  and instead.

´ µ ´Ôµ ´Ôµ

   

Ü

È A second extension concerns the determination of the so-

´ µ ´Ôµ

   Ü lution. All dynamic equations presented use variables, which The name Discrete Diversified Replicator Equation was can be interpreted as probabilities. Thus instead of wait- not a good choice. The DDRP is more similar to Wright’s ing that the dynamic system converges to some boundary

equation than to the replicator equation. This is the content point, we terminate the iteration at a suitable time and gen-

´ µ 

of the next theorem. erate a set of solutions. Thus, given the values for 

we generate points according to the distribution

É

Ò

´Üµ ´ µ 

 .

´Ôµ ½ Theorem 7.2. If the average fitness is used as po- tential, then Wright’s equation and the Discrete Diversified We can now formulate a family of optimization algo- Replicator Equation are identical. rithms, based on difference equations (DIFFOPT). DIFFOPT

We recently discovered that Baum and Eagon [BE67] have

¯ ´ ¼ ´ ¼µ  ¼

 ÑÒ proven a discrete maximum principle for certain instances of STEP 0: Set and  Input .

the DDRP. ¼

´ ·½µ ¯

STEP 1: Compute  according to a dy-



¼

´ ·½µ

 ÑÒ ´Ôµ namic difference equation. If

Theorem 7.3 (Baum-Eagon). Let be a polynomial 

¼ ¼

´ ·½µ ´ ·½µ ½

 ÑÒ  ÑÒ

then .If 

with nonnegative coefficients homogeneous of degree in its 

È

¼

´ ·½µ ½

 ÑÒ ´ µ ´ µ ¼ ´ µ½

then



    

variables  with and . Let

Ü

´ ·½µ

Ô be the point given by

¯ ´ ·½µ 

STEP 2: Compute randomly  in the in-

¼ ¼

´½ µ ´ ·½µ ´½ · µ ´ ·½µ

Î 

terval  . Set

 

´ µ

 

Ô ´Ü µ

´ ·½

´ ·½µ

È 

 (7.2)

´ µ

 

Ü

Ô ´Ü µ

¯ STEP 3: If termination criteria are not met, go to STEP

´µ ´Ô´ · ½µµ

The derivatives are taken at Ô . Then 1.

´Ô´µ Ô´ ·½µ Ô´µ

unless

¯ STEP4: Generate solutions according to

É

Equation 7.2 is exactly the DDRP with a potential . Ò

´Üµ  ´ µ ´Üµ 

 and compute ½

Thus the DDRP could be called the Baum-Eagon equation. 

´Üµ and

From the above theorem the discrete maximum principle for

  Wright’s equation follows by setting and . is not restricted to Wright’s equation or

Thus the potential is the average fitness, which is homoge- DDRP. The numerical efficiency of needs ad-

neous of degree . ditional study. 8 Three Royal Roads to Optimization to with an adaptive Boltzmann annealing schedule. For this algorithm convergence for a large class In this section we will try to classify the different approaches of discrete optimization problems has been shown. presented. Population search methods are based on two com- ponents at least – selection and reproduction with variation. 8.1 Boltzmann Selection and the Replicator Equation In our research we have transformed genetic algorithms to a family of algorithms using search distributions instead of re- Wright’s equation transforms the discrete optimization prob-

combination/mutation of strings. The simplest algorithm of lem into a continuous one. Thus mathematically we can try to

´Ôµ ´Üµ this family is the univariate marginal distribution algorithm optimize instead of .For with Boltzmann

. selection we even have a closed solution for the probability

´Üµ Wright’s equation describes the behavior of us- .Itisgivenby

ing an infinite population and proportionate selection. The

¬ ´Üµ

´Üµ ¼

equation shows that does not primarily optimize the

È

´Üµ

¬Ô (8.1)

¼

¬ ´Ý µ

´Ý µ

¼

´Üµ

fitness function , but the average fitness of the population Ý

´Ôµ

which depends on the continuous marginal frequencies If we differentiate this equation we obtain after some compu-

´ µ 

 . Thus the important landscape for population search is tation

´Üµ

not the landscape defined by the fitness function , but the





´Ôµ

ܵ

landscape defined by . ´

¬Ô

¼

 ´Üµ ´Üµ ´Ý µ ´ µ

¬Ô ¬Ô ¼

The two components of population based search methods ¼

— selection and reproduction with variation — can work on a Ý (8.2)

microscopic (individual) or a macroscopic (population) level.

¼ ½ For we obtain a special case of the replicator equation

The level can be different for selection and reproduction. It is

´Ôµ

7.1. We just have to set  . possible to classify the different approaches according to the

level the components work. The following table shows three Theorem 8.1. The dynamics of Boltzmann selection with

´µ½ classes of evolutionary algorithms, each with a representative ¡ is given by the replicator equation. member. From the convergence theorem 5.1 we know that the global optima are the only stable attractors of the replicator Algorithm Selection Reproduction equation. Thus the replicator equation is an ideal starting Genetic Algorithm microscopic microscopic UMDA microscopic macroscopic point for a system dynamics approach to optimization dis-

System Dynamics macroscopic macroscopic cussed in Section 7. Unfortunately the replicator equation

Ò A genetic algorithm uses a population of individuals. Se- consists of ¾ different equations for a problem of size . lection and recombination is done by manipulating individual Thus we are lead to the same problem encountered when analyzing the Boltzmann distribution. We have to factorize

strings. uses marginal distributions to create indi-

´Üµ viduals. These are macroscopic variables. Selection is done the probability if we want to use the equation numeri-

on a population of individuals, as genetic algorithms do. In cally.

È

Ò

´Üµ

the system dynamics approach selection is modeled by a spe- 

Example 8.1.  . In this case



É

Ò

´Üµ  ´ µ

cific dynamic difference equation for macroscopic variables. 

the factorization is valid  .By

½ We believe that a fourth class — macroscopic selection and summation we obtain from equation 8.2 after some manipu-

microscopic reproduction — makes no sense. lation

Each of the approaches have their specific pros and cons. 

 ´½ µ

 

 (8.3)

Genetic algorithms are very flexible, but the standard recom-

¼

bination operator has limited capabilities. can use For this is just Wright’s equation without the denomi-  any kind of selection method which is also used by genetic nator . algorithm. be extended to an algorithm which uses if we extend this equation we obtain another proposal for a more complex factorization of the distribution. This is the systems dynamics approach to optimization

done by the factorized distribution algorithm FDA. Selection



is very difficult to model on a macroscopic level. Wright’s 

 ´½ µ 

 (8.4)

equation are valid for proportionate selection only. Other se-  lection schemes lead to very complicated system dynamics This equation needs further numerical studies. The speed of

equations. ¼

convergence can be controlled by setting . With

Thus for proportionate selection and gene pool recombi- nation all methods will behave similarly. But each of the

methods allows extensions which cannot be modeled with an  (8.5)

approach using a different level. Mathematically especially interesting is the extension of an interesting alternative to Wright’s equation is obtained. Further numerical studies are needed. 9 Conclusion and Outlook to incorporate all three aspects of organisms - structure, func- tion, and development or being, acting, and becoming. Karl This chapter describes a complete mathematical analysis of Popper has formulated it the following way: The whole life is evolutionary methods for optimization. The optimization problem solving. Gene frequencies cannot explain the many problem is defined by a fitness function with a given set of problem solving capabilities found in nature. variables. Part of theory consists of an adaptation of classical population genetics and the science of breeding to optimiza- Bibliography tion problems. The theory is extended to general population based search methods by introducing search distributions in- [AM94a] H. Asoh and H. M¨uhlenbein. Estimating the stead of doing recombination of strings. This theory can be heritability by decomposing the genetic vari- also used for continuous variables, a mixture of continuous ance. In Y. Davidor, H.-P. Schwefel, and and discrete variables as well as constraint optimization prob- R. M¨anner, editors, Parallel Problem Solv- lems. The theory combines learning and optimization into a ing from Nature, Lecture Notes in Computer common framework based on graphical models. Science 866, pages 98–107. Springer-Verlag, We have presented three approaches to optimization. We 1994. believe that the optimization methods based on search distri- butions (UMDA,FDA,LFDA) have the greatest optimization [AM94b] H. Asoh and H. M¨uhlenbein. On the mean con-

power. The dynamic equations derived for with pro- vergence time of evolutionary algorithms with-

portionate selection are fairly simple. For with trun- out selection and mutation. In Y. Davidor, H.-

cation or tournament selection and with conditional P. Schwefel, and R. M¨anner, editors, Parallel marginal distributions, the dynamic equations can become Problem Solving from Nature, Lecture Notes in very complicated. FDA with Boltzmann selection SDS is an 866, pages 88–97. Springer- extension of simulated annealing to a population of points. Verlag, 1994. It shares with simulated annealing the convergence property, but convergence is much faster. [Bal97] D.H. Ballard. An Introduction to Natural Com- Ultimately our theory leads to a synthesis problem: find- putation. MIT Press, Cambridge, 1997. ing a good factorization for a search distribution defined by [BE67] L.E. Baum and J.A. Eagon. An inequality with a finite sample. This is a central problem in probability the- applications to statistical estimation for proba- ory. One approach to this problem uses Bayesian networks. bilistic functions of markov processes and to For Bayesian networks numerically efficient algorithms have a model for ecology. Bull. Am. Math. Soc., been developed. Our algorithm computes a Bayesian 73:360–363, 1967.

network by minimizing the Bayesian Information Criterion.

The computational effort of both and is sub- [Bou94] R.R. Bouckaert. Properties of bayesian net- stantially higher than that of . Thus should work learning algorithms. In R. Lopez de Man- be the first algorithm to be tried in a practical problem. Next taras and D. Poole, editors, Proc. Tenth Confer-

the Multi-Factorization should be applied. ence on Uncertainty in Artificial Intelligence, Our theory is defined for optimization problems which are pages 102–109, San Francisco, 1994. Morgan defined by quantitative variables. The optimization problem Kaufmann. can be defined by a cost function or a complex process to be simulated. The theory is not applicable if either the optimiza- [CF98] F.B. Christiansen and M.W. Feldman. Al- tion problem is qualitatively defined or the problem solving gorithms, genetics and populations: The method is non-numeric. A popular example of a non-numeric schemata theorem revisited. Complexity, 3:57– problem solving method is genetic programming. In genetic 64, 1998. programming we try to find a program which optimizes the [dlMT93] M. de la Maza and B. Tidor. An analysis of problem, not an optimal solution. Understanding these kind selection procedures with particular attention of problem solving methods will be a challenge for the new paid to proportional and boltzmann selection. decade. In S. Forrest, editor, Proc. of the Fifth Int. Conf. Theoretical faces the same problem. The most on Genetic Algorithms, pages 124–131, San succcessful biological model is the foundation of classical Mateo, CA, 1993. Morgan Kaufman. population genetics. It is based on Mendel’s laws and a sim- ple model of Darwinan selection. The model has also been [Fal81] D. S. Falconer. Introduction to Quantitative used in his paper, it is purely quantitative. But this model is Genetics. Longman, London, 1981. a too great simplification of natural systems. The organism is not represented in the model. As such the model has lead to [Fre98] B.J. Frey. Graphical Models for Machine Ultra-Darwinism with the concept of selfish genes. What is Learning and Digital Communication. MIT urgently needed is a model of the organism. This model has Press, Cambrigde, 1998. [Gei44] H. Geiringer. On the of link- [M¨uh97] H. M¨uhlenbein. The equation for the response age in mendelian heredity. Annals of Math. to selection and its use for prediction. Evolu- Stat., 15:25–57, 1944. tionary Computation, 5(3):303–346, 1997. [Gol89] D.E. Goldberg. Genetic Algorithms in Search, [MV96] H. M¨uhlenbein and H.-M. Voigt. Gene pool Optimization and . Addison- recombination in genetic algorithms. In J.P. Wesley, Reading, 1989. Kelly and I.H Osman, editors, Metaheuristics: Theory and Applications, pages 53–62, Nor- [HCPGM99] G. Harik, E. Cantu-Paz, D.E. Goldberg, and well, 1996. Kluwer Academic Publisher. B.L. Miller. The gambler’s ruin problem, ge- netic algorithms, and the sizing of populations. [Nag92] T. Nagylaki. Introduction to Theoretical Popu- Evolutionary Computation, 7:231–255, 1999. lation Genetics. Springer, Berlin, 1992.

[Hol92] J.H. Holland. Adaptation in Natural and Arti- [PBS97] A. Pr¨ugel-Bennet and J.L. Shapiro. An anal- ficial Systems. Univ. of Michigan Press, Ann ysis of a genetic algorithm for simple random Arbor, 1975/1992. ising systems. Physica D, 104:75–114, 1997. [HS98] J. Hofbauer and K. Sigmund. Evolutionary [PGCP00] M. Pelikan, D.E. Goldberg, and E. Cantu-Paz. Games and Population Dynamics. Cambridge Linkage problem, distribution estimation, and University Press, Cambridge, 1998. bayesian network. Evolutionary Computation, 8:311–341, 2000. [Jor99] M.I. Jordan. Learning in Graphical Models. MIT Press, Cambrigde, 1999. [Rao73] C.R. Rao. Linear Statisticcal Inference and Its Application. Wiley, New York, 1973. [Lau96] St. L. Lauritzen. Graphical Models. Clarendon Press, Oxford, 1996. [RS99] L. M. Rattray and J.L. Shapiro. Cumulant dy- namics in a finite population. Theoretical Pop- [MHF94] M. Mitchell, J.H. Holland, and St. Forrest. ulation Biology, 1999. to be published. When will a genetic algorithm outperform hill climbing? Advances in Neural Information [Sch78] G. Schwarz. Estimating the dimension of a Processing Systems, 6:51–58, 1994. model. Annals of Statistics, 7:461–464, 1978. [MM99a] H. M¨uhlenbein and Th. Mahnig. Convergence [Vap98] V. Vapnik. Statistical Learning Theory. Wiley, theory and applications of the factorized distri- New York, 1998. bution algorithm. Journal of Computing and Information Technology, 7:19–32, 1999. [Voi89] H.-M. Voigt. Evolution and Optimization. Akademie-Verlag, 1989. [MM99b] H. M¨uhlenbein and Th. Mahnig. FDA – a scal- able evolutionary algorithm for the optimiza- [Vos99] M. Vose. The Simple Genetic Algorithm: Foun- tion of additively decomposed functions. Evo- dations and Theory. MIT Press, Cambridge, lutionary Computation, 7(4):353–376, 1999. 1999. [MM00] H. M¨uhlenbein and Th. Mahnig. Evolution- [Wri70] S. Wright. Random drift and the shifting bal- ary algorithms: From recombination to search ance theory of evolution. In K. Kojima, edi- distributions. In L. Kallel, B. Naudts, and tor, Mathematical Topics in Population Genet- A. Rogers, editors, Theoretical Aspects of ics, pages 1–31, Berlin, 1970. Springer Verlag. Evolutionary Computing, Natural Computing, [ZOM97] Byoung-Tak Zhang, Peter Ohm, and Heinz pages 137–176. Springer Verlag, 2000. M¨uhlenbein. Evolutionary induction of sparse [MMO99] H. M¨uhlenbein, Th. Mahnig, and A. Rodriguez neural trees. Evolutionary Computation, Ochoa. Schemata, distributions and graphical 5:213–236, 1997. models in evolutionary optimization. Journal of Heuristics, 5:215–247, 1999. [MSV94] H. M¨uhlenbein and D. Schlierkamp-Voosen. The science of breeding and its application to the breeder genetic algorithm. Evolutionary Computation, 1:335–360, 1994.