Application of an Evolutionary Algorithm to Multivariate Optimal Allocation in Stratified Sample Designs

APPLICATION OF AN EVOLUTIONARY ALGORITHM TO MULTIVARIATE OPTIMAL ALLOCATION IN STRATIFIED SAMPLE DESIGNS Charles D. Day Internal Revenue Service, Statistics of Income Division KEY WORDS: Genetic Algorithm, Stochastic Search INTRODUCTION made is the choice of a variable or variables on Biological evolution can be viewed as a process which to stratify. Since it is rare to conduct a of optimizing a species to (or increasing its survey with only one item of interest, the fitness for) its environment. Evolutionary stratification variable or variables are chosen (or Algorithms (EAs), sometimes called Genetic constructed) to have a strong correlation with as Algorithms after their most common variant, many items of interest as possible. Methods for adopt biological evolution as a model for construction of optimal stratum boundaries (with computation. These algorithms are used most the goal of improving the precision of estimates) often for finding approximate solutions to have been proposed by Dalenius and Hodges [2], computationally intractable optimization Singh [3], Lavallée and Hidiroglou [4], and problems. In this paper, an evolutionary Sweet and Sigman [5]. algorithm is applied to the problem of multivariate optimal allocation in stratified Once stratum boundaries have been defined, and sample designs. a maximum sample size or total cost determined, it is straightforward to determine the number of The work reported on in this paper focuses on sample units to allocate to each stratum if the the design of an EA for solving the multivariate allocation is done on a single variable [6]. The optimal allocation problem and an investigation problem becomes more difficult if the allocation of the performance of that algorithm on a simple, is done on multiple variables. A number of well-known example. One of the most attractive approaches have been use to find good features of EAs is the flexibility of their “fitness” approximations to the optimum allocation [7-11]. (or objective) functions. Many characteristics This paper proposes the use of another method. can be optimized simultaneously. Future research will explore how an EA might be used EVOLUTIONARY ALGORITHMS to find the optimal strata boundaries and the As described in the introduction, evolutionary optimal allocation of sample units to strata algorithms adopt biological evolution as a model simultaneously, with the goal of producing a for computing. While there are a number of better result than doing so in serial fashion using canonical variants of evolutionary algorithms, it standard methods is common for practitioners to adapt features of two or more variants to develop algorithms Stratified sample designs are employed for specific to the solution of their problems. several reasons. These include: 1) to increase the precision of estimates for the whole population In general, evolutionary algorithms start with a for one or more key data items being collected in “population.” Each individual in the population the survey; 2) to obtain more precise estimates consists of one candidate solution for the for interesting domains; 3) to allow the use of problem the EA is trying to solve. Borrowing different sampling, nonresponse adjustment, terminology from biology, each variable in a editing, or estimation methods for domains with solution is referred to as a gene, the value for differing characteristics affecting the choice of each gene is called an allele, and the structure of method, and 4) to facilitate administration of the the whole solution is referred to as a genome. survey [1]. This paper focuses on the first two These candidate solutions are usually generated reasons. at random from the space (or a well-chosen subspace) of all possible solutions. For example, Once stratified sampling has been chosen, it is if an EA were designed to find the rational roots necessary to determine how to divide the of a quadratic equation, the solutions might be population into strata, and how to allocate the represented by a vector in Q2 (a vector of two sample to those strata. One decision that must be floating point numbers). The genome would be a vector of two floating pointing numbers, each The second reproductive operator is mutation. of the two variables would be a gene, and the As one might suspect, it consists in changing the value assigned to each variable an allele. The value of one of the genes with some probability. representation of candidate solutions is an Similarly to selection pressure, if the mutation important factor in the success of an EA; rate (the probability of a mutation) is high, the therefore, representations must be chosen with EA will be expected to more fully explore the care. solution space, if it is lower, convergence is expected to occur more quickly. The “fitness” of each individual is then evaluated; that is, the value of the objective Following reproduction, each child’s fitness is function of the optimization problem being assessed. Children are allowed to survive into solved is determined for each individual. Note the next generation (where they become the that the objective function can be as complicated initial population) in proportion to their fitness. as a simulation for flow of a gas or liquid The earlier comments about selection pressure through a manifold or as simple as a single apply to survival selection as well as they do to polynomial, so long as it is possible to rank the reproductive selection. candidate solutions on their fitnesses. This process continues, with the children Next, pairs (or n-tuples, should the practitioner becoming the next generation’s parents, until wish) of individuals are selected to “reproduce.” some convergence criterion is reached, or a This selection is done proportionate to the maximum number of generations is reached. individuals’ fitness. How the fitnesses are One problem with EAs as described to this point weighted in determining the probability of an is that the best solution may be lost; that is, the individual’s selection to reproduce is one of, as solution with the overall highest (if maximizing) Kenneth DeJong calls them, the “knobs” that one or lowest (if minimizing) value of the objective has to turn in tuning an EA for optimal function may disappear as the algorithm moves performance. If fitter individuals are given a from generation to generation, never to be seen great deal higher probability of selection than again. To address this problem, practitioners those that are less fit, then the EA is expected to usually employ “elitism,” allowing the k highest converge more quickly to an answer, but at valued members of the current population to greater risk of finding a local, rather than the survive into the next generation. global, optimum. The less “selection pressure” is applied, the more fully the EA is allowed to THE MULTIVARIATE OPTIMAL explore the solution space, at the cost of slower ALLOCATION PROBLEM convergence and at the risk of not converging at In stratified sampling, the problem arises of how all. This trade-off is referred to as “exploitation many sample units to allocate to each stratum. If versus exploration,” and a well-designed EA the survey practitioner wishes only to make as must balance the two competing goals so that precise as possible an estimate for one variable progress is made toward convergence without given a fixed cost, or find the minimum cost the EA getting stuck in a local optimum. design to achieve a target variance, this problem has a well-known solution [12]: During reproduction, two operations can be used to produce “children” (the next “generation” of N h S h / c h candidate solutions). One consists of taking one n h = n part of one of the individuals selected to ∑ ( N h S h / c h ) reproduce and appending it to the complementary part of the individual it was where nh is the number of sample units allocated paired with during selection. This is referred to to stratum h, Nh is the number of population units as “crossover” in the EA literature, and is in stratum h, ch is the cost per unit in stratum h, analogous to recombination in biological Sh is the population standard deviation for the reproduction. Given possible constraints on the variable of interest in stratum h, and n is the total structure of solutions, the design of crossover sample size. (Sh is usually estimated from frame operators can become quite creative. The desire information or earlier samples.) If a target for simpler or more effective crossover operators variance is fixed and cost is to be minimized, can also impact the representation of solutions. then: ( Wh Sh ch ) Wh Sh / ch ∑∑ n = 2 DESIGN OF AN EA TO SOLVE THE V + (1/ N)∑Wh Sh MULTIVARIATE OPTIMAL where W = Nh/N. If cost is fixed and variance is ALLOCATION PROBLEM to be minimized then: When designing an EA (or any other optimization algorithm), it is important to (C − c ) N S / c incorporate any special features of the problem n = 0 ∑ h h h . to be solved. The multivariate optimal allocation problem has two features that should be ∑ N h Sh ch accounted for in the design. First, it is really two While it is rarely the case that a survey is problems, minimize a function of variances for a conducted to find the value of only one variable, fixed cost or minimize cost subject to fixed this formula is still broadly useful, since an variance targets. A good solution will allow the allocation that is optimal for one variable may be statistician to choose which of these approaches near-optimal for variables that are strongly to follow. correlated with it. If, however, precise estimates of several variables are needed, and those Second, any solution that results in enough of the variables are not all highly correlated with each available budget (total cost) being left over to other, it is desirable to have a method to find a allocate another unit in any stratum is sub- good compromise allocation that will give optimal; that is, any optimal solution must use adequate precision for all of the variables of the entire budget.

Load more