METHODS OF SELECTING SAMPLES IN MULTIPLE SURVEYS TO REDUCE RESPONDENT BURDEN

Charles R. Perry, Jameson C. Burt and William C. Iwig, National Agricultural Service Charles Perry, USDA/NASS, Room 305, 3251 Old Lee Highway, Fairfax, VA 22030

KEY WORDS: Respondent burden, multiple component, two methods of are proposed surveys, stratified design that reduce this type of respondent burden.

Historically, NASS has attempted to reduce Summary respondent burden and also reduce . In 1979, Tortora and Crank considered sampling with probability inversely proportional to burden. Noting The National Agricultural Statistics Service (NASS) a simultaneous gain in variance with a reduction in surveys the United States population of farm burden, NASS chose not to with probability operators numerous times each year. The list inversely proportional to burden. NASS has reduced components of these surveys are conducted using burden on the area frame component of its surveys. independent designs, each stratified differently. By There, a farmer sampled on one survey might be chance, NASS samples some farm operators in exempt from another survey, or farmers not key multiple surveys, producing a respondent burden to that survey might be sampled less intensely. concern. Two methods are proposed that reduce Statistical agencies in other countries have also this type of respondent burden. The first method approached respondent burden. For example, the uses linear integer programming to minimize the Netherlands Central Bureau of Statistics does some expected respondent burden. The second method co-ordinated or collocated sampling, ingeniously samples by any current sampling scheme, then, conditioning samples for one survey on previous within classes of similar farm operations, it surveys (de Ree, 1983). minimizes the number of times that NASS samples a farm operation for several surveys. Formal Description of Methods I and II The second method reduces the number of times that a respondent is contacted twice or more within a survey year by about 70 percent. The first method Method I is formally described by four basic tasks. will reduce this type of burden even further. (a) Cross-classify the population by the stratifi- cations used in the individual surveys. This Introduction produces the coarsest stratification of the pop- ulation that is a substratification of each indi- The National Agricultural Statistics Service (NASS) vidual stratification. surveys the United States population of farm operators numerous times each year. Some surveys (b) Proportionally allocate each of the individual are conducted quarterly, others are conducted stratified samples to the substrata. Use monthly and still others are conducted annually. random assignment between substrata where Each major survey uses a list dominant multiple necessary. frame design and an area frame component that (c) Apply integer linear programming within each accounts for that part of the population not on substratum to assign the samples to the labels the list frame. The list frame components of these of units belonging to the substratum so that surveys constitute a set of independent surveys, the respondent burden is minimized. each using a stratified design with different strata definitions. With the current (d) Randomize the labels to the units of procedures some individual farm operators are substratum. The final assignment within each sampled for numerous surveys while other farm substratum is a simple random sample with operators with similar design characteristics are respect to each of the proportionally allocated hardly sampled at all. Within the list frame samples. Method II is formally described by four basic tasks. the original stratified sample allocations over the substratum. Originally, these allocations are (a) Using an equal probability of selection tech- random to each substratum, constrained only so nique within a stratum, select independent strat- that the substratum sample sizes sum to their ified samples for each survey. Notice that the stratum sample size. Second, for the first method equal probability of selection criterion permits the respondent burden is minimized over each efficient zonal sampling techniques on each sur- substrata by the linear programming step. vey within strata. Currently, within strata Regarding variance reduction, this means that if the samples are selected systematically with re- original sample was selected using simple random cords essentially in random order. sampling within each stratum, then the first method (b) Substratify the population by cross-classifying reduces respondent burden without any offsetting the individual farm units according to the increase in variance, since proportional allocation stratifications used in the individual surveys. is at least as efficient as simple random sampling. However, the first method would be less efficient (c) Randomly reassign within each substratum for variance than zonal sampling unrestricted by the samples associated with units having burden. But the second method, by reallocating excess respondent burden to units having less some zonal sampling units to reduce respondent respondent burden. burden, may only slightly increase variance over no (d) Iterate the reassignment process until it reallocation and then only when zonal sampling is minimizes the number of times that NASS effective. samples a farm operator for several surveys in the substratum. A Simple Simulation of Method I For both methods, define respondent burden by an index that represents the comparative burden on Method I reduces respondent burden in the following each individual sampling unit in the population. simulation of two surveys. Survey I samples n = 20 Each survey considered is assigned a burden value. from N = 110. Survey II also samples n = 20 from When a sampling unit is selected for multiple N = 110, though each of its strata has either a larger surveys, the burden index may be additive or or smaller population size (N.1 = 40 and N.2 = 70) some other functional form dependent on the than the corresponding strata of survey I (N1. = 30 individual survey burden values. Consequently, and N2. = 80). Here, the first subscript corresponds each sampling configuration is assigned a unique to the first survey, with its strata 1 and 2. Simi- respondent burden index. larly, the second subscript corresponds to the second survey. For example, N21 = 30 corresponds to the For any reasonable respondent burden index, the size of the population in stratum 2 of survey I and first method minimizes the expected respondent (1) in stratum 1 of survey II, whilen ¯21 = 3.75 corre- burden. This follows easily from the following sponds to the proportional allocation of survey I’s observations, where it is assumed that for each (1) stratum 2 sample, n2. = 10, to the population in of the original surveys an equal probability of both stratum 2 of survey I and stratum 1 of survey selection mechanism (epsm) is used within strata. II. First, from the independence of the original sample designs, it follows that for each individual unit the expected burden from the original stratified samples Survey I Survey II Stratum 1 Stratum 2 is equal to the expected respondent burden using N11 = 10 N12 = 20 N1· = 30 (1) (1) (1) proportional allocation followed by epsm sampling Stratum 1 n¯ = 3.33 n¯ = 6.67 n = 10 11 12 1· (2) (2) within substrata. Since the respondent burden n¯ = 2.5 n¯ = 2.85 12 over any population is the sum of the respondent 11 N21 = 30 N22 = 50 N2· = 80 (1) (1) (1) burden on the individuals of the population, the Stratum 2 n¯ = 3.75 n¯ = 6.25 n = 10 21 22 2· (2) (2) equality holds for the entire population or any n¯ = 7.5 n¯ = 7.15 21 22 subpopulation including the substrata. That is, (2) (2) N = 40 n = 10 N = 70 n = 10 the expected respondent burden over any arbitrary ·1 ·1 ·2 ·2 substratum for the proportionally allocated samples is equal to the expected respondent burden of With two surveys, at most we will sample a estimates. These are the same restrictions that respondent twice. For the above two surveys, apply to the permanent random number techniques without any proportional allocation, we simulated discussed by Ohlsson (1993). two independent stratified simple random samples 3 million times. These simulated samples produced, Method I on average, 3.6 double hits for the whole population Using this notation for Method I, we next describe a of 110 potential respondents, and four percent of sequence of simple data manipulation steps that can the simulations produced 7 or more double hits. be used operationally to perform tasks (a) through With the proportional allocations indicated in the (d) on page 1 for each of the K surveys. diagram for Method I, the population exceeds the total sample for both surveys in each substratum, Suppose that each unit, ui, of the population U has so no sampling unit needs to be selected for been stratified for each of the K surveys. Further both surveys. The high respondent burdens of suppose that this information has been entered into independent sampling are reduced to 0 double hits a file containing N records, so that the ith record with Method I! contains the stratification information for unit i. To be definitive, assume that the variable S(k) denotes the stratum classification code for survey k and Operational Description that S(k : i) denotes the value of the stratum classification code for unit ui. Basic Notation For each survey k (k = 1, 2,...,K) perform the fol- lowing sequence of operations. N Let U = {ui}i=1 be a finite population of size N. Suppose that U is surveyed on K occasions and that on each occasion a different independent stratified (a) Sort the data file by the variables S(k),..., design is used. For these K stratified designs, denote S(K),S(1),...,S(k − 1). This will the survey occasion by k = 1, 2,...K and let us use hierarchically arrange the records of the the following notation. population, first by the stratification of survey H(k) :the number of strata for design k, k, by the stratification of survey k + 1 (k) U :the units (the set of them) in stratum h for design k, within the stratification of survey k, then h (k) by the stratification of survey k + 2 within N :the size of stratum h for design k, h the stratification of survey k + 1, . . . , by (k) n :the sample size in stratum h for design k, h the stratification of survey K within the (k) (k) (k) f = n /N :the sampling fraction in stratum h for design k, h h h stratification of survey K − 1, then by H(k) P (k) the stratification of survey 1 within the n(k) = n :the overall sample size for design k, and h h=1 stratification of survey K, . . . , then by (k) PH (k) N = N(k) = N :the overall population size. the stratification of survey k − 1 within the h=1 h stratification of survey k − 2. In terms of the substrata formed by the cross-classification, Remark the records of the population are arranged sequentially after sorting as

Requiring the population to be exactly the same for (k,k + 1,. . . ,K,1,. . . ,k − 2,k − 1) each survey may seem rather restrictive. However U 1 , 1 ,. . . , 1 ,1,. . . , 1 , 1 , it is not, since, for each survey, one can easily (k,k + 1,. . . ,K,1,. . . ,k − 2,k − 1) U 1 , 1 ,. . . , 1 ,1,. . . , 1 , 2 , introduce an extra stratum that contains the units . not covered by that survey. Obviously the sample . sizes associated with the extra noncovered strata are (k,k + 1,. . . ,K,1,. . . ,k − 2, k − 1 ) U (k-1) , taken to be zero. This permits one to apply either 1 , 1 ,. . . , 1 ,1,. . . , 1 ,H (k,k + 1,. . . ,K,1,. . . ,k − 2,k − 1) Method I or Method II over years. U 1 , 1 ,. . . , 1 ,1,. . . , 2 , 1 , . Warning: In multiyear applications, care must be . taken to ensure that no information from the sample (k , k + 1 ,. . . , K , 1 ,. . . , k − 2 , k − 1 ) U (k) (k+1) (K) (1) (k−2) (k−1) , data is used to update any of the frames being H ,H ,. . . ,H ,H ,. . . ,H ,H − 1 (k , k + 1 ,. . . , K , 1 ,. . . , k − 2 , k − 1 ) considered. Failure to do so can lead to biased U H(k),H(k+1),. . . ,H(K),H(1),. . . ,H(k−2),H(k−1) where units. This starting unit begins the first subinterval, with its size randomly (k ,. . . , K , 1 ,. . . , k − 1 ) · ¸ · ¸ U N (k) N (k) hk,. . . ,hK ,h1,. . . ,hk−1 determined as above, h or h + T T T T T n(k) n(k) = U (k) ... U (K) U (1) ... U (k−1) h h hk hK h1 hk−1 1. Sequentially continue to populate the (1) T T (k−1) T (k) T T (K) above subintervals, wrapping around to = Uh ... Uh Uh ... Uh 1 k−1 k K the first unit for one of the subintervals. (1 , 2 ,. . . , k − 1 , k , k + 1 ,. . . , K ) = U h1,h2,. . . ,hk−1,hk,hk+1,. . . ,hK This method of forming subintervals will let us keep the same probability Both the size and sequential arrangement of n(k) of selection h for each unit in that the substrata of stratum h for survey k are (k) Nh displayed schematically as subinterval. It does not choose a sample.

(k,k + 1,. . . ,K,1,. . . ,k − 2,k − 1) N h , 1 ,. . . , 1 ,1,. . . , 1 , 1 (2) Randomly select an integer from each sub- interval [while this integer corresponds to (k,k + 1,. . . ,K,1,. . . ,k − 2,k − 1) N h , 1 ,. . . , 1 ,1,. . . , 1 , 2 a population unit, it is not used here to select that population unit–for that, see . . (d) below]. The number of these random integers falling N(k,k + 1,. . . , K , 1 ,. . . ,k − 2,k − 1) h ,hk+1,. . . ,hK ,h1,. . . , hk-2 , hk-1 in the interval corresponding to

. (k ,k + 1,. . . , K , 1 ,. . . ,k − 2,k − 1) . N hk,hk+1,. . . ,hK ,h1,. . . , hk-2 , hk-1

(k, k + 1 ,. . . , K , 1 ,. . . , k-2 , k-1 ) in the sequential ordering is the size of the N (k+1) (K) (1) (k-1) (k-1) h ,H ,. . . ,H ,H ,. . . ,H -2,H -1 randomly proportioned sample for survey k to (k, k + 1 ,. . . , K , 1 ,. . . , k-2 , k-1 ) be drawn from the substratum population N h ,H(k+1),. . . ,H(K),H(1),. . . ,H(k-1)-1,H(k-1)

(k ,k +1,...,K,1,..., k − 2 , k − 1 ) where U hk,hk+1,...,hK ,h1,...,hk − 2,hk − 1 . (k ,. . . , K ,1,. . . , k − 1 ) Nh ,. . . ,hK ,1,. . . ,H k k−1 Denote this sample size for the substratum by denotes the number of units in (k ,k + 1,. . . , K , 1 ,. . . , k − 2 , k − 1 ) (k ,. . . , K ,1,. . . , k − 1 ) mhk,hk+1,. . . ,hK ,h1,. . . ,hk − 2,hk − 1 U hk,. . . ,hK ,1,. . . ,Hk − 1 . or (k) m(k) (b) To randomly proportion the sample nh for h1,...,hk,hk+1,...,hK stratum h of survey k to the subintervals of where the subscripts in the last expression are stratum k: understood to be in natural order. (1) Divide the length of stratum h for Repeating steps (a) and (b) above for (k) each of the K surveys, we have randomly survey k, Nh , into a sequence of (k) proportioned the K original stratified sample n subintervals of integer length that h sizes to the substrata. differ in length by at most 1. Do N (k) this by forming h as yet unpopulated (c) Next we describe how to use integer linear n(k) h programming to assign within a substratum subintervals, each with the length n(k), the above proportioned samples to the µ· ¸ ¶ h N (k) substratum unit labels–not specific population leaving N (k) − h n(k) imagi- h (k) h units yet. We do this so that the respondent nh nary population units to be assigned. burden is minimized for an arbitrary positive Randomly distribute these remaining linear respondent burden function (index). (1) (2) (K) imaginary· units¸ (without replacement) Suppose that m , m , . . . , m samples (k) Nh have been randomly proportioned to a to the (k) subintervals. Now nh substratum of size M. Clearly the random populate these subintervals by randomly proportioning procedure described above (k) (k) selecting a starting unit from the Nh insures that m ≤ M for k = 1, 2,...,K. Moreover, if the size of the total sample m = assign the configuration ~v . 2K (1) (2) (K) m +m +...+m randomly proportioned Suppose the ith assignment configuration, rep- to the substratum is less than or equal to M, resented by the ith assignment configuration then any positive linear respondent burden vector ~vi, produces a respondent burden of index is minimized by selecting the total bi ≥ 0. Then the problem of assigning the sample m by simple random sampling (SRS) m(1), m(2), . . . , m(K) samples to the M (un- without replacement (WOR) where the first specified) unit labels such that the total re- m1 units selected are associated with survey spondent burden over the substratum is min- I, the second m2 units selected are associated imized is equivalent to minimizing the linear with survey II, etc. objective function (respondent burden index) If the size of the total sample m = f(x1, x2, . . . , x ) m(1) + m(2) + ··· + m(K) is greater than 2K = b x + b x + ... + b x M, then linear integer programming can 1 1 2 2 2K 2K be used to find an assignment of the = ~b ~xT total sample to the (unspecified) labels of the stratum that minimizes the respondent subject to the K + 1 linear constraints  burden. Reiterating, we are working with   ~v1x1 + ~v2x2 + ··· + ~v x = ~m = labels here, so we are considering the burden  2K 2K (1) (2) (K) 0 of an arbitrary unit in the substratum, not the (m , m , . . . , m ) ← K constraints  population units themselves, though we will  x1 + x2 + ··· x = M, use the natural terminology “population unit.” 2K When assigning samples from K surveys to where x , x , . . . , x are non-negative inte- 1 2 2K the population units, there are 2K possible gers. ways of assigning the samples to any one Since ~v , . . . ,~v can each be written as population unit. These possible assignments K+2 2K a nonnegative integer combination of ~v ,..., can be represented by the 2K K-dimensional 2 ~v and since m(k) ≤ M for each k, it is easy vectors, call them assignment configurations, K+1 to see that       0 1 0 ~v x + ~v x + ··· + ~v x 0 0 1 2 2 3 3 2K 2K 0 0 0 ~v1 = . , ~v2 = . , ~v3 = . , = (m(2), m(3), . . . , m(K))0 . . . 0 0 0 has a solution over the nonnegative integers, . say x2, . . . , x . Setting . 2K

x1 = M − x2 − x3 − · · · − x K       2 0 1 1 then provides a feasible solution to the integer 0 1 0 0 0 1 linear programming problem. So there exists ~vK+1 = 0 , ~vK+2 = 0 , ~vK+3 = 0 , . . . a solution and hence there exists an optimal . . . solution. 1 0 0 (d) Finally, select specific sampling units u from . i . the population. Consider a specific substra-     tum and treat other substrata similarly. From 0 1 the results of (c) above, we now randomly choose 1 1 1 1 x2 farmers from the M substratum farmers ~v =   , ~v =   . 2K −1 1 2K 1 for the configuration ~v2, randomly choose x3 . . . . farmers for the configuration ~v3, . . . , randomly 1 1 choose x farmers for the configuration ~v . 2K 2K where component k of the vector is 1 if This sample of farmers reduces burden, yet the unit is sampled for the kth survey and within each stratum of each survey, this ap- 0 otherwise. Now we must determine the proach selects farmers with equal probability. number x1 of the population units to assign Note that this sample is not a type of the configuration ~v1, the number x2 to assign systematic sample–the randomness in (b)-(2) the configuration ~v , . . . , the number x to reveals this. 2 2K Method II (d) Repeat (c) above until no reassignments can be made. Then respondent burden has been In Method II, a sample is selected by some preferred minimized. technique. That sample might be selected by some equal probability of selection technique using With this method, one might want to retain zonal sampling to reduce variance, eg, by Chromy’s most of the original sample selection for the Procedure, Chromy (1981). Method II largely first survey but not necessarily for the other retains that sample, but alters it to reduce burden. surveys. Then, in (c), try to reassign other Thus Method II alters the sample by redistributing surveys before reassigning the first survey. it within the substrata. Sequential application of Method II is justified since each survey uses equal probability of Since this Method II is no more complicated than selection in each stratum which implies that Method I and has many similarities to it, the all units of a substratum have the same following description is brief. selection probability for any given assignment configuration. (a) Within each stratum of each of the K surveys, independently select a sample with equal probability. Some NASS Examples (b) Cross-classify the population as in (a) of Method I. This not only cross-classifies the NASS administers many surveys with a large population, it also cross-classifies the sample number of strata. For example, the Farm Costs chosen in (a) of Method II. From the and Returns Survey (FCRS/COPS) may have 18 strata, the Agriculture Survey may have 17 strata, (k ,k + 1,. . . , K , 1 ,. . . ,k − 2,k − 1) and the Labor Survey may have 8 strata. This Nhk,hk+1,. . . ,hK ,h1,. . . , hk-2 , hk-1 many strata over many surveys brings skepticism units in the substratum population to any use of Methods I or II. One would expect (k ,k +1,...,K,1,..., k − 2 , k − 1 ) many combinations of strata to contain but one U , hk,hk+1,...,hK ,h1,...,hk − 2,hk − 1 individual, even for three surveys. Methods I and denote the number sampled by II could never reduce burden on such a sparsely (one individual) populated combination of strata. (k ,k + 1,. . . , K , 1 ,. . . , k − 2 , k − 1 ) mhk,hk+1,. . . ,hK ,h1,. . . ,hk − 2,hk − 1 . Fortunately, most stratum combinations are empty This subsample size will not be changed, but while other combinations are well populated. it will be distributed among the substratum’s Indeed, not only are many substratum combinations population in (c) below. empty, many survey sampling combinations are (c) Within a substratum, reassign or swap some empty. In some initial testing over nine major 9 of the surveys associated with a sampling surveys, only 58 of the 2 = 512 possible survey unit having excess respondent burden to a combinations occurred in Kansas and only 62 in sampling unit have less respondent burden. If Arkansas based on 1991 data. This fortuitous the respondent burden index is linear, then limitation on survey combinations gives some only one survey for one sampling unit need optimism that many combinations of strata will be be reassigned to reduce burden. For example, well populated. A look at the number of population when we measure respondent burden by the units selected for multiple surveys provides further number of times we hit a farmer with a survey. optimism (see Table 1). Then we would move one survey from the No burden exceeds five surveys. No sampling unit farmer who got 4 hits to the farmer who got was selected for more than five surveys, indicating 0 hits, or to the farmer who got 2 hits if no that the possible number of substrata with only one farmer got 0 or 1 hit. unit is limited somewhat. If the respondent burden is non-linear, then sometimes more than one survey must There is some optimal combination of surveys be reassigned to reduce burden. And to consider when reducing respondent burden by when respondent burden in non-linear, then either Methods I or II. More surveys result in too sometimes three sampling units (not two) few farmers being classified for any of the many must swap to ever reduce burden. substrata combinations. Fewer surveys prevent Table 1: Number of Survey Hits over Nine Surveys Number 1992 in 1991 Selections Current Method II % Reduction Arkansas Kansas 4 6 4 33 Hits Frequency Frequency 3 112 28 75 2 2371 749 68 0 3491 21474 Total 2489 781 69 1 21125 40900 Arkansas 735 124 83 2 6136 8638 Iowa 801 204 75 3 846 938 Kansas 953 453 52 4 60 74 5 1 7

Acknowledgements Methods I and II from reducing any large burdens on some farmers; eg, when NASS surveys one farmer for five different surveys. The authors would like to express their appreciation to George Hanuschak for assistance in initiating In 1991, for the four major surveys – FCRS, Labor, and supporting this project early on. They would Quarterly AG, and Cattle/Sheep – NASS initially also like to thank Ron Bosecker and Jim Davies for sampled the following numbers of farmers. continued support.

Survey Arkansas Iowa Kansas State Statistical Offices are quite concerned about individual farmer respondent burden and want FCRS 666 1836 1356 the Agency to pursue methods to reduce it Labor 576 728 440 while maintaining an acceptable level of statistical Quarterly AG 4442 6477 5881 integrity. This report is the first such statistical Cattle/Sheep 1727 5507 3204 attempt in recent years in the Agency. Special thanks go to State Statisticians T. J. Byran, Method II reduced burden by about 70 percent Ben Klugh, Dave Frank and Howard Holden over the three states Arkansas, Iowa and Kansas for providing their substantial insights into the in 1991 and 1992. Table 2 below summarizes these definition and potential relief of individual farmer reductions of burden. Since the NASS samples were respondent burden. essentially random within strata, a huge reduction can be made in burden with no cost (increase) in References variance. Chromy, James R. (1981). “Variance Estimators for a Sequential Sample Selection Procedure,” Current Table 2: Reduction in Multiple Sample Selections Topics in Survey Sampling, D. Krewski, R. Platek, Using Method II for the FCRS, Labor, Quarterly and J.N.K. Rao (Eds). Academic. AG, and Cattle/Sheep Surveys de Ree, S.J.M. (1983). A System of Co-ordinated Sampling to Spread Response Burden of Enterprises. Number 1991 Netherlands Central Bureau of Statistics. Selections Current Method II % Reduction 4 0 0 – Ohlsson, Esbj¨orn (1993). “Coordination of 3 159 50 69 Samples Using Permanent Random Numbers,” ASA 2 2620 782 70 International Conference on Establishment Survey Total 2779 832 70 Proceedings. Arkansas 733 205 72 Tortora, Robert D. and Crank, Keith N. (1978). Iowa 1105 252 77 The Use of Unequal Probability Sampling to Reduce Kansas 941 375 60 Respondent Burden. USDA/National Agricultural Research Service.