<<

Statistical Aspects of a

Carol C. House

______

This paper focuses on the statistical aspects of a census. It addresses issues such as the coverage, classification, , non-, post collection processing, weighting and disclosure avoidance. The intent of the paper is to demonstrate that most (if not all) of the statistical issues that are important in conducting a are equally germane to conducting a census.

KEY WORDS: census, coverage, non-response, error, frames, imputation, disclosure ______

1 INTRODUCTION1 with respect to well-defined characteristics”. This definition is more In this paper the author will provide a useable. We now look at the term basic overview of the statistical aspects “” to further focus the paper. of planning, conducting and publishing Again from ISI we find that statistics is from a census. The intent of the the “numerical data relating to an paper is to demonstrate that most (if not aggregate of individuals; the science of all) of the statistical issues that are collecting, analyzing and interpreting important in conducting a survey are such data.” Together these definitions equally germane to conducting a census. render a focus for this paper -- those issues germane to the science and/or In order to establish the scope for this methodology of collecting, analyzing and paper, we begin by reviewing some basic interpreting data through what is intended definitions. Webster's New Collegiate to be a complete enumeration of a Dictionary defines a “census” to be “a population at a point in time with respect count of the population and a property to well-defined characteristics. Further, in early Rome”. Although because of the nature of the CAESAR particularly appropriate to quote at the conference, this paper will direct its CAESAR conference, we will want to discussion to agricultural . utilize a broader definition. The Important issues include the (sampling) International Statistical Institute (ISI) in frame, sampling methodology, non- its Dictionary of Statistical Terms defines sampling error, processing, weighting, a census to be “the complete enumeration modeling, disclosure avoidance, and data of a population or group at a point in time dissemination. This paper touches on each of these issues as appropriate to the paper’s focus on censuses of agriculture.

1This paper was presented at the Conference on Agricultural and Environmental Statistical 2 FRAME Applications in Rome (CAESAR), June 5-7, 2001. Carol House is with the National Agricultural Statistics Whether conducting a survey or Service, and Development Division, and is the Division Director. a census, a core component of

1 methodology is the . The The Australian Bureau of Statistics frame usually consists of a listing of (Sward, et. al., 1998) intentionally population units, but alternatively it excludes smaller farms from their might be a structure from which clusters business register and census of of units can be delineated. For agriculture. They focus instead on agricultural censuses, the frame is likely production agriculture, and maintain that to be a business register or a farm their business register has good coverage register. Alternatively it might be a for that target population. Statistics listing of villages from which individual Canada (Lim, et. al., 2000) has dropped farm units can be delineated during data the use of an area frame as part of its collection. The use of an area frame is a census of agriculture, and is conducting third common alternative. Often more research on using various sources of than a single frame is used for a census. administrative data to improve coverage Papers presented at the Agricultural of its farm register. Kiregyera (1998) Statistics 2000 conference highlight the reports that a typical agriculture census in diversity of sampling frames used for Africa will completely enumerate larger agricultural censuses (Sward, et. al.; operations (identified on some listing), Kiregyera; ). but does not attempt to enumerate completely the smaller operations There are three basic statistical concerns because of the resources required to do associated with sampling frames: so. Instead they select a sample from a coverage, classification and duplication. frame of villages or land areas, and These concerns are equally relevant delineate small farms within the sampled whether the frame will be used for a areas for enumeration. In the United census or sampled for a survey. States, the farm register used for the 1997 Census of Agriculture covered 86.3% of 2.1 Coverage all farms, but 96.4% of farms with gross value of sales over $10,000 and 99.5% of Coverage deals with how well the frame the total value of agricultural products. fully delineates all population units. The The U.S. uses a separate area sampling ’s goal should be to maximize frame to measure under-coverage of its coverage of the frame and to provide farm register, and has published global measures of under-coverage. For measures of coverage. They are agricultural censuses, coverage often investigating methodology to model differs by size of farming operation. under-coverage as part of the 2002 Larger farms are covered more census and potentially publish more completely, and smaller farms less so. detailed measures of that coverage. Complete coverage of smaller farms is highly problematic, and statistical 2.2 Classification organizations have used different strategies to deal with this coverage A second basic concern with a sampling problem. frame is whether frame units are accurately classified. The primary

2 classification is whether the unit is, in 2.3 Duplication fact, a member of the target population, and thus should be represented on the A third basic concern with a sampling frame. For example, in the U.S. there is frame is duplication. There needs to be a an official definition of a farm: one-to-one correspondence between operations that sold $1,000 or more of population units and frame units. agricultural products during the target Duplication occurs when a population year, or would normally sell that much. unit is represented by more than one The first part of the definition is fairly frame unit. Similar to misclassification, straightforward, but the second causes duplication is an ongoing concern with considerable difficulty with all business registers. Software is classification. available to match a list against itself to search for potential duplication. This Classification is further complicated process may eliminate much of the when a population unit is linked with, or duplication prior to . owned by, another business entity. This Often it is important in a census or is an ongoing problem for all business survey to add questions to the data registers. The statistician’s goal is to collection instrument that will assist in a employ reasonable, standardized post-collection evaluation of duplication. classification algorithms that are In its 1997 Census of Agriculture, the consistent with potential uses of the U.S. conducted a separate “classification census data. For example, a large error study” in conjunction with the farming operation may be a part of a census. For this study, a sample of larger, vertically integrated enterprise census respondents was re-contacted to which may have holdings under semi- examine potential misclassification and autonomous management in several duplication, and to estimate levels of dispersed geographic areas. Should each both. geographically dispersed establishment be considered a farm, or should the 3 SAMPLING enterprise be considered a single farm and placed only once on the sampling When one initially thinks of a census or frame? Another example is when large complete enumeration, statistical conglomerates contract with small, sampling may not seem relevant. independent farmers to raise livestock. However, in the implementation of The larger firm (contractor) places agricultural censuses throughout the immature animals with the contractee world, a substantial amount of sampling who raises the animals. The contractor has been employed. David (1998) maintains ownership of the livestock, presents a strong rationale for extensive supplies feed and other input expenses, use of sampling for agricultural censuses, then removes and markets the mature citing specifically those conducted in animals. Which is the farm – the Nepal and the Philippines. The reader is contractor, the contractee, or both? encouraged to review his paper for more details. This paper does not attempt an

3 intensive discussion of different sampling 4 NON-SAMPLING ERROR techniques, but identifies some of the major areas where sampling has (or can Collection of data generates sampling be) employed. and non-sampling errors. We have already discussed situations in which Reducing costs is a major reason that sampling, and thus sampling error, may statistical organizations have employed be relevant in census data collection. sampling in their census processes. We Non-sampling errors are always present, have already discussed how agricultural and generally can be expected to increase censuses in Africa, Nepal, and the as the number of contacts and the Philippines have used sampling complexity of questions increases. Since extensively for smaller farms. Sampling censuses generally have many contacts may also be used in and and fairly involved data collection assessment procedures. Examples instruments, one can expect them to include: conducting a sample survey of generate a fairly high level of non- census non-respondents to assist in non- sampling error. In fact, David (1998) response adjustment; or conducting a uses expected higher levels of non- specialized follow-up survey of census sampling error in his rationale for respondents to more carefully examine avoiding complete enumeration in potential duplication and classification censuses of agriculture. errors. The U.S. uses a sample survey based on an area frame to conduct a “… [a census produces] higher non- coverage evaluation of its farm register sampling error which is not and census. It may be advantageous in a necessarily less than the total error in large collection of data to sub-divide the a sample enumeration. What is not population and use somewhat different said often enough is that, on account or collection of their sizes, complete enumeration methodologies on each group. Here CA’s [censuses of agriculture] use again is a role for sampling. For different, less expensive and less example, in order to reduce overall accurate data collection methods than respondent burden some organizations those employed in the intercensal prepare both aggregated and detailed surveys.” versions of a census and use statistical sampling to assign Two categories of non-sampling error questionnaire versions to the frame units. are response error and error due to non- Alternatively sampling may facilitate response. efforts to evaluate the effect of incentives, to use pre-census letters as 4.1 Response Error response inducements, or to examine response rates by different modes of data The literature (Groves; Lyberg, et. al.) is collection. fairly rich in discussions of various components of this type of error. Self- enumeration methods can be more

4 susceptible to certain kinds of response reasonable, measures of the important errors, which could be mitigated, if components of error. interviewer collection were employed. Censuses, because of their large size, are 4.2 Non-Response often carried out through self- enumeration procedures. The Office of The statistician’s role in addressing non- National Statistics in Britain (Eldridge, response is very similar to his/her role in et. al; 2000) has begun to employ addressing response error: to understand cognitive interviewing techniques for the reasons for non-response, to develop establishment surveys much the same as data collection procedures that will they have traditionally employed for maximize response, to provide measures household surveys. They conclude that of non-response error, and to impute or the “… use of focus groups and in-depth otherwise adjust for those errors. interviews to explore the meaning of terms and to gain insight into the Organizations employ a variety of backgrounds and perspectives of strategies to maximize response. These potential respondents can be very include publicity, pre-collection contacts, valuable …” They further conclude and incentives. Some switch data regarding self-administered collection collection modes between waves of that “…layout, graphics, instructions, collection to achieve higher response definitions, routing etc. need testing.” rates. Others are developing procedures Kiregyera (1998) additionally focuses that allow them to target non-response readers’ attention on particular follow-up to those establishments which difficulties that are encountered when are most likely to significantly impact the collecting from farmers in estimates. (McKenzie, 2000) developing countries. These include the “failure of holders to provide accurate A simple method for adjusting for unit estimates of crop area and production … non-response in sample surveys, is to attributed to many causes including lack modify the sampling weights so that of knowledge about the size of fields and respondent weights are increased to standard measurement units, or account for non-respondents. The unwillingness to report correctly for a assumption in this process is that the number of reasons (e.g. taboos, fear of respondents and non-respondents have taxation, etc.).” similar characteristics. Most often, the re-weighting is done within strata to The statistician’s role is fourfold: to strengthen the basis for this assumption. understand the “total error” profile of the A parallel process can be used for census, to develop data collection censuses. Weight groups can be instruments and procedures that developed so that population units within minimize total error, to identify and groups are expected to be similar in correct errors during post collection relationship to important data items. All processing, and to provide, to the extent respondents in a weight group may be given a positive weight, or donor

5 respondents may be identified to receive previous method has failed. A nearest a positive weight. Weight adjustment for neighbor approach based on spatial item non-response, although possible, “nearness” may make more sense for a quickly becomes complex as it creates a census, where there is a greater density of different weight for each item. responses, than it would in a more sparsely distributed sample survey. Imputation is widely used to address , particularly that due to item 5 POST COLLECTION non-response. Entire record imputation PROCESSING is also an appropriate method of addressing unit non-response. Manual Post collection processing involves a imputation of missing data is a fairly variety of different activities, several of widespread practice in data collection which (imputation, weighting, etc.) are activities. Many survey organizations discussed in other sections of this paper. have been moving toward more Here we will briefly address editing and automated imputation methods because analysis of data. Because of the volume of concerns about consistency and costs of information associated with a census associated with manual imputation, and data collection, it becomes very to improve the ability to measure the important to automate as many of these impact of imputation. Automating edit and analyses processes as possible. processes like imputation are particularly Atkinson and House (2001) address this important for censuses because of the issue and provide several guiding volume of records that must be principles that the National Agricultural processed. Statistical Service is using in building an edit and analysis system for use on the Yost et. al. (2000) identify five 2002 Census of Agriculture: a) automate categories of automated imputations: i) as much as possible, minimizing required deterministic imputation – where only manual intervention; b) adopt a “less is one correct value exists (such as the more” philosophy to editing, creating a missing sum at the bottom of a column of leaner edit that focuses on critical data numbers; ii) model-based imputation – problems; and c) identify problems as use of averages, , ratios, early as possible. regression estimates, etc. to impute a value; iii) deck imputation – a donor Editing and analysis must include the questionnaire is used to supply the ability to examine individual records for missing value; iv) mixed imputation – consistency and completeness. This is more than one method used; and v) the often referred to as “micro” editing or use of expert systems. Many systems “input” editing. Consistent with the make imputations based on a specified guiding principles discussed above, the hierarchy of methods. Each item on the Australian Bureau of Statistics has questionnaire is resolved according to its implemented the use of significance own hierarchy of approaches, the next criteria in input editing of agricultural being automatically tried when the data. (Farwell and Raine, 2000) They

6 contend that “… obtaining a corrected number of farms, number of farmers, value through clerical action is expensive number of hogs, etc.). For these data, (particularly if respondent re-contact is desirable characteristics of the census involved) and the effort is wasted if the tabulation is to have integer values at all resulting actions have only a minor effect published levels of disaggregation, and to on estimates." They have developed a have those cells sum appropriately to theoretical framework for this approach. aggregated totals.

Editing and analysis must also include The existence of non-integer weights the ability to perform macro-level creates non-integer weighted data items. analysis or output editing. These Rounding each of the multiple cell totals processes examine trends for important creates the situation that they may not subpopulations, compare geographical add to rounded aggregate totals. This regions, look at data distributions and issue can be addressed in one of several search for outliers. Desjardins and ways. In the U.S., the census of Winkler (2000) discuss the importance of agriculture has traditionally employed the using graphical techniques to explore technique of rounding weights to data and conduct outlier and inlier integers, and then using these integerized analysis. Atkinson and House concur weights. An alternative would be to with these conclusions and further retain the non-integer weights and round discuss the importance of having the the weighted data to integers. A recent macro-analysis tool integrated effectively evaluation of census data in the U.S. with tools for user-defined ad-hoc (Scholetsky, 2000) showed that totals queries. produced using the rounded weighted data values were more precise than the 6 WEIGHTING total produced using the integerized weights except for the demographic When one initially thinks of a census, characteristics, number of farms, and one thinks of tallying up numbers from a ratio per farm estimates. A drawback to complete enumeration, and publishing using rounded weighted data values is the that information in a variety of cross complexity these procedures add to tabulations that add to the total. This storing and processing information. paper has already discussed a variety of situations in which weighting may be a 7 MODELING part of a census process. In this section we focus on the between Modeling can be effective within a weighting and the rounding of data census process by improving estimates of values. small geographic areas and rare subpopulations. Small area statistics is Many of the important data items perhaps one of the most important collected in an agricultural census are products from a census. However, a intrinsically “integral” numbers, making number of factors may impact the sense only in whole increments (i.e. the census’ ability to produce high quality

7 statistics at fairly disaggregate levels. which circumstances and efficiencies The highly skewed distribution of data, require that census data not stand alone. which is intrinsic to the structure of We have already discussed modern farming, creates estimation methodologies in which a separate survey difficulties. For example, many larger may be used to adjust census numbers for operations have production units which non-response, misclassificaion and/or cross the political or geographic coverage. Sometimes sources of boundaries used in publication. If data administrative data are mixed with are collected for the large operation and census data to reduce respondent burden published as if the “whole” farm is or data collection costs. Most often the contained within a single geographic administrative data must be modeled to area, this result will be an over-estimate make it more applicable to the census of agricultural production within that area data elements. Alternatively, some and a corresponding under-estimate census collection procedures utilize a within surrounding areas. Mathematical “long” and “short” version of the models may be used effectively to questionnaire so that all respondents are prorate the operation totals to appropriate not asked every question. To combine geographic areas. the data from these questionnaire versions may also require some type of Census processes for measuring and modeling. adjusting non-response, misclassification, and coverage may produce acceptable 8 DISCLOSURE AVOIDANCE aggregate estimates while being inadequate for use at the more The use of disclosure avoidance disaggregate publication levels. methodology is critically important in Statistical modeling and smoothing preparing census and survey data for methodology may be used to smooth the publication. Disclosure avoidance can be measures so that they produce more very complex for agricultural census reasonable disaggregate measures. For publications because of the scope, example, for the 1997 Census of complexity and size of these Agriculture the U.S. provided measures undertakings. Disclosure avoidance is of frame coverage at the state level for made more difficult by the highly skewed farm counts for major subpopulations. nature of the farm population. Data from They are evaluating several smoothing large, or highly specialized, farming techniques that, if successful, may allow operations are hard to disguise, especially the 2002 census release to include when publishing totals disaggregated to coverage estimates at the county level small geographic areas. instead of just state level, and for production data as well as farm counts. Disclosure avoidance is typically accomplished through the suppression of Although a census may be designed to data cells at publication. A primary collect all information from all suppression occurs when a cell in a population units, there are many cases in publication table requires suppressing

8 because the data for the cell violates 9 DISSEMINATION some rule or rules defined by the statistical agency. Typical rules include: Data products from a census are typically extensive volumes of interconnected a) threshold rule: the total number of tables. The Internet, CD-rom, and other respondents is less than some technical tools now provide statistical specified number, i.e. the cell may be agencies with exciting options for suppressed if it had fewer than 20 dissemination of dense pools of positive responses. information. This paper will discuss b) (n,k) rule: a small number of several opportunities to provide high respondents constitute a large quality data products. percentage of the cell’s value, for example a (2,60) rule would say to The first component of a quality suppress if 2 or fewer responses made dissemination system is metadata, or data up 60 percent or more of the cell’s about the data. Dippo (2000) expounds value. on the importance of providing metadata c) p-percent rule: if a reported value to users of statistical products and on the for any respondent can be estimated components of quality metadata. within some specified percentage. “Powerful tools like databases and the Secondary suppression occurs when a Internet have vastly increased cell becomes a disclosure risk from communication and sharing of data actions taken during the primary among rapidly growing circles of suppression routines. These additional users of many different categories. cells must be chosen in a way that This development has highlighted the provide adequate protection to the importance of metadata, since easily primary cell and at the same time make available data without appropriate the value of the cell mathematically metadata could sometimes be more underivable. harmful than beneficial.”

Zayatz et. al. (2000) have discussed “Metadata descriptions go beyond the alternatives to cell suppression. They pure form and contents of data. propose a methodology that adds “noise” Metadata are also used to describe to record level data. The approach does administrative facts about data, like not attempt to add noise to each who created them, and when. Such publication cell, but uses a random metadata may facilitate efficient assignment of multipliers to control the searching and locating of data. Other effect of the noise on different types of types of metadata describe the cells. This results in the noise having the processes behind the data, how the greatest impact on sensitive cells, with data were collected and processed, little impact on cells that do not require before they were communicated or suppression. stored in a database. An operational description of the data collection

9 process behind the data (including e.g. errors will be present, and the design questions asked to respondents) is must deal effectively with both response often more useful than an abstract and non-response errors. Post collection definition of the “ideal” concept processing should allow both micro and behind the data.” macro analysis. Census processing will probably involve weighting and some The Internet has become a focal point type of modeling. The dissemination for the spread of information. Web users processes should prevent disclosure of expect: to have sufficient guidance on respondent data while providing useful use; to be able to find information access by data users. quickly, even if they do not know precisely what they are looking for; to REFERENCES understand the database organization and naming conventions; and to be able to Atkinson, D., House, C. (2001) A easily retrieve information once it is Generalized Edit and Analysis System found. This implies the need, at a for Agricultural Data, Proceedings of minimum, for high quality web design, the Conference on Agricultural and searchable databases, and easy to use Environmental Statistical Application print and download mechanisms. The in Rome, International Statistical next step is to provide tools such as Institute. The . interactive graphical analysis with drill- down capabilities and fully functional David, I. (1998) Sampling Strategy for interactive query systems. Graphs, charts Agricultural Censuses and Surveys in and tables would be linked, and users Developing Countries, Proceedings of could switch between these different Agricultural Statistics 2000, 83-95. representations of information. Finally, International Statistical Institute. The there would be links between the census Netherlands. information and databases and websites containing information on agriculture, DesJardins, D., Winkler, W. (2000) rural development, and economics. Design of Inlier and Outlier Edits for Business Surveys, Proceedings of the 10. SUMMARY 2nd International Conference on Establishment Surveys, 547-556, Conducting a census involves a number American Statistical Association, of highly complex statistical processes. Washington. One must begin with a quality sampling frame, in which errors due to under- Dippo, C. (2000) The Role of Metadata coverage, mis-classification and in Statistics, Proceedings of the 2nd duplication are minimized. There may be International Conference on opportunities in which statistical Establishment Surveys, 909-918, sampling will help bring to the American Statistical Association, data collection or facilitate quality Washington. control measurements. Non-sampling

10 Eldridge, J., Martin, J., White, A. (2000) Lyberg, L., Biemer, P., Collins, M., De The Use of Cognitive Methods to Leeuw, E., Dippo, C., Schwarz, N., Improve Establishment Surveys in Trewin. D., Eds. (1997) Survey Britain, Proceedings of the 2nd Measurement and Process Quality, International Conference on John Wiley & Sons, Inc. New York. Establishment Surveys, 307-316, American Statistical Association, McKenzie, R. (2000) A Framework For Washington. Priority Contact of Non Respondents, Proceedings of the 2nd International Farwell, K., Raine, M. (2000) Some Conference on Establishment Surveys, Current Approaches to Editing in the 473-482, American Statistical ABS, Proceedings of the 2nd Association, Washington. International Conference on Establishment Surveys529-538, Scholetzky, W. (2000) Evaluation of American Statistical Association, Integer weighting for the 1997 Census Washington. of Agriculture, RD Research Report Number RD-00-01 National Groves, R. (1989) Survey Errors and Agricultural Statistics Service. U. S. Survey Costs, John Wiley & Department of Agriculture. Sons, New York. Washington.

International Statistical Institute (1990) Sward, G., Hefferman, G., and Mackay, A Dictionary of Statistical Terms, A. (1998) Experience with Annual Published for the International Censuses of Agriculture, Proceedings Statistical Institute by Longman of Agricultural Statistics 2000, 59-70, Scientific & Technical. Essex CM20 International Statistical Institute. The 2JE England. Netherlands.

Kiregyera, B. (1998) Experiences with Webster's New Collegiate Dictionary Census of Agriculture in Africa, (1977) G. & C. Merriam Company, Proceedings of Agricultural Statistics Springfield, MA . 2000, 71-82, International Statistical Institute. The Netherlands. Yost, M., Atkinson, D., Miller, J., Parsons, J., Pense, R., Swaim, N. Lim, A., Miller, M., Morabito, J. (2000) (2000) Developing A state of the Art Research Into Improving Frame Editing, Imputation and Analysis Coverage for Agricultural Surveys at System for the 2002 Agricultural , Proceedings of the Census and Beyond--An unpublished 2nd International Conference on staff report, National Agricultural Establishment Surveys, 131-136, Statistics Service. U. S. Department American Statistical Association, of Agriculture. Washington. Washington.

11 Zayatz, L., Evans, T., Slanta, J. Using Noise for Disclosure Limitation of Establishment Tabular Data, Proceedings of the 2nd International Conference on Establishment Surveys, 877-886, American Statistical Association, Washington.

12