Data Validation for the 2006 Census of Agriculture
Total Page:16
File Type:pdf, Size:1020Kb
Papers presented at the ICES-III, June 18-21, 2007, Montreal, Quebec, Canada DATA VALIDATION FOR THE 2006 CENSUS OF AGRICULTURE Charlie Arcaro Statistics Canada, R.H. Coats Building 17th floor, Ottawa, Ontario, K1A 0T6 ABSTRACT corporate, operator and geographical information for existing farms on the Farm Register (FR). The Census of Agriculture (CEAG) is held every five years and collects information on farming practices, commodities and finances. This census is usually 2. CEAG PROCESS conducted at the same time as the Census of Population and is used in redesigning the major As with most surveys, the 2006 CEAG process begins Agricultural surveys and in updating the Farm with the collection of data. Most of the forms were Register. Data Validation is a major part of the data either dropped-off by the enumerator and mailed back quality assurance evaluation and is the most expensive by the respondent or completed via the Internet. Once component of the CEAG process. The validation the forms come back their data were scanned and process is done at both the macro and micro levels to checked for character recognition (ICR) errors. There ensure quality at the major geographical and small area was also extensive follow-up of non-responding farms levels as well as ensuring good sampling frames for the as well micro-level data editing and assignment of major survey collections. As users increasingly headquarter geography to each farm. Data imputation scrutinize CEAG data at finer geographical levels, it is proceeds these steps followed by data validation. Once looked at even more closely in the Validation Process. validation is complete the results are sent for There is also a need to maximize the efficiency for certification and finally to output and dissemination. validation resources during this long process. This paper provides an outline for the 2006 Data Validation process and presents some new methods that addressed the issues of resource management and improved data Table 2.1 2006 CEAG Process quality for lower levels of geography. Keywords Data Validation, Referential Sources, Collection Imputation; Census of Agriculture Data Scanning 1. INTRODUCTION The Census of Agriculture (CEAG) provides a five- Editing, Matching, yearly snapshot of Canadian Agriculture and collects a Follow up wide variety of farm commodity, financial and operator information. The 2006 CEAG was collected in conjunction with as the 2006 Canadian Census of Imputation Population to lower survey under-coverage and to locate missing CEAG farms. Between 2001 and 2006, the number of census farms and farm operators in Data Validation Canada continued their long-term decline. The 2006 CEAG counted 229,373 farms, which was down 17,550 (7.1%) from 2001. It also accounted for 327,060 farm - a decline of 19,140 (5.5%) from 2001. Output & Dissemination The CEAG is also used in redesigning many of the major Agricultural surveys at Statistics Canada. Most of these surveys use CEAG data in constructing their sampling frames as well as use auxiliary information as part of their estimation process. CEAG data is the main source of new farms as well as updates to 709 Papers presented at the ICES-III, June 18-21, 2007, Montreal, Quebec, Canada 3. DATA VALIDATION 3.2 Validation Tools 3.1 Introduction A series of Data Validation tools is used to create a list of reports which are then analyzed by the respective Data Validation involves analyzing and correcting validation teams. These tools are located on the CEAG CEAG data by using a series of both aggregate and Central Processing System (CPS) which also keeps record level tables. These tables allow us to compare track of all validation reports generated. These reports the number of farms, totals and structural changes from provide information at both the farm and aggregate the last CEAG. We also compare to recent surveys record level including comparisons against the (Reference Sources) and see if there are any major previous CEAG and other Referential Sources. It’s discrepancies between them. Farms that contribute important to note that any tables reproduced in this significantly to the provincial totals are also scrutinized paper are either fictitious or from the 2004 CEAG Test. as well as the impacts of imputation and validation on our final data. 3.2.1 Comparison Tables As shown in table 2.1, Data Validation is performed after the imputation stage of the CEAG process but just Comparison Tables provide numbers that measure before the final output & dissemination. Data agricultural activity within a specific province and Validation is also the costliest part of the CEAG post- breaks them down to finer levels of geography. The collection process and involves analyzing and possibly tables provide data for the 2006 CEAG and 2001 changing data that looks “questionable”. This analysis CEAG as well as percentage of change for each is not only done for the current CEAG but also variable. Comparison tables are used to explain and compares record and aggregate level data with other justify any dramatic changes that make the data collections such as previous CEAGs and other “questionable”. In general, we assume that the agricultural surveys. For the 2006 CEAG, the numbers won’t change dramatically from one CEAG validation process was done for each province to the next, however can expect some commodities to separately beginning in September 2006 with the alter significantly and the Comparison Tables allow us Atlantic provinces (New Brunswick, Nova Scotia, to track and evaluate these differences. Prince Edward Island and Newfoundland) and ending in February 2007 with the province of Alberta. A series of macro and micro-level tools are used to 3.2.2 Impact of Processing Tables analyze the data. Aggregate data is analyzed at various geographic levels; from the province level down to the Impact of Processing (IP) Tables allow Data Validators Census Consolidated Subdivision (CCS), of which to assess the impact of changes made to CEAG data by there are approximately 2135 across Canada. This the imputation and validation processes. These tables lower level of data evaluation allows us to determine are only available at the provincial level and allow the quality of small area data, which is an important validators to locate any potential problems with both part of the validation process. Data at the finer the automated and manual imputation systems. IP geographical levels are expected to come under greater Tables also allow validators to monitor the impact of scrutiny from data users than ever before their own work on the final data to be released. There are two tables produced in an IP report: Senior validators prepare extensive plans for all variables to be analyzed based upon their field knowledge and their expectations before the Validation 1. Certification/Summary Report. This report lists all Process begins. These validators are responsible for a the report variables and provides data similar to number of variables on each questionnaire and are Comparison Tables as well as the effects of imputation usually the subject matter experts in that field. Their and validation on the 2006 CEAG data. tasks also include the supervision of junior validators and helping them with their tasks. 2. Detail Report. This report provides a detailed For the 2006 CEAG, efforts were made to make more breakdown of the changes in value for each variable in efficient use of financial and human resources. The the summary report. This not only measures the results from the new strategies developed in effects of imputation and validation on the 2006 determining farms to be validated will be given at the CEAG data but also which of these changes were end of this paper. positive and which were negative. 710 Papers presented at the ICES-III, June 18-21, 2007, Montreal, Quebec, Canada 3.2.3 Top Contributor Tables Each report has two parts – a summary list of all operations, with a few variables, and an expanded Top Contributor (TC) Tables are used to display record section that provides more information. Match level farms that have the largest values for the main Reports allow access to farm records in order to variable being considered. The tables also contain the identify those operations that account for significant farms’ impacts upon the provincial estimate and their differences for specific variables. pre-imputed values. The values are usually displayed in descending order however the option is available to see records that have the smallest non-zero 3.2.5 Distribution Tables contributors in ascending order (bottom contributors). TC tables are only available at the provincial level. Distribution Tables (DT) present data for specific variables sorted into meaningful classes (e.g., Similar to the IP tables, Top Contributor Tables are distributions by farm type). Depending on the table presented in two forms: they can provide frequency counts, percent changes and other valuable information. Most DT tables 1. Summary Report, which contains one record per compare current and past census data. These reports row in order by rank. are the Data Validation tool of choice for the analysis of tick-box variables but nevertheless are useful for 2. Detailed Report, which is the same as a Summary some numeric variables as well. Report with the addition of administrative data. There are three main DT variations: By default, the top 100 farms for a given variable are investigated in the validation process with the added 1. Operator Tables. These tables cover all operators at functionality to investigate Bottom Contributors. detailed levels of geography. These tables include Bottom Contributors generally affect greenhouse information such as number of operators resident on variables and mushrooms as these variables are their operations, age and sex distributions, hours per required in square feet or square meters.