Regression Estimation for Survey Samples
Total Page:16
File Type:pdf, Size:1020Kb
Catalogue no. 12-001-XIE Survey Methodology June 2002 How to obtain more information Specific inquiries about this product and related statistics or services should be directed to: Business Survey Methods Division, Statistics Canada, Ottawa, Ontario, K1A 0T6 (telephone: 1-800-263-1136). For information on the wide range of data available from Statistics Canada, you can contact us by calling one of our toll-free numbers. You can also contact us by e-mail or by visiting our website. National inquiries line 1-800-263-1136 National telecommunications device for the hearing impaired 1-800-363-7629 Depository Services Program inquiries 1-800-700-1033 Fax line for Depository Services Program 1-800-889-9734 E-mail inquiries [email protected] Website www.statcan.ca Information to access the product This product, catalogue no. 12-001-XIE, is available for free. To obtain a single issue, visit our website at www.statcan.ca and select Publications. Standards of service to the public Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner and in the official language of their choice. To this end, the Agency has developed standards of service that its employees observe in serving its clients. To obtain a copy of these service standards, please contact Statistics Canada toll free at 1 800 263-1136. The service standards are also published on www.statcan.ca under About us > Providing services to Canadians. Statistics Canada Business Survey Methods Division Survey Methodology June 2002 Published by authority of the Minister responsible for Statistics Canada © Minister of Industry, 2006 All rights reserved. The content of this electronic publication may be reproduced, in whole or in part, and by any means, without further permission from Statistics Canada, subject to the following conditions: that it be done solely for the purposes of private study, research, criticism, review or newspaper summary, and/or for non-commercial purposes; and that Statistics Canada be fully acknowledged as follows: Source (or “Adapted from”, if appropriate): Statistics Canada, year of publication, name of product, catalogue number, volume and issue numbers, reference period and page(s). Otherwise, no part of this publication may be reproduced, stored in a retrieval system or transmitted in any form, by any means—electronic, mechanical or photocopy—or for any purposes without prior written permission of Licensing Services, Client Services Division, Statistics Canada, Ottawa, Ontario, Canada K1A 0T6. October 2006 Catalogue no. 12-001-XIE ISSN 1492-0921 Frequency: semi-annual Ottawa Cette publication est disponible en français sur demande (no 12-001-XIF au catalogue). Note of appreciation Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued cooperation and goodwill. Survey Methodology, June 2002 5 Vol. 28, No. 1, pp. 523 Statistics Canada, Catalogue No. 12001XIE Regression Estimation for Survey Samples Wayne A. Fuller 1 Abstract Regression and regression related procedures have become common in survey estimation. We review the basic properties of regression estimators, discuss implementation of regression estimation, and investigate variance estimation for regression estimators. The role of models in constructing regression estimators and the use of regression in nonresponse adjustment are explored. Key Words: Auxiliary information; Calibration; Least squares; Design consistency; Linear prediction. 1. Introduction available to non governmental institutions selecting samples. Design and estimation in survey sampling involve the In the third situation, only the population mean of x is use of information about the study population to construct known, or known for a large sample. In this case, the efficient procedures. While design and estimation are auxiliary information cannot be used in design and the intimately related, with estimators depending on the design, estimation options are limited. For example the U.S. the two topics are often treated somewhat separately in the Department of Agriculture might release an estimate of the survey sampling literature. We follow tradition first total number of animals of a particular type on farms on a studying estimation treating the design as given. The particular date. Our discussion concentrates on this situation. estimation task is to combine the available information Two estimation situations can also be identified. In one, about the population, with the sample data to produce good a single variable and a parameter, or a very small number of representations of characteristics of interest. parameters, is under consideration. The analyst is willing to Regression estimation is one of the important procedures invest a great deal of effort in the analysis, has a well that use population information or information from a larger formulated population model, and is prepared to support the sample, to construct estimators with good efficiency. The estimation procedure on the basis of the reasonableness of information, sometimes called auxiliary information, may the model. In the second situation, a large number of have been used in the design or may not have been available analyses of a large number of variables is anticipated. No at the design stage. In surveys of the human population, the single model is judged adequate for all variables. The information often comes from official sources such as the prototypical example of the second situation is the case in national census. Similar sources may provide information which a data set is prepared by the survey sampler to be for other types of surveys. For example, in a survey of land analyzed by others. Because the person preparing the data use the total surface area, the area owned by the national set does not have knowledge of the analysis variables, government, and the area in permanent water bodies may be emphasis is placed on the use of estimators that can be available from national data archives. defended with minimal recourse to models. Three distinct situations can be identified with respect to Regression estimators fall in the class of linear esti the nature of the auxiliary information that is available. In mators. Linear estimators have a particular advantage in the first, the values of the auxiliary vector x are known for survey sampling because once the weights are calculated each element in the population at the time of sample they are appropriate for any analysis variable. Several selection. In this case the auxiliary variable can be used in properties of estimators will be examined in our discussion. designing the sample selection procedure. Given a model, we accept the classical goal of minimizing In the second situation all values of the vector x are the mean square error in a class of estimators. That class known, but a particular value cannot be associated with a may be the class of linear estimators that are unbiased under particular element until the sample is observed. In this case, the model, but the class may be further restricted. the auxiliary information cannot be used in design, but a Estimators that are scale and location invariant can be wide range of estimation options are available once the used in general settings. Mickey (1959) suggested that the observations are available. For example, the population term regression estimator be restricted to linear estimators census may give the agesex distribution of the population, that are location and scale invariant. While we may not but a list of individuals and their characteristics is not adhere strictly to this definition, we support the distinction 1. Wayne Fuller, Emeritus Distinguished Professor, Iowa State University, 221 Snedecor Hall, Ames, IA 500111210, U.S.A. 6 Fuller: Regression Estimation for Survey Samples between estimators that are location and scale invariant and A great deal of research was conducted in the 1970’s and those that are not. We consider location invariance to be 1980’s on the general nature of the regression estimator in important for sampling designs where the unit of interest for survey samples and on the degree to which the model analysis is also the sampling unit. For cluster and two stage prediction approach can be reconciled with the design designs in which weights are constructed for primary perspective. Fuller (1973, 1975) gave the large sample sampling unit totals, location invariance is less important. properties of a vector of regression coefficients computed Models play an important role in the construction of from a survey sample. Isaki (1970) studied regression regression estimators. It is desirable that the estimators estimators and the results were published in expanded retain good properties if the model specification is not exact. versions in Isaki and Fuller (1982) and Fuller and Isaki Therefore properties conditional on the realized finite (1981). It was shown that a regression estimator constructed population, as well as properties under the model, are under a model is design consistent for the population mean if the model contains certain variables. Cassel, Särndal and important. Wretman (1976) considered both model and design Linear estimators that reproduce the known means of the principles in estimator construction and suggested the term auxiliary variables are said to be calibrated. This is a desir “generalized regression estimator” for design consistent able property in that, for example, the marginals of tables estimators of the total of the form with an auxiliary variable as an analysis variable agree with known totals. If the auxiliary variable is of no analytic Tˆ = Tˆ + (T - Tˆ ) βˆ , interest, then calibration is less important. y, GREG y, HT x, N x , HT ˆ ˆ where T y , HT and Tx , HT are the HorvitzThompson estimators of the totals of y and x, respectively, Tx , N is the 2. Background know population total of x and βˆ is an estimated regression coefficient.