Sample Design for Small Area Estimation Wilford B
Total Page:16
File Type:pdf, Size:1020Kb
University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 2011 Sample design for small area estimation Wilford B. Molefe University of Wollongong Recommended Citation Molefe, Wilford B., Sample design for small area estimation, Doctor of Philosophy thesis, Center for Statistical and Survey Methodology, University of Wollongong, 2011. http://ro.uow.edu.au/theses/3495 Research Online is the open access institutional repository for the University of Wollongong. For further information contact Manager Repository Services: [email protected]. Sample Design for Small Area Estimation A thesis submitted in fulfilment of the requirements for the award of the degree of Doctor of Philosophy from The University Of Wollongong by Wilford B. Molefe MSc.(Sheffield), B.A.(UB) Center for Statistical and Survey Methodology Wollongong NSW 2522, Australia 2011 Dedicated to My Beloved Wife Rosinah My Daughter Nametso My Son Masego ii Abstract Sample surveys have long been used as cost-effective means for data collection. Such data are used to provide suitable statistics not only for the population targeted by the survey but also for a variety of subpopulations, often called domains or areas. Sampling designs and in particular sample sizes are chosen in practice so as to provide reliable estimates for aggregates of the small areas such as large geographical regions or broad demographic groups. Budget and other constraints usually prevent the allocation of sufficiently large samples to each of the sub-domains or small areas to provide reliable estimates using traditional techniques. This thesis will develop approaches for sample design to support small area esti- mation. Sample designs for small areas can be classified into two major categories: • when it is feasible to select sample units in every small area; • when only a subset of small areas can be surveyed. The first case will be represented by stratified sampling where strata are small areas. The second case will be represented by two-stage sampling where clusters are small areas and are selected either by equal probability (simple two-stage) or unequal probability sampling (general two-stage). iii In each case, the aim is to find the best sample design for a combination of the anticipated mean squared errors of small area composite estimates and an overall estimator of the population mean, subject to a cost constraint. This thesis develops sample designs which minimize or reduce this objective func- tion, either using analytical expressions for the optimal design, asymptotic approxima- tion to the optimal design, optimal designs within restricted families of designs (such as power allocations), numerical optimization and ad-hoc approaches. Power allocation with the exponent chosen by numerical optimization, is found to be a nearly-optimum strategy with appealing properties when all small areas can be selected in the sample. When only a subset of small areas can be selected, a two- stage unequal probability design is found to perform well, with cluster sizes given by the classical optimal cluster size. The optimal selection probabilities are a complex function of the cluster population sizes which is derived analytically. When the only priority is small area estimation, the optimal design is to select the largest clusters with certainty, and to select none of the remaining clusters. In the case where it is feasible to select sample in every small area, analytical and approximate analytical optimal designs are developed. While optimal designs minimize an objective function, they have undesirable practical properties. Simpler designs, including the adjusted power allocation with the exponent chosen by numerical optimization, are nearly as effective. iv Certification I, Wilford B. Molefe declare that this thesis is wholly my own work unless otherwise referenced or acknowledged below. The document has not been submitted for qualifi- cations at any other academic institution. Wilford B. Molefe December 2011 c 2011 Wilford B. Molefe All Rights Reserved v vi Acknowledgements First of all, I express my sincere appreciation to my supervisor Associate Professor Robert Graham Clark, for his insights, wise direction and good advice, constant en- couragement, and for his support in so many aspects. He is such a wonderful advisor: not only knowledgeable, enthusiasm and confident, but also cares for his students. He is such a great mentor and it has been such a pleasure to work under his supervision. My sincere gratitude to my co-supervisor Professor Raymond L. Chambers and Professor David Steel for their invaluable comments and suggestions on all of my work. Thanks to my friends at the University of Wollongong for sharing many enjoyable and challenging experiences. Thank you Carolyn Silveri, Kerrie Gamble, Anne Harper, Joell Hall, Dallas Burnes and Anica Damcevski and the University of Wollongong Office of Research for providing administrative support. I am so indebted to have a supportive family: my wife, Rosinah and my daughter, Nametso and my son Masego. They've had to endure a diminished standard of living these past four years. They have been patient, understanding and supportive. They have given me the moral momentum to complete this project. To them, I dedicate this work. I feel blissful with your love. To my beloved Rosinah, I still have a long way to go and I want to go with you. vii viii Abbreviations BLUP - Best Linear Unbiased Predictor CD - Census District CV - Coefficient of Variation EA - Enumeration Area EB - Empirical Bayes BLUP - Best Linear Unbiased Predictor EBLUP - Empirical Best Linear Unbiased Predictor GLMMs - Generalized Linear Mixed Models GREG - Generalized Regression HB - Hierarchical Bayes LGA - Local Government Area MSE - Mean Squared Error NSI - National Statistical Institute ix PPS - Probability Proportional to Size PSU(s) - Primary Sampling Unit(s) PPSWR - Probability Proportional to Size With Replacement PPSWOR - Probability Proportional to Size Without Replacement RSE - Relative Standard Error Rel: var: - Relative variance SAE - Small Area Estimation SRS - Simple Random Sampling SRSWOR - Simple Random Sampling Without Replacement SRSWR - Simple Random Sampling With Replacement SSU(s) - Second Stage Unit(s) UPWR - Unequal Probability Sampling with Replacement UPS - Unequal Probability Sampling WOR - Without replacement WR - With Replacement x Contents 1 Introduction 1 1.1 Survey Design and Small Area Statistics . .1 1.2 Theoretical Framework . .6 1.3 Scope Of The Thesis . 11 1.4 Sample Designs That Will Be Considered . 13 1.4.1 Stratified Simple Random Sampling Design Where Small Areas Are Strata . 14 1.4.2 Simple Two-Stage Design Where Small Areas Are Clusters . 15 1.4.3 Unequal Probability Two-Stage Design Where Small Areas Are Clusters . 16 1.5 Structure Of The Thesis . 17 2 Literature Review 21 2.1 Preliminaries and Notation . 21 2.2 The Model-Assisted Framework . 28 2.2.1 Design-based and Model-based Expectations . 28 2.3 Sample Designs for Estimating Totals and Means . 32 xi CONTENTS 2.3.1 Stratified Simple Random Sampling . 32 2.3.2 Two Stage Sampling Using SRSWOR of Clusters . 33 2.3.3 Two-Stage Sampling Using PPSWOR of Clusters . 37 2.4 Overview of Approaches to Small Area Estimation . 39 2.4.1 Introduction . 39 2.4.2 Design-based Small Area Estimation . 42 2.4.3 Model-based Small Area Estimation . 49 2.5 Sample Design for Domains and Small Areas . 59 2.6 Intracluster Correlation Coefficient . 70 3 Stratified Simple Random Sampling Designs Where Small Areas Are Strata 73 3.1 Introduction . 73 3.2 Model-Assisted Small Area Estimation . 76 3.3 Optimal Power Allocation . 80 3.4 Optimal Design When The Only Priority Is Small Area Estimation (G = 0)....................................... 82 3.5 Optimal Design When G > 0 ....................... 87 3.6 Approximate Analytical Optimal Design Based on Small Intraclass Cor- relation . 91 3.7 Alternative Approximate Analytical Optimal Design . 95 3.8 Other Designs . 100 3.9 Numerical Evaluation . 102 xii CONTENTS 3.10 Summary of Chapter 3 . 104 4 Simple Two-Stage Designs Where Small Areas Are Clusters 107 4.1 Introduction . 107 4.2 Model-Assisted Small Area Estimation . 109 4.3 Area-Only Simple Two-Stage Optimal Design . 117 4.4 Other Designs . 122 4.5 Numerical Evaluation . 123 4.6 Summary of Chapter 4 . 129 5 Unequal Probability Two-Stage Designs Where Small Areas Are Clus- ters 131 5.1 Introduction . 131 5.2 Model-Assisted Small Area Estimation . 135 5.3 Area-Only Two-Stage Optimal Design . 139 5.4 Compromise Two-Stage Optimal Design . 142 5.5 Other Designs . 147 5.5.1 Classical Optimal Design . 147 5.5.2 Power design with proportional sampling within clusters . 148 5.5.3 Partial coverage with proportional sampling within clusters . 150 5.5.4 Partial coverage with constant sample size within clusters . 150 5.5.5 Optimal Power Design . 151 5.5.6 Adjusted Optimal Power Design . 152 xiii CONTENTS 5.5.7 Numerical Example . 154 5.6 Numerical Evaluation . 160 5.7 Summary of Chapter 5 . 166 6 Sensitivity Analysis 171 6.1 Introduction . 171 6.2 Switzerland Canton Data . 172 6.2.1 Stratified Designs . 172 6.2.2 Simple Two-Stage Designs . 175 6.2.3 General Two-Stage Designs . 177 6.3 Botswana District Data . 184 6.3.1 Introduction . 184 6.3.2 Stratified Designs . 184 6.3.3 Simple Two-Stage Designs . 187 6.3.4 General Two-Stage Designs . 189 6.4 Summary of Chapter 6 . 196 7 Conclusions 201 7.1 Summary and Conclusion . 201 7.2 Further Research . 206 A Proof of Result (3.3), Theorems 3.4.1, 3.6.1 and 3.7.1 in Chapter 3 209 A.1 Result (3.3) . 209 A.2 Proof of Theorem 3.4.1 . 211 xiv CONTENTS A.3 Proof of Theorem 3.6.1 . 212 A.4 Proof of Theorem 3.7.1 .