Survey Methodology and Sampling Techniques
15-18 april 2008
European Statistical Training Program
Contents
1. General information 2. Basic concepts of survey sampling The Horvitz-Thompson estimation strategy Simple random sampling 3. Stratified sampling Cluster sampling Multistage sampling 4. Ratio estimators Post-stratification 5. Regression estimators Calibration 6. Introduction to the problem and treatment of non-response
1
ESTP course programme
Survey Methodology and Sampling Techniques (4-day course)
Course leader • Eric Lesage, INSEE
Trainers • Guillaume Chauvet, INSEE • Eric Lesage, INSEE • Jean-Pierre Renfer, SFSO • Paul-André Salamin, SFSO
Course Programme
Day 1
Basic concepts of survey sampling The Horwitz-Thompson estimation strategy Simple random sampling
09:00 – 09:30 Welcome and introduction Eric Lesage, course leader 09:30 – 10:30 Lesson J.-P. Renfer, P.-A. Salamin 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson J.-P. Renfer, P.-A. Salamin 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson J.-P. Renfer, P.-A. Salamin 15:15 – 15:30 Coffee break 15:30 – 17:00 Lesson J.-P. Renfer, P.-A. Salamin 17:00 - Welcome reception
Day 2
Stratified sampling Cluster sampling Multi-stage sampling
09:00 – 10:30 Lesson G. Chauvet, E. Lesage 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson G. Chauvet, E. Lesage 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson G. Chauvet, E. Lesage 15:15 – 15:30 Coffee break 15:30 – 17:00 Lesson G. Chauvet, E. Lesage
1
Day 3
Ratio estimator Post-stratification Regression estimator
09:00 – 10:30 Lesson J.-P. Renfer, P.-A. Salamin 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson J.-P. Renfer, P.-A. Salamin 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson G. Chauvet, E. Lesage 15:15 – 15:30 Coffee break 15:30 – 17:00 Lesson G. Chauvet, E. Lesage 19 - Course dinner
Day 4
Calibration Introduction to the problem and treatment of non-response
09:00 – 10:30 Lesson G. Chauvet, E. Lesage 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson G. Chauvet, E. Lesage 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson J.-P. Renfer, P.-A. Salamin 15:15 – 15:30 Coffee break 15:30 – 16:00 Conclusion, evaluation
2
SURVEY METHODOLOGY AND SAMPLING TECHNIQUES (an introduction to survey sampling)
COURSE LEADER Eric Lesage (National Institute of Statistics – INSEE, France)
OBJECTIVE(S) To familiarize the participants with the fundamental principles and the main methods of survey sampling. Emphasis is given to their applications in existing surveys.
TRAINING The course is based on lectures and practical exercises. Most of the exercises computers and METHODS the SAS Enterprise Guide software are used.
TARGET GROUP Staff using sample survey techniques in the production of statistics.
ENTRY • University degree or equivalent education and training level QUALIFICATIONS • Basic knowledge of statistics • Sound command of English.
Basic understanding of the fundamental principles and the main methods of survey sampling. EXPECTED
OUTPUT
CONTENTS Basic Concepts of Survey Sampling Simple Random Sampling Use of auxiliary information Stratified, Cluster and Multi-Stage Sampling Ratio and Regression Estimators Post stratification and Calibration Introduction to the problem, the effects and the treatment of non-response
TRAINER(S)/ • Eric LESAGE (INSEE, France) LECTURER(S) • Guillaume CHAUVET (INSEE, France) • Jean-Pierre RENFER (Swiss Federal Statistical Office - OFS)
REQUIRED None READING
SUGGESTED Basic introduction to sampling theory READING
REQUIRED None PREPARATION
REQUIRED Hand held calculator EQUIPMENT
PRACTICAL INFORMATION
WHEN DURATION WHERE ORGANISER APPLICATION VIA NATIONAL CONTACT POINT
15-18.04.2008 4 days Bruz ADETEF Deadline: 04.02.2008 France
Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Survey Methodology and Sampling Techniques Basic Concepts of Survey Sampling, Horvitz-Thompson and Simple Random Sampling
Paul-Andre´ Salamin, Jean-Pierre Renfer Statistical Methods Unit, Federal Statistical Office
European Statistical Training Program 15 - 18 April 2008
ESTP/Survey methodology c SFSO 1
Contents Basic Concepts of Survey Sampling Census and Sample Surveys Global Framework of Survey Sampling Sampling and Non-sampling Errors Horvitz-Thompson Strategy Simple Random Sampling H-T Estimators Variance Estimation Confidence Interval Relation Between Sample Size and Variance
ESTP/Survey methodology c SFSO 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Census and sample surveys
Information collection for a population U.
Census: whole population U is observed.
Sample: observation for a subset s of the population U.
U U s
ESTP/Survey methodology c SFSO 3
Global Framework of Survey Sampling
From the demand for a particular statistic to the results.
Population Characteristic
Sampling design Sample selection Estimation Estimator
Sample Data collection Data
Survey design
ESTP/Survey methodology c SFSO 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Population and Sampling Frame
We are interested in a specific finite population U = {1, .., k, .., N} of size N. We call the elements k ∈ U the units.
In practice, we use a sampling frame which is a list of the sampling units.
The sampling frame is thus the list of the units used to obtain access to information for the finite population of interest.
ESTP/Survey methodology c SFSO 5
The sampling frame is constructed with census data or registers.
Required properties:
I The units can be identified (identifier, name). I The units can be found (e.g. mail address). I Every element is present only once (no doublets). I No element not in the population (no overcoverage). I Every element of the population is present (no undercoverage). I The frame is valid for a well-defined reference period.
Desirable properties:
I The frame contains additional information for each unit (variables for the sampling design, the estimation, domain identification). ESTP/Survey methodology c SFSO 6 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Variable of interest or study variable (unknown): y. Value of the variable y for unit k ∈ U: yk . Auxiliary variables (known): x1,..,xq. Value of the variable xq for unit k ∈ U: xqk .
Data structure:
U x1 .. xq y1 .. yp 1 x11 .. x1q y11 .. y1p ...... k xk1 .. xkq yk1 .. ykp ...... N xN1 .. xNq yN1 .. yNp
ESTP/Survey methodology c SFSO 7
Population U of size N = 10. k x1 x2 y1 y2 y3 1 1 23 122 21 5 2 2 14 354 13 5 3 2 56 156 35 6 4 1 24 465 65 4 5 3 67 3243 45 3 6 3 2 789 35 1 7 3 35 443 64 2 8 2 23 23 24 3 9 1 19 973 45 4 10 3 76 993 64 1
ESTP/Survey methodology c SFSO 8 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Characteristic
Population characteristic or parameter = function of the study variables values yk , k ∈ U. i.e. θ = θ(y1, .., yN ).
Quantitative variables: P I Total Y = k∈U yk P I Mean Y = ( k∈U yk )/N = Y /N
Qualitative variables with values a = 1, .., A: P I Total number Na = k∈U yk P I Proportion pa = Na/N = ( k∈U yk )/N = Y
where yk = 1 if k in a, and 0 otherwise.
ESTP/Survey methodology c SFSO 9
Characteristics in Domains
A specific subpopulation of U or domain is denoted Ud , where Ud ⊂ U. P Size Nd = |Ud | = k∈U zdk P P Total Yd = k∈U yk = k∈U yk zdk P d P P Mean Y d = ( k∈U yk )/Nd = ( k∈U yk zdk )/( k∈U zdk ) d P P Prop. pda = Nda/Nd = ( k∈U yak zdk )/( k∈U zdk )
1 if k ∈ U 1 if k ∈ a with z = d and y = dk 0 otherwise ak 0 otherwise
ESTP/Survey methodology c SFSO 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Other Characteristics
Variability, dispersion of y in U.
2 2 1 P 2 I Variance S = Sy = (yk − Y ) N−1 k∈√U 2 I Standard deviation S = Sy = S
I Coefficient of variation CV = CVy = S/Y
ESTP/Survey methodology c SFSO 11
Sample
A sample s is a subset of the population U.
The sample size is noted n ≤ N.
In practice, a sample s is a subset of the available sampling frame.
In this course: a sample s is a probability sample. It satisfies certain conditions (see below).
The sample s is the gross sample.
ESTP/Survey methodology c SFSO 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Sample s of size n = 6 in U of size N = 10. k x1 x2 sample y1 y2 y3 1 1 23 1 . . . 2 2 14 0 . . . 3 2 56 1 . . . 4 1 24 1 . . . 5 3 67 1 . . . 6 3 2 0 . . . 7 3 35 1 . . . 8 2 23 0 . . . 9 1 19 1 . . . 10 3 76 0 . . .
ESTP/Survey methodology c SFSO 13
Data
The set of respondents or response set r is a subset of the sample s.
The size of the response set is m ≤ n ≤ N.
The response set r is the net sample.
ESTP/Survey methodology c SFSO 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
k x1 x2 sample resp y1 y2 y3 1 1 23 1 1 122 21 5 2 2 14 0 . . . . 3 2 56 1 1 156 35 6 4 1 24 1 0 . . . 5 3 67 1 1 3243 45 3 6 3 2 0 . . . . 7 3 35 1 0 . . . 8 2 23 0 . . . . 9 1 19 1 1 973 45 4 10 3 76 0 . . . .
k x1 x2 sample resp y1 y2 y3 1 1 23 1 1 122 21 5 3 2 56 1 1 156 35 6 5 3 67 1 1 3243 45 3 9 1 19 1 1 973 45 4
ESTP/Survey methodology c SFSO 15
Sampling Design and Sample Selection
In probability sampling:
I We can define the set of samples S = {s1, s2, ..., sM } that can be obtained with the sampling procedure.
I A known probability of selection p(s) is associated with each s ∈ S.
I Each element in U has a non-zero probability to be selected.
The sampling design is represented by the function p(.) such that p(s) gives us the probability of selecting s under the scheme in use.
ESTP/Survey methodology c SFSO 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Sample selection is carried out by a series of randomized experiments. Various schemes are available.
Example: selection of n units in N where each sample s has the same selection probability.
I generate independently for each unit in U a random number uniformly distributed between 0 and 1
I sort the list by random number
I take the first n units of the sorted list
ESTP/Survey methodology c SFSO 17
Survey Design and Data Collection
Planning and operations.
Survey design: procedure (CATI, CAPI, mail, e-mails, internet), questionnaire, pretesting, reference period, contact (households, individuals, companies).
Data collection: sending, call-backs.
Data processing: scan, manual, quality control, coding, editing, imputation.
Note: extremely important steps, not developed in this course.
ESTP/Survey methodology c SFSO 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Estimator and Estimation
A characteristic θ(yk , k ∈ U) is estimated by θb(yk , k ∈ s).
An estimator is a function of yk for k ∈ s.
An estimate is the result of the calculation of the estimator for a specific sample s.
If s is a probability sample, θb(yk , k ∈ s) is a random variable for which we can compute the expected value and the variance.
ESTP/Survey methodology c SFSO 19
Sampling and Non-sampling Errors Sampling error: results from taking a sample instead of the whole population.
I sampling variance: var(θb) I estimation bias: E(θb) − θ
Non-sampling errors: all other errors.
I errors due to the quality of the frame (coverage, timeliness, etc.)
I errors due to non-response (unit and item)
I measurement errors
I processing error
ESTP/Survey methodology c SFSO 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Horvitz-Thompson Strategy A strategy is the choice of both a sampling design p(s) and an estimator θb(yk , k ∈ s).
Population Characteristic
Sampling design Sample selection Estimation Estimator
Sample Data collection Data
Survey design
Good strategy: p(s) and θb(yk , k ∈ s) such that θb(yk , k ∈ s) has low variance and small bias.
ESTP/Survey methodology c SFSO 21
The Horvitz-Thompson estimator (or π-estimator) of a total P Y = k∈U yk is defined as:
X yk X Yb = = wk yk πk k∈s k∈s
where X πk = Pr(k ∈ s) = p(s) s3k is the selection or inclusion probability, and
wk = 1/πk
is the sampling weight, for k ∈ s.
ESTP/Survey methodology c SFSO 22 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
The Horvitz-Thompson estimator is unbiased: E(Yb) = Y .
The variance of the estimator is given by: X yk y` var(Yb) = (πk` − πk π`) πk π` k,`∈U
where X πk = Pr(k ∈ s) = p(s) s3k and X πk` = Pr(k, ` ∈ s) = p(s) s3k,`
ESTP/Survey methodology c SFSO 23
The H-T variance is estimated by: X (πk` − πk π`) yk y` varc (Yb) = πk` πk π` k,`∈s
If πk` > 0 for all k, ` ∈ U then varc (Yb) is an unbiased estimator of var(Yb). Instability may however occur.
”Wings” notation:
yˇk = yk /πk
∆k` = πk` − πk π` ˇ ∆k` = ∆k`/πk` X ˇ varc (Yb) = ∆k` yˇk yˇ` k,`∈s
ESTP/Survey methodology c SFSO 24 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Sample Membership Indicator Suppose p(s) has been fixed.
Sample membership indicator: 1 if k ∈ s I = I (s) = k k 0 otherwise
First order inclusion probability: πk = Pr(Ik = 1).
Second order inclusion probability: πk` = Pr(Ik = 1 & I` = 1)
Notes:
I Ik : random variable 2 I πkk = Pr(Ik = 1) = Pr(Ik = 1) = πk P I n = U Ik (s)
ESTP/Survey methodology c SFSO 25
For a given p(s), one can prove:
I Expectation: E(Ik ) = πk
I Variance: var(Ik ) = πk (1 − πk )
I Covariance: C(Ik , I`) = πk` − πk π` = ∆k`
Note: If k = ` then C(Ik , Ik ) = var(Ik ). P P P Note: k πk = k E(Ik ) = E( k Ik ) = E(n).
ESTP/Survey methodology c SFSO 26 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
Variability in Sampling
The values yk , k ∈ U, are not random.
Sample selection is the random part.
The sampling design p(.) is a probability on the set of samples.
The indicator Ik = Ik (s) is a random variable.
The selection probability πk = E(Ik ) is determined by the sampling design.
The estimator θb(yk , k ∈ s) is a random variable.
ESTP/Survey methodology c SFSO 27
Fixed Size Sampling Design
A sampling design p(s) may lead to a fixed or random sample size n.
Two examples with πk = n/N, k ∈ U (equal probability sampling designs).
Example 1 (fixed size): I generate a random number k in ]0, 1[ for each unit in U I sort the list by the random number I take the first n units of the sorted list. Example 2 (Bernoulli, random size): I generate a random number k in ]0, 1[ for each unit in U I take all the units with k < n/N.
ESTP/Survey methodology c SFSO 28 Federal Department of Home Affairs FDHA Federal Statistical Office FSO
If p(s) is a fixed size design, the general variance X var(Yb) = ∆k`(yˇk )(yˇ`) k,`∈U
may also be written as (Yates, Grundy and Sen)
1 XX 2 var(Yb) = − ∆ `(yˇ − yˇ`) 2 k k U
with the unbiased estimator, provided that πk` > 0 for all k, ` ∈ U, 1 XX ˇ 2 var(Yb) = − ∆ `(yˇ − yˇ`) c 2 k k s
ESTP/Survey methodology c SFSO 29
Simple Random Sampling
In simple random sampling without replacement (SRS), every possible subset of n elements from a population U of N units has the same probability to be selected as the sample.