<<

Survey Methodology and Techniques

15-18 april 2008

European Statistical Training Program

Contents

1. General information 2. Basic concepts of survey sampling The Horvitz-Thompson estimation strategy Simple random sampling 3. Stratified sampling Multistage sampling 4. Ratio estimators Post-stratification 5. Regression estimators Calibration 6. Introduction to the problem and treatment of non-response

1

ESTP course programme

Survey Methodology and Sampling Techniques (4-day course)

Course leader • Eric Lesage, INSEE

Trainers • Guillaume Chauvet, INSEE • Eric Lesage, INSEE • Jean-Pierre Renfer, SFSO • Paul-André Salamin, SFSO

Course Programme

Day 1

Basic concepts of survey sampling The Horwitz-Thompson estimation strategy Simple random sampling

09:00 – 09:30 Welcome and introduction Eric Lesage, course leader 09:30 – 10:30 Lesson J.-P. Renfer, P.-A. Salamin 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson J.-P. Renfer, P.-A. Salamin 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson J.-P. Renfer, P.-A. Salamin 15:15 – 15:30 Coffee break 15:30 – 17:00 Lesson J.-P. Renfer, P.-A. Salamin 17:00 - Welcome reception

Day 2

Stratified sampling Cluster sampling Multi-stage sampling

09:00 – 10:30 Lesson G. Chauvet, E. Lesage 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson G. Chauvet, E. Lesage 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson G. Chauvet, E. Lesage 15:15 – 15:30 Coffee break 15:30 – 17:00 Lesson G. Chauvet, E. Lesage

1

Day 3

Ratio estimator Post-stratification Regression estimator

09:00 – 10:30 Lesson J.-P. Renfer, P.-A. Salamin 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson J.-P. Renfer, P.-A. Salamin 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson G. Chauvet, E. Lesage 15:15 – 15:30 Coffee break 15:30 – 17:00 Lesson G. Chauvet, E. Lesage 19 - Course dinner

Day 4

Calibration Introduction to the problem and treatment of non-response

09:00 – 10:30 Lesson G. Chauvet, E. Lesage 10:30 – 10:45 Coffee break 10:45 – 12:30 Lesson G. Chauvet, E. Lesage 12:30 – 13:30 Lunch 13:30 - 15:15 Lesson J.-P. Renfer, P.-A. Salamin 15:15 – 15:30 Coffee break 15:30 – 16:00 Conclusion, evaluation

2

SURVEY METHODOLOGY AND SAMPLING TECHNIQUES (an introduction to survey sampling)

COURSE LEADER Eric Lesage (National Institute of – INSEE, France)

OBJECTIVE(S) To familiarize the participants with the fundamental principles and the main methods of survey sampling. Emphasis is given to their applications in existing surveys.

TRAINING The course is based on lectures and practical exercises. Most of the exercises computers and METHODS the SAS Enterprise Guide software are used.

TARGET GROUP Staff using survey techniques in the production of statistics.

ENTRY • University degree or equivalent education and training level QUALIFICATIONS • Basic knowledge of statistics • Sound command of English.

Basic understanding of the fundamental principles and the main methods of survey sampling. EXPECTED

OUTPUT

CONTENTS ƒ Basic Concepts of Survey Sampling ƒ Simple Random Sampling ƒ Use of auxiliary information ƒ Stratified, Cluster and Multi-Stage Sampling ƒ Ratio and Regression Estimators ƒ Post stratification and Calibration ƒ Introduction to the problem, the effects and the treatment of non-response

TRAINER(S)/ • Eric LESAGE (INSEE, France) LECTURER(S) • Guillaume CHAUVET (INSEE, France) • Jean-Pierre RENFER (Swiss Federal Statistical Office - OFS)

REQUIRED None READING

SUGGESTED Basic introduction to sampling theory READING

REQUIRED None PREPARATION

REQUIRED Hand held calculator EQUIPMENT

PRACTICAL INFORMATION

WHEN DURATION WHERE ORGANISER APPLICATION VIA NATIONAL CONTACT POINT

15-18.04.2008 4 days Bruz ADETEF Deadline: 04.02.2008 France

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey Methodology and Sampling Techniques Basic Concepts of Survey Sampling, Horvitz-Thompson and Simple Random Sampling

Paul-Andre´ Salamin, Jean-Pierre Renfer Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

ESTP/Survey methodology c SFSO 1

Contents Basic Concepts of Survey Sampling and Sample Surveys Global Framework of Survey Sampling Sampling and Non-sampling Errors Horvitz-Thompson Strategy Simple Random Sampling H-T Estimators Estimation Confidence Interval Relation Between Sample Size and Variance

ESTP/Survey methodology c SFSO 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Census and sample surveys

Information collection for a population U.

Census: whole population U is observed.

Sample: observation for a subset s of the population U.

U U s

ESTP/Survey methodology c SFSO 3

Global Framework of Survey Sampling

From the demand for a particular to the results.

Population Characteristic

Sampling design Sample selection Estimation Estimator

Sample Data

Survey design

ESTP/Survey methodology c SFSO 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Population and Sampling Frame

We are interested in a specific finite population U = {1, .., k, .., N} of size N. We call the elements k ∈ U the units.

In practice, we use a sampling frame which is a list of the sampling units.

The sampling frame is thus the list of the units used to obtain access to information for the finite population of interest.

ESTP/Survey methodology c SFSO 5

The sampling frame is constructed with census data or registers.

Required properties:

I The units can be identified (identifier, name). I The units can be found (e.g. mail address). I Every element is present only once (no doublets). I No element not in the population (no overcoverage). I Every element of the population is present (no undercoverage). I The frame is valid for a well-defined reference period.

Desirable properties:

I The frame contains additional information for each unit (variables for the sampling design, the estimation, domain identification). ESTP/Survey methodology c SFSO 6 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Variable of interest or study variable (unknown): y. Value of the variable y for unit k ∈ U: yk . Auxiliary variables (known): x1,..,xq. Value of the variable xq for unit k ∈ U: xqk .

Data structure:

U x1 .. xq y1 .. yp 1 x11 .. x1q y11 .. y1p ...... k xk1 .. xkq yk1 .. ykp ...... N xN1 .. xNq yN1 .. yNp

ESTP/Survey methodology c SFSO 7

Population U of size N = 10. k x1 x2 y1 y2 y3 1 1 23 122 21 5 2 2 14 354 13 5 3 2 56 156 35 6 4 1 24 465 65 4 5 3 67 3243 45 3 6 3 2 789 35 1 7 3 35 443 64 2 8 2 23 23 24 3 9 1 19 973 45 4 10 3 76 993 64 1

ESTP/Survey methodology c SFSO 8 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Characteristic

Population characteristic or parameter = function of the study variables values yk , k ∈ U. i.e. θ = θ(y1, .., yN ).

Quantitative variables: P I Total Y = k∈U yk P I Y = ( k∈U yk )/N = Y /N

Qualitative variables with values a = 1, .., A: P I Total number Na = k∈U yk P I Proportion pa = Na/N = ( k∈U yk )/N = Y

where yk = 1 if k in a, and 0 otherwise.

ESTP/Survey methodology c SFSO 9

Characteristics in Domains

A specific subpopulation of U or domain is denoted Ud , where Ud ⊂ U. P Size Nd = |Ud | = k∈U zdk P P Total Yd = k∈U yk = k∈U yk zdk P d P P Mean Y d = ( k∈U yk )/Nd = ( k∈U yk zdk )/( k∈U zdk ) d P P Prop. pda = Nda/Nd = ( k∈U yak zdk )/( k∈U zdk )

 1 if k ∈ U  1 if k ∈ a with z = d and y = dk 0 otherwise ak 0 otherwise

ESTP/Survey methodology c SFSO 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Other Characteristics

Variability, dispersion of y in U.

2 2 1 P 2 I Variance S = Sy = (yk − Y ) N−1 k∈√U 2 I S = Sy = S

I Coefficient of variation CV = CVy = S/Y

ESTP/Survey methodology c SFSO 11

Sample

A sample s is a subset of the population U.

The sample size is noted n ≤ N.

In practice, a sample s is a subset of the available sampling frame.

In this course: a sample s is a probability sample. It satisfies certain conditions (see below).

The sample s is the gross sample.

ESTP/Survey methodology c SFSO 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Sample s of size n = 6 in U of size N = 10. k x1 x2 sample y1 y2 y3 1 1 23 1 . . . 2 2 14 0 . . . 3 2 56 1 . . . 4 1 24 1 . . . 5 3 67 1 . . . 6 3 2 0 . . . 7 3 35 1 . . . 8 2 23 0 . . . 9 1 19 1 . . . 10 3 76 0 . . .

ESTP/Survey methodology c SFSO 13

Data

The set of respondents or response set r is a subset of the sample s.

The size of the response set is m ≤ n ≤ N.

The response set r is the net sample.

ESTP/Survey methodology c SFSO 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

k x1 x2 sample resp y1 y2 y3 1 1 23 1 1 122 21 5 2 2 14 0 . . . . 3 2 56 1 1 156 35 6 4 1 24 1 0 . . . 5 3 67 1 1 3243 45 3 6 3 2 0 . . . . 7 3 35 1 0 . . . 8 2 23 0 . . . . 9 1 19 1 1 973 45 4 10 3 76 0 . . . .

k x1 x2 sample resp y1 y2 y3 1 1 23 1 1 122 21 5 3 2 56 1 1 156 35 6 5 3 67 1 1 3243 45 3 9 1 19 1 1 973 45 4

ESTP/Survey methodology c SFSO 15

Sampling Design and Sample Selection

In probability sampling:

I We can define the set of samples S = {s1, s2, ..., sM } that can be obtained with the sampling procedure.

I A known probability of selection p(s) is associated with each s ∈ S.

I Each element in U has a non-zero probability to be selected.

The sampling design is represented by the function p(.) such that p(s) gives us the probability of selecting s under the scheme in use.

ESTP/Survey methodology c SFSO 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Sample selection is carried out by a series of randomized . Various schemes are available.

Example: selection of n units in N where each sample s has the same selection probability.

I generate independently for each unit in U a random number uniformly distributed between 0 and 1

I sort the list by random number

I take the first n units of the sorted list

ESTP/Survey methodology c SFSO 17

Survey Design and Data Collection

Planning and operations.

Survey design: procedure (CATI, CAPI, mail, e-mails, internet), , pretesting, reference period, contact (households, individuals, companies).

Data collection: sending, call-backs.

Data processing: scan, manual, , coding, editing, imputation.

Note: extremely important steps, not developed in this course.

ESTP/Survey methodology c SFSO 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Estimator and Estimation

A characteristic θ(yk , k ∈ U) is estimated by θb(yk , k ∈ s).

An estimator is a function of yk for k ∈ s.

An estimate is the result of the calculation of the estimator for a specific sample s.

If s is a probability sample, θb(yk , k ∈ s) is a random variable for which we can compute the expected value and the variance.

ESTP/Survey methodology c SFSO 19

Sampling and Non-sampling Errors Sampling error: results from taking a sample instead of the whole population.

I sampling variance: var(θb) I estimation bias: E(θb) − θ

Non-sampling errors: all other errors.

I errors due to the quality of the frame (coverage, timeliness, etc.)

I errors due to non-response (unit and item)

I measurement errors

I processing error

ESTP/Survey methodology c SFSO 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Horvitz-Thompson Strategy A strategy is the choice of both a sampling design p(s) and an estimator θb(yk , k ∈ s).

Population Characteristic

Sampling design Sample selection Estimation Estimator

Sample Data collection Data

Survey design

Good strategy: p(s) and θb(yk , k ∈ s) such that θb(yk , k ∈ s) has low variance and small bias.

ESTP/Survey methodology c SFSO 21

The Horvitz-Thompson estimator (or π-estimator) of a total P Y = k∈U yk is defined as:

X yk X Yb = = wk yk πk k∈s k∈s

where X πk = Pr(k ∈ s) = p(s) s3k is the selection or inclusion probability, and

wk = 1/πk

is the sampling weight, for k ∈ s.

ESTP/Survey methodology c SFSO 22 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The Horvitz-Thompson estimator is unbiased: E(Yb) = Y .

The variance of the estimator is given by:     X yk y` var(Yb) = (πk` − πk π`) πk π` k,`∈U

where X πk = Pr(k ∈ s) = p(s) s3k and X πk` = Pr(k, ` ∈ s) = p(s) s3k,`

ESTP/Survey methodology c SFSO 23

The H-T variance is estimated by:     X (πk` − πk π`) yk y` varc (Yb) = πk` πk π` k,`∈s

If πk` > 0 for all k, ` ∈ U then varc (Yb) is an unbiased estimator of var(Yb). Instability may however occur.

”Wings” notation:

yˇk = yk /πk

∆k` = πk` − πk π` ˇ ∆k` = ∆k`/πk` X ˇ varc (Yb) = ∆k` yˇk yˇ` k,`∈s

ESTP/Survey methodology c SFSO 24 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Sample Membership Indicator Suppose p(s) has been fixed.

Sample membership indicator:  1 if k ∈ s I = I (s) = k k 0 otherwise

First order inclusion probability: πk = Pr(Ik = 1).

Second order inclusion probability: πk` = Pr(Ik = 1 & I` = 1)

Notes:

I Ik : random variable 2 I πkk = Pr(Ik = 1) = Pr(Ik = 1) = πk P I n = U Ik (s)

ESTP/Survey methodology c SFSO 25

For a given p(s), one can prove:

I Expectation: E(Ik ) = πk

I Variance: var(Ik ) = πk (1 − πk )

I Covariance: C(Ik , I`) = πk` − πk π` = ∆k`

Note: If k = ` then C(Ik , Ik ) = var(Ik ). P P P Note: k πk = k E(Ik ) = E( k Ik ) = E(n).

ESTP/Survey methodology c SFSO 26 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Variability in Sampling

The values yk , k ∈ U, are not random.

Sample selection is the random part.

The sampling design p(.) is a probability on the set of samples.

The indicator Ik = Ik (s) is a random variable.

The selection probability πk = E(Ik ) is determined by the sampling design.

The estimator θb(yk , k ∈ s) is a random variable.

ESTP/Survey methodology c SFSO 27

Fixed Size Sampling Design

A sampling design p(s) may lead to a fixed or random sample size n.

Two examples with πk = n/N, k ∈ U (equal probability sampling designs).

Example 1 (fixed size): I generate a random number k in ]0, 1[ for each unit in U I sort the list by the random number I take the first n units of the sorted list. Example 2 (Bernoulli, random size): I generate a random number k in ]0, 1[ for each unit in U I take all the units with k < n/N.

ESTP/Survey methodology c SFSO 28 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

If p(s) is a fixed size design, the general variance X var(Yb) = ∆k`(yˇk )(yˇ`) k,`∈U

may also be written as (Yates, Grundy and Sen)

1 XX 2 var(Yb) = − ∆ `(yˇ − yˇ`) 2 k k U

with the unbiased estimator, provided that πk` > 0 for all k, ` ∈ U, 1 XX ˇ 2 var(Yb) = − ∆ `(yˇ − yˇ`) c 2 k k s

ESTP/Survey methodology c SFSO 29

Simple Random Sampling

In simple random sampling without replacement (SRS), every possible subset of n elements from a population U of N units has the same probability to be selected as the sample.

N N! There are n = n!(N−n)! possible samples.

 1/N if s has n units Therefore: p(s) = n 0 otherwise

Simple random sampling is the most basic form of probability sampling.

ESTP/Survey methodology c SFSO 30 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Inclusion probabilities:

P N−1 N πk = s3k p(s) = n−1 / n = n/N, k = 1, .., N.

P N−2 N n n−1 πk` = s3k&` p(s) = n−2 / n = N N−1 , k 6= ` = 1, .., N. Sampling fraction:

f = n/N

Note: in simple random sampling with replacement: 2 πk = 1/N and πk` = 1/N . The definition of a sample is somewhat different as a unit may be selected more than once.

ESTP/Survey methodology c SFSO 31

H-T Estimator (SRS): Population P H-T estimator of the total Y = k∈U yk : X yk X X Yb = = wk yk = (N/n) yk πk k∈s k∈s k∈s

H-T estimator of the mean Y = Y /N: X Yb = Yb/N = yk /n =: y s k∈s

H-T estimator of a proportion pa = Na/N: X X pba = (1/N) (N/n) yak = yak /n = na/n k∈s k∈s

where yak = 1 if k ∈ a, and 0 otherwise

ESTP/Survey methodology c SFSO 32 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

H-T Estimator (SRS): Domain

Let Ud ⊂ U be a specific domain or subset.

 1 if k ∈ U Indicator variable: z = d dk 0 otherwise

True value Estimator P P Size Nd = k∈U zdk Nbd = k∈s(N/n)zdk P P Total Yd = k∈U yk zdk Ybd = k∈s(N/n)yk zdk

Mean Y d = Yd /Nd Yb d = Ybd /Nbd P P Size of a Nad = k∈U yak zdk Nbad = k∈s(N/n)yak zdk Prop. in a pda = Nad /Nd pbda = Nbad /Nbd

where yak = 1 if k ∈ a, and 0 otherwise

ESTP/Survey methodology c SFSO 33

Variance in SRS Sampling

Variability: sample s.

One can show that the variance of the H-T estimator of the mean Yb = y s is:  n  1 1 var(y ) = 1 − S2 = (1 − f ) S2 s N n n with 1 X S2 = (y − Y )2 N − 1 k k∈U and f = n/N.

ESTP/Survey methodology c SFSO 34 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Simple random sample of size n selected in a population of size N.  n  1 1 var(y ) = 1 − S2 = (1 − f ) S2 s N n n

I What is the precision in case of a census? I Is a sample of size n = 1 000 selected in a population of size N = 50 000 more precise than a sample of size n = 1 000 selected in a population of size N = 5 000 000?

I Is a sample of size n = 1 000 selected in a homogeneous population more precise than a sample of the same size selected in a non-homogeneous population?

I Which parameters may be controlled by the statistician?

ESTP/Survey methodology c SFSO 35

Coefficient of variation of the estimated mean y s: q q CV (y s) = var(y s)/E(y s) = var(y s)/Y

Variance of the estimator of the total Yb:  n  1 var(Yb) = N2var(y ) = N2 1 − S2 s N n

Variance of the estimator of the proportion pa:  n  1  N  var(p ) = 1 − p (1 − p ) ba N n N − 1 a a

ESTP/Survey methodology c SFSO 36 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Estimator of the Variance in SRS Sampling

 n  1 1 X var(y ) = 1 − s2 with s2 = (y − y )2 c s N n n − 1 k s k∈s q CVd(y s) = varc (y s) / y s  n  1 var(Yb) = N2 1 − s2 c N n  n  1  n  var(p ) = 1 − p (1 − p ) c ba N n n − 1ba ba  n  p (1 − p ) = 1 − ba ba N n − 1

Unbiased estimators.

ESTP/Survey methodology c SFSO 37

Confidence Interval Let Yb be the estimator of the unknown total Y .

A confidence interval for Y at the approximate level 1 − α is computed as: q q [Yb − z1−α/2 varc (Yb), Yb + z1−α/2 varc (Yb)]

where z1−α/2 is the (1 − α/2)−quantile of the N (0, 1) distribution.

Usually α = 5% (z1−α/2 = 1.96) or α = 1% (z1−α/2 = 2.58).

Note: if S is estimated by s, we usually use Student’s t with n − 1 degrees of freedom.

ESTP/Survey methodology c SFSO 38 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The confidence interval will contain the unknown total Y for approximately (1 − α)% of repeated samples s drawn with the design p(s) if:

I the of Yb is approximately N (Y , var(Yb)), and I there exists a consistent variance estimator varc (Yb) of var(Yb).

The same procedure is applied for the other finite population parameters such as or proportions.

ESTP/Survey methodology c SFSO 39

Relation Between Sample Size and Variance Sample s of size n selected in the population U of size N by simple random sampling (SRS).

(1) We estimate the mean Y by y s.

Relation between the size n and the precision CV (y s).

var(y )  n  1 S2  n  1 2( ) = s = − = − 2 CV y s 2 1 2 1 CVy Y N n Y N n !−1 CV 2(y ) 1 = s + n 2 CVy N

n 2 If (1 − N ) ≈ 1 then n ≈ CVy /CV (y s) = n0

ESTP/Survey methodology c SFSO 40 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

(2) We estimate the proportion pa by pba.

Relation between the size n and the precision var(pba).  n  1  N  var(p ) = 1 − p (1 − p ) ba N n N − 1 a a N − 1 var(p ) 1 −1  var(p ) 1 −1 n = ba + ≈ ba + N pa(1 − pa) N pa(1 − pa) N

n If (1 − N ) ≈ 1 then n ≈ pa(1 − pa)/var(pba) = n0

Note: pa(1 − pa) ≤ 0.25. Therefore: n ≤ 1/(4 · var(pba))

ESTP/Survey methodology c SFSO 41

References

I Sarndal¨ C.-E., Swensson, B., and Wretman, J. (1997) Model assisted survey sampling, Springer series in statistics. Chapters 1- 2.

I Lohr, S.L., (1999) Sampling: design and analysis, Duxbury Press. Chapters 1-2.

I Cochran, W.G., (1977) Sampling techniques, John Wiley & Sons, Inc. Chapters 1-4.

ESTP/Survey methodology c SFSO 42

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Exercises

Basic concepts of survey sampling, the Horvitz-Thompson estimators and Simple Random Sampling

Jean-Pierre Renfer, Paul-Andre´ Salamin Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

Contents

Exercise 1: Population and samples ...... 2 Exercise 2: Horvitz-Thompson estimators ...... 2 Exercise 3: Variance estimation ...... 2 Exercise 4: ...... 3 Exercise 5: Rented flats data ...... 4 SAS-Code ...... 5

2008 c SFSO ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 1: Population and samples

The population is given by N = 5 elements U = {1, 2, 3, 4, 5} with the values {12, 24, 15, 10, 30} for the variable y.

1. Calculate the total Y , the mean Y , the variance S2, the S and the coefficient of variation CV of the population.

2. How many different samples of size 4 could be drawn from U?

3. Enumerate all possible samples of size 4. In how many samples is element k = 1,...,N?

Exercise 2: Horvitz-Thompson estimators

Consider the population of exercise 1. Four sets of random numbers, 1,k, . . . , 4,k were independently and uniformly generated (Unif (0, 1)) for the five elements k ∈ U, cf. table1.

Table 1 Four sets of five uniformly generated random numbers.

k 1,k 2,k 3,k 4,k 1 0.640 0.461 0.421 0.209 2 0.094 0.346 0.870 0.348 3 0.337 0.214 0.310 0.003 4 0.755 0.408 0.774 0.039 5 0.027 0.004 0.947 0.656

1. Choose one set of random numbers and draw a SRS of size n = 4 using the algorithm intro- duced in the course.

2. Estimate the mean y¯s and the total Yb for the population with the drawn sample.

Exercise 3: Variance estimation

1. Variance of the Horvitz-Thompson estimator: Estimate the and coefficients of variation of the estimated mean y¯s and total Yb you calculated in exercise 2.

2. (∗) Variance of the estimator of the proportion: Show that under SRS  n  1  N  var(ˆp ) = 1 − p (1 − p ) . a N n N − 1 a a

n  1 2 Hint: use the formula var(¯ys) = 1 − N n S with the indicator variable yak = 1 if k ∈ a, and 0 otherwise, and expand S2.

2 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 4: Opinion poll

You are interested in estimating the proportion of people who will vote for a particular politician in a population of 10 million people. A of size 100 was drawn. 20% of the sampled people intended to vote for the politician, 60% not and 20% did not have any opinion on that matter.

1. Estimate the standard deviations and the 95% confidence intervals for the estimated propor- tions of favorable and unfavorable votes. Which proportion is estimated with a better precision?

2. Calculate the sample size needed if the favorable votes should be estimated with a maximal deviation of ±1% and α = 5%, i.e. CI(ˆpa) = [ˆpa ± 1%]. 3. Same as2. but for the unfavorable votes.

4. Comment2. and3.

5. Same as2. but if 20% were favorable and 80% unfavorable votes (0% no opinion).

2008 c SFSO ESTP/Survey methodology 3 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 5: Rented flats data

The administrative authorities of a small town wants to make a survey to estimate the mean rent of its flats. The administrative authorities has a list of all flats with the number of rooms, the availability of a balcony and the surface of the flat. List of variables of the rented flats data ’flats samp’: ID: flat number in the register ROOMS: number of rooms (1, 2,..., 6) SURFACE: surface in m2 RENT: net monthly rent in e BALCONY: existence of a balcony (BALCONY=1) or no balcony (BALCONY=0) RESP: potential response behaviour (RESP=1: response, RESP=0: nonresponse) - not used in this exercise.

You are asked to execute the given SAS-code in these exercises. Therefore, you do not have to program in SAS but should adapt the code where asked. In this exercise we assume full response.

1. Assumption: All data, even the rent, of the whole population is known. First look at the data from the population: comment on the calculated with SAS.

2. Assumption: Only the list of flats (ID) is known from the entire population. The other variables are known only for the samples drawn in the exercise. Based on the sample we aim to estimate the population characteristics:

(a) Select a simple random sample samp1 of 30 sampling units with the provided seed and another SRS, samp2, of the same size but with your own seed. (b) Estimate the monetary mass of the rents and the mean rent, their variances, coefficients of variation and the respective confidence intervals for the mean and the total for both sam- ples. Is it possible that the true value calculated above in (a) is outside of the confidence intervals? Explain. (c) Estimate for both samples the proportion of flats with balcony. (d) Report your estimates for samp2 in the file of the trainer. (e) What does it mean to estimate the mean rent for the flats with and without balcony sepa- rately (use the terminology of the course)?

4 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

SAS-Code

/* Exercice 5: Rented flats data */ /* 1. Descriptive statistics. */ title ’Empirical distribution of the balconies and the rooms’; proc freq data=rents.flats_samp; tables balcony rooms; run; title ’Moments, quantiles and of the surface and the rent’; proc univariate data=rents.flats_samp; var surface rent; ; output out=sasuser.population_charact n=N mean=mean_surface mean=mean_rent sum=sum_surface sum=sum_rent std=std_surface std=std_rent var=var_surface var=var_rent cv=cv_surface cv=cv_rent; run; goptions reset=symbol axis; title ’Scatterplot of the surface and the rent’; proc gplot data=rents.flats_samp; plot surface*rent; run; quit; goptions reset=axis; symbol interpol=box co=blue bwidth=6 cv=green value=dot height=0.5; axis1 label=none value=(t=1 ’’); title ’Box-plot of the surface’; proc gplot data=rents.flats_samp; plot surface*dummy/haxis=axis1; run; quit; title ’Box-plot of the rent’; proc gplot data=rents.flats_samp; plot rent*dummy/haxis=axis1; run; quit; title ’Moments and quantiles of the surface by rooms and balcony’; proc means data=rents.flats_samp mean n; class rooms balcony; var surface; way 1; run; title ’Moments and quantiles of the rent by rooms and balcony’; proc means data=rents.flats_samp mean n; class rooms balcony; var rent; way 1; run; axis2 offset=(10,10) major=(n=2) minor=none; title ’Box-plot of the surface by balcony’; proc gplot data=rents.flats_samp; plot surface*balcony/haxis=axis2; run; quit; title ’Box-plot of the rent by balcony’; proc gplot data=rents.flats_samp; plot rent*balcony/haxis=axis2; run; quit; axis2 offset=(10,10) major=(n=6) minor=none; title ’Box-plot of the surface by rooms’; proc gplot data=rents.flats_samp; plot surface*rooms/haxis=axis2; run; quit; title ’Box-plot of the rent by rooms’; proc gplot data=rents.flats_samp; plot rent*rooms/haxis=axis2; run; quit; goptions reset=axis;

/* 2.(a) Selection of simple random samples. */ proc sort data=rents.flats_samp out=sasuser.frame; by id; run; /* Sample with a given seed (53437) */ title ’SRS sample of the rents data with n=30, seed=53437’; proc surveyselect data=sasuser.frame stats method=srs n=30 seed=53437 out=sasuser.samp1 (label="sample 1"); run;

2008 c SFSO ESTP/Survey methodology 5 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

/* how does it look like? */ proc print data=sasuser.samp1; run; /* Sample with your own seed */ proc sort data=rents.flats_samp out=sasuser.frame; by id; run; title ’SRS sample of the rents data with n=30, seed=[XXXX]’; /* <- choose your seed */ proc surveyselect data=sasuser.frame stats method=srs n=30 seed=[XXXX] /* <- choose your seed */ out=sasuser.samp2 (label=’sample 2’); run; /* how does it look like? */ proc print data=sasuser.samp2; run;

/* 2.(b) Estimations of the mean and total rent with the Horvitz-Thompson estimators. */ /* Sample 1: */ title ’Estimation based on samp1: total and mean’; proc surveymeans data=sasuser.samp1 total=151 sum std varsum cvsum clsum mean stderr var cv clm; var rent; weight SamplingWeight; ods output statistics=sasuser.tot_mean_samp1; run; /* Sample 2: */ title ’Estimation based on samp2: total and mean’; proc surveymeans data=sasuser.samp2 total=151 sum std varsum cvsum clsum mean stderr var cv clm; var rent; weight SamplingWeight; ods output statistics=sasuser.tot_mean_samp2; run;

/* 2.(c) Estimations of the proportion of balconies with the Horvitz-Thompson estimators. */ /* Sample 1: */ title ’Estimation based on samp1: proportion of balconies’; proc surveymeans data=sasuser.samp1 total=151 mean var stderr clm; class balcony; var balcony; weight SamplingWeight; ods output statistics=sasuser.balcony_prop_samp1; run; /* Sample 2: */ title ’Estimation based on samp2: proportion of balconies’; proc surveymeans data=sasuser.samp2 total=151 mean var stderr clm; class balcony; var balcony; weight SamplingWeight; ods output statistics=sasuser.balcony_prop_samp2; run; title2 ’’;

6 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

/* 2.(e) Estimations of the mean rent by balcony with the Horvitz-Thompson estimators. */ /* Sample 1: */ title ’Estimation based on samp1: mean rent by BALCONY’; proc surveymeans data=sasuser.samp1 total=151 mean var stderr cv clm; var rent; domain balcony; weight SamplingWeight; ods output domain=sasuser.balcony_dom_samp1; run;

/* Sample 2: */ title ’Estimation based on samp2: mean rent by BALCONY’; proc surveymeans data=sasuser.samp2 total=151 mean var stderr cv clm; var rent; domain balcony; weight SamplingWeight; ods output domain=sasuser.balcony_dom_samp2; run;

2008 c SFSO ESTP/Survey methodology 7

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Answers to the Exercises

Basic concepts of survey sampling, the Horvitz-Thompson estimators and Simple Random Sampling

Jean-Pierre Renfer, Paul-Andre´ Salamin Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

Contents

2008 c SFSO ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 1: Population and samples

1. N = 5, Y = 91, Y = 18.2, S2 = 72.2, S = 8.5, CV = 46.7%.

N N! 5 2. n = n!(N−n)! = 4 = 5.

N−1 4 3. There are n−1 = 3 = 4 samples that include element k ∈ U.

List of samples of size 4 i si 1 {1,2,3,4} 2 {1,2,3,5} 3 {1,2,4,5} 4 {1,3,4,5} 5 {2,3,4,5}

2 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 2: Horvitz-Thompson estimators

1. Selection algorithm for selecting a SRS of size n from the population U:

(a) generate a random number k in ]0,1[ for each unit k in U (b) sort the list by the random number (c) take the first n units of the sorted list

Compare with the table below for the selected samples.

2. Formulas used: P mean: y¯s = k∈s yk/n P total: Yb = Ny¯s = k∈s(N/n)yk

random number set ` i si y¯s Yb 1 2 {1,2,3,5} 20.3 101.3 2 5 {2,3,4,5} 19.8 98.8 3 1 {1,2,3,4} 15.3 76.3 4 1 {1,2,3,4} 15.3 76.3 - 3 {1,2,4,5} 19.0 95.0 - 4 {1,3,4,5} 16.8 83.8 population {1,2,3,4,5} 18.20 91.00

Different samples result in different estimations. However, none of the results are ”very far” from the population values Y = 91 and Y = 18.2.

2008 c SFSO ESTP/Survey methodology 3 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

4 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 3: Variance estimation

1. Variance of the Horvitz-Thompson estimator Note that

2 n  1 P 2 • varc (Yb) = N varc (¯ys) = 25 · varc (¯ys) = 25 1 − N n(n−1) k∈s(yk − y¯s)

q q q 2 vard (Yb ) N vard (¯ys) vard (¯ys) stdd (¯ys) • CVc (Yb) = = = = = CVc (¯ys). Yb Ny¯s y¯s y¯s

random number set ` i si y¯s Yb varc (¯ys) varc (Yb) CVc (Yb) [%] 1 2 {1,2,3,5} 20.3 101.3 3.4 85.3 9.1 2 5 {2,3,4,5} 19.8 98.8 4.0 100.3 10.1 3 1 {1,2,3,4} 15.3 76.3 1.9 47.8 9.1 4 1 {1,2,3,4} 15.3 76.3 1.9 47.8 9.1 - 3 {1,2,4,5} 19.0 95.0 4.6 115.0 11.3 - 4 {1,3,4,5} 16.8 83.8 4.1 102.8 12.1 population {1,2,3,4,5} 18.2 91.0 0.0 0.0 0.0

2. Variance of the estimator of the proportion   ?  n  1 N var(ˆp ) = 1 − p (1 − p ) . a N n N − 1 a a

n 1 2 With yak = 1 if k has modality a and 0 otherwise, we have: var(ˆpa) = var(¯ya) = (1 − N ) n S . Furthermore,

2 1 PN 2 S = N−1 k=1(yak − Y a) 1 PN 2 2 = N−1 k=1(yak − 2yakY a + Y a) 1 PN 2 PN PN 2 = N−1 k=1 yak − 2Y a k=1 yak + k=1 Y a 1 PN 2 2 2 = N−1 k=1 yak − 2NY a + NY a 1 PN 2 2 = N−1 k=1 yak − NY a y2 =y ak ak 1  2 = N−1 NY a − NY a N 2 = N−1 (pa − pa)

n  1 h N i Hence, var(ˆpa) = 1 − N n N−1 pa(1 − pa) #

Note that an unbiased estimator varc (ˆpa) of var(ˆpa) is produced by using the result above and

2008 c SFSO ESTP/Survey methodology 5 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2 2 1 Pn 2 n estimating S by s = n−1 k=1(yak − y¯s) = n−1 pˆa(1 − pˆa). This leads to the formula seen in the course  n  pˆ (1 − pˆ ) var(ˆp ) = 1 − a a c a N n − 1

6 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 4: Opinion poll

1. Let A be the subset of the population with a favorable opinion, B the subset of unfavorable opinions and C the subset of the rest of the population. n = 100.

• Favorable opinion, pˆA = 0.2  q  h i CI(ˆpA) = pˆA ± z1−α/2 varc (ˆpA) =. pˆA ± z1−α/2stdc (ˆpA) n  1 1 varc (ˆpA) = 1 − N n−1 pˆA(1 − pˆA) ≈ n pˆA(1 − pˆA) = 0.01 · 0.2 · 0.8 = 0.0016 Standard deviation favorable opinions: stdc (ˆpA) ≈ 0.04 CI(ˆpA) ≈ [0.2 ± 2 · 0.04] = [0.12; 0.28]

• Unfavorable opinion, pˆB = 0.6 −3 varc (ˆpB) ≈ 0.01 · 0.6 · 0.4 = 2.4 · 10 Standard deviation unfavorable opinion: stdc (ˆpB) ≈ 0.05 CI(ˆpB) ≈ [0.6 ± 2 · 0.05] = [0.50; 0.70] The estimation of the proportion of favorable opinions is more precise than the estimation of the proportion of unfavorable opinions.

2. Favorable opinions: The smaller confidence interval corresponds to a lower variance. There- fore, we are looking for the estimated standard deviation such that 2 −5 2 · stdc (ˆpA) = 1% ⇒ stdc (ˆpA) = 2.5 · 10 pˆA(1 − pˆA) 0 Hence, neglecting the fpc: n ≈ 2 = 6 400 stdc (ˆpA)

pˆB(1 − pˆB) 0 3. Unfavorable opinions, neglecting the fpc: n ≈ 2 = 9 600 stdc (ˆpB) 4. The proportion of favorable opinions can be estimated with better precision than the proportion of unfavorable opinions. The reason for this is that the proportion of unfavorable opinions is closer to 50% which is the proportion with the worst precision for a fixed sample size. There- fore, the sample size is larger for estimating the unfavorable opinions than the sample size for estimating the favorable opinions if the same precision is to be achieved.

5. pˆA = 0.2, pˆB = 0.8 The standard deviation of pˆB is equal to the standard deviation of pˆA because pˆB = 1 − pˆA. Therefore, the confidence interval of pˆB is CI(ˆpB) ≈ [0.8 ± 2 · 0.04] = [0.72; 0.88].

2008 c SFSO ESTP/Survey methodology 7 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

8 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 5: Rented flats data

1. Descriptive statistics There are 151 flats, but only 6 with 6 rooms in the entire population. The SURFACE and RENT variables have both extreme values. RENT has more extreme ones than SURFACE. About 56% of the flats have a balcony. The mean surface and mean rent are higher for flats with balcony and depending on the number of rooms. However, the Box-plots show that the differences are probably not significant.

2008 c SFSO ESTP/Survey methodology 9 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Empirical distribution of the balconies and the rooms

The FREQ Procedure

BALCONY Cumulative Cumulative BALCONY Frequency Percent Frequency Percent 0 67 44.37 67 44.37 1 84 55.63 151 100

ROOMS Cumulative Cumulative ROOMS Frequency Percent Frequency Percent 1 29 19.21 29 19.21 2 28 18.54 57 37.75 3 33 21.85 90 59.6 4 31 20.53 121 80.13 5 24 15.89 145 96.03 6 6 3.97 151 100

Moments, quantiles and histograms of the surface and the rent

The UNIVARIATE Procedure Variable: SURFACE (SURFACE)

Moments Quantiles (Definition 5) N 151 Sum Weights 151 Quantile Estimate Sum Mean 85.9403974 Observations 12977 100% Max 250 Std Deviation 37.2136591 Variance 1384.85642 99% 200

Skewness 0.94429548 2.05302437 95% 145 Uncorrect ed SS 1322977 Corrected SS 207728.464 90% 134 Coeff Variation 43.3017071 Std Error Mean 3.02840463 75% Q3 105

50% 81 Basic Statistical Measures 25% Q1 60 Location Variability 10% 40 Mean 85.9404 Std Deviation 37.21366 5% 35 Median 81 Variance 1385 1% 28 100 232 0% Min 18 45

Extreme Observations Lowest Highest Value Obs Value Obs 18 23 174 125 28 14 180 140 32 21 180 151 34 19 200 137 34 1 250 148

10 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Moments, quantiles and histograms of the surface and the rent

The UNIVARIATE Procedure Variable: RENT (RENT)

Moments Quantiles (Definition 5) N 151 Sum Weights 151 Quantile Estimate Sum Mean 723.794702 Observations 109293 100% Max 2953 Std Deviation 410.660838 Variance 168642.324 99% 2644

Skewness 2.3940431 Kurtosis 8.92872207 95% 1331 Uncorrect ed SS 104402043 Corrected SS 25296348.6 90% 1106 Coeff Variation 56.7371987 Std Error Mean 33.4191051 75% Q3 925

50% Median 641 Basic Statistical Measures 25% Q1 457 Location Variability 10% 353 Mean 723.7947 Std Deviation 410.66084 5% 306 Median 641 Variance 168642 1% 176 Mode 544 Range 2782 0% Min 171 Interquartile Range 468

Note: The mode displayed is the smallest of 2 modes with a count of 4.

Extreme Observations Lowest Highest Value Obs Value Obs 171 22 1706 125 176 3 1956 142 194 1 2234 150 272 19 2644 148 297 28 2953 151

Moments and quantiles of the surface Moments and quantiles of the rent by by rooms and balcony rooms and balcony

The MEANS Procedure The MEANS Procedure

Analysis Variable : SURFACE SURFACE Analysis Variable : RENT RENT

BALCONY N Obs Mean N BALCONY N Obs Mean N 06766.5820896 67 067580.791045 67 184101.3809524 84 184837.857143 84

Analysis Variable : SURFACE SURFACE Analysis Variable : RENT RENT ROOMS N Obs Mean N ROOMS N Obs Mean N 12942.2758621 29 129377.896552 29 22863.0357143 28 228569.535714 28 33382.5454545 33 333623.939394 33 43198.9677419 31 431797.032258 31 524134.2083333 24 5241105.96 24 66162.1666667 6 661757.67 6

2008 c SFSO ESTP/Survey methodology 11 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

12 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2008 c SFSO ESTP/Survey methodology 13 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2. Estimation

(a) Two SRS of size 30: see the SAS-output below. As expected, different seeds result in different samples. (b) Different samples lead to different estimations. All estimated confidence intervals based on samp1 cover the true values. At the level 1 − α the confidence interval will contain the unknown total Y for approximately (1 − α)% of repeated samples s drawn with the design p(s) if • the sampling distribution of Yb is approximately N (Y, var(Yb)), and • there is a consistent variance estimator vard(Yb) of var(Yb). However, in a real survey it is unknown whether the true value is covered. See the SAS- output below. (c) Estimation of a proportion: cf. the SAS-output below. The true values are covered by the confidence intervals.

14 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2.(a) SRS sample of the rents data with n=30, seed=53437

The SURVEYSELECT Procedure

Selection Method Simple Random Sampling

Input Data Set FRAME Random Number Seed 53437 Sample Size 30 Selection Probability 0.198675 Sampling Weight 5.033333 Output Data Set SAMP1

SRS sample of the rents data with n=30, seed=53437

SelectionPr Obs Row ID ROOMS SURFACE RENT BALCONY RESP ob 1 58 426 3 79 399 0 1 0.1987 2 122 434 5 129 650 1 1 0.1987 3 95 440 4 100 647 1 1 0.1987 4 5 444 1 60 319 0 0 0.1987 5 97 446 4 89 420 1 1 0.1987 6 98 447 4 89 423 1 0 0.1987 7 6 459 1 35 322 0 1 0.1987 8 8 468 1 43 300 0 1 0.1987 9 146 473 6 135 978 1 0 0.1987 10 102 477 4 99 839 0 1 0.1987 11 39 479 2 100 484 1 1 0.1987 12 148 486 6 250 2644 1 1 0.1987 13 11 487 1 37 363 0 1 0.1987 14 106 491 4 125 406 1 0 0.1987 15 41 494 2 40 378 0 1 0.1987 16 73 495 3 80 874 1 1 0.1987 17 132 502 5 143 1200 1 0 0.1987 18 133 503 5 120 1216 1 0 0.1987 19 76 509 3 79 651 0 1 0.1987 20 78 511 3 105 967 1 1 0.1987 21 21 529 1 32 571 0 1 0.1987 22 48 532 2 70 575 0 1 0.1987 23 82 535 3 80 575 1 1 0.1987 24 114 538 4 85 913 0 0 0.1987 25 52 550 2 60 797 0 0 0.1987 26 85 551 3 87 656 0 1 0.1987 27 116 552 4 90 944 0 0 0.1987 28 28 571 1 35 297 0 1 0.1987 29 89 573 3 89 788 1 0 0.1987 30 144 575 5 121 925 1 0 0.1987

2008 c SFSO ESTP/Survey methodology 15 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

SRS sample of the rents data with n=30, seed=2007

The SURVEYSELECT Procedure

Selection Method Simple Random Sampling

Input Data Set FRAME Random Number Seed 2007 Sample Size 30 Selection Probability 0.198675 Sampling Weight 5.033333 Output Data Set SAMP2

2.(a) SRS sample of the rents data with n=30, seed=2007

SelectionPr Obs Row ID ROOMS SURFACE RENT BALCONY RESP ob 1 30 427 2 50 306 0 1 0.1987 2 3 438 1 50 176 1 1 0.1987 3 60 443 3 79 353 0 0 0.1987 4 62 451 3 97 685 0 1 0.1987 5 66 461 3 75 544 0 1 0.1987 6 128 467 5 115 1028 1 0 0.1987 7 9 469 1 65 544 1 1 0.1987 8 70 475 3 75 325 0 1 0.1987 9 102 477 4 99 839 0 1 0.1987 10 71 480 3 100 613 0 1 0.1987 11 103 481 4 90 544 1 1 0.1987 12 147 485 6 146 766 1 0 0.1987 13 72 490 3 79 763 1 1 0.1987 14 131 492 5 134 872 1 1 0.1987 15 14 498 1 28 406 0 1 0.1987 16 132 502 5 143 1200 1 0 0.1987 17 150 504 6 150 2234 1 0 0.1987 18 75 508 3 75 525 0 0 0.1987 19 77 510 3 75 531 0 0 0.1987 20 18 521 1 40 368 0 1 0.1987 21 79 523 3 88 457 1 1 0.1987 22 136 526 5 145 1250 1 0 0.1987 23 19 527 1 34 272 1 1 0.1987 24 20 528 1 45 331 0 1 0.1987 25 22 541 1 40 171 1 1 0.1987 26 25 547 1 45 484 1 1 0.1987 27 51 549 2 60 669 0 1 0.1987 28 139 556 5 125 1078 1 1 0.1987 29 141 558 5 140 1575 1 0 0.1987 30 53 560 2 58 944 0 1 0.1987

16 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2.(b) Estimation based on samp1: total and mean

The SURVEYMEANS Procedure

Data Summary

Number of Observations 30 Sum of Weights 151

Statistics Std Error of Coeff of Variable Mean Mean Var of Mean 95% CL for Mean Variation RENT 717 73.85 5453.85 566 868 0.1029

Statistics Coeff of Variation Variable Sum Std Dev Var of Sum 95% CL for Sum for Sum RENT 108322 11151.00 124353328.00 85515 131130 0.1029

2.(b) Estimation based on samp2: total and mean

The SURVEYMEANS Procedure

Data Summary

Number of Observations 30 Sum of Weights 151

Statistics Std Error of Coeff of Variable Mean Mean Var of Mean 95% CL for Mean Variation RENT 695 72.91 5315.58 546 844 0.1049

Statistics Coeff of Variation Variable Sum Std Dev Var of Sum 95% CL for Sum for Sum RENT 104960 11009.00 121200588.00 82444 127476 0.1049

2.(c) Estimation based on samp1: proportion of balconies

The SURVEYMEANS Procedure

Data Summary

Number of Observations 30 Sum of Weights 151

Class Level Information

Class Variable Label Levels Values BALCONY BALCONY 2 0 1

Statistics Std Error of Variable Level Label Mean Mean Var of Mean 95% CL for Mean BALCONY 0 BALCONY 0.50 0.0831 0.0069 0.33 0.67 1 BALCONY 0.50 0.0831 0.0069 0.33 0.67

2.(c) Estimation based on samp2: proportion of balconies

The SURVEYMEANS Procedure

Data Summary

Number of Observations 30 Sum of Weights 151

Class Level Information

Class Variable Label Levels Values BALCONY BALCONY 2 0 1

Statistics Std Error of Variable Level Label Mean Mean Var of Mean 95% CL for Mean BALCONY 0 BALCONY 0.47 0.0829 0.0069 0.30 0.64 1 BALCONY 0.53 0.0829 0.0069 0.36 0.70

2008 c SFSO ESTP/Survey methodology 17 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2. Estimation (continued)

(d) Estimation in domains.

Estimation based on samp1: mean rent by BALCONY

The SURVEYMEANS Procedure

Data Summary Number of Observations 30 Sum of Weights 151

Statistics Std Error of Coeff of Variable Mean Mean Var of Mean 95% CL for Mean Variation RENT 717 73.85 5453.85 566 868 0.1029

Domain Analysis: BALCONY Std Error of Coeff of BALCONY Variable Mean Mean Var of Mean 95% CL for Mean Variation 0 RENT 555 53.36 2847.49 446 664 0.0962 1 RENT 880 126.70 16052.00 621 1139 0.1440

Estimation based on samp2: mean rent by BALCONY

The SURVEYMEANS Procedure

Data Summary Number of Observations 30 Sum of Weights 151

Statistics Std Error of Coeff of Variable Mean Mean Var of Mean 95% CL for Mean Variation RENT 695 72.91 5315.58 546 844 0.1049

Domain Analysis: BALCONY Std Error of Coeff of BALCONY Variable Mean Mean Var of Mean 95% CL for Mean Variation 0 RENT 531 46.99 2207.98 435 627 0.0884 1 RENT 838 121.32 14718.00 590 1086 0.1447

18 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Summary

Population Y = RENT N 151 Y 724 Y 109’293 S2 168’642 S 411 CVY [%] 56.7

Y = RENT samp1 samp2 n 30 30 seed 53437 y¯s 717 Yb 108’322 CVc [%] 10.29 CI(¯ys) [566,868] CI(Yb) [85’515,131’130]

proportion of balconies samp1 samp2 BALCONY=0 0.5 BALCONY=1 0.5 CI(BALCONY=0) [0.33,0.67] CI(BALCONY=1) [0.33,0.67]

mean rent by BALCONY BALCONY samp1 samp2 y¯s 0 531 CI(¯ys) 0 [546,844]

y¯s 1 838 CI(¯ys) 1 [590,1’068]

2008 c SFSO ESTP/Survey methodology 19

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey Methodology and Sampling Techniques Stratified sampling Survey methodology and sampling techniques

Guillaume Chauvet, Eric Lesage

Institut national de la statistique et des etudes´ economiques´

15 - 18 April 2008

This presentation is based on teaching documents by the CEPE (Insee)

ESTP/Survey methodology c INSEE 1

Learning outcomes

You will know:

I What stratified sampling is,

I Why do we use stratified sampling,

I How do we calculate the sample sizes in each stratum (sample allocation),

I How do we choose the strata,

I How to compute a stratified sampling with SAS.

ESTP/Survey methodology c INSEE 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Outline

Principals and notations Estimation of a total Estimation and precision Stratified sample with SRS in each stratum Sampling allocation between stratum Proportional allocation Optimum allocation Alternative allocation Construction of strata SAS Procedure syntax and example

ESTP/Survey methodology c INSEE 3

Principals and notations

ESTP/Survey methodology c INSEE 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Auxiliary information

We often have auxiliary information (supplementary information), that enables a better sampling design than the SRS. For example :

I the gender in a social survey,

I the firm size in a business survey. If the variable we are interested in has different mean values on the subpopulations, then the use of a stratified sampling will produce more precise estimates

ESTP/Survey methodology c INSEE 5

Stratified sampling: what for?

I a more representative sample (balanced),

I known precision on the subpopulations (domains of estimation),

I convenient for the field operations,

I more precise for the whole population (or lower cost).

ESTP/Survey methodology c INSEE 6 U1 U2 U3 U4 U5

     # 

 "!s1 s2 s3 s4 s5

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Stratified sampling: what is it? 1/2

I we partition the population U into H non-overlapping subpopulations called ”Strata”, denoted U1, ..., UH ,

I H sub-samples s1,s2,...,sH are drawn independently in the H strata (with a simple random sampling design for example). Remark 1: A good stratified population is obtained when 2 Sh are small.

ESTP/Survey methodology c INSEE 7

Stratified sampling: what is it? 2/2

x x x x x x x x x x x x x x x x x x x x x x x x x

ESTP/Survey methodology c INSEE 8      # 

 "!s1 s2 s3 s4 s5

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Stratified sampling: what is it? 2/2

U1 U2 U3 U4 U5 x x x x x x x x x x x x x x x x x x x x x x x x x

ESTP/Survey methodology c INSEE 9

Stratified sampling: what is it? 2/2

U1 U2 U3 U4 U5 x x x x x x x x  x x x x x  x x x x  #x x x x xx  x x  "!s1 s2 s3 s4 s5

ESTP/Survey methodology c INSEE 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Notation 1/3

Population Sample

n Size of stratum h Nh h

H H Size of population/sample P P N = Nh n = nh h=1 h=1

Stratum total of Y Y = P y P h k yh = yk k∈Uh k∈sh

ESTP/Survey methodology c INSEE 11

Notation 2/3

Population Sample

H H Population/sample total P P Y = Yh y = yh h=1 h=1

y Stratum mean Yh h Y h = yh = n Nh h

ESTP/Survey methodology c INSEE 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Notation 3/3

Population Sample

Population/sample P Nh P nh Y = N Yh y = n yh mean h h

Stratum variance S2 = 2 1 P 2 h sh = n −1 (yk − yh) 1 P ` ´2 h k∈s Yk − Yh h Nh−1 k∈Uh

Population/sample 2 1 P ` ´2 2 1 P 2 S = N−1 Yk − Y s = n−1 (yk − y) variance k∈U k∈s

ESTP/Survey methodology c INSEE 13

Analysis of variance decomposition

Overall = Within strata + Between variance variance strata vari- ance H H X Nh − 1 X Nh 2 S2 = S2 + Y − Y  Y N − 1 h N − 1 h h=1 h=1

ESTP/Survey methodology c INSEE 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Estimation of a total

ESTP/Survey methodology c INSEE 15

Estimation of a total

Unbiased estimator (Horvitz-Thompson estimator) of the population total Y :

H X Ybstr = Ybπh 6= Ny h=1

Remark 2: the stratified estimator is not necessarily equal to the SRS total estimator Ny.

ESTP/Survey methodology c INSEE 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Precision

The variance of the stratified estimator is:

H ! H   X X   V ar Ybstr = V ar Ybπh = V ar Ybπh h=1 h=1 because the H sub-samplings are independent. The estimated variance of the stratified estimator is :

H   X   Vd ar Ybstr = Vd ar Ybπh h=1

ESTP/Survey methodology c INSEE 17

Estimation

With simple random samplings in each strata, we have:

H H   H X X 1 X X X Nh Ybstr = Nhyh = Nh  yk = yk nh nh h=1 h=1 k∈sh h=1 k∈sh

Nh 1 For each element in sh, the weight is = . So we have nh fh an unequal probability sampling.

ESTP/Survey methodology c INSEE 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Precision 1/2

With simple random samplings in each strata, we have:

H  2 2   2 X Nh Sh V ar Ybstr = N (1 − fh) N nh h=1 The precision of the stratified estimator depends only on the variability of the variable within the strata ⇒ stratification is 2 efficient if Sh are small.

ESTP/Survey methodology c INSEE 19

Precision 2/2

Variance estimator:

H  2 2   2 X Nh sh Vd ar Ybstr = N (1 − fh) N nh h=1 This is an unbiased estimator.

ESTP/Survey methodology c INSEE 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Sampling allocation between stratum

ESTP/Survey methodology c INSEE 21

Sampling allocation between stratum

The context: We suppose that the size n of the sample is given, and the strata have been defined. Allocation problem: we must calculate the fixed sub-sample sizes nh, for h = 1, ..., H. The solution will depend on the main objective:

I to obtain maximum precision for one variable,

I to obtain maximum precision for more than one variable,

I to obtain a desired precision in each stratum for one variable.

ESTP/Survey methodology c INSEE 22 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Definition

The sampling fraction is the same in all strata :

nh n fh = = = f Nh N

Remark 3: Nh n is not always an integer. So N  Nh nh = round N n , and fh ≈ f.

ESTP/Survey methodology c INSEE 23

Properties 1/2

I The inclusion probabilities are all equal (to f)

I The stratified estimator is:

H H H X X X nh Ybprop = Ybπh = Nhyh = N yh = Ny n h=1 h=1 h=1

(identical to the SRS case)

I This allocation gives a self-weighting sample. It is not necessary to know the membership of the strata.

ESTP/Survey methodology c INSEE 24 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Properties 2/2

H   (1 − f) X Nh 2 (1 − f) 2 V ar Ybprop = S ≈ S n N h n W ithin h=1     I Hence for any variable Y: V Ybprop ≤ V YbSRS 2 2 I If SBetween accounts for a significant proportion of S , stratified sampling with proportional allocation yields a substantially smaller variance than simple random sampling.

ESTP/Survey methodology c INSEE 25

Optimum allocation

Objective: To estimate the population total of a study variable with a minimum variance, for a fixed cost of the survey. We suppose that the total cost of the survey can be expressed as : H X C = nhch + C0 h=1

where ch is the cost of surveying one element in stratum Uh, and C0 is a fixed overhead cost.

ESTP/Survey methodology c INSEE 26 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Optimization problem

  Min V ar Ybstr H P with the constraint: nhch + C0 = C h=1 Where H  2 2   X Nh Sh V ar Ybstr = (1 − fh) N nh h=1

N√hSh C−C0 The solution is: nh = c PH √ h h=1 chNhSh

ESTP/Survey methodology c INSEE 27

Neyman allocation

If we assume that all the stratum costs ch are equal N S n = n h h h PH h=1 NhSh Called the Neyman allocation

Remark 4: The calculation of the optimal nh requires that the stratum variance Sh are known!

ESTP/Survey methodology c INSEE 28 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

It is important to use the Neyman allocation when the size of the statistical units is very different between strata. For example in a business survey, when the strata are based on the size of the firms. Why? Because the variance of quantitative variables will be higher for the strata of the big firms. It is important to over-sample these strata. Attention, an optimal allocation can be worse than a SRS! (for other variable of interest above all). So, keep in mind that a proportional allocation is often a good compromise.

ESTP/Survey methodology c INSEE 29

Alternative allocation

Objective To obtain the same precision in each stratum for one variable, for example to compare the stratum means. The stratum means Y h are estimated by the stratum sample means yh, whose variance is, if the sampling fraction is 2 Sh negligible: V (y ) = . In this case, nh is proportional to h nh 2 the variance Sh. 2 If the variances Sh are almost equals, one will choose the same number of elements in each stratum.

ESTP/Survey methodology c INSEE 30 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Construction of strata

ESTP/Survey methodology c INSEE 31

Auxiliary information

2 Objective: to obtain small values of Sh. Questions: Which variable(s) to select in order to define the strata? Select the variable(s) that have high correlations with the variable of interest, i.e. which allows to define subpopulations that are:

I homogeneous (within sense): the elements within one stratum have similar Y-values

I heterogeneous (between sense): the mean values of Y are very different from one stratum to another.

ESTP/Survey methodology c INSEE 32 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

remarks

Remark 5: the best way to construct the strata would be to use the of Y itself! The next best is to use the distribution of an X variable highly correlated with Y.

Problem: a stratification can be efficient for one variable of study, but not for the others.

ESTP/Survey methodology c INSEE 33

How many strata?

Theoretically: a maximum of strata. In practice, there is a point of diminishing returns when the number of strata increases :

I the gain from stratification becomes negligible,

I the cost of the survey may increase,

I we can obtain nh = 1, even 0, due to non-response (we need nh ≥ 2 in order to estimate the variance). We often use a qualitative variable that enable a variance decomposition of Y or X with a small intra-variance and a large between variance

ESTP/Survey methodology c INSEE 34 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

SAS Procedure syntax and example

ESTP/Survey methodology c INSEE 35

Proc SURVEYSELECT

PROC SURVEYSELECT DATA=sasuser.data METHOD=SRS SAMPSIZE=sasuser.allocation SEED=2007 OUT=sasuser.sample; STRATA strate; RUN;

ESTP/Survey methodology c INSEE 36 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Example

Simple random sampling: n = 2

Second Draw 1 3 4 7 8 100 130 200 540 570 1 100 575 750 1600 1675 First 3 130 825 1675 1750 Draw 4 200 1850 1925 7 540 2775 8 570

ESTP/Survey methodology c INSEE 37

Example

Simple random sampling

Population total: 1540 Range of the estimates: 575 to 2275 Number of samples: 10 Standard error of the estimator: 560 Mean of the estimates: 1540

ESTP/Survey methodology c INSEE 38 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Example

Stratified sampling: n1 = 1 and n2 = 1

Second Draw 7 8 540 570 1 100 1380 1440 First 3 130 1470 1530 Draw 4 200 1680 1740

ESTP/Survey methodology c INSEE 39

Example

Stratified sampling

Population total: 1540 Range of the estimates: 1380 to 1740 Number of samples: 6 Standard error of the estimator: 105 Mean of the estimates: 1540

ESTP/Survey methodology c INSEE 40

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Exercises

Stratified sampling

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

1 Exercise 1 2 1.1 Simple random sampling ...... 2 1.2 Proportional allocation ...... 2 1.3 Sample allocation strategies ...... 2

2 Exercise 2 : Strata choice 3 2.1 Analysis of the variance decomposition ...... 3 2.2 Proportional allocation ...... 3 2.3 Quality of the stratification based on the variable ROOMS ...... 3 2.4 Number of apartments estimations ...... 3

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

1 Exercise 1

In order to estimate the average salary among a population of 200, 000 employees, a sample of 1, 000 employees is drawn, according to different sampling designs. The salary will be noted y.

In all the calculations the sampling rates will be neglected.

1.1 Simple random sampling

A simple random sampling is used, which gives the following results:

y¯ = 5, 380 s2 = 2, 100 Give the value of the estimate and calculate the estimated variance.

1.2 Proportional allocation

The distribution of the population according to gender is supposed to be known: 150, 000 males, 50, 000 females. A stratified sampling is used, with proportional allocation, which gives the following sample variances:

2 2 sM = 1, 500 sF = 1, 100 Calculate the estimated variance of the stratified estimator.

What sample size would have been necessary if a simple random sampling had been used ?

1.3 Sample allocation strategies

Starting from the sample variances given at question 2.:

• what would be the sample allocation that would give the best precision to estimate the average salary?

• what would be the sample allocation that would permit to estimate the average salary per sex with the same precision?

2 ESTP/Survey methodology c INSEE 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2 Exercise 2 : Strata choice

We will use the data base on the accommodation. RENT: is the rent; ROOMS: the number of rooms in the accommodation.

Remark: In this exercise, we suppose that y = RENT is know for each apartment of the sampling frame. In practice, this information is known only on a sample or we know another variable correlated to RENT .

2.1 Analysis of the variance decomposition

We use the variable ROOMS as a stratification variable. Calculate the variance ”Within strata” and ”Between strata” of RENT. you can use the PROC MEANS SAS procedure or the following table:

ROOMS Nh Y¯h Dh 1 29 377.90 112.47 2 28 569.54 176.43 3 33 623.94 184.43 4 31 797.03 218.27 5 24 1 105.96 321.86 6 6 1 757.67 964.51

2.2 Proportional allocation

If we intend to practice a proportional allocation. What the variance reduction (saving in %) will be in comparison with a SRS?

2.3 Quality of the stratification based on the variable ROOMS

Is this stratification good or not? Can we speak of a size effect. If yes, would you use a proportional allocation or an optimal allocation?

2.4 Number of apartments estimations

What would be the precision of the estimations of the number of apartments by strata Nh?

2008 c INSEE ESTP/Survey methodology 3

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Answers to the Exercises

Stratified sampling

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

1 Exercise 1 2 1.1 A simple random sampling is used ...... 2 1.2 A stratified sampling is used, with proportional allocation...... 2 1.3 Two other allocations ...... 2 1.3.1 Same precision in the 2 strata ...... 2 1.3.2 Neymann allocation ...... 2

2 Exercise 2 : Strata choice 3 2.1 Analysis of the variance decomposition ...... 3 2.2 Proportional allocation ...... 3 2.3 Quality of the stratification based on the variable ROOMS ...... 3 2.4 Number of apartments estimations ...... 3

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

1 Exercise 1

1.1 A simple random sampling is used ˆ π-estimator: Y¯π =y ¯ = 5, 380  ¯ˆ  n  s2 Variance estimator: Vd ar Yπ = 1 − N n = 4, 410 Confidence interval: CI95% = [5, 247; 5, 513]

1.2 A stratified sampling is used, with proportional allocation.

π-stratified estimator: N N Y¯ˆ = m y¯ + f y¯ str N m N f

Variance estimator of the stratified estimator:  2   2  2   s2  ¯ˆ  Nm nm sm Nf nf f Vd ar Ystr = 1 − + 1 − = 1, 990 N Nm nm N Nf nf

Size of a Simple random sample equivalent: s2 (2, 100)2 n ≈ = = 2, 216  ¯ˆ  1990 Vd ar Yπ

1.3 Two other allocations

1.3.1 Same precision in the 2 strata

 n  s2  n  s2 1 − m m = 1 − f f = 1, 990 Nm nm Nf nf s2 s2 m ≈ f nm nf

And nm + nf = n = 1, 000 Hence: nm = 651 and nf = 349

1.3.2 Neymann allocation

Nmsm nm = n Nf sf + Nmsm

Hence: nm = 804 and nf = 196.

2 ESTP/Survey methodology c INSEE 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2 Exercise 2 : Strata choice

2.1 Analysis of the variance decomposition ¯ 2 ROOMS Nh Yh Dh σh CVh W ithinh Betweenh 1 29 378 111 12 213 29.2% 2 346 23 131 2 28 570 173 30 017 30.4% 5 566 4 442 3 33 624 182 32 985 29.1% 7 209 2 194 4 31 797 215 46 104 26.9% 9 465 1 109 5 24 1 106 315 99 279 28.5% 15 779 23 368 6 6 1 758 880 775 238 50.1% 30 804 42 756 71 169 96 999 42% 58%

Overall = Within strata + Between strata vari- variance variance ance 168 168 = 71 169 + 96 999

2.2 Proportional allocation

H   (1 − f) X Nh 2 (1 − f) 2   V ar Ybprop = S ≈ S = 42%V ar YbSRS n N h n W ithin h=1 Hence we will have a precision saving of 1 − (0.42)2 = 35%.

2.3 Quality of the stratification based on the variable ROOMS

This stratification is good in order to estimate the rent or any variable correlated to it. We have a size effect. Indeed, the rent for a large apartment is bigger than the rent for a small one. And we can see that the variance in each stratum is proportional to the mean rent (except for the last stratum). In this case, it is appropriate to use an optimal allocation. So, the strata of big units will be over sampled. The Neyman allocation: n N S h = h h n PH h=1 NhSh

2.4 Number of apartments estimations ˆ Nlstr = Nl whichever allocation you use, because: H ˆ X Nlstr = Nhzh = Nl ∗ 1 = Nl h=1 1 with zk = (k∈Ul)

2008 c INSEE ESTP/Survey methodology 3

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey Methodology and Sampling Techniques Multi-stage sampling

Guillaume Chauvet, Eric Lesage

Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

This presentation is based on teaching documents by the CEPE (Insee)

ESTP/Survey methodology c INSEE 1

Introduction

Cluster Sampling Definition Estimation of a total Simple random cluster sampling French labour-force survey

Two-stage sampling Definition Estimation of a total SRS at each stage Another self-weighting design Drawing of the Master Sample in 1999

ESTP/Survey methodology c INSEE 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Introduction

ESTP/Survey methodology c INSEE 3

Two-stage sampling

Principle

I Partition population U in M parts, called Primary Sampling Units (PSU) ; units in U are called Secondary Sampling Units (SSU).

I Draw a sample of PSUs·

I In each selected PSU, viewed as a population, draw a sample of SSUs. The joining of the SSUs drawn in each selected PSU gives the resulting sample.

ESTP/Survey methodology c INSEE 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Two-stage sampling

Justification

I To reduce the survey costs, if the population units are scattered over a wide area (so that the units selected by direct sampling would be scattered too)

I Case of no sampling frame, or high cost of producing sampling frame

ESTP/Survey methodology c INSEE 5

Two-stage sampling

Remarks Sampling designs may differ from one stage to another, and from one PSU to another.

Multistage sampling consists of three or more stages of sampling.

A Simple Random Sample (SRS) of same size will usually be more accurate.

ESTP/Survey methodology c INSEE 6 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Cluster Sampling

ESTP/Survey methodology c INSEE 7

Population U = {1,...,k,...,N} partitioned into M subpopulations, called clusters U1,...,Ui,...,UM .

Set of clusters Ug = {1, . . . , i, . . . , M}.

th Ni = number of population elements in the i cluster Ui.

N = P N i∈Ug i

ESTP/Survey methodology c INSEE 8 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Principle of cluster sampling

ESTP/Survey methodology c INSEE 9

Definition

1. A sample sg of clusters is drawn from Ug according to the sampling design p(). 2. Every population element in the selected clusters is observed. → resulting sample s = S U , with size i∈sg i n = P N . s i∈sg i

Remark : even if p() is a fixed size design, ns in general will not be fixed, because the cluster sizes Ni may vary. 3. Inclusion probabilities (resulting from design p)

πi, πij ∆ij = πij − πiπj

ESTP/Survey methodology c INSEE 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Estimation of a total

The study variable y is defined over U.

Yi is the total of y in the cluster Ui.

Y = P Y i∈Ug i

Theorem The Horvitz-Thompson estimator Yˆ = P Yi is π i∈sg πi design-unbiased for Y .

ESTP/Survey methodology c INSEE 11

Estimation of a total

If p() is a fixed size design :

 2 1 X X Yi Yj V (Yˆπ) = − ∆ij − 2 πi πj i∈Ug j∈Ug

 2 1 X X ∆ij Yi Yj Vˆ (Yˆπ) = − − 2 πij πi πj i∈sg j∈sg

ESTP/Survey methodology c INSEE 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Comments

I If we can choose πi approximately proportional to the cluster totals Yi, then V (Yˆπ) ' 0 : cluster sampling will be highly efficient.

I If there is little variation among the cluster means Y¯ = Yi , and if the cluster sizes are known at the i Ni planning stage, one can choose πi approximately proportional to the Ni : V (Yˆπ) ' 0.

I An equal probability cluster sampling design (i.e. πi are all equal) if often a poor choice when the clusters are of different sizes (unless the Y¯i are roughly proportional to 1/Ni).

ESTP/Survey methodology c INSEE 13

Simple random cluster sampling

We draw m clusters among M, according to a simple random sampling design.

Then YˆSRSg = My¯g where 1 X y¯ = Y = mean of the cluster totals in s g m i g i∈sg

ESTP/Survey methodology c INSEE 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The variance equals

 m  S2 V (Yˆ ) = M 2 1 − g SRSg M m where

S2 = 1 P (Y − Y¯ )2 → variance of the cluster g M−1 i∈Ug i g totals in Ug

Y¯ = 1 P Y → mean of the cluster to- g M i∈Ug i tals in Ug

ESTP/Survey methodology c INSEE 15

An unbiased variance estimator is provided by

 m  s2 Vˆ (Yˆ ) = M 2 1 − g SRSg M m where

S2 = 1 P (Y − y¯ )2 → variance of the cluster g m−1 i∈sg i g totals in sg

y¯ = 1 P Y → mean of the cluster to- g m i∈sg i tals in sg

ESTP/Survey methodology c INSEE 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Comparison with direct simple random sampling

Let S2 = 1 P (y − Y¯ )2 be the variance of y within U . i Ni−1 k∈Ui k i i

Analysis of variance decomposition

2 2 2 S = Swithin + Sbetween

↓ ↓ ' P Ni S2 = P Ni (Y¯ − Y¯ )2 i∈Ug N i i∈Ug N i

2 Swithin We define the homogeneity coefficient ρ = 1 − S2 .

ESTP/Survey methodology c INSEE 17

Comparison with direct simple random sampling Interpretation

I ρ large : low variation of y within every cluster → high degree of homogeneity (elements in the same cluster are similar)

I ρ small : large variation of y within every cluster → low degree of homogeneity (elements in the same cluster are dissimilar) Notation ¯ N N = M = average number of elements per cluster

¯ 2 Cov = covariance between Ni and NiYi

ESTP/Survey methodology c INSEE 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Comparison with a direct simple random sampling of elements

Then V (Yˆ ) SRSg ' 1 + (N¯ − 1)ρ + Cov V (YˆSRS)

where V (YˆSRS) is the variance of the estimator obtained with a SRS of size n = mN¯ (expected size of s).

ESTP/Survey methodology c INSEE 19

Case 1 : Suppose that all cluster sizes are equal :

∀i Ni = N¯ ⇒ Cov = 0

Hence : V (Yˆ ) SRSg = 1 + (N¯ − 1)ρ V (YˆSRS) Many clusters encountered in practice are formed by ”nearby” elements, and because such elements tend to resemble each other more or less, it is likely that ρ > 0.

Example : ρ weakly positive ˆ = 0.08 N¯ = 300 ⇒ V (YSRSg) = 25 V (YˆSRS ) Large average cluster suze ⇒ large loss of efficiency.

ESTP/Survey methodology c INSEE 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Case 2 : Suppose that the clusters vary in size.

Very often Cov > 0 : the variance increase due to selection of clusters may be worse than in case 1.

The cluster sampling is likely to be inefficient in many situations, especially if the clusters are homogeneous and/or of unequal sizes.

However, from a cost efficiency point of view, it may have advantages, since it is often much cheaper to survey cluster of elements than to survey the geographically scattered sample that may arise from a SRS.

ESTP/Survey methodology c INSEE 21

When will we perform accurate cluster sampling?

I If ρ is low → low degree of homogeneity within clusters (i.e. high degree of homogeneity between clusters)

I Small clusters (i.e. many clusters)

I Clusters of similar sizes

I Maximum number of drawn clusters How to improve efficiency of cluster sampling? → to use stratified cluster sampling, where clusters are stratified on a measure of size.

ESTP/Survey methodology c INSEE 22 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

French labour-force survey

Rotating sample of areas

I 6 rotating areas for each dictrict, each being allocated 6 consecutive quarters to an interviewer

I an area consists in one cluster of 20 households

I 2 554 areas are drawn each quarter (54 000 households)

V (YˆSRSg) I Design effect ' 4 V (YˆSRS )

ESTP/Survey methodology c INSEE 23

Principle of rotating areas

ESTP/Survey methodology c INSEE 24 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Rotating areas in a district

ESTP/Survey methodology c INSEE 25

Two-stage sampling

ESTP/Survey methodology c INSEE 26 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Population U = {1,...,k,...,N} partitioned into M primary sampling units (PSUs) U1,...,Ui,...,UM .

Set of PSUs UI = {1, . . . , i, . . . , M}.

th Ni = number of population elements in the i PSU Ui.

N = P N i∈UI i

ESTP/Survey methodology c INSEE 27

Definition

1. A sample sI of PSUs is drawn from UI according to the sampling design pI ().

2. For every i ∈ sI , a sample si of elements (called secondary sampling units, SSUs) is drawn from the PSU Ui according to the design pi(.).

The pi(.) are independent : subsampling in a given PSU is carried out independently os subsampling in any other PSU.

→ resulting sample s = S s , with (random) size i∈sI i n = P n where n is the size of s . s i∈sI si si i

ESTP/Survey methodology c INSEE 28 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Principle of two-stage sampling

ESTP/Survey methodology c INSEE 29

Inclusion probabilities

I For PSU Ui to be selected in sI : πIi (induced by the sampling design pI ),

I For elements k (of PSU Ui selected in sI ) to be selected in si : πk|i (induced by the sampling design pi),

I For elements k (of PSU Ui) to be selected in s : πk = πIiπk|i

ESTP/Survey methodology c INSEE 30 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Estimation of a total

The study variable y is defined over U.

Yi is the total of y in the PSU Ui.

Y = P Y i∈UI i

Horvitz-Thompson estimator

ˆ P Yiπ P yk Yˆπ = with Yˆiπ = i∈sI πIi k∈si πk|i

V (Yˆπ) = VPSU + VSSU

ESTP/Survey methodology c INSEE 31

SRS at each stage

First stage : A sample of m PSUs is drawn by SRS among the M PSUs of population UI .

Second stage : In each selected PSU i, a sample of ni SSUs is selected by SRS among the Ni SSUs of PSU i.

Horvitz-Thompson estimator   ˆ M X Ni X M X Yπ =  yk = Niy¯i m ni m i∈sI k∈si i∈sI

ESTP/Survey methodology c INSEE 32 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Variance of the Horvitz-Thompson estimator

2   2 ˆ 2  m  SI M X 2 ni Si V (Yπ) = M 1 − + Ni 1 − M m m Ni ni i∈UI

2 where S2 = 1 P Y − Y  I M−1 i∈UI i M

Variance of the totals Yi of PSUs

 2 and S2 = 1 P y − Yi i Ni−1 k∈Ui k Ni

Variance of the variable y in PSU Ui

ESTP/Survey methodology c INSEE 33

Variance estimator of the HT estimator

2   2 ˆ ˆ 2  m  sI M X 2 ni si V (Yπ) = M 1 − + Ni 1 − M m m Ni ni i∈sI

 ˆ 2 where s2 = 1 P Yˆ − Yπ I m−1 i∈sI iπ M

 ˆ 2 and s2 = 1 P y − Yiπ i Ni−1 k∈si k Ni

ESTP/Survey methodology c INSEE 34 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Comments

I The first component of the variance reflects the variability of the totals BETWEEN PSUs.

I The second component of the variance reflects the variability of variable y WITHIN PSUs. The first component is very often preponderant.

The size m of the PSUs sample sI plays a role in both components, while sizes ni of the SSUs samples si appear only in the second component.

ESTP/Survey methodology c INSEE 35

Consequently, in order to improve the precision :

I it is better to draw many PSUs,

I it is better to have PSUs with comparable sizes and similar means. This can be obtained with a preliminary stratification of the PSUs by size (or some measure of size), so that PSUs of comparable size are grouped together.

Conclusion : ”Good” PSUs should be similar (and heterogeneous, in within sense).

ESTP/Survey methodology c INSEE 36 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Special case : constant sampling rates at 2nd stage

ni = f2 for each PSU Ui Ni m We denote f1 = M . Then : 1 X 1 X 1 X Yˆπ = yk = yk f1 f2 f1f2 i∈sI k∈si k∈s Every element in the sample has the same weight 1 . f1f2

ESTP/Survey methodology c INSEE 37

2 ˆ 2 SI 1 − f2 X 2 V (Yπ) = M (1 − f1) + NiSi m f1f2 i∈UI and

2 ˆ ˆ 2 sI 1 − f2 X 2 V (Yπ) = M (1 − f1) + NiSi m f1f2 i∈UI

If f1 is small, the variance estimator will usually be close to the PSU variance :

s2 Vˆ (Yˆ ) ' M 2 (1 − f ) I π 1 m

ESTP/Survey methodology c INSEE 38 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

If in addition all the PSUs have the same size Ni = N¯ = N/M, then : ¯ I ni =n ¯ = f2N

I the whole sample size is fixed : n = mn¯ = mf2N¯ = f1f2N ˆ 1 P I Yπ = N n k∈s yk = Ny¯ The sampling design is said to be self-weighting. All individuals share the same weight, and the sample size is fixed.

ESTP/Survey methodology c INSEE 39

Another self-weighting design

1st stage : we draw a sample sI of m from the M PSUs, according to probability-proportional-to-size (Ni) sampling design.

2nd stage : for i ∈ sI , we draw a sample si of n¯ from the Ni SSUs, according to a simple random sampling design.

Inclusion probability for k (∈ Ui) in s :

mNi n¯ mn¯ πk = πIiπk|i = = = constant N Ni N Sample size: mn¯ (fixed). This is also a self weighting design.

ESTP/Survey methodology c INSEE 40 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Horvitz-Thompson estimator : Yˆ = N P y . π mn¯ k∈sI k

Variance of the Horvitz-Thompson estimator

  2 V Yˆ = − 1 N P P ∆ (Y¯ − Y¯ )2 π 2 m2 i∈UI j∈UI Iij i j

 2 + N P N 1 − n¯ S2 mn¯ i∈UI i Ni i Variance estimator

  2 ˆ ˆ ˆ ˆ 1 N P P ∆Iij Yiπ Yjπ 2 V Yπ = − 2 ( − ) 2 m i∈UI j∈UI πIij πIi πIj

 2 + N P N 1 − n¯ s2 mn¯ i∈sI i Ni i

ESTP/Survey methodology c INSEE 41

French Master Sample

I Two steps → Drawing of the Master Sample once for all between two census (to be renewed) → Drawing in the Master Sample for a household survey

I Drawing of the Master Sample → compromise between high accuracy and low costs → stock of 2 Mo households close to a network of interviewers

I Drawing in the Master Sample → A separate drawing for each survey → Used since 2 001 up to 2 009

ESTP/Survey methodology c INSEE 42 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Drawing of the Master Sample

I First stage or two stage sampling

I PSUs → A set of municipalities in rural areas, an urban unit in urban areas

I Drawing of the 350 PSUs of the Master Sample → stratified by French region, rural/urban areas → probabilities proportional to the number of households.

ESTP/Survey methodology c INSEE 43

PSUs drawn in the 1999 Master Sample

ESTP/Survey methodology c INSEE 44 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

PSUs drawn in Ile de France

ESTP/Survey methodology c INSEE 45

PSUs drawn in Brittany

ESTP/Survey methodology c INSEE 46 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Drawing of the Master Sample

I SSUs → a district, or a set of districts, → defined only for middle urban and beyond.

I Drawing of the SSUs → with equal probabilities, → systematic sampling. The drawing of the household surveys is performed for the sampling design to be self-weighting.

An household can’t be surveyed more than one time.

ESTP/Survey methodology c INSEE 47

Drawing of SSUs

ESTP/Survey methodology c INSEE 48 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Bibliography

Ardilly, P. (2006), Les Techniques de Sondage, Technip, Paris.

Sarndal,¨ C-E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, Springer, New-York.

Sautory, O. (2007). Les sondages a` plusieurs degres´ , support de cours CEPE, Insee.

ESTP/Survey methodology c INSEE 49

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Exercises

Cluster Sampling and Two Stage Sampling

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

Exercice 1 (Ardilly and Tille,´ 2003) 2

Exercice 2 2

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercice 1 (Ardilly and Tille,´ 2003)

A survey is performed on a sample of 90 clusters with 40 households for each of them. The clusters are selected by simple random sampling with the sampling rate f = 1/300. To enhance accuracy of the estimation, a statistician proposes to divide by 2 the clusters size and to multiply by 2 the number of selected clusters. Give an estimation of the gain in accuracy.

For a proportion estimate pˆ = 0.1, the actual survey gives a 95% confidence interval CI = [0.1 ± 0.014]. Compute the confidence interval obtained to estimate the same proportion with the new sampling design (the sampling rates are assumed to be negligible).

Exercice 2

A sample of n = 1 200 households is selected by two stage sampling with a = 60 PSUs (of equal size) selected with equal probabilities.

The second stage sampling design is also performed with equal probabilities, and with the same sampling rate in each PSU drawn; the sampling rates are assumed to be negligible.

The variance of the estimator of a proportion p = 0.40 has been estimated with the sample datas and equals 0.0004284.

1. Give an estimation of the design effect (DEFF ) for p, and of the homogeneity coefficient.

2. Assume that the survey is performed with similar principles, but with 9 00 households in 60 PSUs.

Give an estimation of the variance for estimating p.

3. The cost model : C = 47 n + 300 a is introduced.

Give optimal values for n and a to produce the best estimator of p for a global budget of :

• C = 38 500,

• C = 77 000.

2 ESTP/Survey methodology c INSEE 2008

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Answers to the Exercises

Cluster Sampling and Two Stage Sampling

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

Solution of Exercice 1 2

Solution of Exercice 2 3

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Solution of Exercice 1

1. Let Yˆ be the Horvitz-Thompson estimator of the total Y obtained with cluster sampling. Then we have S2 V ar(Yˆ ) = N 2(1 − f) y [1 + ρ(¯n − 1)] (1) mn¯ where

• m is the number of clusters drawn (and f = m/M),

• n¯ is the number of households drawn in each cluster (number of units in the cluster for cluster sampling),

• ρ is the homogeneity coefficient,

2 • Sy is the dispersion of variable y in the whole population U. Note that equation (1) exactly holds as all clusters are of same size.

Let var1 be the variance with m1 = 90 clusters selected among M1 = m1/f = 27 000, and and var2 be the variance with m2 = 180 clusters selected among M2 = m2/f = 54 000. Then equation (1) gives

var 1 + ρ(20 − 1) 1 + 19ρ 2 = = < 1 var1 1 + ρ(40 − 1) 1 + 39ρ Note that ρ is supposed to be constant with either set of clusters. Although this assuption does not exactly hold (ρ should decrease as the cluster size increases), it will approximately in practice.

2. Estimation of a proportion is just a particular case of mean estimation, where the interest variable y is an indicator. Then we have N S2 = P (1 − P ) ' P (1 − P ) y N − 1 that is estimated by replacing P with its estimator Pˆ. Then, assuming that 1 − f ' 1, we have

Pˆ(Pˆ − 1) 0.0142 varˆ = [1 + ρ(¯n − 1)] = 1 mn¯ 2 which gives ρ ' 0.0246.

Then 1 + 19ρ varˆ =var ˆ ' 3.7 10−5 2 1 1 + 39ρ and the new confidence interval is h p i Pˆ ± 2 varˆ 2 = [0.1 ± 0.012]

2 ESTP/Survey methodology c INSEE 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Solution of Exercice 2

1. In case of simple random sampling, we have

S2 P (1 − P ) V ar (Pˆ) = (1 − f) y ' . SRS n n Here, n = 1 200 and P = 0.4, so that ˆ −4 Vd arSRS(P ) ' 2 10 .

The design effect equals 0.0004284 DEFF = = 2.142 0.0002 and we also have DEFF = 1 + ρ(¯n − 1) where n¯ = n/a is the number of households drawn in each PSU. We then get

ρ ' 0.06.

2. Let var2 be the variance obtained for Pˆ with this new sampling design, and varSRS,2 the variance obtained with a simple random sampling of same size n = 900. Then we have var 2 = 1 + ρ(¯n − 1) ' 1.84 varSRS,2 and P (1 − P ) V ar (Pˆ) ' ' 2.7 10−4 SRS,2 n so that −4 var2 ' 4.9 10 . 3. By using the same formula, the variance for two stage sampling equals P (1 − P )  n  1 + ρ − 1 . n a Optimal values are obtained with a Lagrangian technique. The Lagrangian equals P (1 − P )  n  1 + ρ − 1 + λ(47n + 300a). n a Derivation with respect to n gives P (1 − P ) − (1 − ρ) + 47 λ = 0 n2 and derivation with respect to a gives P (1 − P ) − ρ + 300 λ = 0 a2

2008 c INSEE ESTP/Survey methodology 3 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

so that 1 P (1 − P ) 1 P (1 − P ) (1 − ρ) = ρ 47 n2 300 a2 and n/a = 10.

For C = 38 500, we get C = 470 a + 300 A, so that a = 50 and n = 500.

For C = 77 000, we get a = 100 and n = 1000.

4 ESTP/Survey methodology c INSEE 2008

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey Methodology and Sampling Techniques Ratio Estimator and Post-stratified Estimator

Paul-Andre´ Salamin, Jean-Pierre Renfer Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

ESTP/Survey methodology c SFSO 1

Contents

Use of Auxiliary Information During the Estimation Ratio Estimator Definition Combined and Separate Ratio Ratio Estimator for SRS and ST Bias and Variance Post-stratified Estimator Definition Post-Stratification for SRS and Relation to Stratification Relation to the Ratio Estimator Bias and Variance

ESTP/Survey methodology c SFSO 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Auxiliary Information Population U.

U x1 .. xq y1 .. yp 1 x11 .. x1q y11 .. y1p ...... k xk1 .. xkq yk1 .. ykp ...... N xN1 .. xNq yN1 .. yNp

Aim: estimation of θ = f (yk1, .., ykp; k ∈ U).

Variables of interest (unknown): y1, .., yp.

Auxiliary variables that assist in the estimation of θ: x1, .., xq.

ESTP/Survey methodology c SFSO 3

Use of an auxiliary variable x at the design stage:

I stratification variable (e.g. area, economic activity, age group)

I cluster variables for two-stage or multi-stage sampling.

I measure of size for pps sampling

I information collected for two-phase sampling

Required: xk known for all k ∈ U.

ESTP/Survey methodology c SFSO 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Sample s ∈ U to get an estimation of θ = f (y1, .., yp).

U x1 .. xq y1 .. yp sample 1 x11 .. x1q y11 .. y1p 0 ...... 0 ...... 1 ...... k xk1 .. xkq yk1 .. ykp 1 ...... 1 ...... 0 N xN1 .. xNq yN1 .. yNp 0 Total X1 .. Xq n

ESTP/Survey methodology c SFSO 5

Use of auxiliary variables at the estimation stage:

P I ratio estimation: total X = U xk and xk for k ∈ s I post-stratification: counts Nh for post-strata h = 1, .., H and indicators for post-strata for k ∈ s (1 if k ∈ h, 0 otherwise). P P I regression: X1 = U x1k , .., Xq = U xqk and x1k , .., xqk for k ∈ s.

I calibration: general method that includes the methods above.

Aim: better precision for the estimator, respect of known totals.

ESTP/Survey methodology c SFSO 6 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Ratio Estimator

P We aim at estimating the total Y = k∈U yk in U.

A sample s is selected in U by using p(s).

Data is collected for yk for k ∈ s.

Suppose now that we have an auxiliary variable x with the following known information:

I xk ≥ 0, k ∈ s P I X = U xk (population total of x).

ESTP/Survey methodology c SFSO 7

Example: x continuous variable k x sample y 1 23 1 122 2 14 0 . 3 56 1 156 4 24 1 465 5 67 1 3243 6 2 0 . 7 35 1 443 8 23 0 . 9 19 1 973 10 76 0 . P Total X = k∈U xk = 339

Values xk known for k ∈ U, therefore also known for k ∈ s.

ESTP/Survey methodology c SFSO 8 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The population total Y can be written as:

Y Y = X · = X · R, X

where X is known and R = Y = Y . X X

b R can be estimated by: Rb = Yb = Y Xb Xb

Therefore, we define the ratio estimator YbR of Y :

Yb Yb YbR = X · = X · Xb Xb

Yb Same for the ratio estimator Yb R of Y : Yb R = X · Xb

ESTP/Survey methodology c SFSO 9

Illustration for Yb R. y y

YRr

Y Y

YRr

x x X X X X

Slope: Rb = Yb/Xb

ESTP/Survey methodology c SFSO 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Combined and Separate Ratio

P Combined ratio, using X = U xk .

(comb) YbR = X · Yb/Xb = X · Rb

Separate ratio, using a partition U = U1 ∪ .. ∪ Ug ∪ .. ∪ UG and X = P x for g = 1, .., G. g Ug k

(sep) X X YbR = Xg · Ybg/Xbg = Xg · Rbg g g

ESTP/Survey methodology c SFSO 11

Ratio Estimator for SRS and ST (combined)

Ratio estimator for the total Y if s is a SRS of size n with the design weight dk = N/n: P P P k∈s dk yk (N/n) k∈s yk k∈s yk YbR = X · P = X · P = X · P k∈s dk xk (N/n) k∈s xk k∈s xk

Ratio estimator for the total Y if s is a ST of size n, with dk = Nh/nh if k ∈ sh:

P P d y P Nh P y h k∈sh k k h nh k∈sh k YbR = X · = X · P P P Nh P h k∈s dk xk x h h nh k∈sh k

ESTP/Survey methodology c SFSO 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Notation as a Weight Adjustment

Yb X X X X X X YbR = X· = dk yk = dk yk = gk dk yk = wk yk Xb Xb k∈s k Xb k k

Design or sampling weight, selection with p(s): dk .

X Weight adjustment or g-weight: gk = X/Xb = P . `∈s d`x`

Final weight: wk = dk gk . P Note: gk depends on s. k wk xk = X (calibration).

ESTP/Survey methodology c SFSO 13

Bias YbR is slightly biased.

The bias is of order 1/n. It can be large for small n.

If the sample is of fixed size: ! var(Xb) cov(Xb, Yb) bias = E(Yb ) − Y ≈ Y − R X 2 XY For SRS:  n  1  S2 S  ≈ − x − xy bias Y 1 2 N n X XY

2 n  1 2 2 1 P 2 var(Xb) = N 1 − N n Sx , with Sx = N−1 U (xk − X) 2 n  1 1 P cov(Xb, Yb) = N 1 − N n Sxy , with Sxy = N−1 U (xk − X)(yk − Y )

ESTP/Survey methodology c SFSO 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

If s is a SRS, the bias equals zero if and only if

S2 S S2 X x = xy x = 2 or X XY Sxy Y

Let the yk = a + bxk + , k ∈ U.

2 The parameters are: b = Sxy /Sx and ab = Y − bX.

Therefore: the bias equals zero if and only if b = Y /X, i.e. if the intercept ab equals zero.

ESTP/Survey methodology c SFSO 15

Variance

The variance of YbR = X · Yb/Xb cannot be calculated easily:

2 var(YbR) = X var(Yb/Xb) =?

We use a linear approximation of YbR to get the approximate variance: X ek e` var(YbR) ≈ ∆k` πk π` k,`∈U

where ek = yk − Rxk (residual).

ESTP/Survey methodology c SFSO 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

In the case of a SRS, we have, for large n:

 n  S2 var(Yb ) ≈ N2 1 − e R N n

where ek = yk − Rxk .

It is estimated by:

 n  s2 var(Yb ) = N2 1 − be c R N n

where ebk = yk − Rxb k .

P 2 P 2 2 U (ek −e) U ek 2 2 2 Note: Se = (N−1) = (N−1) = Sy + R Sx − 2Rρxy Sx Sy

ESTP/Survey methodology c SFSO 17

Comparison with the H-T Estimator If s is a SRS of size n (large):

2  n  Sy MSE(Yb) = var(Yb) = N2 1 − N n 2 MSE(YbR) = var(YbR) + bias (Ybr ) (order 1/n + order 1/n2)  n  S2 ≈ var(Yb ) ≈ N2 1 − e R N n

2 2 Sxy 1 CVx Note: Se ≤ Sy if and only if ρxy = ≥ . Sx Sy 2 CVy

1 Using the model yk = a + bxk + , the condition is: a ≤ 2 Y

ESTP/Survey methodology c SFSO 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Concluding Remarks about the Ratio Estimator

Auxiliary information such as X can improve considerably the precision of the estimator of interest (variance).

Unlike Horvitz-Thompson estimators, ratio estimators are usually slightly biased.

The bias is small if the correlation between x and y is large.

Separate ratio usually lead to smaller variance estimates but also to larger bias.

ESTP/Survey methodology c SFSO 19

Post-stratified Estimator

P We aim at estimating the total Y = k∈U yk in U.

A sample s is selected in U by using p(s).

Data is collected for yk for k ∈ s.

Suppose now that we have the following auxiliary information for a given partition U = U1 ∪ .. ∪ Uh ∪ .. ∪ UH of the population U.

I Nh, h = 1, .., H (population counts in the post-strata)

I zhk = 1 if k ∈ Uh, 0 otherwise, for k ∈ s, h = 1, .., H.

ESTP/Survey methodology c SFSO 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Example: x k x z1 z2 z3 sample y 1 1 1 0 0 1 122 2 2 0 1 0 0 . 3 2 0 1 0 1 156 4 1 1 0 0 1 465 5 3 0 0 1 1 3243 6 3 0 0 1 0 . 7 3 0 0 1 1 443 8 2 0 1 0 0 . 9 1 1 0 0 1 973 10 3 0 0 1 0 .

Partition of the population in post-strata Uh, h = 1, .., H: U1 = {k | xk = 1}, U2 = {k | xk = 2}, U3 = {k | xk = 3}.

Counts Nh: N1 = 3, N2 = 3, and N3 = 4.

Values xk known for k ∈ U, therefore also known for k ∈ s.

ESTP/Survey methodology c SFSO 21

Partition: U = U1 ∪ .. ∪ Uh ∪ .. ∪ UH , with Nh the number of observations in Uh, h = 1, .., H.

The population total Y may be expressed as: X X Y = Yh = NhY h h h

Nh is known for h = 1, .., H.

The mean Y h is estimated with k ∈ s.

ESTP/Survey methodology c SFSO 22 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

We define the post-stratified estimator Ybpost of Y :

X X Ybh Ybpost = NhYb h = Nh h h Nbh

Same for the post-stratified estimator Yb post of Y :

X N X N Yb Yb = h Yb = h h post N h N h h Nbh

Notes:

I nh, the number of observations in s ∩ Uh is a random number,

I Yb h is not defined if nh = 0.

ESTP/Survey methodology c SFSO 23

Post-Stratification for SRS and Relation to Stratification If s is a SRS sample of size n with a post-stratification: P X X X s yk Y = N Yb = .. = N y = N h bpost h h h sh h nh h h h

Nh nh where nh is a random value, E(nh) = n N , and Nbh = N n .

If s is a ST sample of size n with the allocation nh, h = 1, .., H: P X X X s yk Y = N Yb = N y = N h b h h h sh h nh h h h

where nh is a constant value. ESTP/Survey methodology c SFSO 24 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Notation as a Weight Adjustment General case:

X X Nh X X X Ybpost = NhYb h = dk yk = gk dk yk = wk yk Nb h h h k∈sh k k

Design or sampling weight, selection with p(s): dk .

N N Weight adjustment or g-weight: g = h = P h if k ∈ s . k d` h Nbh `∈sh

Final weight: wk = dk gk .

Notes: gk depends on s, Nbh = nh N/Nh for SRS, and P w = N (calibration). sh k h

ESTP/Survey methodology c SFSO 25

Relation to the Ratio Estimator

P Xg Separate ratio estimator in G groups: YbR = g Ybg Xbg

P Nh Post-stratified estimator in H post-strata: Ybpost = h Ybh Nbh

Both estimators are identical if xk = 1, ∀k i.e. Xg = Ng and X = P d x = N . bg k∈sg k k cg

ESTP/Survey methodology c SFSO 26 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Bias

I Ybpost is approximately unbiased conditional to nh, h = 1, .., H.

I Ybpost is unbiased conditional to nh, h = 1, .., H if nh ≥ 1 for all post-strata h.

I Ybpost is also approximately unbiased un-conditionally to nh.

→ the bias is small if n is large and nh ≥ 1 for all h = 1, .., H.

ESTP/Survey methodology c SFSO 27

Variance

For s a SRS of size n with a post-stratification based on Nh, h = 1, .., H, we have the approximate variance: " # 2  n  1 X Nh 2  n  1 X N − Nh 2 var(Ybpost ) ≈ N 1 − S + 1 − S N n N h N n2 N h h h

First part: variance for a ST with proportional allocation nh = n Nh/N.

Second part: variability due to the random size nh. Small in comparison with the first part (1/n2).

ESTP/Survey methodology c SFSO 28 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

If n is large, we can estimate the variance as:

2  n  1 X Nh 2  n  N X 2 var(Ybpost ) = N 1 − s = 1 − N s c N n N h N n h h h h where 1 X s2 = (y − y )2 h k sh nh − 1 k∈sh i.e. H-T variance for ST with proportional allocation.

ESTP/Survey methodology c SFSO 29

Concluding Remarks about Post-Stratification

Auxiliary information such as Nh for h = 1, .., H can improve the precision of the estimator of interest.

Characteristics of good strata are also applicable to good post-strata (homogeneous groups, not too small).

”Post-stratification for a SRS” and ”ST with proportional allocation” give similar results if n is large enough.

Problems occur e.g. if some post-strata are small or even empty.

ESTP/Survey methodology c SFSO 30 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

References

I Sarndal¨ C.-E., Swensson, B., and Wretman, J. (1997) Model assisted survey sampling, Springer series in statistics. Chapters 6-7.

I Lohr, S.L., (1999) Sampling: design and analysis, Duxbury Press. Chapters 3-4.

I Cochran, W.G., (1977) Sampling techniques, John Wiley & Sons, Inc. Chapters 5A and 6.

ESTP/Survey methodology c SFSO 31

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Exercises

Ratio estimators and post-stratification

Jean-Pierre Renfer, Paul-Andre´ Salamin Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

Contents

Exercise 1: Ratio estimator ...... 2 Exercise 2: Post-stratification ...... 3

2008 c SFSO ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 1: Ratio estimator P With the aim of estimating the total Y = U yk a SRS s was drawn. The data was collected for yk, P k ∈ s, and the auxiliary information x, with xk ≥ 0, k ∈ s. As the population total X = U xk is known, the ratio estimator YbR = X(Y/b Xb) may be used to estimate Y . 1. Compute the ratio estimator of the mean rent using the surface as auxiliary variable and com- pute an estimator of its variance, confidence interval with α = 5% and coefficient of variation. Compare with the H-T estimator. Hint: Use the information on the mean surface X = 86m2 and the population size N = 151 from the population, and the following information from the sample samp1: sample size n = 30, mean surface Xb = 90m2, mean rent Yb = 717e, standard deviation of 2 the rent sy = 451.9e, standard deviation of the surface sx = 43.6m and the coefficient of correlation of the rent and the surface ρˆxy = 0.84.

2. (∗) The residual ek = yk − Rxk, with R = Y/X, is used to estimate the variance with a linear approximation of YbR. We have seen in the lecture that

2 2  n  Se var(YbR) ≈ N 1 − N n Show that  n  S2  n  1 N 2 1 − e = N 2 1 − (S2 + R2S2 − 2Rρ S S ) N n N n y x xy x y Therefore, show that

2 1 P 2 1 P 2 (a) Se = N−1 U (ek − e¯) = N−1 U ek , i.e. show that e¯ = 0 1 P 2 1 P 2 2 2 2 (b) N−1 U ek = N−1 (yk − Rxk) = (Sy + R Sx − 2RρxySxSy)

2 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 2: Post-stratification

1. Explain the characteristics of domains, strata and post-strata as well as that of a ratio estimator and a post-stratified estimator under SRS.

2. Compute the post-stratified estimator based on the sample samp1 with the known population size for flats without balcony, N0 = 67, and for flats with balcony N1 = 84. Compute

(a) the mean rent Yb post,

(b) the variance varc (Yb post),

(c) the coefficient of variation CVc (Yb post) (d) the confidence interval (α = 5%).

h = 0 y = 555e, s = 235.0e h = 1 Remember: flats without balcony, , sh h ; flats with balcony, , y = 880e, s = 557.9e Y = 724e sh h and the mean rent in the population . 3. Compare the H-T estimator under SRS, the ratio estimator form exercise 1 and the post- stratified estimator of the mean rent based on samp1.

(∗) Y = P N y Y = P N Yb 4. Show that under SRS bpost h h sh , using bpost h h h and the definition of the post-stratified estimator of the mean.

2008 c SFSO ESTP/Survey methodology 3

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Answers to the Exercises

Ratio estimators and post-stratification

Jean-Pierre Renfer, Paul-Andre´ Salamin Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

Contents

Exercise 1: Ratio estimator ...... 2 Exercise 2: Post-stratification ...... 3

2008 c SFSO ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 1: Ratio estimator

Yb Yb 1. Yb R = X = 685e, with Rb = = 8.0 Xb Xb b 1 n  1 2 2 2 0 varc (Y R) = N 2 varc (YbR) = 1 − N n (sy + Rb sx − 2Rbρˆxysxsy) = 1 633.6 q stdc (Yb R) = varc (Yb R) = 40.4 Confidence interval: [Yb R ± 1.96 · stdc (Yb R)] = [606, 764]

CVc (Yb R) = 5.9%. H-T estimator: y¯bs = 717, CVc (y¯bs) = 10.3%. Hence, the ratio estimator with the surface as auxiliary information is more precise than the H-T estimator.

2. (a) e¯ = 1 P (y − Rx ) = 1 P y − R 1 P x = Y − Y X = 0 # N k∈U k k N k∈U k N k∈U k X (b)

1 X S2 = (y − Rx )2 e N − 1 k k 1 X 2 = y − Y + Y − Rx  N − 1 k k 1 X 2 = (y − Y ) + (RX − Rx ) N − 1 k k 1 X 2 = (y − Y ) − R(x − X) N − 1 k k 1 X = (y − Y )2 + R2(x − X)2 − 2R(x − X)(y − Y ) N − 1 k k k k 1 = (N − 1)S2 + (N − 1)R2S2 − 2(N − 1)Rρ S S  N − 1 y x xy x y 2 2 2 = (Sy + R Sx − 2RρxySxSy)#

where the correlation is ρ = (xk−X)(yk−Y ) . xy (N−1)SxSy

2 2 1 P Note that the population variance Se is estimated by the sample variance se = n−1 k∈s(yi− 2 Rxb i) , therefore, the estimated variance of YbR may be expressed as

2  n  1 1 X 2 var(YbR) = N 1 − (yk − Rxb k) c N n n − 1 k∈s 2  n  1 2 2 2 = N 1 − (s + Rb s − 2Rbρˆxysxsy) N n y x

There are several equivalent expressions for the variance estimator of the ratio estimator. Refer to the literature mentioned in the course for further expressions.

2 ESTP/Survey methodology c SFSO 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 2: Post-stratification

1. Characteristics of domains, strata, and post-strata, as well as that of ratio estimators and post- stratified estimators.

• Domains, strata and post-strata In stratification and post-stratification auxiliary information is used for estimation, whereas no auxiliary information is used for the estimation in domains. domain strata post-strata aim: analysis only reduce variance, objec- reduce variance, treat- tives of the design (con- ment of non-response trol of the size of sub- populations in the sam- ple), optimal allocation to reduce the sample size for a given precision when used: sometimes during de- design estimation sign, but often too many cells for stratifi- cation

• Ratio and post-stratified estimators The post-stratified estimator is a special case of the ratio estimator where xhk = 1, for k ∈ Uh and 0 otherwise.The post-strata are based on the modalities of categorical variables, whereas continuous variables are usually used to build the ratio estimator.

b 1 P 2. Post-stratified estimator for the mean rent: Y post = N h Nhy¯sh = 736. b 1 n  N P 2 0 2 1 P 2 varc (Y post) ≈ N 2 1 − N n h Nhsh = 5 279 , with sh = n −1 k∈s (yk − y¯sh ) . q h h stdc (Yb post) = varc (Yb post) = 72.7.

CVc (Yb post) = stdc (Yb post)/Yb post = 9.9%.

CI(Yb post) ≈ [Yb post ± 1.96stdc (Yb post) = [593, 878]

H-T Ratio Post-stratified Auxiliary information none XNh mean rent [e] 717 685 736 CVc [%] 10.3 5.9 9.9 CI [566,868] [606,764] [593,878]

3. The post-stratified estimation is less precise than the ratio estimation but slightly more precise than the H-T estimation. The ratio estimation shows an important gain in precision compared with the H-T and post-stratified estimation.

2008 c SFSO ESTP/Survey methodology 3 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

4. X X 1 Ybpost = NhYb h = Nh Ybh h h Nbh X n X n N X = Nh Ybh = Nh yk Nnh Nnh n h h sh X = N y h sh h

4 ESTP/Survey methodology c SFSO 2008

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey Methodology and Sampling Techniques Regression estimator Survey methodology and sampling techniques

Guillaume Chauvet, Eric Lesage

Institut national de la statistique et des etudes´ economiques´

15 - 18 April 2008

This presentation is based on teaching documents by the CEPE (Insee)

ESTP/Survey methodology c INSEE 1

Learning outcomes

You will know :

I What a difference estimator is,

I What a regression estimator is,

I How to calculate the parameters of the linear model,

I How to calculate the regression estimator and its variance estimator.

ESTP/Survey methodology c INSEE 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Outline

Introduction

Difference estimation

Regression estimation Definition Others expressions of the regression estimator Expectation and variance of the regression estimator Simple Random Sampling without replacement Computing

ESTP/Survey methodology c INSEE 3

Introduction

ESTP/Survey methodology c INSEE 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The goal when using the regression estimator is to have a more precise estimator than the Horvitz-Thompson one. More over, we will be able to estimate exactly the total of the auxiliary variables used (respect of known totals).

I we need Auxiliary information,

I we are after the collection phase (use of auxiliary variables at the estimation stage).

ESTP/Survey methodology c INSEE 5

Notations

Population: U = 1, ..., k, ..., N. Variable of interest (unknown): y. Auxiliary variables: x1...xj...xJ , known on U.     x1k X1  ...   ...    P   xk =  xjk  ,X = xk =  Xj ,      ...  k∈U  ...  xJk XJ P Xj = xjk k∈U Weights of the units : ck (ck = 1 in general)

ESTP/Survey methodology c INSEE 6 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Model assisted

It is supposed that a linear relation approximately exists between yk and the auxiliary variables :

J X yk = bjxjk + Ek j=1

ESTP/Survey methodology c INSEE 7

Difference estimation

ESTP/Survey methodology c INSEE 8 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Difference estimation

In this simple case, the coefficients bj are supposed to be known (previous survey data, experts estimation or bj = 1...) , we have:

J 0 0 X 0 yk = yk + yk − yk = bjxjk +δk = b xk + δk |{z} j=1 supposed to be small

0 With b = (b1, ..., bJ ) and δk is supposed to be small 0 compared to b xk. ¿From where:

X X 0 X Y = yk = yk + δk k∈U k∈U k∈U

ESTP/Survey methodology c INSEE 9

J P 0 P 0 0 P I yk = b xk = b X = bjXj is known k∈U k∈U j=1 P I δk can be estimated unbiasedly by the k∈U Horvitz-Thompson estimator:

0 J ! X δk X yk X b xk X X xjk = − = Ybπ − bj πk πk πk πk k∈s k∈s k∈s j=1 k∈s

J X 0 = Ybπ − bjXbjπ = Ybπ − b Xbπ j=1

ESTP/Survey methodology c INSEE 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Difference estimator

First expression:

0   0  Ybdif = b X + Ybπ − b Xbπ

Second expression:

0   Ybdif = Ybπ + b Xb − Xbπ

  Ybdif is an unbiased estimator of Y ; its variance V ar Ybdif is the variance of P δk . πk k∈s

ESTP/Survey methodology c INSEE 11

Difference estimator

The difference estimator is thus interesting if the deviations δk are small, which happens if the coefficients bj are such that the linear approximation is ”correct”. Otherwise, if the bj are not well chosen, the variance of Ybdif can be larger than the variance of Ybπ!

Remark: in order to compute Ybdif , it is sufficient to known the values of xj on the sample and their totals on the population.

ESTP/Survey methodology c INSEE 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Regression estimation

ESTP/Survey methodology c INSEE 13

Regression estimation

This time, the bj are no more supposed to be known. Thus, they must be estimated. The idea is to use the linear regression parameters estimated thanks to the ordinary least square method. We want the ebj which minimize the quantity:

2  J  X X ck yk − bjxjk k∈U j=1

ESTP/Survey methodology c INSEE 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Regression estimation

We get on U:

!−1 ! X 0 X −1 eb = ckxkx ckxkyk = T θ k |{z} |{z} k∈U k∈U (J,J) (J,1)

Which can be estimated on s by:

−1 0 ! ! X xkxk X xkyk −1 b = ck ck = Tb θb πk πk k∈s k∈s

ESTP/Survey methodology c INSEE 15

Computing the bb

You can use the SAS procedure : SURVEYREG to calculate b and its variance.

Remark: it is not so bad to estimate eb by: !−1 ! ˇ X 0 X b = ckxkxk ckxkyk k∈s k∈s

ESTP/Survey methodology c INSEE 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The regression estimator is obtained by replacing b by b in the difference estimator expression.

Regression estimator:

 0   0  Ybreg = b X + Ybπ − b Xbπ

Or: 0   Ybreg = Ybπ + b X − Xbπ

 0  Remark: Ybreg = b X when the constant variable equal to ’1’ is used among the auxiliary variables.

ESTP/Survey methodology c INSEE 17

Advantages If the regression estimators is used in order to estimate the total of an auxiliary variable Xj, then we have:

Xbjreg = Xj

Generally, if it exists an exact linear relationship between Y and the Xj on U, then:

Ybreg = Y

Drawback   If the variable of study y is such that V ar Ybπ = 0 then   V ar Ybreg is not necessary equal to zero (strata sizes for example).

ESTP/Survey methodology c INSEE 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Regression weight

Using a regression estimator can be seen as building a new set of weights.

X yk X Ybreg = gs,k = ws,kyk πk k∈s k∈s with 0   −1 gs,k = 1 + ck X − Xbπ Tb xk

Ybreg is a weighted sum of the yk, where the weights ws,k depend on the sample s.

ESTP/Survey methodology c INSEE 19

Regression weight

The weights ws,k does not depend on the variable of interest Y. They are computed once at all, when the sample is selected. They provide the calibration on the known totals. These totals are unbiasedly estimated with a variance equal to zero.

The weights ws,k can be negative...

ESTP/Survey methodology c INSEE 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Second expression

0   We introduce the term: eb X − Xbπ in order to obtain the following decomposition:

0 0  0      Ybreg = eb X + Ybπ − eb Xbπ + b − eb X − Xbπ | {z } | {z } “ ” 1 o √1 o( ) n n

ESTP/Survey methodology c INSEE 21

Taylor linearisation technique

When n is large:

0  0  Ybreg ≈ eb X + Ybπ − eb Xbπ

 0   (i.e. the term b − eb X − Xbπ is neglected).

Expectation Ybreg is approximately unbiased.

ESTP/Survey methodology c INSEE 22 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Variance !   X Ek V ar Ybreg = V ar gs,k πk k∈s

An approximation of the variance of Ybreg is given by: !   X Ek V ar Ybreg ≈ V ar πk k∈s 0 Where Ek = yk − eb xk = residual of the regression in U.

ESTP/Survey methodology c INSEE 23

Estimated variance of Yb reg

First formula !       X ek X ∆kl ek el Vd ar1 Ybreg = Vd ar gs,k = gs,k gs,l πk πkl πk πl k∈s k,l∈s

0 With ek = yk − b xk = residual of the regression in s.

ESTP/Survey methodology c INSEE 24 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Estimated variance of Yb reg

Second formula !       X ek X ∆kl ek el Vd ar2 Ybreg = Vd ar = πk πkl πk πl k∈s k,l∈s

The first formula differs from the second one by the presence of gs,k. Both formulas are in principle available for large sample sizes, and the first one is generally better.

ESTP/Survey methodology c INSEE 25

Case of a single auxiliary variable with a constant term

Simple Random Sampling without replacement

The model: yk = a + bxk + Ek ck = 1 Parameters on U: Sxy eb = 2 Sx

ea = Y − ebX

ESTP/Survey methodology c INSEE 26 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Case of a single auxiliary variable with a constant term

Parameters on s: P (xk − x)(yk − y) s b = xy = k∈s b 2 P 2 sx (xk − x) k∈s

= slope of the regression line of yk by xk in the sample.

ba = y − bx

ESTP/Survey methodology c INSEE 27

Regression estimator

Ybreg = baN + bbX h i Ybreg = N y + b X − x

The coefficients gs,k can be written:  X − x (xk − x) gs,k = 1 + 2 sx

ESTP/Survey methodology c INSEE 28 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Estimated variances of Ybreg

First formula

  2 1 − f 1 X 2 Vd ar1 Ybreg = N (gs,kek) n n − 1 k∈s

n with πk = N = f

ESTP/Survey methodology c INSEE 29

Estimated variances of Ybreg

Second formula

  2 1 − f 1 X 2 Vd ar2 Ybreg = N (ek) n n − 1 k∈s ˆ with ek = yk − bxk

  2 1 − f 2  2 Vd ar2 Ybreg = N 1 − r s n xy y

with rxy the sample correlation coefficient of x and y.

ESTP/Survey methodology c INSEE 30 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Comparison with the SRS precision

The approximated variance of Ybreg is:

  2 1 − f 2 V ar Ybreg ≈ N S n E

2 where Sz is the variance in U of the variable defined by ˜ Ek = yk − bxk.

  2 1 − f 2 2  V ar Ybreg ≈ N S 1 − ρ n y xy where ρ = Sxy is the coefficient of correlation between X xy SxSy and Y in U.

ESTP/Survey methodology c INSEE 31

Comparison with the SRS precision

From where:   V ar Ybreg   ≈ 1 − ρxy V ar Ybπ

For large sample sizes, the regression estimator Ybreg will be more efficient than the Horvitz-Thompson one when the correlation between X and Y is large.

ESTP/Survey methodology c INSEE 32 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Comparison with the ratio estimator

If we use the approximated variances, we have:     V ar Ybreg ≤ V ar YbR

with the equality when:

Y¯ S = xy = ˜b ¯ 2 X Sx

Ybreg is better than YbR when the regression line of Y by X in U does not pass trough the origin (result available for large sample sizes).

ESTP/Survey methodology c INSEE 33

SAS

In order to estimate the parameters bj, you can use the SAS Procedure SURVEYREG, GLM or REG. But you will see in the section Calibration, that the regression estimator is nothing else than a calibration estimator associated to a specific distance function. So you can obtain the regression weight by using the SAS macro CALMAR...

ESTP/Survey methodology c INSEE 34

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Exercises

Regression estimator

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

1 Exercise 1 2

2 Exercise 2 3 2.1 First regression estimator ...... 3 2.2 Second regression estimator ...... 3

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

1 Exercise 1

We want to estimate the the mean Y¯ of the variable of study y. The size of the population U is N = 11. We have an auxiliary variable x observed on the sample and we know that X¯ = 5.

We draw a SRS of size n = 5. And we have the following results:

xk yk 1 1 2 4 3 9 5 25 9 81 x¯ = 4 y¯ = 24

Which estimators would you use? Calculate these estimators and there intervals of confidence.

We give the following information: P 2 P 2 (xk − x¯) = 40 (yk − y¯) = 4, 404 k∈s k∈s P P 2 (xk − x¯)(yk − y¯) = 410 (yk − 6xk) = 929 k∈s k∈s

2 (Nota: in fact the relation between Y and X is yk = xk. What do you think about the regression estimator in this case?

2 ESTP/Survey methodology c INSEE 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2 Exercise 2

We consider an agricultural region with N = 2, 010 farms. We are looking for the mean Y¯ of the cultivated surface (in wheat). The sample size is n = 100. We use a simple random sampling.

2.1 First regression estimator

We know the mean X¯ = 118 of the cultivated surface (all culture). On the sample we have: 2 2 x¯ = 132, y¯ = 29, sx = 7, 619, sy = 620 and sxy = 1, 453.

We consider the linear regression: yk = a + bxk + Ek. Calculate the parameter ba and b. Calculate the regression estimator of Y¯ .

2.2 Second regression estimator

This time, we will use the auxiliary variable x in a different way. We will divided the farms into 2 strata: the farms with less than 160 acres (h=1) and the farms with more than 160 acres (h=2).

Stratum 1 Stratum 2 N1 = 1, 580 N2 = 430 y¯1 = 19 y¯2 = 52 2 2 sy1 = 312 sy2 = 922 1 We note zk = k∈U1 .

We consider the linear regression: yk = c + dzk + Ek

Calculate the parameter bc and db. Calculate the regression estimator of Y¯ .

Do you recognize which estimator we obtain?

2008 c INSEE ESTP/Survey methodology 3

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Answers to the Exercises

Regression estimator

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

1 Exercise 1 2 1.1 Horvitz-Thompson Estimator ...... 2 1.2 Ratio Estimator ...... 2 1.3 Regression Estimator ...... 2

2 Exercise 2 3 2.1 First regression estimator ...... 3 2.2 Second regression estimator ...... 3

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

1 Exercise 1

1.1 Horvitz-Thompson Estimator ˆ π-estimator: Y¯π =y ¯ = 24   2 ¯ˆ n  sy Variance estimator: Vd ar Yπ = 1 − N n = 120.1 Confidence interval: CI95% = [2.5; 45.5]

1.2 Ratio Estimator ¯ˆ y¯ ¯ Ratio-estimator: YR = x¯ X = 30   1− n 2 ¯ˆ ( N ) 1 P y¯  Variance estimator: Vd ar YR = n n−1 yk − x¯ xk = 25.2 k∈s Confidence interval: CI95% = [20.2; 39.8]

1.3 Regression Estimator h i Yb reg = y + b X − x

With: P (xk − x)(yk − y) s b = xy = k∈s b 2 P 2 sx (xk − x) k∈s aˆ =y ¯ − ˆbx¯

Regression-estimator: Yb reg = 34.3   1− n  2 ¯ˆ ( N ) 1 P ˆ Variance estimator: Vd ar YR = n n−1 yk − bxk − aˆ = 5.5 k∈s Confidence interval: CI95% = [29.7; 38.8]

Remark: on U, Y¯ = 35, ˜b = 10, a˜ = −15, r2 = 0.93.

2 ESTP/Survey methodology c INSEE 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

2 Exercise 2

2.1 First regression estimator

sxy b = 2 = 0.19 sx ˆ ba =y ¯ − bx¯ = 3.83 ¯ˆ ¯ Yreg1 = ba + bX = 26.3

  2 2 ¯ˆ sy 2  sy Vd ar Yreg1 = (1 − f) n 1 − ρxy = 0.55 (1 − f) n √ The precision saving in comparison to a SRS is 1 − 0.55 = 25%.

2.2 Second regression estimator

n1 n1 ¯ N1 2 n1n2 Szy = n (¯y1 − y¯), z¯ = n , Z = N , Sz = n2 .

szy db= 2 =y ¯1 − y¯2 sz ˆ bc =y ¯ − dz¯ =y ¯2

¯ˆ ¯ N1 N1 N2 Yreg2 = bc + dbZ =y ¯2 + (¯y1 − y¯2) N = N y¯1 + N y¯2 = 26.06

Which is the post-stratified estimator! It is interesting to see that the variable x can be used in two different ways.

2008 c INSEE ESTP/Survey methodology 3

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey Methodology and Sampling Techniques Calibration techniques

Guillaume Chauvet, Eric Lesage

Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

This presentation is based on teaching documents by the CEPE (Insee)

ESTP/Survey methodology c INSEE 1

Introduction Principles of Calibration Methods The problem Theoretical Solution Usual distance functions Properties of calibrated estimators Comparison of calibration methods The CALMAR SAS macro Parameters for Input SAS tables Parameters for the calibration method Parameters for Output SAS tables A short example

ESTP/Survey methodology c INSEE 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Introduction

ESTP/Survey methodology c INSEE 3

To adjust a sample : to give the sample units appropriate weighting, to give consistent estimation for auxiliary information known over the whole population.

Examples of auxiliary informations :

I distribution of individuals by sex, age or socio-professional category

I global income for firms in a branch of industry

ESTP/Survey methodology c INSEE 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Adjusting = Reweighting

Calibration techniques consist in reweighting population units through modification of the sampling weights, so that

I Estimators of totals for quantitative variables exactly match exact totals on the whole population,

I Estimate numbers for values of qualitative variables exactly match the real numbers Reweighting → gain in accuracy.

For example, the SAS macro CALMAR2 enables to perform calibration and generalized calibration (see Deville, Sarndal¨ et Sautory 1993, Le Guennec and Sautory 2002).

ESTP/Survey methodology c INSEE 5

Raking Ratio on two categorical variables

x : socioprofessional category y : age.

x/y 15 − 24 25 − 34 35 − 44 Margins years years years

Farmers N1+

Independent Nij Ni+

NI+ Margins N+1 N+j N+J

ESTP/Survey methodology c INSEE 6 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The sample is calibrated on the margins of auxiliary variables; the total numbers Ni+,N+j are assumed to be known and used as auxiliary information. This particular calibration technique is called Raking Ratio.

The calibration is not performed on the cross numbers Nij, which may be partially unknown.

More generally, we will talk about Raking Ratio in case of calibration on numbers for any categorical variables.

ESTP/Survey methodology c INSEE 7

Calibration equations : an example Assume that a sample of 6 units is drawn by SRS in a population U of size 60. x and y are two categorical variables, each consisting in two categories, denoted by X1,X2 and Y1,Y2 respectively. The corresponding numbers are known over U.

Sample datas

Individual x y Weight dk 1 X1 Y1 10 2 X1 Y2 10 3 X1 Y2 10 4 X2 Y1 10 5 X2 Y1 10 6 X2 Y2 10

ESTP/Survey methodology c INSEE 8 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Y1 Y2 Estimate Exact margins margins X1 10 20 30 20

X2 20 10 30 40

Estimate margins 30 30 60

Exact margins 40 20

We look for new weights wk that satisfy the calibration equations X X wk = N1+ = 20 wk = N2+ = 40

k/xk=X1 k/xk=X2 X X wk = N+1 = 40 wk = N+2 = 20

k/yk=Y1 k/yk=Y2

ESTP/Survey methodology c INSEE 9

Principles of Calibration Methods

ESTP/Survey methodology c INSEE 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The problem

Population U = {1,...,N} in which a sample s of size n is drawn, with inclusion probability πk = P(k ∈ s) for unit k. The interest variable is denoted by y. An estimation of the total X Y = yk k∈U is given by the direct Horvitz-Thompson estimator

X yk X Yˆπ = = dk yk πk k∈s k∈s

where dk is the sampling weight for unit k.

ESTP/Survey methodology c INSEE 11

The issue

Auxiliary information

Assume that, for J auxiliary variables x1, . . . , xJ , the P respective totals Xj = k∈U xjk over the whole population are known.

If the auxiliary information is related to categorical variables, the numbers for each category of these variables are known (that is, the totals of the dummy variables associated with these categories).

ESTP/Survey methodology c INSEE 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The problem

The goal is to make use of this auxiliary information to enhance the Horvitz-Thompson estimator. We seek for a new estimator for Y , denoted by X Yˆw = wkyk k∈s

where the new weights wk have to :

I be close to the design weights dk = 1/πk,

I satisfy the calibration equations P k∈s wkxjk = Xj j = 1 ...J.

ESTP/Survey methodology c INSEE 13

Theoretical solution We choose a distance function G such that G(wk/dk) measures the distance between the initial weight dk and the final weight wk. We further assume that

I G(1) = 0,

I G is non-negative and convex (the more wk/dk is distant from 1, the higher is G(wk/dk))

Weights wk are given by the solution of the optimization equation X X Minwk dkG(wk/dk) with wkxk = X k∈s k∈s

0 0 and xk = (x1k, . . . , xJk) , X = (X1,...,XJ ) .

ESTP/Survey methodology c INSEE 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Solution

The Lagrangian is P 0 P  L = k∈s dkG(wk/dk) − λ k∈s wkxk − X where 0 λ = (λ1, . . . , λJ ) is a vector of Lagrange multipliers.

First order conditions ensure that

0 wk = dkF (xkλ)

where F is the reverse function of G.

ESTP/Survey methodology c INSEE 15

λ may be computed by resolving the non-linear system of calibration equations

X 0 dkF (xkλ)xk = X. k∈s

This system may be solved by the iterative Newton-Raphson method.

Convergence is obtained when

w(i+1) w(i) k k Maxk∈s − <  dk dk

ESTP/Survey methodology c INSEE 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Usual distance functions 0 We denote G(r) (where r = wk/dk) and F (u) (where u = xkλ).

Linear method 1 2 G(r) = 2 (r − 1) F (u) = 1 + u. Convergence is obtained at 2nd step of the Newton algorithm and

J X X ˆ Yˆw = wkyk = Yˆπ + bj(Xj − Xˆjπ) k∈s j=1 ˆ ˆ where b1,..., bJ are the coefficients of a (weighted) regression of y on auxiliary variables x1, . . . , xJ over the sample s. ⇒ Yˆw is the generalized regression estimator of the total Y .

ESTP/Survey methodology c INSEE 17

Usual distance functions

The raking ratio method

G(r) = r log(r) − r + 1 F (u) = exp(u).

In case of calibration on categorical variables only, it may be shown that this distance function gives similar results as the Raking Ratio method presented formerly (which is also called Iterative Proportional Fitting).

ESTP/Survey methodology c INSEE 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

ESTP/Survey methodology c INSEE 19

ESTP/Survey methodology c INSEE 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The ”logit” method

 h    i (r − L) log r−L + (U − r) log U−r if L < r < U  1−L U−1 G(r) =  ∞ otherwise

L(U − 1) + U(1 − L) exp(AU) F (u) = ∈ [L, U] U − 1 + (1 − L) exp(Au) This is a bounded raking ratio method : ratios of weights lie within L(< 1) and U(> 1).

ESTP/Survey methodology c INSEE 21

The bounded linear method

 1 (r − 1)2 if L < r < U  2 G(r) =  ∞ otherwise

F (u) = 1 + U ∈ [L, U] This is also a bounded method : ratios of weights lie within L(< 1) and U(> 1).

ESTP/Survey methodology c INSEE 22 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Properties of calibrated estimators

Expectation

For any distance function, the calibrated estimator Yˆw is approximately design unbiased.

Variance

For any distance function, the variance is similar to that of generalized regression estimator. This variance is given by residuals of a regression of the interest variable y on auxiliary variables x1, . . . , xp.

ESTP/Survey methodology c INSEE 23

More precisely, remind that the variance of the Horvitz-Thompson estimator Yˆπ is X X ∆kl(dkyk)(dlyl) k∈U l∈U

where ∆kl = πkl − πkπl, πkl is the second-order inclusion probability, and dk is the sampling weight.

Variance of the calibrated estimator Yˆw approximately equals X X ∆kl(dkEk)(dlEl) k∈U l∈U 0 where Ek = yk − xk B is the residual of the regression of y on variables x1, . . . , xp over the population U.

ESTP/Survey methodology c INSEE 24 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Variance estimation

Two variance estimators may be alternatively used

X X ∆kl Vˆ1(Yˆw) = (gkdkek)(gldlel) πkl k∈s l∈s

X X ∆kl Vˆ2(Yˆw) = (dkek)(dlel) πkl k∈s l∈s 0 ˆ where ek = yk − xk B is the residual of the weighted regression (with weights dk) of y on variables x1, . . . , xp over the sample s, and gk = wk/dk.

ESTP/Survey methodology c INSEE 25

A classical software enabling variance estimation for total estimators Yˆπ may be used for variance estimation of calibrated estimators Yˆw in the following way :

I Perform on the sample s the weighted regression of variable y on auxiliary variables x1, . . . , xp (with weights dk),

I Take the residuals ek of the regression and compute gk ek, where gk = wk/dk,

I Use the software by replacing the yk by the gk ek.

ESTP/Survey methodology c INSEE 26 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Comparison of calibration methods

Linear Method

I The fastest method (only two iterations),

I May lead to negative weights,

I Non bounded weights. Raking Ratio Method

I Always lead to non-negative weights,

I No upper bound for calibrated weights (usually greater than weights given by the linear method).

ESTP/Survey methodology c INSEE 27

Comparison of calibration methods

Logit Method and Bounded Linear Method Give lower and upper bounds L and U for the ratios wk/dk.

Note that any value of L < 1 (respectively of U > 1) may not be used. There exists a maximal value Lmax < 1 (respectively of Umin > 1), to be found by successive trials.

Many criterions may be used to select a method :

I Lower dispersion,

I Lower range,

I Distribution of weights.

ESTP/Survey methodology c INSEE 28 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

The CALMAR SAS macro

ESTP/Survey methodology c INSEE 29

Parameters for Input SAS tables

DATAMEN = name of the SAS table with the sample datas

I Observations : sample units,

I variables : calibration variables, identifying variable, initial weight. MARMEN = name of the SAS table with auxiliary information

I Observations : calibration variables,

I variables : variable name, number of categories, associated margins.

ESTP/Survey methodology c INSEE 30 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Parameters for Input SAS tables POIDS = variable

Numerical variable of initial weights for units in the sample.

PONDQK = variable

Numerical variable of weights for units in the sample, non related to POIDS.

IDENT = variable

Identifying variable for sample units.

ESTP/Survey methodology c INSEE 31

Parameters for Input SAS tables

PCT = OUI or NON

If PCT=OUI, margins for the categorical variables in table DATAMAR are given in percent.

EFFTOT = value

Total number of units in the population (to be given if PCT=OUI).

ESTP/Survey methodology c INSEE 32 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Parameters for the calibration method M = 1,2,3 or 4

Distance function :

I Linear method

I Raking Ratio method

I Logit method

I Bounded linear method

LO = value

Lower bound for ratios of weights (to be given if M=3 or 4).

ESTP/Survey methodology c INSEE 33

Parameters for the calibration method

UP = value

Upper bound for ratios of weights (to be given if M=3 or 4).

SEUIL = value

Threshold for termination of the Newton algorithm (optional).

MAXITER = integer value

Maximum number of iterations for the Newton algorithm (optional).

ESTP/Survey methodology c INSEE 34 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Parameters for Output SAS tables DATAPOI = name of the SAS table with the final weights

I observations : non deleted sample units,

I variables : identifying variable, final weight.

MISAJOUR = OUI or NON Specifies the treatment for the output variable :

I If MISAJOUR=OUI, the variable with the calibrated weights and the identifying variable are added to the table,

I If MISAJOUR=NON, a new SAS table is created with the calibrated weights and the identifying variable. The former SAS table is removed.

ESTP/Survey methodology c INSEE 35

Parameters for Output SAS tables

POIDSFIN = variable Name of the variable with the calibrated weights.

LABELPOI = label Label to be given to the variable with the final weights.

OBSELI = OUI or NON If OBSELI=OUI, creates the SAS table OBSELI with, for each unit removed in the original sample, the identifying variable, calibration variables and initial weights.

ESTP/Survey methodology c INSEE 36 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

A short example

ESTP/Survey methodology c INSEE 37

ESTP/Survey methodology c INSEE 38 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

ESTP/Survey methodology c INSEE 39

ESTP/Survey methodology c INSEE 40 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

ESTP/Survey methodology c INSEE 41

ESTP/Survey methodology c INSEE 42 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

ESTP/Survey methodology c INSEE 43

ESTP/Survey methodology c INSEE 44 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

ESTP/Survey methodology c INSEE 45

ESTP/Survey methodology c INSEE 46 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

ESTP/Survey methodology c INSEE 47

ESTP/Survey methodology c INSEE 48 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

ESTP/Survey methodology c INSEE 49

ESTP/Survey methodology c INSEE 50 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Bibliography

Deville, J-C., and Sarndal,¨ C-E., and Sautory, O. (1993). Generalized raking procedures in survey sampling, Journal of the American Statistical Association, 87, 418, 376-382.

Sautory, O. (2007). Les methodes´ de calage, support de cours CEPE, Insee.

Sautory, O., et Le Guennec, J. (2003). La macro CALMAR2 : Redressement d’un echantillon´ par calage sur marges, document de travail, Insee.

ESTP/Survey methodology c INSEE 51

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Exercises

Calibration

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

Exercice 1 2

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercice 1

A survey is performed among couples of individuals. Use of the Horvitz-Thompson estimator (indi- vidual weights : dk = 1/πk) gives the following results :

• Estimate number of couples without any worker: Nˆ0 = 3 000

• Estimate number of couples with only one worker: Nˆ1 = 6 000

• Estimate number of couples with two workers: Nˆ2 = 2 000

The following auxiliary information is available :

• N = 10 000, number of couples in the population,

• Z = 12 000, number of workers in the population.

The goal is to enhance the Horvitz-Thompson estimator by calibrating on N and Z.

1. Identify the two variables involved into the calibration.

0 Let λ = (a, b) be the vector of Lagrange multipliers, and zk the number of workers in the couple k. Give the ratio of weights in function of a, b and zk. Write the two calibration equations for any distance function F , and show that these equations only depend of the initial weights through Nˆ0, Nˆ1 and Nˆ2.

2. We choose the linear method (regression estimation). Give the solution of the calibration equa- tions and the values for ratios of weights.

3. We choose the raking-ratio method. Give the calibration equations (with α = ea and β = eb).

4. The solutions of the former calibration equations are: α = 0.456 and β = 1.921. Give the values for ratios of weights.

Now, the goal is to estimate the number of couples with children. An Horvitz-Thompson estimation gives:

• Yˆ0 = 2 000 number of non working couples with children,

• Yˆ1 = 2 000 number of ”one worker” couples with children,

• Yˆ2 = 500 number of ”two workers” working couples with children.

5. Give an estimation of the number of couples with children with : the Horvitz-Thompson estimator, the calibrated estimator (linear method), the calibrated estimator (raking ratio method).

2 ESTP/Survey methodology c INSEE 2008

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Answers to the Exercises

Calibration

Guillaume Chauvet, Eric Lesage Institut National de la Statistique et des Etudes´ Economiques´

European Statistical Training Program 15 - 18 April 2008

Contents

Solution of Exercice 1 2

2008 c INSEE ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Solution of Exercice 1

1. Calibration variables

The calibration is performed on the totals: N =number of couples and Z =number of workers. The calibration variables take for couple k the values: xk = 1 for any k, and zk = number of workers in the couple k.

wk Ratio of weights for the couple k: = F (axk + bzk) = F (a + bzk). dk

The calibration equations are : X X dkF (axk + bzk)xk = N and dkF (axk + bzk)zk = Z k∈S k∈S or: X X dkF (a + bzk)xk = 10 000 and dkF (a + bzk)zk = 12 000 k∈S k∈S xk and zk are numerical variables. Yet, zk only takes 3 values : 0, 1, 2. Let S0 be the sub-sample of all couples without any worker, S1 the sub-sample of couples with only 1 worker, S2 the sub-sample of couples with 2 workers. Calibration equations may be written : P d F (a) + P d F (a + b) + P d F (a + 2b) = 10 000 k∈S0 k k∈S1 k k∈S2 k

P d F (a + b) + P 2 d F (a + 2b) = 12 000 k∈S1 k k∈S2 k what gives: Nˆ0F (a) + Nˆ1F (a + b) + Nˆ2F (a + 2b) = 10 000

Nˆ1F (a + b) + 2 Nˆ2F (a + 2b) = 12 000 as: P d = Nˆ , and so on (the sum of initial weights in a domain equals the Horvitz-Thomson k∈S0 k 0 estimator of the domain size). By replacing the estimators by their values, we get: 3F (a) + 6F (a + b) + 2F (a + 2b) = 10

6F (a + b) + 4F (a + 2b) = 12

2. Linear Function gives : F (u) = u + 1. The calibration equations give:

3(a + 1) + 6(a + b + 1) + 2F (a + 2b + 1) = 10

6(a + b + 1) + 4(a + 2b + 1) = 12 or: 11a + 10b + 1 = 0 . 10a + 14b − 2 = 0

2 ESTP/Survey methodology c INSEE 2008 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

We get: 17 16 a = − ' −0.63 and b = ' 0.593. 27 27

wk Ratio of weights : = F (a+bzk) = 1+a+bzk. Ratios of weights are the same for all individuals who dk share the same values for the calibration variables. Here, the 1st calibration variable (xk) is constant, wk so gk = is a function of zk. dk

zk gk 0 a + 1 ' 0.37 1 a + b + 1 ' 0.96 2 a + 2b + 1 ' 1.55

3. Raking-ratio method gives : F (u) = eu.

Calibration equations give 3ea + 6ea+b + 2ea+2b = 10

6ea+b + 4ea+2b = 12

If we denote α = ea and β = eb, we have

3α + 6αβ + 2αβ2 = 10

6αβ + 4αβ2 = 12

wk a+bz z 4. Ratios of weights : gk = = e k = α β k . dk

zk gk 0 0.456 1 0.876 2 1.682

5.

Horvitz-Thomson estimator: Yˆ = Yˆ0 + Yˆ1 + Yˆ2 = 4 500. Calibrated estimator :

Yˆ = P w y = P w y + P w y + P w y cal k∈S k k k∈S0 k k k∈S1 k k k∈S2 k k

= P g d y + P g d y + P g d y k∈S0 0 k k k∈S1 1 k k k∈S2 2 k k

= g P d y + g P d y + g P d y 0 k∈S0 k k 1 k∈S1 k k 2 k∈S2 k k

= g0Yˆ0 + g1Yˆ1 + g2Yˆ2

2008 c INSEE ESTP/Survey methodology 3 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

With the linear function:

Yˆcal = 0.37 × 2 000 + 0.96 × 2 000 + 1.55 × 500 = 3435.

With the raking-ratio method:

Yˆcal = 0.456 × 2 000 + 0.876 × 2 000 + 1.682 × 500 = 3505.

4 ESTP/Survey methodology c INSEE 2008

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey Methodology and Sampling Techniques Introduction to the Problem and Treatment of Non-response

Paul-Andre´ Salamin, Jean-Pierre Renfer Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

ESTP/Survey methodology c SFSO 1

Contents

Definition

Bias

Measures Against Non-response

Reweighting

Imputation

Variance Estimation

ESTP/Survey methodology c SFSO 2 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Definition

Non-response: failure to obtain a measurement in one or more study variables for one or more elements selected for survey.

ID y1 y2 y3 Type 1 *** Complete response 2 Unit non-response 3 ** Item non-response

ESTP/Survey methodology c SFSO 3

Note:

I Non-response is present in most surveys.

I The definition of non-response types is essential.

I Non-response rates have to be defined carefully.

I Major problem as non-response usually introduces selection bias.

I Importance of effective measures against non-response (prevention).

I Need special extrapolation procedure (mainly weighting adjustment and imputation).

ESTP/Survey methodology c SFSO 4 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Bias in Results Based only on Respondents

Illustration in a simple case.

Population U = Ur + Unr of size N = Nr + Nnr .

Population mean: Y = (Nr Y r + Nnr Y nr )/N

Let Yb r be an unbiased estimator of the mean Y r .   b Nr Bias = E(Y r ) − Y = Y r − Y = 1 − N (Y r − Y nr ) Note: the variance estimate may also be biased.

ESTP/Survey methodology c SFSO 5

Measures Against Non-response The quality of survey data is largely determined at the design stage: I Survey content.

I Time of survey.

I Interviewers (training) and survey introduction (letter, first contact).

I Data collection method (mail, telephone/CATI, face-to-face/CAPI).

I Questionnaire design.

I Respondent burden (coordination, etc.)

I Incentives (small gift) and disincentives.

I Follow-up (callbacks).

ESTP/Survey methodology c SFSO 6 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Methods

Reweighting for unit non-response:

I Modelisation of the response probability.

I Use of auxiliary information.

I Adjustment of the respondents’ weights. Imputation for item-non-response:

I Assignation of values to missing items.

I Use of values available for respondents or auxiliary information.

ESTP/Survey methodology c SFSO 7

Model for the Non-response

Population → Sample → Respondents U → s → r πk θs,k

P πk = s3k p(s) = Pr(k ∈ s): known.

θs,k = Pr(k answered|s selected): unknown.

If θs,k were known, Y could be estimated without bias with:

X  1   1  X Yb = y = d g y π θ k k k k r k s,k r

ESTP/Survey methodology c SFSO 8 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

But: θs,k has to be estimated.

! X  1  1 X Yb = y = d g y π k k bk k r k θbs,k r

Design weight: dk = 1/πk .

g-weight: gbk . Note: a bias may occur if the response behavior is not well modelled.

ESTP/Survey methodology c SFSO 9

Mechanisms for Non-response

Let yk a response of interest, known for k ∈ r. Let xk a vector of information, known for k ∈ s.

1. Missing completely at random: θk does not depend on xk , yk and s. No bias but loss in precision. 2. Missing at random given covariates or ignorable non-response: θk depends on xk but not on yk . θk can be modelled successfully.

3. Nonignorable non-response: θk depends on yk and cannot be fully explained by xk .

In practice: assumption of ignorable non-response.

ESTP/Survey methodology c SFSO 10 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Errors Caused by Sampling and Non-response

Design p(s). Unknown response mechanism q(r|s). Target population estimator: Y . Non-response estimator (after adjustment): Ybnr . Full response estimator: Yb. Decomposition: sampling error and non-response error.

Error of Ybnr : Ybnr − Y = (Yb − Y ) + (Ybnr − Yb)

Bias of Ybnr : Bpq(Ybnr ) = [Ep(Yb) − Y ] + [Epq(Ybnr ) − Ep(Yb)]

Variance of Ybnr : varpq(Ybnr ) = varp(Yb) + Epvarq(Ybnr |s)

ESTP/Survey methodology c SFSO 11

Reweighting

Calculation of an estimated gbk = 1/θbk . Successive adjustments gbk = g1,k · g2,k : 1. adjustment for non-response based on a non-response model (g1,k ) (info about the sample) 2. possible adjustment by using post-stratification or calibration on reference values (g2,k ) (info about the population).

or

1. Global adjustment ”all in one” gbk with adequate reference variables at the population and/or sample level (calibration approach).

ESTP/Survey methodology c SFSO 12 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Adjustment for Non-response Response homogeneity group model. The sample s of size n is partitioned into H groups (cells) sh, h = 1, .., H.

All elements in sh are assumed to have the same response probability θh.

Let nh the sample size, rh the set of respondents and mh the number of respondents in h. We define θbk = mh/nh if k ∈ sh. Therefore:

X X X 1 1 X X nh Yb = dk gbk yk = yk = dk yk πk θbk mh r h k∈rh h k∈rh

ESTP/Survey methodology c SFSO 13

If s is a SRS of size n:

X N X X nh Yb = dk gbk yk = yk n mh r h k∈rh

If s is a stratified sample with a stratification identical to the non-response groups:

X X X Nh nh X X Nh Yb = dk gbk yk = yk = yk nh mh mh r h k∈rh h k∈rh

ESTP/Survey methodology c SFSO 14 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Construction of weighting groups:

I response behavior homogeneous within groups and response rates vary between groups.

I determination of the important variables e.g. using logistic regression or classification trees.

I the groups should not be too small (stability). Notes:

I θbk = mh/nh may be replaced by other expressions such as a ratio between sum of weights (unequal probability sampling).

I direct estimation of θk using a model such as a logistic regression may be instable.

ESTP/Survey methodology c SFSO 15

Imputation

Treatment of item non-response.

I Deductive imputation (rules).

I Donor imputation, such as nearest neighbor or last carried forward (cells, selection algorithm).

I Model-based imputation, such as cell mean, regression (auxiliary variables, model).

 ∗ yk if k ∈ ry yk = ybk if k ∈ s \ ry

ESTP/Survey methodology c SFSO 16 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Remarks about Imputation

Imputation creates a complete data set. Bias if the mechanism is not completely known. Distribution of variables and relationship between variables may be distorted. Methods of imputation have to be determined and tested. Define flags so that the data analyst may be able to distinguish between the original and the imputed values. Impact of the imputation must be evaluated.

ESTP/Survey methodology c SFSO 17

Variance Estimation (Reweighting) Framework of the calibration approach with unit non-response. The variance var(Yb) of Yb may be estimated by:

varc (Yb) = VbSAM + VbNR where:

VbSAM = f (dk , dk`, gk , ek ) for k, ` ∈ r

VbNR = f (dk , gk , ek ) for k ∈ r

and ek = f (dk , gk , yk , xk ) is the residual from the calibration approach, dk = 1/πk and dk` = 1/πk`.

ESTP/Survey methodology c SFSO 18 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Stratified sample with an adjustment of non-response in homogeneous groups identical to the strata h = 1, .., H.

X  1 1  V ≈ N2 − S2 bSAM h y,rh nh Nh h X  1 1  V ≈ N2 − S2 bNR h y,rh mh nh h

with S2 = P (y − y )2/(m − 1) y,rh rh k rh h

Approximation: nh/(nh − 1) ≈ 1 and mh/(mh − 1) ≈ 1.

Sizes: Nh= population in h, nh= sample in h and mh= respondent in h.

ESTP/Survey methodology c SFSO 19

Variance Estimation (Imputation)

Decomposition into sampling and imputation variances. or Adjustment of the jackknife and bootstrap methods. or Estimation via multiple imputation: I imputation m >= 2 times for each missing value,

I estimation of the variance due to the imputation with the m data sets.

ESTP/Survey methodology c SFSO 20 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

References

I Lohr, S.L., (1999) Sampling: design and analysis, Duxbury Press. Chapter 8.

I Sarndal¨ C.-E., Swensson, B., and Wretman, J. (1997) Model assisted survey sampling, Springer series in statistics. Chapter 15.

I Sarndal¨ C.-E. and Lundstrom,¨ S. (2005) Estimation in surveys with non-response. John Wiley & sons.

I Schafer, J.-L., (2000) Analysis of Incomplete Multivariate Data. Chapmann and Hall/CRC. New York.

EDIMBUS project: http://edimbus.istat.it/

ESTP/Survey methodology c SFSO 21

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Exercises

Introduction to the Problem and Treatment of Non-response

Jean-Pierre Renfer, Paul-Andre´ Salamin Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

Contents

Exercise 1: Unit and item non-response ...... 2

2008 c SFSO ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 1: Unit and item non-response

Not all tenants of the 30 flats in samp1, drawn from the population of 151 flats, answered the ques- tionnaire. 11 tenants refused to answer at all, but the number of rooms of their flats is known. 4 of the 25 respondents did not fill in their rent amount.

1. Determine homogeneity groups and impute the cell mean rent for the item non-responses.

2. A SRS of size 6 is drawn from the 15 tenants with a missing rent. Their data is now known: ID2 ROOMS BALCONY SURFACE RENT ID2 ROOMS BALCONY SURFACE RENT 4 1 0 32 571 25 4 1 98 423 12 3 1 80 874 28 5 1 120 1’216 15 4 0 99 477 29 5 1 121 925

Conclusions? ID2 ROOMS RESPONSE BALCONY SURFACE RENT 1 1 1 0 35 322 2 1 1 0 43 300 3 1 1 0 37 363 4 1 1 0 32 5 1 1 0 35 297 6 2 1 0 40 378 7 2 1 0 70 575 8 2 1 1 100 484 9 3 1 0 79 399 10 3 1 0 79 651 11 3 1 0 87 656 12 3 1 1 80 13 3 1 1 105 967 14 3 1 1 80 575 15 4 1 0 99 16 4 1 1 100 647 17 4 1 1 89 420 18 5 1 1 129 19 6 1 1 250 2’644 20 1 0 21 2 0 22 3 0 23 4 0 24 4 0 25 4 0 26 4 0 27 5 0 28 5 0 29 5 0 30 6 0

2 ESTP/Survey methodology c SFSO 2008

Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Survey methodology and sampling techniques Answers to the Exercises

Introduction to the Problem and Treatment of Non-response

Jean-Pierre Renfer, Paul-Andre´ Salamin Statistical Methods Unit, Federal Statistical Office

European Statistical Training Program 15 - 18 April 2008

Contents

Exercise 1: Unit and item non-response ...... 2

2008 c SFSO ESTP/Survey methodology 1 Federal Department of Home Affairs FDHA Federal Statistical Office FSO

Exercise 1: Unit and item non-response

1. The variables BALCONY and ROOMS are used in the following to create homogeneity groups.

Aux. variables Imputed rent by auxiliary variable ID2 ROOMS BALCONY BALCONY ROOMS BALCONY × ROOMS 4 1 0 555 362 362 12 3 1 880 701 801 15 4 0 555 656 899 18 5 1 880 998 998

Different imputation methods usually lead to different imputed values. The imputation of the cell mean for the observation ID2=4 is the same whether the homogenous groups are built with BALCONY and ROOMS or with ROOMS only. This is due to the fact that there are no one room flats with balcony among the respondents. The same holds for observation ID2=18 but there are only five room flats with balcony among the respondents. Such a situation might be an indicator for a NMCAR non-response mechanism.

Imputation method No cell mean in homogeneity groups by regression imputation BALCONY ROOMS BALCONY × ROOMS imputation m 15 19 19 19 19 y¯r 645 660 652 670 652 stdc (¯yr) 142.6 111.5 112.5 113.3 116.2 CVd(¯yr) 22.1 16.9 17.2 16.9 17.8

2. The new information gained could be used to elaborate a new imputation model, assuming that the nonrespondents form a homogeneous group. The validity of the model must always be assured and the model must rely on sufficient data for consistent estimation of the model pa- rameters. The imputation based on such a model may reduce the NMCAR-non-response bias considerably. In our case, there are not enough data available to elaborate a new imputation model. But the new data may be added to the respondents and hence the bias is also reduced. However, the problem of non-response remains.

Aux. variables Imputed rent by auxiliary variable Non-response ID2 ROOMS BALCONY BALCONY ROOMS BALCONY × ROOMS survey 4 1 0 555 362 362 571 12 3 1 880 701 801 874 15 4 0 555 656 899 477

Differences between imputed values and the values collected by the non-response survey are often large, regardless of the imputation method used. The method which showed up the least differences was the cell mean imputation with the variable BALCONY used to form the homogeneous groups. However, it must be taken into account that conclusions are based on only very few observations in this example.

2 ESTP/Survey methodology c SFSO 2008