Chapter 9 Cluster Sampling
It is one of the basic assumptions in any sampling procedure that the population can be divided into a finite number of distinct and identifiable units, called sampling units. The smallest units into which the population can be divided are called elements of the population. The groups of such elements are called clusters.
In many practical situations and many types of populations, a list of elements is not available and so the use of an element as a sampling unit is not feasible. The method of cluster sampling or area sampling can be used in such situations.
In cluster sampling - divide the whole population into clusters according to some well-defined rule. - Treat the clusters as sampling units. - Choose a sample of clusters according to some procedure. - Carry out a complete enumeration of the selected clusters, i.e., collect information on all the sampling units available in selected clusters.
Area sampling In case, the entire area containing the populations is subdivided into smaller area segments and each element in the population is associated with one and only one such area segment, the procedure is called as area sampling.
Examples: In a city, the list of all the individual persons staying in the houses may be difficult to obtain or even maybe not available but a list of all the houses in the city may be available. So every individual person will be treated as sampling unit and every house will be a cluster. The list of all the agricultural farms in a village or a district may not be easily available but the list of village or districts are generally available. In this case, every farm in sampling unit and every village or district is the cluster.
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 1
Moreover, it is easier, faster, cheaper and convenient to collect information on clusters rather than on sampling units.
In both the examples, draw a sample of clusters from houses/villages and then collect the observations on all the sampling units available in the selected clusters.
Conditions under which the cluster sampling is used: Cluster sampling is preferred when (i) No reliable listing of elements is available, and it is expensive to prepare it. (ii) Even if the list of elements is available, the location or identification of the units may be difficult. (iii) A necessary condition for the validity of this procedure is that every unit of the population under study must correspond to one and only one unit of the cluster so that the total number of sampling units in the frame may cover all the units of the population under study without any omission or duplication. When this condition is not satisfied, bias is introduced.
Open segment and closed segment: It is not necessary that all the elements associated with an area segment need be located physically within its boundaries. For example, in the study of farms, the different fields of the same farm need not lie within the same area segment. Such a segment is called an open segment.
In a closed segment, the sum of the characteristic under study, i.e., area, livestock etc. for all the elements associated with the segment will account for all the area, livestock etc. within the segment.
Construction of clusters: The clusters are constructed such that the sampling units are heterogeneous within the clusters and homogeneous among the clusters. The reason for this will become clear later. This is opposite to the construction of the strata in the stratified sampling.
There are two options to construct the clusters – equal size and unequal size. We discuss the estimation of population means and its variance in both the cases.
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 2
Case of equal clusters Suppose the population is divided into N clusters and each cluster is of size M . Select a sample of n clusters from N clusters by the method of SRS, generally WOR. So total population size = NM total sample size = nM .
Let
th th yij : Value of the characteristic under study for the value of j element (1,2,...,)j M in the i cluster (iN 1,2,..., ).
M 1 th yyiij mean per element of i cluster . M j1
Population (NM units)
Cluster Cluster Cluster M units M units … … … M units Population
N clusters
N Clusters
Cluster Cluster Cluster M units M units … … … M units Sample
n clusters
n Clusters
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 3
Estimation of population mean: First select n clusters from N clusters by SRSWOR. Based on n clusters, find the mean of each cluster separately based on all the units in every cluster. So we have the cluster means as yy12, ,..., yn . Consider the mean of all such cluster means as an estimator of population mean as 1 n yycl i . n i1 Bias: 1 n Ey()cl Ey () i n i1 1 n Y (since SRS is used) n i1 Y .
Thus ycl is an unbiased estimator of Y . Variance:
The variance of ycl can be derived on the same lines as deriving the variance of sample mean in SRSWOR.
The only difference is that in SRSWOR, the sampling units are yy12, ,..., yn whereas in case of ycl , the sampling units are yy, ,..., y . 12 n Nn Nn Note that is case of SRSWOR,Var (y )SVar22 and (y ) s , Nn Nn
2 Var() ycl E ( y cl Y )
Nn 2 Sb Nn N 221 where SyYbi() which is the mean sum of square between the cluster means in the N 1 i1 population. Estimate of variance: Using the philosophy of estimate of variance in case of SRSWOR again, we can find Nn Var()y s2 clNn b
n 221 where sbicl()yy is the mean sum of squares between cluster means in the sample . n 1 i1
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 4
Comparison with SRS : If an equivalent sample of nM units were to be selected from the population of NM units by SRSWOR, the variance of the mean per element would be NM nM S 2 Var ( y ) . nM NM nM fS2 . nM NM Nn-122 where f and SyY (ij ) . NNM1 ij11 Nn Also Var ( y ) S 2 clNn b f S 2 . n b Consider NM 22 (1)()NM S yij Y ij11
NM 2 ()()yyij i yY i ij11
NM NM 22 ()yyij i () yY i ij11 ij 11 22 NM(1)(1) Swb MN S where
N 221 SSwi is the mean sum of squares within clusters in the population N i1 M 221 th Siiji()yy is the mean sum of squares for the i cluster. M 1 j1 The efficiency of cluster sampling over SRSWOR is Var() y E nM Var() ycl S 2 2 MSb 2 1(1)NM Sw 2 (1).N (1)NM M Sb
2 2 Thus the relative efficiency increases when Sw is large and Sb is small. So cluster sampling will be efficient if clusters are so formed that the variation the between cluster means is as small as possible while variation within the clusters is as large as possible.
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 5
Efficiency in terms of intra class correlation The intra class correlation between the elements within a cluster is given by
Ey()()ij Y y ik Y 1 2 ;1 Ey()ij Y M 1 1 NM M ()()yijY y ik Y MN(1) M ijkj11()1 NM 1 2 ()yYij MN ij11 1 NM M ()()y Y y Y MN(1) M ij ik ijkj11()1 MN 1 2 S MN NM M ()()yYyYij ik ijkj11()1 . (1)(1)MN M S 2 Consider
2 NNM 2 1 ()yYiij ( y Y ) iij111M NM11 MM ()yY2 ()() yYyY 22ij ij ik ij11MM jkj 1()1 NM M N NM 22 2 ()()yYyYij ik M () yY i () yY ij ijkj11()1 i 1 ij 11 or
22 2 2 (MNMSMNSNMS 1)( 1) ( 1)b ( 1) (1)MN or SMS221( 1). b MN2 (1)
The variance of ycl now becomes Nn Var() y S 2 clNn b
NnMNS1 2 1(M 1). Nn N1 M 2 MN1 N n For large NNN,1,1,1andso MN N 1 S 2 Var() y 1( M 1). cl nM
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 6
The variance of the sample mean under SRSWOR for large N is S 2 Var() y . nM nM The relative efficiency for large N is now given by Var() y E nM Var() ycl S 2 nM S 2 1(M 1) nM 11 ;1. 1(MM 1) 1
If M 1 then E 1, i.e., SRS and cluster sampling are equally efficient. Each cluster will consist of one unit, i.e., SRS. If M 1, then cluster sampling is more efficient when E 1 or (1)0M
or 0. If 0, then E 1, i.e., there is no error which means that the units in each cluster are arranged randomly. So sample is heterogeneous. In practice, is usually positive and decreases as M increases but the rate of decrease in is much lower in comparison to the rate of increase in M. The situation that 0 is possible when the nearby units are grouped together to form cluster and which are completely enumerated. There are situations when 0.
Estimation of relative efficiency: The relative efficiency of cluster sampling relative to an equivalent SRSWOR is obtained as S 2 E 2 . MSb
2 2 An estimator of E can be obtained by substituting the estimates of S and Sb . 1 n Since yycl i is the mean of n means yi from a population of N means yii , 1,2,..., N which are n i1 drawn by SRSWOR, so from the theory of SRSWOR, Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 7
n 221 Es()bic E ( y y ) n i1 N 1 2 ()yYi N 1 i1 2 Sb .
2 2 Thus sb is an unbiased estimator of Sb .
n 221 2 Since swi S is the mean of n mean sum of squares Si drawn from the population of N mean sums n i1
2 of squares Sii , 1,2,..., N , so it follows from the theory of SRSWOR that
nn nN 22211 11 2 E()sEwi S ES () i S i nnii11 nN ii 11 N 1 2 Si N i1 2 Sw .
2 2 Thus sw is an unbiased estimator of Sw . Consider
NM 221 SyY (ij ) MN 1 ij11
NM 2 2 or (MN 1) S ( yij y i ) ( y i Y ) ij11 NM 22 ()()yyij i yY i ij11 N 22 (1)(1)MSMNSib i1 22 NM(1)(1). Swb MN S An unbiased estimator of S 2 can be obtained as 1 SNMsMNsˆ 222(1)(1). MN 1 wb So Nn Var ( y ) s2 clNn b NnS ˆ 2 Var ( y ) nM Nn M n 221 wheresyybicl ( ) . n 1 i1 Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 8
S 2 An estimate of efficiency E 2 is MSb
22 ˆ NM(1)(1) swb MN s E 2 . MNM(1) sb
If N is large so that M (1)NMN and MNMN1, then
2 11M Sw E 2 M MMSb and its estimate is
2 ˆ 11M sw E 2 . M MMsb
Estimation of a proportion in case of equal cluster Now, we consider the problem of estimation of the proportion of units in the population having a specified attribute on the basis of a sample of clusters. Let this proportion be P .
th Suppose that a sample of n clusters is drawn from N clusters by SRSWOR. Defining yij 1 if the j unit
th in the i cluster belongs to the specified category (i.e. possessing the given attribute) and yij 0 otherwise, we find that
yPii , 1 N YPP i , N i1 MPQ S 2 ii, i (1)M N MPQ ii S 2 i1 , w NM(1) NMPQ S 2 , NM 1)
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 9
N 221 SPPbi(), N 1 i1 N 1 22 PNPi N 1 i1
NN 1 2 PPii(1 ) PNP i (1)N ii11 1 N NPQ PQii, (1)N i1
th where Pi is the proportion of elements in the i cluster, belonging to the specified category and
QPiNii1 , 1,2,..., and QP1. Then, using the result that ycl is an unbiased estimator of Y , we find that
n ˆ 1 Pcl P i n i1 is an unbiased estimator of P and N NPQ PQ ()Nn ii Var() Pˆ i1 . cl Nn(1) N ˆ This variance of Pcl can be expressed as NnPQ Var() Pˆ [1( M 1)], cl NnM1 where the value of can be obtained from
22 M (1)NSNSbw 222 2 and (1)(1)(1)MNSNMSMNS wb (1)(1)M MN S 22 2 by substituting SSbw, and S in , we obtain
N PQ M 1 ii 1 i1 . (1)M NPQ ˆ The variance of Pcl can be estimated unbiasedly by Nn Var() Pˆ s2 clnN b n Nn 1 ˆ 2 ()PPicl nN(1) n i1 n Nn ˆ ˆ nPcl Q cl PQ i i Nn(1) n i1 Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 10
ˆ ˆ where QIPcl cl . The efficiency of cluster sampling relative to SRSWOR is given by MN(1)1 E (1)1(1)MN M (1)NNPQ . NM 1 N NPQ PQii i1 1 If N is large, then E . M An estimator of the total number of elements belonging to a specified category is obtained by multiplying ˆ ˆ Pcl by NM , i.e. by NMPcl . The expressions of variance and its estimator are obtained by multiplying the corresponding expressions for Pˆ by NM22. cl
Case of unequal clusters: In practice, the equal size of clusters are available only when planned. For example, in a screw manufacturing company, the packets of screws can be prepared such that every packet contains same number of screws. In real applications, it is hard to get clusters of equal size. For example, the villages with equal areas are difficult to find, the districts with same number of persons are difficult to find, the number of members in a household may not be same in each household in a given area.
th Let there be N clusters and M i be the size of i cluster, let
N MM0 i i1 1 N MM i N i1
Mi 1 th yyiij : mean of i cluster M j1 i 1 N Mi Yy ij M 0 ij11 N M i yi i1 M 0 N 1 M i yi NMi1 Suppose that n clusters are selected with SRSWOR and all the elements in these selected clusters are surveyed. Assume that M i ’s (iN 1,2,..., ) are known. Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 11
Population
Cluster Cluster Cluster M1 M2 … … … M Population N units units units N clusters
N Clusters
Cluster Cluster Cluster M1 M2 … … … Mn Sample units units units n clusters
n Clusters
Based on this scheme, several estimators can be obtained to estimate the population mean. We consider four type of such estimators.
1. Mean of cluster means: Consider the simple arithmetic mean of the cluster means as 1 n yyci n i1 1 N Eyci y N i1 N M i YY(where yi ). i1 M 0
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 12
The bias of y is c
Bias ycc E y Y
NN 1 M i yyii NMii110
NN 1 M 0 Myii y i MN0 ii11 NN M y N ii 1 ii11 Myii MN0 i1 1 N (MMyi )( i Y ) M 0 i1 N 1 Smy M 0
Bias yc 0 if M iiand y are uncorrelated . The mean squared error is
2 MSE ycc Var y Bias y c 2 Nn22 N1 SSbmy Nn M 0 where N 221 SyYbi ( ) N 1 i1 1 N SMMyYmy ( i )( i ). N 1 i1
An estimate of Var yc is
Nn 2 Var ycb s Nn
n 2 1 2 where sbicyy. n 1 i1
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 13
2. Weighted mean of cluster means Consider the arithmetic mean based on cluster total as
n * 1 yMycii nM i1 n * 11 E()yEyMcii ( ) nMi1 n 1 N M iiy nM0 i1 1 N Mi yij M 0 ij11 Y .
* * Thus yc is an unbiased estimator of Y . The variance of yc and its estimate are given by
n * 1 M i Var()yci Var y nMi1 Nn S *2 Nn b
Nn 2 Var() y** s cbNn where
N 2 *2 1 M i Sbiy Y NM1 i1 n 2 *21 M i * syybic nM1 i1 *2 *2 Es()bb S .
* Note that the expressions of variance of yc and its estimate can be derived using directly the theory of SRSWOR as follows:
n M i * 1 Let zyyii,then c zz i . Mni1
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 14
Since SRSWOR is followed, so
n *2Nn 1 Var() yci Var () z ( z Y ) Nn N 1 i1
N 2 Nn 1 M i yYi Nn N1 i1 M Nn S *2 . Nn b Since
n *21 2 Es()bi E ( z z ) n 1 i1 n 2 1 M i * Eyyic nM1 i1 N 2 1 M i yYi NM1 i1 *2 Sb So an unbiased estimator of variance can be easily derived.
3. Estimator based on ratio method of estimation Consider the weighted mean of the cluster means as
n M iiy ** i1 yc n M i i1
It is easy to see that this estimator is a biased estimator of the population mean. Before deriving its bias and mean squared error, we note that this estimator can be derived using the philosophy of ratio method of estimation. To see this, consider the study variable U i and auxiliary variable Vi as
Myii Ui M M Vii 1,2,..., N i M
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 15
N
N M i 11i1 VV i 1 NNMi1 1 n uu i n i1 1 n vv i . n i1
The ratio estimator based on U and V is u YVˆ R v n ui i1 n vi i1 n M y ii M i1 n M i i1 M n M iiy i1 n . M i i1
** Since the ratio estimator is biased, so yc is also a biased estimator. The approximate bias and mean
** squared errors of yc can be derived directly by using the bias and MSE of ratio estimator. So using the results from the ratio method of estimation, the bias up to second order of approximation is given as follows
2 ** Nn SSvuv Bias ( yc ) 2 U Nn V UV
Nn 2 Suv SUv Nn U 11NN where UUiii My NNMii11
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 16
N 221 SVVvi() N 1 i1
N 2 1 M i 1 NM1 i1 1 N SUUVVuv()() i i N 1 i1 NN 11Myii M i Myii 1 NMNMM1 ii11 U 1 N RUuv My i i . VNMi1
** The MSE of yc up to second order of approximation can be obtained as follows:
**Nn 2 2 2 MSE() ycuvuv S R S 2 RS Nn
NN2 2 11Myii where SMuiiy NMNM1 ii11 Alternatively,
N 2 ** Nn 1 MSE() yciuvi U R V Nn N 1 i1 2 NN Nn 11Myii M i Myii Nn N1 ii11 M NM M 2 N N 2 Myii Nn 1 M ii1 yi . Nn N1 i1 M NM An estimator of MSE can be obtained as
n 2 **Nn 1 M i ** 2 MSE() ycic ( y y ). Nn n1 i1 M
** The estimator yc is biased but consistent.
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 17
4. Estimator based on unbiased ratio type estimation
1 n 1 Mi Since yyci (where yyiij ) is a biased estimator of population mean and n i1 M i i1
N 1 Bias()ycmyS M 0 N 1 Smy NM Since SRSWOR is used, so 11nn smy()(),Mmyy i i c m M i nn1 ii11 is an unbiased estimator of 1 N SMMyYmy()(), i i N 1 i1 i.e., E()sSmy my . So it follow that N 1 E()yYcmy Es ( ) NM
N 1 or EycmysY. NM So
** N 1 yycc s my NM is an unbiased estimator of the population mean Y .
This estimator is based on unbiased ratio type estimator. This can be obtained by replacing the study M M variable (earlier y ) by i y and auxiliary variable (earlier x ) by i . The exact variance of this i M i i M estimate is complicated and does not reduces to a simple form. The approximate variance upto first order of approximation is
2 NN ** 11M i Var ycciii y Y)(). y M M nN(1) ii11 M NM
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 18
A consistent estimate of this variance is 2 n M 11nnM i ** ii1 Var yccicii y y) y M . nn(1) ii11 M nM n ** ** The variance of ycc will be smaller than that of yc (based on the ratio method of estimation) provided the N N Myii Mi 1 1 regression coefficient of on is nearer to yi than to Myii. M M N i1 M 0 i1
Comparison between SRS and cluster sampling:
n In case of unequal clusters, M i is a random variable such that i1 n E MnMi . i1 Now if a sample of size nM is drawn from a population of size NM , then the variance of corresponding sample mean based on SRSWOR is NM nM S 2 Var() y SRS NM nM
NnS 2 . Nn M This variance can be compared with any of the four proposed estimators. For example, in case of
n * 1 yMycii nM i1 Nn Var() y**2 S cbNn N 2 Nn 1 M i yi Y . Nn N1 i1 M
** The relative efficiency of yc relative to SRS based sample mean Var() y E SRS Var()y* c S 2 *2 . MSb
* *2 For Var() ycSRS Var ( y ), the variance between the clusters ()Sb should be less. So the clusters should be formed in such a way that the variation between them is as small as possible.
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 19
Sampling with replacement and unequal probabilities (PPSWR) In many practical situations, the cluster total for the study variable is likely to be positively correlated with the number of units in the cluster. In this situation, it is advantageous to select the clusters with probability proportional to the number of units in the cluster instead of with equal probability, or to stratify the clusters according to their sizes and then to draw a SRSWOR of clusters from each of the stratum. We consider here the case where clusters are selected with probability proportional to the number of units in the cluster and with replacement.
Suppose that n clusters are selected with ppswr, the size being the number of units in the cluster. Here P is the probability of selection assigned to the ith cluster which is given by i
MMii Pi ,1,2,...,.iN MNM0 Consider the following estimator of the population mean:
ˆ 1 n Yyci . n i1 Then this estimator can be expressed as
ˆ 1 N Yycii n i1
th where i denotes the number of times the i cluster occurs in the sample. The random variables
12, ,..., N follow a multinomial probability distribution with E()nP , Var () nP (1) P ii iii Cov(,ij ) nPP ij , i j . Hence,
ˆ 1 N E()YEcii () y n i1 1 N nPii y n i1 N M i yi i1 NM
N Mi yij ij11 Y . NM
ˆ Thus Yc is an unbiased estimator of Y .
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 20
ˆ We now derive the variance of Yc . ˆ 1 N From Yycii , n i1 1 NN Var() Yˆ Var ()y 2 Cov (, )yy ciiijij2 n iij1 1 NN PPyPPyy(1 ) 2 2 iiiijij n iij1 2 NN 1 2 Pyii Py ii n2 iij1 N 1 2 Py Y 2 ii n i1 N 1 2 Myii(). Y nNM i1 ˆ An unbiased estimator of the variance of Yc is
n ˆˆ1 2 Var() Ycic ( y Y ) nn(1) i1 which can be seen to satisfy the unbiasedness property as follows: Consider
n 1 ˆ 2 EyY()ic nn(1) i1 n 1 22ˆ EynY()ic nn(1) i1 n 1 22ˆ E iiynVarYnY() c nn(1) i1 where E()iinP , Var () iii nP (1),(, P Cov ijij ) nPP , i j
nNN 111ˆ 2222 E ()yYic nPyn iii PyY ii () nY nn(1)iii111 nn (1) n NN 1122 2 Pyii() Y Py ii () Y (1)nn ii11 NN 1122 Pyii() Y Py ii () Y (1)nn ii11 N 1 2 Pyii() Y (1)n i1 ˆ Var(). Yc
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur Page 21