<<

STRATIFIED CLUSTER UNDER MULTIPLICATIVE MODEL FOR QUANTITATIVE SENSITIVE QUESTION SURVEY

Pu Xiangke, Gao Ge, Fan Yubo and Wang Mian

SUMMARY

Advanced sampling methods and corresponding formulas are total probability formulas. The performance of the sampling seldom applied to quantitative sensitive question survey. This methods and formulas in survey of cheating on exams on Dus- paper develops a series of formulas for parameter estimation hu Lake Campus of Soochow University, China, are provided. in cluster sampling and stratified cluster sampling under mul- The reliability of the survey methods and formulas for quanti- tiplicative model on the basis of classic sampling theories and tative sensitive question survey is found to be high.

Introduction sponse error, randomized re- sitive question survey. However, more, formulas for simple ran- sponse models have been devel- this method is vulnerable to dom sampling may also be So-called sensitive questions oped for collecting sensitive because the ran- wrongly used when a relatively are some private issues such as information after the pioneering domness of the selection may complicated sampling procedure acquired immune deficiency work of Warner (1965). The result in a that doesn’t is conducted in sensitive ques- syndrome (AIDS), drug addic- multiplicative model was de- reflect the makeup of the popu- tion survey and evaluation of tion, gambling, prostitution, signed for quantitative sensitive lation and it is cumbersome reliability of sampling methods driving while intoxicated, per- question survey, with advan- when sampling from a large and corresponding formulas sonal income, tax evasion, pre- tages of simple design, small target population. Stratified under randomized response marital sex, venereal diseases, sample size and small sampling sampling is superior to simple models for sensitive question homosexual tendencies, and so error (Wang, 2003; Raghunath random sampling in reducing survey has seldom been set. on. It is difficult to get honest and Georg, 2006). sampling error and cluster sam- In this paper, designs for answers by asking direct ques- Simple random sampling has pling is less costly, but these cluster sampling and stratified tions on sensitive issues in sur- been considered a useful sam- methods are not so popular be- cluster sampling under multipli- vey research (Tourangeau and pling method under multiplica- cause they are not too simple cative model and corresponding Yan, 2007). To reduce the re- tive model for quantitative sen- (Ding and Gao, 2008). Further- formulas for parameter estima-

Keywords / Multiplicative Model / Parameter Estimation / Quantitative Sensitive Question / Stratified Cluster Sampling / Received: 09/01/2010. Modified: 09/05/2011. Accepted: 09/07/2011.

Pu Xiangke. Ph.D. candidate in Gao Ge. M.Sc. in , P.R.China. e-mail: gaoge@ Wang Mian. M.Sc. in Epidemiol- and Health Sta- Soochow University, China. suda.edu.cn ogy and Health Statistics. So- tistics, Soochow University, Professor, Soochow University, Fan Yubo. Ph.D. candidate in ochow University, China. China. Technologist-in-charge, China. Address: Department of Epidemiology and Health Sta- Institute of Hepatology, Third Health Statistics, School of tistics. Soochow University, People’s Hospital, Changzhou, Radiation Medicine and Public China. China. e-mail: [email protected] Health, Suzhou, 215123,

NOV 2012, VOL. 37 Nº 11 0378-1844/12/11/833-05 $ 3.00/0 833 MUESTREO POR CONGLOMERADOS ESTRATIFICADOS BAJO UN MODELO MULTIPLICATIVO PARA PREGUNTAS SENSIBLES EN ENCUESTAS CUANTITATIVAS Pu Xiangke, Gao Ge, Fan Yubo y Wang Mian RESUMEN

Los métodos de muestreo avanzados y las correspondientes total. Se ilustra el desempeño de los métodos de muestreo y fórmulas son raramente aplicados en preguntas sensibles de las fórmulas en una encuesta de engaños en exámenes en el encuestas cuantitativas. Este trabajo desarrolla una serie de Campus Dushu Lake de la Universidad de Soochow, China. La fórmulas para parámetros de estimación en muestreo de con- confiabilidad de los métodos de encuesta y las fórmulas para glomerados estratificados bajo un modelo multiplicativo basa- encuestas cuantitativas de preguntas sensibles resulta ser alta. das en teorías clásicas de muestreo y fórmulas de probabilidad

AMOSTRAGEM ESTRATIFICADA POR CONGLOMERADOS SOB UM MODELO MULTIPLICATIVO PARA PERGUNTAS SENSÍVEIS EM PESQUISAS QUANTITATIVAS Pu Xiangke, Gao Ge, Fan Yubo e Wang Mian RESUMO

Os métodos avançados de amostragem e as corresponden- probabilidade total. Ilustra-se o desempenho dos métodos de tes fórmulas são raramente aplicados em perguntas sensíveis amostragem e as fórmulas em uma pesquisa de enganos em de pesquisas quantitativas. Este trabalho desenvolve uma sé- exames no Campus Dushu Lake da Universidade de Soochow, rie de fórmulas para estimação de parâmetros em amostragem China. A confiabilidade dos métodos de pesquisa e as fórmulas de conglomerados estratificados sob um modelo multiplicativo para pesquisas de perguntas sensitivas quantitativas resultam baseadas em teorias clássicas de amostragem e fórmulas de ser alta.

tion are provided. These com- eral clusters (primary units), the ith cluster contains Mi of the subunits in each clus- plex sampling methods have and each cluster is composed subunits. Then, n clusters are ter, and been employed in survey of of secondary units. Some drawn randomly from the cheating on Dushu Lake Cam- clusters are randomly selected population. is the sampling ratio. pus of Soochow University, from the population. Then, Estimation of the population China, and may be applicable the multiplicative model and its for to a large population. (above) is applied to all the quantitative sensitive question When Mi=M, secondary units of selected survey. Suppose the mean of

Methods for Quantitative clusters for the quantitative the response values in the ith Sensitive Question Survey sensitive question survey. (4) cluster is µi, yi denotes the sum of values in the ith clus- Multiplicative model Stratified cluster sampling ter and the population mean is where f=nM / NM=n/N is the under the multiplicative µ. By Cochran (1977) and sampling ratio. In the multiplicative model model Wang and Gao (2006) the es- (Wang, 2003), a randomized timator of the population Calculation of µi and yi. Sup- device is designed to ran- Before cluster sampling, mean can be stated as pose µi denotes the mean of domly generate an integer the population is divided variables with sensitive charac- between 0 and 9. In the de- into different strata by char- ter in the ith cluster, µiz de- vice, ten balls of identical acters. Then, cluster sam- notes the mean of numerical size are respectively labeled pling is applied to each stra- values of the answers in the by the ten integers. By a ran- tum. In each stratum, differ- (1) ith cluster, and µy denotes the domization process, each re- ent clusters (primary units) and when Mi=M, mean of all the random num- spondent takes a ball labeled are also composed of sec- bers in the randomized device. by an integer and multiplies ondary units. And then, the Then, from characteristics of the integer by the numerical multiplicative model is em- (Wang et al., 2006), value of his response to the ployed to those secondary (2) µ = µ µ , i= 1,2,...,n (5) quantitative sensitive question units of selected clusters for Also following Cochran iz i y so as to get a final result. the quantitative sensitive (1977), the of the vari- µ = µ /µ , i= 1,2,...,n (6) question survey. ance of can be obtained as i iz y Cluster sampling under the multiplicative model on Deduction of Formulas and yi = Miµi, i= 1,2,...,n quantitative sensitive questions Formulas for cluster Formulas for cluster sampling It is convenient to use a (3) cluster sampling method. The Suppose the population is where is the mean Suppose the population is population is divided into sev- divided into N clusters, and composed of L strata, and

834 NOV 2012, VOL. 37 Nº 11 the hth stratum contains Nh tum according to the number Table I clusters (primary units), and of subunits. The mean of times of cheating of students the ith cluster contains Mih As samples of each stratum in 38 classes in last two semesters secondary units. The popula- are independent, from Eq. 11, by twice repeated stratified cluster tion contains N secondary its variance is sampling under multiplicative model units and nh clusters are ran- domly drawn from the hth (12) Serial Serial stratum. numbers of numbers According to Eqs. 8 and undergraduate µi1 µ'i1 of graduate µi2 µ'i2 Estimation of the population 10, the estimator of V( ) classes classes mean and its variance of the is v( ). And from Eq. 12, 1 0.8085 0.9787 1 1.3990 1.3788 hth stratum. From Eq. 1, the v( ) will be the estimator of 2 1.9222 1.7333 2 2.3308 2.6925 estimator of the population V( ). 3 1.2278 1.4722 3 2.3457 2.2370 mean (µh) of the hth stratum 4 1.5934 1.5177 4 0.8322 1.1111 can be obtained as Calculation of µih and yih 5 1.0505 0.9748 5 0.9660 0.9669 6 0.6333 0.4858 6 0.8133 0.9622

Let µih denote the mean of 7 1.1358 0.9926 7 0.9235 0.8889 h= 1,2,…, L variables with sensitive char- 8 1.0922 1.0118 8 0.7488 0.9220 (7) acter in the ith cluster of the 9 0.9827 0.8494 9 0.5383 0.5333 10 1.0431 1.1066 10 0.8009 0.7824 hth stratum, µiz denote the and from Eq. 3, we can get average value of all the an- 11 1.1376 1.4233 11 0.7565 0.7488 the estimator of the variance swers in the ith cluster of 12 0.7727 0.6010 12 0.7546 0.7593 of : the hth stratum, and µ be 13 1.2691 1.1012 13 0.7565 0.7234 y 14 0.8636 0.8232 14 0.4921 0.4815 15 0.9082 0.9034 15 0.9975 0.9728 16 1.0000 0.8485 16 0.8931 0.8637 17 0.1401 0.0821 17 0.7677 0.6717 18 0.9312 0.9206 18 0.9333 0.8404 where is the 19 0.8360 0.7196 the average value of all the 20 0.4840 0.5284 random numbers in the ran- mean of subunits of each domized device. Then, from cluster in the hth stratum, and characteristics of means tum are approximately the Survey on the mean of (Wang et al., 2006; Su, same size. Twenty clusters cheating times in each class , which is the 2007), we can get containing 1080 students µ = µ µ (13) were randomly drawn from Through twice repeated sur- ihZ ih Y undergraduates and 18 clus- veys on cheating on exams in ters that contain 818 students the last two semesters by strat- sampling ratio in the hth stra- Thus, were drawn ified cluster sampling under the tum. from graduates, multiplicative model and using When M =M , from Eq. 2, ih h there being Eq. 13, it is also possible to get we can also get the estimator and we obtain yih = Mih µih. 1898 students in the selected (Table I) the mean of times of of the population mean: 38 clusters. Each student was cheating on exams of under- repeatedly surveyed twice at graduates in 20 classes in the different times, and 3796 sur- first survey (µi1 (i= 1, 2, …, veys were conducted in total. 20)), the mean of times of All the were cheating on exams of under- and from Eq. 4, the estimator Survey of Cheating on recovered and the passing graduates in 20 classes in the of the variance of can be Exams on Dushu Lake rate of the questionnaires second survey (µ'i1 (i= 1, 2, obtained as Campus of Soochow was 100%. A bank was ….20)), the mean of times of University established IN Excel2003 and cheating on exams of graduates all the data was analyzed by in 18 classes in the first survey Let the stu- SAS9.13. (µi2 (i= 1, 2,…, 18)), and the dents on Du- In the randomized device, mean of times of cheating on Estimation of the population shu Lake Campus of Sooch- there were 10 balls of iden- exams of graduates in 18 class- mean and its variance. Ac- ow University be the target tical size in a bag and the es in the second survey (µ'i2 (i= cording to Cochran (1977), population, which is divided balls were respectively 1, 2,…, 18). the estimator of the popula- into two strata. Define under- tagged by integers from 0 tion mean is graduates as the first stratum to 9. Each chosen student Estimation of the population which contains 9689 students was asked to select a ball mean and variance of and graduates as the second from the bag to get the cor- cheating times in each stratum which contains 1890 responding integer, multi- stratum (11) students, and we could get plied it by the number to times of his cheating on Population mean of cheating where is the exams in the last two se- times in each stratum. By the Let every class be a clus- mesters, and wrote down first survey on undergraduates relative size of the hth stra- ter, and clusters in each stra- the final result. and Eq. 9, we can get the es-

NOV 2012, VOL. 37 Nº 11 835 timator of the population of students on Dushu Lake es by using SAS9.13 and have provided several sam- mean of cheating times of Campus as showed that the coincidence pling methods. But so far, undergraduates in last two of results researches of sampling de- semesters: b e t w e e n sign have been limited to two repeat simple random sampling, st ratif ied and studies on the evaluation Thus, the 95% confidence cluster samplings was high of the reliability and validity interval of the population (the Spearman rank correla- of sample survey on sensi- By the first survey on mean of cheating times of tion coefficient rs= 0.90311, tive questions are rare graduates and Eq. 9, we can students on Dushu Lake P<0.0001, not normal distri- (Wang and Gao, 2008; Gao also get the estimator of the Campus of Soochow Univer- bution), which also indicate and Fan, 2008). population mean of cheating sity is given by that the reliability of our In this paper, formulas times of graduates in last s u r v e y for parameter estimation in two semesters: m e t h o d s cluster sampling and strati- and its for- fied cluster sampling under mulas was quite high. multiplicative model were Reliability evaluation deduced and quantitative Discussion data about sensitive issues Variance of the population Correlative analysis was could be easily acquired. mean of cheating times in applied to the data of two Sensitive question survey Under the multiplicative each stratum. By the first repeated surveys under the is very popular and impor- model, cluster sampling and survey on undergraduates multiplicative model from 20 tant in social and medical stratified cluster sampling and Eq. 10, we can get the undergraduate classes by us- research, especially for the were successfully applied to estimator of the variance of ing SAS9.13. The Shapiro- prevention of AIDS. the survey of cheating on the population mean of Wilks’ W test was applied Through the introduction Dushu Lake Campus of So- cheating times of under- to µi1 and µ'i1, and the cor- period and the growth peri- ochow University. And in graduates in last two se- responding W values were od of AIDS, China is now the evaluation of the test- mesters: 0.951357 and 0.966911 re- facing the threat of an retest reliability, the coinci- spectively; the P values were AIDS outbreak. To prevent dence of results between 0.3882 and 0.6888, respec- the disease accurate data two repeated surveys was about it are needed, al- high, showing that the sur- though many people may vey methods and statistical be unwilling to give formulas are highly reliable. honest answers to such a For cluster and stratified tively, and were normally sensitive question. Survey cluster sampling, the sample By the first survey on distributed. This analysis methods and formulas for size is generally large and graduates and Eq. 10, we showed that the coincidence parameter estimation pro- the sample mean usually can also get the estimator between the results of the posed in this article will be follows the normal distribu- of the variance of the popu- two repeat cluster samplings helpful to get reliable data tion. With the formulas lation mean of cheating in the first stratum was high to prevent sexually trans- herein deduced, one can get times of graduates in last (coefficient of product-mo- mitted diseases and improve of population two semesters: ment correlation r= 0.9351, public health. means and their , P<0.0001). Rank correlation Randomized response estimate the intervals of analysis was applied to the models have been widely those population means and, used to make people coop- further, compare the mean erative in sensitive question of each stratum by using survey. Recently, a meta- either t-test, Z test, analysis analysis was applied to 38 of variance or rank test. data of two repeated surveys relevant papers published The study of cluster sam- Population mean and under the multiplicative from 1965 to 2000 and pling and stratified cluster variance of cheating times of model from 18 graduate showed that the application sampling methods and rele- students on Dushu Lake classes by using SAS9.13 of randomized response vant formulas under multi- Campus and showed that the coinci- models has a significant ad- plicative model will likely dence between results of two vantage of accuracy and reli- be helpful for survey of From Eq. 11 the estimator repeat cluster samplings in ability compared with other quantitative sensitive issues. of the population mean of the second stratum was high traditional survey methods cheating times of students on (the Spearman rank correla- (Lensvelt-Mulders et al., Acknowledgements Dushu Lake Campus is tion coefficient rs= 0.83634, 2005). And the multiplica- P<0.0001, not tive model is generally supe- The authors thank Yong- normal distribu- rior to the previously used zhong Wang, Shuangrong tion). Finally, models of randomizing ques- Hang, Min Chen and Hon- tions for quantitative data gyu Shen and the Third Peo- And from Eqs. 10 and 12, analysis was applied to the (Pollock and Bek, 1976; Yan ple’s Hospital of Changzhou, we can get the estimator of data of two repeated surveys and Nie, 2005). As to the for their encouragement, and the variance of the popula- under the multiplicative sampling design for sensitive the referees and editors for tion mean of cheating times model from all the 38 class- question survey, their suggestions and help.

836 NOV 2012, VOL. 37 Nº 11 This work was supported by Ding Y, Gao G (2008) Health Raghunath A, Georg D (2006) Medical Publishing House. grants from the National Statistics. Science Press. Randomized response tech- Beijing, China. 442-450 pp. Beijing, China. 13-20 pp. niques for complex survey de- Wang M, Gao G (2008). Cluster Natural Science Foundation signs. Stat. Pap. 48: 131-141. of China (Nº 81273188, to Gao G, Fan Y (2008) The research Sampling and its Application of stratified cluster sampling Su L (2007) Advanced Mathe- on Quantitative Sensitive Gao Ge), the Preventive on simmons model for sensi- matical Statistics. Peking Questions. Chin. J. Health Medicine Research Project tive question survey. Chin. J. University Press. Beijing, Stat. 25: 586-589. of Jiangsu Province Health Stat. 25: 562-569. China. 3 pp. Wang Y, Sui S, Wang A (2006) (Y2012072, to Pu Xiangke) Lensvelt-Mulders GJLM, Hox JJ, Tourangeau R, Yan T (2007) and and the Applied Basic Re- van der Heijden PG.M, Maas Sensitive questions in sur- Engineering Data Analysis by CJM (2005) Meta-analysis of veys. Psychol. Bull. 133: MATLAB. Tsinghua University search Program of Chang- 859-883. Press. Beijing, China. 9-11 pp. zhou (CJ20112013, to Pu randomized response re- search, thirty-five years of Wang J, Gao G (2006) The esti- Warner S L (1965) Randomized Xiangke). validation. Sociol. Meth. Res. mation of sampling size in response: A survey technique 33: 319-348. multi-stage sampling and its for eliminating answer bias. J. References Pollock KH, Bek Yuksel (1976) application in medical sur- Am. Stat. Assoc. 60: 63-69. A comparison of three ran- vey. Appl. Math. Comput. Yan Z, Nie Z (2005) An alterna- Cochran WG. (1977) Sampling domized response models for 178: 239-249. tive technique of sensitive Techniques. 3rd ed. Wiley. quantitative data. J. Am. Wang J (2003) Practical Medical questions survey. Or Trans. New York; USA. 288-289 pp. Stat. Assoc. 71: 884-886. Research Methods. People’s 9: 30-34.

NOV 2012, VOL. 37 Nº 11 837