A Framework for the Meta-Analysis of Survey Data

SSC – JSM 2010 A Framework for the Meta-Analysis of Survey Data Karla Fox, Statistics Canada Abstract This research presents a framework for the meta-analysis of complex survey data using the sample design and a superpopulation model to quantitatively summarize different surveys that represent the same underlying population. Additionally, it will compare the classical design based methods using a superpopulation model approach. Keywords: Meta-analysis, observational data, surveys 1. Introduction The recent increase in straightforward access to data has lead to a proliferation of analysis that groups or pools multiple study results. This ready access, combined with user-friendly computer software, has lead researchers to try to borrow strength from the many different observational and non-randomized studies in an attempt to either improve precision or address research questions for which the original studies were not designed. Researchers have started to employ many different techniques including meta-analysis; however, the analysis is often done without reference to a generalized framework or a systematic review and often without an understanding of the methodological differences between surveys and experiments. Although meta-analysis was first used in 1976 for the combination of educational studies, it is now common in the medical literature (Glass). This integration of studies has expanded from combining randomized control trials and other experimental studies to blending information from epidemiological quasi-experimental studies (cohort and case-control), and diagnostic tests and bioassays and now survey data. This paper begins with an overview of the differences in the randomization processes in experiments and surveys. Next, we describe model and design based inference and estimation in surveys, followed by an overview of fixed and random effect models in meta-analysis. We then describe the simulations used to illustrate meta-analysis in the survey context, highlighting issues with convergence, inference and small sample properties. 2. Randomization: Allocation compared to Selection It is important to address the differences of the randomization frameworks in experimental design, and that of survey design. What is random in experimental data is the assignment of individuals. With experimental data, we assign a treatment(s) and a control to individuals in a random fashion to create two (or more) probabilistically equivalent groups. Altman (1985) points out, however, that randomization does not guarantee that the treatment groups are comparable with respect to their baseline characteristics, and suggests that comparability should be assessed prior to analysis. We are thus using randomization to reduce confounding and to ensure that the results of the experiment are internally valid. When we combine experiments we make several assumptions about the treatment effect in the different experiments, based on the fact that we are combing estimates that have been generated from a process with random allocation. We assume that the treatment effect follows a distribution and we either assume that the effects we are combining are identical and independent from this distribution or that the differences between the treatment effects from the different 527 SSC – JSM 2010 experiments can be modeled in an analogous fashion to random effect ANOVA models. We make no assumptions on the population of individuals. On the other hand, in the case of classical design-based survey methods, what is random is the selection of individuals. The purpose of random selection is to generalize to the entire finite population from which we are drawing our sample. Finite population parameters describe the finite population U, from which we draw our sample. Estimation is then about characteristics of this population. These quantities always portray the population at a given time point (the time of sampling), and are purely descriptive in nature. A finite population parameter has the property that in a census (with no non-response and no measurement error) its value can be established exactly. If the researcher was able to collect information from the entire population he would do so. However, since the researcher rarely can take a census of his population of interest and probability sampling is done to produce estimates of the finite population parameter of a sufficient precision at a minimum cost. In probability sampling we select samples that satisfy the following conditions (Sarndal 2003): 1. We can define a distinct set of samples, S1, S2, …, Sv, which the sampling procedure is capable of selecting if applied to a specific population. This means we can say precisely which units belong to S1, to S2, and so on. 2. Each possible sample Si has assigned to it a known probability of selection P(Si), and the probabilities of the possible samples sums to 1. 3. This procedure gives every unit a known non-zero probability of selection. 4. We select one of the Si by a random process in which each sample has its appropriate probability of selection P(Si). 3. The Model and the Design Though classical sampling theory is concerned with inference for finite population parameters we can extend inference to a superpopulation. Superpopulation modeling occurs when the researcher specifies or assumes a random mechanism that generated the finite population (Hartley and Sielken, 1975). We now have what is called model-based as opposed to design-based inference, where the model is the statistical conceptualization of a superpopulation. For example if we assume that the vector y of responses in a survey are a realization of a random Y (Y ,Y ,...,Y ) variable 1 2 N , then the finite population vector y is itself a sample from a hypothetical superpopulation model, ξ. Here then the model-builder is not interested in the finite population U at particular time, but the causal system described in the model. So in the case where we have a census we still would not be estimating these model parameters exactly. Figure 1 illustrates the relationship of the superpopulation with the finite population with the sample. If we let be the parameter from a superpopulation model, then we can let N be the ˆ F ( F ) estimator from the entire finite population and then is the estimator for the finite population based on the sample. Figure 1 ˆ F ( F ) θ θ N 528 SSC – JSM 2010 We now note that we can approach estimation in either a finite sampling or a superpopulation framework. If we cross the estimation approach used with the level of inference we have the following table (Hartley and Sielken, 1975): Estimation Framework Sampling from a fixed or Two-stage sampling from a finite population superpopulation Parameter of interest Finite population Classical finite population Superpopulation theory for the finite parameters or descriptive sampling population parameters ˆ F ( F ) ˆ F ( S ) Infinite superpopulation Infeasible Inference about superpopulation parameters or model parameters from two-stage sampling parameters ˆ S (S ) For clarity the following notation will be used to represent the level of inference and the ˆ S (S ) methodology used to estimate a particular parameter. denotes an estimator of the parameter of the superpopulation where the first superscript S denotes that the parameter of interest is from the superpopulation the second enclosed in parenthesis (S) denotes the two-stage superpopulation framework used to estimate the parameter. Hartely and Seilken and later Sarndal refer to this as ˆ F (S ) case 3. Likewise is the finite population parameter estimate under a two-stage ˆ F (F ) superpopulation framework (case 2 Hartley Seilken) and is the finite population parameter estimate under classical finite population sampling (case 1 Hartley Sielken). As the estimation of a superpopulation parameter under finite population sampling is not possible the superscript S(F) is not used. As it is not logical to combine the estimates from different studies that are inferentially different or calculated under different estimation methods (without accounting for the differences) we only consider cases where we are combining like estimators. 3.1 Combining estimates of the finite population parameter using sampling from a fixed ˆ F (F ) population - With this first type of inference the simplest approach is to use design-based estimation such as a ˆ F (F ) Horvitz-Thompson estimator . Here inference is based on repeated sampling of the population of the same design. We may have an analytical model, but we do not use it in any way. Survey methodologists most frequently encounter this case. ˆ F (F ) When we consider combining estimates of from different surveys we can see that several cases commonly discussed in the literature fall under this category. Pooling or cumulating cases as suggested by Kish (1999) is a method of taking different estimates of a finite population 529 SSC – JSM 2010 characteristic and combining them. Kish uses the term combining when he estimates what he describes as multipopulation or mulitdomains, and cumulating or pooling for periodic or rolling samples. In both these cases we need not invoke a superpopulation and inference is for the ‘finite’ population of interest. Repeated sampling and panel samples of the same population are ˆ F (F ) also a method to combine (Cochran 1977). Wu (2004) suggests a method to combine information from different surveys using an empirical likelihood method similar to the approach of Zieschang (1990) and Rensessn and Nieuwenbroek

Load more