Some SAS-SUDAAN Comparisons
Total Page:16
File Type:pdf, Size:1020Kb
NESUG 17 Analysis Variance Estimation With Complex Surveys: Some SAS-SUDAAN Comparisons Xiuhua Chen and Paul Gorrell Social & Scientific Systems, Inc. Abstract In version 8, SAS® introduced new procedures such as SURVEYMEANS and SURVEYREG for analyzing data from complex surveys (SURVEYFREQ and SURVEYLOGISTIC are part of V9). This paper compares the SAS and SUDAAN® implementation of the Taylor linearization method for variance estimation. Although both SAS and SUDAAN make use of this method (currently it's the only method available in SAS), there are important differences, such as assumptions concerning missing values (nonresponse) and subgroup analysis. Focusing on SURVEYMEANS, we will use concrete examples from two national healthcare databases (the Medical Expenditure Panel Survey and the HCUP Nationwide Inpatient Sample) to clarify these differences. Along the way we'll illustrate the syntax and keywords used in the SAS survey analysis PROCs (especially the STRATA, CLUSTER and DOMAIN statements). This paper should be useful for anyone analyzing data from complex surveys, especially SAS programmers who use SUDAAN for variance estimation. Introduction In our work we use SAS for a wide variety of tasks: restructuring data files, editing variables, statistical analysis, table and report generation, etc. Until recently we did not use SAS for variance estimation with complex survey data because SAS statistical procedures did not have the functionality for incorporating the design properties of complex surveys into the analysis, i.e. they assumed the design to be a simple random sample of an infinite population. In our experience this assumption generally leads to an underestimation of variance. To correctly estimate variance for complex surveys we have generally used SAS-callable SUDAAN whenever we needed to, for example, compute standard errors. More recently we have begun to test the SAS survey PROCs and compare their output with that generated by SUDAAN. This paper reports on some of those comparisons. We focus on SURVEYMEANS but many of the basic points of this paper apply to the other survey PROCs as well. This paper is not a general introduction to the SAS survey analysis PROCs (or even to SURVEYMEANS) and it will not answer all the questions you may have about using SAS for analyzing complex surveys, or how SAS compares to SUDAAN for different types of analyses. But we hope that reading this paper will put you in a better position to know what questions to ask when you first need to use a SAS survey procedure, or allow you to point someone in the right direction if they stop you in the hallway and ask why SAS and SUDAAN don't always produce the same standard errors. For a paper on variance estimation, this paper has a notable lack of statistical formulae. This is because we have targeted SAS programmers rather than statisticians. We have taken a 'functional' approach to discussing the SAS-SUDAAN differences we discuss here. Both the SAS and SUDAAN documentation contain excellent discussions of the statistical details which are built into their respective procedures. Variance Estimation with Complex Surveys A complex survey design is any design that is not a simple random sample (where each person in the target population has an equal chance of being selected in the survey). Complex designs involve dividing the population into groups and sampling the groups themselves (cluster sampling) and/or sampling from the groups (stratified sampling). 1 NESUG 17 Analysis When using SUDAAN to analyze data from complex surveys you have two general options for variance estimation. You can choose either Taylor linearization or a replication method. We won't discuss the replication methods (e.g. jackknife and balanced repeated measures) because they are currently unavailable in SAS. The Taylor method is used in both SAS and SUDAAN so this method will be assumed in the comparisons discussed in this paper. SAS and SUDAAN also have in common a default assumption of a with-replacement design. One important point to emphasize when analyzing data from complex surveys is that you should resist the normally efficient practice of subsetting to just those records that will be used in the analysis. This subverts the process of giving the procedures all the necessary survey design information with respect to strata and clusters. That is, if you are interested in a particular analysis of city-dwellers within a nationally-representative survey and subset to just this subpopulation before your analysis, you may be deleting information about the full design properties of the survey. You are more or less likely to do this depending on the criteria for subsetting. For example, subsetting to males from a typical national survey population is unlikely to exclude all records in a particular PSU and remove it from the analysis (and if it did you'd really want to know why!). Below we discuss the options in SAS and SUDAAN for analyzing subpopulations (or, domains) within a sample population. In this paper we will compare SAS and SUDAAN in four different situations: (1) a. The analysis includes all survey respondents and all respondents have non-missing values for the analysis variable. b. A subgroup analysis requires the use of the SURVEYMEANS DOMAIN statement or the SUDAAN SUBGROUP statement. c. The analysis includes all respondents but some respondents have missing values for the analysis variable causing some strata to have only one PSU (cluster). d. The analysis includes all respondents but some respondents have missing values for the analysis variable causing some PSUs (clusters) to have missing values (but all strata have 2 or more PSUs). In the comparisons we show below a survey weight variable is always used. When the WEIGHT statement is used both SAS and SUDAAN exclude records with nonpositive weights from the analysis. The Household Component of the Medical Expenditure Panel Survey (MEPS HC) The MEPS HC is a nationally representative survey of the U.S. civilian noninstitutionalized population. It collects medical expenditure data as well as information on demographic characteristics, access to health care, health insurance coverage, as well as income and employment data. MEPS is cosponsored by the Agency for Healthcare Research and Quality (AHRQ) and the National Center for Health Statistics (NCHS). For the comparisons reported here we used the MEPS 2001 Full Year Consolidated Data File (HC-060). This is a public use file available for download from the MEPS web site (http://www.meps.ahrq.gov). See also the MEPS Factsheet "Computing Standard Errors for MEPS Estimates", also available from the MEPS web site. The MEPS is not a simple random sample, its design includes stratification, clustering, multiple stages of selection, as well as disproportionate sampling. The MEPS public use files (such as HC-060) include variables for generating weighted national estimates and for use of the Taylor method for variance estimation. These variables are: person-level weight (PERWT01F on HC-060); stratum (VARSTR01 on HC-060); and cluster/psu (VARPSU01 on HC-060). 2 NESUG 17 Analysis PROC SURVEYMEANS (SAS) In this section we will first review the basic syntax of SURVEYMEANS. We will then generate some estimates and standard deviations and errors using the 2001 MEPS full year file (HC-060). Note that we used SAS 8.2 for the comparisons reported here. Initial tests with 9.1.2 reveal no differences. SYNTAX The following illustrates the relevant syntax and statements for PROC SURVEYMEANS: (2) proc surveymeans data= hc060 sum std mean stderr; strata VARSTR01 / list; cluster VARPSU01; weight PERWT01F; var TOTEXP01; run; The options selected on the PROC SURVEYMEANS statement are SUM, STD (standard deviation of the sum), MEAN, and STDERR (standard error of the mean). The STRATA statement (you can also use "STRATUM" when you have one variable and you're fussy about your Latin) is where you list the strata variables if you have a stratified sample. The LIST option causes SAS to generate a Stratum Information table which lists the variables and sampling rates for each stratum, as well as the number of records and clusters for each stratum and analysis variable. This can be very handy information. For example, for the MEPS HC-060 file the Stratum Information table shows that each stratum has at least 2 PSUs (clusters). The head of this table shows the following information. (3) Number of Strata 145 Number of Clusters 557 Number of Observations 33556 Number of Observations Used 32122 Number of Obs with Nonpositive Weights 1434 Sum of Weights 284247327 The 1,434 records with nonpositive weights are excluded prior to analysis. The N value that appears in the LST output will be 32,122 (number of observations used). For TOTEXP01 the NMISS value will be zero because all records have a non-missing value for this analysis variable. The CLUSTER statement is where (you guessed it!) you list the cluster variables. For MEPS (as with all stratified samples), clusters are nested within strata (we'll see this explicitly when the NEST statement in SUDAAN is discussed). This may be obvious but it's worth pointing out that SAS will not tell you the sample design of the data you're working with. You need to know the design of the survey that produced the data and how those design properties are to be specified in the SAS (or, SUDAAN) survey procedures. The Federal agencies responsible for distributing public use files for data such as the MEPS, HCUP or the National Health Interview Survey (NHIS) do an excellent job of providing documentation of survey design and data file properties. The WEIGHT statement is self-explanatory. Records with nonpositive or missing weight values are excluded from the analysis. The VAR statement lists the analysis variables. Here TOTEXP01 is the 2001 MEPS total expenditure variable.