NESUG 17 Analysis

Variance Estimation With Complex Surveys: Some SAS-SUDAAN Comparisons

Xiuhua Chen and Paul Gorrell Social & Scientific Systems, Inc.

Abstract

In version 8, SAS® introduced new procedures such as SURVEYMEANS and SURVEYREG for analyzing data from complex surveys (SURVEYFREQ and SURVEYLOGISTIC are part of V9). This paper compares the SAS and SUDAAN® implementation of the Taylor linearization method for variance estimation. Although both SAS and SUDAAN make use of this method (currently it's the only method available in SAS), there are important differences, such as assumptions concerning missing values (nonresponse) and subgroup analysis. Focusing on SURVEYMEANS, we will use concrete examples from two national healthcare databases (the Medical Expenditure Panel Survey and the HCUP Nationwide Inpatient Sample) to clarify these differences. Along the way we'll illustrate the syntax and keywords used in the SAS survey analysis PROCs (especially the STRATA, CLUSTER and DOMAIN statements). This paper should be useful for anyone analyzing data from complex surveys, especially SAS programmers who use SUDAAN for variance estimation.

Introduction

In our work we use SAS for a wide variety of tasks: restructuring data files, editing variables, statistical analysis, table and report generation, etc. Until recently we did not use SAS for variance estimation with complex survey data because SAS statistical procedures did not have the functionality for incorporating the design properties of complex surveys into the analysis, i.e. they assumed the design to be a simple random sample of an infinite population. In our experience this assumption generally leads to an underestimation of variance. To correctly estimate variance for complex surveys we have generally used SAS-callable SUDAAN whenever we needed to, for example, compute standard errors. More recently we have begun to test the SAS survey PROCs and compare their output with that generated by SUDAAN. This paper reports on some of those comparisons. We focus on SURVEYMEANS but many of the basic points of this paper apply to the other survey PROCs as well.

This paper is not a general introduction to the SAS survey analysis PROCs (or even to SURVEYMEANS) and it will not answer all the questions you may have about using SAS for analyzing complex surveys, or how SAS compares to SUDAAN for different types of analyses. But we hope that reading this paper will put you in a better position to know what questions to ask when you first need to use a SAS survey procedure, or allow you to point someone in the right direction if they stop you in the hallway and ask why SAS and SUDAAN don't always produce the same standard errors.

For a paper on variance estimation, this paper has a notable lack of statistical formulae. This is because we have targeted SAS programmers rather than statisticians. We have taken a 'functional' approach to discussing the SAS-SUDAAN differences we discuss here. Both the SAS and SUDAAN documentation contain excellent discussions of the statistical details which are built into their respective procedures.

Variance Estimation with Complex Surveys

A complex survey design is any design that is not a simple random sample (where each person in the target population has an equal chance of being selected in the survey). Complex designs involve dividing the population into groups and the groups themselves (cluster sampling) and/or sampling from the groups ().

1 NESUG 17 Analysis

When using SUDAAN to analyze data from complex surveys you have two general options for variance estimation. You can choose either Taylor linearization or a replication method. We won't discuss the replication methods (e.g. jackknife and balanced repeated measures) because they are currently unavailable in SAS. The Taylor method is used in both SAS and SUDAAN so this method will be assumed in the comparisons discussed in this paper. SAS and SUDAAN also have in common a default assumption of a with-replacement design.

One important point to emphasize when analyzing data from complex surveys is that you should resist the normally efficient practice of subsetting to just those records that will be used in the analysis. This subverts the process of giving the procedures all the necessary survey design information with respect to strata and clusters. That is, if you are interested in a particular analysis of city-dwellers within a nationally-representative survey and subset to just this subpopulation before your analysis, you may be deleting information about the full design properties of the survey. You are more or less likely to do this depending on the criteria for subsetting. For example, subsetting to males from a typical national survey population is unlikely to exclude all records in a particular PSU and remove it from the analysis (and if it did you'd really want to know why!). Below we discuss the options in SAS and SUDAAN for analyzing subpopulations (or, domains) within a sample population.

In this paper we will compare SAS and SUDAAN in four different situations:

(1) a. The analysis includes all survey respondents and all respondents have non-missing values for the analysis variable. b. A subgroup analysis requires the use of the SURVEYMEANS DOMAIN statement or the SUDAAN SUBGROUP statement. c. The analysis includes all respondents but some respondents have missing values for the analysis variable causing some strata to have only one PSU (cluster). d. The analysis includes all respondents but some respondents have missing values for the analysis variable causing some PSUs (clusters) to have missing values (but all strata have 2 or more PSUs).

In the comparisons we show below a survey weight variable is always used. When the WEIGHT statement is used both SAS and SUDAAN exclude records with nonpositive weights from the analysis.

The Household Component of the Medical Expenditure Panel Survey (MEPS HC)

The MEPS HC is a nationally representative survey of the U.S. civilian noninstitutionalized population. It collects medical expenditure data as well as information on demographic characteristics, access to health care, health insurance coverage, as well as income and employment data. MEPS is cosponsored by the Agency for Healthcare Research and Quality (AHRQ) and the National Center for Health Statistics (NCHS). For the comparisons reported here we used the MEPS 2001 Full Year Consolidated Data File (HC-060). This is a public use file available for download from the MEPS web site (http://www.meps.ahrq.gov). See also the MEPS Factsheet "Computing Standard Errors for MEPS Estimates", also available from the MEPS web site.

The MEPS is not a simple random sample, its design includes stratification, clustering, multiple stages of selection, as well as disproportionate sampling. The MEPS public use files (such as HC-060) include variables for generating weighted national estimates and for use of the Taylor method for variance estimation. These variables are: person-level weight (PERWT01F on HC-060); stratum (VARSTR01 on HC-060); and cluster/psu (VARPSU01 on HC-060).

2 NESUG 17 Analysis

PROC SURVEYMEANS (SAS)

In this section we will first review the basic syntax of SURVEYMEANS. We will then generate some estimates and standard deviations and errors using the 2001 MEPS full year file (HC-060). Note that we used SAS 8.2 for the comparisons reported here. Initial tests with 9.1.2 reveal no differences.

SYNTAX

The following illustrates the relevant syntax and statements for PROC SURVEYMEANS:

(2) proc surveymeans data= hc060 sum std stderr; strata VARSTR01 / list; cluster VARPSU01; weight PERWT01F; var TOTEXP01; run;

The options selected on the PROC SURVEYMEANS statement are SUM, STD (standard deviation of the sum), MEAN, and STDERR ( of the mean). The STRATA statement (you can also use "STRATUM" when you have one variable and you're fussy about your Latin) is where you list the strata variables if you have a stratified sample. The LIST option causes SAS to generate a Stratum Information table which lists the variables and sampling rates for each stratum, as well as the number of records and clusters for each stratum and analysis variable. This can be very handy information. For example, for the MEPS HC-060 file the Stratum Information table shows that each stratum has at least 2 PSUs (clusters). The head of this table shows the following information.

(3) Number of Strata 145 Number of Clusters 557 Number of Observations 33556 Number of Observations Used 32122 Number of Obs with Nonpositive Weights 1434 Sum of Weights 284247327

The 1,434 records with nonpositive weights are excluded prior to analysis. The N value that appears in the LST output will be 32,122 (number of observations used). For TOTEXP01 the NMISS value will be zero because all records have a non-missing value for this analysis variable.

The CLUSTER statement is where (you guessed it!) you list the cluster variables. For MEPS (as with all stratified samples), clusters are nested within strata (we'll see this explicitly when the NEST statement in SUDAAN is discussed).

This may be obvious but it's worth pointing out that SAS will not tell you the sample design of the data you're working with. You need to know the design of the survey that produced the data and how those design properties are to be specified in the SAS (or, SUDAAN) survey procedures. The Federal agencies responsible for distributing public use files for data such as the MEPS, HCUP or the National Health Interview Survey (NHIS) do an excellent job of providing documentation of survey design and data file properties.

The WEIGHT statement is self-explanatory. Records with nonpositive or missing weight values are excluded from the analysis.

The VAR statement lists the analysis variables. Here TOTEXP01 is the 2001 MEPS total expenditure variable. The HC-060 file is a person-level file with 33,556 records. All records have a non-missing value for

3 NESUG 17 Analysis

TOTEXP01. This is consistent with the situation described in (1a): all survey respondents are in the analytic population and all have a non-missing value for the analysis variable.

When we run (2) SAS generates the following output statistics (rounded values are given). The full output listing is given in the Appendix.

(4) SUM: 726,354,166,775 STD: 20,005,091,265 MEAN: 2,555 STDERR: 55

If we run (2) but omit the STRATA and CLUSTER statements, SAS generates the same SUM and MEAN, but the standard deviation and standard error are those given in the parentheses in (5).

(5) SUM: 726,354,166,775 STD: 20,005,091,265 (14,073,048,947) MEAN: 2,555 STDERR: 55 (49)

This comparison is consistent with our general experience that variance estimation where a simple random sample is assumed underestimates standard deviations and standard errors for complex samples.

PROC DESCRIPT (SUDAAN)

To illustrate the basic syntax of SUDAAN's PROC DESCRIPT we'll run the same analysis we did in the last section with SURVEYMEANS. We used SAS-callable SUDAAN 8.0.1 for these comparisons. The program code is shown in (6).

(6) proc descript data= HC060 ; nest VARSTR01 VARPSU01; weight PERWT01F; var TOTEXP01; run;

The NEST statement lists the same variables that are on the SURVEYMEANS STRATA and CLUSTER statements. The order of the variables is important because, in the case of a survey such as MEPS, you are indicating which variable has stratum and which variable has cluster information. As noted earlier, clusters are nested within strata so the strata variable comes before the cluster (PSU) variable. Also, unlike SAS, the input data set must be sorted by the variables appearing on the NEST statement.

If you run (6), SUDAAN will generate the same total, standard deviation, mean and standard error that SAS does (although the keywords differ). That is, when the input data represent all the strata and clusters in the survey design, and all the analysis variables have non-missing values, SAS and SUDAAN (using the Taylor method) will generate the same standard deviations and standard errors.

DOMAIN (Subgroup) Analysis

In this section we show the use of the SURVEYMEAN's DOMAIN statement and its counterpart in SUDAAN, the SUBGROUP statement. We will compute the mean and standard error again for the MEPS TOTEXP01 variable, but here we will also look at males and females as subgroups. For this example we will also show

4 NESUG 17 Analysis

values to six decimal places to indicate the extent to which SAS and SUDAAN produce identical output under these conditions, i.e. all respondents included and no missing values for the analysis variable.

(7) proc surveymeans data= hc060 mean stderr; strata VARSTR01; cluster VARPSU01; weight PERWT01F; var TOTEXP01; domain SEX; run;

The DOMAIN statement shows the categorical variable SEX. The output statistics are shown in (8).

(8)

Data Summary

Number of Strata 145 Number of Clusters 557 Number of Observations 33556 Number of Observations Used 32122 Number of Obs with Nonpositive Weights 1434 Sum of Weights 284247327

Statistics

Std Error Variable N N Miss Mean of Mean ------TOTEXP01 32122 0 2555.359713 54.631583 ------

Domain Analysis: SEX

Std Error SEX Variable N N Miss Mean of Mean ------FEMALE TOTEXP01 16753 0 2816.937064 62.743712 MALE TOTEXP01 15369 0 2280.601782 81.814411 ------

The SUDAAN code for generating this output is given in (9). This generates the output in (10).

(9) proc descript data= hc060; nest VARSTR01 VARPSU01; weight PERWT01F; var TOTEXP01; subgroup SEX; levels 2; print MEAN SEMEAN / MEANFMT=F12.6 SEMEANFMT=F12.6; run;

5 NESUG 17 Analysis

(10)

Number of observations read : 32122 Weighted count :284247327 Number of observations skipped : 1434 (WEIGHT variable nonpositive) Denominator degrees of freedom : 412 Variance Estimation Method: Taylor Series (WR) by: Variable, SEX.

------| | | | | Variable | | SEX | | | | Total | 1 | 2 | ------| | | | | | | TOTAL HEALTH | Mean | 2555.359713 | 2280.601782 | 2816.937064 | | CARE EXP 01 | SE Mean | 54.631583 | 81.814411 | 62.743712 | ------

The SUBGROUP statement, like the DOMAIN statement, lists a categorical variable for subgroup analysis. The LEVELS statement tells SUDAAN the number of levels for the categorical variable.

It's clear from examining the SAS and SUDAAN output that they generate the same mean and standard error for this input.

In the next section the first example shows a similar subgroup analysis, but one where SAS and SUDAAN produce different output.

SAS and SUDAAN: Similarities and Differences

DOMAIN (Subgroup) Analysis With Nonpositive Values for the Subgroup Variable

On the MEPS 2001 HC-060 file the MSA variable roughly indicates a rural-urban location distinction (coded as "MSA" or "NON-MSA". In addition to records with these values, there are 270 records with "-1" values. This is a survey indicator coded as "INAPPLICABLE". These nonpositive values are treated differently in SAS and SUDAAN analysis PROCs.

Most programmers who work with both SAS and SUDAAN know that "0" categorical variable values need to be recoded to some positive value for SUDAAN or they will be excluded from the analysis. A common recoding is for a 0/1 flag variable on a SAS data set to be recoded to 2/1 for use in SAS-callable SUDAAN. But it is equally important to keep in mind other nonpositive values for variables, e.g. the "-1" values for MSA on the MEPS file. The following example illustrates this difference. On the MEPS file a value of "0" indicates "NON-MSA" and "1" indicates "MSA". For the SUDAAN run we've recoded the "0" values to "2" but left the "-1" values as is.

The SURVEYMEANS code is shown in (11), with the SUDAAN code in (12).

(11) proc surveymeans data= hc060 mean stderr; strata VARSTR01; cluster VARPSU01; weight PERWT01F; var TOTEXP01; domain MSA; run;

6 NESUG 17 Analysis

(12) proc descript data= hc060; nest VARSTR01 VARPSU01; weight PERWT01F; var TOTEXP01; subgroup MSA; levels 3; print MEAN SEMEAN / MEANFMT=F12.6 SEMEANFMT=F12.6; run;

The basic difference between SAS and SUDAAN here is that SAS will compute an overall mean regardless of the values of the subgroup variable whereas SUDAAN will compute the overall mean just for those records with positive values for the subgroup variable. The SAS and SUDAAN output are shown in (13) and (14) respectively.

(13)

Std Error Variable N N Miss Mean of Mean ------TOTEXP01 32122 0 2555.359713 54.631583 ------

Domain Analysis: MSA

Std Error MSA Variable N N Miss Mean of Mean ------1 TOTEXP01 258 0 13677 1793.486511 0 TOTEXP01 6616 0 2859.678431 126.854976 1 TOTEXP01 25248 0 2373.814472 55.089462 ------

(14)

Variance Estimation Method: Taylor Series (WR) by: Variable, MSA.

------| | | | | Variable | | MSA | | | | Total | 1 | 2 | 3 | ------| | | | | | | | TOTAL HEALTH | Mean | 2463.532013 | 2373.814472 | 2859.678431 | . | | CARE EXP 01 | SE Mean | 51.146406 | 55.089462 | 126.854976 | . | ------

Here SAS and SUDAAN output different because the records with "-1" values for MSA are included in the SAS analysis but excluded in the SUDAAN analysis. Note that the missing values in the SUDAAN output for a non-existent value of "3" for MSA are due to specifying 3 on the LEVELS statement. If 2 were specified the same output would result, but without the "3" column.

Next we discuss two situations where SAS and SUDAAN produce different results, not because of differences regarding valid variable values, but because of different underlying assumptions concerning missing values.

7 NESUG 17 Analysis

Missing Values for the Analysis Variable (one PSU in some strata)

In this section we will first look at an example where records with missing values for the analysis variable create a situation where some strata only have one PSU (cluster) per stratum. We create this example by modifying the MEPS example from the previous section so we can run an analysis just for total workers' compensation expenses (the MEPS variable TOTWCP01). Further, for this analysis we are interested in computing the mean only for those persons with a workers' compensation expense, i..e. zero values are set to missing.

(15) proc surveymeans data= hc060 sum std mean stderr; strata VARSTR01 / list; cluster VARPSU01; weight PERWT01F; var TOTWCP01; run;

Running (15) generates the LOG note in (16) and the output statistics in (17):

(16) Only one cluster in a stratum for variable TOTWCP01. The variance in that stratum is estimated by zero.

(17) SUM: 12,312,110,556 STD: 1,355,766,128 MEAN: 1,645 STDERR: 173

The stratum information table would also show which of the strata have only one cluster (PSU). The NMISS value on the output statistics would be 31,292 and the N would be 830. NMISS indicates the number of records with a positive weight that have a missing value for the analysis variable TOTWCP01.

If we run the comparable analysis in SUDAAN, as in (18), we generate the output in the second column in (19).

(18) proc descript data= hc060 ; nest VARSTR01 VARPSU01 ; weight PERWT01F ; var TOTWCP01 ; print TOTAL SETOTAL MEAN SEMEAN ; run;

(19) SAS Output SUDAAN Output

SUM: 12,312,110,556 12,312,110,556 STD: 1,355,766,128 1,543,443,593 MEAN: 1,645 1,645 STDERR: 173 192

The difference in the SAS and SUDAAN standard deviations and standard errors is due to how each system handles situations in which there is only one PSU in a stratum. As is shown in (16), for SAS the variance in that stratum "is estimated by zero". In contrast SUDAAN will estimate the variance for that stratum using the difference between that stratum's value and the overall mean for the population.

8 NESUG 17 Analysis

Missing Values for the Analysis Variable (missing values at the PSU level)

Now we'll look at an example from the Nationwide Inpatient Sample (NIS) to illustrate SAS-SUDAAN differences in situations where missing values for the analysis variable cause some PSUs within strata to be missing. First we'll sketch some of the basic properties of the NIS.

The HCUP Nationwide Inpatient Sample (NIS)

The NIS is a database of hospital inpatient stays. It is part of the Healthcare Cost and Utilization Project (HCUP), sponsored by AHRQ, which develops databases and related software tools through a Federal-State- Industry partnership. The 2002 NIS has records for more than 7 million stays for 995 U.S. hospitals. The NIS includes data on diagnoses, procedures, discharges, patient demographics, payment source, total charges, length of stay, hospital characteristics, etc. See "Calculating Nationwide Inpatient Sample Variances" available from the HCUP section of the AHRQ Web site.

The 2002 NIS hospital universe consists of all U.S. hospitals open during any part of 2002 and defined as community hospitals by the American Hospital Association's Annual Survey of Hospitals. The NIS is a stratified sample of hospitals. Within each stratum a systematic random sample of hospitals is drawn. The NIS sample resembles a stratified single-stage cluster sample with hospitals as the clusters. The stratum variable is NIS_STRATUM; the cluster variable is HOSPID; the weight variable is DISCWT. There are 995 clusters nested within 60 strata.

Computing the Mean and Standard Error for 2002 NIS Discharges

The 2002 NIS core file has approximately 7.8 million records. Of these a little over 190,000 have missing values for the variable TOTCHG (total charges). When these 190,000 records are excluded from the analysis, 10 hospitals (clusters) are removed. The syntax of PROC SURVEYMEANS to compute the mean and standard error for TOTCHG is given in (20). The output statistics are given in (21).

(20) proc surveymeans data= nis2002c nobs nmiss mean stderr; strata NIS_STRATUM / list; cluster HOSPID; weight DISCWT; var TOTCHG; run;

(21) MEAN: 17,263 STDERR: 429

The SUDAAN PROC DESCRIPT code is shown in (22), with both the SAS and SUDAAN mean and standard error in (23). We show the standard errors here to 6 decimal places for comparison with the output shown in (26) below.

(22) proc descript data= nis2020c; nest NIS_STRATUM HOSPID; weight DISCWT; var TOTCHG; run;

9 NESUG 17 Analysis

(23) SAS Output SUDAAN Output

MEAN: 17,263 17,263 STDERR: 428.581301 430.592652

The reason for the difference in standard error values is the different assumptions made by SAS and SUDAAN concerning the treatment of missing values (see SAS FAQ # 1813 available online at SAS Technical Support). SAS survey analysis procedures assume that missing values for analysis variables are missing completely at random and delete them before running the analysis. In contrast the SUDAAN default analysis assumes that these values are not missing completely at random and runs a domain analysis focused only on those records with non-missing values, i.e. the subset (or, domain) of respondents for the analysis variable. Note that this is a distinct situation from what we saw in the previous section. Here (as the stratum information table would show) all strata have at least 2 clusters.

You can easily demonstrate that the difference in standard errors is due to the missing values for the analysis variable (TOTCHG) by recoding the missing values to zero (or some other value) and re-running the SAS and SUDAAN code. Of course the mean and standard error will be different than shown above, but now SAS and SUDAAN will produce identical output. Clearly this new output is meaningless analytically, but it does pinpoint the SAS-SUDAAN difference.

We can simulate the SUDAAN approach in SURVEYMEANS by using the DOMAIN statement to explicitly distinguish missing and non-missing TOTCHG values. First we need to create a categorical variable INDICATOR (see also the SURVEYMEANS documentation for analyzing data with missing values). Then we can run SURVEYMEANS with the code in (25).

(24) data nis2002cv2; set nis2002c; if TOTCHG not in (., .A, .C) then INDICATOR = 'NON-MISSING'; else INDICATOR = 'MISSING'; run;

(25) proc surveymeans data= nis2002cv2; strata NIS_STRATUM ; cluster HOSPID; weight DISCWT; var TOTCHG; domain INDICATOR; run;

This will produce the output in (26).

10 NESUG 17 Analysis

(26)

Data Summary

Number of Strata 60 Number of Clusters 995 Number of Observations 7853982 Sum of Weights 37804053.9

Statistics

Std Error Variable N N Miss Mean of Mean ------TOTCHG 7661881 192101 17263 428.581301 ------

Domain Analysis: INDICATOR

Std Error INDICATOR Variable N N Miss Mean of Mean ------MISSING TOTCHG 0 192101 . . NON-MISSING TOTCHG 7661881 0 17263 430.592652 ------

The standard error in the first statistics row is the one SURVEYMEANS computes for the mean under the assumption that missing values are missing completely at random. The domain analysis section shows the standard error produced by the 'indicator' approach. The value of 430.592652 is exactly what is produced by SUDAAN's PROC DESCRIPT (see (23)). Although intuitively the numeric difference between the two values is not great, it is important to understand that (i) by default SAS and SUDAAN build different assumptions into the computation of standard errors in cases where there are cluster-level missing values, and (ii) that SURVEYMEANS, through the use of the DOMAIN statement, offers a way to incorporate into the analysis the non-default assumption that missing values are not missing completely at random.

Summary

In this paper we have compared how SAS and SUDAAN estimate variance for totals and means in four situations:

(27) a. The analysis includes all survey respondents and all respondents have non-missing values for the analysis variable (the analysis of total expenditures in MEPS). b. Subgroup analyses. c. The analysis includes all respondents but some respondents have missing values for the analysis variable causing some strata to have only one PSU (the analysis of workers' compensation expenditures in MEPS). d. The analysis includes all respondents but some respondents have missing values for the analysis variable causing some PSUs to have missing values, but all strata have 2 or more PSUs (the analysis of total charges in NIS).

When the use of the Taylor method for variance estimation in SAS and SUDAAN is compared for the first type of analysis, we see that the two software packages produce identical results. For situations where the interaction of the input data properties and the selected analysis produce strata with only one PSU (cluster), SAS and SUDAAN produce different results. This is due to the different methods of variance estimation in SAS and

11 NESUG 17 Analysis

SUDAAN for these cases. SAS will estimate variance by zero and SUDAAN, for each stratum with only one PSU, will use the difference between that stratum's value and the overall population mean.

Another SAS-SUDAAN difference was shown for analyses where missing values for the analysis variable occur at the cluster level. This difference is due to different assumptions concerning whether or not missing values are completely at random or not. SAS assumes complete randomness and deletes missing values prior to analysis. SUDAAN, on the other hand, implicitly adopts a domain analysis of missing and non-missing value groups for the analysis variable. The SUDAAN approach can be re-created in SAS by using the DOMAIN statement in SURVEYMEANS with the appropriate categorical variable for distinguishing the two groups.

References

"Calculating Nationwide Inpatient Sample Variance," available at: http://www.hcup-us.ahrq.gov/db/nation/nis/reports/NISVariance2001_Revised%20031904.pdf.

"Computing Standard Errors for MEPS Estimates," available at: http://www.meps.ahrq.gov/FactSheets/FS_StandardErrors.HTM.

"Overview of Survey Data Analysis Using SAS Software". Tony An. Proceedings of NESUG 16 (2003).

"What assumptions are made about the treatment of missing values (nonresponse)? How are missing values handled," available at http://support.sas.com/faq/018/FAQ01813.html.

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. ----- SUDAAN® is a registered trademark of Research Triangle Institute.

Acknowledgements

We want to thank Fred Rohde for his excellent brownbag presentation at Social & Scientific Systems on the Taylor method of variance estimation with complex surveys. We have benefited from discussions with Zhengyi Fang, Devi Katikineni, Shiqiang Li and Lita Manuel about variance estimation using SAS and SUDAAN. We'd also like to thank Fred Rohde and Bryan Sayer for comments on a draft of this paper.

Contact Information

Xiuhua Chen Paul Gorrell Social & Scientific Systems, Inc. 8757 Georgia Avenue Silver Spring, MD 20910 [email protected] [email protected]

12 NESUG 17 Analysis

Appendix

PROC SURVEYMEANS Output for the Program Code in Example (2)

The SURVEYMEANS Procedure

Data Summary

Number of Strata 145 Number of Clusters 557 Number of Observations 33556 Number of Observations Used 32122 Number of Obs with Nonpositive Weights 1434 Sum of Weights 284247327

Stratum Information

VARIANCE ESTIMATION Statum STRATUM Index - 2001 N Obs Variable N Clusters ------1 1 556 TOTEXP01 528 15 2 2 86 TOTEXP01 85 2 3 3 159 TOTEXP01 158 2 4 4 179 TOTEXP01 171 2 5 5 220 TOTEXP01 214 2 6 6 123 TOTEXP01 114 2 7 7 140 TOTEXP01 133 2 8 8 158 TOTEXP01 146 2 9 9 168 TOTEXP01 159 3 10 10 64 TOTEXP01 63 2 11 11 326 TOTEXP01 316 2 12 12 97 TOTEXP01 95 2 13 13 85 TOTEXP01 81 2 14 14 171 TOTEXP01 161 2 15 15 62 TOTEXP01 62 2 16 16 116 TOTEXP01 109 2 17 17 294 TOTEXP01 285 2 18 18 84 TOTEXP01 83 2 19 19 148 TOTEXP01 145 2 20 20 136 TOTEXP01 129 4 21 21 455 TOTEXP01 447 2 22 22 148 TOTEXP01 140 2 23 23 75 TOTEXP01 72 2 24 24 117 TOTEXP01 113 2 25 25 237 TOTEXP01 232 2 26 26 125 TOTEXP01 121 2 27 27 70 TOTEXP01 66 2 28 28 90 TOTEXP01 88 2 29 29 143 TOTEXP01 138 2 30 30 228 TOTEXP01 216 2 31 31 2033 TOTEXP01 1961 59 32 32 232 TOTEXP01 228 9 33 33 164 TOTEXP01 161 2 34 34 157 TOTEXP01 148 2 35 35 444 TOTEXP01 422 2 36 36 274 TOTEXP01 262 2 37 37 69 TOTEXP01 69 2 38 38 95 TOTEXP01 92 2 39 39 309 TOTEXP01 300 2 40 40 626 TOTEXP01 602 2 41 41 90 TOTEXP01 88 2 42 42 200 TOTEXP01 198 2 43 43 155 TOTEXP01 149 2 44 44 887 TOTEXP01 853 29 45 45 95 TOTEXP01 92 2 46 46 167 TOTEXP01 158 2

13 NESUG 17 Analysis

47 47 114 TOTEXP01 110 2 48 48 122 TOTEXP01 119 2 49 49 110 TOTEXP01 107 2 50 50 122 TOTEXP01 117 2 51 51 166 TOTEXP01 156 2 52 52 143 TOTEXP01 140 2 53 53 81 TOTEXP01 78 2 54 54 554 TOTEXP01 526 2 55 55 102 TOTEXP01 102 2 56 56 732 TOTEXP01 677 18 57 57 968 TOTEXP01 937 22 58 58 490 TOTEXP01 472 2 59 59 93 TOTEXP01 91 2 60 60 543 TOTEXP01 512 15 61 61 164 TOTEXP01 157 2 62 62 233 TOTEXP01 219 10 63 63 341 TOTEXP01 326 2 64 64 63 TOTEXP01 61 2 65 65 385 TOTEXP01 371 2 66 66 75 TOTEXP01 66 2 67 67 143 TOTEXP01 136 2 68 68 144 TOTEXP01 132 2 69 69 98 TOTEXP01 98 2 70 70 80 TOTEXP01 73 2 71 71 53 TOTEXP01 49 2 72 72 116 TOTEXP01 105 2 73 73 136 TOTEXP01 130 2 74 74 158 TOTEXP01 154 2 75 75 348 TOTEXP01 344 2 76 76 76 TOTEXP01 76 2 77 77 122 TOTEXP01 113 2 78 78 76 TOTEXP01 76 2 79 79 111 TOTEXP01 109 2 80 80 92 TOTEXP01 91 2 81 81 229 TOTEXP01 222 2 82 82 226 TOTEXP01 221 2 83 83 144 TOTEXP01 138 2 84 84 271 TOTEXP01 254 2 85 85 107 TOTEXP01 105 2 86 86 259 TOTEXP01 248 2 87 87 73 TOTEXP01 73 2 88 88 289 TOTEXP01 275 2 89 89 77 TOTEXP01 77 2 90 90 402 TOTEXP01 386 2 91 91 82 TOTEXP01 80 2 92 92 123 TOTEXP01 119 2 93 93 169 TOTEXP01 162 2 94 94 192 TOTEXP01 186 2 95 95 173 TOTEXP01 170 2 96 96 301 TOTEXP01 293 2 97 97 188 TOTEXP01 180 2 98 98 624 TOTEXP01 606 19 99 99 131 TOTEXP01 125 2 100 100 86 TOTEXP01 82 2 101 101 624 TOTEXP01 581 16 102 102 190 TOTEXP01 173 3 103 103 65 TOTEXP01 62 2 104 104 115 TOTEXP01 115 2 105 105 135 TOTEXP01 122 2 106 106 90 TOTEXP01 87 2 107 107 275 TOTEXP01 261 2 108 108 81 TOTEXP01 81 2 109 109 117 TOTEXP01 113 2 110 110 316 TOTEXP01 308 2 111 111 304 TOTEXP01 297 2 112 112 136 TOTEXP01 124 2 113 113 95 TOTEXP01 94 2 114 114 98 TOTEXP01 92 2 115 115 162 TOTEXP01 150 2 116 116 94 TOTEXP01 82 2 117 117 100 TOTEXP01 90 2 118 118 350 TOTEXP01 319 2

14 NESUG 17 Analysis

119 119 135 TOTEXP01 131 2 120 120 243 TOTEXP01 229 2 121 121 106 TOTEXP01 101 2 122 122 79 TOTEXP01 76 2 123 123 360 TOTEXP01 331 2 124 124 81 TOTEXP01 78 2 125 125 107 TOTEXP01 97 2 126 126 157 TOTEXP01 149 2 127 127 283 TOTEXP01 259 9 128 128 373 TOTEXP01 354 2 129 129 173 TOTEXP01 166 2 130 130 146 TOTEXP01 145 2 131 131 73 TOTEXP01 73 2 132 132 183 TOTEXP01 179 2 133 133 120 TOTEXP01 116 2 134 134 759 TOTEXP01 720 16 135 135 104 TOTEXP01 102 2 136 136 106 TOTEXP01 100 2 137 137 2617 TOTEXP01 2513 52 138 138 148 TOTEXP01 142 2 139 139 139 TOTEXP01 132 2 140 140 64 TOTEXP01 60 2 141 141 90 TOTEXP01 89 2 142 142 164 TOTEXP01 155 2 143 143 145 TOTEXP01 132 2 144 144 99 TOTEXP01 90 2 145 145 598 TOTEXP01 567 2 ------

Data Summary

Number of Observations 33556 Number of Observations Used 32122 Number of Obs with Nonpositive Weights 1434 Sum of Weights 284247327

Statistics

Std Error Variable N N Miss Mean of Mean Sum Std Dev ------TOTEXP01 32122 0 2555.359713 54.631583 726354166775 20005091265 ------

15