CONTRIBUTED RESEARCH ARTICLES 37

Transitioning to : Replicating SAS, , and SUDAAN Analysis Techniques in Health Policy Data by Anthony Damico recommended code examples for selected propri- etary statistical packages with those of the survey Abstract: Statistical, data manipulation, and package available in R, using data from the Medical presentation tools make R an ideal integrated Expenditure Panel Survey–Household Component package for research in the fields of health pol- (MEPS-HC).2 icy and healthcare management and evaluation. However, the technical documentation accom- panying most data sets used by researchers in Medical Expenditure Panel Survey– these fields does not include syntax examples for Household Component analysts to make the transition from another sta- • Nationally representative (non-institutionalized) tistical package to R. This paper describes the sample of health services utilization in the steps required to import health policy data into United States R, to prepare that data for analysis using the two most common complex survey variance calcu- • Administered by the Agency for Healthcare lation techniques, and to produce the principal Research and Quality (AHRQ) set of statistical estimates sought by health pol- • Panel survey design allows tracking of all indi- icy researchers. Using data from the Medical Ex- viduals and families for two calendar years penditure Panel Survey Household Component (MEPS-HC), this paper outlines complex survey • Person-level and event-level annual data avail- data analysis techniques in R, with side-by-side able since 1996 comparisons to the SAS, Stata, and SUDAAN statistical software packages. MEPS-HC data provides a useful basis for cross- package comparisons because variances can be es- timated using both Taylor-series linearization and Introduction replicate weighting methods, unlike most complex sample survey data sets which recommend only one Health-care researchers use survey data to evaluate or the other. Presenting a step-by-step methodology the state of the market and to inform the creation and with the MEPS-HC data set allows more convenient implementation of new policies. America’s contin- generalization to other data sets. uing discussion about the health-care system high- lights the unique role of population-generalizable data as a gauge for estimating how proposed leg- Data preparation islation will affect specific demographic groups and the $2.2 trillion health-care industry.1 Despite pow- AHRQ provides all downloadable MEPS files in both erful statistical analysis, data manipulation, and ASCII (.txt) and SAS transport (.xpt and .ssp) for- presentation capabilities, R has not become widely mats, along with programming statements for SAS adopted as a statistical package in the fields of health and SPSS. The foreign package imports SAS trans- policy or health-care management and evaluation. port files with minimal effort: User guidelines and other technical documents for > library(foreign) government-funded and publicly available data sets > h105 <- read.xport("h105.ssp") rarely provide appropriate code examples to R users. The objective of this paper is to describe the steps re- Reading larger data sets with the read.xport quired to import health policy data into R, to prepare function may overburden system memory.3 If a that data for analysis using the two most common memory error occurs, SAS transport files can be con- complex survey variance calculation techniques, and verted to tab-delimited (.txt) files using the free SAS to produce the principal set of statistical estimates System Viewer® and then alternatively imported sought by health policy researchers. with the read.table function: This article compares and standard errors produced by technical documentation- > h105 <- read.table("h105.dat", header=TRUE ) 1http://www.cms.hhs.gov/NationalHealthExpendData/downloads/highlights.pdf 2http://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files.jsp 3http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 38 CONTRIBUTED RESEARCH ARTICLES

Should memory errors continue to occur due to a > #combine ex (1) & vg health (2) large number of variables in the data set, database- > # into the same response option backed survey design objects allow R to access a > meps06[, "RTHLTH53"] <- meps06[ ,"RTHLTH53"] - 1 table stored within a relational database using ei- > meps06[meps06[ ,"RTHLTH53"] == 0, "RTHLTH53"] <- ther DBI or ODBC.4 This alternative does not con- + 1 serve memory in data sets that are large due to > #recode negatives to missing (NA) record count. The table is accessed directly by the > meps06[meps06[ , "RTHLTH53"] < 0, "RTHLTH53"] <- svydesign function (discussed later), through the + NA addition of a dbname argument to specify the loca- tion and a dbtype argument to specify the database driver. The addition of missing values to any variable After successful import, unnecessary variables within a data frame requires that subsequent anal- should be eliminated from the data frame to in- yses of the variable include the na.rm = TRUE argu- crease processing speed. The select argument to ment in order to appropriately exclude those obser- the subset command acts as the SAS KEEP option vations. in a DATA step or the Stata keep command, by drop- At the conclusion of all data transformations, all ping all unspecified fields. Distinct subpopulation variables that one wishes to display in weighted col- data frames could also be created with the subset umn frequencies should be converted to a character command; however, observations should not be ex- string in a new variable. The rationale behind this cluded until the design object has been created, in conversion is discussed in the Analysis section. order to preserve the entire complex sample design.5 > #char vars for region & health stat. > #limit to specified fields > meps06 <- transform(meps06, > meps06 <- subset(h105, + cRTHLTH53 = as.character(RTHLTH53)) + select = c(DUPERSID, PANEL, PERWT06F, VARSTR, > meps06 <- transform(meps06, + VARPSU, AGE06X, TOTEXP06, REGION06, RTHLTH53)) + cREGION06 = as.character(REGION06)) > #clear memory of full data frame > rm(h105)

All other recoding and re-categorization should Survey design for Taylor-series lin- occur at this stage. The example below creates a cat- earization egorical variable based on the age of the individual using the cut function to delineate the appropriate The estimation weight (PERWT06F), stratum intervals. (VARSTR), and primary unit (VARPSU) fields make up the MEPS survey design and allow > #recode linear to categorical > # below 25 ; 25-34 ; 35-44 ; 45-54 ; for the Taylor-series linearization method of - > # 55-64 ; 65-74 ; 75-84 ; 85+ dard error computation. The survey package can > meps06[, "AGECAT"] <- cut(meps06[, "AGE06X"], then be used to create a complex sample survey de- + br = c(-1, 24 + 10*0:6, Inf), labels = 1:8) sign object for each data set. After creation, the com- plex survey design object replaces the R data frame Note that the labels argument can directly pro- as the target of all analysis functions. duce character strings rather than numeric cate- gories. This example creates the same categories as > library(survey) above, with character labels. > #create main tsl survey design obj. > #alternate character labels > meps.tsl.dsgn <- svydesign(id = ~ VARPSU, > meps06[, "cAGECAT"] <- cut(meps06[, "AGE06X"], + strata = ~ VARSTR, weights = ~ PERWT06F, + br = c(-1, 24+10*0:6, Inf), + data = meps06, nest = TRUE) + labels = c('Below 25', '25-34', '35-44', + '45-54', '55-64', '65-74', '75-84', '85+')) Survey design objects store (rather than point to) Once all needed variables have been created, in- data sets at the time of design object creation; any applicable or invalid values should be converted to modifications of the meps06 data frame in subsequent missing. This example combines the ’Excellent’ and steps require the above re-assignment of the design ’Very good’ self-reported health status responses, objects in order to see those changes in subsequent then assigns incomplete records as missing values.6 analysis commands. 4http://faculty.washington.edu/tlumley/survey/svy-dbi.html 5http://faculty.washington.edu/tlumley/survey/example-domain.html 6Original Self-Reported Health Status Values in MEPS: (1) Excellent; (2) Very good; (3) Good; (4) Fair; (5) Poor; (-9) Not Ascertained; (-8) Don’t Know; (-7) Refused; (-1) Inapplicable.

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 39

Taylor-series linearization cross- > brr <- read.xport("h36b06.ssp") package comparisons > #merge half-sample with main file > meps.brr <- merge(meps06, brr, Many health policy data sets with complex sample + by.x = c("DUPERSID", "PANEL"), designs use the Taylor-series linearization method + by.y = c("DUPERSID", "PANEL"), for computation. After the appropri- + all=FALSE) ate survey design object has been created, point esti- mates can be obtained. Note that if additional variables need to be re- moved from a data frame, the select argument cap- > #total adult expendure & se tures all fields in between two fields separated by a > svymean(~ TOTEXP06, colon - a shortcut particularly valuable for maintain- + subset(meps.tsl.dsgn, AGE06X > 17)) ing 64 replicate weight variables. > #total nonelderly adult expenditure > meps.brr.sub <- subset(meps.brr, > svymean(~ TOTEXP06 , + select = c(DUPERSID, PANEL, PERWT06F, + subset(meps.tsl.dsgn, AGE06X > 17 + VARSTR, VARPSU, AGE06X, AGECAT, + & AGE06X < 65)) + TOTEXP06, REGION06, cREGION06, + RTHLTH53, cRTHLTH5, BRR1:BRR64)) As seen in Table1, R accurately replicates the es- timates derived by each of the other statistical pack- The survey replicate weight function ages, SAS, Stata, and SUDAAN. The programming svrepdesign allows the creation of a replicate code for each of these other three packages matches weight-based survey design object. The repweights the AHRQ recommendations found in their data argument requires the identification of the 64 repli- documentation. 7 cate weight fields in the data frame, obtained in Since policy research and evaluations often dic- the code below by isolating all fields containing tate that adults and children be studied separately, the string "BRR" with the grep function. Half- this paper will chiefly analyze the data set after re- sample replication techniques generally rely on the stricting the sample to individuals aged 18 or above. replicate weights equation BRRXwt = BRRX * 2 * Although used throughout this paper, note that the PERWT06F, where X is 1:64 and each BRRX field is subset design object embedded in the functions a binary variable containing an equal number of ze- above could be re-created as a pointer. roes and ones that are randomly distributed across the records in the data set. If the series of replicate > x <- subset(meps.tsl.dsgn, AGE06X > 17) weights has already been computed to BRRXwt, the combined.weights argument should be set to TRUE. The downloadable MEPS BRR file contains only bi- Data preparation and survey design nary BRRX variables, indicating that R must take the for replicate weighting extra step of computing the BRRXwt weights.9 > #create meps brr survey design obj. Note: Analysts solely interested in Taylor-series lineariza- > meps.brr.dsgn <-svrepdesign(data = meps.brr, tion can skip this section. Standard errors in MEPS- + repweights = HC can be calculated without the Balanced Repeated + meps.brr[ ,grep("^BRR", colnames(meps.brr))], Replication (BRR) technique; however, these survey + type = "BRR", combined.weights = FALSE, design steps can be generalized to other health pol- + weights = meps.brr$PERWT06F) icy data sets that recommend replicate weighting, including the Consumer Expenditure Survey (CE) and the Survey of Income and Program Participation Replicate weighting cross-package (SIPP). Replicate weighting designs provide the dual comparisons advantages of allowing quantile point estimate stan- dard errors to be calculated with relative ease and of Note: Analysts solely interested in Taylor-series lineariza- maintaining accurate variances when a subpopula- tion can skip this section. Data sets used for health tion under examination has been so restricted as to policy research often require the calculation of stan- 8 contain a stratum with only one PSU. The code be- dard errors using replicate weights. Using the same low imports the MEPS-HC Replicates file and joins svymean function, the point estimate can be obtained that file with the adult file and the non-elderly adult again by simply changing the survey design object. file. > #brr method: total $ mean & se > #load the half-sample data > svymean(~ TOTEXP06, > library(foreign) + subset(meps.brr.dsgn, AGE06X > 17)) 7http://www.meps.ahrq.gov/mepsweb/survey_comp/standard_errors.jsp 8http://faculty.washington.edu/tlumley/survey/exmample-lonely.html 9http://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-036BRR

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 40 CONTRIBUTED RESEARCH ARTICLES

Table 1: Comparison of Taylor-Series Linearization Standard Error Calculation Methods

R SAS Stata SUDAAN

All Code svymean(˜TOTEXP06, proc surveymeans svyset proc descript Adults subset( data=meps.adult; [pweight=PERWT06F], data="adult" meps.tsl.dsgn, stratum varstr; strata(VARSTR) filetype=sasxport AGE06X > 17)) cluster varpsu; psu(VARPSU) svy: design=wr weight perwt06f; mean TOTEXP06 notsorted; nest var totexp06; varstr varpsu; weight perwt06f; var totexp06; Mean 4009.1 4009.1 4009.1 4009.1

SE 82.4 82.4 82.4 82.4

Non- Code svymean(˜TOTEXP06, proc surveymeans svyset proc descript Elderly subset( data=meps.adult; [pweight=PERWT06F], data = "adult" Adults meps.tsl.dsgn, where age06x < 65; strata(VARSTR) filetype=sasxport AGE06X > 17 & stratum varstr; psu(VARPSU) svy: design=wr AGE06X < 65)) cluster varpsu; mean TOTEXP06 if notsorted; nest weight perwt06f; AGE06X < 65 varstr varpsu; var totexp06; weight perwt06f; var totexp06; subpopn age06x < 65; Mean 3155.4 3155.4 3155.4 3155.4

SE 76.2 76.2 76.2 76.2

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 41

As seen in Table2, the survey design object cre- > # (ex. poor health 3.26%) ated by R matches the Stata variance estimation pre- > a <- svymean(~ cRTHLTH53, cisely where the mean-squared error (MSE) option + subset(meps.tsl.dsgn, AGE06X > 17), has not been selected; this estimate is derived from + na.rm = TRUE) the sum of squares of the replicate estimates centered > a at the mean of those estimates. Stata’s variance esti- Although this paper focuses on data manipula- mation using the MSE option and SUDAAN’s built- tion and analysis, R includes powerful formatting in replicate weighting algorithm both center on the and presentation tools. Individuals interested in con- actual point estimate rather than the mean of the verting simple tables — like the one produced by the replicate estimates, resulting in a slightly larger con- code below — into more advanced graphics should fidence interval. The choice between these two tech- research the R packages graphics, lattice, grid, grDe- niques does not have any theoretical considerations, vices, as well as LAT X converters like Hmisc and and the results are almost always similar. Though E xtable. marginally different, standard errors computed with R are equally valid as those computed with SU- > #label & format health status DAAN or with Stata’s MSE method. Two statistical > ftable(a, tests conducted using the standard errors from the + list(c("Excellent / Very good", different packages will almost always produce a nu- + "Good", "Fair", "Poor"))) merically different but substantively compatible re- sult. Variables not converted to character before the Comparison code for SAS was not included in creation of the design object can be hastily converted Table 2 because the base package requires the cal- using the as.character function within the svymean culation of replicate weighting estimates with user- command. Also notice the absence of the na.rm ar- written code. gument, since missing values were not assigned for any cases of the REGION06 variable.

# % adults living in census regions Analysis # (ex. northeastern 18.64%) > svymean(~ as.character(REGION06), The core functions included in R’s survey pack- + subset(meps.tsl.dsgn, AGE06X > 17)) age allow the rapid estimation and cross-tabulation of means, distributions, frequencies, and quantiles. Each of the discussed operations also produces stan- Weighted frequencies and totals dard errors, except for the quantile calculations using The svytotal and svytable functions can be used to Taylor-series linearization design objects. calculate both weighted aggregate values and popu- lation estimates.

Unweighted frequencies > #total 2006 health expenditures Before conducting analysis on the complex survey > #all non-institutionalized adults design object, simple unweighted frequencies can be > # ($892 billion) > svytotal(~ TOTEXP06, produced using the xtabs function. Unweighted fre- + subset(meps.tsl.dsgn, AGE06X > 17)) quencies only assure that sufficient cell sizes exist, meaning that the complex design does not need to > #num adults in census regions be accounted for. Therefore, this function refers to > # (ex. northeastern 41.47 million) the source data set, rather than the design object. > svytable(~ REGION06, + subset(meps.tsl.dsgn, AGE06X > 17)) > #calculate unweighted cell size > xtabs( ~ RTHLTH53 + REGION06, > #alternate method + subset(meps06, AGE06X > 7)) > svytotal(~ as.character(REGION06), + subset(meps.tsl.dsgn, AGE06X > 17)) Distributions Medians and quantiles As previously shown, the svymean function calcu- lates mean values for any numeric variable. When The svyquantile function calculates each of the per- given character variables, however, svymean com- centiles specified in the quantile argument. For ex- putes percentages of the total. Notice the presence ample, a value of 0.5 generates the median. How- of the na.rm = TRUE argument to eliminate missing ever, any number of quantiles can be produced with values from the results. by specifying a vector of values.

> #percent distribution w/ svymean > #median 2006 health expenditure > #determine % adults in each state > svyquantile(~ TOTEXP06,

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 42 CONTRIBUTED RESEARCH ARTICLES

Table 2: Comparison of Balanced Repeated Replication Standard Error Calculation Methods

R Stata Stata (MSE) SUDAAN

All Code svymean(˜TOTEXP06, svyset svyset proc descript Adults subset( [pweight=PERWT06F], [pweight=PERWT06F], data = "adult" meps.brr.dsgn, brrweight(BRR*x) brrweight(BRR*x) filetype=sasxport AGE06X > 17)) vce(brr) svy: vce(brr) mse svy: design=brr mean TOTEXP06 mean TOTEXP06 notsorted; repwgt BRR1 - BRR64 ; weight perwt06f; var totexp06; Mean 4009.1 4009.1 4009.1 4009.1

SE 83.0 83.0 83.6 83.6

Non- Code svymean(˜TOTEXP06, svyset svyset proc descript Elderly subset( [pweight=PERWT06F], [pweight=PERWT06F], data = "adult" Adults meps.brr.dsgn, brrweight(BRR*x) brrweight(BRR*x) filetype=sasxport AGE06X > 17 & vce(brr) svy: vce(brr) mse svy: design=brr AGE06X < 65)) mean TOTEXP06 if mean TOTEXP06 if notsorted; repwgt AGE06X < 65 AGE06X < 65 BRR1 - BRR64 ; weight perwt06f; var totexp06; subpopn age06x < 65; Mean 3155.4 3155.4 3155.4 3155.4

SE 77.0 77.0 77.5 77.5

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 43

+ subset(meps.tsl.dsgn, AGE06X > 17), 0.5) + subset(meps.tsl.dsgn, AGE06X > 17), svymean)

> #calculate expenditure percentiles > #stratified by health status > svyquantile(~ TOTEXP06, > # (ex. northeast poor hlth $18,038) + subset(meps.tsl.dsgn, AGE06X > 17), > svyby(~ TOTEXP06, ~ REGION06 + RTHLTH53, + c(0.1, 0.25, 0.5, 0.75, 0.9)) + subset(meps.tsl.dsgn, AGE06X > 17), svymean)

Although there are a number of other methods > #top quartile by health status for constructing an interval estimate, the Taylor- > # (ex. poor health $19,130) series linearization design object will not produce > svyby(~ TOTEXP06, ~ RTHLTH53, standard errors in svyquantile since the CDF is not + design = subset(meps.tsl.dsgn, AGE06X > 17), differentiable.10 Median standard errors can be ob- + FUN = svyquantile, quantiles = 0.75, ci = TRUE) tained normally if a replicate weighting design object exists for the data. Crosstabulation of distributions > #brr method: median health $ > svyquantile(~ TOTEXP06, In addition to the straightforward distributions pre- + subset(meps.brr.dsgn, AGE06X > 17), 0.5) sented using the svymean function, the svyby func- tion allows row and column percents to be calculated If replicate weights do not exist for the data set, using any desired factor variable(s). a jackknife replication design can be created by con- verting the Taylor-series linearization design object. > #dist of hlth stat by census region > svyby(~ cREGION06, ~cRTHLTH53, > #convert tsl to jackknife design + subset(meps.tsl.dsgn, AGE06X > 17), svymean) > meps.jackknife.dsgn <- + as.svrepdesign(meps.tsl.dsgn, "auto") > #dist of census region by hlth stat > svyby(~ cRTHLTH53, ~cREGION06, > #jackknife method: median health $ + subset(meps.tsl.dsgn, AGE06X > 17), svymean, > svyquantile(~ TOTEXP06, + na.rm = TRUE) + subset(meps.jackknife.dsgn, AGE06X > 17), 0.5) Unlike row and column percents, table percents Although this jackknife technique might produce do not require the svyby command and can be calcu- a ballpark standard error statistic, recommended lated using the interaction function. This approach quantile standard error calculations can vary de- is principally useful when one wishes to examine a pending on the data set and analysis method. Some sub-subpopulation in relation to the entire popula- textbooks simply estimate median standard error as 11 tion, rather than in relation to its parent subpopula- 25% larger than the mean standard error. There- tion. fore, technical documentation and/or data support staff should be consulted before relying on this sur- > #total % in each hlth stat & region vey design conversion. > svymean(~ interaction(cRTHLTH53, cREGION06), + subset(meps.tsl.dsgn, AGE06X > 17), Crosstabulation of totals, means, and + na.rm = TRUE) quantiles > #repeat above with weighted counts The svyby command repeats a specified function, it- > svytotal(~ interaction(cRTHLTH53, cREGION06), + subset(meps.tsl.dsgn, AGE06X > 17), erating across the set of distinct values in another fac- + na.rm = TRUE) tor variable also within the data frame. This func- tion is comparable to lapply, which runs across in- dividual elements contained in an external vector. Statistical tests Rather than running the indicated function on the en- tire data set, svyby analyzes discrete subpopulations. Though most individual estimates are more easily compared using a general difference formula, the > #health spending by census region svyttest function can be used to conduct quick two- > # (ex. northeast $173 billion) > svyby(~ TOTEXP06, ~ REGION06, sample t-tests on any binary value that can be gen- + subset(meps.tsl.dsgn, AGE06X > 17), svytotal) erated in the data set. For example, to assess statisti- cal differences in health spending between adults in > #regional mean spending on health a particular census region of the United States and > # (ex. northeast $4,171) adults living in all other regions, one can rapidly cre- > svyby(~ TOTEXP06, ~ REGION06, ate a binary variable to test for spending differences. 10http://faculty.washington.edu/tlumley/survey/html/svyquantile.html 11http://davidmlane.com/hyperstat/A106993.html

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 44 CONTRIBUTED RESEARCH ARTICLES

> #spending in northeast vs. others proven to work on MEPS can be generalized to the > svyttest(TOTEXP06 ~ (cREGION06 == 1), majority of available health policy data sets. + subset(meps.tsl.dsgn, AGE06X > 17 )) As a secondary goal behind contributing to the health policy analyst’s toolkit, this article has laid > #spending in midwest vs. others the groundwork for health-care survey administra- > svyttest(TOTEXP06 ~ (cREGION06 == 2), + subset(meps.tsl.dsgn, AGE06X > 17 )) tion organizations to follow the lead of the National Center for Health Statistics by including instructions Using this method, any Boolean expression can and example code for R in future versions of their be tested for differences between groups. The ex- technical documentation.12 amples below show that adults aged 45 to 54 do not have significantly higher or lower spending than the set of all other adults, while pre-retirees (aged Version notes 55 to 64r) do spend significantly more than all other adults. The calculations in this paper were performed with R version 2.9.0, using the R survey package ver- > #test 45-54 vs. all others sion 3.16. Statistical package comparisons were > svyttest(TOTEXP06 ~ (AGECAT == 4), calculated using SAS version 9.1.3 Service Pack 2, + subset(meps.tsl.dsgn, AGE06X > 17)) Stata/MP version 9.2, and SUDAAN release 10.0.0. > #test 55-64 vs. all others > svyttest(TOTEXP06 ~ (cAGECAT == '55-64'), + subset(meps.tsl.dsgn, AGE06X > 17 )) Acknowledgements

I gratefully acknowledge Dr. Thomas Lumley, Pro- Conclusion fessor of Biostatistics at the University of Washing- ton, Seattle and author of the R survey package for This article has outlined the capacity to prepare and assistance with the replication design methodology, accurately analyze health policy data with complex and Dr. Roger Peng, Professor of Biostatistics at survey designs using R. By highlighting the neces- Johns Hopkins School of Public Health for code edit- sary steps to transition from the proprietary statis- ing. tical package code provided in the technical docu- mentation of the MEPS-HC data, this article equips Anthony Damico R users with the conversion knowledge required to Statistical Analyst conduct important research on health policy data Kaiser Family Foundation sets. The data manipulation and analysis techniques [email protected]

12ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2008/srvydesc.pdf

The R Journal Vol. 1/2, December 2009 ISSN 2073-4859