Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health

CONTRIBUTED RESEARCH ARTICLES 37 Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data by Anthony Damico recommended code examples for selected propri- etary statistical packages with those of the survey Abstract: Statistical, data manipulation, and package available in R, using data from the Medical presentation tools make R an ideal integrated Expenditure Panel Survey–Household Component package for research in the fields of health pol- (MEPS-HC).2 icy and healthcare management and evaluation. However, the technical documentation accom- panying most data sets used by researchers in Medical Expenditure Panel Survey– these fields does not include syntax examples for Household Component analysts to make the transition from another sta- • Nationally representative (non-institutionalized) tistical package to R. This paper describes the sample of health services utilization in the steps required to import health policy data into United States R, to prepare that data for analysis using the two most common complex survey variance calcu- • Administered by the Agency for Healthcare lation techniques, and to produce the principal Research and Quality (AHRQ) set of statistical estimates sought by health pol- • Panel survey design allows tracking of all indi- icy researchers. Using data from the Medical Ex- viduals and families for two calendar years penditure Panel Survey Household Component (MEPS-HC), this paper outlines complex survey • Person-level and event-level annual data avail- data analysis techniques in R, with side-by-side able since 1996 comparisons to the SAS, Stata, and SUDAAN statistical software packages. MEPS-HC data provides a useful basis for cross- package comparisons because variances can be es- timated using both Taylor-series linearization and Introduction replicate weighting methods, unlike most complex sample survey data sets which recommend only one Health-care researchers use survey data to evaluate or the other. Presenting a step-by-step methodology the state of the market and to inform the creation and with the MEPS-HC data set allows more convenient implementation of new policies. America’s contin- generalization to other data sets. uing discussion about the health-care system highlights the unique role of population-generalizable data as a gauge for estimating how proposed leg- Data preparation islation will affect specific demographic groups and the $2.2 trillion health-care industry.1 Despite pow- AHRQ provides all downloadable MEPS files in both erful statistical analysis, data manipulation, and ASCII (.txt) and SAS transport (.xpt and .ssp) for- presentation capabilities, R has not become widely mats, along with programming statements for SAS adopted as a statistical package in the fields of health and SPSS. The foreign package imports SAS trans- policy or health-care management and evaluation. port files with minimal effort: User guidelines and other technical documents for > library(foreign) government-funded and publicly available data sets > h105 <- read.xport("h105.ssp") rarely provide appropriate code examples to R users. The objective of this paper is to describe the steps re- Reading larger data sets with the read.xport quired to import health policy data into R, to prepare function may overburden system memory.3 If a that data for analysis using the two most common memory error occurs, SAS transport files can be con- complex survey variance calculation techniques, and verted to tab-delimited (.txt) files using the free SAS to produce the principal set of statistical estimates System Viewer® and then alternatively imported sought by health policy researchers. with the read.table function: This article compares means and standard errors produced by technical documentation- > h105 <- read.table("h105.dat", header=TRUE ) 1http://www.cms.hhs.gov/NationalHealthExpendData/downloads/highlights.pdf 2http://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files.jsp 3http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 38 CONTRIBUTED RESEARCH ARTICLES Should memory errors continue to occur due to a > #combine ex (1) & vg health (2) large number of variables in the data set, database- > # into the same response option backed survey design objects allow R to access a > meps06[, "RTHLTH53"] <- meps06[ ,"RTHLTH53"] - 1 table stored within a relational database using ei- > meps06[meps06[ ,"RTHLTH53"] == 0, "RTHLTH53"] <- ther DBI or ODBC.4 This alternative does not con- + 1 serve memory in data sets that are large due to > #recode negatives to missing (NA) record count. The table is accessed directly by the > meps06[meps06[ , "RTHLTH53"] < 0, "RTHLTH53"] <- svydesign function (discussed later), through the + NA addition of a dbname argument to specify the loca- tion and a dbtype argument to specify the database driver. The addition of missing values to any variable After successful import, unnecessary variables within a data frame requires that subsequent anal- should be eliminated from the data frame to in- yses of the variable include the na.rm = TRUE argu- crease processing speed. The select argument to ment in order to appropriately exclude those obser- the subset command acts as the SAS KEEP option vations. in a DATA step or the Stata keep command, by drop- At the conclusion of all data transformations, all ping all unspecified fields. Distinct subpopulation variables that one wishes to display in weighted col- data frames could also be created with the subset umn frequencies should be converted to a character command; however, observations should not be ex- string in a new variable. The rationale behind this cluded until the design object has been created, in conversion is discussed in the Analysis section. order to preserve the entire complex sample design.5 > #char vars for region & health stat. > #limit to specified fields > meps06 <- transform(meps06, > meps06 <- subset(h105, + cRTHLTH53 = as.character(RTHLTH53)) + select = c(DUPERSID, PANEL, PERWT06F, VARSTR, > meps06 <- transform(meps06, + VARPSU, AGE06X, TOTEXP06, REGION06, RTHLTH53)) + cREGION06 = as.character(REGION06)) > #clear memory of full data frame > rm(h105) All other recoding and re-categorization should Survey design for Taylor-series lin- occur at this stage. The example below creates a cat- earization egorical variable based on the age of the individual using the cut function to delineate the appropriate The estimation weight (PERWT06F), stratum intervals. (VARSTR), and primary sampling unit (VARPSU) fields make up the MEPS survey design and allow > #recode linear to categorical > # below 25 ; 25-34 ; 35-44 ; 45-54 ; for the Taylor-series linearization method of stan- > # 55-64 ; 65-74 ; 75-84 ; 85+ dard error computation. The survey package can > meps06[, "AGECAT"] <- cut(meps06[, "AGE06X"], then be used to create a complex sample survey de- + br = c(-1, 24 + 10*0:6, Inf), labels = 1:8) sign object for each data set. After creation, the complex survey design object replaces the R data frame Note that the labels argument can directly pro- as the target of all analysis functions. duce character strings rather than numeric categories. This example creates the same categories as > library(survey) above, with character labels. > #create main tsl survey design obj. > #alternate character labels > meps.tsl.dsgn <- svydesign(id = ~ VARPSU, > meps06[, "cAGECAT"] <- cut(meps06[, "AGE06X"], + strata = ~ VARSTR, weights = ~ PERWT06F, + br = c(-1, 24+10*0:6, Inf), + data = meps06, nest = TRUE) + labels = c('Below 25', '25-34', '35-44', + '45-54', '55-64', '65-74', '75-84', '85+')) Survey design objects store (rather than point to) Once all needed variables have been created, in- data sets at the time of design object creation; any applicable or invalid values should be converted to modifications of the meps06 data frame in subsequent missing. This example combines the ’Excellent’ and steps require the above re-assignment of the design ’Very good’ self-reported health status responses, objects in order to see those changes in subsequent then assigns incomplete records as missing values.6 analysis commands. 4http://faculty.washington.edu/tlumley/survey/svy-dbi.html 5http://faculty.washington.edu/tlumley/survey/example-domain.html 6Original Self-Reported Health Status Values in MEPS: (1) Excellent; (2) Very good; (3) Good; (4) Fair; (5) Poor; (-9) Not Ascertained; (-8) Don’t Know; (-7) Refused; (-1) Inapplicable. The R Journal Vol. 1/2, December 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 39 Taylor-series linearization cross- > brr <- read.xport("h36b06.ssp") package comparisons > #merge half-sample with main file > meps.brr <- merge(meps06, brr, Many health policy data sets with complex sample + by.x = c("DUPERSID", "PANEL"), designs use the Taylor-series linearization method + by.y = c("DUPERSID", "PANEL"), for standard error computation. After the appropri- + all=FALSE) ate survey design object has been created, point estimates can be obtained. Note that if additional variables need to be re- moved from a data frame, the select argument cap- > #total adult expendure mean & se tures all fields in between two fields separated by a > svymean(~ TOTEXP06, colon - a shortcut particularly valuable for maintain- + subset(meps.tsl.dsgn, AGE06X > 17)) ing 64 replicate weight variables. > #total nonelderly adult expenditure > meps.brr.sub <- subset(meps.brr, > svymean(~ TOTEXP06 , + select = c(DUPERSID, PANEL, PERWT06F, + subset(meps.tsl.dsgn,

Load more