1988: Sudaan: a Comprehensive Software Package For
Total Page:16
File Type:pdf, Size:1020Kb
SUDAAN: A COMPREHENSIVE SOFTWARE PACKAGE FOR SURVEY DATA ANALYSIS Lisa M. LaVange, Babubhai V. Shah, Beth G. Barnwell and Joyce F. Killinger Lisa M. LaVan~e. Research Triangle Institute KEYWORDS: variance estimation, user interface, The system design for the comprehensive C language survey data analysis software, SUDAAN, evolved with these goals in mind. The first goal is ABSTRACT being met by writing the new software entirely The development of a survey data analysis in the C programming language and testing simul- software package that is flexible enough to meet taneously on several systems. Compilers for the the needs of survey statisticians, as well as C language are available on a variety of operat- research scientists not necessarily schooled in ing systems. The initial programs are currently survey sampling, is underway at the Research executing under the TURBO C compiler on an IBM Triangle Institute. The system was designed to personal computer, the SVC compiler on a 68020, provide significant enhancements to RTI's exist- and under the UNIX operating system on a GOULD ing software package, SUDAAN. The new SUDAAN computer. Versions are also being tested on the system is written in the C language and provides VAX/VMS and IBM/VS. greater efficiency and portability. The soft- The second goal is being met by building ware development strategy consists of developing checks into the system to assure that defini- and debugging stand alone statistical procedures tions of objects are meaningful and computations on an IBM PC with expanded capacity. The proce- are feasible. We have implemented numerically dures are then implemented in two additional stable procedures for the accumulation of sums computing environments, VAX/VMS and IBM/0S. of squares and cross products and for matrix Variance estimation options and data analysis operations, including matrix inversion. techniques not previously available in the RTI For achieving the third goal of improved software package are among the many modifica- efficiency, special care has been taken in the tions incorporated into the new SUDAAN system. design to isolate tasks that are repeated numer- The purpose of this paper is to describe the ous times, such as a numerical calculation on design and development of this system. each variable of each record. These sections of code are optimized to the extent possible with- INTRODUCTION out using assembly language. Some of the opti- Scientists at the Research Triangle Institute mizing algorithms include processing of sparse have been involved in designing, developing, and matrices and vectors, replacement of multiplica- maintaining software systems for the analysis of tion by table lookup with addition, elimination survey data for the past 17 years. Recently, of repetitive (in the loop) "if" tests by RTI has been under contract to the Public Health constructing separate do loops, and many others. Service (PHS) to develop a comprehensive soft- An important product of the SUDAAN design ware package that meets the needs of statistical effort has been the creation of a high level analysts at the National Center for Health statistical programming language. The SUDAAN Statistics (NCHS) and PHS. In particular, this language has been developed to provide the package must be flexible enough to handle most statistician/programmer with easy access to the of the statistical designs used by these agen- software package, thus accomplishing the fourth cies. For this purpose, we have embarked on the goal listed above. Statistical procedures are task of developing a system that incorporates currently being written in the SUDAAN language. many of the features of RTI's existing survey Once they are finalized, some parts may be data analysis system but also includes signifi- converted to C programs to maximize computing cant enhancements. The purpose of this paper is efficiency. to describe in detail the design and development The SUDAAN design allows for four types of of this comprehensive software package. The user interface as depicted in Figure I. The immediate ojective of the SUDAAN system design general user will be able to access the system was to develop a series of procedures capable of through user friendly procedures. Ultimately performing statistical data analysis for complex the procedures will be available through a sample surveys. The ultimate objective was to statistical system, such as SAS, but the current have a system that could aid in the design and design consists of stand-alone procedures that evaluation of new statistical techniques. interface with a standard ASCII file format. RTI's existing software system consists of a The more sophisticated user can write SUDAAN series of procedures that produce statistical programs for computations not available in the analyses in the form of weighted cross-tabula- procedure library. This access feature is tions, generalized ratio estimators, and linear particularly well suited for testing modifica- and logistic regressions. All of the procedures tions to existing procedures, such as the calcu- were constructed using a unified set of FORTRAN lation of an alternative test statistic. A C subroutines that execute in the SAS computing programmer can access the system directly environment on an IBM mainframe. In the course through C callable functions. of developing a new system, we went through the The key features of the SUDAAN system are process of rethinking our current design. Four described in detail in the following sections. goals emerged: RTI SURVEY DATA ANALYSIS PROCEDURES Portability BACKGROUND Reliability/numerical accuracy The collaboration of statisticians and Computational efficiency Ease of modification/enhancement. 761 computer scientists at RTI, beginning in 1971 quantities are used to produce Yates-Grundy-Sen (e.g., Folsom, Bayless, and Shah, 1971) and variance estimates for this design option. The continuing today, has resulted in a series of three variance options cover a wide range of programs for survey data analysis that reflects sample designs including sampling with probabil- state-of-the-art methodology in this field. All ities proportional to size at the first stage. of RTI's analysis packages employ the Taylor Unequal probabilities of selection at a subse- series or Delta method of variance estimation quent stage can be handled by forming pseudo for complex sample survey designs (Folsom, strata at that stage since probabilities of 1974). Each program currently marketed is selection can vary from stratum to stratum. In designed to run within the SAS framework. addition to these variance options, the user may RTI is currently marketing the following request that the variance-covariance matrix for procedures for general use: estimates within a table be computed for all • SESUDAAN procedures allowing for cross-tabulations. • RATIOEST Estimates of variance-covariance matrices were • RTIFREQS previously available only in the SURREGR and • SURREGR RTILOGIT procedures. " RTILOGIT. Phase I software consists of three proce- dures: CROSSTAB, DESCRIPT, and RATIO. The The SESUDAAN (Shah, 1981) procedure produces CROSSTAB procedure is the RTIFREQS equivalent in weighted estimates of means, totals, and propor- the new SUDAAN system. In addition to computing tions and their standard errors. RATIOEST weighted frequency and percentage distributions, (Shah, 1982) calculates weighted frequency dis- Chi-square tests of independence for a two-way tributions and percentages for cross-tabula- contingency table are produced. The tests are tions. computed using a Wald statistic according to RTI's survey regression package, SURREGR methods advocated by Koch, Freeman, and Freeman (Holt, 1977), provides estimates of linear (1976). During Phase II, the Rao-Scott regression coefficients, their estimated Satterthwaite corrected test statistic (Thomas variance-covariance matrix, and tests of hypo- and Rao, 1984 and 1985) will be computed in theses concerning the coefficients that are addition to the Wald statistic for goodness of appropriate for a complex sample design. Logis- fit tests. This statistic has been shown by tic regression capabilities became available Thomas and Rao to have the best characteristics with the development of RTILOGIT (Shah, et al., with respect to both power and significance 1984). This procedure runs in conjunction with level when compared with other statistics SAS PROC LOGIST and produces pseudo maximum correcting for the liberalness of the Wald test. likelihood estimates of the parameters and tests The DESCRIPT procedure computes estimates of of hypotheses about these parameters. means, totals, proportions, geometric means, quantiles, and their sampling errors. Estimates New System Procedures can be requested for the subdomains of user- The SUDAAN system design consists of two specified cross-tabulations. Options are also phases of software development, each resulting available to the user for standardized and/or in a series of statistical procedures. The post-stratified means and proportions and their Phase I procedures are primarily concerned with sampling errors. The standardizing and/or post- descriptive data analysis, including cross-tabu- stratifying variables and their distributions lations, generalized ratio estimation, and are required as input. quantile estimation. The Phase II design, RTI has recently completed a simulation study currently in the planning stages, will incorpor- comparing two methods for estimating the vari- ate several analytical procedures, including