A Comparative Evaluation of Selected Statistical Software for Computing Various Categorical Analyses
Total Page:16
File Type:pdf, Size:1020Kb
A Comparative Evaluation of Selected Statistical Software for Computing Various Categorical Analyses Nancy McDermott and Cynthia White Social Science Computing Cooperative University of Wisconsin - Madison Introduction Unlike other statistical packages. by default PROC LOGISTIC models the probability that the event equals zero. To change This paper is a comparative evaluation of statistical software for this to model the probability that the event equals one as in other computing various categorical analyses including logistic packages. specify the DESCENDING option on the PROC regression, multinomial logits, and loglinear analysis. The LOGISTIC statement. If you do not add the DESCENDING following statistical packages were included in the evaluation: option, your parameter estimates may be opposite in sign of The SAS System" (version 6.09). STATA (version 3.1). SPSS what you may' get from other statistical packages. (version 4.0). GLiM (version 3.n). and LlMDEP (version 6.0). The RISKLIMITS option on the MODEL statement requests Large data sets were 'selected for analYSis. The code for the confidence intervals for the conditional odds ratio. 95% analyses is presented for each of the software packages. confidence intervals are computed by default. The LACKFIT Important and unique features of the analyses are noted. option on the MODEL statement requests the Hosmer Following the output. perfonnance comparisons on a Spare Lemeshow Goodness~of~Fit Test. Only one other statistical 10/512 MP running UNIX with 128Mb memory are provided. package (STATA) provided outpu1 for these two statistics. The UNIX time command was used to compare the performances of the statistical packages. The paper also makes Logistic Regression in STATA some recommendations on the appropriate package to use in certain situations. The following STATA code requests the logistic regression: Logistic Regression logistic wantyes age numliv educ permort religd2 religd3 cityd2 cityd3 lfit, group(10) The data set used for the logistic regression is from the 1980 logit World Fertility Survey in the Cote d'ivoire. The data -represent responses to interviews of a stratified random sample of women The LFlT command requests the Hosmer-Lemeshow test for ages 15~50 in the Cote d'ivoire. A total of 5764 women were Goodness-of-Fit. Unlike the SAS System, STATA allows you to interviewed. However, for this analysis, only 4165 married. specify the number of groups to construct for the Hosmer~ self-reporting fecund women were included in the analysis. Lemeshow Goodness-of-Fir test. The SAS System uses approximately 10 groups when constructing the test. 8TAlA The dependent variable, WANTYES, was a woman's response reported a Hosmer-LemeShow Goodness~of~Fit test of 8.42 for when asked whether or not she wanted more children. The this example while the SAS System reported 9.865. The reason independent variables were AGE. number'of children ever born for the difference is unknown because both packages (NUMLlV). percent of children who died (PERMORT). constructed the same number of groups after ordering on the EDUCATION. RELIGION, and urbanization (CITY). RELIGION predicted probabilities. You get the underlying coefficients for and CITY were categorical variables with three levels each. 1110 odds ratios by typing LOGIT without arguments after the LOGISTIC command. Logistic Regression in the SAS System Logistic Regression in SPSS The SAS System has several procedures which can cany out logistic regression including LOGISTIC. GENMOD. CATMOD. SPSS has several commands which can carry out logistic and NUN. The LOGISTIC procedure was used for this regression including LOGISTIC REGRESSION. LOGLINEAR. discussion. HILOGLIN. and NLR. The LOGISTIC REGRESSION command was used for this discussion. Note that it is not necessary to The following SAS code requests the logistic regression with create the dummy variables for the two categorical variables WANTYES as the dependent variable. Among the regressors (RELIGION and CITY) because SPSS's LOGISTIC are RELlGD2. RELlGD3. CITYD2. and CITYD3 which are REGRESSION procedure generates them automatically. Only dummy variables created from the RELIGION and CITY one other package (GUM) offered this feature. The following categorical variables. Note that RELlGD1 and CITYD1 were not SPSS code requests the logistic regression: included in the model so that it would not be overdetermined. logistic regression wantyes with age numliv educ proc logistic descending; - permort religion city model wantyes=age numliv educ permort religd2 /external religd3 cityd2 cityd3 /categorical=religion city /contrast (religion) =simple (1) / ris~limits lackfit ctable; /contrast(city}=simple(l) 938 The !EXTERNAL subcommand was used to conserve memory. Performance Comparisons for Logistic Regression The ICATEGORICAL subcommand was included to declare the Categorical variables. The two ICONTRAST subcommends were Each program for each package was run 10 times. The average used to set 1he reference category to one ~ as to make the time in seconds spent in execution of the program (not real time) coefficien1s comparable to the output from the other statistical is shown in the table below. Times could not be reported for packages. There are four other types of contrasts available in SPSS because the time command did not accurately report SPSS: deviations from the overall effect. difference or reverse these for SPSS. Heimerl contras1s, Heimerl contras1s, and polynomial contras1s. Only a basic logistic regression was specified for each package. The LOGISTIC REGRESSION command computes a Goodness In order to make a fair comparison, the Hosmer-Lemeshow of-Fit test by default. However, this statistic is not appropriate Goodness-of-Fit test was not specified because this test can be for data like these because there are very few observations at computationally intensive and thus inflate the times for the two each observed level of the covariate. Hence, the Goodness-of packages that can compute the test. Fit statistic does not have an approximate chi-squared distribution. A test similar to the Hosmer-Lemeshow Goodness-<>f-Fit tests computed by the SAS System and STATA Package Time in Seconds would be much more appropriate for the data in this example. SPSS does not compute this test, however. SAS 1.50 (1) Logietic Regression in GUM STATA 3.11 (2) GUM fi1s generafized linear models, as defined by Neider and GUM 6.09 (3) Wedderburn (1972), which include logistic regression. You need UMDEP 31.22 (4) to specify a binomial distribution function with a 10gH link function. The following commands fit the logistic model: The number in parentheses represents the package's relativt:' $faeter relig 3 city 3 $ rank for performance. $yvar wantyes$ Scale n=l$ Recommendations for Logistic Regression $error -binomial n$ Slink g$ $fit age+numliv+ educ+ permort+ relig+ city$ All of the statistical packages considered provided a simple $display e$ procedure for computing a logistic regression. Unless you need a particular option or CPU time is a factor, it may be more The FACTOR command defines which variables are categorical. convenient just to use the package with which you are most Only one other package (SPSS) can generate the dummy familiar. SAS's LOGISTIC procedure provided the most options. variables automatically. The YVAR command specifies the especially in the areas of criteria for assessing the fit of the dependent variable. You must set n equal to 1 with the CALC model and rank correlation between the observed response and command because there is only one measurement on each the predicted probabilities. If you only wanted a Goodness-of-Fit person. The ERROR specification is binomial with the total test for assessing fit, either the SAS System or STATA would be number of observations on each person equal to 1. The LINK G a good choice. Both packages compute the Hosmer-Lemeshow command specifies that a logit link function will be used for the test. Also, the SAS System and STAT A were the only packages fit The FIT command fi1s the model specified. that computed odds ratios with confidence intervals. SPSS computed the odds ratio but without confidence intervals. Finally, the DISPLAY directive instructs GUM to display the resul1s of the model fit. In this case, DISPLAY E instruc1s GUM SPSS provided most of the options that the SAS System and to display the parameter estimates and their standaro errors, STATA did. One nice feature offered by SPSS that only one including extrinsically aliased parameters. GLIM does not other package (GUM) offered was the automatic construction of provide a test similar to the Hosmer-Lemeshow Goodness-of-Fit dummy variables for categorical independent variables. In tests computed by the SAS System and STATA. addition, it provides five different types of contrasts for the categorical variables used for interpreting the coefficients in Logistic Regression in UMDEP different ways. The easiest way to carry out a simple logistic regression in The SAS System and STATA ranked highest in execution time. UMDEP is with the LOGIT command. The LOGIT command STATA's good performance was not unexpected because carries out both binomial and multinomial logit models. STATA puts all the data in memory instead of using swap space. Although this method of execution can put a huge drain on a legit; lhs=wantyesi machine's memory when a large job is executing, it usually rhs=one;age,numliv, educ,permort. religd2.religd3, means the package will execute jobs very quickly. What was cityd2, cityd3 $ unexpected was that the SAS System actually performed better LIMDEP does not provide a test similar to the Hosmer than STATA for the larger analysis. The SAS System does Lemeshow Goodness-<>f-Fit tests computed by the SAS System make use of swap space instead of putting everything in and STATA. memory. Both the SAS System and STATA appear to be good choices when CPU time is a factor. Because of slow performance, you may want to avoid LlMDEP for large problems. 939 Multinomial Logit Regression Following is the MLOGIT command to fit the one-way model: mlogit occup xl-x3l' SAS, STATA, and LlMDEP were the only packages compared mlogit, rrr for the muftinomiallogit runs.