<<

The Stata Journal Volume 12 Number 2 2012

®

A Stata Press publication StataCorp LP College Station, Texas

The Stata Journal

Editor Editor H. Joseph Newton Nicholas J. Cox Department of Department of Geography Texas A&M University Durham University College Station, Texas 77843 South Road 979-845-8817; fax 979-845-6077 Durham DH13LE UK [email protected] [email protected]

Associate Editors Christopher F. Baum Peter A. Lachenbruch Boston College Oregon State University Nathaniel Beck Jens Lauritsen New York University Odense University Hospital Rino Bellocco Stanley Lemeshow Karolinska Institutet, Sweden, and Ohio State University University of Milano-Bicocca, Italy Maarten L. Buis J. Scott Long T¨ubingen University, Germany Indiana University A. Colin Cameron Roger Newson University of California–Davis Imperial College, London Mario A. Cleves Austin Nichols Univ. of Arkansas for Medical Sciences Urban Institute, Washington DC William D. Dupont Marcello Pagano Vanderbilt University Harvard School of Public Health David Epstein Sophia Rabe-Hesketh Columbia University University of California–Berkeley Allan Gregory J. Patrick Royston Queen’s University MRC Clinical Trials Unit, London James Hardin Philip Ryan University of South Carolina University of Adelaide Ben Jann Mark E. Schaffer University of Bern, Switzerland Heriot-Watt University, Edinburgh Stephen Jenkins Jeroen Weesie London School of Economics and Utrecht University Political Science Nicholas J. G. Winter Ulrich Kohler University of Virginia WZB, Berlin Jeffrey Wooldridge Frauke Kreuter Michigan State University University of Maryland–College Park

Stata Press Editorial Manager Lisa Gilmore Stata Press Copy Editor Deirdre Skaggs The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository papers that link the use of Stata commands or programs to associated principles, such as those that will serve as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go “beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to a wide spectrum of users (e.g., in management or graphics) or to some large segment of Stata users (e.g., in survey statistics, , panel analysis, or limited dependent variable modeling); 4) papers analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could be of interest or usefulness to researchers, especially in fields that are of practical importance but are not often included in texts or other journals, such as the use of Stata in managing datasets, especially large datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata with topics such as extended examples of techniques and interpretation of results, simulations of statistical concepts, and overviews of subject areas. For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com

The Stata Journal is indexed and abstracted in the following:

• CompuMath Citation Index • Current Contents/Social and Behavioral SciencesR • RePEc: Research Papers in Economics • Science Citation Index Expanded (also known as SciSearchR ) TM • Scopus • Social Sciences Citation IndexR

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and help files) are copyright c by StataCorp LP. The contents of the supporting files (programs, datasets, and help files) may be copied or reproduced by any whatsoever, in whole or in part, as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal. The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part, as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal. Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions. This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites, fileservers, or other locations where the copy may be accessed by anyone other than the subscriber. Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting files understand that such use is made without warranty of any kind, by either the Stata Journal, the author, or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special, incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote free communication among Stata users. The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata Press, Mata, , and NetCourse are registered trademarks of StataCorp LP. Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone 979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned. Subscriptions mailed to U.S. and Canadian addresses: 1-year subscription $ 79 2-year subscription $155 3-year subscription $225 3-year subscription (electronic only) $210 1-year student subscription $ 48 1-year university library subscription $ 99 2-year university library subscription $195 3-year university library subscription $289 1-year institutional subscription $225 2-year institutional subscription $445 3-year institutional subscription $650

Subscriptions mailed to other countries: 1-year subscription $115 2-year subscription $225 3-year subscription $329 3-year subscription (electronic only) $210 1-year student subscription $ 79 1-year university library subscription $135 2-year university library subscription $265 3-year university library subscription $395 1-year institutional subscription $259 2-year institutional subscription $510 3-year institutional subscription $750 Back issues of the Stata Journal may be ordered online at http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may be ordered online. http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA. Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX 77845, USA, or emailed to [email protected].

Volume 12 Number 2 2012 The Stata Journal

Articles and Columns 169

A robust instrumental-variables estimator ...... R.DesbordesandV.Verardi 169 What hypotheses do “nonparametric” two-group tests actually test? ...... R.M.Conroy 182 From resultssets to resultstables in Stata...... R.B.Newson 191 Menu-driven X-12-ARIMA seasonal adjustment in Stata . . . Q. Wang and N. Wu 214 Faster estimation of a discrete-time proportional hazards model with gamma frailty...... M.G.Farnworth 242 Thresholdregressionfortime-to-eventanalysis: Thestthregpackage...... T.Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 257 Fitting nonparametric mixed logit models via expectation-maximization algorithm ...... D.Pacifico 284 The S-estimator of multivariate location and scatter in Stata ...... V.VerardiandA.McCathie 299 Using the margins command to estimate and interpret adjusted predictions and marginaleffects...... R.Williams 308 Speaking Stata: Transforming the time axis...... N.J.Cox 332

Notes and Comments 342

Stata tip 108: On adding and constraining ...... M.L.Buis 342 Stata tip 109: How to combine variables with missing values...... P.A.Lachenbruch 345 Stata tip 110: How to get the optimal k-means cluster solution ...... A.Makles 347

Software Updates 352

The Stata Journal (2012) 12, Number 2, pp. 169–181 A robust instrumental-variables estimator

Rodolphe Desbordes University of Strathclyde Glasgow, UK [email protected]

Vincenzo Verardi University of Namur (Centre for Research in the Economics of Development) and Universit´e Libre de Bruxelles (European Center for Advanced Research in Economics and Statistics and Center for Knowledge Economics) Namur, Belgium [email protected]

Abstract. The classical instrumental-variables estimator is extremely sensitive to the presence of outliers in the sample. This is a concern because outliers can strongly distort the estimated effect of a given regressor on the dependent vari- able. Although outlier diagnostics exist, they frequently fail to detect atypical observations because they are themselves based on nonrobust (to outliers) es- timators. Furthermore, they do not take into account the combined influence of outliers in the first and second stages of the instrumental-variables estima- tor. In this article, we present a robust instrumental-variables estimator, ini- tially proposed by Cohen Freue, Ortiz-Molina, and Zamar (2011, Working paper: http://www.stat.ubc.ca/˜ruben/website/cv/cohen-zamar.pdf ), that we have pro- grammed in Stata and made available via the robivreg command. We have im- proved on their estimator in two different ways. First, we use a weighting scheme that makes our estimator more efficient and allows the computations of the usual identification and overidentifying restrictions tests. Second, we implement a gen- eralized Hausman test for the presence of outliers. Keywords: st0252, robivreg, multivariate outliers, robustness, S-estimator, instru- mental variables

1Theory

Assume a model given by y = Xθ + ε (1) where y is the n × 1 vector containing the value of the dependent variable, X is the n × p matrix containing the values for the p regressors (constant included), and ε is the vector of the error term. Vector θ of size p × 1 contains the unknown regression parameters and needs to be estimated. On the basis of the estimated parameter θ,itis then possible to fit the dependent variable by y = Xθ and estimate the residual vector

c 2012 StataCorp LP st0252 170 Robust instrumental variables r = y−y. In the case of the ordinary least-squares (LS) method, the vector of estimated parameters is θLS = arg min r r θ The solution to this minimization leads to the well-known formula −1 θ = XtX Xty LS b b nΣXX nΣXy which is simply, after centering the data, the product of the p × p matrix of the explanatory variables ΣXX and the p×1 vector of the of the explanatory 1 variables and the dependent variable ΣXy (the n simplify).

The unbiasedness and consistency of the LS estimates crucially depend on the ab- sence of correlation between X and ε. When this assumption is violated, instrumental- variables (IV) estimators are generally used. The logic underlying this approach is to find some variables, known as instruments, that are strongly correlated with the trou- blesome explanatory variables, known as endogenous variables, but independent of the error term. This is equivalent to estimating the relationship between the response vari- able and the covariates by using only the part of the variability of the endogenous covariates that is uncorrelated with the error term. More precisely, define Z as the n × m matrix (where m ≥ p) containing the instru- ments. The IV estimator (generally called two-stage when m>p)canbe conceptualized as a two-stage estimator. In the first stage, each endogenous variable is regressed on the instruments and on the variables in X that are not correlated with the error term. In the second stage, the predicted value for each variable is then fit (denoted X here). In this way, each variable is purged of the correlation with the error term. Exogenous explanatory variables are used as their own instruments. These new variables are then replaced in (1), and the model is fit by LS. The final estimator is (again centering the data and recalculating the intercept term) −1 −1 −1 θIV = ΣXZ ΣZZ ΣZX ΣXZ ΣZZ ΣZy (2)

where ΣXZ is the of the original right-hand-side variables and the instruments, ΣZZ is the covariance matrix of the instruments, and ΣZy is the vector of covariances of the instruments with the dependent variable.

A drawback of the IV method is that if outliers are present, all the estimated co- are distorted, even asymptotically. Cohen Freue, Ortiz-Molina, and Zamar (2011) therefore suggest replacing classical estimated covariance matrices in (2) with some robust counterparts that withstand the contamination. These could be minimum covariance determinant scatter matrices as presented in Verardi and Dehon (2010)or S-estimators of location and scatter as described by Verardi and McCathie (2012). We use the latter, and the superscript S is used to indicate it. 1. The constant term has to be recalculated. R. Desbordes and V. Verardi 171

The robust IV estimator can therefore be written as − −1 1 −1 S S S S S S S θRIV = ΣXZ ΣZZ ΣZX ΣXZ ΣZZ ΣZy

As shown by Cohen Freue, Ortiz-Molina, and Zamar (2011), this estimator inherits the consistency properties of the underlying multivariate S-estimator and remains consistent even when the distribution of the carriers is not elliptical or symmetrical. They also demonstrate that under certain regularity conditions, this estimator is asymptotically normal, regression and carrier equivariant. Finally, they provide a simple formula for its asymptotic . An alternative estimator that would allow a substantial gain in efficiency is

− −1 1 −1 W W W W W W W θRIV = ΣXZ ΣZZ ΣZX ΣXZ ΣZZ ΣZy

S where W stands for weights. First estimated are the robust covariance ΣXZy and − − μ 1 − μ  the robust Mahalanobis distances—that is, di = (Mi M)ΣM (Mi M) , where S μ M =(X, Z, y), ΣM = ΣXZy is the scatter matrix of explanatory variables, and M is the location vector. Outliers are then identified as the observations that have a 2 robust Mahalanobis distance di larger than χp+m+1,q, where q is a confidence level (for example, 99%), given that Mahalanobis distances are distributed as the square μ root of a chi-squared with degrees of freedom equal to the length of vector M. Finally, observations that are associated with a di larger than the cutoff point are downweighted, and the classical covariance matrix is estimated. The weighting that we adopt is simply to award a weight of 1 to observations associated with a di smaller than the cutoff value and to award a weight of 0 otherwise. The advantage of this last estimator is that standard overidentification, underiden- tification, and weak instruments tests can easily be obtained, because this weighting scheme amounts to running a standard IV estimation on a sample free of outliers and the asymptotic variance of the estimator is also readily available. We use the user- written ivreg2 command (Baum, Schaffer, and Stillman 2007) to compute the final estimates; the reported tests and standard errors are those provided by this command.2 Finally, a substantial gain in efficiency with respect to the standard robust IV estimator proposed by Cohen Freue, Ortiz-Molina, and Zamar (2011) can be attained. We illus- trate this efficiency gain by running 1,000 simulations using a setup similar to that of Cohen Freue, Ortiz-Molina, and Zamar (2011) but with no outliers: 1,000 observations for five random variables (x, u, v, w, Z) drawn from a multivariate normal distribution with μ =(0, 0, 0, 0, 0) and covariance

2. The robivreg command is not a full wrapper for the ivreg2 command. However, a sample free of outliers can easily be obtained by using the generate(varname) option that we describe in the next section. 172 Robust instrumental variables

⎛ ⎞ 1000.50 ⎜ ⎟ ⎜ 00.30.200⎟ ⎜ ⎟ Σ=⎜ 00.20.300⎟ ⎝ 0.50 0 10⎠ 00001

The data-generating process is Y =1+2x + Z + u, where x is measured with error and only variable X = x + v is assumed observable. To remedy this endogeneity bias, X is instrumented by Z. For this setup, the simulated efficiency of the two estimators S W is 46.7% for the raw θRIV estimator and 95.5% for the reweighted θRIV estimator. The S W efficiency is calculated as follows: Assume θRIV (θRIV) is asymptotically normal with covariance matrix V , and assume V0 is the asymptotic covariance matrix of the classical S W S −1 θIV estimator. The efficiency of θRIV (θRIV) is calculated as eff(θRIV)=λ1(V V0), where λ1(E) denotes the largest eigenvalue of the matrix E,andV0 and V are the simulated covariances.

2 The robivreg command 2.1 Syntax

The robivreg command implements an IV estimator robust to outliers.      robivreg depvar varlist1 (varlist2 = instlist) if in , first robust cluster(varname)generate(varname) raw cutoff(#) mcd graph  label(varname) test nreps(#) nodots where depvar is the dependent variable, varlist1 contains the exogenous regressors, varlist2 contains the endogenous regressors, and instlist contains the excluded instru- ments.

2.2 Options first reports various first-stage results and identification statistics. May not be used with raw. robust produces standard errors and statistics that are robust to arbitrary heteroskedas- ticity. cluster(varname) produces standard errors and statistics that are robust to both arbitrary heteroskedasticity and intragroup correlation, where varname identifies the group. R. Desbordes and V. Verardi 173 generate(varname) generates a dummy variable named varname, which takes the value of 1 for observations that are flagged as outliers. raw specifies that Cohen Freue, Ortiz-Molina, and Zamar’s estimator (2011) should be returned. Note that the standard errors reported are different from the ones that they proposed because these are robust to heteroskedasticity and asymmetry. The asymptotic variance of the raw estimator is described in Verardi and Croux (2009). cutoff(#) allows the user to change the above which an individual is con- sidered to be an outlier. The default is cutoff(0.99). mcd specifies that a minimum covariance determinant estimator of location and scatter be used to estimate the robust covariance matrices. By default, an S-estimator of location and scatter is used. graph generates a graphic in which outliers are identified according to their type, and labeled using the variable varname. Vertical lines identify vertical outliers (observa- tions with a large residual), and the horizontal line identifies leverage points. label(varname) labels the outliers as varname. label() only has an effect if specified with graph. test specifies to report a test for the presence of outliers in the sample. To test for the appropriateness of a robust IV procedure relative to the classical IV estimator, we rely on the W proposed by Dehon, Gassner, and Verardi (2009)and Desbordes and Verardi (2011), where   t −1 IV − S   S −  IV S IV − S W = θ θRIV Var θIV + Var θRIV 2Cov θ , θRIV θ θRIV

2 Bearing in mind that this statistic is asymptotically distributed as a χp, where p is the number of covariates, it is possible to set an upper bound above which the estimated parameters can be considered to be statistically different and hence the robust IV estimator should be preferred to the standard IV estimator. When the cluster() option is specified, a cluster–bootstrap is used to calculate the W statistic. nreps(#) specifies the number of bootstrap replicates performed when the test and cluster() options are both specified. The default is nreps(50). nodots suppresses the dots.

3 Empirical example

In a seminal article, Romer (1993) convincingly shows that more open economies tend to have lower inflation rates. Worried that a simultaneity bias may affect the estimates, he instruments the trade openness variable—the share of imports in gross domestic product—by the logarithm of a country’s land area. 174 Robust instrumental variables

From a pedagogical perspective, it is useful to start with the dependent variable (which is the average annual inflation rates since 1973), in levels, as in example 16.6 of Wooldridge (2009, 558).

. use http://fmwww.bc.edu/ec-p/data/wooldridge/OPENNESS . merge 1:1 _n using http://www.rodolphedesbordes.com/web_documents/names.dta Result # of obs.

not matched 0 matched 114 (_merge==3)

. ivregress 2sls inf (opendec = lland) Instrumental variables (2SLS) regression Number of obs = 114 Wald chi2(1) = 5.73 Prob > chi2 = 0.0167 R-squared = 0.0316 Root MSE = 23.511

inf Coef. Std. Err. z P>|z| [95% Conf. Interval]

opendec -33.28737 13.91101 -2.39 0.017 -60.55245 -6.022284 _cons 29.60664 5.608412 5.28 0.000 18.61435 40.59893

Instrumented: opendec Instruments: lland

The coefficient on opendec is significant at the 5% level and suggests that a country with a 50% import share had an average inflation rate about 8.3 percentage points lower than a country with a 25% import share. We may be worried that outliers distort these estimates. For instance, it is well known that countries in Latin America have experienced extremely high inflation rates in the 1980s. Hence, we refit the model with the robivreg command. R. Desbordes and V. Verardi 175

. robivreg inf (opendec = lland), test graph label(countryname) (sum of wgt is 8.3000e+01) IV (2SLS) estimation

Estimates efficient for homoskedasticity only Statistics consistent for homoskedasticity only Number of obs = 83 F( 1, 81) = 2.50 Prob > F = 0.1181 Total (centered) SS = 1322.080301 Centered R2 = 0.0514 Total (uncentered) SS = 10844.0201 Uncentered R2 = 0.8844 Residual SS = 1254.073829 Root MSE = 3.935

inf Coef. Std. Err. t P>|t| [95% Conf. Interval]

opendec -12.07379 7.642656 -1.58 0.118 -27.28027 3.1327 _cons 14.60224 2.500814 5.84 0.000 9.626404 19.57808

Underidentification test (Anderson canon. corr. LM statistic): 16.073 Chi-sq(1) P-val = 0.0001

Weak identification test (Cragg-Donald Wald F statistic): 19.453 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission.

Sargan statistic (overidentification test of all instruments): 0.000 (equation exactly identified)

Instrumented: opendec Excluded instruments: lland

H0: Outliers do not distort 2SLS classical estimation

chi2(2)=10.53 Prob > chi2 = .005

Once the influence of outliers is downweighted, the value of the coefficient on opendec becomes much smaller and loses statistical significance. Our test for outliers, requested using the option test, confirms that outliers distort enough the original estimates such that robustness should be favored at the expense of efficiency. The outliers can be easily identified using the graph option. We facilitate the iden- tification of each type of outlier by setting vertical and horizontal cutoff points in the reported graph. The vertical cutoff points are 2.25 and −2.25. If the residuals were normally distributed, values above or below these cutoff points would be strongly atyp- ical because they would be 2.25 standard deviations away from the mean (which is 0 by construction), with a probability of occurrence of 0.025. The reported residuals are said to be robust and standardized because the residuals are based on a robust-to-outliers estimation and have been divided by the of the residuals associated with nonoutlying observations. In line with our downweighting scheme, the horizontal 176 Robust instrumental variables 2 cutoff point is, by default, χp+m+1,0.99. Vertical outliers are observations above or below the vertical lines, while leverage points are to the right of the horizontal line. 30

Bolivia 20

Argentina

Israel 10 Brazil Chile

Robust standardised residuals Peru Congo, Dem. Rep. NicaraguaUruguay Singapore Lesotho 0

0 2 4 6 8 10 Robust Mahalanobis distances

Figure 1. Identification of outliers when inf is used

Romer (1993) was fully aware that his results could be sensitive to outliers. This is why he decided to use as a dependent variable the log of average inflation.

. ivregress 2sls linf (opendec = lland) Instrumental variables (2SLS) regression Number of obs = 114 Wald chi2(1) = 11.06 Prob > chi2 = 0.0009 R-squared = 0.1028 Root MSE = .66881

linf Coef. Std. Err. z P>|z| [95% Conf. Interval]

opendec -1.315804 .3957235 -3.33 0.001 -2.091408 -.5401999 _cons 2.98983 .1595413 18.74 0.000 2.677135 3.302525

Instrumented: opendec Instruments: lland

The coefficient on opendec is now significant at the 1% level and suggests, using the Duan smearing estimate, that a country with a 50% import share had an average inflation rate about 5.4 percentage points lower than a country with a 25% import share. However, even though taking the log of average inflation has certainly reduced the influence of extreme values of the dependent variable, outliers may still be an issue. Hence, we refit the model again with the robivreg command. R. Desbordes and V. Verardi 177

. robivreg linf (opendec = lland), test graph label(countryname) (sum of wgt is 8.9000e+01) IV (2SLS) estimation

Estimates efficient for homoskedasticity only Statistics consistent for homoskedasticity only Number of obs = 89 F( 1, 87) = 3.03 Prob > F = 0.0851 Total (centered) SS = 18.9763507 Centered R2 = 0.1479 Total (uncentered) SS = 533.2560195 Uncentered R2 = 0.9697 Residual SS = 16.17036311 Root MSE = .4311

linf Coef. Std. Err. t P>|t| [95% Conf. Interval]

opendec -1.225348 .7034456 -1.74 0.085 -2.623523 .1728262 _cons 2.796497 .2300043 12.16 0.000 2.339339 3.253656

Underidentification test (Anderson canon. corr. LM statistic): 20.963 Chi-sq(1) P-val = 0.0000

Weak identification test (Cragg-Donald Wald F statistic): 26.806 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission.

Sargan statistic (overidentification test of all instruments): 0.000 (equation exactly identified)

Instrumented: opendec Excluded instruments: lland

H0: Outliers do not distort 2SLS classical estimation

chi2(2)=4.86 Prob > chi2 = .088 178 Robust instrumental variables

In that case, the magnitude of the coefficient is preserved, but its statistical sig- nificance sharply decreases. Once again, we can identify outliers by using the graph option.

Bolivia 4

Lesotho Singapore 2 0 Robust standardised residuals 2 − 0 2 4 6 8 10 Robust Mahalanobis distances

Figure 2. Identification of outliers when ln(inf) is used

In figure 1, we can see that taking the log of inf has been insufficient to deal with all outliers in the dependent variable because Bolivia remains an outlier. Furthermore, Romer was right to be worried that Lesotho or Singapore may “have an excessive influ- ence on the results” Romer (1993, 877). The remoteness of these two observations from the rest of the data led to an inflation of the total sample variation in trade openness, resulting in undersized standard errors and spuriously high statistical significance. For the final example, we illustrate the use of the test option with the cluster() option. The clustering variable is idcode. R. Desbordes and V. Verardi 179

. webuse nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . keep if _n<1501 (27034 observations deleted) . robivreg ln_w age not_smsa (tenure = union south), cluster(idcode) test (sum of wgt is 8.4700e+02) IV (2SLS) estimation

Estimates efficient for homoskedasticity only Statistics robust to heteroskedasticity and clustering on idcode Number of clusters (idcode) = 209 Number of obs = 847 F( 3, 208) = 5.16 Prob > F = 0.0018 Total (centered) SS = 126.4871762 Centered R2 = -0.0831 Total (uncentered) SS = 2988.447827 Uncentered R2 = 0.9542 Residual SS = 136.9920776 Root MSE = .4031

Robust ln_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

tenure .1260126 .0439573 2.87 0.005 .0393536 .2126715 age .0029424 .0032327 0.91 0.364 -.0034306 .0093155 not_smsa -.2569617 .0921363 -2.79 0.006 -.4386024 -.075321 _cons 1.434409 .1009936 14.20 0.000 1.235307 1.633511

Underidentification test (Kleibergen-Paap rk LM statistic): 16.263 Chi-sq(2) P-val = 0.0003

Weak identification test (Cragg-Donald Wald F statistic): 27.689 (Kleibergen-Paap rk Wald F statistic): 11.305 Stock-Yogo weak ID test critical values: 10% maximal IV size 19.93 15% maximal IV size 11.59 20% maximal IV size 8.75 25% maximal IV size 7.25 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.

Hansen J statistic (overidentification test of all instruments): 0.010 Chi-sq(1) P-val = 0.9207

Instrumented: tenure Included instruments: age not_smsa Excluded instruments: union south

Test with clustered errors

bootstrap replicates (50) 1 2 3 4 5 ...... 50 H0: Outliers do not distort 2SLS classical estimation

chi2(4)=2.86 Prob > chi2 = .582 180 Robust instrumental variables

The robust–cluster variance estimator has been used to estimate the standard errors, and as previously explained, a cluster–bootstrap procedure ( is done from clusters with replacement to account for the correlations of observations within cluster) has been used to calculate the W statistic of the outlier test.

4 Conclusion

The robivreg command implements an IV estimator robust to outliers and allows their identification. In addition, a generalized Hausman test provides the means to evaluate whether the gain in robustness outweighs the loss in efficiency and thus justifies the use of a robust IV estimator.

5 References Baum, C. F., M. E. Schaffer, and S. Stillman. 2007. Enhanced routines for instrumental variables/generalized method of moments estimation and testing. Stata Journal 7: 465–506.

Cohen Freue, G. V., H. Ortiz-Molina, and R. H. Zamar. 2011. A natural robustification of the ordinary instrumental variables estimator. Working paper. http://www.stat.ubc.ca/˜ruben/website/cv/cohen-zamar.pdf. Dehon, C., M. Gassner, and V. Verardi. 2009. Extending the Hausman test to check for the presence of outliers. ECARES Working Paper 2011-036, Universit´eLibrede Bruxelles. http://ideas.repec.org/p/eca/wpaper/2013-102578.html.

Desbordes, R., and V. Verardi. 2011. The positive causal impact of foreign direct investment on productivity: A not so typical relationship. Discussion Paper No. 11- 06, University of Strathclyde Business School, Department of Economics. http://www.strath.ac.uk/media/departments/economics/researchdiscussionpapers/ 2011/11-06 Final.pdf.

Romer, D. 1993. Openness and inflation: Theory and evidence. Quarterly Journal of Economics 108: 869–903.

Verardi, V., and C. Croux. 2009. in Stata. Stata Journal 9: 439–453.

Verardi, V., and C. Dehon. 2010. Multivariate outlier detection in Stata. Stata Journal 10: 259–266.

Verardi, V., and A. McCathie. 2012. The S-estimator of multivariate location and scatter in Stata. Stata Journal 12: 299–307.

Wooldridge, J. M. 2009. Introductory : A Modern Approach. 4th ed. Mason, OH: South-Western. R. Desbordes and V. Verardi 181

About the authors Rodolphe Desbordes is a senior lecturer in economics at the University of Strathclyde in Glas- gow, UK. His research interests include applied econometrics, international economics, and economic growth. Vincenzo Verardi is a research fellow of the Belgian National Science Foundation. He is a professor at the Faculte Notre Dame de la Paix of Namur and at the Universite Libre de Bruxelles. His research interests include applied econometrics and development economics. The Stata Journal (2012) 12, Number 2, pp. 182–190 What hypotheses do “nonparametric” two-group tests actually test?

Ron´an M. Conroy Royal College of Surgeons in Ireland Dublin, Ireland [email protected]

Abstract. In this article, I discuss measures of effect size for two-group compar- isons where data are not appropriately analyzed by least-squares methods. The Mann–Whitney test calculates a statistic that is a very useful measure of effect size, particularly suited to situations in which differences are measured on scales that either are ordinal or use arbitrary scale units. Both the difference in and the difference between groups are also useful measures of effect size. Keywords: st0253, ranksum, Wilcoxon rank-sum test, Mann–Whitney statistic, Hodges–Lehman median shift, effect size, qreg

1 Introduction

It is a common fallacy that the Mann–Whitney test, more properly known as the Wilcoxon rank-sum test and also known as the Mann–Whitney–Wilcoxon test, is a test for equality of medians. Many of its users are probably unaware that the test calculates a useful parameter (and therefore should not be called “nonparametric”) that is often of more practical interest than the difference between two means. I will use an extreme case to illustrate the tests available to compare two groups and, in particular, the procedures that examine differences in medians. I will use a dataset that is deliberately constructed so that the medians of two groups are equal but with distributions skewed in opposite directions. Although this is an extreme case, you should bear in mind that differences between two groups in the shape of underlying distributions will have consequences in the same direction, albeit smaller than the ones illustrated here.

c 2012 StataCorp LP st0253 R. M. Conroy 183

The dataset is as follows:

. list

group value

1. 05 2. 05 3. 05 4. 05 5. 05

6. 05 7. 07 8. 08 9. 09 10. 010

11. 11 12. 12 13. 13 14. 14 15. 15

16. 15 17. 15 18. 15 19. 15 20. 15

2 The Mann–Whitney test

Both groups have a median of 5, but group 0 has no values less than 5 and group 1 has no values greater than 5. We can confirm that the medians are the same by using the table command:

. table group, contents(p50 value)

group med(value)

0 5 1 5 184 Two-group tests

Next we run the Wilcoxon rank-sum test (Mann–Whitney test):

. ranksum value, by(group) Two-sample Wilcoxon rank-sum (Mann-Whitney) test group obs rank sum expected

0 10 137 105 1 10 73 105

combined 20 210 210 unadjusted variance 175.00 adjustment for ties -37.63

adjusted variance 137.37 Ho: value(group==0) = value(group==1) z = 2.730 Prob > |z| = 0.0063

The test gives a highly significant difference between the two groups. Clearly, the test cannot be testing the hypothesis of equal medians, so what hypothesis does it test? We can see the answer by adding the porder option.

. ranksum value, by(group) porder Two-sample Wilcoxon rank-sum (Mann-Whitney) test group obs rank sum expected

0 10 137 105 1 10 73 105

combined 20 210 210 unadjusted variance 175.00 adjustment for ties -37.63

adjusted variance 137.37 Ho: value(group==0) = value(group==1) z = 2.730 Prob > |z| = 0.0063 P{value(group==0) > value(group==1)} = 0.820

3 The Mann–Whitney statistic: A useful measure of ef- fect size

The last line of the output states that the probability of an observation in group 0 having a true value that is higher than an observation in group 1 is 82%. In reality, the limitations of measurement scales will often produce cases where the two values are tied. So the parameter is calculated on the basis of the percentage of cases in which a random observation from group 0 is higher than a random observation from group 1, plus half the probability that the values are tied (on the rationale that if the values are tied, the true value is greater in group 1 in half the randomly selected pairs and greater in group 2 in the other half of them). R. M. Conroy 185

This parameter forms the basis of the Mann–Whitney test, a parameter that is a very useful measure of effect size in many situations. A researcher will frequently be faced with a measurement scale that either is not interval in nature (such as a five- point ordered scale) or has no naturally defined underlying measurement units. Typical examples of the latter are scales to measure moods, attitudes, aptitudes, and quality of life. In such cases, presenting mean differences between groups is uninformative. The Mann–Whitney statistic, on the other hand, is highly informative. It tells us the likelihood that a member of one group will score higher than a member of the other group (with the caveat above about the interpretation of tied values). In the analysis of controlled treatment trials, this measure is equivalent to the probability that a person assigned to the treatment group will have a better outcome than a person assigned to the control group. Using this can overcome the problem of many outcome scales used in assessing treatments being measured in arbitrary units. The statistic has a long history of being rediscovered and, consequently, goes under a variety of names. The history dates back to the original articles in which Wilcoxon (1945, 1950) described the test. He failed to give the test a name and failed to spec- ify the hypothesis it tested. The importance of the article by Mann and Whitney (1947) is that they made the hypothesis explicit: their article is entitled “On a test of whether one of two random variables is stochastically larger than the other”. How- ever, the statistic they actually calculated in the article, U, is not the probability that one variable is larger than the other. Birnbaum (1956) pointed out that transforming U by dividing it by its maximum value resulted in a useful measure, but he failed to name the measure. In 1976, Herrnstein proposed the transformation, but the article (Herrnstein, Loveland, and Cable 1976) appeared in an animal psychology journal, and his proposal to assign the letter rho to the transformed value was extremely unoriginal. Perhaps as a result, the literature is now replete with statistics that are nothing other than Mann–Whitney statistics under other names. Bross (1958), setting out the calculation and use of ridit statistics, noted that the mean ridit score for a group was the probability that an observation from that group would be higher than an observation from a reference population. Harrell’s C statistic, which is a measure of the difference between two survival distributions, is a special case of the Mann–Whitney statistic, and indeed, in the absence of censored data, it reduces to the Mann–Whitney statistic (Koziol and Jia 2009). Likewise, the tendency to refer to the Mann–Whitney statistic as the area under the receiver operator characteristic curve is common in literature evaluating diagnostic and screening tests in medicine, and is extremely unhelpful. The name entirely obscures what the test actually tells us, which is the probability that a person with the disorder or condition will score higher on the test than a person without it. The area under the receiver operator characteristic curve has been proposed as a measure of effect size in clinical trials (Brumback, Pepe, and Alonzo 2006), which would extend the bafflement to a new population of readers. As a measure of effect size, the Mann–Whitney statistic has been renamed not once, but repeatedly, and with willful obstinacy. McGraw and Wong (1992)proposed what they called “a common language effect size” that was none other than the Mann– Whitney statistic. The measure was generalized by Vargha and Delaney (2000), who in 186 Two-group tests the process renamed it the measure of stochastic superiority. It has since been further generalized by Ruscio (2008), who renamed the statistic A. He points out that it is insensitive to base rates and more robust to several other factors (for example, extreme scores and nonlinear transformations), in addition to its excellent generalizability across contexts. Finally, Acion et al. (2006) rebranded it the “Probabilistic index”; they advocated it as an intuitive nonparametric approach to measuring the size of treatment effects. Stata users can use the ranksum command to calculate the Mann–Whitney statistic. More usefully, Newson’s (1998) package somersd provides confidence intervals. Newson named the statistic “Harrell’s C”.

. somersd group value, transf(c) tdist Somers´ D with variable: group Transformation: Harrell´s c Valid observations: 20 Degrees of freedom: 19 Symmetric 95% CI for Harrell´s c

Jackknife group Coef. Std. Err. t P>|t| [95% Conf. Interval]

value .18 .0711805 2.53 0.020 .0310175 .3289825

Note that the output of somersd will have to be manipulated in this case to provide the statistic comparing the higher and lower groups. It reports the probability that an observation in the first group will be higher than an observation in the second group. A simple way to overcome this is to reverse the group codes by using Cox’s (2003) vreverse, one of several user-written commands to reverse variables (like all user- written commands, it may be located and installed within Stata by using the findit command).

. vreverse group, generate(r_group) note: group not labeled . somersd r_group value, transf(c) tdist Somers´ D with variable: r_group Transformation: Harrell´s c Valid observations: 20 Degrees of freedom: 19 Symmetric 95% CI for Harrell´s c

Jackknife r_group Coef. Std. Err. t P>|t| [95% Conf. Interval]

value .82 .0711805 11.52 0.000 .6710175 .9689825

This accords with the earlier output from ranksum but has the added advantage of presenting us with a useful confidence interval. R. M. Conroy 187 4 Howandwhentotestmedians

The Mann–Whitney test is a test for equality of medians only under the very strong assumption that both of the two distributions are symmetrical about their respective medians or, in the case of asymmetric distributions, that the distributions are of the same shape but differ in location. Thus the common belief that the test compares medians is true only under some implausible circumstances. Nevertheless, there are times when a researcher explicitly wishes to test for a dif- ference in medians. For example, mean length of hospital stay is generally skewed and often badly affected by small numbers of very long admissions, so the 50th percentile of the stay distribution may be of more interest than the mean. A researcher might therefore be tempted to test for equality of medians with the median test. However, once again, we need to be cautious:

. median value, by(group) Median test Greater than the group median 01Total

no 610 16 yes 40 4

Total 10 10 20 Pearson chi2(1) = 5.0000 Pr = 0.025 Continuity corrected: Pearson chi2(1) = 2.8125 Pr = 0.094

The median test does not actually test for equality of medians. Instead, it tests a likely consequence of drawing two samples from populations with equal medians: that a similar proportion of observations in each group will be above and below the grand median of the data. And as we can see, the test provides at least some support for the idea that the medians are different. On the other hand, quantile regression does test for the equality of medians. It is the direct equivalent of the t test for medians, because the mean minimizes squared error, whereas the median minimizes absolute error: 188 Two-group tests

. qreg value group Iteration 1: WLS sum of weighted deviations = 27.481539 Iteration 1: sum of abs. weighted deviations = 30 Iteration 2: sum of abs. weighted deviations = 26 Iteration 3: sum of abs. weighted deviations = 24 Median regression Number of obs = 20 Raw sum of deviations 24 (about 5) Min sum of deviations 24 Pseudo R2 = 0.0000

value Coef. Std. Err. t P>|t| [95% Conf. Interval]

group 0 1.221911 0.00 1.000 -2.56714 2.56714 _cons 5 .8640215 5.79 0.000 3.184758 6.815242

This accords with what we know of the data: the medians are identical and the difference in the medians is 0. Quantile regression is also more powerful in that it can be extended to other quantiles of interest and it can be adjusted for covariates. Note that the difference in medians between two groups does not correspond to the median difference between individuals. To confirm this, we can use Newson’s (2006) cendif command:

. cendif value, by(group) Y-variable: value Grouped by: group Group numbers: group Freq. Percent Cum.

0 10 50.00 50.00 1 10 50.00 100.00

Total 20 100.00 Transformation: Fisher´s z 95% (s) for percentile difference(s) between values of value in first and second groups: Percent Pctl_Dif Minimum Maximum 50204

The median difference between members of group 0 and members of group 1 is 2, with a 95% confidence interval from 0 to 4. This measure, usually called the Hodges– Lehman median difference (Hodges and Lehmann 1963)is,asNewson (2002) points out, a special case of Theil’s (1950a; 1950b; 1950c) median slope. The researcher therefore has to relate the statistical procedure to the purpose of the study: computing the difference in medians tests for a difference between two conditions. In the case of hospital stay, the effect of a new treatment on length of stay could be tested by comparing the median stay in two groups, one of whom received the new treatment and the other acting as control. However, when the interest is the impact on the individual, the difference in medians between groups is misleading. Here the expected effect of treatment on the individual is best measured by the median difference between patients in the treatment and control groups. R. M. Conroy 189 5 Conclusion

The Mann–Whitney test is based on a parameter that is of considerable interest as a measure of effect size, especially in situations in which outcomes are measured on scales that either are ordinal or have arbitrary measurement units. Unfortunately, it has been often misrepresented as a test for the equality of medians, which it is not. The Mann– Whitney statistic also has been subject to so many rediscoveries and rebrandings over the years. Those who wish to test the equality of medians between two groups should avoid the Mann–Whitney test and should consider qreg as a more powerful and versatile alternative to the median test. It is important, however, to distinguish between the difference between the medians of two groups, which measures the effect of a policy or condition on the location of the distribution of the outcome of interest, and the median difference between individuals in the two groups (Hodges–Lehman median difference), which is a measure of the expected benefit to the individual associated with being a member of the superior group.

6 References Acion, L., J. J. Peterson, S. Temple, and S. Arndt. 2006. Probabilistic index: An intu- itive non-parametric approach to measuring the size of treatment effects. Statistics in Medicine 25: 591–602. Birnbaum, Z. W. 1956. On a use of the Mann-Whitney statistic. In Proceedings of the Third Berkeley Symposium on and Probability, ed. J. Neyman, 13–17. Berkeley, CA: University of California Press. Bross, I. D. J. 1958. How to use ridit analysis. Biometrics 14: 18–38. Brumback, L. C., M. S. Pepe, and T. A. Alonzo. 2006. Using the ROC curve for gauging treatment effect in clinical trials. Statistics in Medicine 25: 575–590. Cox, N. J. 2003. vreverse: Stata module to reverse existing . Sta- tistical Software Components S434402, Department of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s434402.html. Herrnstein, R. J., D. H. Loveland, and C. Cable. 1976. Natural concepts in pigeons. Journal of Experimental Psychology: Animal Behavior Processes 2: 285–302. Hodges, J. L., Jr., and E. L. Lehmann. 1963. Estimates of location based on rank tests. Annals of Mathematical Statistics 34: 598–611. Koziol, J. A., and Z. Jia. 2009. The concordance index C and the Mann-Whitney parameter Pr(X>Y) with randomly censored data. Biometrical Journal 51: 467–474. Mann, H. B., and D. R. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 18: 50–60. 190 Two-group tests

McGraw, K. O., and S. P. Wong. 1992. A common language effect size statistic. Psy- chological Bulletin 111: 361–365.

Newson, R. 1998. somersd: Stata module to calculate Kendall’s tau-a, Somers’ D and median differences. Statistical Software Components S336401, Department of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s336401.html.

———. 2002. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. Stata Journal 2: 45–64. ———. 2006. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios. Stata Journal 6: 497–520.

Ruscio, J. 2008. A probability-based measure of effect size: Robustness to base rates and other factors. Psychological Methods 13: 19–30.

Theil, H. 1950a. A rank invariant method of linear and polynomial , I. Proceedings of the Koninklijke Nederlandse Akademie Wetenschappen, Series A – Mathematical Sciences 53: 386–392.

———. 1950b. A rank invariant method of linear and analysis, II. Proceedings of the Koninklijke Nederlandse Akademie Wetenschappen, Series A – Mathematical Sciences 53: 521–525.

———. 1950c. A rank invariant method of linear and polynomial regression analysis, III. Proceedings of the Koninklijke Nederlandse Akademie Wetenschappen, Series A – Mathematical Sciences 53: 1397–1412.

Vargha, A., and H. D. Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics 25: 101–132.

Wilcoxon, F. 1945. Individual comparisons by methods. Biometrics Bulletin 1: 80–83.

———. 1950. Some rapid approximate statistical procedures. Annals of the New York Academy of Sciences 52: 808–814.

About the author Ron´an Conroy is a biostatistician at the Royal College of Surgeons in Dublin, Ireland. The Stata Journal (2012) 12, Number 2, pp. 191–213 From resultssets to resultstables in Stata

Roger B. Newson National Heart and Lung Institute Imperial College London London, UK [email protected]

Abstract. The listtab package supersedes the listtex package. It inputs a list of variables and outputs them as a table in one of several formats, including TEX, LATEX, HTML, Microsoft Rich Text Format, or possibly future XML-based formats. It works with a team of four other packages: sdecode, chardef, xrewide,and ingap, which are downloaded from the Statistical Software Components archive. The sdecode package is an extension of decode; it allows the user to add prefixes or suffixes and to output exponents as TEX, HTML, or Rich Text Format superscripts. It is called by another package, bmjcip, to convert estimates, confidence limits, and p-values from numeric to string. The chardef package is an extension of char define; it allows the user to define a characteristic for a list of variables, adding prefixes or suffixes. The xrewide package is an extension of reshape wide; it allows the user to store additional results in variable characteristics or local macros. The ingap package inserts gap observations into a dataset to form gap rows when the dataset is converted to a table. Together, these packages form a general toolkit to convert Stata datasets (or resultssets) to tables, complete with row labels and column labels. Examples are demonstrated using LATEX and also using the rtfutil package to create Rich Text Format documents. This article uses data from the Avon Longitudinal Study of Parents and Children, based at Bristol University, UK. Keywords: st0254, listtab, listtex, sdecode, chardef, xrewide, ingap, bmjcip, rtfutil, keyby, addinby, table, resultsset, TEX, LATEX, HTML, XML, RTF

1 Introduction

Statisticians and other scientists routinely produce tables of descriptive and inferential statistics. Unsurprisingly, Stata users have produced several add-on packages to input the various Stata data and output structures and to output tables of statistics in a variety of formats, such as TEX, LATEX, Hypertext Markup Language (HTML), Extensible Markup Language (XML), Microsoft Rich Text Format (RTF), or delimited text tables for menu-driven manual incorporation into word processor documents. The packages for this purpose that are most frequently downloaded from the Statis- tical Software Components (SSC) archive are Roy Wada’s outreg2, Ben Jann’s estout, John Luke Gallup’s outreg, and Ian Watson’s tabout. Some very inspiring presenta- tions were given at the 2009 UK Stata Users Group meeting by Ben Jann (2009), who gave a tutorial on estout, and by Adam Jacobs (2009), who described some of his own programs to create XML-based documents for Open Document Format packages.

c 2012 StataCorp LP st0254 192 From resultssets to resultstables in Stata

My own method for table production was based initially on Stata’s listtex package, which has since been superseded by listtab for users of Stata 10 or higher. The latest versions of both packages are still downloadable from SSC. Both of these packages input a list of variables from the Stata dataset in memory and output a text table (in the Stata log or in a file) with one row per observation. The table is derived by outputting the values of these variables, preceded by a begin() string option, separated by a delimiter() string option, and suffixed by an end() string option. Optionally, the user may specify a list of string header rows to be output before the observation rows by using the headlines() option or a list of string footer rows to be output after the observation rows by using the footlines() option. This general format includes, as special cases, most table formats known to computer science and probably a few yet to be invented. To save the user the trouble of specifying a long list of options, there is an rstyle() option to specify one of a collection of row style names, such as rstyle(tabular) for LATEX tabular environments (Lamport 1994). A row style is defined as a vector of four string values—namely, a begin() string, a delimiter() string, an end() string, and a missnum() string—to be output in the place of numeric missing values. All of these options default to the empty string for listtab. However, the delimiter() option defaults to the ampersand (&)forlisttex, because I was thinking of TEX tables as the norm when I wrote listtex.Themissnum() option is probably not often used with either package, because numeric variables should usually be converted to string variables before being output to tables. The listtab or listtex method is simple and versatile, if only the user already has a dataset in memory with one observation per table row. Fortunately, StataCorp and other Stata users have already developed a variety of programs, such as official Stata’s statsby and the SSC package parmest, that could be used to produce such datasets (or resultssets). Some of these programs, and their use with listtex, are described in Newson (2003). Since that article was written, resultsset processing has developed further and additional Stata packages have become available. In this article, I aim to give an update of techniques for converting resultssets (and other datasets) to tables (or resultstables). The methods used can be summarized as “decode, label, and list” (in the simplest cases) or as “decode, label, reshape, insert gaps, and list” (in the most complicated cases).

1.1 Example resultsset: Genotype frequencies in ALSPAC

The resultsset that we will use to demonstrate these techniques was produced using data from the Avon Longitudinal Study of Parents and Children (ALSPAC). For further information about this birth , refer to http://www.alspac.bris.ac.uk (the study website). Analyses of some of these data are discussed in Shaheen et al. (2010) and Henderson et al. (2010).

In the ALSPAC birth cohort study, 14,060 children born in and around Bristol, UK, in 1991 and 1992 have been observed through progression to adulthood. Subsets of these children, and of their mothers, gave blood or other tissue samples, which were R. B. Newson 193 used to measure their genotypes with respect to 24 biallelic polymorphisms, using assay methods developed by molecular geneticists. An allele is a version of a gene, and a biallelic polymorphism is a set of two possible alleles for a gene. The genotype of a study subject with respect to a polymorphism is the total set of alleles present in the two copies of the human genome present in each cell of that study subject; one copy comes from the subject’s mother’s egg, while the other copy comes from the subject’s father’s sperm. Genotypes with respect to a biallelic polymorphism may have values of the form GG, Gg,orgg, where G is the first allele and g is the second allele. Before measuring associations of the two alleles of a biallelic polymorphism with a disease, geneticists usually measure the frequencies of the three possible genotypes in the sample of subjects. An important population parameter of this  distribution is the homozygote or heterozygote ratio, defined as PGGPgg/PGg, where PGG, PGg,andPgg are the proportions of subjects in the population with genotypes GG, Gg,andgg, respectively. The sample estimate of this ratio is NGGNgg/NGg, and the of its log is 1/(4NGG)+1/(4Ngg)+1/NGg, where NGG, NGg, and Ngg are the numbers of subjects in the sample with genotypes GG, Gg,andgg, respectively. If the geometric mean homozygote or heterozygote ratio is 0.5, as would be the case under random mating (where the maternal egg allele and the paternal sperm allele are statistically independent), then the of the genotypes is said to be in Hardy–Weinberg equilibrium. The alternative possibilities are a tendency to inbreeding (where the ratio is greater than 0.5, indicating a tendency for mothers and fathers to contribute the same allele) and a tendency to outbreeding (where the ratio is less than 0.5, indicating a tendency for mothers and fathers to contribute different alleles). Justifications of these formulas can be found in Newson (2008a). The example resultsset to be used in this article contains one observation per poly- morphism per DNA source (“Child” or “Maternal”) for each of 24 polymorphisms as- sayed. The resultsset also contains data on the number of subjects (children or mothers) whose genotypes were assessed for the polymorphism, and the numbers and percentages with each genotype. This is together with estimates and confidence intervals for the geometric mean homozygote or heterozygote ratios for 24 biallelic polymorphisms, and p-values for testing the hypothesis of Hardy–Weinberg equilibrium. The observations in the resultsset are sorted (and uniquely identified) by a key of three value-labeled numeric variables: source (indicating DNA source), polygp (in- dicating a polymorphism group corresponding to a particular protein specified by a particular gene), and poly (indicating the polymorphism). The resultsset was created using the SSC packages xcontract, dsconcat,andfactmerg, together with the parmcip command of the SSC package parmest. The methods used to create this resultsset are discussed in Newson (2008b), which also gives a plot of some of the confidence intervals. Methods for production of resultssets are discussed more generally in Newson (2003) and also in many more recent references accessible by hypertext from the online help for the parmest package. In this article, I will demonstrate how to convert these results into tables in LATEX (which Stata programmers frequently like to use) and in RTF (which produces Microsoft- 194 From resultssets to resultstables in Stata ready documents, which their colleagues frequently demand). In section 2, I explain how a table can be viewed as a special case of a Stata dataset. In sections 3–5, I demonstrate increasingly advanced examples of table production in LATEX. In section 6, I describe the creation of tables in RTF documents. Finally, in section 7, I outline possible further extensions of the methods presented in this article. The example resultsset, together with the Stata do-files used to output the tables, can be downloaded (using Stata) as online supplementary material to this article.

2 Tables as Stata datasets

To create a table using listtab, we must first have a suitable Stata dataset in the memory, with one observation per table row. In general, a well-formed Stata dataset should be designed to have one observation per thing and data on attributes of things. For instance, the example resultsset has one observation per combination of DNA source (“Child” or “Maternal”) and polymorphism, data on genotype frequencies for that polymorphism in that DNA source, and data on statistics derived from those frequencies. Specifically, a well-formed Stata dataset should be a table, as defined by the relational model of Date (1986). This model (in its essentials) assumes that a table (or dataset) is a mathematical function whose domain is the set of existing combinations of key variable values (specifying the things)and whose is the set of all possible combinations of non-key variable values (specifying the attributes of things). For a Stata dataset to conform to this model, it should be keyed (that is to say, its observations should be sorted and uniquely identified) by the values of a list of key variables. In the case of the example resultsset, these key variables are source, polygp,andpoly. Not all Stata datasets conform to the relational model. Some are not sorted or are sorted by variables that do not uniquely identify the observations. Similarly, not all keyed Stata datasets are ideal for input to listtab. I would argue (from experience) that a Stata dataset input to listtab should be keyed, with key variables identifying the table rows in order of appearance on the page, possibly preceded in the key by key variables identifying the pages of a multipage table. The dataset should also have the following features:

1. The variable list input to listtab should contain only string variables. 2. The first variable in the list should be a row label variable whose value, in each observation, is determined from the values of the key variables in that observation. 3. The variables input to listtab should have one or more characteristics (see [P] char) containing the corresponding column labels for the output table.

The first of these features is a good idea because numeric variables nearly always have to be reformatted before inclusion in tables in ways that are not provided by any official Stata format. The other two features are a good idea because tables in R. B. Newson 195 documents have row labels and column labels, both of which can contain justification, font specification, or other formatting information.

2.1 Making resultstables by breaking resultssets

A typical resultsset produced by parmest, contract, collapse,orstatsby contains mostly numeric variables containing frequencies or sample statistics. Formatting re- sultssets for input to listtab is usually a destructive process, because these numeric variables are converted irreversibly to string variables. The process is even more de- structive when the dataset is also subset, reshaped, or provided with gap rows to make the table more readable. It is therefore usually a good idea to program this conversion process as a sequence of commands in a do-file and to enclose these commands between a preserve statement and a restore statement (see [P] preserve). Stata programmers are frequently urged to avoid the use of preserve and restore, because they involve file processing and therefore consume computing resources. Un- fortunately, in Stata 1 to 12, the user may only have access to one dataset in memory at a time and therefore frequently has no choice but to use preserve and restore. Fortunately, if a resultsset is small enough to be converted to a table or even to a mul- tipage document of multipage tables, then this resource intensity is not likely to be an important issue. I usually follow a policy of writing a do-file to produce a keyed resultsset on disk, and then writing a second do-file to input this resultsset and to produce the tables (and plots), which frequently have to be revised iteratively before they have exactly the right look and feel.

2.2 Enforcing the relational model with keyby and addinby

Two useful packages to enforce and use the relational model in Stata are keyby and addinby, both downloadable from SSC. The command keyby is an extension of sort (see [D] sort); it sorts the dataset by a list of variables and then checks that the dataset is keyed by these variables. The keyby package also contains the command keybygen, which sorts the data by a list of variables and adds an additional variable containing the preexisting order of an observation within its by-group to complete the key. Both keyby and keybygen reorder the key variables to be the first in the dataset unless the user specifies a noorder option.

The addinby package is an extension of merge (see [D] merge); it adds extra vari- ables to the dataset in memory from a second dataset by using a list of existing variables as a foreign key, which is assumed to be the primary key of the second dataset. Both keyby and addinby fail if the list of variables specified does not key the dataset that it is supposed to key. They also fail if these key variables contain missing values (unless the user specifies a missing option). Both commands restore the original dataset in the event of failure (unless the user specifies a fast option). 196 From resultssets to resultstables in Stata

The keyby and addinby packages are frequently used to produce a keyed resultsset to be stored on disk before that resultsset is subset, decoded, reshaped, or supplied with gap rows for input to listtab.

3 Simple tables: Decode, label, list

Our first listtab examples are the creation of tables 1 and 2. Table 1, a minimal example, is a modified version of table 1 of Newson (2003), with one row for each Volkswagen car model in auto.dta. Table 2, a more typical example, has one row per polymorphism in the example resultsset, labeled by the polymorphism group and the name of the individual polymorphism; data on the total number of child subjects; data on the numbers of subjects with genotypes GG, Gg,andgg (where G is the first allele and g is the second allele); and the estimates, confidence limits, and p-values for the geometric mean homozygote or heterozygote ratios.

Table 1. Volkswagen cars in auto.dta

Make and Model Mileage (mpg) Weight (lbs.) Price VW Dasher 23 2,160 $7,140 VW Diesel 41 2,040 $5,397 VW Rabbit 25 1,930 $4,697 VW Scirocco 25 1,990 $6,850 R. B. Newson 197

Table 2. Child genotype frequencies and homozygote or heterozygote ratios

Polymorphism (by group) N GG Gg gg Ratio (95% CI) P AHR, rs2066853 8704 6811 1788 105 0.47 (0.43, 0.53) .31 CYP1A1, rs1048943 8764 8205 544 15 0.64 (0.49, 0.84) .062 CYP1A1, rs4646903 8587 6899 1567 121 0.58 (0.53, 0.65) .0033 CYP2A6, rs1801272 8666 8203 454 9 0.60 (0.43, 0.84) .3 CYP2A6, rs28399433 8691 7716 936 39 0.59 (0.49, 0.69) .067 GCL, rs17883901 9389 7925 1401 63 0.50 (0.44, 0.58) .9 GPX, rs713041 8611 2733 4168 1710 0.52 (0.50, 0.54) .093 GSTM1, GSTM1 CNV 8399 585 3163 4651 0.52 (0.49, 0.55) .14 GSTP1, rs947894 8692 3695 3937 1060 0.50 (0.48, 0.53) .82 GSTT1, GSTT1 CNV 7591 2448 3653 1490 0.52 (0.50, 0.55) .056 IGF, rs35767 9383 6579 2555 249 0.50 (0.47, 0.54) .96 IGF, rs2854744 8489 2456 4202 1831 0.50 (0.48, 0.53) .67 LTA, rs909253 8664 3397 4089 1178 0.49 (0.47, 0.51) .34 MIF, rs755622 9500 6454 2757 289 0.50 (0.46, 0.53) .79 MTNR1B, rs10830963 8584 4570 3357 657 0.52 (0.49, 0.54) .24 Nrf2, rs1806649 8606 4889 3199 518 0.50 (0.47, 0.53) .86 Nrf2, rs1962142 8763 7154 1528 81 0.50 (0.44, 0.56) .95 Nrf2, rs2364723 8725 4047 3760 918 0.51 (0.49, 0.54) .31 Nrf2, rs6706649 8731 6696 1920 115 0.46 (0.41, 0.51) .086 Nrf2, rs6721961 9419 7534 1775 110 0.51 (0.46, 0.57) .64 Nrf2, rs6726395 8692 2580 4323 1789 0.50 (0.48, 0.52) .78 Nrf2, rs10183914 8671 3617 3975 1079 0.50 (0.47, 0.52) .8 TNF, rs1800629 8685 5679 2699 307 0.49 (0.46, 0.52) .53 UGB, rs3741240 8671 3602 4031 1038 0.48 (0.46, 0.50) .079

In both of these cases, we start with a dataset (auto.dta for table 1 and the ex- ample resultsset for table 2) and simply decode, label, and list the variables we want to tabulate. The tools used to do this are sdecode (to decode the numbers and label the rows), chardef (to label the columns), and listtab (including the listtab vars command) to output the table as a LATEX tabular environment to the Stata log or the Results window or an output file, from which it was cut and pasted into the LATEX version of this article.

3.1 Numeric to string conversion using sdecode

The decoding and row labeling is done by the sdecode package, which is a “super” version of decode (see [D] decode). It has the companion package sencode, a “super” version of encode (see [D] encode). Both packages are downloadable from SSC.The sencode package converts a string variable to a value-labeled numeric variable and is used extensively to produce graphs, because a string variable must usually be converted to numeric before it can be used as an axis variable on a graph. By contrast, the sdecode package converts a numeric variable (value-labeled or unlabeled) to a string 198 From resultssets to resultstables in Stata variable and is used extensively in producing tables. Numeric variables in tables are usually formatted to a specified number of decimal places or significant figures and often are formatted to scientific notation with superscripts. They may have added parentheses (if they are confidence limits) or stars (if they are p-values), and they may also have additional prefixes or suffixes indicating justification (left, right, or center) or fonts (bold, italic, or nonstandard size). To cater for all these possibilities, sdecode has some extra options that are not available for decode:

1. The user can include a generate() option to specify that a new string output vari- able will be generated and can include a replace option to specify that the string output variable will replace the numeric input variable, inheriting its position in the dataset and its characteristics (see [P] char). 2. Unlabeled numeric values can be converted to string by using formats, possibly specified with a format() option, as with the tostring command (see [D] de- string). 3. The esub() option allows the user to convert format-generated string values con- taining embedded e- or e+ substrings, indicating scientific notation, to a number of exponentiation formats. These include esub(texsuper) (converting exponents to TEX superscripts), esub(htmlsuper) (converting exponents to HTML super- scripts), and esub(rtfsuper) (converting exponents to RTF superscripts). 4. The ftrim option allows the user to trim format-generated string values, removing spaces on the left and on the right.

5. The xmlsub option replaces output substrings &, <,and> with the XML escape sequences &, <,and>, respectively. 6. The prefix() and suffix() options allow the user to add a prefix and a suffix, respectively, to the output string values.

These extra options allow numeric variables to be converted to string equivalents suited to a wide variety of document formats. However, to make life easier for users, sdecode has been extended further by the provision of the bmjcip package, also down- loadable from SSC, which is a front end for sdecode. bmjcip inputs up to four numeric variables, which it interprets as a p-value, an estimate with a p-value, an estimate with two confidence limits, or an estimate with two confidence limits and a p-value, depend- ing on whether the user supplies one, two, three, or four input variables, respectively. It decodes these variables, replacing them with their decoded string versions, prefixed and suffixed appropriately.

3.2 Characteristic assignment using chardef

As stated previously, users are advised to store to listtab column labels for tables in characteristics assigned to the corresponding variable’s output (see [P] char). To R. B. Newson 199 reduce the number and complexity of char define statements required, the chardef package, downloadable from SSC, has been provided to semiautomate this assignment. chardef assigns a user-specified characteristic for a list of variables, assigning a list of corresponding values that may be prefixed or suffixed using the prefix() option or the suffix() option. The characteristics may be cleared using the charundef command of the chardef package. The aim of this mass production (and mass destruction) of variable characteristics is to enable the user to create header lines in local macros to be input to the headlines() option of listtab.Thelisttab package includes the command listtab vars,which inputs a list of variables (with a row style) that will later be output using listtab and outputs a header line to a local macro. This header line contains a list of the variable names, the variable labels, the values of a named variable characteristic, or another attribute of the variables, separated by the delimiter() string of the row style, and prefixed and suffixed by the begin() and end() strings of the row style, respectively. This local macro can then be input to listtab as part of the headlines() string list option, to provide one or more rows of column labels in the output table.

3.3 The code to create table 1

Having introduced the packages to be used, I can now present the Stata code that combines them to produce table 1. If we assume that auto.dta is in the memory, then this code (and its output to the Stata log) is as follows:

. preserve . keep if word(make,1) == "VW" (70 observations deleted) . sdecode mpg, replace . sdecode weight, replace . sdecode price, replace prefix("\\$") . chardef make mpg weight price, char(varname) prefix("\textsl{") suffix("}") > values("Make and Model" "Mileage (mpg)" "Weight (lbs.)" "Price") . listtab_vars make mpg weight price, substitute(char varname) rstyle(tabular) > local(h1) . listtab make mpg weight price, type rstyle(tabular) > headlines("\begin{tabular}{rrrr}" "\hline" "`h1´" "\hline") > footlines("\hline" "\end{tabular}") \begin{tabular}{rrrr} \hline \textsl{Make and Model}&\textsl{Mileage (mpg)}&\textsl{Weight (lbs.)} &\textsl{Price}\\ \hline VW Dasher&23&2,160&\$7,140\\ VW Diesel&41&2,040&\$5,397\\ VW Rabbit&25&1,930&\$4,697\\ VW Scirocco&25&1,990&\$6,850\\ \hline \end{tabular} . restore 200 From resultssets to resultstables in Stata

We start by using preserve to save a copy of the original auto.dta, which in this case is not a resultsset. We then discard the data on all non-Volkswagen car models by using keep, and we use sdecode to convert the numeric variables mpg, weight,and price to string variables, prefixed in the case of price by a LATEX dollar sign. (Note that we do not need to create a row label variable because the existing variable make will play that role and the role of a key for the dataset.) We then specify column labels by using chardef, which assigns the characteristic varname of the variables make, mpg, weight,andprice with a list of values specified by the values() option, prefixed by \textsl{ andsuffixedby}, so the column labels will be slanted in a LATEX environment. Then we use the listtab vars command to combine the varname characteristics of the variables make, mpg, weight,andprice into an output local macro h1 contain- ing a header row for the table. We do this by using the option rstyle(tabular) to specify that the header row be in the LATEX tabular row style and by using the option substitute(char varname) to specify that the cells in the header row be taken from the variable characteristic varname instead of from the variable names. After this, we use listtab with a headlines() option to specify table headlines including the local macro h1 and a footlines() options to specify table footlines. The listtab command outputs the variables make, mpg, weight,andprice to the Stata log in the form of some alien-looking output, which readers may recognize as a LATEX tabular environment, ready to be cut and pasted into a LATEX document to produce table 1. Finally, the restore command restores the original auto.dta.

3.4 The code to create table 2

As a less trivial example, I can now present the Stata code to produce table 2. If we assume that the example resultsset is in the memory, then this code (and its output to the Stata log) is as follows:

. preserve . keep if source==1 (24 observations deleted) . keep source polygp poly N _freq* homhet min* max* p . format homhet min* max* %8.2f . bmjcip homhet min* max* p, esub(texsuper) prefix($) suffix($) . foreach Y of var N _freq* { 2. sdecode `Y´, replace esub(texsuper) prefix($) suffix($) 3. } . sdecode polygp, generate(S_1) . sdecode poly, generate(S_2) . generate row=S_1+", "+S_2 . chardef row N _freq* homhet min* max* p, > char(varname) prefix("\textsl{") suffix("}") > values("Polymorphism (by group)" N GG Gg gg Ratio "(95\%" "CI)" P) . listtab_vars row N _freq* homhet min* max* p, rstyle(tabular) > substitute(char varname) local(h1) R. B. Newson 201

. listtab row N _freq* homhet min* max* p using texdemo1_2.tex, > replace rstyle(tabular) > headlines("\begin{tabular}{lrrrrrrrl}" "\hline" "`h1´" "\hline") > footlines("\hline" "\end{tabular}") . restore

We start by using preserve to save a copy of the original resultsset (with data on genotype frequencies in children and in mothers). We keep only the subset source==1, because the variable source is 1 for child results and 2 for maternal results. We then reset the formats of the homozygote or heterozygote ratios and their confidence limits to %8.2f (implying two decimal places), and we use bmjcip to decode the ratios, their confidence limits, and their p-values from numeric variables to string variables. We then use a foreach loop to decode the variables N, freq0, freq1,and freq2 to string variables. For all of these decoded variables, we use the options prefix($) and suffix($) to output the numbers in LATEX math ; this was not really necessary but might have been necessary if there had been any exponentiated numbers (such as very small p-values or astronomical sample numbers). We label the table rows, using sdecode to decode the key variables polygp and poly to new variables S 1 and S 2, respectively, and combining their values to produce a new row label variable row. We label the table columns—which are the variable row and the decoded statistical variables—using chardef, which assigns the characteristic varname to contain slanted table header cells, in the way earlier demonstrated in the creation of table 1. We can now use listtab vars to input the table column variables and output a header row to the local macro h1, again in the way demonstrated in the creation of table 1. Having created this header row, we can use listtab, which again uses the local macro h1 as part of its headlines() option. The listtab command creates a LATEX tabular environment, even more alien-looking than the one for table 1, and (fortunately) does not type it to the Stata log but writes it to the file texdemo1 2.tex. This file may be cut and pasted into a LATEX document or included using \input to produce table 2. Finally, the restore command restores the original resultsset, with the original numeric statistical variables and observations for child and maternal results, ready to be used to produce further resultstables.

4 Parallel tables: Decode, label, reshape, list

In our next example, we demonstrate the production of table 3. Unlike table 2, table 3 does not include the individual genotype frequencies or the p-values. However, it does include sample numbers, homozygote or heterozygote ratios, and confidence limits both for child genotypes and for maternal genotypes. The child and maternal results are displayed side by side in the same table rows. This can be done by using the official Stata reshape command (see [D] reshape)ortheSSC package xrewide. 202 From resultssets to resultstables in Stata

Table 3. Child and maternal homozygote or heterozygote ratios

Child: Maternal: Polymorphism N Ratio (95% CI) N Ratio (95% CI)

AHR, rs2066853 8704 0.47 (0.43, 0.53) 8100 0.57 (0.52, 0.63) CYP1A1, rs1048943 8764 0.64 (0.49, 0.84) 8130 0.54 (0.38, 0.75) CYP1A1, rs4646903 8587 0.58 (0.53, 0.65) 7990 0.51 (0.45, 0.57) CYP2A6, rs1801272 8666 0.60 (0.43, 0.84) 8088 0.61 (0.44, 0.84) CYP2A6, rs28399433 8691 0.59 (0.49, 0.69) 8055 0.49 (0.41, 0.60) GCL, rs17883901 9389 0.50 (0.44, 0.58) 5644 0.47 (0.39, 0.56) GPX, rs713041 8611 0.52 (0.50, 0.54) 8015 0.49 (0.47, 0.51) GSTM1, GSTM1 CNV 8399 0.52 (0.49, 0.55) 7497 0.51 (0.49, 0.55) GSTP1, rs947894 8692 0.50 (0.48, 0.53) 8000 0.48 (0.46, 0.50) GSTT1, GSTT1 CNV 7591 0.52 (0.50, 0.55) 6674 0.51 (0.49, 0.54) IGF, rs35767 9383 0.50 (0.47, 0.54) 7434 0.53 (0.49, 0.57) IGF, rs2854744 8489 0.50 (0.48, 0.53) 6215 0.50 (0.47, 0.52) LTA, rs909253 8664 0.49 (0.47, 0.51) 8073 0.49 (0.46 0.51) MIF, rs755622 9500 0.50 (0.46, 0.53) 7397 0.51 (0.47, 0.55) MTNR1B, rs10830963 8584 0.52 (0.49, 0.54) 7987 0.49 (0.47, 0.52) Nrf2, rs1806649 8606 0.50 (0.47, 0.53) 8095 0.52 (0.49, 0.55) Nrf2, rs1962142 8763 0.50 (0.44, 0.56) 8119 0.49 (0.43, 0.56) Nrf2, rs2364723 8725 0.51 (0.49, 0.54) 8093 0.50 (0.47, 0.52) Nrf2, rs6706649 8731 0.46 (0.41, 0.51) 8071 0.51 (0.46, 0.56) Nrf2, rs6721961 9419 0.51 (0.46, 0.57) 5630 0.50 (0.43, 0.57) Nrf2, rs6726395 8692 0.50 (0.48, 0.52) 8061 0.51 (0.49, 0.53) Nrf2, rs10183914 8671 0.50 (0.47, 0.52) 8037 0.52 (0.49, 0.54) TNF, rs1800629 8685 0.49 (0.46, 0.52) 8114 0.51 (0.48, 0.55) UGB, rs3741240 8671 0.48 (0.46, 0.50) 7952 0.51 (0.49, 0.54)

4.1 Reshaping tables using reshape and xrewide

The reshape command reshapes a dataset in memory from a long form to a wide form (using its reshape wide syntax) or from a wide form to a long form (using its reshape long syntax). If we think of tables as Stata datasets, then we can think of using reshape wide to convert two versions of table 2 (the child and maternal versions) into table 3. Alternatively, we can think of using reshape long to convert table 2 into a table form similar to some of those created by the esttab package (Jann 2007), in which the sample numbers, ratios, confidence limits, and p-values are stacked vertically. (This might be done using an integer-valued j() variable named stat, with value labels like “N”, “Ratio”, “95% CI”, and “P”.) The package xrewide is an extended version of reshape wide that stores additional results in variable characteristics or local macros. These additional results can be useful if the dataset is output to a table by using listtab. To store these results, xrewide has the following options, in addition to those available with reshape wide: R. B. Newson 203

1. The cjvalue() and cjlabel() options specify the names of variable characteris- tics to be assigned to the generated reshaped output variables and to contain the corresponding values and value labels, respectively, of the j() variable.

2. The vjvalue() and vjlabel() options specify the names of subsets of the input variables to be reshaped, whose corresponding output variables will be assigned the variable characteristics named by cjvalue() and cjlabel(), respectively. (By default, all output variables are assigned these characteristics.)

3. The pjvalue() and sjvalue() options specify a prefix and a suffix, respectively, to be added to the variable characteristic named by cjvalue().Thepjlabel() and sjlabel() options specify a prefix and a suffix, respectively, for the cjlabel variable characteristic.

4. The xmlsub option specifies that XML escape sequence substitutions be performed for the characters &, <,and> in the cjlabel() variable characteristic.

5. Finally, the lxjk() and lxkj() options specify the names of local macros, to be assigned with full lists of the generated reshaped output variables. In the case of the lxjk() macro, these lists are sorted primarily by the corresponding j() value and secondarily by the corresponding input variable; in the case of the lxkj() macro, these lists are sorted primarily by the corresponding input variable and secondarily by the corresponding j() value.

These extra saved results, stored in variable characteristics or local macros, can then be passed to listtab and listtab vars to save the user the trouble of manually typing column header rows or variable lists. The saved results can save a lot of work if a colleague suddenly decides that three columns of confidence intervals were wanted instead of two or that the columns need to be relabeled.

4.2 The code to create table 3

This code and its Stata log output are as follows:

. preserve . keep source polygp poly N homhet min* max* . format homhet min* max* %8.2f . bmjcip homhet min* max*, esub(texsuper) prefix($) suffix($) . sdecode N, replace esub(texsuper) prefix($) suffix($) . sdecode polygp, generate(S_1) . sdecode poly, generate(S_2) . generate row=S_1+", "+S_2 . chardef row N homhet min* max*, > char(varname) prefix("\textsl{") suffix("}") > values("Polymorphism" N Ratio "(95\%" "CI)") . chardef row N homhet min* max*, char(justify) values(l rrrr) 204 From resultssets to resultstables in Stata

. xrewide N homhet min* max*, i(polygp poly) j(source) > cjlabel(varname2) vjlabel(N) pjlabel("\textsl{") sjlabel(":}") > lxjk(nonrowvars) Data long -> wide

Number of obs. 48 -> 24 Number of variables 10 -> 13 j variable (2 values) source -> (dropped) xij variables: N->N1N2 homhet -> homhet1 homhet2 min95 -> min951 min952 max95 -> max951 max952

. listtab_vars row `nonrowvars´, rstyle(tabular) substitute(char varname) > local(h1) . listtab_vars row `nonrowvars´, rstyle(tabular) > substitute(char varname2) local(h2) . listtab_vars row `nonrowvars´, substitute(char justify) local(just) . listtab row `nonrowvars´ using texdemo1_3.tex, replace rstyle(tabular) > headlines("\begin{tabular}{`just´}" "\hline" "`h2´" "`h1´" "\hline") > footlines("\hline" "\end{tabular}") . restore

We start as before by preserving the input resultsset, keeping only the key variables and the variables to be tabulated, and formatting the estimates and confidence lim- its to two decimal places. We then use bmjcip as before to decode the estimates and confidence limits (this time without p-values). We use sdecode to create the row label variable row as before. Note that, at this point, the dataset is keyed by the variables source, polygp,andpoly, and the two observations for each polymorphism with dif- ferent values of source (containing child and maternal statistics, respectively) will have the same value of row. We then use chardef to assign values to two variable characteristics, whose names are varname and justify, containing, respectively, the variable header cell and the variable justification of the variable in the table (l for left-justified, r for right-justified). We can then use xrewide to reshape the dataset to a wide format, keyed by the variables polygp and poly. The wide format also has two sets of statistical variables, the first containing the sample numbers, estimates, and confidence limits for child genotypes, and the second containing the sample numbers, estimates, and confidence limits for maternal genotypes. Note that the characteristics varname and justify, assigned to the input statistical variables to be reshaped, are inherited automatically by the corresponding generated reshaped output variables. The output variables N1 and N2 also have the second characteristic varname2, as specified by the cjlabel() option, containing the value labels of the j() variable source, prefixed and suffixed by the pjlabel() and sjlabel() options, to be slanted and terminated with colons in LATEX. The option lxjk(nonrowvars) assigns the local macro nonrowvars with the variable list N1 homhet1 min951 max951 N2 homhet2 min952 max952 R. B. Newson 205 containing the reshaped statistical variables in the order in which they are to be output. We then use listtab vars with the option rstyle(tabular) to create header rows in local macros h1 and h2 containing the variable characteristics varname and varname2, respectively. We use listtab vars with no rstyle() option to create an unprefixed, unsuffixed, and unseparated list of justify characteristics in the local macro just, with the value lrrrrrrrr indicating that the first column of table 3 will be left-justified and the other columns of table 3 will be right-justified.

The listtab command then creates a LATEX tabular environment and writes it to the file texdemo1 3.tex, where it can be copied and pasted or \input into a table in aLATEX document, which appears as table 3. Finally, we restore the original resultsset to the memory.

5 Gapped tables: Decode, label, reshape, insert gaps, list

Table 3 can be criticized because too much horizontal space is allocated in the first col- umn to repetitive displays of the polymorphism groups (particularly the group “Nrf2”) so that a small font has to be used. Table 4 might be viewed as an improvement on table 3, because each polymorphism group starts with a gap row in which the polymor- phism group name is displayed in bold type and left-justified, so it does not need to be repeated in the following rows. The first column therefore needs less horizontal space, allowing the other columns to have more. The gap rows make the table less dense and more clear, at the price of using more vertical space. To insert gap rows into tables, we use the ingap package. 206 From resultssets to resultstables in Stata

Table 4. Child and maternal homozygote or heterozygote ratios (with gaps)

Child: Maternal: Polymorphism N Ratio (95% CI) N Ratio (95% CI) (by group) AHR rs2066853 8704 0.47 (0.43, 0.53) 8100 0.57 (0.52, 0.63) CYP1A1 rs1048943 8764 0.64 (0.49, 0.84) 8130 0.54 (0.38, 0.75) rs4646903 8587 0.58 (0.53, 0.65) 7990 0.51 (0.45, 0.57) CYP2A6 rs1801272 8666 0.60 (0.43, 0.84) 8088 0.61 (0.44, 0.84) rs28399433 8691 0.59 (0.49, 0.69) 8055 0.49 (0.41, 0.60) GCL rs17883901 9389 0.50 (0.44, 0.58) 5644 0.47 (0.39, 0.56) GPX rs713041 8611 0.52 (0.50, 0.54) 8015 0.49 (0.47, 0.51) GSTM1 GSTM1 CNV 8399 0.52 (0.49, 0.55) 7497 0.51 (0.49, 0.55) GSTP1 rs947894 8692 0.50 (0.48, 0.53) 8000 0.48 (0.46, 0.50) GSTT1 GSTT1 CNV 7591 0.52 (0.50, 0.55) 6674 0.51 (0.49, 0.54) IGF rs35767 9383 0.50 (0.47, 0.54) 7434 0.53 (0.49, 0.57) rs2854744 8489 0.50 (0.48, 0.53) 6215 0.50 (0.47, 0.52) LTA rs909253 8664 0.49 (0.47, 0.51) 8073 0.49 (0.46, 0.51) MIF rs755622 9500 0.50 (0.46, 0.53) 7397 0.51 (0.47, 0.55) MTNR1B rs10830963 8584 0.52 (0.49, 0.54) 7987 0.49 (0.47, 0.52) Nrf2 rs1806649 8606 0.50 (0.47, 0.53) 8095 0.52 (0.49, 0.55) rs1962142 8763 0.50 (0.44, 0.56) 8119 0.49 (0.43, 0.56) rs2364723 8725 0.51 (0.49, 0.54) 8093 0.50 (0.47, 0.52) rs6706649 8731 0.46 (0.41, 0.51) 8071 0.51 (0.46, 0.56) rs6721961 9419 0.51 (0.46, 0.57) 5630 0.50 (0.43, 0.57) rs6726395 8692 0.50 (0.48, 0.52) 8061 0.51 (0.49, 0.53) rs10183914 8671 0.50 (0.47, 0.52) 8037 0.52 (0.49, 0.54) TNF rs1800629 8685 0.49 (0.46, 0.52) 8114 0.51 (0.48, 0.55) UGB rs3741240 8671 0.48 (0.46, 0.50) 7952 0.51 (0.49, 0.54) R. B. Newson 207

5.1 Inserting gap rows by using ingap

The ingap package is downloadable from SSC and was described briefly in Newson (2003). It is a comprehensive package for inserting user-selected numbers of gap obser- vations into user-selected positions in a dataset or by-group, with user-selected values assigned to the table row variable. However, usually most users will probably only want to insert one gap observation at the beginning of each of a number of by-groups and to set the row variable in each of these new observations to be equal to a preexisting gap-row string variable, whose value will be constant within each by-group. The syntax for doing this is bysort keyvarlist 1 gaprowvarname (keyvarlist 2): ingap, neworder(newvar) rowlabel(varname) grexpression(gap row label expression) where keyvarlist 1 is a list of key variables defining the by-groups that will each have one gap observation added at the beginning, gaprowvarname is the name of the preex- isting gap-row string variable, keyvarlist 2 is a list of key variables defining the order of preexisting observations within the by-group, varname is the name of the preexisting string row label variable, and newvar is the name of a new variable to be generated by ingap containing in each observation the sequential order of that observation within the by-group once the gap observations have been added. The dataset is usually keyed by the variables keyvarlist 1 and keyvarlist 2 before the execution of ingap and is keyed by keyvarlist 1, gaprowvarname,andnewvar after the execution of ingap. Therefore, the variable newvar defined in neworder() will be 1 for the gap observation in each by-group defined by the keyvarlist 1 and will have integer values starting from 2 in the remaining observations in each by-group. The gap-row variable specified by gaprowvarname is defined, usually using sdecode, as a function of the variables in keyvarlist 1 and will therefore be constant within each by-group. For instance, in our example resultsset, once the decoding, labeling, and reshaping have been done and a gap-row variable gaprow has been defined as a function of the polymorphism group variable polygp, the command

. bysort polygp gaprow (poly): ingap, rowlabel(row) grexpression(gaprow) > neworder(rowseq) will insert a gap row at the beginning of each by-group defined by polygp, will set the value of the string variable row in that gap row to be equal to the value of gaprow in that by-group, and will create a new integer-valued variable rowseq equal to 1 in the gap observation and to integers starting at 2 in the other observations in the by-group in the ascending order defined by the variable poly. The dataset will then be keyed by polygp, gaprow,androwseq. 208 From resultssets to resultstables in Stata

5.2 The code to create table 4

The code to do this in the example resultsset and its Stata log is as follows:

. preserve . keep source polygp poly N homhet min* max* . format homhet min* max* %8.2f . bmjcip homhet min* max*, esub(texsuper) prefix($) suffix($) . sdecode N, replace esub(texsuper) prefix($) suffix($) . sdecode poly, generate(row) prefix("\hfill ") . chardef row N homhet min* max*, > char(varname) prefix("\textsl{") suffix("}") > values("Polymorphism (by group)" N Ratio "(95\%" "CI)") . chardef row N homhet min* max*, char(justify) values("p{1in}" rrrr) . xrewide N homhet min* max*, i(polygp poly) j(source) > cjlabel(varname2) vjlabel(N) pjlabel("\textsl{") sjlabel(":}") > lxjk(nonrowvars) Data long -> wide

Number of obs. 48 -> 24 Number of variables 8 -> 11 j variable (2 values) source -> (dropped) xij variables: N->N1N2 homhet -> homhet1 homhet2 min95 -> min951 min952 max95 -> max951 max952

. sdecode polygp, generate(gaprow) prefix("\textbf{") suffix("}\hfill") . bysort polygp gaprow (poly): ingap, rowlabel(row) grexpression(gaprow) > neworder(rowseq) . listtab_vars row `nonrowvars´, rstyle(tabular) substitute(char varname) local(h1) . listtab_vars row `nonrowvars´, rstyle(tabular) substitute(char varname2) > local(h2) . listtab_vars row `nonrowvars´, substitute(char justify) local(just) . listtab row `nonrowvars´ using texdemo1_4.tex, replace rstyle(tabular) > headlines("\begin{tabular}{`just´}" "\hline" "`h2´" "`h1´" "\hline") > footlines("\hline" "\end{tabular}") . restore

Once again, we start by preserving the original resultsset and proceed as we did for table 3 up to xrewide. Here though, the row label variable row now depends only on poly and not on polygp. The characteristic justify, for this row variable, is set to p{1in}, implying that this row variable will be neither left-justified nor right-justified but will have a fixed width of 1 inch in the table, with justification depending on the value of the variable row. In the original variable row, before the addition of the gap observations, the values are all prefixed with \hfill, which causes the values to be right-justified. Once xrewide has been executed, producing a wide dataset keyed by polygp and poly as before, we use sdecode to produce the gap-row variable gaprow, containing the R. B. Newson 209 decoded value of polygp, prefixed by \textbf{ andsuffixedby}\hfill, which implies a bold font and left-justification. We then execute ingap, adding the gap observations. After that, we can use listtab vars three times as before, use listtab once as before (outputting the tabular environment to the file texdemo1 4.tex), and restore the original resultsset. The tabular environment can then be pasted or \input into a LATEX document to form table 4.

6 Creating RTF documents using rtfutil listtab can create tables in many formats other than LATEX. In particular, it can be used with the SSC package rtfutil to create tables in Microsoft RTF, described in plain language by Burke (2003).

RTF tables are less simple to produce than LATEX tables for two reasons:

1. It is not usually a good idea to cut and paste an RTF table into an RTF document by using a text editor (although it can be done). This is because RTF documents (especially those produced using Microsoft Word) may contain a lot of incompre- hensible code, making it difficult to guess where the table should be pasted. If it is pasted in the wrong place, then Microsoft Word will often fail uninformatively when it tries to open the document.

2. There is no unique RTF row style, defined by begin(), delimiter(), end(),and missnum() options for listtab. This is because in RTF tables, the number of cells per row, their widths, and other formatting features are defined separately for each row as part of the begin() string or the end() string.

The first of these problems is solved using the handle() option of listtab,which allows listtab to output to an existing open file with a file handle as created by the file command of Stata (see [P] file). The rtfutil package has the command rtfopen,which opens an RTF file by using file open and outputs the beginning of an RTF document, and the command rtfclose, which outputs the end of an RTF document and closes the file by using file close.Thelisttab commands to create tables can then be inserted between the rtfopen and rtfclose commands. The second of these problems is solved using the rtfrstyle command of rtfutil, which creates RTF row styles. It inputs a list of variables to be output by listtab, with their column widths and other formatting information, and it outputs to a list of local macros specified in a local() option, which will contain the begin(), delimiter(), end(), and possibly missnum() strings that will later be input to listtab. Column widths are specified using the cwidths() option of rtfrstyle and are expressed in twips (1440 twips = 1 inch), which are the units used internally by RTF documents. 210 From resultssets to resultstables in Stata

It is a good idea for all commands between an rtfopen command and an rtfclose command to be enclosed in a capture noisily block. This ensures that if any of these commands fails, then the RTF file will automatically be closed and will be available for inspection by the user.

6.1 The code to create an RTF version of table 4

In this example, we start with the example resultsset and create an RTF document rtfdemo1.rtf containing an RTF version of table 4. The code is enclosed in a document- generating block starting with a tempname command to create a file handle name, fol- lowed by an rtfopen command, followed by the beginning of a capture noisily block, and ending with an rtfclose command preceded by the end of the capture noisily block.

. #delim ; . /* Beginning of document-generating block */ ; . tempname rtfb1; . rtfopen `rtfb1´ using "rtfdemo1.rtf", replace > margins(1000 1000 1000 1000) template(fnmono1); . capture noisily {; . /* Create table */; . preserve; . keep source polygp poly N homhet min* max*; . format homhet min* max* %8.2f; . bmjcip homhet min* max*, esub(rtfsuper) pref("\qr{") suff("}"); . sdecode N, replace esub(rtfsuper) pref("\qr{") suff("}"); . sdecode poly, generate(row) pref("\qr{") suff("}"); . chardef row N homhet min* max*, char(varname) pref("\qr{\i ") suff("}") > val("Polymorphism (by group)" N Ratio "(95%" "CI)"); . xrewide N homhet min* max*, i(polygp poly row) j(source) > cjlab(varname2) vjlab(N) pjlab("\ql{\i ") sjlab(" DNA:}") lxjk(nonrow); Data long -> wide

Number of obs. 48 -> 24 Number of variables 8 -> 11 j variable (2 values) source -> (dropped) xij variables: N->N1N2 homhet -> homhet1 homhet2 min95 -> min951 min952 max95 -> max951 max952

. sdecode polygp, generate(gaprow) pref("\ql{\b ") suff(":}"); . bysort polygp gaprow (poly): ingap, row(row) grex(gaprow) neword(rowseq); . rtfrstyle row `nonrow´, cwidths(1500 1000) local(b d e); . listtab_vars row `nonrow´, sub(char varname) b(`b´) d(`d´) e(`e´) local(h1); . listtab_vars row `nonrow´, sub(char varname2) b(`b´) d(`d´) e(`e´) local(h2); . file write `rtfb1´ > "{\pard\b" > " Geometric mean homozygote/heterozygote ratios" > " for biallelic polymorphisms in child and maternal DNA" > "\par}" > _n; R. B. Newson 211

. listtab row `nonrow´, handle(`rtfb1´) b(`b´) d(`d´) e(`e´) > head("`h2´" "`h1´"); . restore; .}; . rtfclose `rtfb1´; . /* End of document-generating block */;

Inside the capture noisily block, the commands up to ingap are mostly similar to the corresponding commands to create table 4 except that the prefixes and suffixes are RTF prefixes and suffixes. The prefix–suffix pair \ql{ and } specifies left-justification, the prefix–suffix pair \qr{ and } specifies right-justification, and both pairs can be used with or without the added \i or \b appended to the prefix to specify italic or bold font, respectively. The rtfrstyle command inputs the variables to be output by listtab, together with the option cwidths(1500 1000) to specify that the first column have a width of 1500 twips and that all other columns have widths of 1000 twips, and with the local() option to specify that the begin(), delimiter(),andend() strings be stored in the local macros b, d,ande, respectively. These local macros are used by listtab vars to create the header rows, stored in the local macros h1 and h2,andbylisttab to output the table to the RTF file after the RTF table heading has been output using a file write command. The RTF file produced, with the name rtfdemo1.rtf,ispart of the supplemental online material for this article and can be downloaded using Stata.

7 Possible extensions

The team of packages listtab, sdecode, chardef, xrewide,andingap can be used with document formats other than TEXandRTF, such as HTML and possibly XML- based formats yet to be invented. The packages are good at enforcing the nested block structure commonly used in such formats, because each prefix() option of sdecode or chardef can have its suffix() option, each pjlabel() option of xrewide can have its sjlabel() option, each begin() option of listtab canhaveitsend() option, and each headlines() option of listtab canhaveitsfootlines() option. Likewise, each rtfopen command (followed by the beginning of a capture noisily block) can have its rtfclose command (preceded by the end of the same capture noisily block). It is possible to produce parallel column groups nested in larger parallel column groups by using multiple calls to xrewide. Similarly, it is possible to produce gap- prefixed by-groups nested in larger gap-prefixed by-groups by using multiple calls to ingap. The smaller by-groups might be preceded by gap-row labels in a smaller bold font, while the larger by-groups might be preceded by gap-row labels in a larger bold font. 212 From resultssets to resultstables in Stata

It is also possible to produce multipage tables in a document by putting a listtab command with an if qualifier inside a program loop. Alternatively, we can put the whole set of commands from preserve to restore inside a program loop and include a keep if command to select the subset of observations for a page. It should also be possible to write user-friendly front-end packages by using the low-level programming toolkit described here to produce standard documents, in par- ticular formats containing standard-format tables. So far, I have not done this, because my colleagues and collaborators seem to prefer highly customized RTF resultstables to standard-format RTF resultstables.

Finally, a Stata program that inputs a resultsset to output resultstables to TEX, RTF, and other documents can also output resultsplots of the same results to the same documents, produced using Stata graphics commands (see [G-2] graph). In particular, the rtfutil package includes an rtflink command to include graphical output files, such as Encapsulated PostScript files produced using graph export, as linked objects in an RTF document. The user may then convert these linked objects to embedded objects by using Microsoft Word. I find that this possibility is a major advantage of a resultsset-based solution over the purely table-based solutions used by estout, outreg, outreg2,andtabout.

8 Acknowledgments

I would like to thank Ben Jann of the Eidgen¨ossiche Technische Hochschule in Zurich, Switzerland, and Adam Jacobs of Dianthus Medical Limited in London, UK, for their presentations at the 2009 UK Stata Users Group meeting, which inspired a lot of the packages and updates presented in this article. I would also like to thank Philippe Van Kerm of CEPS/INSTEAD in Luxembourg for suggesting that a handle() option for listtex (predecessor to listtab) would be a good idea; Ian White of the MRC Bio- statistics Unit in Cambridge, UK, for suggesting that something like the listtab vars command of listtab would be a good idea; and Lindsey Peters of the United Kingdom Health Protection Agency for some very helpful feedback about RTF table rows out- put with the rtfutil package. I would also like to thank Nicholas J. Cox of Durham University, UK, for coining the term “resultsset” to describe a Stata dataset of results.

I would also like to thank our collaborators in the ALSPAC Study Team (Institute of Child Health, University of Bristol, Bristol, UK) for allowing the use of their data in this article. In particular, I would like to thank my ALSPAC collaborators John Holloway, John Henderson, and Seif Shaheen for helping me to define the gene labels in tables 2, 3, and 4 of this article. The whole ALSPAC Study Team comprises interviewers, computer technicians, laboratory technicians, clerical workers, research scientists, volunteers, and managers who continue to make the study possible. The ALSPAC study could not have been undertaken without the cooperation and support of the mothers and midwives who took part or without the financial support of the Medical Research Council, the Department of Health, the Department of the Environment, the Wellcome Trust, and other funders. The ALSPAC study is part of the World Health Organization–initiated R. B. Newson 213

European Longitudinal Study of Pregnancy and Childhood. My own work at Imperial College London is financed by the United Kingdom Department of Health.

9 References

Burke, S. M. 2003. RTF Pocket Guide. Sebastopol, CA: O’Reilly. Date, C. J. 1986. An Introduction to Database Systems: Volume 1. 4th ed. Reading, MA: Addison–Wesley. Henderson, A. J., R. B. Newson, M. Rose-Zerilli, S. M. Ring, J. W. Holloway, and S. O. Shaheen. 2010. Maternal Nrf2 and gluthathione-S-transferase polymorphisms do not modify associations of prenatal tobacco smoke exposure with asthma and lung function in school-aged children. Thorax 65: 897–902. Jacobs, A. 2009. Improving the output capabilities of Stata with Open Document Format xml. UK Stata Users Group meeting proceedings. http://ideas.repec.org/p/boc/usug09/06.html. Jann, B. 2007. Making regression tables simplified. Stata Journal 7: 227–244.

———. 2009. Recent developments in output processing. UK Stata Users Group meeting proceedings. http://ideas.repec.org/p/boc/usug09/18.html.

Lamport, L. 1994. LATEX: A Document Preparation System. 2nd ed. Reading, MA: Addison–Wesley. Newson, R. 2003. Confidence intervals and p-values for delivery to the end user. Stata Journal 3: 245–269. Newson, R. B. 2008a. Asymptotic distributions of linear combinations of logs of multi- nomial parameter estimates. http://www.imperial.ac.uk/nhli/r.newson/miscdocs/linclog1.pdf.

———. 2008b. parmest and extensions. UK Stata Users Group meeting proceedings. http://ideas.repec.org/p/boc/usug08/07.html. Shaheen, S. O., R. B. Newson, S. M. Ring, M. J. Rose-Zerilli, J. W. Holloway, and A. J. Henderson. 2010. Prenatal and infant acetaminophen exposure, antioxidant gene polymorphisms, and childhood asthma. Journal of Allergy and Clinical Immunology 126: 1141–1148.

About the author Roger B. Newson is a lecturer in at the National Heart and Lung Institute, Imperial College London, UK, working principally in asthma research. He has written several Stata packages that are frequently used together and that define a dialect of the Stata language. These packages can all be installed as a group in versions compatible with Stata 9, 10, 11, or 12 by using Stata do-files (one for each Stata version), which users can download from http://www.imperial.ac.uk/nhli/r.newson/stata.htm. The Stata Journal (2012) 12, Number 2, pp. 214–241 Menu-driven X-12-ARIMA seasonal adjustment in Stata

Qunyong Wang Institute of Statistics and Econometrics Nankai University Tianjin, China [email protected]

Na Wu School of Economics Tianjin University of Finance and Economics Tianjin, China

Abstract. The X-12-ARIMA software of the U.S. Bureau is one of the most popular methods for seasonal adjustment; the program x12a.exe is widely used around the world. Some software also provides X-12-ARIMA seasonal adjustments by using x12a.exe as a plug-in or externally. In this article, we illustrate a menu- driven X-12-ARIMA seasonal-adjustment method in Stata. Specifically, the main utilities include how to specify the input file and run the program, how to make a diagnostics table, how to import data, and how to make graphs. Keywords: st0255, sax12, sax12diag, sax12im, sax12del, seasonal adjustment, X-12-ARIMA, menu-driven

1 Introduction Seasonal data are widely used in time-series analysis, usually at a quarterly or monthly frequency. Some seasonal data have significant seasonal fluctuations, such as data on retail sales, travel, and electricity usage. brings many difficulties to model specification, estimation, and inference. In most developed countries, researchers can directly use the seasonally adjusted data issued by the statistics bureau. Unfortunately, this statistical work is unavailable in most developing countries. Their departments of statistics do not provide the seasonal adjustment or are still in their early stages. In these cases, researchers must adjust the seasonal data themselves.

Several methods of seasonal adjustment are available, among which X-12-ARIMA of the U.S. Census Bureau and TRAMO-SEATS of the Bank of Spain are the most popular. The programs for both of them are free of charge. The statistics bureaus in many countries either use them directly or make some modifications based on them. The programs can also be run as a plug-in or externally with some statistical software, such as EViews, OxMetrics, and Rats. The U.S. Census Bureau also provides a Windows interface for X-12-ARIMA version 0.3, namely, Win X-12, which can be downloaded from http://www.census.gov/srd/www/winx12/winx12 down.html. In fact, we borrow many ideas from Win X-12 to design our dialog box in Stata.

c 2012 StataCorp LP st0255 Q. Wang and N. Wu 215

Stata is good at dealing with , file reading and writing, making graphs, and more, but it currently does not provide X-12-ARIMA seasonal adjustment. In this article, we describe a menu- and command-driven X-12-ARIMA seasonal adjustment with sax12, a Stata interface for the X-12-ARIMA software provided by the U.S. Census Bureau. We use the shell command to run the DOS program x12a.exe.1 Our program is specifically designed for x12a.exe version 0.3 and may not work for previous or future versions. In this article, we introduce several utilities, such as how to generate the specification file, how to import the adjusted series, and how to make graphs. We first outline the background of seasonal adjustment and then describe the design of the menus and dialog boxes. We illustrate the process using two examples. The first is a seasonal adjustment of a single series, and the second is a seasonal adjustment of multiple series simultaneously. Several benefits are derived from the menu-driven commands. First, seasonal ad- justment has so many options that consulting the reference manual is difficult, let alone trying to remember all the options. The dialog box makes the options easier to use. Second, some options are usually in conflict with others. The warning messages be- hind the dialog box save you much time by quickly identifying these conflicts. All the dialog boxes have the corresponding command line syntax. The full syntax with a de- scription for each option can be obtained by typing help sax12, help sax12im, help sax12diag,andhelp sax12del.

2 Outline of the X-12-ARIMA seasonal adjustment

The X-12-ARIMA seasonal-adjustment program is an enhanced version of the X-11 Vari- ant of the Census Method II (Shiskin, Young, and Musgrave 1967). We outline the framework of X-12-ARIMA in this section. Please refer to U.S. Census Bureau (2011)for more details. X-12-ARIMA includes two modules: regARIMA (linear regression model with ARIMA time-series errors) and X-11. Three stages are needed to complete the seasonal adjustment: model building, seasonal adjustment, and diagnostic checking.

2.1 Stage I: Regression with ARIMA errors (regARIMA)

In the first stage, regARIMA performs prior adjustment for various effects (such as trading-day effects, seasonal effects, moving holiday effects, and outliers) and forecasts or backcasts of the time series. The general regARIMA model can be written as  s d s D s φ(B)Φ (B )(1− B) (1 − B ) yt − βixit = θ(B)Θ (B ) ut i=1 where B is the backshift operator (Byt = yt−1); s is the seasonal period; φ(B)=1− p s s Ps φ1B−···−φpB is the regular autoregressive operator; Φ(B )=1−Φ1B −···−ΦP B

1. The program is downloadable from http://www.census.gov/srd/www/x12a/x12downv03 pc.html and is free. Our program is limited to the Windows PC version of X-12-ARIMA and is not supported on Linux. Make sure x12a.exe is in your current working directory. 216 Menu-driven X-12-ARIMA seasonal adjustment in Stata

q is the seasonal autoregressive operator; θ(B)=1−θ1B−···−θqB is the regular moving- s s Qs average operator; Θ(B )=1− Θ1B −···−ΘQB is the seasonal moving-average operator; the ut are independent and identically distributed (i.i.d.) with mean 0 and variance σ2;andd and D are the regular differencing order and seasonal differencing order, respectively.

yt is the dependent variable to adjust. yt is the original series or its prior-adjusted series, or some type of transformation, including log, square root, inverse, logistic, and Box–Cox power transformations. The transformations are listed in table 1, and the prior adjustments are listed in table 2.

Table 1. Transformations of the original series (X-12-ARIMA Seasonal Adjustment dialog box: Prior tab)

Type Formula Range for yt Option in sax12 y y > log log( t) √ t 0 transfunc(log) or transpower(0) square root 1/4+2( yt − 1) yt ≥ 0 transfunc(sqrt) or transpower(0.5) inverse 2 − 1/yt yt =0 transfunc(inverse) or transpower(-1) logistic log(yt/(1 − yt)) 0

A predefined prior adjustment includes length of month (or quarter) and leap year effect. Length of month (or quarter) adjustment means that each observation of a monthly series is divided by the corresponding length of month (or length of quarter for quarterly series) and then is rescaled by the average length of month (or quarter). You can define the modes of your own prior-adjustment variables. The mode specifies whether the factors are percentages or ratios, are to be divided out of the series, or are values to be subtracted from the original series. Ratios or percentage factors can only be used with log-transformed data, and subtracted factors can only be used with no trans- formation. You can also specify whether the prior-adjustment factors are permanent (removed from both the original series and the seasonally adjusted series) or temporary (removed from the original series but not from the seasonally adjusted series).

Table 2. Prior adjustment of the original series (X-12-ARIMA Seasonal Adjustment dialog box: Prior tab)

Predefined Option in sax12 User defined Option in sax12

length of month prioradj(lom) variables priorvar(varlist) length of quarter prioradj(loq) mode priormode(string) leap year prioradj(lpyear) type priortype(string)

xit are regression variables that include the predefined variables (such as trading-day effects), the automatically added variables (such as outliers), and the user-defined vari- ables. So in the regARIMA model, the trading day, holiday, outlier, and other regression Q. Wang and N. Wu 217 effects can be fit and used to adjust the original series prior to seasonal adjustment. The options for the regression variables in sax12 are listed in table 3. The predefined variables are specified in the same way as those in table 4.1 of U.S. Census Bureau (2011). For example, the length of month and seasonal dummy variables can be spec- ified as regpre(lom seasonal). The user-defined variables are assumed to be of type user and already centered. Of course, you can change these options. For example, we add two holiday variables (mb and ma) in the regression model. We can specify these as reguser(mb ma) regusertype(holiday holiday). The significance of differ- ent types of variables can be tested using the regic(string) option. For example, we specify regic(td1nolpyear user) to test the significance of the working-day effect and user-defined variables. The outliers in the model will be discussed shortly.

Table 3. Regression variables in regARIMA model (X-12-ARIMA Seasonal Adjustment dialog box: Regression tab)

Predefined Option in sax12 User defined Option in sax12

length of month, etc. regpre(string) variables reguser(varlist) types regusertype(string) center regusercent(string)

X-12-ARIMA provides capabilities of identification, estimation, and diagnostic check- ing. Identification of the ARIMA model for the regression errors can be carried out based on sample and partial autocorrelation. You can specify the ARIMA model by hand or let the program automatically select the optimal model from among a set of models. X-12-ARIMA has an automatic model-selection procedure based largely on the automatic of TRAMO (G´omez and Maravall 2001). X-12-ARIMA will select the optimal model given the maximum difference order and ARMA order or only given the maximum ARMA order at a fixed difference order. You can also let X- 12-ARIMA automatically select the optimal ARIMA model from among a set of models stored in one file.

Once a regARIMA model has been specified, X-12-ARIMA estimates its parameters by maximum likelihood using an iterated generalized least-squares algorithm. Diagnostic checking involves examining the residuals from the fitted model for signs of model in- adequacy, including outlier detection, , and the Ljung–Box Q test. The options allowed in sax12 of the ARIMA model are listed in table 4. 218 Menu-driven X-12-ARIMA seasonal adjustment in Stata

Table 4. ARIMA specifications in regARIMA model (X-12-ARIMA Seasonal Adjustment dialog box: ARIMA tab)

ARIMA model Option in sax12 Example user defined: ammodel(string) ammodel((0,1,1)(0,1,1)) automatic selection given maximum order: ammaxlag(numlist) ammaxlag(2 2) ammaxdiff(numlist) ammaxdiff(2 2) amfixdiff(numlist) amfixdiff(2 1) automatic selection in models stored in file: amfile(filename) amfile(d:/x12a/mymodel.mdl) common option: forecast ammaxlead(integer) ammaxlead(12) backcast ammaxback(integer) ammaxback(12) confidence level amlevel(real) amlevel(95) sample amspan(string) amspan(1996.1, 2003.3)

An important aspect of the diagnostic checking of time-series models is outlier de- tection. X-12-ARIMA’s approach to outlier detection is based on that of Chang and Tiao (1983)andChang, Tiao, and Chen (1988), with extensions and modifications as dis- cussed in Bell (1983)andOtto and Bell (1990). Three types of outliers are searched for: additive outliers (AO), temporary change outliers (TC), and level shifts (LS). In brief, this approach involves computing t statistics for the significance of each outlier type at each time point, searching through these t statistics for significant outlier(s), and adding the corresponding AO, LS,orTC regression variable(s) to the model.

X-12-ARIMA provides two variations on this general theme. The addone method provides full-model reestimation after each single outlier is added to the model, while the addall method refits the model only after a set of detected outliers is added to the model. During outlier detection, a robust estimate of the residual standard deviation— 1.48 times the median absolute deviation of the residuals—is used. The default critical value is determined by the number of observations in the interval searched for outliers.

When a model contains two or more LSs, including those obtained from outlier de- tection as well as any specified in the regression specification, X-12-ARIMA will optionally produce t statistics for testing null hypotheses that each run of two, three, etc., succes- sive LSs actually cancels to form a temporary LS. Two successive LSs cancel to form a temporary LS if the effect of one offsets the effect of the other, which implies that the sum of the two corresponding regression parameters is zero. Similarly, three successive LSs cancel to a temporary LS if the sum of their three regression parameters is zero, and so on. The options of outliers are listed in table 5. Q. Wang and N. Wu 219

Table 5. Outlier variables in regARIMA model (X-12-ARIMA Seasonal Adjustment dialog box: Outlier tab)

Automatically detected Option in sax12 User defined Option in sax12

type outauto(string) AO outliers outao(string) critical values outcrit(string) LS outliers outls(string) span outspan(string) TC outliers outtc(string) LS run outlsrun(integer) method outmethod(string)

2.2 Stage II: Seasonal adjustment (X-11)

The original series is adjusted using the trading day, holiday, outlier, and other regres- sion effects derived from the regression coefficients. Then the adjusted series (O)is decomposed into three basic components: trend cycle (C), seasonal (S), and irregular (I). X-12-ARIMA provides four different decomposition modes: multiplicative (the de- fault), additive, pseudo-additive, and log-additive. The four modes are listed in table 6.

Table 6. Modes of seasonal adjustment and their models (X-12-ARIMA Seasonal Ad- justment dialog box: Adjustment tab)

Mode Model for O Model for SA Option in sax12

multiplicative O = C × S × I SA = C × I x11mode(mult) additive O = C + S + I SA = C + I x11mode(add) pseudo-additive O = C × (S + I − 1) SA = C × I x11mode(pseudoadd) log-additive log(O)=C + S + I SA =exp(C + I) x11mode(logadd) Note: 1) The table is extracted from table 7.44 in U.S. Census Bureau (2011, 193). 2) SA denotes a seasonally adjusted series.

X-12-ARIMA uses a seasonal (filter) to estimate the seasonal factor. An n×m moving average means that an n-term simple average is taken of a sequence of consecutive m-term simple averages. Table 7.43 of U.S. Census Bureau (2011) lists these options. If no selection is made, X-12-ARIMA will select a seasonal filter automatically. The Henderson moving average is used to estimate the final trend cycle. Any odd number greater than 1 and less than or equal to 101 can be specified. If no selection is made, the program will select a trend moving average based on statistical characteristics of the data. For monthly series, a 9-, 13-, or 23-term Henderson moving average will be selected. For quarterly series, the program will choose either a 5- or 7-term Henderson moving average. 220 Menu-driven X-12-ARIMA seasonal adjustment in Stata

The options allowed by sax12 are listed in table 7. By default, a series is adjusted for both seasonal and holiday effects. The holiday-adjustment factors derived from the program can be kept in the final seasonally adjusted series by using the x11hol option. The final seasonally adjusted series will contain the effects of outliers and user-defined regressors. These effects can be removed using the x11final() option. The lower and upper sigma limits used to downweight extreme irregular values in the internal seasonal-adjustment iterations are specified using the x11sig() option.

Table 7. Options of seasonal adjustment (X-12-ARIMA Seasonal Adjustment dialog box: Adjustment tab)

Option Option in sax12 Example

trend filter x11trend(string) x11trend(23) seasonal filter x11seas(string) x11seas(s3x5) factor removed x11final(string) x11final(ao user) holiday kept x11hol sigma limits x11sig(string) x11sig(2,3)

If an aggregate time series is a sum (or other composite) of component series that are seasonally adjusted, then the sum of the adjusted component series provides a seasonal adjustment of the aggregate series that is called the indirect adjustment. This adjust- ment is usually different from the direct adjustment that is obtained by applying the seasonal-adjustment program to the aggregate (or composite) series. The indirect ad- justment is usually appropriate when the component series have very different seasonal patterns.

X-12-ARIMA makes three types of adjustment. Type I adjustment is for a single series based on a single specification file, which specifies the entire adjustment process and has the extension .spc. Type II adjustment is for multiseries using a data metafile with the extension .dta,2 which is a list of data files to be run using the same .spc file. Type III adjustment is for multiseries using a metafile with the extension .mta, which is a list of specification files. Table 8 lists the options in sax12 for each type of adjustment.

2. Though the data metafile has the same extension as the Stata data file, they have completely different formats. Q. Wang and N. Wu 221

Table 8. Direct and indirect seasonal adjustment (X-12-ARIMA Seasonal Adjustment dialog box: Main tab)

Types Option in sax12 Specification file(s) Prefix of output files

type I satype(single) inpref(filename) outpref(file prefix) comptype(string) type II satype(dta) inpref(filename) outpref(file prefix) dtafile(filename) type III satype(mta) mtaspc(filenames) outpref(file prefix) compsa compsadir

2.3 Stage III: Diagnostics of seasonal adjustment

X-12-ARIMA contains several diagnostics for modeling, model selection, and adjustment stability. Spectral plots of the original series, the regARIMA residuals, the final seasonal adjustment, and the final irregular component help you check whether there still re- mains seasonal or trading-day variation. The sliding spans analysis and history revision analysis are two important stability diagnostics. The sliding spans diagnostics compare seasonal adjustments from overlapping spans of given series. The revision history di- agnostics compare the concurrent and final adjustments. Table 9 lists the options of diagnostics in sax12.

Table 9. Diagnostics of seasonal adjustment (X-12-ARIMA Seasonal Adjustment dialog box: Others tab)

Option Option in sax12 sliding span analysis sliding history revision analysis history check .spc file justspc

The X-12-ARIMA software contains limits on the maximum length of series, maximum number of regression variables in a model, etc. For example, the maximum length of the series on an input series is 600, and the maximum number of regression variables in a model is 80. The limits in X-12-ARIMA are listed in table 2.2 of U.S. Census Bureau (2011). These limits are set at values believed to be sufficiently large for the great majority of applications, without being so large as to cause memory problems or to significantly lengthen program execution times. The sax12 command will check these limits before executing x12a.exe. 222 Menu-driven X-12-ARIMA seasonal adjustment in Stata 3 Design of menus and dialog boxes for sax12

Menu-driven commands generate and execute the required files for x12a.exe. Some jargon used here, such as group box and check box, can be found in [P] program.For details about the X-12-ARIMA computations and terminology, please refer to U.S. Census Bureau (2011).

3.1 Main

The dialog box shown in figure 1 selects the adjustment type. Different types have different options.

Figure 1. Screenshot of the Main tab (type I) in the X-12-ARIMA Seasonal Adjustment dialog box

The first bullet, labeled Single series, is selected to calculate a type I adjustment. If this is selected, you must also select the variable to adjust. If you want to perform composite adjustment, choose the Composite type. You can save the new input file in the field for Spc file to save (the same name with variable if left blank) by pressing the Save as button. You can also name the output file in the fields for Prefix of output to save (the same name with variable if left blank). By default, both the input file and the output file are named after the variable. The second bullet, labeled Multiple series using a single specification file, is selected to calculate a type II adjustment with the options shown in figure 2. If this is selected, Q. Wang and N. Wu 223 you must also select the variables to adjust. By default, the program saves the new input file as mvss.spc—which means “multiple variables using single specification file”—saves the output files named after the variables, and saves the data metafile as mvss.dta.Of course, you can save these files by editing the fields Spc file to save (mvss.spc if left blank) and Dta file to save (mvss.dta if left blank) or by pressing the Save as button.

Figure 2. Screenshot of the Main tab (type II) in the X-12-ARIMA Seasonal Adjust- ment dialog box

The last bullet, labeled Multiple series using multiple specification files, is selected to calculate a type III adjustment with the options shown in figure 3. If this is selected, you must also select the input files by pressing the Browse button next to the Select spc files field. By default, the program saves the input metafile as mvms.mta,which stands for “multiple variables for multiple specification files”. If you want to perform the composite adjustment, click on the check box for Composite adjustment. If you also want to perform the direct adjustment for the aggregate series, click on Direct seasonal adjustment for composite series and input the name of the specification file in the Spc file to save (mvms.spc if left blank) field. The program saves the input file as mvms.spc by default. This produces an indirect seasonal adjustment of the composite series as well as a direct adjustment. 224 Menu-driven X-12-ARIMA seasonal adjustment in Stata

Figure 3. Screenshot of the Main tab (type III) in the X-12-ARIMA Seasonal Adjust- ment dialog box

3.2 Prior

The dialog box shown in figure 4 specifies the data properties, the transformation method, and the prior adjustment. Data properties include Flow or stock and Frequency (monthly or quarterly). The dialog box will determine the frequency automatically if there are data in the memory. Trading days and Easter have different regression variables for flow series than for stock series. The predefined variables on this dialog box and the Regression tab of the X-12-ARIMA Seasonal Adjustment dialog box will change according to these choices. Data transformation includes Auto (log or none), Log, Square Root, Inverse, Logis- tic,andBox-Cox Power.TheAuto (log or none) option lets X-12-ARIMA automatically decide whether the data should be log-transformed using Akaike’s information criterion (AIC) test. If you choose Box-Cox Power, then include the transformation parameter in the edit field. Q. Wang and N. Wu 225

In the Prior Adjustment group box, you can select the predefined variables in the Predefined Variable combo box or choose the user-defined variables in the User-defined variables group box. For the user-defined variables, you can also select your modes and types. Mode specifies the way in which the user-defined prior-adjustment factors will be applied to the time series. For example, mode=diff means that the prior adjustments are to be subtracted from the original series. Type specifies whether the prior-adjustment factors are permanent or temporary.

Figure 4. Screenshot of the Prior tab in the X-12-ARIMA Seasonal Adjustment dialog box

3.3 Regression

In the Regression tab of the X-12-ARIMA Seasonal Adjustment dialog box, shown in figure 5, you can choose the predefined variables and user-defined variables. The available predefined variables are determined by the data properties and transformation. For the user-defined variables, you can also choose their types and centering method. If Types is left blank, all variables are assumed to be of the type user. Center type includes already centered, subtract overall mean,andsubtract mean by season.By default, the program assumes the variables have already been centered. In the AIC test group box, you can choose the types of variables to test based on AIC, and the insignificant variables will be dropped automatically from the regression. 226 Menu-driven X-12-ARIMA seasonal adjustment in Stata

Figure 5. Screenshot of the Regression tab in the X-12-ARIMA Seasonal Adjustment dialog box

3.4 Outlier

In the Outlier tab of the X-12-ARIMA Seasonal Adjustment dialog box, shown in figure 6, you can let the program automatically identify the outliers in the Automatically identified outliers group box. By default, the program searches AO and LS outliers. You can adjust the criteria in the Options group box. Critical values sets the value to which the absolute values of the outlier t statistics should be computed to detect outliers. The default critical value is determined by the number of observations in the interval searched for outliers. You can also specify the critical values manually by selecting user defined. The values are set for AO, LS,andTC sequentially and separated by commas, as in 3.5, 4.0, 4.0 or 3.5, , 4.0. Blank means the default value. If only one value is specified, it applies to all types of outliers. Outlier span specifies the start and end dates of a span that should be searched for outliers, as in 1995.1, 2008.12 or 1995.1, . Blank means the default date. LS run specifies the maximum length of a period to form a temporary LS. The value must lie between 0 and 5. The default value is 0, which means no temporary LS t statistics are computed. Identify method includes two methods to successively add detected outliers to the model: addone (the default) and addall. Q. Wang and N. Wu 227

You can also input the outliers by hand in the User-defined outliers group box in the respective field for each outlier type. A tool tip will pop up when you mouse over a field, and it will give an example to fill in.

Figure 6. Screenshot of the Outlier tab in the X-12-ARIMA Seasonal Adjustment dialog box

3.5 ARIMA

The ARIMA tab of the X-12-ARIMA Seasonal Adjustment dialog box, shown in fig- ure 7, specifies the ARIMA model. You can specify the ARIMA model directly by clicking on Specify by hand (non-seasonal, seasonal). The model should be specified with the format (p,d,q)(P,D,Q), such as (0,1,1)(1,1,0). The former part is the regular ARIMA model, and the latter part is the seasonal ARIMA model. You can also let the program automatically select the optimal model through two ways. The first is through Automatic selection given maximum lag and difference. You should choose the Max lag of ARMA part for both the regular and the seasonal ARMA model. The difference order can be specified to the maximum order by clicking on Max difference, or it can be fixed at a specified order by clicking on Fixed difference. The second is through clicking on Automatic selection in models stored in file and then clicking on the Browse button to choose the optimal model from among several models listed in the stored file. The default file extension is .mdl. 228 Menu-driven X-12-ARIMA seasonal adjustment in Stata

You can restrict the estimation sample in the field following the Sample option or can change the default forecast options in the Forecast options group box.

Figure 7. Screenshot of the ARIMA tab in the X-12-ARIMA Seasonal Adjustment dialog box

3.6 Adjustment

In the Adjustment tab of the X-12-ARIMA Seasonal Adjustment dialog box, shown in figure 8, you can specify the X-11 seasonal adjustment. The X11 mode specifies the seasonal-adjustment mode. The default mode (multiplicative) is disabled because the default transformation is set to Auto (log or none) on the Prior tab of the X-12-ARIMA Seasonal Adjustment dialog box. Extreme limits specifies the lower and upper limits used to downweight extreme irregular values in the internal seasonal-adjustment iterations. The default values are set to 1.5 for the lower limit and 2.5 for the upper limit. If you specify the values manually, then you should select user defined and input the values in the appropriate fields. Valid list values are any real numbers greater than zero with the lower-limit value less than the upper-limit value and the two values separated by a comma. Blank denotes the default value. Valid specifications include 1.8,2.8 and 1.8, . More examples can be viewed by clicking on note and examples. Trend filter specifies that the Henderson moving average should be used to estimate the final trend cycle. Any odd number greater than 1 and less than or equal to 101 Q. Wang and N. Wu 229 can be specified. By default, the program will select a trend moving average based on statistical characteristics of the data. For monthly series, a 9-, 13-, or 23-term Henderson moving average will be selected. For quarterly series, the program will choose either a 5- or 7-term Henderson moving average. Seasonal filter specifies that a seasonal moving average will be used to estimate the seasonal factors. By default, X-12-ARIMA will choose the final seasonal filter automatically. If some outliers are excluded from the adjusted series, choose the types in the field for Exclude the outliers out of the seasonally adjusted series. If you want to include the holiday effect in the adjusted series, click on Keep holiday effect in the seasonally adjusted series. If you do not want to perform seasonal adjustment, just click on No adjustment.In this case, the program will just fit the regARIMA model.

Figure 8. Screenshot of the Adjustment tab in the X-12-ARIMA Seasonal Adjustment dialog box

3.7 Stability and other options

The Others tab of the X-12-ARIMA Seasonal Adjustment dialog box, shown in figure 9, includes two kinds of stability analysis: Sliding span analysis and Revision history analysis. If you just want to view the input file before adjustment, click on Just generate spc file and check it but not perform seasonal adjustment. 230 Menu-driven X-12-ARIMA seasonal adjustment in Stata

If the input files have already been generated, you can perform seasonal adjustment directly by clicking on Directly perform seasonal adjustment based on already defined spc (dta, mta) file(s) and select the corresponding files in the Options group box. Note that if you choose this option, all other specifications will be omitted.

Figure 9. Screenshot of the Others tab in the X-12-ARIMA Seasonal Adjustment dialog box

4 Design of menus and dialog boxes for sax12im

When X-12-ARIMA is run, a number of output files can be created. These generally have the same name (or base name) as the specification file or metafile, unless an alternate output name is specified, but have extensions based on the file type. The main output is written to the file filename.out. Run-time errors are stored in file filename.err,and a log file filename.log is also generated. Q. Wang and N. Wu 231

sax12im is used to import the series generated by sax12, or more precisely, by X-12-ARIMA. Input the following command to open the dialog box as shown in figure 10.

. db sax12im

Figure 10. Screenshot of the dialog box to import data

First, select the results in the field for Select X-12-ARIMA results. After pressing the Browse button, all the results with extension .out are listed. You can select multiple results. Results in the same directory or different directories are both allowed. This will be convenient if the results of different specifications are stored in different directories. In the field for Select the series or insert the suffix, you can select the series to import. We list the most often used series. With the specific variable selected, the corresponding suffix will appear in the edit fields. By default, the imported series are as follows: seasonal factor (d10), seasonally adjusted series (d11), trend-cycle component (d12), and irregular component (d13). If you want to import more series, you can insert the suffix directly. By default, the program imports the series as variables, and this is the usual case. For some other cases, it is appropriate to import the data as matrices, such as the spectral density and the autocorrelation function. The option import as matrix provides this utility. If there are no data in memory, you should select the frequency (monthly or quar- terly) by clicking on specify the frequency if no data in memory. The new date variable will be named after sdate automatically. If you want to replace the data in memory, select the clear the data in memory option. Note that if you select this option, you 232 Menu-driven X-12-ARIMA seasonal adjustment in Stata must also choose the frequency because there are no data after the memory is cleared. The program will not import those variables already existing in memory. In this case, you can update the observations by clicking update the obs using dataset on disk if variables already exist. For example, let’s pretend you have a variable (such as gdp) using a subsample from January 1992 to December 2008, and you have imported the variables gdp.d10 and gdp.d11. Then you make a new adjustment using the sample from January 1992 to December 2009. Unless you specify the update option, sax12im will not import the newly generated series gdp.d10 and gdp.d11 because they already exist. The names of all the variables needed to import are stored in r(varlist).The names of the newly generated variables are stored in r(varnew), and the names of the already existing variables are stored in r(varexist).

5 Design of menus and dialog boxes for sax12diag

X-12-ARIMA automatically stores the most important diagnostics in a file, which will have the same path and filename as the main output but with the extension .udg.The diagnostics summary file is an ASCII database file. Within the diagnostic file, each diag- nostic has a unique key to access its value. We extract the most important information into a table. The diagnostics table consists of five parts: general information, X-11, regARIMA, outliers, and stability analysis. Input the following command to open the dialog box as shown in figure 11. Q. Wang and N. Wu 233

. db sax12diag

Figure 11. Screenshot of the dialog box to make diagnostics tables

First, select the diagnostics files in the Select X-12-ARIMA diagnostics files field. After pressing the Browse button, all the files with extension .udg are listed. You can select multiple files. Use the check box for no print if you want to suppress the output on the screen. You can save the diagnostics table in another file by clicking on save the table in file and using the Browse button to select your file.

6 Design of menus and dialog boxes for sax12del

Many files are generated in the X-12-ARIMA seasonal adjustment. You may wish to delete these files when you obtain satisfactory results. We designed a dialog box as shown in figure 12 to fulfill this task. The sax12del command automatically identifies and deletes all the files of the seasonal adjustment. 234 Menu-driven X-12-ARIMA seasonal adjustment in Stata

. db sax12del

Figure 12. Screenshot of the dialog box to delete files

First, select the diagnostics files in the Select X-12-ARIMA results field. After pressing the Browse button, all the files with extension .out are listed. Only one file can be selected at a time. You can choose which files are to be deleted via Select the extensions to delete (leave empty to delete all files); likewise, you can choose which files aretobekeptviaSelect the extensions to keep. The default is to delete all files.

7 Example

Next we use two examples to illustrate how to make a seasonal adjustment using our programs. The first example contains monthly data for one series, and the second example contains quarterly data for nine series.

7.1 Example: Adjustment for a single series

The retail.dta file contains the monthly data of the total retail sales of consumer goods in China (retail) from January 1990 (1990m1) to December 2009 (2009m12). Q. Wang and N. Wu 235 regARIMA model specification

We import the data into memory and make a trend plot. In this example, we assume the data are stored in the directory d:\sam.

. cd d:\sam d:\sam . use retail, clear . tsset time variable: mdate, 1990m1 to 2011m12 delta: 1 month . des Contains data from retail.dta obs: 264 vars: 5 11 Jan 2011 14:39 size: 5,808

storage display value variable name type format label variable label

mdate int %tm retail double %10.0g total retail sales of consumer goods in China sprb float %9.0g 1st variable of Spring Festival sprm float %9.0g 2nd variable of Spring Festival spra float %9.0g 3rd variable of Spring Festival

Sorted by: mdate . tsline retail

Note that the available data for the three variables sprb, sprm,andspra span from January 1990 (1990m1) to December 2011 (2011m12). This overage is necessary to fore- cast in the ARIMA model. The default number of forecasts is 12; the allowed maximum number of forecasts is 24 for our data unless we extend the observations of sprb, sprm, and spra. It is evident that the magnitude of the seasonal fluctuations is proportional to the level. So we prefer to log-transform and use multiplicative-adjustment mode.

We include the trading day, leap year, and one AO outlier at May 2003 because of SARS in the regression. We let the program automatically detect all other outliers of type AO, LS,andTC; automatically select the optimal ARIMA model given maximum difference order of (2 1) and maximum ARMA lag of (3 1); and automatically deter- mine whether to transform the variable though we prefer the log-transformation. Also included are three variables denoting the Spring Festival moving holiday, sprb, sprm, and spra. The Spring Festival is the biggest holiday in China. We use the three-stage method to generate the three variables.3

3. The genhol.exe program of the U.S. Census Bureau is a special tool to generate the variables for a moving holiday. This program is freely available from http://www.census.gov/srd/www/genhol/. 236 Menu-driven X-12-ARIMA seasonal adjustment in Stata

Specification through the dialog box

In the Main tab of the X-12-ARIMA Seasonal Adjustment dialog box, select the vari- able retail for the type I seasonal adjustment. In the Regression tab, select trading day with leap year in the Predefined variables group box. Choose sprb, sprm,andspra in the User-defined variables group box. Choose the Types with holiday because the three variables belong to the same type. In the Outlier tab, click on TC (temporary change) so that all three types of outliers are automatically detected. In the User-defined outliers group box, input ao2003.5 or ao2003.may in the field for AO outliers. In the ARIMA tab, select Automatic selection given maximum lag and difference, and change the Max lag of ARMA part from (2 1) (the default case) to (3 1). In the Others tab, click on Sliding span analysis and Revision history analysis. All other options remain unchanged. Press OK to execute the command. The command is as follows:

. sax12 retail, satype(single) transfunc(auto) regpre(const td) > reguser(sprb sprm spra) regusertype(holiday) outao(ao2003.5) > outauto(ao ls tc) outlsrun(0) ammaxlag(3 1) ammaxdiff(2 1) ammaxlead(12) > x11seas(x11default) sliding history

The adjustment results are stored in the retail.* files with different extensions. The main output file, retail.out, will pop up in the Viewer window. We can check the adequacy and stability of the adjustment through the diagnostics table that will be illustrated in the second example.

Import the adjusted series and make graphs

Next we illustrate how to import data and make graphs through two examples: making a seasonal plot of the seasonal factor and making a spectrum plot of the irregular component. First, open the dialog box. Then select the X-12-ARIMA result retail.out, and replace the default suffix (d10 d11 d12 d13) with d10.PressSubmit to import the seasonal factor as a new variable. The command is

. sax12im retail.out, ext(d10) Variable(s) (retail_d10) imported

Then we clear the suffix and select spectrum of irregular component (sp2). Next click on import as matrix.PressOK to import the data as a matrix with the name retail sp2. The command is

. sax12im retail.out, ext(sp2) noftvar

You can view the matrix by typing the command matlist retail sp2. Q. Wang and N. Wu 237

We use the cycleplot4 command of Cox (2006) to make the seasonal plot of the seasonal factor, as shown in figure 13.

. gen year = year(dofm(mdate)) . gen month = month(dofm(mdate)) . cycleplot retail_d10 month year, summary(mean) lpattern(dash) xtitle("") 1.15 1.1 1.05 1 .95

1 2 3 4 5 6 7 8 9 10 11 12

retail_d10 mean retail_d10

Figure 13. Seasonal factor plot by season

We use matplot to make the spectrum plot of the irregular component. We add the vertical lines of two trading day peaks at (0.348, 0.432), and we add six seasonal peaks at (1/12, 2/12,...,6/12). The command is as follows:

. _matplot retail_sp2, columns(3,2) > xline(0.348 0.432, lpattern(dash) lcolor(brown) lwidth(thick)) > xline(0.08333 0.16667 0.25 0.33333 0.41667 0.5, lpattern (dash) lcolor(red) > lwidth(medium)) connect(direct) msize(small) mlabp(0) mlabs(zero) ysize(3) > xsize(5) ytitle("Density") xtitle("Frequency")

(output omitted )

7.2 Example: Adjustment for multiseries based on multispecification

We use the aggregate quarterly gross domestic product of China from the first quarter of 1992 (1992q1) to the fourth quarter of 2009 (2009q4) as an example to illustrate the multiseries adjustment.

4. You can download and install the package by typing the command ssc install cycleplot. 238 Menu-driven X-12-ARIMA seasonal adjustment in Stata

Generate a separate spc file for each series

There are nine component series, and for convenience, we create the input files for the nine components by using the same specifications. Assume the data are stored in the directory d:\saq.

. cd d:\saq d:\saq . use gdpcn, clear (constant price based on 2005; unit: hundred million yuan) . tsset time variable: qdate, 1992q1 to 2009q4 delta: 1 quarter . foreach v of varlist agri - other { 2. sax12 `v´, satype(single) comptype(add) transfunc(auto) > regpre(const) outauto(ao ls) outlsrun(2) ammaxlag(2 1) ammaxdiff(1 1) > ammaxlead(12) x11seas(x11default) sliding history justspc noview 3. }

Nine input files will be created: agri.spc, const.spc, indus.spc, trans.spc, retail.spc, service.spc, finance.spc, house.spc,andother.spc. Note that if you create the files through the X-12-ARIMA Seasonal Adjustment dialog box, do not forget to choose the Composite type on the Main tab.

Specification through the dialog box

Now we adjust the series based on the nine input files and we make composite adjust- ments. On the Main tab of the X-12-ARIMA Seasonal Adjustment dialog box, select Multiple series using multiple specification files and then select the nine input files in your directory. Click on Composite adjustment and Direct seasonal adjustment for com- posite series to make both direct and indirect adjustments for the aggregate series. Save the input file as gdp.spc by typing that into the field for Spc file to save (mvms.spc if left blank), and save the metafile as gdp.mta by typing that into the field for Mta file to save (mvms.mta if left blank). Next we make specifications for direct adjustment of the composite series. In the Prior tab, select quarterly for the data frequency and select the Log transformation. In the Regression tab, select leap year in the Predefined variables group box. In the Others tab, click on Sliding span analysis and Revision history analysis. All other options remain unchanged. Press OK to get the results. The command is as follows:

. sax12, satype(mta) > mtaspc("agri const finance house indus other retail service trans") > mtafile(gdp.mta) compsa inpref(gdp.spc) compsadir transfunc(log) > regpre(const lpyear) outauto(ao ls) outlsrun(0) x11mode(mult) > x11seas(x11default) sliding history Q. Wang and N. Wu 239

View the diagnostics table of multiple adjustments

We view the diagnostics table to check the adequacy of the model and the sufficiency of the adjustment or to compare the different adjustments. We illustrate the usage of the diagnostics table just through the composite series. Select the diagnostics files gdp.udg, and save the diagnostics table in the file mydiag.txt. The command and output are as follows:

. sax12diag gdp.udg using mydiag.txt

gdp.udg gdp.udg(ind)

General information: Series name gdp gdp Frequency 44 Sample 1992.1-2009.4 1992.1-2009.4 Transformation Log(y) Adjustment mode multiplicative indirect Seasonal peak rsd indirr Trading day peak none none Peak of TD in adj no Peak of Seas. in adj no Peak of TD in res t1 no Peak of Seas. in res s1 s1 Peak of TD in irr no no Peak of Seas. in irr no s1 Peak of TD in ori no Peak of Seas. in ori s1

X11: M1 0.046 0.006 M2 0.070 0.000 M3 0.000 0.000 M4 0.696 0.476 M5 0.512 0.200 M6 0.044 1.302 M7 0.155 0.085 M8 0.248 0.226 M9 0.203 0.180 M10 0.207 0.211 M11 0.207 0.211 Q 0.20 0.25 Q without M2 0.22 0.28 Seasonality F test(%) 193.587 0.00 1366.726 0.00 Seasonality KW test(%) 61.386 0.00 63.945 0.00 Moving seas. test(%) 0.773 71.38 4.228 0.00

regARIMA model: Model span 1992.1-2009.4 Model (000) Num. of regressors 5 Num. of sig. AC 2 Num. of sig. PAC 2 Normal 0.8140 0.1494 2.4021 AICC 1461.3847 240 Menu-driven X-12-ARIMA seasonal adjustment in Stata

Outliers: Outlier span 1992.1-2009.4 Total outliers 3 Auto detected outliers 3 AO critical 3.732 * LS critical 3.732 * TC critical Outliers: LS1995.4(7.73) LS2001.4(7.16) LS2005.4(7.17)

Stability analysis: Revision span 2007.1-2009.3 Span num., length, start 4 32 1 1999 unstable SF 8 36 22.222 unstable MM change in SA 13 35 37.143 unstable YY change in SA 2 32 6.250 Ave. abs. perc. rev. 4.255275

Description of the diagnostics

The tables contain the diagnostics information of the composite series for both the direct and the indirect adjustments. We import the adjusted series by using sax12im command described above. For example, the following command imports the seasonal factor, the seasonally adjusted series, the irregular series, and the trend cycle series of both the direct and the indirect seasonal adjustment.

. sax12im gdp.out, ext(d10 d11 d12 d13 isf isa itn iir) Variable(s) (gdp_d10 gdp_d11 gdp_d12 gdp_d13 gdp_isf gdp_isa gdp_itn gdp_iir) > imported

More than 200 files are generated in this example. The following command deletes all the files except those with extensions d10, d11, d12, d13,andspc.

. foreach s in "agri const finance house indus other retail service trans gdp" { 2. sax12del `s´, keep(d10 d11 d12 d13 spc) 3. } . dir gdp.* 2.2k 1/06/12 0:38 gdp.d10 2.2k 1/06/12 0:38 gdp.d11 2.2k 1/06/12 0:38 gdp.d12 2.2k 1/06/12 0:38 gdp.d13 0.7k 1/06/12 0:38 gdp.spc

8 Conclusion

The X-12-ARIMA software of the U.S. Census Bureau is one of the most popular methods for seasonal adjustment, and the program x12a.exe is widely used around the world. In this article, we illustrated the menu-driven X-12-ARIMA seasonal-adjustment method in Stata. Specifically, the main utilities include how to specify the input file and run the program and how to import the adjusted results into Stata as variables or matrices. Because of the strong graphics utilities in Stata, the special tools for producing graphs, Q. Wang and N. Wu 241 like Win X-12, are not provided in our article. This may be an extension in the future based on our work.

9 Acknowledgments We thank H. Joseph Newton (Stata Journal editor) for providing advice and encourage- ment. We received professional and meticulous comments from an anonymous referee, Jian Yang, Liuling Li, and Hongmei Zhao, which substantially helped the revision of this article. Qunyong Wang acknowledges Project “Signal Extraction Theory and Ap- plications of Robust Seasonal Adjustment” (Grant No. 71101075) supported by NSFC.

10 References Bell, W. R. 1983. A computer program for detecting outliers in time series. Proceedings of the American Statistical Association (Business and Economic Statistics Section): 634–639.

Chang, I., and G. C. Tiao. 1983. Estimation of time series parameters in the presence of outliers. Technical Report 8, Statistics Research Center, University of Chicago.

Chang, I., G. C. Tiao, and C. Chen. 1988. Estimation of time series parameters in the presence of outliers. Technometrics 30: 193–204.

Cox, N. J. 2006. Speaking Stata: Graphs for all seasons. Stata Journal 6: 397–419.

G´omez, V., and A. Maravall. 2001. Automatic modeling methods for univariate series. In A Course in Time Series Analysis, ed. D. Pe˜na, G. C. Tiao, and R. S. Tsay, 171–201. New York: Wiley.

Otto, M. C., and W. R. Bell. 1990. Two issues in time series outlier detection using indicator variables. Proceedings of the American Statistical Association (Business and Economic Statistics Section): 182–187.

Shiskin, J., A. H. Young, and J. C. Musgrave. 1967. The X-11 variant of the Census Method II seasonal adjustment program. Technical Paper 15, U.S. Department of Commerce, Bureau of the Census.

U.S. Census Bureau. 2011. X-12-ARIMA Reference Manual (Version 0.3). http://www.census.gov/ts/x12a/v03/x12adocV03.pdf.

About the authors Qunyong Wang earned his PhD from Nankai University and now works at the University’s Institute of Statistics and Econometrics. This article was written as part of his involvement with the Stata research program at the Center for Experimental Education of Economics at Nankai University. Na Wu received her PhD from Nankai University. Now she works at the School of Economics at Tianjin University of Finance and Economics. Her expertise lies in international economics and the financial market. The Stata Journal (2012) 12, Number 2, pp. 242–256 Faster estimation of a discrete-time proportional hazards model with gamma frailty

Michael G. Farnworth University of New Brunswick Fredericton, New Brunswick, Canada [email protected]

Abstract. Fitting a complementary log-log model that accounts for gamma- distributed unobserved heterogeneity often takes a significant amount of time. This is in part because numerical derivatives are used to approximate the gradient vector and Hessian matrix. The main contribution of this article is the use of Mata and a gf2 evaluator to express the gradient vector and Hessian matrix. Gradient vector expression allows one to use a few different options and postestimation commands. Furthermore, expression of the gradient vector and Hessian matrix increases the speed at which a is maximized. In this article, I present a complementary log-log model, show how the gamma distribution has been incorporated, and point out why the gradient vector and Hessian matrix can be expressed. I then discuss the speed at which a maximum is achieved, and I apply sampling weights that require an expression of the gradient vector. I introduce a new command for fitting this model. To demonstrate how this model can be applied, I will examine information on when young males first try marijuana. Keywords: st0256, pgmhazgf2, hazard model, discrete duration analysis, comple- mentary log-log model, gamma distribution, unobserved heterogeneity

1 Introduction

In the duration literature, survival times are often grouped into months, years, or some other width. Information on widely banded survival times is common in the social sci- ences, and in the discrete duration literature, a few different functional forms have been used (Jenkins 1995, 134). One option is a logistic that corresponds to a proportional odds model. Alternatively, a probit model can be fit. A third option is the comple- mentary log-log proposed by Prentice and Gloeckler (1978). Jenkins (2008) provides an introduction to survival analysis that pays particular attention to these discrete-time models, and Sueyoshi (1995) shows that each of them is nested in a class of continuous- time hazard models. When applying duration analysis, unobserved heterogeneity may be an important concern. In response, one can assume that heterogeneity across individuals is normally distributed with a mean of 0, and the xtlogit or xtcloglog command may be applied. Alternatively, one can fit a complementary log-log model under which unobserved het- erogeneity is accounted for with a multinomial distribution that has a particular number

c 2012 StataCorp LP st0256 M. G. Farnworth 243 of mass points (Heckman and Singer 1984). To fit this model, Jenkins (2004a) provides a program called hshaz.1 On the topic of unobserved heterogeneity, Abbring and van den Berg (2007) find that in a large class of hazard models with proportional unobserved heterogeneity, the distribution of the heterogeneity among survivors converges to a gamma distribution. Meyer (1990) incorporates this distribution into a complementary log-log functional form, and hence this is called a Prentice–Gloekler–Meyer (pgm) hazard model. This model can be fit when there are no left-truncated survival times and when each in- dividual experiences a single spell. Ondrich and Rhody (1999) extend this method to examine multiple spells that an individual may experience. To examine single spells that are not left-truncated, Jenkins (2004b) provides a command called pgmhaz8.2 This is a d0 maximum likelihood program that is publicly available and runs under Stata 8.0 or higher. This program approximates the gradient vector and Hessian matrix with numerical derivatives, so maximization can take a significant amount of time. To demonstrate the speed of the new pgmhazgf2 command that uses an expression of the gradient vector and Hessian matrix, I examine information on when young males first try marijuana. This sample is from a 2006–2007 Canadian provincial Youth Smoking Survey of New Brunswick students in grades 6 through 12 (grade 5 students in some schools) conducted by the University of New Brunswick and the Universit´e de Moncton in collaboration with the Centre for Behavioural Research and Program Evaluation at the University of Waterloo. The failure-indicator variable is based on the survey question “Have you ever tried marijuana or cannabis? (a joint, pot, weed, hash, ...)”. If the answer is yes, individuals are also asked “How old were you when you first used marijuana or cannabis?” This briefly demonstrates how the estimation can be applied, and hence I just examine the sample of males. Incorporating gradient and Hessian expressions within a maximum likelihood pro- gram increases the speed at which a maximum is arrived at and allows one to apply sampling pweights as well as postestimation, such as robust, clustering, and stratifi- cation. This maximum likelihood command also supports estimation with survey data (svy:). After estimation, the predict command with a score option can be used.

2Model

Under a complementary log-log model, the hazard of individual i first trying marijuana at any point within age-level j is    end of agej − − × × h(ageij )=1 exp λ0(t)dt exp(xij β) beginning of agej

1. This can be installed in Stata by typing ssc install hshaz. This program along with the associ- ated help file is available at http://ideas.repec.org/c/boc/bocode/s444601.html. 2. This can be installed in Stata by typing ssc install pgmhaz8. This program along with the associated help file is available at http://ideas.repec.org/c/boc/bocode/s438501.html. 244 pgmhaz estimation

The term in curly brackets is minus one times a cumulative hazard that has an un- specified baseline over continuous time within an age level. The cumulated baseline is multiplied by an exponentiated scalar equal to a row vector of covariates multiplied by a column vector of parameters. Each covariate is one general measurement of what takes place within age-level j. Under this model, the exponential base raised to the power of one parameter is the cumulated hazard ratio associated with an age level. When the same proportional hazard model holds at each point in time, an exponentiated parameter is also a hazard ratio. This is a common interpretation of complementary log-log parameters. As mentioned above, Meyer (1990) incorporates gamma-distributed unobserved het- erogeneity across individuals into this complementary log-log model. After choosing a particular functional form for the baseline hazard across discrete-time periods at risk and incorporating gamma-distributed unobserved heterogeneity, the survival rate to the end of age-level j among individuals who report the same covariates is

 2 2 −1/σ 1+σ × exp(xtc,i × βtc) × exp(xtv,i × βtv) × Ji

Gamma distributions are typically expressed in terms of location along with shape and scale parameters. With location equal to zero, the equal to 1/σ2, and the equal to σ2, unobserved heterogeneity has a mean of 1 and a 2 variance of σ . The variable xtc,i is a row vector of time-constant covariates, and βtc is the associated column vector of parameters. The term xtv,i is a matrix that has a row for each of individual i’s time intervals examined, and each column is a covariate that may vary over time. This matrix is multiplied by a column vector of parameters, and the exponential base is applied to each row. Finally, the transposition of this column vector is multiplied by a column vector of ones (Ji). This survival rate is used to construct a likelihood function. To express the gradient vector and Hessian matrix, the first derivative of survival with respect to the gamma variance parameter is calculated by applying the natural logarithm to the . Furthermore, each parameter in βtv is treated as a scalar within the likelihood program. Overall, the gradient and Hessian expressions can make the maximization process faster and easier to achieve, the gradient and Hessian may be more accurate, and postestimation commands that require an analytic gradient vector can be used.

3 Estimation, computer time, and parameter estimate differences

To examine the speed at which the likelihood function is maximized, I examine a sample of 6,247 individuals who are associated with 34,756 age-level time intervals. The covari- ates are listed in table 5, and complementary log-log parameter estimates along with a log-gamma variance of minus one are used as maximum likelihood starting values. M. G. Farnworth 245

The likelihood function is maximized with Stata 11.1. As mentioned above, a Mata gf2 maximum likelihood program that requires gradient vector and Hessian matrix ex- pressions is used. With a Dell 32-bit computer with Windows Vista,3 fitting a standard complementary log-log model and then maximizing the likelihood function with the gf2 program takes approximately two minutes. With this same computer and sample, the pgmhaz8 command takes approximately 21.5 minutes to maximize. When the likelihood function is maximized with gf1 rather than gf2, the Hessian matrix expression is not used and estimation takes approximately 39 minutes. Max- imization with gf0 involves just the likelihood function, does not use the gradient or Hessian expressions, and takes approximately 6.5 hours. With these different methods and the rounding that takes place, each set of estimates will be slightly different. In particular, the difference between pgmhaz8 and gf2 log likelihood levels is 4e–09, and the mreldif() function is used to compare the parameter vectors and variance–covariance matrices. For the parameter vectors, the largest relative difference is 9e–06, and for the variance–covariance matrix it is 4e–07. All of these comparisons are made without sampling weights. When pweightsare applied, gf2 maximization takes a few more seconds, and according to the mreldif() function, some of the parameter estimate and variance–covariance changes are rather large. In particular, the school district 2 parameter estimate goes from 0.05 without weights to −0.34 with weights, and the log-gamma variance parameter estimate goes from 0.70 without weights to 0.40 with weights. For the variance–covariance matrix, the largest relative difference is 0.06, and this is for the school district 12 parameter estimate variance.

4 Failure variable and covariates

The sample examined is from a 2006–2007 Canadian provincial Youth Smoking Survey of New Brunswick students. Self-reported student information is examined. The dataset is one cross-section that identifies individuals who have tried marijuana and, if so, the age at which this first took place. The first discrete-time period is at 9 years old, and the last possible full discrete-time period in the sample is at 17 years old. When people are surveyed, they will have experienced their age level for less than 12 months. Birth dates as well as survey dates are not reported. For these reasons, only fully experienced age levels are examined. Individuals who have never tried marijuana are right-censored at the start of the age level during which they are surveyed. The sample potential time periods for each person are [start of 9 years old, end of 9 years old], (end of 9 years old, end of 10 years old], (end of 10 years old, end of 11 years old], ..., (endof16yearsold,endof17yearsold]. Thefailure-indicator variable is equal to 0 for all age levels that an individual survives. For those who report failure at 9 years old or younger, the failure-indicator variable is equal to 1 at 9 years old. Among those who fail at older age levels, this variable is equal to 1 at the age level during which the individual tries marijuana for the first time.

3. This computer contains an Intel Core 2 E8400 processor at 3.00 GHz with 4.00 GB of RAM. 246 pgmhaz estimation

The short form for the natural logarithm of the gamma variance is ln-varg. Reported covariates that do not change as a person grows older are the following: aboriginal: Equal to 1 if First Nations, M´etis, or Inuit and equal to 0 if not an Abo- riginal or if there is no response. parents smoke: Equal to 1 if at least one parent, step-parent, or guardian smokes and equal to 0 otherwise. This is based on current information in the sample and is assumed to be associated with factors that are constant over time for a child. age relative to other students in a grade at school:

most common age in grade = reference category: Equalto1ifaperson’s age is the most common one that people in his grade level report and equal to 0 otherwise. The sample does not report birth or survey dates. When people were surveyed, most of them were in a grade level equal to their age minus five; in this situation, this variable is equal to 1. When a person’s age level minus five is not equal to his grade, this variable is equal to 0. This time-constant covariate is constructed under the assumption that students do not fail or skip any grades. young in grade: Equal to 1 if a person is younger than most other students in his grade and equal to 0 otherwise. old in grade: Equal to 1 if a person’s age is older than most other students in his grade and equal to 0 otherwise. school district 1 = reference category, 2, 3, . . . , 14: There are 14 school dis- tricts in the province of New Brunswick, and an indicator variable is constructed for each one. For confidentiality reasons, each district is arbitrarily assigned a number.

Reported covariates that can or must change as a person grows older are the follow- ing:

9 years old = reference category, 10, 11, . . . , 17: Each of these indicator vari- ables identifies an age level.

1997–1998, 1998–1999, 1999–2000, ..., 2005–2006= referencecategory: Each of these indicator variables identifies a calendar year.

first time consuming five or more drinks of alcohol on one occasion:

never 5+ drinks at previous age or younger = reference category: Equal to 1 when a person has never consumed five or more drinks of alcohol on one occasion or first experienced this during the age level under consideration and equal to 0 otherwise. M. G. Farnworth 247

first 5+ drinks at previous age: Equal to 1 when a person consumed five or more drinks of alcohol on one occasion for the first time during the previous age level and equal to 0 otherwise. first 5+ drinks before previous age: Equal to 1 when a person consumed five or more drinks of alcohol on one occasion for the first time two or more age levels ago and equal to 0 otherwise.

first time trying a cigarette:

never tried cigarette at previous age or younger = reference category: Equal to 1 if a person has never tried a cigarette (even a few puffs) or first experienced this during the age level under consideration and equal to 0 otherwise. first cigarette at previous age: Equal to 1 when a person tried a cigarette (even a few puffs) for the first time during the previous age level and equal to 0 otherwise. first cigarette before previous age: Equalto1whenapersonhastrieda cigarette (even a few puffs) for the first time two or more age levels ago and equal to 0 otherwise.

Each individual sampled responded to one of five potential modules of questions. One module surveyed physical activity, one surveyed healthy eating, and the other three focused on the topic of smoking. Certain schools in the province of New Brunswick were surveyed, and for each school the survey was conducted in the primary language of the majority of students, which was either French or English. Two of the modules on smoking report the information required for this article, and the associated sampling weight for each individual is used.

5 Results

The sample provides information on 7,702 males; each person is surveyed just once. After dropping individuals who were born outside of Canada or who are right-censored and do not report their age, there were 6,372 individuals left. For each age level, table 1 reports the number of individuals who had not failed or became right-censored during younger age levels, and hence they all had some chance of reporting failure. This table also reports the calendar year range associated with each age level examined. Finally, the number of uncensored individuals who survived into an age level is summed over the age levels for a total of 35,429. 248 pgmhaz estimation

Table 1. Sample size for each age level

Associated Number who survived to and Age calendar year were not right-censored before level range the beginning of the age level 9 1997–2005 6,372 10 1998–2006 6,230 11 1999–2006 6,148 12 2000–2006 5,587 13 2001–2006 4,779 14 2002–2006 3,383 15 2003–2006 1,952 16 2004–2006 836 17 2005–2006 142 sum over age levels: 35,429

Table 2 reports complementary log-log parameter estimates arrived at with just age covariates and sampling weights. Throughout this section, a single star indicates a p-value of less than 5%, two stars indicate less than 1%, and three stars indicate less than 0.1%. Apart from the in tables 1 and 3, all the findings were arrived at with sampling weights. The age parameter estimates in table 2 indicate that some individuals claim to have first tried marijuana at 9 years old or younger; the conditional likelihood of this first happening at age 10 or 11 is slightly lower. This could be due to some people claiming to have tried marijuana at age 9 or younger when in fact this may not be true. After 10 years old, the conditional likelihood of first trying marijuana consistently rises until age 17, which is the final age level examined. With these estimates, the survival rate to the end of age 17 can be predicted with

nlcom 1000*exp(-exp(_b[_cons])* /// (1+exp(_b[age10])+exp(_b[age11])+exp(_b[age12])+exp(_b[age13]) /// +exp(_b[age14])+exp(_b[age15])+exp(_b[age16])+exp(_b[age17]))) which is equal to 364 individuals per thousand with a standard error of 22. M. G. Farnworth 249

Table 2. Descriptive cloglog, males

Parameter z estimate constant −3.842*** −41.41 9 years old reference category 10 years old −0.726*** −4.54 11 years old −0.311* −2.08 12 years old 0.645*** 5.32 13 years old 1.368*** 12.52 14 years old 1.920*** 17.95 15 years old 2.215*** 20.23 16 years old 2.289*** 18.07 17 years old 2.584*** 12.20 Note: These estimates are arrived at with just age covariates.

The next estimates are arrived at with both age and calendar year covariates. The sample was collected during a school year that goes from September to June. Given the absence of birth dates, the sample identifies one age level that takes place within an unknown part of the September to June time period. For each school year in the sample, table 3 reports the associated age range along with the number of individuals who have some chance of reporting failure. An important thing to note is that during the year of 1997–1998, there is only information on 9-year-olds. This means that a 1997–1998 covariate could be associated with a factor that is unique to being 9 years old during this calendar year. It is also the case that at 17 years old, all individuals are experiencing the year of 2005–2006. Similar issues arise with wider time spans, and hence the age covariates could be associated with year covariates and vice versa. Table 4 reports these complementary log-log parameter estimates. 250 pgmhaz estimation

Table 3. Age levels associated with each calendar time range

Age levels that Number who survived to and individuals were not right-censored before reported info the beginning of the age School year on within the levels that take place within calendar range calendar range each calendar range 1 September 1997–1 July 1998 9 389 1 September 1998–1 July 1999 9–10 1,491 1 September 1999–1 July 2000 9–11 2,735 1 September 2000–1 July 2001 9–12 3,994 1 September 2001–1 July 2002 9–13 5,061 1 September 2002–1 July 2003 9–14 5,530 1 September 2003–1 July 2004 9–15 5,767 1 September 2004–1 July 2005 9–16 5,458 1 September 2005–1 July 2006 10–17 5,004 sum over each calendar range: 35,429

Table 4. Descriptive cloglog, males

Parameter z Parameter z estimate estimate constant −4.922*** −31.25 1997–1998 2.059*** 7.53 9 years old reference category 1998–1999 1.417*** 6.18 10 years old −0.431* −2.54 1999–2000 1.428*** 7.83 11 years old 0.266 1.64 2000–2001 0.976*** 6.17 12 years old 1.447*** 9.54 2001–2002 0.551*** 4.47 13 years old 2.310*** 15.16 2002–2003 0.297** 2.97 14 years old 2.946*** 18.89 2003–2004 0.102 1.15 15 years old 3.290*** 20.31 2004–2005 −0.017 −0.23 16 years old 3.373*** 18.99 2005–2006 reference category 17 years old 3.663*** 14.85 Note: These estimates are arrived at with just age and calendar time covariates. M. G. Farnworth 251

With these estimates, the survival rate to the end of age 17 among those who were 17 in 2005–2006 can be predicted with

nlcom 1000*exp(-exp(_b[_cons])*(exp(_b[y1997_98]) /// +exp(_b[age10]+_b[y1998_99])+exp(_b[age11]+_b[y1999_00]) /// +exp(_b[age12]+_b[y2000_01])+exp(_b[age13]+_b[y2001_02]) /// +exp(_b[age14]+_b[y2002_03])+exp(_b[age15]+_b[y2003_04]) /// +exp(_b[age16]+_b[y2004_05])+exp(_b[age17]))) which is equal to 295 per thousand with a standard error of 20. Finally, the log-gamma variance parameter and all the covariates are included. All the findings thus far are based on 6,372 individuals. Because of missing covariate in- formation, a few observations have been deleted; hence, what follows is based on 6,247 males. Among these males, the number of uncensored individuals who survived into an age level, summed over age levels, is 35,429. The log-gamma variance estimate is 0.40, which implies a gamma variance of 1.49. The Stata short-form for this parameter estimate is b[ln gvar: cons]. The command

testnl exp([ln_gvar]:_cons)=0 along with the resulting p-value divided by two for a one-tailed test conveys statistical significance at 0.1%. To interpret this variance estimate, consider a hazard ratio between two individuals who are the same except that the unobserved term takes on a particular value for the individual in the numerator and is at the normalized level of 1 for the other person. The distribution of this hazard ratio can be plotted with the following command:

twoway function y = gammaden(1/exp(_b[ln_gvar:_cons]),exp(_b[ln_gvar:_cons]),0,x)

Furthermore, the percentage of individuals predicted to have a level of the unobserved heterogeneity term at 0.5 or less can be identified with

display gammap(1/exp(_b[ln_gvar:_cons]),0.5/exp(_b[ln_gvar:_cons]))

These commands are used to plot the density function in figure 1. The log-gamma estimate implies that 47% of individuals have a degree of unobserved heterogeneity that is half the mean of 1 or less, and 66% have a degree of unobserved heterogeneity of 1 or less. In contrast, the variance estimate implies that 15% have a degree of unobserved heterogeneity that is more than twice the mean. This suggests that there are a small number of people who for reasons not reported are highly likely to try marijuana for the first time. 252 pgmhaz estimation

gamma variance estimate=1.49 2.5 1 0.47

0.15 0 0 1/2 1 2 4 v

Figure 1. Gamma variance estimate

For each covariate, the hazard ratio associated with a 1-unit increase in one variable is reported in table 5. A hazard ratio of 1 implies that the associated parameter is equal to 0, and hence the z statistic associated with the null hypothesis that the parameter is equal to 0 is also reported. The intercept estimate is −5.03, and the associated z statistic is −23.36. M. G. Farnworth 253

Table 5. pgmhaz males

Time Hazard ratio z† Time Hazard ratio z† constant varying aboriginal 1.92*** 3.67 9 years old reference category not aboriginal reference category 10 years old 0.46*** −4.47 parents smoke 1.94*** 7.19 11 years old 0.88 −0.75 parents 12 years old 2.79*** 6.12 don’t smoke reference category 13 years old 6.33*** 9.99 most common 14 years old 12.83*** 11.79 age in grade reference category 15 years old 20.92*** 11.50 young in grade 1.17 0.67 16 years old 27.01*** 9.81 old in grade 1.31** 2.72 17 years old 33.55*** 7.86 school district 1 reference category 1997–1998 4.15*** 4.24 school district 2 0.71 −1.36 1999–1999 2.92*** 4.03 school district 3 0.83 −1.07 1999–2000 3.04*** 4.89 school district 4 0.37*** −4.03 2000–2001 2.01*** 3.48 school district 5 1.20 1.45 2001–2002 1.42* 2.14 school district 6 0.81 −1.67 2002–2003 1.12 0.84 school district 7 1.34 1.33 2003–2004 1.01 0.13 school district 8 0.88 −0.83 2004–2005 0.94 −0.72 school district 9 0.70 −1.88 2005–2006 reference category school district 10 0.90 −0.63 school district 11 0.66** −2.67 school district 12 0.21*** −3.91 school district 13 1.24 1.14 Time Hazard ratio z† varying never 5+ drinks at previous age or younger reference category first 5+ drinks at previous age 4.40*** 10.69 first 5+ drinks before previous age 5.48*** 7.10 never tried cigarette at previous age or younger reference category first cigarette at previous age 5.77*** 12.51 first cigarette before previous age 5.60*** 10.37 † Each z statistic is based on the null hypothesis that the associated parameter is equal to 0, which implies a hazard ratio of 1.

For reasons discussed above, each time-varying covariate parameter is a scalar within the likelihood program. Hence, the age 10 parameter estimate is called b[age10: cons] and the 1997–1998 parameter estimate is called b[y1997 98: cons]. The age and calendar year parameter estimates along with all other variables set equal to 0 imply the following survival rate to the end of 17 years old among those who were 9 years old in 1997–1998: 254 pgmhaz estimation

nlcom 1000*(1+exp(_b[ln_gvar:_cons]+_b[fmari_d:_cons]) /// *(exp(_b[y1997_98:_cons]) /// +exp(_b[age10:_cons]+_b[y1998_99:_cons]) /// +exp(_b[age11:_cons]+_b[y1999_00:_cons]) /// +exp(_b[age12:_cons]+_b[y2000_01:_cons]) /// +exp(_b[age13:_cons]+_b[y2001_02:_cons]) /// +exp(_b[age14:_cons]+_b[y2002_03:_cons]) /// +exp(_b[age15:_cons]+_b[y2003_04:_cons]) /// +exp(_b[age16:_cons]+_b[y2004_05:_cons]) /// +exp(_b[age17:_cons])))^(-exp(-_b[ln_gvar:_cons]))

This survival rate is 600 per thousand with a standard error of 31. The reference categories other than age and calendar time are the following: not aboriginal, no parent smokes, age level is the most common in grade at school, attends a school in district 1, have not consumed five or more drinks on one occasion for the first time during the previous age or earlier, and have not tried a cigarette for the first time during the previous age or earlier. Each of the time-constant covariates is separately changed from 0 to 1, and table 6 reports the associated survival rates to the end of being 17 years old. In this table, the school district survival rates are listed from highest to lowest. Among all the time constant variables, the survival rate falls to the lowest level when the covariate indicating that a parent smokes goes from 0 to 1; changing the aboriginal covariate from 0 to 1 has a similar effect.

Table 6. Survival rate to the end of 17 years old per thousand

Survival Std. Survival Std. rate error rate error reference categories 600 31 aboriginal 458 44 school district 2 671 58 parents smoke 457 31 school district 6 644 29 young in grade 566 59 school district 3 640 40 old in grade 541 30 school district 8 626 35 school district 12 867 44 school district 10 622 39 school district 4 790 41 school district 5 560 31 school district 11 687 33 school district 13 553 44 school district 9 675 43 school district 7 536 53 The reference categories are not aboriginal, parents don’t smoke, most common age level in grade, school district 1, never 5+ drinks at previous age or younger, and never tried cigarette at previous age or younger. M. G. Farnworth 255

The remaining results involve previous alcohol and cigarette consumption. When individuals consume five or more drinks of alcohol on one occasion for the first time at age 16, the survival rate to the end of 17 years old falls from 600 per thousand to 453 per thousand (standard error = 42). When this volume of drinks is first consumed at age 15, the survival rate falls to 367 per thousand (standard error = 38), and when such drinking takes place at age 14, it falls to 322 per thousand (standard error = 33). Similarly, when individuals first try a cigarette at age 16, the survival rate to the end of 17 years old falls from 600 per thousand to 415 per thousand (standard error = 43); when first trying a cigarette at age 15, the survival rate falls to 348 per thousand (standard error = 33); and with a first cigarette at 14, it falls to 308 per thousand (standard error = 29). These previous choices appear to have the strongest relationship with the hazard of trying marijuana for the first time. In the future, it will be valuable to examine what motivates people to make each of these choices and to compare choices among males and females.

6 Summary

Expressing the gradient vector and Hessian matrix within a computer program increases the speed at which a complementary log-log model with gamma-distributed unobserved homogeneity can be fit. Furthermore, one can use estimation options and postestimation commands that require the expression of a gradient vector. The findings suggest that there are a small number of individuals who are very likely to first try marijuana because of unobserved factors. The findings also suggest that if a young male has not yet tried marijuana, the hazard of such initiation is rather high after he first consumes a large amount of alcohol on one occasion or tries a cigarette for the first time. In the future, it may be informative to examine the most common order of marijuana, cigarette, and alcohol initiation, and the time between each of these choices among both males and females.

7 Acknowledgments

This research was in part made possible and supported by the Health & Education Research Group at the University of New Brunswick. The coexecutive director, Dr. Bill Morrison, was particularly patient and helpful. Financial support was also received from the Canadian Institute for Health Research. While developing this computer program and when writing generally about the duration literature, advice from professor Stephen P. Jenkins was essential, and his comments were very helpful. Isabel Canette, who has a PhD in statistics/data analysis and is a senior with Stata, was also very helpful in answering questions I had while working on this. I would also like to thank professors Ted McDonald, Mehmet Dalkir, and David Haardt for their advice and encouragement. 256 pgmhaz estimation 8 References Abbring, J. H., and G. J. van den Berg. 2007. The unobserved heterogeneity distribution in duration analysis. Biometrika 94: 87–99.

Heckman, J., and B. Singer. 1984. A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica 52: 271–320.

Jenkins, S. P. 1995. Easy estimation methods for discrete-time duration models. Oxford Bulletin of Economics and Statistics 57: 129–136.

———. 2004a. hshaz: Stata module to estimate discrete time () pro- portional hazards models. Statistical Software Components S444601, Department of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s444601.html.

———. 2004b. pgmhaz8: Stata module to estimate discrete time (grouped data) pro- portional hazards models. Statistical Software Components S438501, Department of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s438501.html.

———. 2008. Survival analysis with Stata. University of Essex module EC968. http://www.iser.essex.ac.uk/resources/survival-analysis-with-stata-module-ec968.

Meyer, B. D. 1990. insurance and unemployment spells. Econometrica 58: 757–782. Ondrich, J., and S. E. Rhody. 1999. Multiple spells in the Prentice–Gloeckler–Meyer likelihood with unobserved heterogeneity. Economics Letters 63: 139–144.

Prentice, R. L., and L. A. Gloeckler. 1978. Regression analysis of grouped survival data with application to breast cancer data. Biometrics 34: 57–67.

Sueyoshi, G. T. 1995. A class of binary response models for grouped duration data. Journal of Applied Econometrics 10: 411–431.

About the author Michael G. Farnworth is an associate professor in the Department of Economics at the Univer- sity of New Brunswick. The Stata Journal (2012) 12, Number 2, pp. 257–283 Threshold regression for time-to-event analysis: The stthreg package

Tao Xiao G. A. Whitmore Xin He The Ohio State University McGill University University of Maryland Columbus, OH Montreal, Canada College Park, MD [email protected] [email protected] [email protected]

Mei-Ling Ting Lee University of Maryland College Park, MD [email protected]

Abstract. In this article, we introduce the stthreg package of Stata commands to fit the threshold regression model, which is based on the first hitting time of a boundary by the sample path of a Wiener diffusion process and is well suited to applications involving time-to-event and survival data. The threshold regression model serves as an important alternative to the Cox proportional hazards model. The four commands that comprise this package for the threshold regression model are the model-fitting command stthreg, the postestimation command trhr for hazard-ratio calculation, the postestimation command trpredict for prediction, and the model diagnostics command sttrkm. These commands can also be used to implement an extended threshold regression model that accommodates appli- cations where a cure rate exists. Keywords: st0257, stthreg, trhr, trpredict, sttrkm, bootstrap, Cox proportional hazards regression, cure rate, first hitting time, hazard ratios, model diagnostics, survival analysis, threshold regression, time-to-event data, Wiener diffusion process

1 Introduction to threshold regression

Threshold regression is a methodology to analyze time-to-event data. For a review of this methodology, see Lee and Whitmore (2006)andLee and Whitmore (2010). A unique feature of threshold regression is that the event time is the first time an under- lying stochastic process hits a boundary threshold. In the context of survival data, for example, the event can be death and the time of death is the when his/her latent health status first decreases to a boundary at zero. With the stthreg command, a Wiener process Y (t) is used to model the latent health status process. An event is observed when Y (t) reaches 0 for the first time. 2 Three parameters of the Wiener process are involved: μ, y0,andσ . Parameter μ, called the drift of the Wiener process, is the mean change per unit of time in the level of the sample path. The sample path approaches the threshold if μ<0. Parameter 2 y0 is the initial value of the process and is taken as positive. Parameter σ represents

c 2012 StataCorp LP st0257 258 Threshold regression the variance per unit of time of the process (Lee and Whitmore 2006). The first hitting 2 time (FHT) of a Wiener process with μ, y0,andσ has an inverse Gaussian distribution with the probability density function   2 2 y0 (y0 + μt) f t|μ, σ ,y0 = √ exp − (1) 2πσ2t3 2σ2t

2 where −∞ <μ<∞, σ > 0, and y0 > 0. The probability density function is proper if μ ≤ 0. The cumulative distribution function of the FHT is ! ! − 2 (y0 + μt) 2y0μ μt y0 F t|μ, σ ,y0 =Φ − √ +exp − Φ √ (2) σ2t σ2 σ2t where Φ(·) is the cumulative distribution function of the standard normal distribution. Note that if μ>0, the Wiener process may never hit the boundary at zero, and hence 2 there is a probability that the FHT is ∞; specifically, P (FHT = ∞)=1−exp(−2y0μ/σ ) (Cox and Miller 1965). Because the health status process is usually latent (that is, unobserved), an arbitrary unit can be used to measure such a process. Hence the variance parameter σ2 of the process is set to 1 in the stthreg command to fix the measurement unit of the process. Then we can regress the other two process parameters, y0 and μ, on the covariate data. We assume that μ and ln(y0) are linear in regression coefficients.  Suppose that the covariate vector is Z =(1,Z1,...,Zk), where Z1,...,Zk are co- variates and the leading 1 in Z allows for a constant term in the regression relationship. Then ln(y0)andμ can be linked to the covariates with the following regression forms:

 ln(y0)=γ0 + γ1Z1 + ···+ γkZk = Z γ (3)

 μ = β0 + β1Z1 + ···+ βkZk = Z β (4)

Vectors γ in (3)andβ in (4) represent regression coefficients for ln(y0)andμ,   respectively, with γ =(γ0,...,γk)andβ =(β0,...,βk). Note that researchers can set some elements in γ or β to zero if they feel the corresponding covariates are not  important in predicting ln(y0)orμ. For example, if covariate Z1 in the vector Z is considered not important to predict ln(y0), we can remove the Z1 term in (3) by setting γ1 to zero. In the remaining sections of this article, we detail how to use our software package to implement threshold regression in Stata. In section 2, we compare threshold regres- sion with Cox proportional hazards regression by running three of the four commands (that is, sttrkm, stthreg,andtrhr), together with some existing Stata commands, on leukemia.dta. In sections 3 through 6, we introduce the uses of these four commands, respectively. We use melanoma.dta in these four sections to illustrate how to use these four commands. In section 7, we introduce these four commands for an extended thresh- old regression model that we refer to as the threshold regression cure-rate model, and we demonstrate their application to kidney.dta. T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 259 2 Comparison of threshold regression with Cox propor- tional hazards regression

The Cox regression model, also called proportional hazards regression, has played a key role in the area of time-to-event data analysis for many years (Cox 1972). The Cox model assumes that covariates alter hazard functions in a proportional manner. Threshold regression does not assume proportional hazards (PH) and can be used as an alternative to the Cox model, especially when the PH assumption of the Cox model is violated. The connections between threshold regression and Cox regression have been studied in Lee and Whitmore (2010), where it is shown that Cox regression is, for most purposes, a special case of threshold regression. In this section, we use a leukemia remission study dataset (Garrett 1997)fromthe Stata website to compare these two models when the PH assumption of the Cox model is violated. Below we obtain this leukemia remission dataset from the Stata website and declare it to be survival-time data.

. webuse leukemia (Leukemia Remission Study) . describe Contains data from http://www.stata-press.com/data/r12/leukemia.dta obs: 42 Leukemia Remission Study vars: 8 23 Mar 2011 10:39 size: 336

storage display value variable name type format label variable label

weeks byte %8.0g Weeks in Remission relapse byte %8.0g yesno Relapse treatment1 byte %8.0g trt1lbl Treatment I treatment2 byte %8.0g trt2lbl Treatment II wbc3cat byte %9.0g wbclbl White Blood Cell Count wbc1 byte %8.0g wbc3cat==Normal wbc2 byte %8.0g wbc3cat==Moderate wbc3 byte %8.0g wbc3cat==High

Sorted by: weeks . stset weeks, failure(relapse) failure event: relapse != 0 & relapse < . obs. time interval: (0, weeks] exit on or before: failure

42 total obs. 0 exclusions

42 obs. remaining, representing 30 failures in single record/single failure data 541 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 35 260 Threshold regression

The dataset consists of 42 patients who were monitored to see if they relapsed (relapse: 1 = yes, 0 = no) and how long (in weeks) they remained in remission (weeks). These 42 patients received two different treatments. For the first treatment, 21 patients received a new experimental drug (drug A), and the other 21 received a standard drug (treatment1: 1 =drugA,0 = standard). For the second treatment, 20 patients received a different drug (drug B), and the other 22 received a standard drug (treatment2: 1 =drugB,0 = standard). White blood cell count, a strong indicator of the presence of leukemia, was recorded in three categories (wbc3cat: 1 = normal, 2 = moderate, 3 = high).

As demonstrated in [ST] stcox PH-assumption tests, the variable treatment2 violates the PH assumption. We focus on this variable below to demonstrate the advan- tages of the threshold regression model over the Cox model when the PH assumption is violated. First, we use our new model diagnostic command sttrkm to demonstrate the better fit of the threshold regression model over the Cox model for the leukemia remission data. To address this comparison, we also use three existing diagnostic commands provided by Stata: stphplot, sts,andstcoxkm. Figures 1 through 4 are generated by the following four commands sequentially:

. stphplot, by(treatment2) noshow title("Log-Log Plot") plot2opts(lpattern(-)) . sts, by(treatment2) noshow . stcoxkm, by(treatment2) noshow title("Cox Predicted vs. Observed") > pred1opts(lpattern(dash)) pred2opts(lpattern(longdash)) . sttrkm, lny0(treatment2) mu(treatment2) noshow > title("TR Predicted vs. Observed")

In figure 1, the curves of −ln{−ln(survival)} versus ln(analysis time) for both the drug B group and the standard group are plotted in a log-log plot. If the plotted lines are reasonably parallel, the PH assumption has not been violated. Because the two curves corresponding to the two groups cross each other in figure 1,thePH assumption is clearly violated for treatment2.ThePH violation is also suggested by figure 2 where the Kaplan–Meier survival curves for the two groups also cross each other. In figure 3, we use the diagnostic command stcoxkm for the Cox model to overlay the Kaplan– Meier survival curves and the Cox predicted curves for treatment2. The Kaplan–Meier curves and Cox predicted curves are not close to each other, suggesting a poor fit of the Cox model. Obviously, this poor fit results from a violation of the PH assumption for treatment2. The threshold regression model based on a Wiener process, however, does not assume proportional hazards. In figure 4, we use our diagnostic command sttrkm for the threshold regression model to overlay the Kaplan–Meier survival curves and the threshold regression predicted curves for treatment2. The Kaplan–Meier curves and the threshold regression predicted curves match very well. T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 261

Log−Log Plot Kaplan−Meier survival estimates 3 1.00 2 0.75 1 0.50 ln(Survival Probability)] − 0 0.25 ln[ − 1 − 0.00 0 1 2 3 4 0 10 20 30 40 ln(analysis time) analysis time

treatment2 = Standard treatment2 = Drug B treatment2 = Standard treatment2 = Drug B

Figure 1. Log-log plot by the Figure 2. Kaplan–Meier plot by the treatment2 variable for the leukemia treatment2 variable for the leukemia data data

Cox Predicted vs. Observed TR Predicted vs. Observed 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 Survival Probability Survival Probability 0.20 0.20 0.00 0.00 0 10 20 30 40 0 10 20 30 40 analysis time analysis time

Observed: treatment2 = Standard Observed: treatment2 = Drug B Observed: treatment2 = Standard Observed: treatment2 = Drug B Predicted: treatment2 = Standard Predicted: treatment2 = Drug B Predicted: treatment2 = Standard Predicted: treatment2 = Drug B

Figure 3. Cox predicted plot versus Figure 4. Threshold regression pre- Kaplan–Meier plot by the treatment2 dicted plot versus Kaplan–Meier plot variable for the leukemia data by the treatment2 variable for the leukemia data

Further, let us use another two new commands, stthreg and trhr,tofitthethresh- old regression model to the leukemia remission data and calculate the hazard ratios at week 4, week 9, and week 14. In addition, the trhr command calculates pointwise bootstrap confidence intervals for the hazard ratios with its ci option (before the trhr command, we set the random seed to 1 for the bootstrap procedure). It also plots the estimated hazard function of the drug B group against that of the standard drug group by its graph(hz) option. This plot is shown in figure 5. 262 Threshold regression

. xi: stthreg, lny0(i.treatment2) mu(i.treatment2) i.treatment2 _Itreatment_0-1 (naturally coded; _Itreatment_0 omitted) failure _d: relapse analysis time _t: weeks initial: log likelihood = -140.16289 alternative: log likelihood = -173.94938 rescale: log likelihood = -136.69022 rescale eq: log likelihood = -116.95156 Iteration 0: log likelihood = -116.95156 Iteration 1: log likelihood = -110.07546 Iteration 2: log likelihood = -104.66173 Iteration 3: log likelihood = -104.64227 Iteration 4: log likelihood = -104.64227 ml model estimated; type -ml display- to display results Threshold Regression Estimates Number of obs = 42 Wald chi2(2) = 27.39 Log likelihood = -104.64227 Prob > chi2 = 0.0000

Coef. Std. Err. z P>|z| [95% Conf. Interval]

lny0 _Itreatmen~1 -1.273925 .2442516 -5.22 0.000 -1.75265 -.7952011 _cons 2.009787 .170641 11.78 0.000 1.675336 2.344237

mu _Itreatmen~1 .588838 .1535616 3.83 0.000 .2878628 .8898132 _cons -.5886176 .1340773 -4.39 0.000 -.8514042 -.3258309

. set seed 1 . trhr, var(treatment2) timevalue(4 9 14) ci graph(hz) (running stthreg on estimation sample) Bootstrap replications (2000) (output omitted ) Hazard Ratio for Scenario , at Time = 4

var Hazard Ratio [95% Percentile C.I.]

_Itreatment_1 5.9588129935 1.8366861 147.06873

(running stthreg on estimation sample) Bootstrap replications (2000) (output omitted ) Hazard Ratio for Scenario , at Time = 9

var Hazard Ratio [95% Percentile C.I.]

_Itreatment_1 .38571350099 .17518535 .84617594 T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 263

(running stthreg on estimation sample) Bootstrap replications (2000) (output omitted ) Hazard Ratio for Scenario , at Time = 14

var Hazard Ratio [95% Percentile C.I.]

_Itreatment_1 .19011874598 .07150129 .44462474

Hazard Function Hazard Ratio 6 .2 .15 4 .1 2 .05 Estimated Hazard Ratio Estimated Hazard Function 1 0 0 0 10 20 30 40 5 10 15 20 25 30 Analysis Time Specified Time Value

treatment2==Standard treatment2==Drug B HR(treatment2==Drug B vs. treatment2==Standard)

Figure 5. Estimated hazard functions Figure 6. Estimated hazard ratio values by the treatment2 variable for the over time by the treatment2 variable leukemia data by the threshold regres- for the leukemia data by the threshold sion model regression model

It is clear that the hazard ratios calculated at weeks 4, 9, and 14 are quite different. At week 4, the estimated hazard ratio of the drug B group versus the standard drug group is 5.96 (the 95% bootstrap CI of the hazard ratio excludes 1 on the left); while at week 9, the estimated hazard ratio drops to 0.38 (the 95% bootstrap CI of hazard ratio excludes 1 on the right); the hazard ratio keeps decreasing to 0.19 at week 14 (the 95% bootstrap CI of the hazard ratio excludes 1 on the right). Obviously, the change of hazard ratio over time is huge, and the change is captured by the threshold regression model. In figure 5, it is clear that the hazard function curves of the two groups cross over time, and that is why the hazard ratio of the drug B group versus the standard drug group changes from 5.96 (a value greater than 1) at week 4 to 0.19 (a value less than 1) at week 14. We can use the graph(hr) option of the trhr command to depict the hazard ratios for different time points to illustrate the changing pattern of the hazard ratio over time. For example, the following command can be used to plot the estimated hazard-ratio values in a specified time span from week 4 to week 30 with a reference line for a hazard ratio of 1. The generated plot is shown in figure 6.

. trhr, var(treatment2) timevalue(4(1)30) graph(hr) graphopt(title(Hazard Ratio) > ytitle(Estimated Hazard Ratio) legend(on) yline(1,lpattern(shortdash)) > ymtick(1) ymlabel(1)) 264 Threshold regression

On the other hand, because of the proportional-hazards assumption, the Cox model can only be used to estimate a constant hazard ratio across the whole time span for the drug B group versus the standard drug group. The following command fits the Cox model to the leukemia remission data. The resulting hazard ratio is a constant 0.75 with the 95% CI including 1. This is obviously a misleading outcome for these data where the PH assumption is violated.

. xi: stcox i.treatment2 i.treatment2 _Itreatment_0-1 (naturally coded; _Itreatment_0 omitted) failure _d: relapse analysis time _t: weeks Iteration 0: log likelihood = -93.98505 Iteration 1: log likelihood = -93.71683 Iteration 2: log likelihood = -93.716786 Refining estimates: Iteration 0: log likelihood = -93.716786 Cox regression -- Breslow method for ties No. of subjects = 42 Number of obs = 42 No. of failures = 30 Time at risk = 541 LR chi2(1) = 0.54 Log likelihood = -93.716786 Prob > chi2 = 0.4639

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

_Itreatmen~1 .7462828 .3001652 -0.73 0.467 .3392646 1.641604

In the following sections, we detail the use of the four new threshold regression commands one by one with examples.

3 stthreg: The command to fit the threshold regression model 3.1 Introduction to stthreg stthreg fits the threshold regression model introduced in section 1. Maximum likeli- hood estimation is used to estimate the regression coefficients in vectors γ and β.A subject i in the sample dataset who is observed to die contributes the FHT probability (i)| (i) (i) (i) density f(t μ ,y0 ) to the sample likelihood function, where t is the observed time of death of the subject. A subject j in the sample dataset who lives to the end of the − (j)| (j) (j) study contributes the survival probability 1 F (t μ ,y0 ) to the sample likelihood function, where t(j) is the right-censored survival time of the subject. Among the n subjects in the sample, subjects with observed death times are indexed 1 to n1 and sub- jects with right-censored observations are indexed n1 +1ton. Then the log-likelihood function is written as T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 265

n1 n   (i)| (i) (i) − (j)| (j) (j) ln L (β,γ)= ln f t μ ,y0 + ln 1 F t μ ,y0 (5) i=1 j=n1+1

The maximum-likelihood estimation routine with the lf method (Gould et al. 2010) is incorporated in the stthreg command to find the maximum likelihood estimates of the regression coefficients in the threshold regression model. The convergence speed is fairly fast.

3.2 Syntax of stthreg     stthreg if in , lny0(varlist)mu(varlist) noconstant cure lgtp(varlist)  level(#) init(string) nolog maximize options

3.3 Options lny0(varlist) specifies independent variables that will be used in the linear regres- sion function for ln y0 in the threshold regression model, as in (3). For example, if Z1,...,Zk in (3) are used, then this lny0(varlist) option should be written as lny0(Z1 ... Zk). Note that the intercept coefficient γ0 in (3) will be automati- cally added in the linear regression function for ln y0, and this intercept coefficient will correspond to the cons term in the lny0 section of the output regression coef- ficient table. The estimation results for the other regression coefficients in the linear regression function for ln y0 correspond to the independent variables in the lny0 section of the output regression coefficient table. If no independent variable is listed in lny0(), then only the intercept coefficient will be used in the regression function for ln y0. In this case, the estimated value of ln y0 is equal to the estimated value of this intercept coefficient, as can be easily seen in (3). lny0() is required. mu(varlist) specifies independent variables that will be used in the linear regression function for μ in the threshold regression model, as in (4). For example, if Z1,...,Zl in (4) are used, then this mu(varlist) option should be written as mu(Z1 ... Zl). Note that the intercept coefficient β0 in (4) will be automatically added in the linear regression function for μ, and this intercept coefficient will correspond to the cons term in the mu section of the output regression coefficient table. The estimation results for the other regression coefficients in the linear regression function for μ will appear after the names of the corresponding independent variables in the mu section of the output regression coefficient table. If no independent variable is listed in mu(), then only the intercept coefficient will be used in the regression function for μ.In this case, the estimated value of μ is merely equal to the estimated value of this intercept coefficient, as can be easily seen in (4). mu() is required. noconstant specifies that no intercept is included in the linear regression functions for ln y0 and μ. 266 Threshold regression cure specifies that the model to be fit is a threshold regression cure-rate model. See section 7 for details about this model. lgtp(varlist) specifies independent variables that will be used in the linear regression function for lgtp() in the threshold regression cure-rate model. This lgtp() option can be used only when the cure option is used. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level. init(string) specifies initialization values of the regression coefficients in maximum likelihood iterations. The syntax is the same as ml init, which is the command to set initial values in the maximum likelihood estimation routine of Stata. nolog specifies that the iteration log of the log likelihood not be displayed.   maximize options: iterate(#), no log, trace, tolerance(#), ltolerance(#), nrtolerance(#),andnonrtolerance; see [R] maximize. These options are sel- dom used.

3.4 Saved results of stthreg stthreg saves the following in e():

Scalars e(N) number of observations e(chi2) χ2 e(df m) model degrees of freedom e(p) significance e(ll) log likelihood Macros e(cmd) stthreg e(ml method) type of ml method e(depvar) name of dependent variable e(technique) maximization technique e(chi2type) Wald or LR;typeofmodel e(crittype) optimization criterion chi-squared test e(properties) b V e(vce) oim e(predict) program used to implement e(opt) type of optimization predict Matrices e(b) coefficient vector e(V) variance–covariance matrix of the estimators Functions e(sample) marks estimation sample

3.5 Melanoma dataset

We use melanoma.dta to illustrate how to use the stthreg package of commands. melanoma.dta is originally from Drzewiecki and Andersen (1982). It contains observa- tions retrieved from case records for 205 melanoma patients. These 205 patients, who were all in clinical stage I, were treated at the Plastic Surgery Unit in Odense from 1964 to 1973. Follow-up was terminated on January 1, 1978. Some patients died during the study period, but others were still alive at the end of the study and thus were consid- ered to be right-censored observations in our analysis. Eight variables are included in T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 267 this dataset to illustrate the use of the stthreg command: survtime (on-study time in days), status (0 = censored, 1 = death), sex (0 = female, 1 = male), ici (degree of inflammatory cell infiltrate), ecells (epithelioid cell type), ulcerat (ulceration), thick (tumor thickness), and age (in years).

. use melanoma, clear . describe Contains data from melanoma.dta obs: 205 vars: 8 27 Apr 2008 14:21 size: 6,560

storage display value variable name type format label variable label

sex float %9.0g sex survtime float %9.0g survival time status float %9.0g status status at end of follow-up ici float %9.0g ici degree of ici ecells float %9.0g ecells epitheloid cells ulcerat float %9.0g ulcerat ulceration thick float %9.0g tumor thickness (1/100 mm) age float %9.0g age in years

Sorted by: . stset survtime, failure(status) failure event: status != 0 & status < . obs. time interval: (0, survtime] exit on or before: failure

205 total obs. 0 exclusions

205 obs. remaining, representing 57 failures in single record/single failure data 441324 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 5565

3.6 Example of stthreg

Before using stthreg,youmuststset the data as we have already done in section 3.5. Below we will run the stthreg command on melanoma.dta. Each of the two parameters of the underlying Wiener process, ln y0 and μ, are linked to the covariates thick, age, and sex by the lny0() and mu() options. Note that both ln y0 and μ are linked to the same covariates in this example, while they can be linked to different covariates in other contexts. 268 Threshold regression

Deciding which covariates influence ln y0 and μ requires subject matter knowledge and does not adhere to a simple general rule. Covariates linked to ln y0 are those that influence the initial health state, such as baseline characteristics. Covariates linked to μ often include baseline characteristics but also covariates (such as treatment effects) that affect the drift in health over time. See Lee and Whitmore (2010) for relevant discussion of this issue. The stthreg command to fit this threshold regression model is given below:

. xi: stthreg, lny0(sex ecells thick age i.ulcerat) > mu(sex ecells thick age i.ulcerat) i.ulcerat _Iulcerat_0-1 (naturally coded; _Iulcerat_0 omitted) failure _d: status analysis time _t: survtime initial: log likelihood = -1243.6028 alternative: log likelihood = -9621.3647 rescale: log likelihood = -1081.5328 rescale eq: log likelihood = -573.47172 Iteration 0: log likelihood = -573.47172 Iteration 1: log likelihood = -538.3778 Iteration 2: log likelihood = -531.08833 Iteration 3: log likelihood = -530.83209 Iteration 4: log likelihood = -530.83114 Iteration 5: log likelihood = -530.83114 ml model estimated; type -ml display- to display results Threshold Regression Estimates Number of obs = 205 Wald chi2(10) = 64.94 Log likelihood = -530.83114 Prob > chi2 = 0.0000

Coef. Std. Err. z P>|z| [95% Conf. Interval]

lny0 sex .1070979 .1724429 0.62 0.535 -.2308839 .4450797 ecells .3300398 .194412 1.70 0.090 -.0510008 .7110804 thick -.0737792 .0333807 -2.21 0.027 -.1392043 -.0083542 age .0077415 .0042245 1.83 0.067 -.0005384 .0160215 _Iulcerat_1 -.5143228 .2165787 -2.37 0.018 -.9388093 -.0898363 _cons 3.803575 .3145534 12.09 0.000 3.187061 4.420088

mu sex -.0120426 .0073686 -1.63 0.102 -.0264849 .0023996 ecells -.0224773 .0083094 -2.71 0.007 -.0387633 -.0061912 thick .0005888 .0012388 0.48 0.635 -.0018392 .0030167 age -.000579 .0002101 -2.76 0.006 -.0009908 -.0001672 _Iulcerat_1 -.0000191 .0083654 -0.00 0.998 -.016415 .0163768 _cons .0462283 .0143098 3.23 0.001 .0181816 .0742751 T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 269 4 trhr: The postestimation command to calculate hazard ratios 4.1 Introduction to trhr trhr is a postestimation command to calculate hazard ratios based on the threshold regression model fit by stthreg in advance. That is, suppose that a set of predictor variables {Z1,...,Zk−1,G} has been used to predict ln y0 and μ in the threshold re- gression model by using stthreg, where G is a categorical variable. We can then use trhr to estimate the hazard ratio of level G = g over the reference level G = 0 at given values of other predictors, Z1 = z1,...,Zk−1 = zk−1, and a given value of time t = t0. We state as follows how such a hazard ratio is calculated. First, we need to estimate the hazard value for level G = g at Z1 = z1,...,Zk−1 = zk−1 and time t = t0. Set

g (z ) =(1,z1,...,zk−1,g) (6)

g Then ln y0 and μ can be estimated for the given (z ) in (6)as g zg γ ln y0 =( ) (7)

μg =(zg)β (8) where γ and β are vectors of regression coefficients already estimated by stthreg (see section 3.1). Furthermore, the estimates of the density, survival, and hazard functions at time t0 are given as |g g f (t0 μ , y0 ) (9)

|g g − |g g S (t0 μ , y0 )=1 F (t0 μ , y0 ) (10)

f (t |μg, yg) f(t |μg, yg) h (t |μg, yg)= 0 0 = 0 0 (11) 0 0 |g g − |g g S(t0 μ , y0 ) 1 F (t0 μ , y0 )

2 where f(·)andF (·) are given in (1) and (2), with σ set to 1, and y0 and μ replaced by their estimates in level g. If we change the nonreference level g to the reference level 0 |0 0 |0 0 |0 0 in (6), we can obtain f(t0 μ , y0), S(t0 μ , y0), and h(t0 μ , y0) by the same means. The hazard ratio of level G = g over level G =0atZ1 = z1,...,Zk−1 = zk−1 and time t = t0 is therefore h(t |μg, yg) Hazard Ratio = 0 0 (12) |0 0 h(t0 μ , y0) Using the formulas above, trhr estimates the hazard ratios for a categorical variable with three options: var(), timevalue(),andscenario().Thevar() option specifies 270 Threshold regression the name of the categorical variable for which the hazard ratios are to be calculated. Note that a prerequisite for the execution of trhr is that the categorical variable speci- fied in the var() option needs to have been expanded into a dummy-variable set by the xi command (see [R] xi), which can be placed ahead of the previous stthreg command. The timevalue() option specifies the desired time values at which the hazard ratios are to be calculated. And the scenario() option specifies the values of all predictors except G. A setting of these values is referred to as a scenario. The calculated hazard ratios are with reference to the specified scenario. We do not need to specify a level value for the categorical G in the scenario() option because all nonreference levels g of G are enumerated in calculating hazard ratios relative to the reference level 0. The reference level or the nonreference levels are decided by the xi command, and that is why xi is needed before trhr is run. By default, the lowest level is used as the reference level by xi. Following the notation from (6)to(12), trhr calculates the hazard ratio of level G = g over level G = 0 with the following command: trhr, var(G) scenario(Z1 = z1 ···Zk−1 = zk−1) timevalue(t0). If the threshold regression model has categorical predictors other than G being fit by stthreg and if the user has also expanded some of these categorical predictors to sets of dummy variables by using “i.”, then the user needs to provide the values of all the created dummy variables instead of the values of the corresponding original categorical variables in the scenario() option when calculating the hazard ratios for G. That is because trhr will use these dummy variables, instead of the original categorical variables, as predictors in calculations. In the scenario() option, the order of presentation of the predictors does not matter, and the terms in this option are separated by blanks. If G is the only predictor in the model specified by stthreg (G also needs to be expanded by xi in this case), then the scenario() option is not necessary because there is no predictor that needs to be specified for the scenario value (the levels of G will still be enumerated to calculate the hazard ratios for G in this case). Note that if trhr is used after a threshold regression cure-rate model is fit by stthreg, all the calculations by trhr then automatically correspond to the thresh- old regression cure-rate model. See section 7 for details about the threshold regression cure-rate model.

4.2 Syntax of trhr  trhr, var(varname)timevalue(numlist) scenario(string) graph(hz | sv | ds | hr) graphopt(string) ci bootstrap(#)level keep  prefix(string) T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 271

4.3 Options var(varname) specifies the name of the categorical variable for the calculation of hazard ratios. var() is required. timevalue(numlist) specifies the desired values of time at which the hazard ratios be calculated (for the use of numlist, see [U] 11.1.8 numlist). timevalue() is required. scenario(string) must be used if any covariates other than the categorical variable specified in var() are also included in the threshold regression model fit by the stthreg command. The scenario() option is to assign values to those covariates by equations, one by one and separated by a blank space. The scenario() option is to set up a scenario at which hazard ratios are to be calculated. graph(hz | sv | ds | hr) specifies the type of curves to be generated. The graph(hz) option is to plot hazard function curves, the graph(sv) option is to plot survival function curves, the graph(ds) option is to plot density function curves, and the graph(hr) option is to plot a hazard-ratio tendency plot over the specified time points. graphopt(string) specifies the graph settings (titles, legends, etc.). If this option is used, the default graph settings in the trhr command will be ignored, and the styles (such as titles, legends, and labels) of the output graphs completely depend on the options provided in graphopt(). Because the line command in Stata is used in the trhr command to plot curves, any options suitable for the line command can be used in the graphopt() option to produce desired graphic styles. ci specifies to output the bootstrap percentile confidence interval of the estimated hazard-ratio value. By default, 2,000 bootstrap replications will be performed and the confidence level is 95%. bootstrap(#) specifies the number of bootstrap replications. The default is bootstrap(2000). level(#) sets the confidence level of the bootstrap percentile confidence interval of the hazard-ratio value; the default is level(95). keep saves the estimated hazard-ratio values of specified time points in the current dataset. The specified time values are saved under a new variable with the name hr t; the hazard-ratio values are saved under a new variable name that begins with hr and ends with the name of the corresponding dummy variable. If the ci option is used, then the lower and upper limit values of the confidence intervals will also be saved under two new variable names that begin with hrll and hrul, respectively, and end with the name of the corresponding dummy variable. prefix(string) specifies a string that will be attached to the head of the names of the new variables generated by the keep option. 272 Threshold regression

4.4 Example of trhr

The following command estimates the hazard-ratio value for the two levels of the ulcerat variable at time 4,000 for a subject with these features: sex=0, ecells=0, thick=12,andage=40.

. set seed 1 . trhr, var(ulcerat) timevalue(4000) scenario(sex=0 ecells=0 thick=12 age=40) ci (running stthreg on estimation sample) Bootstrap replications (2000) 1 2 3 4 5 ...... 50 ...... 100 (output omitted ) ...... 1950 ...... 2000 Hazard Ratio for Scenario sex=0 ecells=0 thick=12 age=40, at Time = 4000

var Hazard Ratio [95% Percentile C.I.]

_Iulcerat_1 1.1206067911 .05079897 46.630562

5 trpredict: The postestimation command for predic- tions 5.1 Introduction to trpredict

After a threshold regression model is fit by stthreg, we can use the command trpredict to predict the initial health status value y0, the drift value of the health process μ, the probability density function of the survival time f(t|μ, y0), the survival function S(t|μ, y0), and the hazard function h(t|μ, y0) for a specified scenario, as we did in calcu- lating the hazard ratios in section 4.1 where hazard function values at a specified time point are predicted and are used to calculate the corresponding hazard-ratio values. To specify the scenario values, use the scenario() option. Note that in the scenario() op- tion, we need to provide the scenario values of all predictors (including dummy-variable predictors) specified in the previous stthreg command. This is a little bit different from the scenario() option in the trhr command, where we do not need to provide the scenario values for the dummy variables expanded from the categorical variable G for which the hazard ratios are calculated. Again the reason is that the program will automatically enumerate all levels of G. Also we can predict those quantities for the scenario values and on-study time values corresponding to each subject in the dataset (this use is similar to the predict command in Stata). Following equations from (7)to(11), we can see that when the elements of z in (7) and (8) are set to be the corresponding covariate values of a subject, and when t in (9), (10), and (11) is set to the on-study time of this subject, we can calculate T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 273 the predicted values ln y0 (and hence y0), μ, f(t|μ, y0), S(t|μ, y0), and h(t|μ, y0)for this subject by using the regression coefficient estimates γ and β. Then by applying this procedure to each subject in the dataset we obtain those predicted values for each subject. Note that after running the trpredict command, the following six variables will be added to the current dataset: lny0, y0, mu, f, S,andh. These six variables, re- spectively, correspond to ln y0, y0, μ, f(t|μ, y0), S(t|μ, y0), and h(t|μ, y0). The option prefix(string) is used to add a specified prefix string to the default names of those six variables. Also note that if the trpredict command is run after fitting a threshold re- gression cure-rate model by using the cure option in stthreg, then two more variables, lgtp and p, in addition to the six variables above, will be added to the current dataset. These two variables correspond to the estimate of the logit of p and the estimate of p, where p is the susceptibility rate. See section 7 for details of the threshold regression cure-rate model.

5.2 Syntax of trpredict   trpredict , prefix(string)scenario(string)replace

5.3 Options prefix(string) specifies a prefix string that will be attached to the names of the six new variables generated by the trpredict command. For example, if prefix(abc ) is used, then the names of the six variables will be abc lny0, abc y0, abc mu, abc f, abc S,andabc h. scenario(string) specifies a scenario for which the predictions are calculated by using the standard threshold regression model or the threshold regression cure-rate model fit previously. Note that specifying a scenario means specifying the values of the predictors in the standard threshold regression model or the threshold regression cure-rate model fit previously. replace is used to force replacement of the variables (in the current dataset) with the same names as the six (or eight) new prediction variables without warning.

5.4 Example of trpredict

Below we first use two trpredict commands to predict the six quantities {(ln y0, y0, μ, f(t|μ, y0), S(t|μ, y0), and h(t|μ, y0)} for two different scenarios (scenario 1: Iulcerat 1=0, ecells=0, thick=12, age=40,andsex=0; scenario 2: Iulcerat 1=1, ecells=0, thick=12, age=30,andsex=1). Then we use the line commands to overlay the predicted survival curves to compare these two scenarios (see figure 7). 274 Threshold regression

. trpredict, scenario(_Iulcerat_1=0 ecells=0 thick=12 age=40 sex=0) > prefix(Scenario1_) replace . trpredict, scenario(_Iulcerat_1=1 ecells=0 thick=12 age=30 sex=1) prefix(Scen > ario2_) replace . twoway line Scenario1_S _t || line Scenario2_S _t, > title("Scenario Comparison") ytitle("Predicted Survival Function") > xtitle("Analysis Time (Days)") > text(.85 3500 "_Iulcerat_1=0, ecells=0, thick=12, age=40, sex=0") > text(.6 3500 "_Iulcerat_1=1, ecells=0, thick=12, age=30, sex=1")

Scenario Comparison 1 .9 _Iulcerat_1=0, ecells=0, thick=12, age=40, sex=0 .8 .7

.6 _Iulcerat_1=1, ecells=0, thick=12, age=30, sex=1 Predicted Survival Function .5 0 2000 4000 6000 Analysis Time (Days)

Scenario1_Prediction Scenario2_Prediction

Figure 7. Predicted survival functions for two different scenarios by threshold regression

6 sttrkm: The model diagnostic command 6.1 Introduction to sttrkm

For a categorical independent variable, the Kaplan–Meier nonparametric survival curve can be plotted for each categorical level of this variable. If we include such a categorical independent variable as the only predictor in the threshold regression, survival curves for each level can also be predicted parametrically by the threshold regression model (see section 5.1). The command sttrkm overlays these two types of curves and provides a graphic goodness-of-fit diagnosis of the threshold regression model. The closer the Kaplan–Meier nonparametric survival curves are to the predicted curves, the better the threshold regression model fits the data with this variable. A counterpart command for the Cox model is stcoxkm. Note that unlike trhr and trpredict,thesttrkm command is not a postestimation command, and hence the estimation command stthreg is not required before the sttrkm command. However, you must stset your data before using sttrkm. T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 275

6.2 Syntax of sttrkm   sttrkm if , lny0(varname) mu(varname) cure lgtp(varname) noshow separate obsopts(sttrkm plot options) obs#opts(sttrkm plot options) predopts(sttrkm plot options) pred#opts(sttrkm plot options) addplot(plot)  twoway options byopts(byopts)

6.3 Options lny0(varname) specifies the categorical predictor that will be used in the linear regres- sion function for ln y0 in the threshold regression model. Note that either lny0() or mu() (see below) can be omitted if you do not want to use this categorical predictor for either ln y0 or μ in the threshold regression model. However, if both lny0() and mu() are used, the categorical predictor specified in these two options must be the same. mu(varname) specifies the categorical predictor that will be used in the linear regression function for μ in the threshold regression model. cure is used to diagnose the goodness of fit of a threshold regression cure-rate model. See section 7 for details of this model. lgtp(varname) specifies the categorical predictor that will be used in the linear regres- sion function for lgtp() in the threshold regression cure-rate model. This lgtp() option can be used only when the cure option is used. Note that when cure is used, lny0(), mu(),orlgtp() can be omitted if you do not want to use this categorical predictor for ln y0, μ,orlgtp() in the threshold regression cure-rate model. How- ever, if any of these three options is used, the categorical predictor specified in these options must be the same. noshow specifies to not show st setting information. separate specifies to draw separate plots for predicted and observed curves. obsopts(sttrkm plot options) affects rendition of the observed curve. obs#opts(sttrkm plot options) affects rendition of the #th observed curve; not allowed with separate. predopts(sttrkm plot options) affects rendition of the predicted curve. pred#opts(sttrkm plot options) affects rendition of the #th predicted curve; not al- lowed with separate. addplot(plot) specifies other plots to add to the generated graph. twoway options are any options documented in [G-3] twoway options. byopts(byopts) specifies how subgraphs are combined, labeled, etc. 276 Threshold regression

6.4 Example of sttrkm

Below we use the sttrkm command to overlay the threshold regression predicted survival curves with the Kaplan–Meier observed survival curves for each level of the ici variable. We used the separate option of the sttrkm command to separate the plots for different levels of ici. The generated graph is shown in figure 8.

. sttrkm, lny0(ici) mu(ici) noshow separate

0 1 1.00 0.80 0.60 0.40 2 3 1.00 0.80 Survival Probability 0.60 0.40 0 2000 4000 6000 0 2000 4000 6000 analysis time Observed: ici = 0 Observed: ici = 1 Observed: ici = 2 Observed: ici = 3 Predicted: ici = 0 Predicted: ici = 1 Predicted: ici = 2 Predicted: ici = 3

Graphs by degree of ici

Figure 8. Threshold regression predicted plot versus Kaplan–Meier plot by the ici variable for the melanoma data

7 Threshold regression cure-rate model 7.1 Introduction to threshold regression cure-rate model

As mentioned in section 1, the Wiener process has probability P (FHT = ∞)=1− 2 exp(−2y0μ/σ ) that it will never hit the boundary at zero if it is drifting away from the boundary, that is, if μ>0. We now denote this probability by 1 − p0. When hitting the boundary represents a medical failure, p0 is sometimes called the susceptibility rate and 1 − p0 the nonsusceptibility rate or cure rate. The former is the proportion of the population that would eventually experience the medical failure if given enough time, while the latter is the proportion that is cured and will never experience the medical failure.

The preceding susceptibility rate p0 is determined by the parameter values of the Wiener model when μ>0. Research populations, however, often have proportions of susceptible and nonsusceptible individuals that do not correspond to p0 and 1 − p0. T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 277

In this common situation, it is better to let the susceptibility rate be a free parame- ter that is independently linked to covariates in the threshold regression model. We will use p to denote this mathematically independent susceptibility rate and use 1 − p to denote the corresponding cure rate. We therefore enrich the threshold regression model with an additional parameter p. Specifically, we replace the probability density 2 2 function f(t|μ, σ ,y0)in(1)bypf(t|μ, σ ,y0) and the cumulative distribution function 2 2 F (t|μ, σ ,y0)in(2)bypF (t|μ, σ ,y0). Note that the probability density function inte- grates to p in this case and therefore is improper. We link the log- or logit of p to a linear combination of covariates, as follows: ! p logit(p) = log = λ + λ Z + ···+ λ Z = Zλ 1 − p 0 1 1 k k

We continue to set the variance parameter σ2 of the Wiener process to 1. Hence the log- likelihood function for this enriched threshold regression model is the following extension of (5):

n1 n   β γ λ (i) (i)| (i) (i) − (j) (j)| (j) (j) ln L( , , )= ln p f t μ ,y0 + ln 1 p F t μ ,y0 i=1 j=n1+1

We call this enriched threshold regression model the threshold regression cure-rate model.Thestthreg package can also be used for the threshold regression cure-rate model, and the commands are summarized as follows:

• stthreg:Thecure option is used to fit a threshold regression cure-rate model; when the cure option is used, the lgtp() option is enabled and can be used to specify independent variables that will be used in the linear regression function for lgtp in the threshold regression cure-rate model.

• trhr: When used after stthreg that fits a threshold regression cure-rate model, all the results produced by trhr will correspond to the threshold regression cure- rate model.

• trpredict: When used after stthreg that fits a threshold regression cure-rate model, all the results produced by trpredict will correspond to the threshold regression cure-rate model.

• sttrkm:Thecure option is used to diagnose the goodness of fit of a threshold regression cure-rate model; when the cure option is used, the lgtp() option is enabled and can be used to specify the categorical predictor that will be used in the linear regression function for lgtp in the threshold regression cure-rate model. 278 Threshold regression

7.2 Kidney dataset

In the next section, we use kidney.dta to illustrate how to use the four commands for the threshold regression cure-rate model. The kidney dialysis dataset (kidney.dta) is taken from Nahman et al. (1992) and analyzed by Klein and Moeschberger (2003). The dataset considers the time to first exit-site infection (in months) in patients with renal insufficiency. Two groups of patients are compared: patients who used a surgically placed catheter and patients who used a percutaneous placed catheter. There are three variables in the dataset—time, infection,andgroup. The variable time records the on-study time; infection indicates whether the time is an event time (infection = 1) or a right- time (infection = 0) for each observation; and group indicates which group the observation is in (1 = surgical group, 2 = percutaneous group).

. use kidney, clear . describe Contains data from kidney.dta obs: 119 vars: 3 27 Jan 2008 03:49 size: 714

storage display value variable name type format label variable label

time float %9.0g Time to infection (months) infection byte %8.0g Infection indicator (0=no, 1=yes) group byte %8.0g group Catheter placement (1=surgically, 2=percutaneously)

Sorted by: . stset time, failure(infection) failure event: infection != 0 & infection < . obs. time interval: (0, time] exit on or before: failure

119 total obs. 0 exclusions

119 obs. remaining, representing 26 failures in single record/single failure data 1092.5 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 28.5 T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 279

7.3 Examples for the threshold regression cure-rate model

The following four commands generate figures 9 to 11 for kidney.dta.

. sts, by(group) noshow . stcoxkm, by(group) noshow separate title("Cox Predicted vs. Observed") . sttrkm, lny0(group) mu(group) noshow separate > title("TR Predicted vs. Observed") . sttrkm, lny0(group) mu(group) lgtp(group) cure noshow separate > title("TRC Predicted vs. Observed")

Kaplan−Meier survival estimates Cox Predicted vs. Observed Cox Predicted vs. Observed Surgical Percutaneous 1.00 1.00 0.75 0.80 0.60 0.50 Survival Probability 0.40 0.25 0.20 0 10 20 30 0 10 20 30 analysis time 0.00 0 10 20 30 analysis time Observed: group = Surgical Observed: group = Percutaneous Predicted: group = Surgical Predicted: group = Percutaneous group = Surgical group = Percutaneous Graphs by Catheter placement (1=surgically, 2=percutaneously)

Figure 9. Kaplan–Meier plot by the Figure 10. Cox predicted plot versus group variable for the kidney data Kaplan–Meier plot by the group vari- able for the kidney data

TR Predicted vs. Observed TR Predicted vs. Observed TRC Predicted vs. Observed TRC Predicted vs. Observed Surgical Percutaneous Surgical Percutaneous 1.00 1.00 0.80 0.80 0.60 0.60 Survival Probability Survival Probability 0.40 0.40 0.20 0.20 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 analysis time analysis time Observed: group = Surgical Observed: group = Percutaneous Observed: group = Surgical Observed: group = Percutaneous Predicted: group = Surgical Predicted: group = Percutaneous Predicted: group = Surgical Predicted: group = Percutaneous

Graphs by Catheter placement (1=surgically, 2=percutaneously) Graphs by Catheter placement (1=surgically, 2=percutaneously)

Figure 11. Threshold regression pre- Figure 12. Threshold regression dicted plot versus Kaplan–Meier plot cure-rate model predicted plot versus by the group variable for the kidney Kaplan–Meier plot by the group vari- data able for the kidney data

Figure 9 shows the Kaplan–Meier survival curves for the two groups. The two survival curves can be seen to cross each other. Before a time point around month 8, the estimated survival probability (infection-free probability) of the surgical group is higher 280 Threshold regression than that of the percutaneous group; after that time point, however, the estimated survival probability of the surgical group falls away relative to that of the percutaneous group. The crossing survival curves suggest that the proportional-hazards assumption does not hold. The violation of the PH assumption can also be diagnosed from a graph generated by the stphplot command (graph not shown). In figure 10,weusethestcoxkm command to overlay the Cox predicted survival curves and the Kaplan–Meier survival curves. As expected, the curves do not match well. In figure 11,weusethesttrkm command to overlay the threshold regression predicted survival curves and the Kaplan–Meier survival curves. Clearly, compared with the Cox predicted curves (figure 10), the threshold regression predicted curves (figure 11) match the Kaplan–Meier curves better. This application of sttrkm produces a positive estimate for μ of 0.542 for the percutaneous group (the regression output is not shown). The susceptibility rate p0 for this group is estimated to be 0.219, and hence the estimated cure rate is 1 − 0.219 = 0.781. The threshold regression predicted curve is slightly miscalibrated for the percutaneous group in this application of standard threshold regression. To improve the fit, we switch to the threshold regression cure-rate model. We use the cure and lgtp(group) options inside the sttrkm command to overlay the threshold regression cure-rate model predicted curves and the Kaplan–Meier curves in figure 12. There is a noticeable improvement in fit for the percutaneous group when compared with figure 11. The output for this last threshold regression analysis appears momen- tarily. The susceptibility rate p, when estimated as a free parameter, is 0.244 for the percutaneous group. Twice the difference of the maximum log-likelihood values for the standard and cure-rate models gives 2[−114.4747 − (−116.4909)] = 4.032. Comparison 2 of this test statistic with a χ1 distribution gives a p-value of 0.045, which suggests that p0 and p differ at the 0.05 significance level. Next we illustrate how to use the stthreg and trhr commands for the threshold regression cure-rate model. In stthreg, we specify the cure option to fit a thresh- old regression cure-rate model. In addition to the lny0() and mu() options, we also use the lgtp() option to incorporate the group independent variable in the logit link function for the lgtp() parameter in this threshold regression cure-rate model. The trhr command is used after the stthreg command to calculate the hazard-ratio values at month 2 and month 20, based on the threshold regression cure-rate model fit by stthreg.

. xi: stthreg, lny0(i.group) mu(i.group) lgtp(i.group) cure i.group _Igroup_1-2 (naturally coded; _Igroup_1 omitted) failure _d: infection analysis time _t: time initial: log likelihood = -141.26094 alternative: log likelihood = -154.434 rescale: log likelihood = -137.01023 rescale eq: log likelihood = -130.91466 Iteration 0: log likelihood = -130.91466 (not concave) Iteration 1: log likelihood = -124.57115 Iteration 2: log likelihood = -116.61429 T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 281

Iteration 3: log likelihood = -114.87662 Iteration 4: log likelihood = -114.6138 Iteration 5: log likelihood = -114.51506 Iteration 6: log likelihood = -114.48821 Iteration 7: log likelihood = -114.47805 Iteration 8: log likelihood = -114.47546 Iteration 9: log likelihood = -114.47485 Iteration 10: log likelihood = -114.47474 Iteration 11: log likelihood = -114.47471 Iteration 12: log likelihood = -114.4747 ml model estimated; type -ml display- to display results Threshold Regression Cure Rate Model Estimates Number of obs = 119 Wald chi2(3) = 30.34 Log likelihood = -114.4747 Prob > chi2 = 0.0000

Coef. Std. Err. z P>|z| [95% Conf. Interval]

lny0 _Igroup_2 -1.297609 .2357168 -5.50 0.000 -1.759606 -.835613 _cons 1.411293 .1434588 9.84 0.000 1.130119 1.692467

mu _Igroup_2 .0958833 .3969795 0.24 0.809 -.6821821 .8739488 _cons -.095869 .0764876 -1.25 0.210 -.2457819 .054044

lgtp _Igroup_2 -14.90171 958.0936 -0.02 0.988 -1892.731 1862.927 _cons 13.76995 958.0933 0.01 0.989 -1864.058 1891.598

. trhr, var(group) timevalue(2 20) Hazard Ratio for Scenario , at Time = 2

var Hazard Ratio

_Igroup_2 2.4673714479

Hazard Ratio for Scenario , at Time = 20

var Hazard Ratio

_Igroup_2 .04444226172

Finally, we use two trpredict commands to predict the eight quantities [ln y0, y0, μ,  lgtp, p, f(t|μ, y0), S(t|μ, y0), and h(t|μ, y0)] by this threshold regression cure-rate model for the two scenarios: group=surgical and group=percutaneous. For example, the estimates for y0 for the surgical group and the percutaneous group are 4.101 and 1.120, respectively. The estimates indicate that patients in the surgical group have a much higher initial health status than those in the percutaneous group. The estimates for the susceptibility rate p for the surgical group and the percutaneous group in the research population are 1 and 0.244, respectively. And hence the estimates for the cure rate 1−p for these two groups in the research population are 0 and 0.756, respectively.

. trpredict, scenario(_Igroup_2=0) prefix(Surgical_) replace . trpredict, scenario(_Igroup_2=1) prefix(Percutaneous_) replace 282 Threshold regression 8 Conclusion

In this article, we introduced a package to implement threshold regression in Stata. Threshold regression is a newly developed methodology in the area of survival-data analysis. Applications of threshold regression can be carried out easily with the help of our package.

9 Acknowledgment

This project is supported in part by CDC/NIOSH Grant RO1 OH008649.

10 References Cox, D. R. 1972. Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 34: 187–220.

Cox, D. R., and H. D. Miller. 1965. The Theory of Stochastic Processes. Boca Raton, FL: Chapman & Hall/CRC.

Drzewiecki, K. T., and P. K. Andersen. 1982. Survival with malignant melanoma: A regression analysis of prognostic factors. Cancer 49: 2414–2419.

Garrett, J. M. 1997. sbe14: Odds ratios and confidence intervals for models with effect modification. Stata Technical Bulletin 36: 15–22. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 104–114. College Station, TX: Stata Press.

Gould, W., J. Pitblado, and B. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College Station, TX: Stata Press.

Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored and Truncated Data. 2nd ed. New York: Springer.

Lee, M.-L. T., and G. A. Whitmore. 2006. Threshold regression for survival analysis: Modeling event times by a stochastic process reaching a boundary. Statistical Science 21: 501–513.

———. 2010. Proportional hazards and threshold regression: Their theoretical and practical connections. Lifetime Data Analysis 16: 196–214. Nahman, N. S., Jr., D. F. Middendorf, W. H. Bay, R. McElligott, S. Powell, and J. Anderson. 1992. Modification of the percutaneous approach to peritoneal dial- ysis catheter placement under peritoneoscopic visualization: Clinical results in 78 patients. Journal of the American Society of Nephrology 3: 103–107. T. Xiao, G. A. Whitmore, X. He, and M.-L. T. Lee 283

About the authors Tao Xiao is a PhD candidate in in the College of Public Health at The Ohio State University. His current research interests include the first-hitting-time models for survival analysis, the Bayesian methodology related to such models, and statistical computing. G. A. Whitmore is an emeritus professor at McGill University in Montreal and an associate investigator at the Ottawa Hospital Research Institute. His current research interests include statistical modeling and analysis of event histories, survival data, and risk processes, with an emphasis on applications in health care and medicine. Recent publications include studies of birth weight and chronic obstructive pulmonary disease. Xin He is an assistant professor in the Department of and Biostatistics at the University of Maryland in College Park. His current research interests include longitudinal data analysis, survival analysis, nonparametric and semiparametric methods, as well as applications in clinical trials, epidemiology, and other studies related to public health. Dr. Mei-Ling Ting Lee is a professor in and chair of the Department of Epidemiology and Biostatistics and is director of the Biostatistics and Risk Assessment Center at the University of Maryland. He has published in a broad spectrum of research areas, including statistical methods for genomic studies; threshold regression models for risk assessments, with applications in cancer, environmental research, and occupational exposure; and rank-based nonparametric tests for clustered data. Dr. Lee has published a book entitled Analysis of Microarray Gene Expression Data and has coedited two other books. He is the founding editor and editor-in-chief of the international journal Lifetime Data Analysis, which specializes in modeling time-to-event data. The journal is currently publishing its eighteenth volume. The Stata Journal (2012) 12, Number 2, pp. 284–298 Fitting nonparametric mixed logit models via expectation-maximization algorithm

Daniele Pacifico Italian Department of the Treasury Rome, Italy daniele.pacifi[email protected]

Abstract. In this article, I provide an illustrative, step-by-step implementation of the expectation-maximization algorithm for the nonparametric estimation of mixed logit models. In particular, the proposed routine allows users to fit straight- forwardly latent-class logit models with an increasing number of mass points so as to approximate the unobserved structure of the mixing distribution. Keywords: st0258, latent classes, expectation-maximization algorithm, nonpara- metric mixed logit

1 Introduction

In this article, I develop a nonparametric method suggested by Train (2008) for the esti- mation of mixing distributions in conditional logit models, and I show how to implement it in practice. Specifically, the proposed code allows users to estimate a discrete mixing distribution with points and shares as parameters. This constitutes a typical latent-class logit model, where the underlying mixing distribution can be fit nonparametrically by setting a sufficiently large number of mass points. Clearly, latent-class models can be fit through standard optimization methods, such as Newton–Raphson or Berndt–Hall–Hall–Hausman. However, this becomes difficult when the number of classes increases, because the higher the number of parameters the more difficult the inversion of the empirical Hessian matrix, with the possibility of sin- gularity at some iterations. In such situations, the proposed expectation-maximization (EM) recursion could help, because it implies the repeated evaluation of a function that is easier to maximize. Moreover, EM algorithms have proved to be particularly stable, and—under conditions given by Boyles (1983)andWu (1983)—they always climb uphill until convergence to a local maximum. The article is structured as follows: section 2 presents the mixed logit model with both continuous and discrete mixing distributions; section 3 shows how a mixed logit model with discrete mixing distributions can be fit via EM algorithm; section 4 presents an illustrative, step-by-step implementation of the EM recursion; section 5 contains an empirical application based on accessible data; and section 6 concludes.

c 2012 StataCorp LP st0258 D. Pacifico 285 2 The mixed logit model

Assume there are N agents who face J alternatives on T choice occasions. Agents choose the alternative that maximizes their utility in each choice occasion. The random utility of agent i from choosing alternative j in period t is defined as follows β x Uijt= i ijt + ijt where xijt is a vector of alternative-specific attributes and ijt is a stochastic term that is assumed to be an independent and identically distributed extreme value. Importantly, each βi is assumed to be random with the unconditional density f(β | ϑ), where ϑ collects the parameters that characterize the distribution. β Conditional on knowing i, the probability of the observed sequence of choices for agent i is given by the traditional McFadden’s (1974) choice model   dijt "T "J exp(β x ) Pr(β)= # i ijt i J β x t=1 j=1 k=1 exp( i ikt) where dijt is a dummy that selects the chosen alternative in each choice occasion. How- β ever, because i is unknown, the conditional probability of the sequence of observed β | ϑ choices has to be evaluated for any possible value of i. Hence, assuming that f(β ) has a continuous distribution, the unconditional probability becomes    dijt "T "J exp(βx ) Pr(ϑ)= # ijt f(β|ϑ) (1) i J βx t=1 j=1 k=1 exp( ikt)

This probability defines a mixed logit model with continuous mixing distributions, and typically, the related log-likelihood function is fit with simulation methods.1 If the β distribution of each i is discrete, the probability in (1) becomes   C "T "J dijt exp(βcxijt) Pr(ϑ)= πc # (2) i J β x c=1 t=1 j=1 k=1 exp( c ikt)

where πc = f(βc|ϑ) represents the share of the population with coefficients βc. Equation (2) is a typical latent-class logit model. Nevertheless, here we follow McFadden and Train (2000) and define it as a mixed logit model with discrete mix- ing distributions, to emphasize the similarities with the continuous-mixture logit model of (1). The estimation of latent-class models is usually based on standard gradient-based methods. However, these methods often fail to achieve convergence when the number of latent classes becomes high. In this case, an EM algorithm could help, because it requires the repeated maximization of a function that is far simpler with respect to the log likelihood derived from (2).

1. See Train (2003). In Stata, continuous mixed logit models can be fit with simulated maximum likelihood through the command mixlogit, written by Hole (2007). 286 An EM algorithm for nonparametric mixed logit models 3 An EM algorithm for the estimation of mixed logit models with discrete mixing distributions

EM algorithms were initially proposed in the literature to deal with missing-data prob- lems. Nevertheless, they turned out to be an efficient method to fit latent-class models, where the missing information consists of the class share probabilities. Nowadays, they are widely used in many economic fields where the assumption that people can be grouped in classes with different unobserved preference heterogeneity is reasonable.

The recursion is known as “E-M” because it consists of two steps, namely, an “expec- tation” and a “maximization”. As explained in Train (2008), the term to be maximized is the expectation of the missing-data log likelihood—that is, the joint density of the observed choice and the —whilst the expectation is over the distribution of the missing information, conditional on the density of the data and the previous parameter estimates. Consider the conditional logit model with discrete mixing distributions that we out- lined in the previous section. Following (2), the log likelihood is defined as   N C "T "J dijt exp(βcxijt) LL = ln π # c J β x i=1 c=1 t=1 j=1 k=1 exp( c ikt) which can be maximized by means of standard, gradient-based optimization methods. However, the same log likelihood can be also maximized by repeatedly updating the following recursion   dijt # # $ $ exp(βcx ) ηs+1 = argmax C (ηs)ln π # ijt η i c i c t j J β x (3) # # k=1 exp( c ikt) ηs | = argmaxη i c Ci( )ln(Li classi = c) where η is a vector that contains the whole set of parameters to be estimated—that is, those that enter the probability of the observed choice plus those that may define the s class shares—Li is the missing-data likelihood function, and Ci(η ) is the conditional probability that household i belongs to class c, which depends on the density of the data and the previous value of the parameters. s This conditional probability—Ci(η )—is the key future of the EM recursion and can be computed by means of the Bayes’ theorem: L |class = c C (ηs)=# i i (4) i C | c=1 Li classi = c

Now, given that   "T "J dijt exp(βcx ) ln(L | class = c)=lnπ +ln # ijt (5) i i c J β x t=1 j=1 k=1 exp( c ikt) D. Pacifico 287 the recursion in (3) can be split into the following steps:

1. Form the contribution to the likelihood (Li | classi = c) as defined in (3)foreach class;2 2. Form the individual-specific conditional probabilities of class membership as in (4);

3. Following (5), update the sets of βc,c=1, 2,...,C, by maximizing—for each class—a conditional logit model with weighted observations, with weights given by the conditional probabilities of class membership:   N "T "J dijt exp(βcx ) βs+1 = argmax C(ηs)ln # ijt c β J β x i=1 t=1 j=1 k=1 exp( c ikt)

4. Maximize the other part of the log likelihood in (3), and get the updated vector of class shares: N C s+1 s π = argmaxπ Ci(η )ln πc i=1 c=1

• If the class share probabilities depend on a vector of demographics—zi —the relative parameters are updated as

N C α z s+1 s #exp( c i) α = argmaxα Ci(η )ln , αC =0 (6) exp(αczi) i=1 c=1 c

This is a grouped-data log likelihood, where we have used a logit specification to constrain the estimated class share probabilities into the right range.3 The updated class share probabilities—πc,c=1, 2,...,C—are then computed as

s+1 exp(α zi) πs+1 = # c ,c=1, 2,...,C c αs+1z c exp( c i) which in turn allows updating the conditional probabilities of class member- βs+1 s+1 ship by means of the new vectors c and πc ,c=1, 2,...,C. • If the class share probabilities do not depend on demographics, the empirical maximization of the function in (6) can be avoided, because its analytical solution would be given by # C (ηs) πs+1 = # #i i ,c=1,...,C (7) c ηs i c Ci( ) 2. For the first iteration, starting values have to be used for the densities that enter the model. Note that these starting values must be different in every class; otherwise, the recursion estimates the same set of parameters for all the classes. 3. Differently fromP the βc s, the vectors αc ,c=1, 2,...,C, are jointly estimated. This is necessary to ensure that c πc =1. 288 An EM algorithm for nonparametric mixed logit models

s+1 where the updated conditional probabilities—Ci(η )—are computed by β means of the updated values of c, c =1, 2,...,C, and the previous values of the class shares.

5. Once the conditional probabilities of class membership have been updated—either in models with or without covariates on zi —the recursion can start again from step 3 until convergence.

4 A Stata routine for the EM estimation of mixed logit models with discrete mixing distributions

In this section, I show how the EM algorithm outlined above can be coded into Stata. I propose a detailed step-by-step procedure that can be easily implemented in a simple do-file and that works with accessible data from Huber and Train (2001) on household’s choice of electricity supplier. Note that this is the same dataset used by Hole (2007) for an application of his Stata command mixlogit, which performs the estimation of parametric mixed logit models via simulated maximum likelihood. The data contain information on 100 residential electricity customers who were asked up to 12 choice .4 In each , the customer was asked which of the four suppliers he or she would prefer among four hypothetical electricity suppliers, and the following characteristics of each offer were stated:

• The price of the contract (in cents per kWh) whenever the supplier offers a contract with a fixed rate (price)

• The length of contract that the supplier offered, expressed in years (contract)

• Whether the supplier is a local company (local)

• Whether the supplier is a well-known company (wknown)

• Whether the supplier offers a time-of-day rate instead of a fixed rate (tod)

• Whether the supplier offers a seasonal rate instead of a fixed rate (seasonal)

Each customer is identified by the variable pid. For each customer, the variable gid identifies a given choice occasion, while the dummy variable y identifies the stated choice in each choice occasion. Because no individual-level variables are included in this database, I also simulate a customer-specific variable— x1—to show how to fit a model with demographics in the probability of class membership. Hence the data setup required for fitting the model is as follows:

4. Because some customers stopped before answering all 12 experiments, there is a total of 1,195 choice occasions in the sample. D. Pacifico 289

. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta . set seed 1234567890 . by pid, sort: egen _x1=sum(round(rnormal(0.5),1)) . list in 1/12, sepby(gid)

y price contract local wknown tod seasonal gid pid _x1

1. 0 7 5 0 10 01116 2. 0 9 1 1 00 01116 3. 0 0 0 0 00 11116 4. 1 0 5 0 11 01116

5. 0 7 0 0 10 02116 6. 0 9 5 0 10 02116 7. 1 0 1 1 01 02116 8. 0 0 5 0 00 12116

9. 0 9 5 0 00 03116 10. 0 7 1 0 10 03116 11. 0 0 0 0 11 03116 12. 1 0 0 1 00 13116

Section 4.1 presents the steps for fitting a model with covariates on the class share probabilities. Alternatively, the optimization of the algorithm when only a constant term is included among the class probabilities is presented in section 4.2.

4.1 A model with covariates on the class share probabilities

To present a flexible routine, we will work with global variables so that the code can be easily adapted to other databases. The dependent variable is called $depvar;the list of covariates that enter the probability of the observed choice is called $varlist; the list of variables that enter the grouped-data log likelihood is called $varlist2;the variable that identifies the panel dimension—that is, the choice makers—is called $id; and the variable that defines the choice situations for each choice maker is called $group. We also define the number of latent classes, $nclasses, and the number of maximum iterations, $niter.

. **(1) Set the estimation framework** . global depvar "y" . global varlist "price contract local wknown tod seasonal" . global varlist2 "_x1" . global id "pid" . global group "gid" . global nclasses "2" . global niter "50" 290 An EM algorithm for nonparametric mixed logit models

Starting values are computed by randomly splitting the sample into C different subsamples—one for each class—and fitting a separate clogit for each of them.5 After each clogit estimation, we use the command predict to obtain the probability of every alternative in each class, and we store them in the variables l1, l2,..., lC.As for the starting values for the probability of class membership, we simply define equal shares, that is, 1/C:

. **(2) Split the sample** . by $id, sort: generate double _p=runiform() if _n==_N . by $id, sort: egen double _pr=sum(_p) . local prop 1/$nclasses . generate double _ss=1 if _pr<=`prop´ . forvalues s=2/$nclasses { 2. replace _ss=`s´ if _pr>(`s´-1)*`prop´ & _pr<=`s´*`prop´ 3. } . **(3) Get starting values for both the beta coefficients and the class shares** . forvalues s=1/$nclasses { 2. generate double _prob`s´=1/$nclasses 3. clogit $depvar $varlist if _ss==`s´, group($group) technique(nr dfp) 4. predict double _l`s´ 5. }

In what follows, the steps to calculate the conditional probabilities of (4) from these starting values are presented. First, for each latent class, we multiply the variables l1, l2,..., lC by the dummy variable that identifies the observed choice in each choice situation, that is, $depvar. Note that this allows storing the probabilities of the observed choice for each class. Second, for each latent class, we multiply the probabilities of the observed choices in each choice situation to obtain the probability of the agent’s sequence of choices:6

. **(4) Compute the probability of the agent´s sequence of choices** . forvalues s=1/$nclasses { 2. generate double _kbb`s´=_l`s´*$depvar 3. recode _kbb`s´ 0=. 4. by $id, sort: egen double _kbbb`s´=prod(_kbb`s´) 5. by $id, sort: replace _kbbb`s´=. if _n!=_N 6. }

Third, we construct the denominator of (4), that is, the unconditional choice prob- ability for each choice maker. This is done by computing a weighted average of the probabilities of the agent’s sequence of choices in each class, with weights given by the class shares, that is, the variables prob1, prob2,..., probC:

5. If the same starting values were used for all the classes, the EM algorithm would perform the same computations for each class and return the same results at each iteration. 6. This last step is done through the user-written command gprod.Typefindit gprod and install the package dm71. D. Pacifico 291

. **(5) Compute the choice probability** . generate double _den=_prob1*_kbbb1 . forvalues s=2/$nclasses { 2. replace _den=_den+_prob`s´*_kbbb`s´ 3. }

Finally, we compute the ratios defined in (4) and rearrange them to create individual- level variables. These are the conditional probabilities of class membership as defined in the previous section:

. **(6) Compute the conditional probabilities of class membership** . forvalues s=1/$nclasses { 2. generate double _h`s´=(_prob`s´*_kbbb`s´)/_den 3. by $id, sort: egen double _H`s´=sum(_h`s´) 4. }

Before starting the loop that iterates the EM recursion until convergence, we need to specify a Stata ml command that performs the estimation of the grouped-data model defined in (6):7

. **(7) Provide Stata with the ML command for the grouped-data model** . by $group, sort: generate _alt=sum(1) . summarize _alt . by $id, sort: generate double _id1=1 if _n<=r(mean) . generate _con=1 . program logit_lf . args lnf a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 . tempvar denom . generate double `denom´=1 . forvalues c=2/$nclasses { 2. replace `denom´=`denom´+exp(`a`c´´) 3. } . replace `lnf´=_H1*ln(1/`denom´) if $depvar==1 & _id1==1 . forvalues c=2/$nclasses { 2. replace `lnf´=`lnf´+_H`c´*ln(exp(`a`c´´)/`denom´) if $depvar==1 & _id1==1 3. } . replace `lnf´=0 if `lnf´==. . end

Here it is worth noting that following (6), we set to 0 one vector of parameters for identification purposes. We now present the loop that repeats the steps above until convergence:

. local i=1 . while `i´<= $niter {

7. The ml command presented in the following lines allows for up to 20 latent classes. However, the routine can be easily modified to account for a different number of classes. 292 An EM algorithm for nonparametric mixed logit models

To start, fit again the C clogit models—one for each class—using the conditional probabilities of class membership as weights. Then recompute the probability of the chosen alternative in each choice occasion to update probabilities of the agent’s sequence of choices:

. **(8) Update the probability of the agent´s sequence of choices** . capture drop _l* _kbbb* . forvalues s=1/$nclasses { 2. clogit $depvar $varlist [iw=_H`s´], group($group) technique(nr dfp) 3. predict double _l`s´ 4. replace _kbb`s´=_l`s´*$depvar 5. recode _kbb`s´ 0=. 6. by $id, sort: egen double _kbbb`s´=prod(_kbb`s´) 7. by $id, sort: replace _kbbb`s´=. if _n!=_N 8. }

Now launch the ml model for the estimation of the grouped-data log likelihood, and use the active estimates to update the class share probabilities:

. **(9) Update the class share probabilities: . global variables="($varlist2)" . forvalues s=3/$nclasses { 2. global variables="$variables ($varlist2)" 3. } . ml model lf logit_lf $variables, max . capture drop _denom _a* . generate double _denom= 1 . forvalues c=2/$nclasses { 2. local k = `c´ - 1 3. predict double _a`c´, eq(eq`k´) 4. replace _denom=_denom+exp(_a`c´) 5. } . replace _prob1=1/_denom . forvalues c=2/$nclasses { 2. replace _prob`c´=exp(_a`c´)/(_denom) 3. }

Once the class share probabilities have been updated, the next step requires updating both the choice probability (that is, the variable den) and the conditional probabilities of class membership (that is, the variables H*):

. **(10) Update the choice probability** . replace _den=_prob1*_kbbb1 . forvalues s=2/$nclasses { 2. replace _den=_den+_prob`s´*_kbbb`s´ 3. } . **(11) Update the conditional probabilities of class membership** . drop _H* . forvalues s=1/$nclasses { 2. replace _h`s´=(_prob`s´*_kbbb`s´)/_den 3. by $id, sort: egen double _H`s´=sum(_h`s´) 4. } D. Pacifico 293

Now that both the parameters and the conditional probabilities have been updated, the routine computes the new log likelihood and restarts the loop until convergence.

As Train (2008) points out, convergence of EM algorithms for nonparametric estima- tion is still controversial and constitutes an area of future research. Here we follow Train (2008)andWeeks and Lange (1989) and define convergence when the percentage vari- ation in the log-likelihood function from one iteration to the other is sufficiently small. In what follows, for example, the routine automatically stops the internal loop before the selected number of iterations, provided that the log likelihood has not changed more than 0.0001% over the last five iterations:8

. **(12) Update the log likelihood** . capture drop _sumll . egen _sumll=sum(ln(_den)) . **(13) Check for convergence** . sum _sumll . global z=r(mean) . local _sl`i´=$z . if `i´>=6 { . local a=`i´-5 . if (-(`_sl`i´´ - `_sl`a´´)/`_sl`a´´<= 0.0001) { . continue, break .} .} . **(14) If not converged, display the log likelihood and restart the loop** . display as green "Iteration " `i´ ": log likelihood = " as yellow $z . local i=`i´ +1 .}

When convergence is achieved, results can be displayed by typing

. forvalues s=1/$nclasses { 2. clogit $depvar $varlist [iw=_H`s´], group($group) 3. }

4.2 A model without covariates on the class shares probabilities

If the model does not include demographics on the class share probabilities, the max- imization of the grouped-data log likelihood can be avoided, because its solution can be provided analytically by following (7). This is important because the maximization of the grouped-data model slows down the overall estimation process, which could be- come time-consuming when the number of latent classes is high. In this case, the EM algorithm outlined in the previous section can be optimized by simply substituting the ninth step of the loop with the following lines:

8. Such a small value is advisable because—as discussed in Dempster, Laird, and Rubin (1977)—EM algorithms may move very slowly toward the maximum. 294 An EM algorithm for nonparametric mixed logit models

**(9-bis) Update class share probabilities by means of (13)** . forvalues s=1/$nclasses { 2. capture drop _nums`s´ 3. egen double _nums`s´=sum(_h`s´) 4. } . capture drop _dens . generate double _dens=_nums1 . forvalues s=2/$nclasses { 2. replace _dens=_dens+_nums`s´ 3. } . forvalues s=1/$nclasses { 2. replace _prob`s´=nums`s´/dens 3. }

Subsequently, the loop continues as before from step 10 till the end.

5 Application

For our application, we use the data illustrated in the previous section. To begin with, a typical conditional logit model is fit with the clogit command:

. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear . clogit y price contract local wknown tod seasonal, group(gid) Iteration 0: log likelihood = -1379.3159 (output omitted ) Iteration 4: log likelihood = -1356.3867 Conditional (fixed-effects) logistic regression Number of obs = 4780 LR chi2(6) = 600.47 Prob > chi2 = 0.0000 Log likelihood = -1356.3867 Pseudo R2 = 0.1812

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

price -.6354853 .0439523 -14.46 0.000 -.7216302 -.5493403 contract -.13964 .0161887 -8.63 0.000 -.1713693 -.1079107 local 1.430578 .0963826 14.84 0.000 1.241672 1.619485 wknown 1.054535 .086482 12.19 0.000 .8850338 1.224037 tod -5.698954 .3494016 -16.31 0.000 -6.383769 -5.01414 seasonal -5.899944 .35485 -16.63 0.000 -6.595437 -5.204451

From the results above, we can see that—on average—customers prefer lower prices, shorter contract lengths, fixed-rate plans, and a local and well-known company. Now we use the command mixlogit from Hole (2007) to fit a parametric mixed logit model with independent, normally distributed coefficients:9

9. The model is fit via simulated maximum likelihood with 300 Halton draws. D. Pacifico 295

. mixlogit y, id(pid) group(gid) rand(price contract local wknown tod seasonal) > nrep(300) Iteration 0: log likelihood = -1249.8219 (not concave) (output omitted ) Iteration 7: log likelihood = -1101.6085 Mixed logit model Number of obs = 4780 LR chi2(6) = 509.56 Log likelihood = -1101.6085 Prob > chi2 = 0.0000

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

Mean price -1.004329 .0721185 -13.93 0.000 -1.145679 -.8629798 contract -.2274985 .047386 -4.80 0.000 -.3203735 -.1346236 local 2.208746 .2439681 9.05 0.000 1.730578 2.686915 wknown 1.656329 .1707167 9.70 0.000 1.32173 1.990927 tod -9.364151 .5858618 -15.98 0.000 -10.51242 -8.215883 seasonal -9.496181 .5792009 -16.40 0.000 -10.63139 -8.360968

SD price .2151655 .0311095 6.92 0.000 .154192 .2761389 contract .384136 .044778 8.58 0.000 .2963728 .4718992 local 1.788806 .2370063 7.55 0.000 1.324282 2.25333 wknown 1.185838 .1731652 6.85 0.000 .8464401 1.525235 tod 1.6553 .2094545 7.90 0.000 1.244777 2.065824 seasonal -1.119371 .2836182 -3.95 0.000 -1.675252 -.5634893

As can be seen from the maximized log likelihood, we can reject the conditional logit specification in favor of a random coefficient model. Moreover, the magnitude of the coefficients is significantly different when compared with the estimates from clogit.10

We now fit a nonparametric mixed logit model by means of the EM algorithm outlined in the previous section. Following Greene and Hensher (2003)andTrain (2008), the choice of the appropriate number of classes is made by means of some information criteria. Here we opt for the Bayesian information criterion and the consistent Akaike information criterion (CAIC), which penalize more heavily those models with a large number of parameters.

10. This is an indication of the bias produced by the irrelevant alternatives property of standard conditional logit models. See Bhat (2000) for this point. 296 An EM algorithm for nonparametric mixed logit models

The next table shows—for an increasing number of latent classes—the maximized log likelihood, the number of parameters, the Bayesian information criterion, and the CAIC:

Classes Log Likelihood N.param. CAIC BIC

Cl.1 -1356.39 6 2746.40 2740.40 Cl.2 -1211.35 13 2495.57 2482.57 Cl.3 -1118.23 20 2348.57 2328.57 Cl.4 -1085.30 27 2321.95 2294.95 Cl.5* -1040.49 34 2271.55* 2237.55 Cl.6 -1028.56 41 2286.93 2245.93 Cl.7 -1006.37 48 2281.79 2233.79 Cl.8* -990.24 55 2288.76 2233.76* Cl.9 -983.64 62 2314.80 2252.80 Cl.10 -978.10 69 2342.96 2273.96 Cl.11 -963.10 76 2352.19 2276.19

From these results, we can infer that the five-class model is optimal according to the CAIC, whilst the Bayesian criterion points to a model with eight latent classes. The next table summarizes the fit coefficients for a model with eight latent classes:

Variable Class1 Class2 Class3 Class4 Class5

price -0.910 -0.737 -0.488 -2.110 -0.642 contract -0.438 0.218 -0.592 -0.662 0.096 local 0.370 2.416 0.782 0.717 2.186 wknown 0.369 2.840 0.710 0.241 1.207 tod -8.257 -6.690 -4.132 -14.191 -3.836 seasonal -6.440 -7.213 -6.560 -17.207 -4.052

Prob 0.120 0.097 0.091 0.070 0.096

Variable Class6 Class7 Class8 W.Average

price -1.208 -1.533 -0.082 -0.946 contract -0.198 -0.409 -0.156 -0.269 local 6.578 0.621 4.937 2.369 wknown 5.103 0.930 3.444 1.918 tod -14.847 -16.007 -1.088 -9.003 seasonal -15.334 -14.818 -1.060 -9.058

Prob 0.111 0.236 0.178 -

Results show that the parameters are rather different from class to class, meaning that there is an extended unobserved heterogeneity in the choice behavior. Interestingly, the weighted averages of the parameters are close to the correspondent values obtained from the parametric estimation with mixlogit. D. Pacifico 297

As for the EM recursion, the estimation of the eight-class model took 49 iterations before reaching convergence, that is, about one minute on our standard-issue PC with a dual-core processor and 4 GB RAM. However, the routine is already close to the maximum at the 11th iteration, that is, after less than 10 seconds.

This is a common feature of EM algorithms, and it actually suggests another appli- cation of the EM procedure proposed in this contribution. In fact, the routine could also be used to obtain good starting values for the estimation of latent-class models with gradient-based algorithms. This could be particularly useful either to speed up the estimation process or to avoid convergence problems when fitting models with a high number of latent classes.

6 Acknowledgments

I am grateful to Kenneth Train for his helpful comments on the implementation of the EM algorithm. Thanks also to an anonymous referee for relevant comments and suggestions on how to improve both this article and the estimation routine. Thanks to Hong il Yoo for his careful check on the Stata code.

7 References Bhat, C. R. 2000. Automatic modeling methods for univariate series. In Handbook of Transport Modelling, ed. D. A. Hensher and K. J. Button, 71–90. Oxford: Elsevier.

Boyles, R. A. 1983. On the convergence of the EM algorithm. Journal of the Royal Statistical Society, Series B 45: 47–50. Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from in- complete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39: 1–38. Greene, W. H., and D. A. Hensher. 2003. A latent class model for discrete choice analysis: Contrasts with mixed logit. Transportation Research, Part B 37: 681–698. Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood. Stata Journal 7: 388–401. Huber, J., and K. Train. 2001. On the similarity of classical and Bayesian estimates of individual mean partworths. Marketing Letters 12: 259–269. McFadden, D. 1974. Conditional logit analysis of qualitative choice behavior. In Fron- tiers in Econometrics, ed. P. Zerembka, 105–142. New York: Academic Press.

McFadden, D., and K. Train. 2000. Mixed MNL models for discrete response. Journal of Applied Econometrics 15: 447–470. Train, K. E. 2003. Discrete Choice Methods with Simulation. Cambridge: Cambridge University Press. 298 An EM algorithm for nonparametric mixed logit models

———. 2008. EM algorithms for nonparametric estimation of mixing distributions. Journal of Choice Modelling 1: 40–69.

Weeks, D. E., and K. Lange. 1989. Trials, tribulations, and triumphs of the EM algorithm in pedigree analysis. Journal of Mathematics Applied in Medicine & Biology 6: 209– 232.

Wu, C. F. J. 1983. On the convergence properties of the EM algorithm. Annals of Statistics 11: 95–103.

About the author DanielePacificoworkswiththeItalianDepartmentoftheTreasuryinRome,Italy,andisa member of the Centre for the Analysis of Public Policies in Modena, Italy. The Stata Journal (2012) 12, Number 2, pp. 299–307 The S-estimator of multivariate location and scatter in Stata

Vincenzo Verardi University of Namur (FUNDP) Center for Research in the Economics of Development (CRED) Namur, Belgium and Universit´e Libre de Bruxelles (ULB) European Center for Advanced Research in Economics and Statistics (ECARES) Center for Knowledge Economics (CKE) Bruxelles, Belgium [email protected]

Alice McCathie Universit´e Libre de Bruxelles (ULB) European Center for Advanced Research in Economics and Statistics (ECARES) Bruxelles, Belgium [email protected]

Abstract. In this article, we introduce a new Stata command, smultiv,that implements the S-estimator of multivariate location and scatter. Using simulated data, we show that smultiv outperforms mcd, an alternative robust estimator. Finally, we use smultiv to perform robust principal component analysis and least- squares regression on a real dataset. Keywords: st0259, smultiv, S-estimator, robustness, outlier, robust principal com- ponent analysis, robust regression

1 Introduction

Robust methods for parameter estimation are rapidly becoming essential elements of the statistician’s toolbox. A key parameter in most data analyses techniques is the parameter of dispersion, for which Stata currently offers methods of robust estimation. In Verardi and Dehon (2010), the authors propose such a method: the minimum covari- ance determinant (MCD) estimator. This estimator is based on the notion of generalized variance (Wilks 1932), a one-dimensional assessment of the multivariate spread, mea- sured by the determinant of the sample covariance matrix. The MCD estimator looks for the 50% subsample with the smallest generalized variance. This subsample is assumed free of outliers and thus can be used to compute robust estimates of location and scatter.

Although the MCD estimator is appealing from a theoretical point of view, its prac- tical implementation is problematic. Its estimation requires us to compute the determi- nant of the sample covariance matrix for all subsamples that contain 50% of the initial

c 2012 StataCorp LP st0259 300 The S-estimator of multivariate location and scatter in Stata data. Given that this is unfeasible, the code for the MCD estimator uses the p-subset algorithm (Rousseeuw and Leroy 1987) that allows for only a reduced number of subsets to be considered. Unfortunately, this algorithm leads to relatively unstable estimation results, especially for small datasets. Another weakness of the MCD estimator is its low Gaussian efficiency. In light of these shortcomings, we present in this article a Stata command that implements an alternative estimator of location and scatter in a multivariate setting, the S-estimator. The S-estimator is based on the minimization of the sum of a function of the deviations, where the choice of function determines the robustness properties of the estimator. This article is structured as follows. First, we present the S-estimator in both a univariate and a multivariate setting. We then show that this estimator outperforms MCD in terms of both the stability of the results and its Gaussian efficiency. Finally, we use S-estimates of location and scatter to perform robust principal component analysis (PCA) and robust linear regression.

2 S-estimator of location and scatter

In this section, we introduce the S-estimator of location and scatter. Let us begin by recalling that a is a measure of the centrality of a distribution. To determine this parameter, one must look for the point in the distribution around which the dispersion of observations is the smallest. In the univariate case, if we choose the variance as a measure of this dispersion, the problem can be formalized as

n 2 2 1 2 θ = arg min σ , where σ = (xi − θ) , θb n i=1 ! 1 n x − θ 2 which can be rewritten as 1 = i n σ i=1

This optimization problem consists of the minimization of the sum of the squared dis- tances between each point and the center of the distribution, subject to an equality constraint. In a multivariate setting, the distance measure to the center of the data cloud is called −1  the Mahalanobis distance and is defined as MDi = (Xi − θ)Σ (Xi − θ) , where X is a matrix of covariates, Σ is the scatter matrix, and θ is the location vector. To obtain a multivariate estimator of location, we minimize a univariate measure of the dispersion of the data cloud—the determinant of the scatter matrix, det(Σ). This is subject to a constraint similar to the one in the univariate case, that equalizes the sum of the squared (Mahalanobis) distances to the number of degrees of freedom, p (recall that the 2 squared Mahalanobis distances follow a χp). The problem can be formally stated as n 2 1  (θ, Σ) = arg min det(Σ), such that p = (X − θ)Σ−1(X − θ) n i i θ, Σ i=1 V. Verardi and A. McCathie 301

Because the distances enter the above equation squared, it is obvious that observations lying far away from the center of the distribution will have a large influence on its value. Thus in the presence of outlying values, the estimated θ may not reflect the actual centrality of a large part of the dataset. The first step toward robustifying this estimator is to consider replacing the square function with an alternative function we call ρ. This function ρ should be nondecreasing in positive values of the argument but less increasing in these values than the square function we wish to replace. Given such a function ρ, the problem becomes n 1  (θ, Σ) = arg min det(Σ), such that b = ρ (X − θ)Σ−1(X − θ) n i i θ, Σ i=1 where b = E{ρ (u)}1 to guarantee Gaussian consistency of the estimator. The parameter θ for which we can find the smallest det(Σ) satisfying this equality is called an M- estimator of location, where Σ is a multivariate M-estimator of dispersion. If these parameters are estimated simultaneously, they are called S-estimators. From this point on, we will consider only S-estimators because they are highly resistant to outliers for an appropriately chosen function ρ. The quality of the S-estimator depends on the function ρ. This function should be chosen such that it maximizes both the robustness of the estimator and its Gaussian efficiency. Although there exist a variety of candidate functions for ρ, here we will consider only the Tukey biweight function, which is known to possess good robustness properties. This function is defined as ⎧ ⎨ 2 3 − − MDi | |≤ 1 1 k if MDi k ρ(MDi)=⎩ 1if|MDi| >k

The breakdown point of the estimator (that is, the maximum contamination the es- timator can withstand) is determined by the tuning constant k. One first selects a cutoff value for a constant c on the univariate scale as the number of standard deviations from themeanbeyondwhichρ (MDi) becomes zero (Campbell, Lopuha¨a, and Rousseeuw 1998). This constant c can then be converted to a value of k on a chi-squared scale of MD2 using the Wilson–Hilferty (1931) transformation (Campbell 1984): ! ! ! ! 3 1 2 1 2 k = p c + 1 − 9 p 9 p

The S-estimator has a breakdown point of 50% when c =1.548, which for p = 3 implies k =2.707. Figure 1 depicts the Tukey biweight function ρ and its derivative ρ.

1. Under the assumption that u is a standard normal distribution. 302 The S-estimator of multivariate location and scatter in Stata

Figure 1. Tukey biweight

The Gaussian consistency parameter (E{ρ (u)} if we assume a p-variate standard normal distribution) is

p p(p +2) p(p + 2)(p +4) k2  b = χ2 (k2) − χ2 (k2)+ χ2 (k2)+ 1 − χ2(k2) 2 p+2 2k2 p+4 6k4 p+6 6 p

3 Relative stability of MCD and S-estimation

Both MCD and S-estimators give robust estimates of multivariate location and scatter. However, implementation of MCD is more problematic because the fast MCD algorithm2 that performs the estimation yields results that are both unstable over multiple repli- cations and sensitive to the chosen starting value. This is not a weakness of the S- estimator, whose estimation procedure is based on an iterative reweighting algorithm; see Campbell, Lopuha¨a, and Rousseeuw (1998). To illustrate the comparative performances of both estimators, we create a dataset of 1,000 observations with values drawn from three independent N(0, 1) random vari- ables and replace 10% of the x1 values by an arbitrarily large number (100 in this example). We then compute the MCD and S-estimator of scatter of the data and repeat the procedure 1,000 times. We report in table 1 the estimates obtained for the ele- ments of the main diagonal of the scatter matrix. The stability of the estimates refers to the proportion of appearances made by the most frequently observed value for each estimate.

2. Developed by Rousseeuw and van Driessen (1999) and implemented in Stata by Verardi and Croux (2009, 2010). V. Verardi and A. McCathie 303

Table 1. MCD and S-estimator of scale

S-estimator MCD-estimator 2 2 2 2 2 2 σ1 σ2 σ3 σ1 σ2 σ3 minimum 1.049 1.171 1.118 0.761 0.874 0.815 maximum 1.051 1.171 1.120 0.972 1.016 1.050 stability 99.6% 99.6% 99.6% 0% 0% 0%

We see here that the range of values observed is much smaller with the S-estimator than with MCD. In fact, the difference between the minimum and maximum S-estimates is virtually zero (between 0 and 0.002), whereas with MCD this difference is small but not negligible (between 0.142 and 0.235). Perhaps the most striking result is that the most frequently observed S-estimate occurred in 99.6% of the trials. This is in sharp contrast with the MCD estimates, none of which reappeared more than once.

4 The multivariate S-estimator of location and scatter in other statistical applications

Many statistical applications require the prior estimation of a scatter matrix. In this section, we demonstrate how to use the S-estimator of location and scatter to 1) identify outliers in a dataset, 2) obtain robust estimates of linear regression parameters, and 3) perform robust PCA. Again we create a dataset of 1,000 observations, generated here from the following data-generating process: y = x1 + x2 + x3 + ε. We then contaminate the data by replacing 10% of the observations in x1 by draws from a N(100, 1). In Stata language, this is done as follows:

set seed 123 set obs 1000 matrix C0=(1,0.3,0.2 \ 0.3,1,0.4 \ 0.2,0.4,1) drawnorm x1-x3, corr(C0) gen e=invnormal(runiform()) gen y=x1+x2+x3+e gen bx1=x1 replace x1=invnormal(runiform())+100 in 1/100

To identify the outliers present in our dataset, we need to estimate the robust Maha- lanobis distance for each observation in the sample. Computing these distances requires robust estimates of both the location and the scale parameters of our sample. We thus run the smultiv command to obtain S-estimators of these parameters and request that the outliers be flagged and the robust Mahalanobis distances be reported. This is done in Stata by typing

smultiv x*, generate(robust outlier robust distance) summarize robust distance in 101/l summarize robust distance in 1/100 304 The S-estimator of multivariate location and scatter in Stata

Observations are identified as outliers if their robust Mahalanobis distance is greater than the critical value at 95% of a χ2. (We know that if our data are Gaussian, the p 2 Mahalanobis distances will follow a χp distribution, where p is equal to the number of variables in our dataset.) Our results show that the robust distances of the outliers range from 96.82 to 102.85 with an average of 99.58. These values are consistent with our contamination scheme. The average distance for observations in the noncontaminated 2 dataset is 1.493, which is significantly smaller than the critical value of χ3,0.95 =2.80.

Next we look at PCA. There are two different methods for performing robust PCA. One can run the pca command on a cleaned-up version of the initial sample after removing the outlying values identified. Alternatively, one can use the pcamat com- mand in Stata to extract the eigenvalues and eigenvectors of the robust correlation matrix estimated. In our study, we perform four PCA. The first pca command is run on the noncontaminated sample and serves as our benchmark (PCA clean). We then perform the pca command on the contaminated sample (PCA). Next we run the pca command on the cleaned-up sample obtained by removing the outliers identified previ- ously (PCA cleaned). Finally, we run the pcamat command on the robust correlation matrix (PCA mat). These commands are typed in Stata as follows:

pca bx1 x2 x3 pca x* pca x* if robust outlier==0 matrix C=e(C) pcamat C, n(1000)

For each PCA performed, we report in table 2 the first principal component obtained, its corresponding eigenvalue, and the proportion of variance it explains. We see clearly that both robust PCA methods yield results similar to those obtained from the standard PCA performed on the uncontaminated sample, whereas the classical PCA run on the contaminated dataset leads to uninformative results. A similar robustification of the factor model can be performed either by downweighting the outliers or by calling on the factor command.

Table 2. Principal component analysis

PCA clean PCA PCA cleaned PCA mat x1 0.5339 −0.0225 0.5243 0.5243 x2 0.6313 0.7078 0.6331 0.6331 x3 0.5626 0.7060 0.5695 0.5695 eigenvalue 1.63 1.38 1.62 1.62 (54%) (46%) (54%) (54%)

S-estimation can also be used to perform robust linear regression analysis. Consider the following regression model: y = α + Xβ + ε, where y is the dependent variable, V. Verardi and A. McCathie 305

α is a constant, X is the matrix of covariates, and ε is the error term vector. Least- −1 −1 squares estimation gives us parameter estimates β = Σxx Σxy =(X X) X Y and α = μy − μxβ (where μx is a vector and μy is a scalar). To obtain robust estimators for β and α, one can replace μ and Σ in the previous expressions by their robust S- S S estimation counterparts μ and Σ . Robust estimates of the regression parameters are −1 S S S − S thus β = Σxx Σxy and α = μy μx β. We conclude this section by performing regression analysis on a real dataset that contains information about 120 U.S. universities.3 The dependent variable is score,a weighted average of five of the six factors used by Shanghai University to compute their Academic Ranking of World Universities. We consider five explanatory variables: cost (logarithm of the out-of-state cost per year), enrol (proportion of admitted students that choose to enroll), math (SAT math score first quartile), undergrad (logarithm of the total undergraduate population), and private (a dummy variable that indicates whether a university is private).

We begin by running a basic ordinary least-squares (OLS) regression on the com- plete sample. We then perform S-estimation using the command smultiv and flag the outliers. We run an OLS regression on the subsample obtained by removing the outliers identified by smultiv. Finally, we repeat this two-step procedure with the mcd com- mand (Verardi and Croux 2010) to identify the outliers in our sample. This is done in Stata by typing

use http://homepages.ulb.ac.be/~vverardi/unidata, clear /* Classic OLS */ regress score cost enrol math undergrad private /* Removal of outliers identified through S-estimation */ smultiv score cost enrol math undergrad, /// generate(robust outlier robust distance) dummies(private) regress score cost enrol math undergrad private if robust outlier==0 /* Removal of outliers identified through MCD estimation */ mcd score cost enrol math undergrad, generate(mcd outlier mcd distance) regress score cost enrol math undergrad private if mcd outlier==0

The results in table 3 show that although the OLS regression performed on the whole sample identifies all five explanatory variables as statistically significant, both robust regressions reveal that this is in fact not the case for the variable enrol.

3. Data for 2007, available at http://www.usuniversities.ca/. 306 The S-estimator of multivariate location and scatter in Stata

Table 3. Linear regression analysis

cost enrol math undergrad private cons regress 27.690* 4.484* 84.321* 3.731* −15.469* −845.546* (5.96) (0.57) (12.55) (1.87) (3.61) (78.85)

smultiv 25.080* −0.506 65.753* 9.659* −8.802* −747.318* (6.38) (1.08) (12.11) (1.93) (3.95) (68.41)

mcd 20.246* −0.620 72.041* 7.446* −11.986* −714.984* (6.26) (1.02) (11.77) (1.64) (3.16) (69.97)

5 The smultiv command 5.1 Syntax

The general syntax for the command is    smultiv varlist if in ,generate(varname1 varname2 ) dummies(varlist)  nreps(#)

5.2 Options generate(varname1 varname2 ) creates two new variables: varname1 , a dummy vari- able that flags the outliers, and varname2 , a continuous variable that reports the robust Mahalanobis distances. The user must specify the name for each of the variables generated. These variables cannot be generated separately. dummies(varlist) specifies the dummy variables in the dataset. nreps(#) specifies the number of replications.

5.3 Saved results smultiv saves the following in e():

Matrices e(mu) location vector e(S) scatter matrix e(C) correlation matrix V. Verardi and A. McCathie 307 6 Conclusion

In this article, we proposed a new Stata command to implement the S-estimator of location and scatter in a multivariate setting. Our simulated data examples show that this estimator yields more stable results than its closest competitor, the MCD estima- tor. Furthermore, the estimates obtained can be used to perform other statistical data analysis techniques, such as PCA and linear regression, and obtain robust results.

7 Acknowledgment

Vincenzo Verardi is an associate researcher at the FNRS and gratefully acknowledges their financial support.

8 References Campbell, N. A. 1984. Mixture models and atypical values. Mathematical Geology 16: 465–477. Campbell, N. A., H. P. Lopuha¨a, and P. J. Rousseeuw. 1998. On the calculation of a robust S-estimator of a covariance matrix. Statistics in Medicine 17: 2685–2695. Rousseeuw, P. J., and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: Wiley. Rousseeuw, P. J., and K. van Driessen. 1999. A fast algorithm for the minimum covari- ance determinant estimator. Technometrics 41: 212–223. Verardi, V., and C. Croux. 2009. Robust regression in Stata. Stata Journal 9: 439–453. ———. 2010. Software update: st0173 1: Robust regression in Stata. Stata Journal 10: 313. Verardi, V., and C. Dehon. 2010. Multivariate outlier detection in Stata. Stata Journal 10: 259–266. Wilks, S. S. 1932. Certain generalizations in the . Biometrika 24: 471–494. Wilson, E. B., and M. M. Hilferty. 1931. The distribution of chi-square. Proceedings of the National Academy of Sciences of the United States of America 17: 684–688.

About the authors

Vincenzo Verardi is a research fellow of the Belgian National Science Foundation (FNRS). He is a professor at the Facult´e Notre Dame de la Paix of Namur and at the Universit´eLibrede Bruxelles. His research interests include applied econometrics and development economics. Alice McCathie is a PhD candidate in economics at the Universit´e Libre de Bruxelles. Her main research interests are the economics of higher education and discrete choice modeling. The Stata Journal (2012) 12, Number 2, pp. 308–331 Using the margins command to estimate and interpret adjusted predictions and marginal effects

Richard Williams Department of Sociology University of Notre Dame Notre Dame, IN [email protected]

Abstract. Many researchers and journals place a strong emphasis on the sign and statistical significance of effects—but often there is very little emphasis on the substantive and practical significance of the findings. As Long and Freese (2006, Regression Models for Categorical Dependent Variables Using Stata [Stata Press]) show, results can often be made more tangible by computing predicted or expected values for hypothetical or prototypical cases. Stata 11 introduced new tools for making such calculations—factor variables and the margins command. These can do most of the things that were previously done by Stata’s own adjust and mfx commands, and much more. Unfortunately, the complexity of the margins syntax, the daunting 50-page reference manual entry that describes it, and a lack of understanding about what margins offers over older commands that have been widely used for years may have dissuaded some researchers from examining how the margins command could benefit them. In this article, therefore, I explain what adjusted predictions and marginal ef- fects are, and how they can contribute to the interpretation of results. I further explain why older commands, like adjust and mfx, can often produce incorrect results, and how factor variables and the margins command can avoid these er- rors. The relative merits of different methods for setting representative values for variables in the model (marginal effects at the means, average marginal effects, and marginal effects at representative values) are considered. I shows how the marginsplot command (introduced in Stata 12) provides a graphical and often much easier means for presenting and understanding the results from margins, and explain why margins does not present marginal effects for terms. Keywords: st0260, margins, marginsplot, adjusted predictions, marginal effects

1 Introduction

Many researchers and journals place a strong emphasis on the sign and statistical signif- icance of effects—but often there is very little emphasis on the substantive and practical significance of the findings. As Long and Freese (2006) show, results can often be made more tangible by computing predicted or expected values for hypothetical or prototyp-

c 2012 StataCorp LP st0260 R. Williams 309 ical cases. For example, if we want to get a practical feel for the impact of gender in a logistic regression model, we might compare the predicted probabilities of success for a man and woman who both have low, average, or high values on the other variables in the model. Such predictions are sometimes referred to as margins, predictive margins, or (Stata’s preferred terminology) adjusted predictions. Another useful aid to interpre- tation is marginal effects, which can succinctly show, for example, how the adjusted predictions for blacks differ from the adjusted predictions for whites. Stata 11 introduced new tools for making such calculations—factor variables and the margins command. These can do most of the things that were previously done by Stata’s own adjust and mfx commands, and much more. Unfortunately, the complexity of the margins syntax, the daunting 50-page reference manual entry that describes it, and a lack of understanding about what margins offers over older commands that have been widely used for years may have dissuaded some researchers from examining how the margins command could benefit them. In this article, therefore, I illustrate and explain some of the most critical features and advantages of the margins command. I explain what adjusted predictions and marginal effects are, and how they can aid interpretation. I show how margins can replicate analyses done by older commands like adjust but can do so more easily. I demonstrate how, thanks to its support of factor variables that were introduced in Stata 11, margins can avoid mistakes made by earlier commands and provide a superior means for dealing with interdependent variables (for example, X and X2; X1, X2, and X1 × X2; and multiple dummies created from a single categorical variable). I illustrate the different strategies for defining “typical” cases and how margins can estimate them: marginal effects at the means (MEMs), average marginal effects (AMEs), and marginal effects at representative values (MERs); I also show some of the pros and cons of each approach. The output from margins can sometimes be overwhelming; I therefore show how the marginsplot command, introduced in Stata 12, provides an easy and convenient way of generating graphical results that can be much more understandable. Finally, I explain why, unlike older commands, margins does not report marginal effects for interaction terms and why it would be nonsensical to do so.

2Data

We use nhanes2f.dta1 (Second National Health and Nutrition Examination Survey), available from the StataCorp website. The examples examine how demographic vari- ables are related to whether a person has diabetes.2 We begin by retrieving the data, extracting the nonmissing cases we want, and then computing variables we will need later.

1. These data were collected in the 1980s. Rates of diabetes in the United States are much higher now. 2. To simplify the discussion and to facilitate our comparison of old and new commands, we do not use the sampling weights that come with the data. However, margins can handle those weights correctly. 310 Using the margins command

. webuse nhanes2f, clear . keep if !missing(diabetes, black, female, age, age2, agegrp) (2 observations deleted) . label variable age2 "age squared" . describe diabetes black female age age2 agegrp storage display value variable name type format label variable label

diabetes byte %9.0g diabetes, 1=yes, 0=no black byte %8.0g 1 if race=black, 0 otherwise female byte %8.0g 1=female, 0=male age byte %9.0g age in years age2 float %9.0g age squared agegrp byte %8.0g agegrp Age groups 1-6 . * Compute the variables we will need . tab1 agegrp, generate(agegrp) -> tabulation of agegrp Age groups 1-6 Freq. Percent Cum.

age20-29 2,320 22.45 22.45 age30-39 1,620 15.67 38.12 age40-49 1,269 12.28 50.40 age50-59 1,289 12.47 62.87 age60-69 2,852 27.60 90.47 age 70+ 985 9.53 100.00

Total 10,335 100.00 . generate femage = female*age . label variable femage "female * age interaction" . summarize diabetes black female age age2 femage, separator(6) Variable Obs Mean Std. Dev. Min Max

diabetes 10335 .0482825 .214373 0 1 black 10335 .1050798 .3066711 0 1 female 10335 .5250121 .4993982 0 1 age 10335 47.56584 17.21752 20 74 age2 10335 2558.924 1616.804 400 5476 femage 10335 25.05031 26.91168 0 74

The observations in the sample range in age from 20 to 74, with an average age of 47.57. Slightly over half the sample (52.5%) is female and 10.5% is black.3 Less than 5% of the respondents have diabetes, but as we will see, the likelihood of having diabetes differs by race, gender, and age. Note that the mean of femage (female × age) is about half the mean of age. This reflects the fact that men have a score of 0 on femage while for women, femage = age.

3. Less than two percent of the sample is coded Other on race, and their rates of diabetes are identical to whites. We therefore combine whites and others in the analysis. R. Williams 311 3 Adjusted predictions for a basic model

We first fit a relatively uncomplicated model.

. * Basic model . logit diabetes black female age, nolog Logistic regression Number of obs = 10335 LR chi2(3) = 374.17 Prob > chi2 = 0.0000 Log likelihood = -1811.9828 Pseudo R2 = 0.0936

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

black .7179046 .1268061 5.66 0.000 .4693691 .96644 female .1545569 .0942982 1.64 0.101 -.0302642 .3393779 age .0594654 .0037333 15.93 0.000 .0521484 .0667825 _cons -6.405437 .2372224 -27.00 0.000 -6.870384 -5.94049

According to the model, on an “all other things equal” basis, blacks are more likely to have diabetes than are whites, women are more likely to have diabetes than are men, and the probability of having diabetes increases with age. (The effect of being female is not significant in this model, but it will be significant in other models we test.) The coefficients tell us how the log odds of having diabetes are affected by each variable (for example, the log odds of a black having diabetes are 0.718 greater than the log odds for an otherwise-identical white). But because most people do not think in terms of log odds, many would find it more helpful if they could see how the probability of having diabetes was affected by each variable. For example, the positive and highly significant coefficient for age tells us that getting older is bad for one’s health. This is hardly surprising, but just how bad is it? For most people, the coefficient for age of 0.059 has little intuitive or practical appeal. Adjusted predictions can make these results more tangible. With adjusted predic- tions, you specify values for each of the independent variables in the model and then compute the probability of the event occurring for an individual who has those val- ues. To illustrate, we will use the adjust command to compute the probability that an “average” 20-year-old will have diabetes and compare it to the probability that an “average” 70-year-old will. 312 Using the margins command

. adjust age = 20 black female, pr

Dependent variable: diabetes Equation: diabetes Command: logit Covariates set to mean: black = .10507983, female = .52501209 Covariate set to value: age = 20

All pr

.006308

Key: pr = Probability . adjust age = 70 black female, pr

Dependent variable: diabetes Equation: diabetes Command: logit Covariates set to mean: black = .10507983, female = .52501209 Covariate set to value: age = 70

All pr

.110438

Key: pr = Probability

The results show that an “average” 20-year-old has less than a 1% chance of having diabetes, while an otherwise-comparable 70-year-old has an 11% chance. Most people will find such results much more tangible and meaningful than the original coefficient for age. But what does “average” mean? In this case, we used the common, but not universal, practice of using the mean values for the other independent variables (female, black) that are in the model; for example, the value of female is set to 0.525, while the value for black is fixed at 0.105. Later, when discussing marginal effects, I show other options for defining “average”. With margins, it is even easier to get these results, and more. We use the at() option to fix a variable at a specific value or set of values. The atmeans option tells margins to fix all other variables at their means. (Unlike adjust, this is not the default for margins.) If we wanted to see how the probability of having diabetes for average individuals differs across age groups, we could do something like this:4

4. The vsquish option suppresses blank lines between terms. An even more compact display can be obtained by using the noatlegend option, which suppresses the display of the values that variables were fixed at. However, be careful using noatlegend because not having that information may make output harder to interpret. R. Williams 313

. margins, at(age=(20 30 40 50 60 70)) atmeans vsquish Adjusted predictions Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() 1._at : black = .1050798 (mean) female = .5250121 (mean) age = 20 2._at : black = .1050798 (mean) female = .5250121 (mean) age = 30 3._at : black = .1050798 (mean) female = .5250121 (mean) age = 40 4._at : black = .1050798 (mean) female = .5250121 (mean) age = 50 5._at : black = .1050798 (mean) female = .5250121 (mean) age = 60 6._at : black = .1050798 (mean) female = .5250121 (mean) age = 70

Delta-method Margin Std. Err. z P>|z| [95% Conf. Interval]

_at 1 .0063084 .0009888 6.38 0.000 .0043703 .0082465 2 .0113751 .0013794 8.25 0.000 .0086715 .0140786 3 .0204274 .0017892 11.42 0.000 .0169206 .0239342 4 .0364184 .0021437 16.99 0.000 .0322167 .04062 5 .0641081 .0028498 22.50 0.000 .0585226 .0696935 6 .1104379 .005868 18.82 0.000 .0989369 .121939

According to these results, an average 70-year-old (who is again 0.105 black and 0.525 female) is almost 18 times as likely to have diabetes as an average 20-year-old (11.04% compared with 0.63%). Further, we see that there is a large increase in the predicted probability of diabetes between ages 50 and 60 and an even bigger jump between 60 and 70.

4 Factor variables

Suppose, instead, we wanted to compare the average female with the average male, and the average black with the average nonblack. We could give the commands

. margins, at(black = (0 1)) atmeans . margins, at(female = (0 1)) atmeans 314 Using the margins command

Using factor variables, introduced in Stata 11, can make things easier. We need to rerun the logit command first.

. logit diabetes i.black i.female age, nolog Logistic regression Number of obs = 10335 LR chi2(3) = 374.17 Prob > chi2 = 0.0000 Log likelihood = -1811.9828 Pseudo R2 = 0.0936

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.black .7179046 .1268061 5.66 0.000 .4693691 .96644 1.female .1545569 .0942982 1.64 0.101 -.0302642 .3393779 age .0594654 .0037333 15.93 0.000 .0521484 .0667825 _cons -6.405437 .2372224 -27.00 0.000 -6.870384 -5.94049

. margins black female, atmeans Adjusted predictions Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() at : 0.black = .8949202 (mean) 1.black = .1050798 (mean) 0.female = .4749879 (mean) 1.female = .5250121 (mean) age = 47.56584 (mean)

Delta-method Margin Std. Err. z P>|z| [95% Conf. Interval]

black 0 .0294328 .0020089 14.65 0.000 .0254955 .0333702 1 .0585321 .0067984 8.61 0.000 .0452076 .0718566

female 0 .0292703 .0024257 12.07 0.000 .024516 .0340245 1 .0339962 .0025912 13.12 0.000 .0289175 .0390748

The i. notation tells Stata that black and female are categorical variables rather than continuous. As the Stata 12 User’s Guide (StataCorp 2011) explains in sec- tion 11.4.3.1, “i.group is called a factor variable, although more correctly, we should say that group is a categorical variable to which factor-variable operators have been applied.... Whenyoutype i.group, it forms the indicators for the unique values of group.” In other words, Stata, in effect, creates dummy variables coded 0 or 1 from the categorical variable. In this case, of course, black and female are already coded 0 or 1—but margins and other postestimation commands still like you to use the i. notation so they know the variable is categorical (rather than, say, being a continuous variable that just happens to only have the values of 0 or 1 in this sample). But if, say, we had the variable race coded 1 = white and 2 = black, then the new variable would be coded 0 = white and 1 = black. R. Williams 315

Or if the variable religion was coded 1 = Catholic,2=Protestant,3=Jewish, and 4 = Other, then saying i.religion would cause Stata to create three 0 or 1 dum- mies. By default, the first category (in this case, Catholic) is the reference category, but we can easily change that; for example, ib2.religion would make Protestant the reference category, or ib(last).religion would make the last category, Other,the reference. Factor variables can also be used to include squared terms and interaction terms in models. For example,

. logit diabetes i.black i.female age c.age#c.age, nolog . logit diabetes i.black i.female age i.female#c.age, nolog

The # (pronounced cross) operator is used for interactions and product terms. The use of # implies the i. prefix; that is, unless you indicate otherwise, Stata will assume that the variables on both sides of the # operator are categorical and will compute interaction terms accordingly. Hence, we use the c. notation to override the default and tell Stata that age is a continuous variable. So, c.age#c.age tells Stata to include age2 in the model; we do not want or need to compute the variable separately. Similarly, i.female#c.age produces the female × age interaction term. Stata also offers a ## notation, called factorial cross. It can save some typing and provide an alternative parameterization of the results. At first glance, the use of factor variables might seem like a minor convenience at best: they save you the trouble of computing dummy variables and interaction terms beforehand. However, the advantages of factor variables become much more apparent when used in conjunction with the margins command.

5 Adjusted predictions when there are interdependencies among variables

Sometimes the value of one variable or variables perfectly determines the value of an- other. For example, if a model includes both X and X2, then if X = 10, X2 must equal 100. Or if X1 = 0, then the interaction X1 × X2 must also equal 0; or if X1=1, then the interaction X1 × X2mustequalX2. If multiple dummies have been created from the same categorical variable (for example, black, white, and other have been cre- ated from the variable race), then if black = 1, the other race dummies must equal 0. Older Stata commands generally do not recognize such interdependencies between variables. This can lead to incorrect results when computing adjusted predictions. Factor variables and the margins command can avoid these errors. Following are some examples. 316 Using the margins command

5.1 Squared terms

Suppose, for example, that our model includes an age2 term (that is, the age2 variable), and we want to see what the predicted value is for a 70-year-old who has average (mean) values on the other variables in the model. We can do the following:

. * Squared term in model . logit diabetes black female age age2, nolog Logistic regression Number of obs = 10335 LR chi2(4) = 381.03 Prob > chi2 = 0.0000 Log likelihood = -1808.5522 Pseudo R2 = 0.0953

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

black .7207406 .1266509 5.69 0.000 .4725093 .9689718 female .1566863 .0942032 1.66 0.096 -.0279486 .3413212 age .1324622 .0291223 4.55 0.000 .0753836 .1895408 age2 -.0007031 .0002753 -2.55 0.011 -.0012428 -.0001635 _cons -8.14958 .7455986 -10.93 0.000 -9.610926 -6.688233

. adjust age = 70 black female age2, pr

Dependent variable: diabetes Equation: diabetes Command: logit Covariates set to mean: black = .10507983, female = .52501209, age2 = 2558.9238 Covariate set to value: age = 70

All pr

.373211

Key: pr = Probability

adjust yields a predicted probability of 37.3%. That is pretty grim compared with our earlier estimate of 11%! The problem is that age2 (which has a negative effect) is not being handled correctly. Because the adjust command does not know that age2 is a function of age, it simply uses the mean of age2, which is 2558.92. But for a 70-year-old, age2 = 4900. R. Williams 317

If we instead use factor variables and the margins command, the correct results are easily obtained.

. logit diabetes i.black i.female age c.age#c.age, nolog Logistic regression Number of obs = 10335 LR chi2(4) = 381.03 Prob > chi2 = 0.0000 Log likelihood = -1808.5522 Pseudo R2 = 0.0953

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.black .7207406 .1266509 5.69 0.000 .4725093 .9689718 1.female .1566863 .0942032 1.66 0.096 -.0279486 .3413212 age .1324622 .0291223 4.55 0.000 .0753836 .1895408

c.age#c.age -.0007031 .0002753 -2.55 0.011 -.0012428 -.0001635

_cons -8.14958 .7455986 -10.93 0.000 -9.610926 -6.688233

. margins, at(age = 70) atmeans Adjusted predictions Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() at : 0.black = .8949202 (mean) 1.black = .1050798 (mean) 0.female = .4749879 (mean) 1.female = .5250121 (mean) age = 70

Delta-method Margin Std. Err. z P>|z| [95% Conf. Interval]

_cons .1029814 .0063178 16.30 0.000 .0905988 .115364

By using factor-variable notation, we let the margins command know that if age=70, then age2 = 4900, and it hence computes the predicted values correctly. 318 Using the margins command

5.2 Interaction terms

Now suppose we have an interaction term in our model, for example, female × age (femage). We want to compute the predicted probability of diabetes for a male who has average values on the other variables. We might do something like this:

. * Interaction term . logit diabetes black female age femage, nolog Logistic regression Number of obs = 10335 LR chi2(4) = 380.85 Prob > chi2 = 0.0000 Log likelihood = -1808.6405 Pseudo R2 = 0.0953

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

black .7112782 .1268575 5.61 0.000 .4626421 .9599144 female 1.358331 .4851999 2.80 0.005 .4073562 2.309305 age .0715351 .0063037 11.35 0.000 .05918 .0838902 femage -.0199143 .0078292 -2.54 0.011 -.0352593 -.0045693 _cons -7.140004 .3961599 -18.02 0.000 -7.916463 -6.363545

. adjust female = 0 black age femage, pr

Dependent variable: diabetes Equation: diabetes Command: logit Covariates set to mean: black = .10507983, age = 47.565844, femage = 25.050314 Covariate set to value: female = 0

All pr

.015345

Key: pr = Probability

Note that the adjust command is using femage =25.05, which, as we saw earlier, is the mean value of femage in the sample. But that is obviously wrong: if female =0, then femage also = 0. R. Williams 319

Now let’s use factor variables and the margins command:

. logit diabetes i.black i.female age i.female#c.age, nolog Logistic regression Number of obs = 10335 LR chi2(4) = 380.85 Prob > chi2 = 0.0000 Log likelihood = -1808.6405 Pseudo R2 = 0.0953

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.black .7112782 .1268575 5.61 0.000 .4626421 .9599144 1.female 1.358331 .4851999 2.80 0.005 .4073562 2.309305 age .0715351 .0063037 11.35 0.000 .05918 .0838902

female#c.age 1 -.0199143 .0078292 -2.54 0.011 -.0352593 -.0045693

_cons -7.140004 .3961599 -18.02 0.000 -7.916463 -6.363545

. margins female black female#black, atmeans Adjusted predictions Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() at : 0.black = .8949202 (mean) 1.black = .1050798 (mean) 0.female = .4749879 (mean) 1.female = .5250121 (mean) age = 47.56584 (mean)

Delta-method Margin Std. Err. z P>|z| [95% Conf. Interval]

female 0 .0250225 .0027872 8.98 0.000 .0195597 .0304854 1 .0372713 .0029632 12.58 0.000 .0314635 .0430791

black 0 .0287052 .0020278 14.16 0.000 .0247307 .0326797 1 .0567715 .0067009 8.47 0.000 .0436379 .0699051

female#black 00 .0232624 .0026348 8.83 0.000 .0180983 .0284265 01 .0462606 .0068486 6.75 0.000 .0328376 .0596835 10 .0346803 .0028544 12.15 0.000 .0290857 .0402748 11 .0681786 .0083774 8.14 0.000 .0517592 .084598

This tells us that the average male (who is 0.105 black and 47.57 years old) has a predicted 2.5% chance of having diabetes. The average female (also 0.105 black and 47.57 years old) has a 3.7% chance. If we fail to take into account the fact that femage is a function of age, we underestimate the likelihood that men will have diabetes; that is, if we do it wrong, we estimate that the average male has a 1.5% probability of having diabetes when the correct estimate is 2.5%. 320 Using the margins command

We also asked for information pertaining to race. This shows that the average white (who is 0.525 female and 47.57 years old) has a 2.9% chance of having diabetes, while for the average black the figure is nearly twice as high at 5.7%. The female#black notation on the margins command does not mean that an interaction term for race and gender has been added to the model. Rather, it simply causes the adjusted predictions for each combination of race and gender (based on the model that was fit) to be included in the output.

5.3 Multiple dummies

One other sort of interdependency not handled well by older commands is when multiple dummies are computed from a single categorical variable. For example, suppose we do not have the continuous variable age and instead have to use the categorical agegrp variables. We want to estimate the probability that the average person aged 70 or above has diabetes:

. * Multiple dummies . logit diabetes black female agegrp2 agegrp3 agegrp4 agegrp5 agegrp6, nolog Logistic regression Number of obs = 10335 LR chi2(7) = 368.98 Prob > chi2 = 0.0000 Log likelihood = -1814.575 Pseudo R2 = 0.0923

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

black .7250941 .1265946 5.73 0.000 .4769733 .9732148 female .1578264 .0941559 1.68 0.094 -.0267158 .3423686 agegrp2 .7139572 .3397881 2.10 0.036 .0479847 1.37993 agegrp3 1.685402 .3031107 5.56 0.000 1.091316 2.279488 agegrp4 2.223236 .2862673 7.77 0.000 1.662162 2.784309 agegrp5 2.674737 .2680303 9.98 0.000 2.149407 3.200066 agegrp6 2.999892 .2783041 10.78 0.000 2.454426 3.545358 _cons -5.242579 .2658865 -19.72 0.000 -5.763707 -4.721451

. adjust agegrp6 = 1 black female agegrp2 agegrp3 agegrp4 agegrp5, pr

Dependent variable: diabetes Equation: diabetes Command: logit Covariates set to mean: black = .10507983, female = .52501209, agegrp2 = .15674891, agegrp3 = .12278665, agegrp4 = .12472182, agegrp5 = .27595549 Covariate set to value: agegrp6 = 1

All pr

.320956

Key: pr = Probability R. Williams 321

According to the adjust command, the average person (average meaning 0.105 black and 0.525 female) who is age 70 or above has a 32.1% chance of having diabetes— far higher than our earlier estimates of around 10 or 11%. Being old may be bad for your health, but it is not that bad! As the logit results show, each older age group is more likely to have diabetes than the youngest age group. But each person only belongs to one age group; that is, if you have a score of 1 on agegrp6, you have to have a score of 0 on all the other age-group variables. adjust, on the other hand, is using the mean values for all the other age dummies (rather than 0), which causes the probability of having diabetes for somebody aged 70 or above to be greatly overestimated. Factor variables and margins again provide an easy means of doing things correctly.

. logit diabetes i.black i.female i.agegrp, nolog Logistic regression Number of obs = 10335 LR chi2(7) = 368.98 Prob > chi2 = 0.0000 Log likelihood = -1814.575 Pseudo R2 = 0.0923

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.black .7250941 .1265946 5.73 0.000 .4769733 .9732148 1.female .1578264 .0941559 1.68 0.094 -.0267158 .3423686

agegrp 2 .7139572 .3397881 2.10 0.036 .0479847 1.37993 3 1.685402 .3031107 5.56 0.000 1.091316 2.279488 4 2.223236 .2862673 7.77 0.000 1.662162 2.784309 5 2.674737 .2680303 9.98 0.000 2.149407 3.200066 6 2.999892 .2783041 10.78 0.000 2.454426 3.545358

_cons -5.242579 .2658865 -19.72 0.000 -5.763707 -4.721451 322 Using the margins command

. margins female black agegrp, atmeans Adjusted predictions Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() at : 0.black = .8949202 (mean) 1.black = .1050798 (mean) 0.female = .4749879 (mean) 1.female = .5250121 (mean) 1.agegrp = .2244799 (mean) 2.agegrp = .1567489 (mean) 3.agegrp = .1227866 (mean) 4.agegrp = .1247218 (mean) 5.agegrp = .2759555 (mean) 6.agegrp = .0953072 (mean)

Delta-method Margin Std. Err. z P>|z| [95% Conf. Interval]

female 0 .0280253 .0025121 11.16 0.000 .0231016 .0329489 1 .03266 .0027212 12.00 0.000 .0273266 .0379935

black 0 .0282075 .0021515 13.11 0.000 .0239906 .0324244 1 .0565477 .006821 8.29 0.000 .0431787 .0699166

agegrp 1 .0061598 .0015891 3.88 0.000 .0030453 .0092744 2 .0124985 .002717 4.60 0.000 .0071733 .0178238 3 .0323541 .0049292 6.56 0.000 .0226932 .0420151 4 .0541518 .0062521 8.66 0.000 .041898 .0664056 5 .082505 .0051629 15.98 0.000 .0723859 .092624 6 .1106978 .009985 11.09 0.000 .0911276 .130268

Similarly to our earlier results, the probability of having diabetes is much greater for an otherwise-average person aged 70 or above than it is for a similar person in his or her 20s. We also got the predicted values for average females, males, blacks, and whites. While these numbers are similar to before, the average person is no longer 47.57 years old. Rather, the average person now has a score of 0.224 on agegrp1, 0.157 on agegrp2, and so on. To sum up, for many purposes both older and newer, Stata commands like adjust and margins will work well, but margins is usually easier to use and more flexible. When variables are interdependent, for example, when the value of one or more vari- ables completely determines the value of another, the margins command is clearly superior. You can try to include options with older commands to take into account the interdependencies, but it is generally easier (and probably less error-prone) if you use the new margins command instead. R. Williams 323 6 Marginal effects

Marginal effects are another popular means by which the effects of variables in nonlinear models can be made more intuitively meaningful. As Cameron and Trivedi (2010, 343) note, “A marginal effect (ME), or partial effect, most often measures the effect on the conditional mean of y of a change in one of the regressors, say, xj. In the linear regres- sion model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, this is no longer the case, leading to remarkably many different methods for calculating MEs.” Marginal effects for categorical independent variables are especially easy to under- stand.5 The ME for categorical variables shows how P(Y = 1) changes as the categorical variable changes from 0 to 1, after controlling in some way for the other variables in the model.With a dichotomous independent variable, the ME is the difference in the adjusted predictions for the two groups, for example, for blacks and whites. There are different ways of controlling for the other variables in the model. Older Stata commands (for example, adjust and mfx) generally default to using the means for variables whose values have not been otherwise specified, that is, they estimate marginal effects at the means (MEMs). Presumably, the mean reflects the “average” or “typical” person on the variable. However, at least two other approaches are also possible with the margins command: average marginal effects (AMEs) and marginal effects at representative values (MERs). We now illustrate each of these approaches, with each building off of the following basic model.

. * Back to basic model . logit diabetes i.black i.female age, nolog Logistic regression Number of obs = 10335 LR chi2(3) = 374.17 Prob > chi2 = 0.0000 Log likelihood = -1811.9828 Pseudo R2 = 0.0936

diabetes Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.black .7179046 .1268061 5.66 0.000 .4693691 .96644 1.female .1545569 .0942982 1.64 0.101 -.0302642 .3393779 age .0594654 .0037333 15.93 0.000 .0521484 .0667825 _cons -6.405437 .2372224 -27.00 0.000 -6.870384 -5.94049

6.1 MEMs

MEMs are easily estimated with the margins command. The dydx() option tells margins which variables to compute MEsfor.Theatmeans option tells margins to use the mean values for other variables when computing the ME for a variable. For the same reasons as given before, it is important to use factor-variable notation so that Stata recognizes any interdependencies between variables. It is also important because MEs are computed differently for discrete and continuous independent variables.

5. See Cameron and Trivedi (2010) for a discussion of marginal effects for continuous variables. 324 Using the margins command

. * MEMs - Marginal Effects at the Means . margins, dydx(black female) atmeans Conditional marginal effects Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() dy/dx w.r.t. : 1.black 1.female at : 0.black = .8949202 (mean) 1.black = .1050798 (mean) 0.female = .4749879 (mean) 1.female = .5250121 (mean) age = 47.56584 (mean)

Delta-method dy/dx Std. Err. z P>|z| [95% Conf. Interval]

1.black .0290993 .0066198 4.40 0.000 .0161246 .0420739 1.female .0047259 .0028785 1.64 0.101 -.0009158 .0103677

Note: dy/dx for factor levels is the discrete change from the base level.

The results tell us that if you had two otherwise-average individuals, one white, one black, the black’s probability of having diabetes would be 2.9 percentage points higher. And what do we mean by “average”? With MEMs, average is defined as having the mean value for the other independent variables in the model, that is, 47.57 years old, 10.5% black, and 52.5% female.

MEMs are easy to explain. With the atmeans option, we fix some variable values (for example, black = 1), compute the mean values for the other variables, and then use the fixed and mean values to compute predicted probabilities. The predicted values show us how the average female compares with the average male, where average is defined as having mean values on the other variables in the model.

MEMs have been widely used. Indeed, for a long time, MEMs were the only option with Stata, because that is all the old mfx command supported. But many do not like MEMs. While there are people who are 47.57 years old, there is nobody who is 10.5% black or 52.5% female. Further, the means are only one of many possible sets of values that could be used—and a set of values that no real person could actually have seems troublesome. For these and other reasons, many researchers prefer AMEs, which I describe next.

6.2 AMEs

Rather than use the means when computing predicted values, some argue it is best to use the actual observed values for the variables whose values are not otherwise fixed (which is the default asobserved option for the margins command). With atmeans, we fix the values of some variables (for example, black = 1) and then use the means for the other variables to compute predicted probabilities. With asobserved, we again fix the values for some variables, but for the other variables we use the observed values for each case. We then compute a predicted probability for each case with the fixed and observed values of variables, and then we average the predicted values. R. Williams 325

. * AMEs - Average Marginal Effects . margins, dydx(black female) Average marginal effects Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() dy/dx w.r.t. : 1.black 1.female

Delta-method dy/dx Std. Err. z P>|z| [95% Conf. Interval]

1.black .0400922 .0087055 4.61 0.000 .0230297 .0571547 1.female .0067987 .0041282 1.65 0.100 -.0012924 .0148898

Note: dy/dx for factor levels is the discrete change from the base level.

Intuitively, the AME for being black is computed as follows:

• Go to the first case. Treat that person as though he or she were white, regardless of what the person’s race actually is. Leave all other independent variable values as is. Compute the probability that this person (if he or she were white) would have diabetes.

• Now do the same thing but this time treating the person as though he or she were black.

• The difference in the two probabilities just computed is the ME for that case. • Repeat the process for every case in the sample.

• Compute the average of all the MEs you have computed. This gives you the AME for being black.

If the margins command did not exist, it would be fairly straightforward to do the same computations using other Stata commands. Indeed, doing so can yield additional insights of interest.

. * Replicate AME for black without using margins . clonevar xblack = black . quietly logit diabetes i.xblack i.female age, nolog . replace xblack = 0 (1086 real changes made) . predict adjpredwhite (option pr assumed; Pr(diabetes)) . replace xblack = 1 (10335 real changes made) . predict adjpredblack (option pr assumed; Pr(diabetes)) . generate meblack = adjpredblack - adjpredwhite 326 Using the margins command

. summarize adjpredwhite adjpredblack meblack Variable Obs Mean Std. Dev. Min Max

adjpredwhite 10335 .0443248 .0362422 .005399 .1358214 adjpredblack 10335 .084417 .0663927 .0110063 .2436938 meblack 10335 .0400922 .0301892 .0056073 .1078724

With AMEs, you are in effect comparing two hypothetical populations—one all white, one all black—that have the exact same values on the other independent variables in the model. The logic is similar to that of a study, where subjects have identical values on every independent variable except one. Because the only difference between these two populations is their races, race must be the cause of the difference in their probabilities of having diabetes. Many people like the fact that all the data are being used, not just the means, and feel that this leads to superior estimates. Many, perhaps most, authors seem to prefer AMEsoverMEMs (for example, Bartus [2005]andCameron and Trivedi [2010]). Others, however, are not convinced that treating men as though they are women and women as though they are men really is a better way of computing MEs. The biggest problem with both of the last two approaches, however, may be that they only produce a single estimate of the ME. No matter how “average” is defined, averages can obscure differences in effects across cases. In reality, the effect that variables like race have on the probability of success varies with the characteristics of the person; for example, racial differences could be much greater for older people than for younger. Indeed, in the example above, the summary statistics showed that while the AME for being black was 0.04, the ME for individual cases ranged between 0.006 and 0.108; that is, at the individual level, the largest ME for being black was almost 20 times as large as the smallest.

For these and other reasons, MERs will often be preferable to either of the alternatives already discussed.

6.3 MERs

With MERs, you choose ranges of values for one or more independent variables and then see how the MEs differ across that range. MERs can be intuitively meaningful, while showing how the effects of variables vary by other characteristics of the individual. The use of the at() option makes this possible. R. Williams 327

. * Section 6.3: MERs - Marginal Effects at Representative Values . quietly logit diabetes i.black i.female age, nolog . margins, dydx(black female) at(age=(20 30 40 50 60 70)) vsquish Average marginal effects Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() dy/dx w.r.t. : 1.black 1.female 1._at : age = 20 2._at : age = 30 3._at : age = 40 4._at : age = 50 5._at : age = 60 6._at : age = 70

Delta-method dy/dx Std. Err. z P>|z| [95% Conf. Interval]

1.black _at 1 .0060899 .0016303 3.74 0.000 .0028946 .0092852 2 .0108784 .0027129 4.01 0.000 .0055612 .0161956 3 .0192101 .0045185 4.25 0.000 .0103541 .0280662 4 .0332459 .0074944 4.44 0.000 .018557 .0479347 5 .0555816 .0121843 4.56 0.000 .0317008 .0794625 6 .0877803 .0187859 4.67 0.000 .0509606 .1245999

1.female _at 1 .0009933 .0006215 1.60 0.110 -.0002248 .0022114 2 .00178 .0010993 1.62 0.105 -.0003746 .0039345 3 .003161 .0019339 1.63 0.102 -.0006294 .0069514 4 .0055253 .0033615 1.64 0.100 -.001063 .0121137 5 .0093981 .0057063 1.65 0.100 -.001786 .0205821 6 .0152754 .0092827 1.65 0.100 -.0029184 .0334692

Note: dy/dx for factor levels is the discrete change from the base level.

Earlier, the AME for being black was 4%; that is, on average blacks’ probability of having diabetes is four percentage points higher than it is for whites. But when we estimate MEs for different ages, we see that the effect of being black differs greatly by age. It is less than one percentage point for 20-year-olds and almost nine percentage points for those aged 70. This makes sense, because the probability of diabetes differs greatly by age; it would be unreasonable to expect every white to be four percentage points less likely to get diabetes than every black regardless of age. Similarly, while the AME for gender was only 0.6%, at different ages the effect is much smaller or much higher than that. In a large model, it may be cumbersome to specify representative values for every variable, but you can do so for those of greatest interest. The atmeans or asobserved options can then be used to set the values of the other variables in the model. 328 Using the margins command 7 Graphic displays of margins results: The marginsplot command

The output from the margins command can be very difficult to read. Because of space constraints, numbers are used to label categories rather than value labels. The marginsplot command introduced in Stata 12 makes it easy to create a visual display of results. Here are two simple examples:

. quietly logit diabetes i.black i.female age, nolog . quietly margins, dydx(black female) at(age=(20 30 40 50 60 70)) vsquish . marginsplot, noci

Average Marginal Effects .08 .06 .04 Effects on Pr(Diabetes) .02 0 20 30 40 50 60 70 age in years

1.black 1.female

The graph makes it clear how the differences between blacks and whites, and between men and women, increase with age. Here is a slightly more complicated example that illustrates how marginsplot can also be used with adjusted predictions. R. Williams 329

. * Plot of adjusted predictions . quietly logit diabetes i.black i.female age i.female#c.age, nolog . quietly margins female#black, at(age=(20 30 40 50 60 70)) . marginsplot, noci

Adjusted Predictions of female#black .2 .15 .1 Pr(Diabetes) .05 0 20 30 40 50 60 70 age in years

female=0, black=0 female=0, black=1 female=1, black=0 female=1, black=1

The differences between blacks and whites and men and women are again clear. This model also included an interaction term for gender and age. The graphic shows that up until about age 70, women are more likely to get diabetes than their same-race male counterparts, but after that men are slightly more likely.

8 Marginal effects for interaction terms

People often ask what the ME of an interaction term is. Stata’s margins command replies: there is not one. You just have the MEs of the component terms. The value of the interaction term cannot change independently of the values of the component terms, so you cannot estimate a separate effect for the interaction. The older mfx command will report MEs for interaction terms, but the numbers it gives are wrong because mfx is not aware of the interdependencies between the interaction term itself and the variables used to compute the interaction term. 330 Using the margins command

. quietly logit diabetes i.black i.female age i.female#c.age, nolog . margins, dydx(*) Average marginal effects Number of obs = 10335 Model VCE : OIM Expression : Pr(diabetes), predict() dy/dx w.r.t. : 1.black 1.female age

Delta-method dy/dx Std. Err. z P>|z| [95% Conf. Interval]

1.black .0396176 .0086693 4.57 0.000 .022626 .0566092 1.female .0067791 .0041302 1.64 0.101 -.001316 .0148743 age .0026632 .0001904 13.99 0.000 .0022901 .0030364

Note: dy/dx for factor levels is the discrete change from the base level.

9 Other points margins would also give the wrong answers if you did not use factor variables. You should use margins because older commands, like adjust and mfx, do not support the use of factor variables. margins supports the use of the svy: prefix with svyset data. Some older commands do not. margins is, unfortunately, more difficult to use with multiple-outcome commands like ologit or mlogit. You have to specify a different margins command for each possible outcome of the dependent variable. But this is also true of many older commands. It is my hope that future versions of margins will overcome this limitation. The ability to compute adjusted predictions and MEsfor individual cases would also be a welcome addition to margins. Finally, both margins and marginsplot include numerous other options that can be used to further refine the analysis and the presentation of results.

10 Conclusion

Adjusted predictions and marginal effects can make the results from many analyses much more intuitive and easier to interpret. The margins command offers a generally superior alternative to the adjust and mfx commands that preceded it. It can estimate the same models and can generally do so more easily. Interdependencies between variables are easily handled, and the user has a choice between the atmeans and asobserved options. The relative merits of atmeans versus asobserved continue to be debated. Clearly, many prefer the asobserved approach. They would rather compare hypothetical pop- ulations that have values that real people actually do have than compare hypothetical persons with mean values on variables that no real person could ever have. But however “typical” or “average” is defined, any approach that only looks at “typical” values is going to miss variability in effects across cases. Presenting MERs can make results easier to interpret and provide a better feel for how the effects of variables differ across cases. R. Williams 331 11 Acknowledgment

Sarah Mustillo offered numerous helpful comments on earlier versions of this article.

12 References Bartus, T. 2005. Estimation of marginal effects using margeff. Stata Journal 5: 309–329. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press.

Long, J. S., and J. Freese. 2006. Regression Models for Categorical Dependent Variables Using Stata. 2nd ed. College Station, TX: Stata Press.

StataCorp. 2011. Stata 12 User’s Guide. College Station, TX: Stata Press.

About the author Richard Williams is an associate professor and a former chairman of the Department of Soci- ology at the University of Notre Dame. His teaching and research interests include Methods and Statistics, Demography, and Urban Sociology. His work has appeared in the American Sociological Review, Social Forces, Stata Journal, Social Problems, Demography, Sociology of Education, Journal of Urban Affairs, Cityscape, Journal of Marriage and the Family,andSoci- ological Methods and Research. His recent research, which has been funded by grants from the Department of Housing and Urban Development and the National Science Foundation, focuses on the causes and consequences of inequality in American home ownership. He is a frequent contributor to Statalist. The Stata Journal (2012) 12, Number 2, pp. 332–341 Speaking Stata: Transforming the time axis

Nicholas J. Cox Department of Geography Durham University Durham City, UK [email protected]

Abstract. The time variable is most commonly plotted precisely as recorded in graphs showing change over time. However, if the most interesting part of the graph is very crowded, then transforming the time axis to give that part more space is worth consideration. In this column, I discuss logarithmic scales, square root scales, and scale breaks as possible solutions. Keywords: gr0052, axis scale, graphics, logarithm, scale break, square root, time, transformation

1 Introduction

Using transformed scales on graphs is a technique familiar to data analysts. Most commonly in Stata, axis scales on a twoway graph may be logarithmic. The options yscale(log) and xscale(log) provide wired-in support. In principle, other trans- formed scales are hardly more difficult: just calculate the transformation before graph- ing and, ideally, ensure that the axis labels remain easy to interpret. For example, a logit scale can be useful for proportions within (0, 1) or percentages within (0, 100). For that example and more general comments, see Cox (2008). In the case of graphs plotting change over time, the time axis usually shows the time variable unchanged. In most sciences, time is plotted on the horizontal or x axis. In the earth sciences, time is sometimes plotted on the vertical or y axis, essentially because older materials are often found at greater depths. In either case, however, it is common to use a transformed scale for whatever response is plotted versus time. For example, with a logarithmic response scale, periods of exponential growth or decline would be shown by straight line segments. In this column, I focus on something more unusual: transforming the time axis. The argument is in the first instance pragmatic. Sometimes data span an extended time range, and the amount of information or the pace of change is far greater at one end of the time range. If so, expanding the crowded part of the graph and shrinking the other part of the graph may help readers visualize the detail of change. If data come from cosmology or particle physics, blowing up the early part of a time range might be a good idea. If data come from the social sciences, especially economic or demographic history, blowing up the later part might be called for. In many examples, very recent growth has been explosive, but the precise details remain of interest. The

c 2012 StataCorp LP gr0052 N. J. Cox 333 same recommendation applies, but for different reasons, to many datasets from the earth sciences. Here information from earlier periods tends to be much sparser and indeed in most instances has long since been buried, blasted, or eroded away.

2 Logarithmic scale?

A logarithmic scale for time would show earlier times—all later than some origin, itself conventionally zero—more expansively. xscale(log) should suffice, assuming again that time is shown on the x axis. Conversely, a logarithmic scale for times, all those shown being before some origin, would do the same for later times. In the second case, you would need to calculate the logarithm of (origin − time) and devise intelligible labels. However, a major limitation in both cases is that the origin is not plottable because the logarithm of zero is indeterminate. A fix for this is to ensure that the origin is not a data point but is beyond the range of the data. Otherwise put, if time 0 is a data point, the origin must be offset. Karsten (1923, chap. 41) called plotting in terms of logarithm of (offset + latest − time) “retrospective logarithmic projection” and gave economic examples. This common need for an offset is at best awkward and at worst leads to arbitrary choices that are difficult to justify or explain. It remains true that thinking logarithmically is highly desirable for thinking broadly about the ranges of time scales (and sizes, speeds, and numerosity) to be found in science. See, for example, the engaging essay “Life on log time” in Glashow (1991)or the more extended discussions in Schneider (2009). Although using a logarithmic scale for either variable or both variables is likely to strike readers as an elementary and obvious technique, its history is in several ways surprisingly short (Funkhouser 1937, 359–361). Lalanne (1846) introduced double log- arithmic scales for calculation problems: Hankins (1999, 71) reproduces his graph. In economics, Jevons (1863, 1884) plotted data for responses on a logarithmic scale versus time; for discussion, see Morgan (1990), Klein (1997), and Stigler (1999). Nevertheless, adoption of the idea was slow, and it was even resisted as obscuring the data. Expos- itory articles urging the merits of the technique, often called semilogarithmic or ratio charts, were still appearing many years after its introduction (Fisher 1917; Field 1917; Griffin and Bowden 1963; Burke 1964).

3 Square root scale?

In practice in data analysis, however, logarithmic transformations are likely to be too strong unless change on very different time scales is being considered. A weaker trans- formation that can work well is the square root, as recently suggested by Wills (2012, 171–172). The square root also has the simple advantage that the root of 0 is also 0 and is thus quite plottable. 334 Speaking Stata

The earliest example of a square root scale known to me is by Moseley (1914)ina classic physics problem that led to predictions of as yet undiscovered elements. Heilbron (1974, 100) reproduces his graph. Karsten (1923, 485) plotted the square root of re- sponse versus time in an example in which a logarithmic scale was clearly too severe. Fisher and Mather (1943) plotted frequency counts on a square root scale. Their idea was taken up in much more detail by Mosteller and Tukey (1949), and Fisher (1944, 1950) added corresponding material to the ninth and eleventh editions of his text Sta- tistical Methods for Research Workers. There are contexts ranging from physics to finance in which the square root of time appears naturally, or at least mathematically, in discussions, but we leave that to one side. In the example to follow, the justification for using square roots is, once more, just pragmatic. Consider some data from Haywood (2011) on world population over the last 8,000 years, available with the media for this journal issue. As is customary, the cautions and caveats associated with such estimates should be flagged. Comprehensive are relatively recent historically, and even long national experience in census-taking is no guarantee of population counts that all experts regard as accurate. Over most of human history, we have only a variety of wild guesses, and competent workers in the field often only agree if they are using each other’s estimates. Let us look first at some standard plots. N. J. Cox 335

. use haywood . twoway connected pop year, msymbol(Oh) . twoway connected pop year, msymbol(Oh) yscale(log) > ylabel(20 50 200 500 2000 5000) 8000 6000 4000 world population, millions 2000 0 −6000 −4000 −2000 0 2000 year

Figure 1. World population over the last 8,000 years (data from Haywood [2011]) 5000 2000 500 200 world population, millions 50 20

−6000 −4000 −2000 0 2000 year

Figure 2. World population over the last 8,000 years; the vertical scale is logarithmic, so periods showing exponential change (constant growth rate) would plot as linear segments (data from Haywood [2011]) 336 Speaking Stata

Figure 1 draws attention to the explosive nature of population growth over the last few centuries. It has a simple and powerful message, but detail is hard to discern. Figure 2 adds a logarithmic scale and shows a main pattern of accelerating growth rates with strong hints of other fluctuations, reflecting extended periods of higher death rates. For both graphs, using twoway connected, a hybrid of scatter and line,has a natural purpose of emphasizing the precise dates of data points. With more data or data regularly spaced in time, using line is likely to be a better choice. That detail aside, figure 1 and figure 2 are standard graphs that are both useful, but neither deals with the problem of crowding of data for later times. Using a square root scale for time before the present requires two steps before we can create a graph. First, calculate the new time scale.

. generate time = sqrt(2010 - year)

The latest year for data is 2010. Clearly, other datasets would lead to different constants. Second, produce better axis labels. The key here is that any text you like can be shown as text in axis labels. In practice, we need to work out what labels we want to see and then where they would be put in terms of the new variable, here time. This can take some experimentation, but it is best done by looping over the labels we want and working out the option call that is needed.

. foreach y of num -6000(2000)2000 500 1000 1500 { . local call `call´ `=sqrt(2010 - `y´)´ "`y´" .}

Here the result is put into a local macro. Every time around the loop, we calculate the square root of 2010 minus some year and add that value and the text to be shown on the time axis—namely, the year itself—as extra content in the local macro. So the first time around, the year is −6000. We use Stata to calculate sqrt(2000 - (-6000)) on the fly and insert the result in the macro. The special syntax ‘=expression’ instructs Stata to calculate the result of the expression and use that result immediately. If you do the calculation for −6000 yourself with display, you will find that the result is given as 89.442719. On-the-fly calculation gives more decimal places, far more than are really needed for graphical purposes, but no harm is done thereby. The calculation

89.44271909999159 "-6000" is added to the macro, which is created empty. The next time around, the macro will contain

89.44271909999159 "-6000" 77.45966692414834 "-4000" and so on. N. J. Cox 337

If you do this yourself, make sure to blank out the local macro by typing

. local call before revising any earlier guess. Otherwise, earlier insertions in the macro will just be carried forward. Now the graph shown as figure 3 is possible. Reversing the scale is needed because the time variable increases going backward, at least if you want to show the direction of time in the usual manner.

. twoway connected pop time, msymbol(Oh) xscale(reverse) xlabel(`call´) > xtitle(year) 8000 6000 4000 world population, millions 2000 0 −6000 −4000 −2000 0 500 1000 1500 2000 year

Figure 3. World population over the last 8,000 years; the horizontal axis represents time before 2010 on a square root scale (data from Haywood [2011])

You can see the advantage of automating the production of the option call: you would not prefer to do all the little calculations one by one with a flurry of copy-and- paste to transfer results to the graph command. Nothing stops the use of yscale(log) too, although the standard result that expo- nential change shows as linear segments no longer holds. My own preference is to stop at transforming the time scale. As usual, smaller graphical details remain open to choice. The axis title of “year” might seem redundant. A method other than negative dates might be preferred for showing years BCE.They-axis labels could be rotated to horizontal. 338 Speaking Stata 4 Scale break?

Another popular solution to the problem of crowding for early or later times is to use a scale break so that the time series is divided into two or more panels with different scales. Scale breaks themselves divide statistically minded people. Some argue for them as a simple practical solution, so long as it is made explicit, and some argue against them as awkward, artificial, and usually unnecessary, because transformation is typically a better solution. Good discussions are given by Cleveland (1994)andWilkinson (2005). For more discussion of scale breaks in Stata graphics, see Cox and Merryman (2006). Stata does not support scale breaks directly, perhaps mostly because a graph with a scale break is not easily made consistent with the general philosophy of allowing graphs to be superimposed or combined with a common axis. But you can devise your own scale breaks by producing separate graphs before combining them again.

. twoway connected pop y if year < 0, msymbol(Oh) yscale(r(10 7000) log) > ylabel(10 20 50 100 200 500 1000 2000 5000, angle(h)) > saving(part1, replace) xlabel(-6000(1000)0) xtitle("") . twoway connected pop y if year > 0, ms(Oh) ysc(r(10 7000) log off) > yla(10 20 50 100 200 500 1000 2000 5000, ang(h)) > saving(part2, replace) xla(250(250)2000) xtitle("") . graph combine "part1" "part2", imargin(small) b1title(year, size(small))

The result is shown in figure 4. It is likely to qualify as many users’ best bet for this kind of problem, if only because it is easier to explain than other solutions.

5000

2000

1000

500

200

100 world population, millions

50

20

10 −6000 −5000 −4000 −3000 −2000 −1000 0 250 500 750 1000 1250 1500 1750 2000 year

Figure 4. World population over the last 8,000 years; scale break at change from BCE to AD (data from Haywood [2011]) N. J. Cox 339

Let us focus on some of the small details in these commands that might otherwise seem puzzling.

1. I tend to include the replace suboption of saving() even the first time around, because in practice I need several iterations before all my choices seem right. Nevertheless, do check that you are not losing a valuable graph from some previous project. 2. graph combine has a ycommon option. But with yscale(log) too, I get better results by being explicit in yscale() in both graph commands that the y axes have the same limits. 3. Why spell out the ylabel() call for the second graph if the labels are omitted by yscale(off)? The reason is that the label locations determine horizontal grid lines. (Naturally, if you did not want those grid lines, you should omit them.) 4. The title “year” should certainly not be given twice, so we blank it out in the indi- vidual graph commands and put it back (once) in the graph combine command. 5. The first panel ends up a bit smaller in this example as a side-effect of omitting the y axis and its labels from the second panel. That seems about right, but the relative sizes of the panels could be further controlled by using the fxsize() option with the initial graph commands. Cox and Merryman (2006) give an example.

In principle, two or more scale breaks are possible, giving three or more panels. You just need to produce the individual graph panels as separate graphs and then combine them. But the trade-offs are clear. Each panel must be big enough to make clear what the different scales are, but without overloading the time axis with many different labels or ticks. For that and other reasons, even a graph with two scale breaks and three panels is likely to be too complicated for most purposes.

5 Conclusions

For showing change over time graphically, transformation almost always means trans- forming the response. Logarithmic scales are by far the most widely used example and need little publicity. In this column, I focused on a less common but intriguing possi- bility: transforming time. Although logarithmic scales have been used on occasion for time, they can be awkward at best. In contrast, a square root scale for time is easy to implement and can work well. Using a scale break to divide a graph into separate panels is also a possibility. As always, when users have choices, they also have to make decisions on what to do. The question is likely to depend on specific circumstances of dataset, analysis, and audience, as well as researchers’ general attitudes. In choosing any of the graphs here for a presentation, what the audience would find easiest to understand—or in some cases, what they would find most stimulating to consider—would be the first question to consider. 340 Speaking Stata 6 Acknowledgment

Stephen M. Stigler provided invaluable details for references to successive editions of R. A. Fisher’s text.

7 References Burke, T. 1964. Semi-logarithmic graphs in geography: A pertinent addendum. Profes- sional Geographer 16(1): 19–21.

Cleveland, W. S. 1994. The Elements of Graphing Data. Rev. ed. Summit, NJ: Hobart.

Cox, N. J. 2008. Stata tip 59: Plotting on any transformed scale. Stata Journal 8: 142–145.

Cox, N. J., and S. Merryman. 2006. Stata FAQ: How can I show scale breaks on graphs? http://www.stata.com/support/faqs/graphics/scbreak.html.

Field, J. A. 1917. Some advantages of the logarithmic scale in statistical diagrams. Journal of Political Economy 25: 805–841.

Fisher, I. 1917. The “ratio” chart for plotting statistics. Publications of the American Statistical Association 15: 577–601.

Fisher, R. A. 1944. Statistical Methods for Research Workers. 9th ed. Edinburgh: Oliver &Boyd.

———. 1950. Statistical Methods for Research Workers. 11th ed. Edinburgh: Oliver & Boyd.

Fisher, R. A., and K. Mather. 1943. The inheritance of style length in Lythrum salicaria. Annals of Eugenics 12: 1–23.

Funkhouser, H. G. 1937. Historical development of the graphical representation of statistical data. Osiris 3: 269–404.

Glashow, S. L. 1991. The Charm of Physics. New York: American Institute of Physics.

Griffin, D. W., and L. W. Bowden. 1963. Semi-logarithmic graphs in geography. Pro- fessional Geographer 15(5): 19–23.

Hankins, T. L. 1999. Blood, dirt, and nomograms: A particular history of graphs. Isis 90: 50–80.

Haywood, J. 2011. The New Atlas of World History: Global Events at a Glance. London: Thames & Hudson.

Heilbron, J. L. 1974. H. G. J. Moseley: The Life and Letters of an English Physicist, 1887–1915. Berkeley: University of California Press. N. J. Cox 341

Jevons, W. S. 1863. A Serious Fall in the Value of Gold Ascertained, and Its Social Effects Set Forth. London: Edward Stanford.

———. 1884. Investigations in Currency and Finance. London: Macmillan.

Karsten, K. G. 1923. Charts and Graphs: An Introduction to Graphic Methods in the Control and Analysis of Statistics. New York: Prentice Hall.

Klein, J. L. 1997. Statistical Visions in Time: A History of Time Series Analysis 1662– 1938. Cambridge: Cambridge University Press.

Lalanne, L. 1846. M´emoire sur les tables graphiques et sur la g´eom´etrie anamorphique appliqu´ee `a diverses questions qui se rattachent `a l’art de l’ing´enieur. Annales des Ponts et Chauss´ees 11: 1–69.

Morgan, M. S. 1990. The History of Econometric Ideas. Cambridge: Cambridge Uni- versity Press.

Moseley, H. G. J. 1914. The high-frequency spectra of the elements. Part II. Philosoph- ical Magazine, 6th ser., 27: 703–713.

Mosteller, F., and J. W. Tukey. 1949. The uses and usefulness of binomial probability paper. Journal of the American Statistical Association 44: 174–212.

Schneider, D. C. 2009. Quantitative Ecology. 2nd ed. London: Academic Press. Stigler, S. M. 1999. Statistics on the Table: The History of Statistical Concepts and Methods. Cambridge, MA: Harvard University Press. Wilkinson, L. 2005. The Grammar of Graphics. 2nd ed. New York: Springer.

Wills, G. 2012. Visualizing Time: Designing Graphical Representations for Statistical Data. New York: Springer.

About the author Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks, postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com- mands in official Stata. He wrote several inserts in the Stata Technical Bulletin andisaneditor of the Stata Journal. The Stata Journal (2012) 12, Number 2, pp. 342–344 Stata tip 108: On adding and constraining Maarten L. Buis Department of Sociology T¨ubingen University T¨ubingen, Germany [email protected]

In many estimation commands, the constraint command (see [R] constraint)can impose linear constraints. The most common of these is the constraint that two or more regression coefficients are equal. A sometimes useful characteristic of models with that constraint is that they are equivalent to a model that includes the sum of the variables that are constrained. Consider the relevant part of a regression equation:

β1x1 + β2x2

If we constrain the effects of x1 and x2 to be equal, then we can replace β1 and β2 with β: βx1 + βx2 = β(x1 + x2)

One situation where this characteristic can be useful occurs when you have created a variable by adding several variables and you wonder whether that was a good idea. In the example below, there are three variables on the degree of trust a respondent has in the executive, legislative, and judicial branches of the U.S. federal government: confed, conlegis,andconjudge, respectively. These can take the values 0 (hardly any trust), 1 (only some trust), or 2 (a great deal of trust). I think that these three variables say something about the trust in the federal government, and I created a single variable that captures that, congov, which I use to predict whether a respondent voted for Barack Obama in the 2008 U.S. presidential election. This results in model sum1. If I want to check whether adding these three confidence measures was a good idea, I can use the fact that adding variables is equivalent to constraining their effects to be equal. So you can operationalize the rather vague idea “adding these variables is a good idea” to the testable statement “the effects of these three variables are the same”. As a check, I first fit a model that constrains the effects to be equal. This is model constr1, and as expected, the resulting coefficients, standard errors, and log likelihood are exactly the same. I then fit a model with the three confidence variables without constraint, unconstr1. The resulting coefficients are very different from one another: the effects do not even have the same sign.1 A likelihood-ratio test also rejects the hypothesis that these variables have the same effect on voting for Obama. So adding the sum of the three confidence measures was not a good idea in this case.

1. These are odds ratios, so the sign is determined by whether the ratio is larger or smaller than 1.

c 2012 StataCorp LP st0261 M. L. Buis 343

. use gss10 (extract from the 2010 General Social Survey) . generate byte congov = confed + conlegis + conjudge (463 missing values generated) . quietly logit obama congov, or nolog . estimates store sum1 . constraint 1 confed = conlegis . constraint 2 confed = conjudge . quietly logit obama confed conlegis conjudge, or constraint(1 2) nolog . estimates store constr1 . quietly logit obama confed conlegis conjudge, or nolog . estimates store unconstr1 . estimates table sum1 constr1 unconstr1, stats(ll N) eform b(%9.3g) se(%9.3g) > stfmt(%9.4g)

Variable sum1 constr1 unconstr1

congov 1.62 .11 confed 1.62 3.47 .11 .576 conlegis 1.62 1.69 .11 .305 conjudge 1.62 .674 .11 .107 _cons .461 .461 .689 .0833 .0833 .134

ll -347.8 -347.8 -324.9 N 557 557 557

legend: b/se . lrtest constr1 unconstr1 Likelihood-ratio test LR chi2(2) = 45.77 (Assumption: constr1 nested in unconstr1) Prob > chi2 = 0.0000

Another situation where this characteristic can be useful occurs when you have two or more ordinal or categorical variables that you want to combine. Consider the example below. In that example, I want to treat education as an ordinal variable, and I want to see the effect of “family educational background” on the educational attainment of the children. I think of family educational background as some sort of sum of the father’s and mother’s educations, but how do I create a sum of two ordinal variables? That is hard, but it is easy to consider the equivalent model that constrains the effects of father’s education to be equal to the effects of mother’s education. In this example, the effects of mother’s and father’s educations are fairly similar, and the test of the hypothesis that they are equal cannot be rejected (compare unconstr2 with constr2). It also shows that constraining effects to be the same is equivalent to adding the sums of the indicator variables (compare constr2 with sum2). 344 Stata tip 108

. quietly ologit degree i.madeg i.padeg, or nolog . estimates store unconstr2 . constraint 1 1.madeg = 1.padeg . constraint 2 2.madeg = 2.padeg . quietly ologit degree i.madeg i.padeg, or constraint(1 2) nolog . estimates store constr2 . generate byte p_hs = 1.madeg + 1.padeg (329 missing values generated) . generate byte p_mths = 2.madeg + 2.padeg (329 missing values generated) . quietly ologit degree p_hs p_mths, or nolog . estimates store sum2 . estimates table unconstr2 constr2 sum2, stats(ll N) eform b(%9.3g) se(%9.3g) > stfmt(%9.4g) keep(degree:)

Variable unconstr2 constr2 sum2

madeg 1 2.5 2.17 .449 .199 2 4.89 5.51 1.26 .691

padeg 1 1.88 2.17 .322 .199 2 6.08 5.51 1.46 .691

p_hs 2.17 .199 p_mths 5.51 .691

ll -799.1 -800.5 -800.5 N 972 972 972

legend: b/se . lrtest unconstr2 constr2 Likelihood-ratio test LR chi2(2) = 2.79 (Assumption: constr2 nested in unconstr2) Prob > chi2 = 0.2484 The Stata Journal (2012) 12, Number 2, pp. 345–346 Stata tip 109: How to combine variables with missing values Peter A. Lachenbruch Oregon State University (retired) Corvallis, OR [email protected]

A common problem in data management is combining two or more variables with missing values to get a single variable with as many nonmissing values as possible. Typically, this problem arises when what should be the same variable has been named differently in different datasets. Stata’s treatment of missing values means that the combination needs a little care, although there are several quite easy solutions. To make this situation concrete, consider a recent experience of mine. Two datasets were created. A variable resp ill was created in one dataset, and the corresponding variable moxResp was created in the other. After appending the data with the append command, there were missing values on resp ill if the data came from the second dataset, and there were missing values on moxResp if the data came from the first dataset. The total number of observations is 782. Here is a short listing of some of the observations:

. list if inrange(_n, 1, 5) | inrange(_n, 340, 344), separator(0)

resp_ill moxResp

1. no . 2. no . 3. no . 4. no . 5. no . 340. .no 341. .no 342. .no 343. .no 344. . yes

The variables are indicators of whether the patient had a respiratory illness. They have been labeled as “no” for 0 and “yes” for 1. It is easy to see or to say that the variables should be renamed (with the rename command) for consistency. I could have done that in one of the original datasets and then repeated the append command. In practice, it is often quicker to fix the problem in the current dataset. Sometimes, maintaining dataset integrity is sufficiently important to rule out the rename solution altogether. Experimentation will confirm that adding the two variables is not the answer, be- cause the missing values are propagated: for Stata, nonmissing plus missing is still missing. Doing that with a replace of one variable rather than a generate would be

c 2012 StataCorp LP dm0063 346 Stata tip 109 an even worse idea because any original nonmissing values would be lost in the variable that was replaced. What we want is to ignore the missings, and there are several ways to do that. One explicit solution is

. generate newind = min(resp_ill, moxResp)

The pairwise minimum will be whichever nonmissing value is present. What is less intuitive is that the function max() will work in the same way because it follows the same principle of ignoring missings to the extent possible, just as, say, summarize returns the maximum nonmissing value and not missings. So even though Stata has a general rule that missings count higher than nonmissings, in the case of the max() function, that rule is trumped by the principle that missings are ignored as far as possible. Another solution is

. generate newind = cond(missing(resp_ill), moxResp, resp_ill)

In this particular case, it was known that the missings in one variable corresponded to the nonmissings in the other. In other circumstances, it would be prudent to check whether there were observations with nonmissing values on both variables, say, by typing

. list if var1 <.&var2 <.

Another solution that you may know is to use egen’s rowtotal() function, which by default ignores missings in forming a row total and so gives a single nonmissing result for one missing and one nonmissing argument. Using egen can be convenient, especially if you have more than two variables to combine, but it does impose an extra overhead because there is more code for Stata to interpret. For one-off tasks and small- or moderate-size datasets, you would have to strain to detect that overhead, but for repeated data management and large datasets, using min() or max() is preferable. The Stata Journal (2012) 12, Number 2, pp. 347–351 Stata tip 110: How to get the optimal k-means cluster solution Anna Makles Schumpeter School of Business and Economics University of Wuppertal Wuppertal, Germany [email protected]

The k-means cluster algorithm is a well-known partitional clustering method but is also widely used as an iterative or exploratory clustering method within unsupervised learning procedures (Hastie, Tibshirani, and Friedman 2009, chap. 14). When the num- ber of clusters is unknown, several k-means solutions with different numbers of groups k (k =1,...,K) are computed and compared. To detect the clustering with the optimal number of groups k∗ from the set of K solutions, we typically use a and search for a kink in the curve generated from the within sum of squares (WSS) or its logarithm [log(WSS)] for all cluster solutions. Another criterion for detecting the optimal number of clusters is the η2 coefficient, which is quite similar to the R2, or the proportional reduction of error (PRE) coefficient (Schwarz 2008, 72):

2 − WSS(k) − WSS(k) ∀ ∈ ηk =1 =1 k K WSS(1) TSS

WSS(k − 1) − WSS(k) PREk = ∀ k ≥ 2 WSS(k − 1)

Here WSS(k)[WSS(k − 1)] is the WSS for cluster solution k (k − 1), and WSS(1) is the 2 WSS for cluster solution k = 1, that is, for the nonclustered data. ηk measures the proportional reduction of the WSS for each cluster solution k compared with the total sum of squares (TSS). In contrast, PREk illustrates the proportional reduction of the WSS for cluster solution k compared with the previous solution with k − 1 clusters. Because the cluster kmeans command does not store any results in e(), we must use the same trick as in the cluster stop ado-file for hierarchical clustering to gather the information on the WSS for different cluster solutions. The following example uses 20 different cluster solutions, k =1,...,20, and physed.dta, which measures different characteristics of 80 students and is discussed in [MV] cluster kmeans and kmedians. The dataset is available at

. use http://www.stata-press.com/data/r12/physed

After the variables flexibility, speed,andstrength are standardized by typing

. local list1 "flex speed strength" . foreach v of varlist `list1´ { 2. egen z_`v´ = std(`v´) 3. }

c 2012 StataCorp LP st0262 348 Stata tip 110 we calculate 20 cluster solutions with random starting points and store the results in name(clname):

. local list2 "z_flex z_speed z_strength" . forvalues k = 1(1)20 { 2. cluster kmeans `list2´, k(`k´) start(random(123)) name(cs`k´) 3. }

To gather the WSS of each cluster solution cs‘k’, we calculate an ANOVA using the anova command, where cs‘k’ is the cluster variable. anova stores the residual sum of squares for the chosen variable within the defined groups in cs‘k’ in e(rss),whichis exactly the same as the variable’s sum of squares within the clusters. To collect the information on all cluster solutions, we generate a 20 × 5 matrix to store the WSS, its logarithm, and both coefficients for every cluster solution k.

. * WSS matrix . matrix WSS = J(20,5,.) . matrix colnames WSS = k WSS log(WSS) eta-squared PRE . * WSS for each clustering . forvalues k = 1(1)20 { 2. scalar ws`k´ = 0 3. foreach v of varlist `list2´ { 4. quietly anova `v´ cs`k´ 5. scalar ws`k´ = ws`k´ + e(rss) 6. } 7. matrix WSS[`k´, 1] = `k´ 8. matrix WSS[`k´, 2] = ws`k´ 9. matrix WSS[`k´, 3] = log(ws`k´) 10. matrix WSS[`k´, 4] = 1 - ws`k´/WSS[1,2] 11. matrix WSS[`k´, 5] = (WSS[`k´-1,2] - ws`k´)/WSS[`k´-1,2] 12. }

Finally, we use the columns of the output matrix WSS and the matplot command to produce plots of the calculated statistics.

. matrix list WSS WSS[20,5] k WSS log(WSS) eta-squared PRE r1 1 237 5.4680601 0 . r2 2 89.351871 4.4925822 .62298789 .62298789 r3 3 56.208349 4.0290653 .76283397 .3709326 r4 4 16.471059 2.8016049 .93050186 .70696419 r5 5 13.823239 2.6263512 .9416741 .16075591 r6 6 12.737676 2.5445642 .94625453 .07853172 (output omitted ) . local squared = char(178) . _matplot WSS, columns(2 1) connect(l) xlabel(#10) name(plot1, replace) nodraw > noname . _matplot WSS, columns(3 1) connect(l) xlabel(#10) name(plot2, replace) nodraw > noname . _matplot WSS, columns(4 1) connect(l) xlabel(#10) name(plot3, replace) nodraw > noname ytitle({&eta}`squared´) A. Makles 349

. _matplot WSS, columns(5 1) connect(l) xlabel(#10) name(plot4, replace) nodraw > noname (1 points have missing coordinates) . graph combine plot1 plot2 plot3 plot4, name(plot1to4, replace)

The results indicate clustering with k = 4 to be the optimal solution. At k =4, 2 there is a kink in the WSS and log(WSS), respectively. η4 points to a reduction of the WSS by 93% and PRE4 to a reduction of about 71% compared with the k = 3 solution. However, the reduction in WSS is negligible for k>4. 6 250 5 200 4 150 WSS 3 100 log(WSS) 2 50 0 1 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 k k 1 .8 .8 .6 .6 .4 ² η PRE .4 .2 .2 0 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 k k

Figure 1. WSS, log(WSS), η2,andPRE for all K cluster solutions

In figure 2, we see a scatterplot matrix of the standardized variables for the four- cluster solution, which indicates the four distinct groups of students.

. graph matrix z_flex z_speed z_strength, msym(i) mlab(cs4) mlabpos(0) > name(matrixplot, replace) 350 Stata tip 110

−2 0 2

4 444 444 4 2 444 444 44 4 4444 4 4444 4 3 3 3 3 3 33 333 3 3 33 3 3 333 3 Standardized 33 3 3 3 3 3 3 3 3 33 3 333 values of 3 2 2 3 0 11 22 2 2 2 2 22 1 1 (flexibility) 11111 2 2 2 2 2 2 2 111 1 1111 2 2 111111 11111111 1 1111111 11 1 1 1 1111 1 111 1 1 −2 2 4 4 4 4 444 4444 4 22 2 4444 22 2 4 4 2 444 4 2 4 444 2 2 2 222 222 2 3 3 Standardized 2 33 3 3 3 3 3 0 3 3 3 3 3 333 33 33 3 values of 33 333 3 33 3 3 11111 3 3 (speed) 33 1 11 1 11111111 11111111 11111 1 3 3 111111 11 1 1111 1 111 11 −2 2 1 1 11 1111 111 1111 111 111111 1111111 1 11 11111 1 111111111 1 1 33333 33 3 1 1 33 33333 33 333 3 333 3 33 Standardized 3 3 3 3 0 3 3 4 4 3 3 4 4 values of 4 4 4 444 44 4 4 4 4 444 44 44444 (strength) 2 2 2 2222 2222 22 2 2 −2 2 2 22 −2 0 2 −2 0 2

Figure 2. Scatterplot matrix of the standardized variables for the four-cluster solution

Although the results seem quite clear, this is not always the case. The results of a traditional k-means algorithm always depend on the chosen initialization (that is, the initial cluster centers) and, of course, the data. Figure 3 again shows results for physed.dta but for 50 different starting points. Here our optimal solution with four clusters occurs 37 times (75%). Ten (20%) results point to the five-cluster solution to be the optimal number of groups. Hence, depending on the initialization, natural clusters may be divided into subgroups, or sometimes no kink is even visible. The best way to evaluate the chosen solution is therefore to repeat the clustering several times with different starting points and then compare the different solutions as done here. A. Makles 351 6 250 5 200 4 150 WSS 3 100 log(WSS) 2 50 0 1 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 k k 1 1 .8 0 .6 ² η PRE .4 1 − .2 2 − 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 k k

Figure 3. Fifty different WSS, log(WSS), η2,andPRE curves for K =20

References Hastie, T., R. J. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learn- ing: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.

Schwarz, A. 2008. Lokale Scoring-Modelle. Lohmar, Germany: Eul Verlag. The Stata Journal (2012) 12, Number 2, p. 352 Software Updates dm0042 1: Speaking Stata: Distinct observations. N. J. Cox and G. M. Longton. Stata Journal 8: 557–568. Options have been added to restrict output to variables with a minimum or maximum of distinct values. Thus max(10) restricts output to variables with 10 or fewer distinct values. st0087 1: Boosted regression (boosting): An introductory tutorial and a Stata plugin. M. Schonlau. Stata Journal 5: 330–354. A new plugin is supplied that supports the 64-bit architecture under Windows. Pre- viously, only 32-bit Windows was supported. In addition to the R2 value computed for the test data, the output also includes an R2 value for the training data for comparison. Some memory leaks were fixed, although those leaks did not previously affect numeric results. st0150 4: A Stata package for the estimation of the dose–response function through adjustment for the generalized propensity score. M. Bia and A. Mattei. Stata Journal 10: 692; Stata Journal 9: 652; Stata Journal 8: 594; Stata Journal 8: 354–373. There are two changes in this update. An error has been corrected that affected the estimation of the treatment effect function when log transformation was used through the options t transf() and delta() together. Weights are now properly included in the estimation of the generalized propensity score. st0243 1: Causal mediation analysis. R. Hicks and D. Tingley. Stata Journal 11: 605– 619. The updated files correct an error in the code for the medsens command. Previ- ously, the command had hardcoded the number of independent variables. Running the command with fewer than three independent variables in the mediate equation and fewer than four variables in the outcome equation returned a conformability er- ror. The update also allows for an interaction between the mediator and treatment variables in the medeff command. Finally, the update changes how weights are handled to allow the command to be used more easily with different flavors of Stata. If no weights were specified, a matrix of 1s was previously created, but Stata’s limit on matrix sizes would be an issue if that was exceeded by the number of observa- tions. The vector of 1s is now created as a temporary variable if no weight variable is specified, and thus that limit no longer is a constraint.

c 2012 StataCorp LP up0036