The Journal Volume 18 Number 1 2018

® ®

A Stata Press publication StataCorp LLC College Station, Texas

The Stata Journal

Editors H. Joseph Newton Nicholas J. Cox Department of Department of Geography Texas A&M University Durham University College Station, Texas Durham, UK [email protected] [email protected]

Associate Editors Christopher F. Baum, Boston College Frauke Kreuter, Univ. of Maryland–College Park Nathaniel Beck, New York University Peter A. Lachenbruch, Oregon State University Rino Bellocco, Karolinska Institutet, Sweden, and Stanley Lemeshow, Ohio State University University of Milano-Bicocca, Italy J. Scott Long, Indiana University Maarten L. Buis, University of Konstanz, Germany Roger Newson, Imperial College, London A. Colin Cameron, University of California–Davis Austin Nichols, Abt Associates, Washington, DC Mario A. Cleves, University of Arkansas for Marcello Pagano, Harvard School of Public Health Medical Sciences Sophia Rabe-Hesketh, Univ. of California–Berkeley Michael Crowther , University of Leicester, UK J. Patrick Royston, MRC CTU at UCL, London, UK William D. Dupont , Vanderbilt University Mark E. Schaffer, Heriot-Watt Univ., Edinburgh Philip Ender , University of California–Los Angeles Philippe Van Kerm, LISER, Luxembourg James Hardin , University of South Carolina Vincenzo Verardi,Universit´e Libre de Bruxelles, Ben Jann, University of Bern, Switzerland Belgium Stephen Jenkins, London School of Economics and Ian White, MRC CTU at UCL, London, UK Political Science Richard A. Williams, University of Notre Dame Ulrich Kohler , University of Potsdam, Germany Jeffrey Wooldridge, Michigan State University

Stata Press Editorial Manager Stata Press Copy Editors Lisa Gilmore Adam Crawley, David Culwell,andDeirdre Skaggs

The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository papers that link the use of Stata commands or programs to associated principles, such as those that will serve as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go “beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users (e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could be of interest or usefulness to researchers, especially in fields that are of practical importance but are not often included in texts or other journals, such as the use of Stata in managing datasets, especially large datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata with topics such as extended examples of techniques and interpretation of results, simulations of statistical concepts, and overviews of subject areas.

The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav- ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch), Scopus,andSocial Sciences Citation Index.

For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone 979-696-4600 or 800-782-8272, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.

U.S. and Canada Elsewhere

Printed & electronic Printed & electronic 1-year subscription $124 1-year subscription $154 2-year subscription $224 2-year subscription $284 3-year subscription $310 3-year subscription $400

1-year student subscription $ 89 1-year student subscription $119

1-year institutional subscription $375 1-year institutional subscription $405 2-year institutional subscription $679 2-year institutional subscription $739 3-year institutional subscription $935 3-year institutional subscription $1,025

Electronic only Electronic only 1-year subscription $ 89 1-year subscription $ 89 2-year subscription $162 2-year subscription $162 3-year subscription $229 3-year subscription $229

1-year student subscription $ 62 1-year student subscription $ 62

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX 77845, USA, or emailed to [email protected].

® ® Copyright c 2018 by StataCorp LLC

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and help files) are copyright c by StataCorp LLC. The contents of the supporting files (programs, datasets, and help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal. The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part, as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal. Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions. This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites, fileservers, or other locations where the copy may be accessed by anyone other than the subscriber. Users of any of the , ideas, data, or other materials published in the Stata Journal or the supporting files understand that such use is made without warranty of any kind, by either the Stata Journal, the author, or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special, incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote free communication among Stata users. The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata Press, Mata, , and NetCourse are registered trademarks of StataCorp LLC. Volume 18 Number 1 2018 The Stata Journal

Articles and Columns 1

Announcement of the Stata Journal Editors’ Prize 2018 ...... 1 Power and sample-size analysis for the Royston–Parmar combined test in clinical trialswitha time-to-eventoutcome...... P.Royston 3 Unit-roottestsbasedonforwardandreverseDickey–Fullerregressions...... J.OteroandC.F.Baum 22 validscale: A commandtovalidatemeasurementscales...... B.Perrot,E.Bataille, and J.-B. Hardouin 29 A command for fitting mixture regression models for bounded dependent variables usingthebetadistribution...... L.A.GrayandM.Hern´andez Alava 51 Testing for serial correlation in fixed-effects panel models ...... J.Wursten 76 ldagibbs: A command for topic modeling in Stata using latent Dirichlet allocation ...... C.Schwarz 101 Exploring marginal treatment effects: Flexible estimation using Stata...... M.E.Andresen 118 Fittingandinterpretingcorrelatedrandom-coefficientmodelsusingStata...... O.BarrigaCabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 159 heckroccurve: ROC curves for selected samples ...... J.A.CookandA.Rajbhandari 174 Panel unit-root tests for heteroskedastic panels ...... H.Herwartz,S.Maxand,F.H.C.Raters,andY.M.Walle 184 SomecommandstohelpproduceRichTextFilesfromStata...... M.S.Gillman 197 Standard-error correction in two-stage optimization models: A quasi–maximum likelihood estimation approach . . . . . F. Rios-Avila and G. Canavire-Bacarreza 206 Assessingthehealtheconomicagreementofdifferentdatasources...... D.GallacherandF.Achana 223 Manipulationtestingbasedondensitydiscontinuity...... M.D.Cattaneo,M.JanssonandX.Ma 234 Speaking Stata: Logarithmic binning and labeling...... N.J.Cox 262

Notes and Comments 287

Stata tip 129: Efficiently processing textual data with Stata’s new Unicode fea- tures...... A.Koplenig 287

Software Updates 290

The Stata Journal (2018) 18, Number 1, pp. 1–2 Announcement of the Stata Journal Editors’ Prize 2018

The editors of the Stata Journal are pleased to invite nominations for their 2018 prize in accordance with the following rules. Nominations should be sent as private email to [email protected] by July 31, 2018.

1. The Stata Journal Editors’ Prize is awarded annually to one or more authors of a specified paper or papers published in the Stata Journal in the previous three years. 2. The prize will consist of a framed certificate and an honorarium of U.S. $1,000, courtesy of the publisher of the Stata Journal. The prize may be awarded in person at a Stata Conference or Stata Users Group meeting of the recipient’s or recipients’ choice or as otherwise arranged. 3. Nominations for the prize in a given year will be requested in the Stata Journal in the first issue of each year and simultaneously through announcements on the Stata Journal website and on Statalist. Nominations should be sent to the edi- tors by private email to [email protected] by July 31 in that year. The recipient(s) will be announced in the Stata Journal inthelastissueofeachyear and simultaneously through announcements on the Stata Journal website and on Statalist. 4. Nominations should name the author(s) and one or more papers published in the Stata Journal in the previous three years and explain why the work concerned is worthy of the prize. The precise time limits will be the annual volumes of the Stata Journal, so that, for example, the prize for 2018 will be for work published in the annual volumes for 2015, 2016, or 2017. The rationale might include originality, depth, elegance, or unifying power of work; usefulness in cracking key problems or allowing important new methodologies to be widely implemented; and clarity or expository excellence of the work. Comments on the excellence of the software will also be appropriate when software was published with the paper(s). Nomina- tions might include evidence of citations or downloads or of impact either within or outside the community of Stata users. These suggestions are indicative rather than exclusive, and any special or unusual merits of the work concerned may nat- urally be mentioned. Nominations may also mention, when relevant, any body of linked work published in other journals or previously in the Stata Journal or Stata Technical Bulletin. Work on any or all of statistical analysis, data management, statistical graphics, and Stata or Mata programming may be nominated.

c 2018 StataCorp LLC gn0075 2 Announcement of the Stata Journal Editors’ Prize 2018

5. Nominations will be considered confidential both before and after award of the prize. Neither anonymous nor public nominations will be accepted. Authors may not nominate themselves and so doing will exclude those authors from consider- ation. The editors of the Stata Journal may not be nominated. Employees of StataCorp may not be nominated. Such exclusions apply to any person with such status at any time between January 1 of the year in question and the announce- ment of the prize. The associate editors of the Stata Journal may be nominated.

6. The recipient(s) of the award will be selected by the editors of the Stata Journal, who reserve the right to take advice in confidence from appropriate persons, sub- ject to such persons not having been nominated themselves in the same year. The editors’ decision is final and not open to discussion.

Previous awards of the Prize were to David Roodman (2012); Erik Thorlund Parner and Per Kragh Andersen (2013); Roger Newson (2014); Richard Williams (2015); Patrick Royston (2016); and Ben Jann (2017). For full details, please see Stata Jour- nal 12: 571–574 (2012); Stata Journal 13: 669–671 (2013); Stata Journal 14: 703–707 (2014); Stata Journal 15: 901–904 (2015); Stata Journal 16: 815–825 (2016); and Stata Journal 17: 781–785 (2017).

H. Joseph Newton and Nicholas J. Cox Editors, Stata Journal The Stata Journal (2018) 18, Number 1, pp. 3–21 Power and sample-size analysis for the Royston–Parmar combined test in clinical trials with a time-to-event outcome

Patrick Royston MRC Clinical Trials Unit University College London London, UK [email protected]

Abstract. Randomized controlled trials with a time-to-event outcome are usually designed and analyzed assuming proportional hazards (PH) of the treatment effect. The sample-size calculation is based on a log-rank test or the nearly identical Cox test, henceforth called the Cox/log-rank test. Nonproportional hazards (non-PH) has become more common in trials and is recognized as a potential threat to interpreting the trial treatment effect and the power of the log-rank test—hence to the success of the trial. To address the issue, in 2016, Royston and Parmar (BMC Medical Research Methodology 16: 16) proposed a “combined test” of the global null hypothesis of identical survival curves in each trial arm. The Cox/log- rank test is combined with a new test derived from the maximal standardized difference in restricted mean survival time (RMST) between the trial arms. The test statistic is based on evaluations of the between-arm difference in RMST over several preselected time points. The combined test involves the minimum p-value across the Cox/log-rank and RMST-based tests, appropriately standardized to have the correct distribution under the global null hypothesis. In this article, I introduce a new command, power ct, that uses simulation to implement power and sample-size calculations for the combined test. power ct supports designs with PH or non-PH of the treatment effect. I provide examples in which the power of the combined test is compared with that of the Cox/log-rank test under PH and non-PH scenarios. I conclude by offering guidance for sample-size calculations in time-to-event trials to allow for possible non-PH. Keywords: st0510, power ct, randomized controlled trial, time-to-event outcome, restricted mean survival time, log-rank test, Cox test, combined test, treatment effect, hypothesis testing, flexible parametric model

1 Introduction

In randomized controlled trials with a time-to-event outcome, nonproportional hazards (non-PH) is increasingly recognized as a potentially important issue. In one investi- gation, statistically significant non-PH was found in about a quarter of cancer trials (Trinquart et al. 2016). Nonstatistically significant non-PH, still of potential practical importance, is likely to occur in a greater proportion of trials, particularly as trial sample sizes get larger.

c 2018 StataCorp LLC st0510 4 Power analysis for a generalized treatment effect

Concerns about using the hazard ratio (HR) as a summary measure and as the basis of the Cox/log-rank test of the treatment effect in such trials include poor interpretabil- ity and possible loss of power. The difference (or ratio) in restricted mean survival time (RMST) between treatment groups is gaining popularity as a summary measure and as the basis of a possible test of a treatment effect (for example, Dehbi, Royston, ∗ and Hackshaw [2017]). RMST at some time point (t > 0) is the integral of the survival function at t∗, that is, the “area under the survival curve” from 0 to t∗. It is inter- preted as the mean of the survival-time distribution truncated at t∗. The difference, ΔRMST, defined as RMST in a research arm minus RMST in the control arm, is the integrated difference between the survival functions, equal to the (signed) area between ∗ the survival curves up to t . Details and an implementation of RMST and ΔRMST in the community-contributed commands strmst and strmst2 may be found in Royston (2015)andCronin, Tian, and Uno (2016), respectively.

Briefly, in a two-arm trial, consider testing the “global” null hypothesis H0 : S0 (t)= S1 (t) for any t>0, where Sj (t) is the survival function in the jth group (j =0, 1) and j = 0 denotes the control group. Royston and Parmar (2016) proposed a test of H0 based on the separation of the integrated survival curves. It involves evaluating 2 the maximal chi-squared statistic Cmax =max Z over several time points, where Z =ΔRMST/SE(ΔRMST) is the standardized difference in RMST at a given time point. Arguing pragmatically, Royston and Parmar (2016) determined Cmax over 10 equally spaced values of time t∗ between the 30th and 100th centiles of the failure times in the dataset. Starting with Cmax, they developed an approach to testing H0 that they called the “combined test”. The p-value for the combined test is the smaller of the p-value for the Cox/log-rank test and the multiplicity-corrected p-value for Cmax. In simulation studies, the combined test was shown to be more powerful than the Cox/log-rank test when an “early” effect of treatment was present and only slightly less powerful under PH. The combined test, as implemented in the community-contributed command stctest, is described in Royston (2017b). To my knowledge, power and sample size for the combined test are not computable in closed form. The purpose of this article is to present a command, power ct,forex- ploring the power and sample size for the combined test. The command power ct uses simulation of possibly censored survival times to estimate power or sample size based on a given trial design. A related helper command, stcapture (Royston 2017a), available on Statistical Software Components, outputs survival functions and time-dependent HRs estimated from a dataset in memory. Results from stcapture may be fed directly to power ct in the form of stored macros. This facilitates exploration of the operat- ing characteristics of the combined and Cox/log-rank tests under realistic patterns of survival and HR functions.

This article proceeds as follows: In section 2, I describe estimation of RMST and my simulation-based approach to estimate power and sample size for the combined test. In section 3, I introduce the new command power ct, which implements the power and sample-size computations for trial designs with staggered patient entry and defined timelines for patient accrual and follow-up. In section 4, I provide examples under different scenarios of PH and non-PH of the treatment effect. In section 5, I offer broad P. Royston 5 suggestions on how to approach sample-size calculations in such trials. In section 6,I conclude with a brief discussion.

2 Methods 2.1 Estimation of RMST

∗ Estimation of RMST for a sample of time-to-event data at a point t > 0inanaly- sis time requires determining the area under the survival curve from 0 to t∗.Two methods are available in Stata: Method 1 involves jackknife estimation of pseudovalues (Andersen, Hansen, and Klein 2004), which is equivalent to integrating the Kaplan– Meier curve. Method 2 involves integrating smooth survival curves predicted from flexi- ble parametric survival models (Royston and Parmar 2002; Lambert and Royston 2009; Royston and Lambert 2011), also known as Royston–Parmar models. These methods are described briefly by Royston (2017b) and are implemented in commands stctest ps and stctest rp, respectively.

2.2 Simulation to estimate power of the combined test

A challenging preliminary task is to devise an approach to simulation that allows a range of survival and time-dependent HR functions to be studied. For convenience and gener- ality, I adopt a somewhat simplified version of the ART system of trial design (Barthel, Royston, and Babiker 2005; Barthel et al. 2006). The ART system computes power or sample size for the Cox/log-rank test by exact calculation not requiring simulation. Trial calendar time is divided into M contiguous periods of equal length. A period is a convenient duration such as a year, quarter, or month, depending on the context. Staggered entry of patients into the trial is assumed to occur at a steady (uniform) rate within each period, while potentially varying between periods. Typically, in real trials, patient recruitment starts slowly and speeds up over calendar time. The survival distribution in the control arm, S0 (t), is defined by its values at the end of each period of analysis time. The instantaneous event rate (hazard) is assumed constant throughout each period. This defines a piecewise exponential distribution with piecewise constant hazards.

Patient accrual and follow-up are assumed to take place over K1 and K2 calendar periods, respectively, with K1 + K2 = M. For example, for a trial with M =10periods each of length 1 year and with accrual for K1 = 6 years, follow-up of all patients recruited by staggered entry over 6 years would continue for a further K2 =4years, after which one would analyze the trial outcome data.

The survival distribution in the research arm, S1 (t), is specified with HRsforperiods 1, 2,... of analysis time applied to S0 (t) via the cumulative hazard function. Propor- tional hazards would require the HRs to be equal across periods. 6 Power analysis for a generalized treatment effect

The power of the combined test is estimated by simulation within the above frame- work. A suitably large number of datasets with the predefined piecewise exponential dis- tributions in the control and research arms are simulated using the command power ct, which is described in detail in section 3. The number of simulations in which the com- bined test is significant at a given level α is counted. The power of the test [with a binomial-based confidence interval (CI)] is the corresponding proportion.

The software also presents the power of the Cox/log-rank test as calculated by ART. The simulation-based power of the Cox/log-rank test is provided in addition as a “reality check”.

2.3 A note on Monte Carlo error

With any simulation scheme, estimated quantities are not exact but come with a degree of “Monte Carlo error”, reflecting uncertainty due to randomness. In power ct,Monte Carlo error attaches to the estimated power or sample size. For power, the binomial method computes the reported CIs. For sample size, I use the delta method to create a normal-based CI, as described in section 2.4. The simulate() and ciwidth() options govern the number of simulation replicates used by power ct.Ifciwidth() is specified, then after some simple algebra, the cor- responding required value of simulate(), rounded to the nearest integer, is found to be zz 2 simulate() = round power() × (1 − power()) × ciwidth() where power() is the target power, zz = −2 ∗ Φ−1{(100 − level())/200},whereΦ−1() is the inverse standard normal distribution function, and level() is the desired confi- dence level (by default, 95%). For example, if power() =0.9, ciwidth() =0.02, and level() = 95, then zz =3.9199 and simulate() = 3457. If power() =0.8andthe other parameters are unchanged, then simulate() = 6146.

2.4 Sample-size calculation for the combined test power ct provides a simulation-based method of estimating the sample size for the combined test to achieve a given power at a given significance level. The user must suggest at least three plausible values for the sample size in option n().Writeω ∈ (0, 1) for power and n for the total sample size. Using simulation, the program estimates the power of the combined test at the supplied sample sizes. The several powers and sample sizes are assumed to follow the relation √ −1 Φ (ω)=b0 + b1 n (1)

Functional form (1) is suggested by (1)ofRoyston et al. (2011). Parameters b0 and b1 may be estimated by ordinary least squares. The required sample size, say, nest,forthe target power ω0 is determined by inversion and back-transformation of (1), giving nest = P. Royston 7 −1 2 Φ (ω0) − b0 /b1 . A delta-method, normal-based CI for nest may be found by using nlcom, for example, nlcom ((invnormal(‘omega0’) - b[ cons])/ b[sqrtn])^2.

Finally, power ct checks the power for nest for agreement with ω0 by performing an additional round of simulation with sample size nest. √ To demonstrate the linearity between Φ−1 (ω)and n in an example, I applied power ct to estimate power in sets of 2,000 simulated trials for a design with non- PH across a range of sample sizes between 500 and 1,500. The HRs over the 8 equal time-periods of the design were 0.6, 0.6, 1.0, 1.0, 1.2, 1.5, 1.5, and 1.5. The survival probabilities were 0.90, 0.71, 0.60, 0.52, 0.44, 0.38, 0.33, and 0.28. The results are shown in figure 1.

2.0

1.5

1.0

0.5

Normal equivalent deviate of power 0.0 20 25 30 35 40 Square root of sample size

Figure 1. Relationship between normal equivalent deviate of power, Φ−1 (ω), and square root of sample size in sets of 2,000 simulations of a trial design with non-PH

As can be seen, the relationship between the transformed power and transformed sample size is closely linear across a wide range of powers between about 0.56 and 0.98.

The parameters b0 and b1 are estimated as 0.115 and −2.345, respectively. For power ω0 =0.9, the required n0 is 1,000 with 95% CI [989, 1012], rounded to the nearest integer. For comparison, using the same parameters, the Cox/log-rank test requires 4,789 patients to achieve power 0.9. 8 Power analysis for a generalized treatment effect 3Thepowerct command

The syntax of power ct is as follows: power ct ,alpha(#)aratio(#) at(numlist) ciwidth(#)graphopts(string) hr(numlist | #i ) level(#) median(#)n(#)n(numlist)nperiod(#) onesided(direction) p0(#) plot (fn) power(#) recruit(#) recwt(numlist) saving(fn2 , replace ) simulate(#) survival(numlist | #i ) timer tscale(#)

To enable all the features of power ct, you must install packages containing up-to- date versions of the community-contributed commands artsurv (Royston and Barthel 2010), stpm2 (Andersson and Lambert 2012), stpmean (Overgaard, Andersen, and Parner 2015), and stctest (Royston 2017b). The do-file power ct install.do is pro- vided for convenience as part of the installation of power ct. It contains the following commands:

. ssc install art, replace . ssc install stpm2, replace . quietly net sj 15-3 st0202_1 . net install st0202_1, replace . quietly net sj 17-2 st0479 . net install st0479, replace The replace option ensures that out-of-date installed versions of these programs (if any) are smoothly replaced with the most recent versions.

3.1 Description power ct has two roles for testing for a generalized treatment effect in a two-arm, parallel group clinical trial with a time-to-event outcome.

• Role 1: To explore the power or sample size for the Cox/log-rank test under proportional hazards (PH)ornon-PH. This role uses features of the ART package (Barthel, Royston, and Babiker 2005; Barthel et al. 2006). • Role 2: To use simulation to evaluate the power or sample size for the combined test of Royston and Parmar (2016) under PH or non-PH.

The combined test combines an unweighted Cox/log-rank test, implemented through stcox, with a statistic derived from the maximal squared standardized between-arm difference in time-dependent RMST. See Royston and Parmar (2016) for methodological details and Royston (2017b) for a description of stctest, an implementation of the combined test. Note that power ct is an immediate command. It does not use a dataset in memory or disturb data in memory. P. Royston 9

3.2 Options alpha(#) defines the significance level for testing for a difference between treatments. The default is alpha(0.05) and tests are two sided. See also onesided(). aratio(#) defines the allocation ratio, whereas # equals the number of patients al- located to the research arm for each patient allocated to the control arm. For example, aratio(0.5) means one research arm patient for every two control-arm patients. The default is aratio(1), meaning equal allocation. at(numlist) defines time points such that numlist lists the periods corresponding to the values of the control-arm survival function in survival().Bydefault,theperiods areassumedtobe1, 2,... up to the number of elements in survival(). ciwidth(#) defines the desired width of the CI on power when working with simulated data using simulate(). graphopts(string) are options of graph, twoway that may be applied to enhance the appearance of the plot produced by the plot option. hr(numlist |#i) defines HRs. Conventionally, it is assumed that HRs < 1 indicate treat- ment benefit compared with control, and vice versa for HRs > 1. There are two possible syntaxes:

Syntax 1 is hr(numlist),wherenumlist defines the HRs to be applied to the control- arm survival function during each period. If only the first k (k

Syntax 2 is hr(#i) and uses built-in HR pattern number i,where# denotes the “hash” character and i is a positive integer. The program supplies the HRsineachof10 periods. When the plot() option is used to plot survival curves, an impression is given of the relationship between a non-PH HR pattern and the resulting population survival curves in the control and research arms. The 5 10-period HR patterns currently implemented may be described as follows: #1: early positive effect reversing direction in the long term #2: large late effect #3: large early effect reversing direction, then disappearing #4: fairly small early effect slowly reversing direction #5: early effect with crossing survival curves

The five HR patterns currently implemented are as follows: #1: 0.522 0.642 0.722 0.892 1.193 1.571 1.967 2.288 2.478 2.627 #2: 1.0 1.0 0.7 0.5 0.5 0.5 0.5 0.5 0.5 0.5 10 Power analysis for a generalized treatment effect

#3: 0.3 0.5 1.0 1.4 1.6 1.7 1.0 1.0 1.0 1.0 #4: 0.894 0.701 0.768 0.875 1.013 1.185 1.385 1.594 1.775 1.894 #5: 0.5 0.5 0.5 0.7 1.0 1.6 2.0 2.0 2.0 2.0 By default, hr(#i) (syntax 2) assumes built-in survival-function number i. However, if you provide values in survival() or you specify survival(#j ) where j = i, survival(#j ) values are used instead. The values of hr() and survival() that are actually used can be inspected after running power ct by typing return list and looking at stored quantities (() macros) r(hr), r(survival0),andr(survival1). level(#) sets the confidence level for CIsto#. The default # is c(level), initially 95%. See also help set level. median(#) specifies the median survival time in the control arm. The survival probabil- ity in the control arm at period 1, s0, is calculated as e− ln(2)/#. Options survival() and at() are posted internally as survival(s0 ) and at(1), respectively, and may not be used. n(#) specifies the sample size at which the power of the Cox/log-rank and combined tests is to be determined. By default, sample size for power given in power(#) is determined according to the control-arm survival probabilities in survival(), HRs in hr(),numberofperiodsinnperiod(), and accrual time in recruit().Seealso hr() and n(numlist). n(numlist) estimates the sample sizes for the combined test to achieve power given by power(). power ct first uses simulation (see simulate()) to estimate the power of the combined test for each sample size in numlist,ofwhichtheremustbeatleast three. The values in numlist are user-supplied “educated guesses” at the required sample size. power ct then regresses the normal equivalent deviates of the estimated power values on the square roots of the corresponding sample sizes, enabling back- calculation of the sample size for the combined test corresponding to the required power. A normal-based CI for the sample size, computed by the delta method, is presented. Finally, the power with the estimated sample size is determined by a further round of simulation. The CI on the reported power should usually enclose the target power specified in power(). More precise estimates of the sample size may be obtained by increasing simulation(), reducing ciwidth(),orincreasing the length of numlist. nperiod(#) specifies the number of “periods” at the end points of which survival probabilities and HRs apply. Periods are integer numbers of time intervals whose lengths are determined by the reciprocal of # in the tscale() option. Note that the number of periods of follow-up time for each patient is calculated as nperiod() minus recruit(). Hence, recruit() cannot exceed nperiod(). The default is nperiod(10). P. Royston 11 onesided(direction) makes all significance tests one sided. If direction = +, the di- rection is toward RMST in the research arm exceeding that in the control arm, and HR < 1. If direction = -, the direction is toward RMST in the research arm being lower than that in the control arm, and HR > 1. The default is direction unspecified, meaning two-sided tests. p0(#) defines the fraction of patients recruited at time 0. Such patients are followed up from the start of period 1 and through all subsequent periods. The default is p0(0). plot and plot(fn) plot the estimated population survival functions against analysis time using the time-scale factor tscale().Ifplot(fn) is specified, the plotted values are stored to a file called fn.dta; otherwise, no data are stored. power(#),whenn() is omitted, defines the power at which to determine sample size for the Cox/log-rank test. When n(numlist) is specified, power ct estimates the sample size for the combined test to have power #.Notethatpower() and n() cannot be specified together. recruit(#) defines the number of periods of calendar time over which patients accrue to the trial. By default, accrual is assumed to occur at a uniform rate; see recwt() for how to specify varying recruitment rates. The default is recruit(5). recwt(numlist) defines accrual weights in each period of patient recruitment. The weights must be a constant multiple of the proportions of patients recruited in each period up to recruit().Valuesinnumlist must be positive real numbers. The number of values in numlist must equal the number defined by recruit().The default is recwt(1), meaning a constant accrual rate across periods. saving(fn2 , replace ) saves simulation estimates at each replicate to a file named fn2.dta. simulate(#) controls the number of simulations to be performed. The default is simulate(0), meaning no simulations are done. survival(numlist | #i) defines survival probabilities in the control arm. Unless hr(#i) is specified (see hr()), survival() is required. In syntax 1, survival(numlist) defines the survival probability at the end of each period. In syntax 2, survival(#i),where # denotes the “hash” character and i is a positive integer, the program assumes nperiod(10) and supplies the control-arm survival probability in each of 10 periods according to built-in function i. The six built-in survival functions are as follows: #1: 0.765 0.516 0.340 0.221 0.161 0.130 0.112 0.100 0.090 0.082 #2: 0.765 0.516 0.340 0.221 0.161 0.130 0.112 0.100 0.090 0.082 #3: 0.765 0.516 0.340 0.221 0.161 0.130 0.112 0.100 0.090 0.082 #4: 0.500 0.265 0.114 0.065 0.046 0.037 0.032 0.029 0.027 0.025 12 Power analysis for a generalized treatment effect

#5: 0.984 0.923 0.773 0.644 0.549 0.471 0.424 0.396 0.377 0.363 #6: 0.538 0.333 0.248 0.204 0.178 0.160 0.146 0.136 0.127 0.119 Currently, functions 1, 2, and 3 are identical. timer activates a minute timer for simulation runs. Simulations can become lengthy. Elapsed times from timer with a smaller number of simulations in simulate() can be scaled up to indicate how long a production run will need. (Technical note: timer uses timers 99 and 100.) tscale(#) defines the scale factor between analysis-time units and “periods”, where one unit of analysis time equals # periods in length. Note that # may be 1, < 1, or > 1, but it is often 1 or > 1 to “magnify” analysis time and give greater detail of the survival function and HRs. The default is tscale(1). Example 1: if analysis time is in years and tscale(2) is specified, each period is one half a unit of analysis time (that is, six months) in length. Example 2: if analysis time is in years and nperiod(12) tscale(4) is specified, each period is 3 months; survival probabilities are estimated at 1/4, 2/4,...,12/4=3 years.

4 Examples 4.1 Overview

In the examples given below, I consider computing the sample size for the combined test under three generic scenarios: a) PH, b) early treatment effect, and c) late treatment effect. The last terms are explained in context in the following sections. I estimated the baseline survival function, S0 (t), in the GOG111 trial in advanced ovarian cancer (McGuire et al. 1996). Table 1 shows S0 (t)andS1 (t), the survival function in the research arm, computed with three time-related patterns of HR.

Table 1. Survival functions for three patterns of a time-dependent HR. See also figure 2.

Period S0 (t) PH Non-PH early Non-PH late (yr) HR S1 (t) HR S1 (t) HR S1 (t) 10.765 0.750 0.818 0.522 0.870 1.000 0.765 20.516 0.750 0.609 0.642 0.675 1.000 0.516 30.340 0.750 0.445 0.722 0.500 0.700 0.385 40.221 0.750 0.322 0.892 0.340 0.500 0.311 50.161 0.750 0.254 1.193 0.233 0.500 0.265 60.130 0.750 0.217 1.571 0.167 0.500 0.238 70.112 0.750 0.194 1.967 0.124 00.500.221 80.100 0.750 0.178 2.288 0.096 0.500 0.209 90.090 0.750 0.164 2.478 0.074 0.500 0.198 10 0.082 0.750 0.153 2.627 0.058 0.500 0.189 P. Royston 13

The three population survival functions corresponding to the HR functions in table 1 are shown graphically in figure 2.

(a) PH with HR = 0.75 (b) Non−PH, early effect 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0 0 2 4 6 8 10 0 2 4 6 8 10 (c) Non−PH, late effect 1.0

0.8 Survival probability 0.6

0.4

0.2

0.0 0 2 4 6 8 10 Years of follow−up

Figure 2. Population survival curves corresponding to PH with HR = 0.75 and non-PH with two different specifications of the time-dependent HR (see table 1). The control- arm survival curve S0 (t) (solid line), also given in table 1, is identical in each panel. The research-arm survival curves are shown by dashed lines.

4.2 Example 1. Sample size under PH

The first port of call for a sample-size calculation with time-to-event data in Stata is usu- ally to assume PH of the treatment effect and work with the Cox/log-rank test. A flexible system to facilitate such exploration is the ART package (Barthel, Royston, and Babiker 2005). The main options of ART (specifically, options of the community-contributed pro- gram artsurv) are available in simplified form in power ct.Ifthesimulate() option of power ct is not specified, power ct calls artsurv to calculate the sample size or power according to the Cox/log-rank test but not the combined test.

Previous work suggests that under PH and other things being equal, the combined test requires some 5 to 10% more patients than the Cox/log-rank test to provide a given power (Royston and Parmar 2016). I exemplify the use of power ct in this context. Suppose the sample size is required to achieve power of 0.9 at a two-sided significance level of 0.05 to detect HR =0.75 under PH. For illustration, I use the first built-in baseline 14 Power analysis for a generalized treatment effect survival function in power ct as provided by the option survival(#1). The leftmost four columns of table 1 show the corresponding population survival functions, S0 (t)and S1 (t), at the end of each one-year period up to a maximum of M =10years. Suppose that patient accrual takes place at a uniform rate over 5 of the 10 periods. The sample size for the Cox/log-rank test is calculated without requiring simulation:

. power_ct, alpha(.05) power(0.9) hr(0.75) survival(#1) nperiod(10) recruit(5) ART sample size calculation for Cox/logrank test HR: PH with HR = .75 Alpha: .05 Power: 0.9000 Events: 509 Sample size: 599

Asampleofn = 599 patients experiencing 509 events is required for the Cox/log- rank test to have power 0.9. I now check the power of the combined test under the same scenario with 599 patients, using simulation of 5,000 trial datasets. I first set the random-number seed arbitrarily to 115 to ensure reproducibility of the simulation results later if needed:

. set seed 115 . power_ct, alpha(.05) n(599) hr(0.75) survival(#1) nperiod(10) recruit(5) > simulate(5000) ART power calculation for Cox/logrank test HR: PH with HR = .75 Alpha: .05 Sample size: 599 Events: 509 Power: 0.9002 Estimating power of combined test with sample size 599 (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%)

Simul. n Power_CT [95% Conf. Int.] Power_Cox [95% Conf. Int.] Mean HR

5000 599 0.8776 0.8682, 0.8866 0.8972 0.8884, 0.9055 0.7490

The power of the combined test is estimated to be 0.878 [95% CI 0.868 to 0.887], about 0.022 lower than for the Cox/log-rank test. What sample size is needed for the combined test to achieve power 0.9? To determine this, the n() option must specify at least three sample sizes with ranges intended to include the correct value. Because the Cox/log-rank test is optimally powerful under PH, more than 599 patients are needed for the combined test to obtain sufficient power. I take 600, 650, and 700 as a reasonable candidate range. If this set does not cover the required power, I can adjust the sample sizes and repeat the procedure. power ct performs the simulations and interpolates (transformed) sample size and (transformed) power values, as described in section 2.4, to obtain a sample size estimate for power 0.9. It also provides a 95% CI for the estimated power. Finally, power ct again uses simulation to evaluate the power with the final sample size: P. Royston 15

. set seed 117 . power_ct, alpha(.05) power(.9) n(600 650 700) hr(0.75) survival(#1) > nperiod(10) recruit(5) simulate(5000) Estimating power of combined test in 3 sets of 5000 replicates ... n power SE(power)

600 0.8814 0.0046 650 0.9024 0.0042 700 0.9220 0.0038

Sample size for power .9 = 643, 95% CI (640,646) ART power calculation for Cox/logrank test HR: PH with HR = .75 Alpha: .05 Sample size: 643 Events: 547 Power: 0.9192 Estimating power of combined test with sample size 643 (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%)

Simul. n Power_CT [95% Conf. Int.] Power_Cox [95% Conf. Int.] Mean HR

5000 643 0.8956 0.8868, 0.9039 0.9154 0.9073, 0.9230 0.7506

The sample size for the combined test is 643, some 7.3% higher than the 599 required by the Cox/log-rank test. Although the power of 0.896 for the combined test does not exactly equal the requested 0.9, the value of 0.9 lies within the 95% CI for 0.896 of [0.887, 0.904].

4.3 Example 2. Sample size under non-PH: Early effect

I now describe an investigation of sample size for the combined test under a speci- fied time-dependent pattern of non-PH.TheHR values are given in the fifth column of table 1. The pattern represents an “early” effect of treatment featuring crossing hazard functions, with HR < 1 for the first four periods and HR > 1 subsequently. Royston and Parmar (2016) demonstrated that the Cox/log-rank test may have severely reduced power in this situation.

I specify power 0.9 at significance level 0.05. I cannot rely on the ART calculation to inform the sample size needed for the combined test. Instead, I instruct power ct to cover a wide range of sample sizes. As a “sighting shot”, I choose n = 200, 500, and 1,000. Initially, to save computer time, I use a relatively small number of simulations (500): 16 Power analysis for a generalized treatment effect

. set seed 119 . power_ct, alpha(.05) power(.9) n(200 500 1000) hr(#1) survival(#1) > nperiod(10) recruit(5) simulate(500) Estimating power of combined test in 3 sets of 500 replicates ... n power SE(power)

200 0.6240 0.0217 500 0.9500 0.0097 1000 1.0000 0.0000

Sample size for power .9 = 405, 95% CI (405,405) ART power calculation for Cox/logrank test HR: .522 .642 .722 .892 1.193 1.571 1.967 2.288 2.478 2.627 Alpha: .05 Sample size: 405 Events: 359 Power: 0.6878 Estimating power of combined test with sample size 405 (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%)

Simul. n Power_CT [95% Conf. Int.] Power_Cox [95% Conf. Int.] Mean HR

500 405 0.8880 0.8570, 0.9143 0.6320 0.5880, 0.6744 0.7803

It appears that n = 405 is likely to be close to the right answer, enabling the range to be narrowed. I now choose n = 350, 400, and 450 and rerun power ct with a larger number of simulations (5,000):

. set seed 121 . power_ct, alpha(.05) power(.9) n(350 400 450) hr(#1) survival(#1) > nperiod(10) recruit(5) simulate(5000) Estimating power of combined test in 3 sets of 5000 replicates ... n power SE(power)

350 0.8706 0.0047 400 0.9140 0.0040 450 0.9426 0.0033

Sample size for power .9 = 383, 95% CI (381,384) ART power calculation for Cox/logrank test HR: .522 .642 .722 .892 1.193 1.571 1.967 2.288 2.478 2.627 Alpha: .05 Sample size: 383 Events: 339 Power: 0.6636 Estimating power of combined test with sample size 383 (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%)

Simul. n Power_CT [95% Conf. Int.] Power_Cox [95% Conf. Int.] Mean HR

5000 383 0.9022 0.8936, 0.9103 0.6708 0.6576, 0.6838 0.7682 P. Royston 17

The revised sample size conferring power 0.902 [95% CI 0.894 to 0.910] is 383. No- tably, the power of the Cox/log-rank test in this setting is 0.671, which is much smaller than that of the combined test.

As a sensitivity analysis, suppose the same sample size of 383 and the same HRsthat define the early effect for the first 4 years are kept, but the subsequent HRs are reduced to 1.0. This is still an early effect lasting four years, but now the treatment effect does not actually “go into reverse” (that is, exhibit crossing hazards) after four years, as it does with hr(#1). What is the power of the Cox/log-rank and combined tests now?

. local hr1 0.522 0.642 0.722 0.892 111111 . set seed 123 . power_ct, alpha(.05) n(383) hr(`hr1´) survival(#1) nperiod(10) recruit(5) > simulate(5000) ART power calculation for Cox/logrank test HR: .522 .642 .722 .892 1 11111 Alpha: .05 Sample size: 383 Events: 330 Power: 0.8619 Estimating power of combined test with sample size 383 (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%)

Simul. n Power_CT [95% Conf. Int.] Power_Cox [95% Conf. Int.] Mean HR

5000 383 0.9246 0.9169, 0.9318 0.8600 0.8501, 0.8695 0.7142

The power of the Cox/log-rank test has increased markedly from 0.671 to 0.860. However, it is still somewhat lower than the power of the combined test, which has increased from 0.902 to 0.925. The example suggests that when an early treatment effect is present, the power of the combined test is superior and quite robust to variations in the HR in the later part of follow-up. The power of the Cox/log-rank test is sensitive to such variations. Typically, relatively few patients who are still event-free contribute data in the late phase.

4.4 Example 3. Sample size under non-PH: Late effect

A second type of non-PH pattern that may be seen, for example, in screening or pre- vention trials, is the late treatment effect. Here the HR maybecloseto1intheearly follow-up phase and decrease later, signifying a late-onset treatment effect. The corre- sponding survival curves coincide in the early period and separate later. In many trials, most of the events occur in the early follow-up phase. Obtaining sufficient power in such trials can be a challenge. Earlier simulation work (Royston and Parmar 2016) suggested that with a late effect, the power of the combined test is not far short of that of the Cox/log-rank test. To obtain a sample-size estimate for the combined test, I take an approach similar to the PH case. I obtain a rough indication of the required sample size from the Cox/log-rank 18 Power analysis for a generalized treatment effect test, then refine it for the combined test using the n(numlist) option of power ct.As an example, I use hypothetical time-dependent HR pattern #2 as provided in power ct through the option hr(#2) (see table 1), together with survival(#1) as before:

. power_ct, alpha(.05) power(.9) hr(#2) survival(#1) nperiod(10) recruit(5) ART sample size calculation for Cox/logrank test HR: 11.7.5.5.5.5.5.5.5 Alpha: .05 Power: 0.9000 Events: 811 Sample size: 971 . local n = r(N)

I need n = 971 patients. Note that power ct stores the required sample size (971) in r(N), which I have stored in local macro `n´ for further use below. Based on this, I guess a (generous) range of, say, minus 10% to plus 15% of 971, that is, 874 and 1,117, intended to cover power 0.9 for the combined test.

. local n1 = round(`n´ - .10*`n´) . local n2 = round(`n´ + .15*`n´) . set seed 125 . power_ct, alpha(.05) power(.9) n(`n1´ `n´ `n2´) hr(#2) survival(#1) > nperiod(10) recruit(5) simulate(5000) Estimating power of combined test in 3 sets of 5000 replicates ... n power SE(power)

874 0.8446 0.0051 971 0.8718 0.0047 1117 0.9192 0.0039

Sample size for power .9 = 1048, 95% CI (1021,1074) ART power calculation for Cox/logrank test HR: 11.7.5.5.5.5.5.5.5 Alpha: .05 Sample size: 1048 Events: 876 Power: 0.9206 Estimating power of combined test with sample size 1048 (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%)

Simul. n Power_CT [95% Conf. Int.] Power_Cox [95% Conf. Int.] Mean HR

5000 1048 0.8990 0.8903, 0.9072 0.9208 0.9130, 0.9281 0.7965

Sample size n = 1048 is indicated for the combined test. The corresponding power is 0.899 [95% CI 0.890 to 0.907]. The sample size is 7.9% larger than the 971 needed for the Cox/log-rank test. P. Royston 19 5 Sample-size calculation: General recommendations

Based on examples and on previous experience with the combined test (Royston and Parmar 2016), I give tentative recommendations for trial sample-size calculation below. I describe the approach for the most popular power values of 0.9and0.8. In principle, any desired power may be targeted.

1. As described in section 4.2, power the trial for the combined test under PH.Power and sample-size assessments under non-PH for various patterns of time-dependent HR may be viewed as sensitivity analyses. They may be informed by subject- matter considerations, for example, hypotheses about the likely modes of action of the treatment regimens under comparison.

2. To achieve power 0.9 for the combined test under PH, initially use the ART method- ology implemented in power ct to compute the sample size for the Cox/log-rank test with power 0.92. Simulation is unnecessary. If the target power is 0.8, do the same calculation for Cox/log-rank power 0.83. Call the resulting sample size nPH.

3. To obtain nest, the estimated sample size, run power ct for power 0.9or0.8with three sample sizes covering a sensible range, for example, nPH, 0.9nPH, 1.1nPH. The choice of number of simulations may be guided by specifying ciwidth() in- stead of simulate(). For example, ciwidth(0.02) would give a CI width of about ±0.01 for the estimated power conditional on nest.Thevalueofnest proposed for the combined test should be close to nPH.

6 Discussion

It is apparent that the power of the Cox/log-rank test is vulnerable to scenarios where a treatment effect in the early follow-up phase disappears or reverses later on. This is the phenomenon of crossing hazard functions. An extreme case of crossing hazards is when the survival curves also cross (see, for example, figure 2A of Mok et al. [2009]). Because of its sensitivity to “gaps” between Kaplan–Meier curves, that is, local features rather than the particular pattern of consistent separation implied by PH, the combined test may be able to detect a statistically and clinically significant early treatment effect when the Cox/log-rank test fails to. Therefore, using the combined test in such cases can enhance power.

Because the Cox/log-rank test is optimally powerful under PH, a requirement for the combined test to provide a given power under PH inevitably causes an increase in sample size.Theincreaseisakintoan“insurancepremium”tocopewithpossiblefailuresof the PH assumption (Royston and Parmar 2016). In practice, the premium is modest, less than a 10% increase in sample size in all the instances I have investigated. The benefit of the combined test is one that is more robust to failure of the PH assumption than the Cox/log-rank test in some situations.

Note that the sample-size calculation for the combined test under PH is sufficiently well defined to be fully specifiable in the trial protocol. The combined test may be 20 Power analysis for a generalized treatment effect applied to the final trial dataset without the need for any data-driven modifications to the analysis strategy. Such prespecification is a key requirement of good clinical practice when designing and running a trial.

7 Acknowledgment

I thank Dr. Tim Morris for helpful comments on the manuscript.

8 References Andersen, P. K., M. G. Hansen, and J. P. Klein. 2004. Regression analysis of restricted mean survival time based on pseudo-observations. Lifetime Data Analysis 10: 335– 350. Andersson, T. M.-L., and P. C. Lambert. 2012. Fitting and modeling cure in population- based cancer studies within the framework of flexible parametric survival models. Stata Journal 12: 623–638. Barthel, F. M.-S., A. Babiker, P. Royston, and M. K. B. Parmar. 2006. Evaluation of sample size and power for multi-arm survival trials allowing for non-uniform accrual, non-proportional hazards, loss to follow-up and cross-over. Statistics in Medicine 25: 2521–2542. Barthel, F. M.-S., P. Royston, and A. Babiker. 2005. A menu-driven facility for complex sample size calculation in randomized controlled trials with a survival or a binary outcome: Update. Stata Journal 5: 123–129. Cronin, A., L. Tian, and H. Uno. 2016. strmst2 and strmst2pw: New commands to compare survival curves using the restricted mean survival time. Stata Journal 16: 702–716. Dehbi, H.-M., P. Royston, and A. Hackshaw. 2017. Life expectancy difference and life expectancy ratio: Two measures of treatment effects in randomised trials with non-proportional hazards. British Medical Journal 357: j2250. Lambert, P. C., and P. Royston. 2009. Further development of flexible parametric models for survival analysis. Stata Journal 9: 265–290. McGuire,W.P.,W.J.Hoskins,M.F.Brady,P.R.Kucera,E.E.Partridge,K.Y. Look, D. L. Clarke-Pearson, and M. Davidson. 1996. Cyclophosphamide and cisplatin compared with paclitaxel and cisplatin in patients with stage III and stage IV ovarian cancer. New England Journal of Medicine 334: 1–6. Mok, T. S., Y.-L. Wu, S. Thongprasert, C.-H. Yang, D.-T. Chu, N. Saijo, P. Sun- paweravong, B. Han, B. Margono, Y. Ichinose, Y. Nishiwaki, Y. Ohe, J.-J. Yang, B. Chewaskulyong, H. Jiang, E. L. Duffield, C. L. Watkins, A. A. Armour, and M. Fukuoka. 2009. Gefitinib or carboplatin–paclitaxel in pulmonary adenocarcinoma. New England Journal of Medicine 361: 947–957. P. Royston 21

Overgaard, M., P. K. Andersen, and E. T. Parner. 2015. Regression analysis of censored data using pseudo-observations: An update. Stata Journal 15: 809–821.

Royston, P. 2015. Estimating the treatment effect in a clinical trial using difference in restricted mean survival time. Stata Journal 15: 1098–1117. . 2017a. stcapture: Stata module to estimate survival functions and hazard ratios. Statistical Software Components S458312, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s458312.html. . 2017b. A combined test for a generalized treatment effect in clinical trials with a time-to-event outcome. Stata Journal 17: 405–421.

Royston, P., and F. M.-S. Barthel. 2010. Projection of power and events in clinical trials with a time-to-event outcome. Stata Journal 10: 386–394. Royston, P., F. M.-S. Barthel, M. K. B. Parmar, B. Choodari-Oskooei, and V. Isham. 2011. Designs for clinical trials with time-to-event outcomes based on stopping guide- lines for lack of benefit. Trials 12: 81. Royston, P., and P. C. Lambert. 2011. Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. College Station, TX: Stata Press. Royston, P., and M. K. B. Parmar. 2002. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in Medicine 21: 2175–2197. . 2016. Augmenting the logrank test in the design of clinical trials in which non-proportional hazards of the treatment effect may be anticipated. BMC Medical Research Methodology 16: 16. Trinquart, L., J. Jacot, S. C. Conner, and R. Porcher. 2016. Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. Journal of Clinical Oncology 34: 1813–1819.

About the author Patrick Royston is a medical statistician with 40 years of experience and a strong interest in biostatistical methods and in statistical computing and algorithms. He works largely in methodological issues in the design and analysis of clinical trials and observational studies. He is currently focusing on alternative outcome measures and tests of treatment effects in trials with a time-to-event outcome, on parametric modeling of survival data, and on novel clinical trial designs. The Stata Journal (2018) 18, Number 1, pp. 22–28 Unit-root tests based on forward and reverse Dickey–Fuller regressions

Jes´us Otero Christopher F. Baum Universidad del Rosario Boston College Bogot´a,Colombia Chestnut Hill, MA [email protected] [email protected]

Abstract. In this article, we present the command adfmaxur, which computes the Leybourne (1995, Oxford Bulletin of Economics and Statistics 57: 559–571) unit-root statistic for different numbers of observations and the number of lags of the dependent variable in the test regressions. The latter can be either specified by the user or endogenously determined. We illustrate the use of adfmaxur with an empirical example. Keywords: st0511, adfmaxur, unit-root test, critical values, lag length, p-values

1 Introduction

During the last three decades or so, testing for the presence of a unit root has become a key step in the empirical analysis of time series in economics and finance. Among the several tests that have been proposed in the literature over the years, perhaps the most commonly applied is the augmented Dickey–Fuller (ADF) test proposed by Said and Dickey (1984), which follows the work of Dickey and Fuller (1979). This test has had long-lasting appeal because it is regression based; thus, it is straightforward to calculate. Despite this clear advantage, a common criticism is that the ADF test exhibits poor power properties as documented, for example, by DeJong et al. (1992). To improve the power of the test, Leybourne (1995) recommends an approach based on the maximum ADF t statistic that results from estimating Dickey–Fuller type regressions using forward and reverse realizations of the data. According to Leybourne (1995), the resulting statistic, commonly referred to as the ADFmax test, is not only easy to compute but also exhibits greater power, so it is more likely to reject a false unit-root hypothesis than the standard ADF test.

To implement the ADFmax test, Leybourne (1995) tabulates critical values for the two model specifications that are mainly used in practical empirical work: a model with an intercept and a model with an intercept and trend. In both cases, the tabulated critical values are based on 20,000 replications for T = 25, 50, 100, 200, and 400 obser- vations. However, these critical values do not account for their possible dependence on the number of lags of the dependent variable that may be needed to account for serial correlation, nor do they account for the information criterion used to determine the optimal number of lags. To improve inference, Otero and Smith (2012) performed an extensive set of Monte Carlo simulation experiments that were summarized through re- sponse surface regressions, from which critical values of the Leybourne (1995) tests were

c 2018 StataCorp LLC st0511 J. Otero and C. F. Baum 23 subsequently calculated for different specifications of deterministic components, number of observations, and lag order, where the last can be determined either exogenously by the user or endogenously through a data-dependent procedure. For any given ADFmax statistic, Otero and Smith (2012) provided an Excel spreadsheet to calculate critical values at the 1%, 5%, and 10% significance levels as well as the associated p-value of the statistic.1

In this article, we present the command adfmaxur, which computes the ADFmax test statistic jointly with the 1%, 5%, and 10% significance levels and associated p-value. The command allows for the presence of stochastic processes with nonzero mean and nonzero trend and for the lag order to be determined either exogenously or endogenously using a data-dependent procedure.

The article is organized as follows: Section 2 provides an overview of the ADFmax unit-root test. Section 3 describes the command adfmaxur. Section 4 illustrates the use of adfmaxur with an empirical example. Section 5 concludes.

2 The Leybourne ADFmax unit-root test Leybourne (1995) proposes a unit-root test that involves ordinary least-squares esti- mation of ADF-style regressions based on the forward and reverse realizations of the time series of interest. Thus, letting yt indicate the forward realization of a time series, consider an ADF-type regression for the model that includes a constant and a trend as deterministic components, p Δyt = α + γt + βyt−1 + δj Δyt−j + εt (1) j=1 where Δ is the first-difference operator, p is the number of lags of the dependent variable that are included as additional regressors to allow for residual serial correlation, and t =1,...,T time observations. In this setting, the unit-root null hypothesis is given by H0 : β = 0, against the alternative that the series of interest is stationary around a linear trend term; that is, H1 : β<0. Let ADFf denote the regression t statistic for β =0in(1).2

Now, let us consider the reverse realization of yt,whichisgivenbyzt = yT −t+1. Thus, z1 = yT , z2 = yT −1, and so on until zT = y1. The corresponding ADF type regression applied to zt is p ∗ ∗ ∗ ∗ ∗ Δzt = α + γ t + β zt−1 + δj Δzt−j + εt j=1

1. See http://www2.warwick.ac.uk/fac/soc/economics/staff/academic/jeremysmith/research. 2. Dickey and Fuller (1981) construct the likelihood-ratio statistics to test the joint null hypotheses that the true model is a random walk with and without drift, which in terms of (1) are represented as H0 : γ = β =0andH0 : α = γ = β = 0, respectively. The distributions under the null hypothesis of these additional statistics were not tabulated by Leybourne (1995) and therefore are not considered in this article. 24 Forward and reverse Dickey–Fuller regressions

∗ where we are interested in ADFr, the regression t statistic for β = 0. Finally, the ADFmax statistic is simply given by the maximum between ADFf and ADFr.

3 The adfmaxur command

The command adfmaxur calculates the ADFmax test statistic, the statistic’s associated 1%, 5%, and 10% finite sample critical values, and the statistic’s approximate p-value. The estimation of critical values and approximate p-value permits different combina- tions of the number of observations, T , and lags of the dependent variable in the test regression, p, where the latter can be either specified by the user or optimally selected using a data-dependent procedure.3

3.1 Syntax adfmaxur varname if in , noprint maxlag(integer) trend

Note that varname may not contain gaps. varname may contain time-series opera- tors. adfmaxur may be applied to one unit of a panel.

3.2 Options noprint specifies that the results be returned but not printed. maxlag(integer) sets the number of lags to be included in the test regression to account for residual serial correlation. By default, adfmaxur sets the number of lags following Schwert (1989), with the formula maxlag() =int{12(T/100)0.25},whereT is the total number of observations. In either case, the number of lags appears in the row labeled FIXED of the output table. In addition, adfmaxur determines the optimal number of lags according to the Akaike and Schwarz information criteria, denoted AIC and SIC, respectively, and also following the general-to-specific (GTS) algorithm advocated by Hall (1994)and Ng and Perron (1995).TheideaoftheGTS algorithm is to start by setting an upper bound on p, denoted pmax, estimating (1)withp = pmax, and testing the statistical

significance of δpmax . If this coefficient is statistically significant, using, for example, significance levels of 5% (referred to as GTS0.05) or 10% (referred to as GTS0.10), one selects p = pmax. Otherwise, the order of the estimated autoregression in (1)isre- duced by one until the coefficient on the last included lag is found to be statistically different from zero.

3. The optimally selected number of lags yields the global minimizer of an information criterion; see, for example, Ng and Perron (2005). J. Otero and C. F. Baum 25 trend specifies the modeling of intercepts and trends. By default, adfmaxur assumes varname is a nonzero mean stochastic process, so a constant is included in the test regression. If, on the other hand, the trend option is specified, varname is assumed to be a nonzero trend stochastic process, in which case a constant and a trend are included in the test regression.

3.3 Stored results adfmaxur stores the following results in r():

Scalars r(N) number of observations in the test regression r(minp) first time period used in the test regression r(maxp) last time period used in the test regression Macros r(varname) variable name r(treat) either constant or constant and trend r(tsfmt) time-series format of the time variable Matrices r(results) results matrix, 5 × 6 The rows of the results matrix indicate which method of lag length was used: FIXED (lag selected by user, or using Schwert’s formula); AIC; SIC; GTS05;orGTS10. The columns of the results matrix contain, for each method, the number of lags used; the ADFmax statistic; its p-value; and the critical values at 1%, 5%, and 10%, respectively.

4 Empirical application

Examining the time-series properties of the unemployment rate is important to under- stand the way this variable responds to shocks. On the one hand, under the natural rate hypothesis, shocks to unemployment have a temporary effect because unemploy- ment will return to its natural rate in the long run. On the other hand, under the hysteresis hypothesis, the same shocks will have a permanent effect because of rigidities in the labor market. This section illustrates the use of adfmaxur to assess these two hypotheses. To this end, we use monthly seasonally adjusted data on unemployment rates for the 48 contiguous U.S. states. The data, which are freely available from the Federal Reserve Economic Data of the Federal Reserve Bank of St. Louis, cover the period between 1976m1 and 2017m2, for a total of T = 494 time observations for each state.4

4. The task of downloading the time series from the Federal Reserve Economic Data database was greatly simplified using the command freduse;seeDrukker (2006). 26 Forward and reverse Dickey–Fuller regressions

We begin by loading the dataset and declaring it as a time series:

. use usurates . tsset date, monthly time variable: date, 1976m1 to 2017m2 delta: 1 month

We would like to test whether the unemployment rate in each state contains a unit root against the alternative that it is a stationary process. For practical purposes, visual inspection of the time plot of the variable of interest often provides useful guidelines as to whether a linear trend term should be included in the test regressions. For our purposes, given that each unemployment series has a nonzero mean, but not trending behavior, the relevant ADFmax statistic is that based on a test regression that includes constant but not trend, which is the default option for adfmaxur. Setting a maxlag() of p = 12, we see that the results of applying adfmaxur to the unemployment rate in the state of Alabama, denoted ALUR,are

. adfmaxur ALUR, maxlag(12) Leybourne (1995) test results for 1977m2 - 2017m2 Variable name: ALUR Ho: Unit root Ha: Stationarity Model includes constant

Criteria Lags ADFmax stat. p-value 1% cv 5% cv 10% cv

FIXED 12 -2.165 0.088 -3.006 -2.412 -2.105 AIC 9 -2.234 0.085 -3.103 -2.475 -2.153 SIC 6 -2.673 0.027 -3.039 -2.430 -2.118 GTS05 9 -2.234 0.086 -3.114 -2.483 -2.161 GTS10 9 -2.234 0.088 -3.133 -2.498 -2.172

adfmaxur runs in a fraction of a second, and the maxlag() option, if not specified, is determined following Schwert (1989), as it is in the official dfgls command. Table 1 summarizes the results of applying adfmaxur to the first 20 unemployment rates that are included in usurates.dta. Interestingly, the results reveal that, in some cases, inference may depend upon the procedure used to select the augmentation order of the test regressions. For example, in the cases of Kansas (KSUR) and Massachusetts (MAUR), we fail to reject the unit-root null at the 5% significance level when using FIXED but not with the other criteria. In contrast, in the case of Idaho (IDUR), the unit-root null is rejected (at the same significance level) using FIXED but not with the other criteria. In summary, the output reported by adfmaxur can be viewed as serving the purpose of assessing the sensitivity of the results to the selection of the method for lag determination in the test regressions. J. Otero and C. F. Baum 27

Table 1. Application of the ADFmax test to unemployment rates in selected U.S. states

State FIXED AIC SIC GTS05 GTS10 Stat. p.val. Stat. p.val. Stat. p.val. Stat. p.val. Stat. p.val

ALUR −2.165 0.088 −2.234 0.085 −2.673 0.027 −2.234 0.086 −2.234 0.088 ARUR −1.403 0.336 −1.631 0.252 −1.399 0.343 −1.179 0.458 −1.469 0.324 AZUR −2.057 0.110 −2.057 0.121 −2.736 0.023 −2.057 0.122 −2.057 0.125 CAUR −2.906 0.013 −2.851 0.020 −3.005 0.011 −3.152 0.009 −2.851 0.021 COUR −2.615 0.030 −2.581 0.039 −2.581 0.035 −2.581 0.040 −2.581 0.041 CTUR −2.406 0.051 −2.152 0.100 −1.723 0.210 −1.723 0.221 −2.406 0.061 DEUR −2.251 0.073 −2.185 0.094 −2.185 0.087 −2.290 0.077 −2.251 0.085 FLUR −2.561 0.034 −3.050 0.012 −2.693 0.026 −3.050 0.012 −3.050 0.013 GAUR −1.992 0.126 −1.959 0.145 −2.314 0.065 −1.799 0.195 −1.959 0.150 IAUR −1.810 0.177 −2.010 0.132 −1.831 0.174 −2.010 0.134 −1.830 0.187 IDUR −2.623 0.029 −2.278 0.077 −1.983 0.131 −1.983 0.141 −2.278 0.081 ILUR −2.061 0.110 −2.085 0.114 −2.975 0.012 −2.085 0.116 −2.085 0.119 INUR −2.134 0.094 −2.338 0.068 −2.157 0.092 −2.173 0.098 −2.338 0.071 KSUR −2.379 0.054 −2.853 0.020 −3.109 0.008 −3.109 0.010 −3.109 0.011 KYUR −2.199 0.082 −2.077 0.116 −2.073 0.110 −2.077 0.118 −2.077 0.120 LAUR −1.815 0.176 −1.763 0.204 −2.180 0.088 −1.763 0.207 −1.763 0.209 MAUR −2.268 0.070 −2.494 0.048 −3.467 0.003 −2.818 0.022 −2.494 0.050 MDUR −2.924 0.013 −3.040 0.012 −3.040 0.010 −3.040 0.012 −3.040 0.013 MEUR −2.372 0.055 −2.367 0.064 −2.367 0.058 −2.367 0.065 −2.367 0.067 MIUR −2.015 0.120 −2.066 0.119 −2.077 0.109 −2.247 0.084 −2.247 0.086

5 Concluding remarks

In this article, we presented the command adfmaxur, which calculates the Leybourne (1995) ADFmax unit-root test, the test’s 1%, 5%, and 10% finite-sample critical values, and the test’s approximate p-value. The critical values and approximate p-value are estimated as a function of the number of observations, T , lags of the dependent variable in the test regressions, p, and also allow for the lag length to be determined either exogenously by the user or endogenously using a data-dependent procedure.

6 Acknowledgments

Jes´us Otero acknowledges financial support received from the Universidad del Rosario through Fondo de Investigaciones de la Universidad del Rosario. We are also grateful to an anonymous referee for useful comments and suggestions. The usual disclaimer applies.

7 References DeJong, D. N., J. C. Nankervis, N. E. Savin, and C. H. Whiteman. 1992. The power problems of unit root test in time series with autoregressive errors. Journal of Econo- metrics 53: 323–343. 28 Forward and reverse Dickey–Fuller regressions

Dickey, D. A., and W. A. Fuller. 1979. Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association 74: 427–431. . 1981. Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica 49: 1057–1072.

Drukker, D. M. 2006. Importing Federal Reserve economic data. Stata Journal 6: 384–386. Hall, A. 1994. Testing for a unit root in time series with pretest data-based model selection. Journal of Business and Economic Statistics 12: 461–470.

Leybourne, S. J. 1995. Testing for unit roots using forward and reverse Dickey–Fuller regressions. Oxford Bulletin of Economics and Statistics 57: 559–571.

Ng, S., and P. Perron. 1995. Unit root tests in ARMA models with data-dependent methods for the selection of the truncation lag. Journal of the American Statistical Association 90: 268–281. . 2005. A Note on the Selection of Time Series Models. Oxford Bulletin of Economics and Statistics 67: 115–134. Otero, J., and J. Smith. 2012. Response surface models for the Leybourne unit root tests and lag order dependence. Computational Statistics 27: 473–486. Said, S. E., and D. A. Dickey. 1984. Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika 71: 599–607. Schwert, G. W. 1989. Tests for unit roots: A Monte Carlo investigation. Journal of Business and Economic Statistics 7: 147–159.

About the authors Jes´us Otero is a professor of economics at the Universidad del Rosario in Colombia. He received his PhD in economics from Warwick University in the United Kingdom. His field of study is applied time-series econometrics. Christopher F. Baum is a professor of economics and social work at Boston College. He is an associate editor of the Stata Journal and the Journal of Statistical Software. Baum founded and manages the Boston College Statistical Software Components (SSC) archive at RePEc (http://repec.org). His recent research has addressed issues in social epidemiology and the progress of refugees in developed economies. The Stata Journal (2018) 18, Number 1, pp. 29–50 validscale: A command to validate measurement scales

Bastien Perrot University of Nantes, University of Tours, INSERM SPHERE U1246 “methodS in Patient-centered outcomes and HEalth ResEarch” Nantes, France [email protected]

Emmanuelle Bataille University of Nantes, University of Tours, INSERM SPHERE U1246 “methodS in Patient-centered outcomes and HEalth ResEarch” Nantes, France [email protected]

Jean-Benoit Hardouin University of Nantes, University of Tours, INSERM SPHERE U1246 “methodS in Patient-centered outcomes and HEalth ResEarch” Nantes, France [email protected]

Abstract. Subjective measurement scales are used to measure nonobservable respondent characteristics in several fields such as clinical research, educational sciences, or psychology. To be useful, the scores resulting from the questionnaire must be validated; that is, they must provide the psychometric properties validity, reliability, and sensitivity. In this article, we present the validscale command, which carries out the required statistical analyses to validate a subjective mea- surement scale. We have also developed a dialog box, and validscale will soon be implemented online with Numerics by Stata. Keywords: st0512, validscale, subjective measurements scales, score, psychomet- rics, validity, reliability

1 Introduction

Several domains require measurement scales to measure concepts, but no technical in- strument in these domains allows one to obtain a measure. This is the case, for example, in education sciences, when we want to measure the ability of students; in social and human sciences, when we want to measure characteristics of individuals like personality traits or behaviors; and in health, when we want to measure perceived health, quality of life, pain, or mental disorders. Generally, these concepts are measured using a questionnaire composed of items. Questionnaires can be unidimensional (they measure only one concept) or multidimen-

c 2018 StataCorp LLC st0512 30 validscale: A command to validate measurement scales sional (they measure several concepts), so they can lead to one or more measures able to measure the concept or concepts of interest. Historically, the model of measurement is defined in the framework of classical test theory (CTT). If other theories of measurement coexist today like item response theory (IRT) or Rasch measurement theory (RMT), CTT continues to be largely used in several domains (psychology, health, etc.) to validate scales. The success of CTT can be explained by the simplicity of obtaining the measures, because, in this framework, the measure of each concept can be obtained using scores computed as a combination (for example, sum or mean) of responses to the items. To construct a valid and reliable questionnaire, one must provide its psychometric properties: Validity is the degree to which an instrument measures the concept or concepts of interest accurately. Reliability is the degree to which an instrument measures the concept or concepts of interest consistently. Validity and reliability are assessed by checking their respective facets. Validity is composed of content validity, construct validity, and criterion validity. Reliability is composed of internal consistency, test-retest reliability, and scalability. Content validity should be evaluated by experts. In this step, the experts define the number of concepts measured by the scale, the definition of each of these concepts, and the assumed relationships between the concepts and the questionnaire items. This facet of validity is based on qualitative methods, but all the other facets can be assessed using statistical analyses to statistically confirm the experts’ opinions. However, there is currently no statistical software package to perform all of these analyses in a user- friendly way. The objective of the validscale command is to perform the necessary analyses to validate a measurement scale in the framework of CTT.

2 Psychometric properties assessed by validscale

The concepts described below are based on the terminology used by Fayers and Machin (2007).

Construct validity:

Convergent and divergent validity test whether the items of the questionnaire measure the constructs they are designed to measure. Convergent validity tests whether an item is sufficiently correlated to the score computed with items of the same dimension. Divergent validity tests whether an item is poorly correlated to the scores computed in the other dimensions (for multi- dimensional scales). Structural validity tests the dimensional structure of the questionnaire (the num- ber of dimensions). B. Perrot, E. Bataille, and J.-B. Hardouin 31

Criterion validity:

Concurrent validity is assessed by comparing the score or scores of the question- naire with previously validated instruments or a gold standard measuring approximately the same concept or concepts. Known-groups validity tests whether the scores differ according to known groups in a predictable way. Reliability:

Internal consistency refers to how unidimensional the dimension is and whether it is composed of enough items. Reproducibility refers to how well the items and scores are stable when the individual’s state is stable.

3 The validscale command 3.1 Description

The validscale command computes elements to provide structural validity, conver- gent and divergent validity, concurrent validity, reproducibility, known-groups validity, internal consistency, and scalability. The command is intended to be used with ques- tionnaires that comprise dichotomous (that is, two response categories) or polytomous items (that is, more than two response categories, for example, the Likert-type scale). The user defines the items used to compute the scores. The second parameter required (partition()) is the repartition of the items in the different dimensions of the ques- tionnaire.

3.2 Syntax validscale varlist, partition(numlist) scorename(string) scores(varlist) categories(numlist) impute(method) noround compscore(method) descitems graphs cfa cfamethod(method) cfasb cfastand cfanocovdim cfacov(string) cfarmsea(#) cfacfi(#) cfaor convdiv tconvdiv(#) convdivboxplots alpha(#)delta(#)h(#)hjmin(#) repet(varlist) kappa ickappa(#) scores2(#) kgv(varlist) kgvboxplots kgvgroupboxplots conc(varlist) tconc(#)

varlist contains the variables (items) used to compute the scores. The first items of varlist compose the first dimension, the following items define the second dimension, and so on. 32 validscale: A command to validate measurement scales

validscale requires the commands delta (Hardouin 2007a), detect2 (included with the software for this article), imputeitems (Hardouin 2007b), mi twoway (Hamel 2014), kapci (Reichenheim 2004), loevh (Hardouin 2004), and lstrfun (Blanchette 2010).

3.3 Options partition(numlist) defines the number of items in each dimension. The number of elements in this list indicates the number of dimensions. partition() is required. scorename(string) defines the names of the dimensions. By default, the dimensions are named Dim1, Dim2, ... unless scores(varlist) is used. scores(varlist) selects scores already computed in the dataset. varlist must contain as many variables as there are dimensions in the questionnaire. scores(varlist) and scorename(string) cannot be used together. This option is useful when the scores result from combinations of items that are more complex than the combinations available in the compscore() option (unweighted sum, unweighted mean, or - dardization between 0 and 100). In that case, the scores must be computed prior to using validscale. categories(numlist) specifies the minimum and maximum possible values for item response categories. If all the items have the same response categories, the user may specify these two values in numlist. If the item response categories differ from a dimension to another, the user must define the minimum and maximum values of items responses for each dimension. So the number of elements in numlist must be equal to the number of dimensions times 2. Eventually, the user may specify the minimum and maximum response categories for each item. In this case, the number of elements in numlist must be equal to 2 times the number of items. By default, the observed minimum and maximum values are assumed to be the minimum and maximum for each item. impute(method) imputes missing-item responses with person-mean substitution (pms) or the two-way imputation method applied in each dimension (mi). Both methods are applied in each dimension. When pms is specified, missing data are imputed only if the number of missing values in the dimension is less than half the number of items in the dimension. By default, imputed values are rounded to the nearest whole number, but with the noround option, imputed values are not rounded. If impute() is absent, then noround is ignored. noround specifies that imputed values are not rounded. By default, imputed values are rounded to the nearest whole number. If impute() is absent, then noround is ignored. B. Perrot, E. Bataille, and J.-B. Hardouin 33 compscore(method) defines the method used to compute the scores. method may be mean (default), sum,orstand (set scores from 0 to 100). compscore() is ignored if scores() is used. descitems displays a descriptive analysis of the items. The option displays missing data rate per item and distribution of item responses. It also computes for each item the Cronbach’s alphas obtained by omitting each item in each dimension. Moreover, the option computes Loevinger’s Hj coefficients and the number of nonsignificant Hjk. See Hardouin, Bonnaud-Antignac, and S´ebille (2011) for details about Loevinger’s coefficients. graphs displays graphs for items’ and dimensions’ descriptive analyses. It provides histograms of scores, a biplot of the scores, and a graph showing the correlations between the items. cfa performs a confirmatory factor analysis (CFA)usingthesem command. It displays estimations of parameters and several goodness-of-fit indices. cfamethod(method) specifies the method to estimate the parameters. method may be ml (maximum likelihood), mlmv (ml with missing values), or adf (asymptotic distribution free). cfasb produces Satorra–Bentler-adjusted goodness-of-fit indices by using the option vce(sbentler) from sem. cfastand displays standardized coefficients for the CFA. cfanocovdim asserts that the latent variables are not correlated. cfacov(string) adds covariances between measurement errors. cfacov(item1*item2) estimates the covariance between the errors of item1 and item2. To specify more than one covariance, use cfacov(item1*item2 item3*item4). cfarmsea(#) automatically adds the covariances between measurement errors found with the estat mindices command until the root mean square error of approxi- mation (RMSEA) of the model is less than #. More precisely, the “basic” model (without covariances between measurement errors) is fit; then we add the covariance corresponding to the greatest modification index. The model is then fit again with this extra parameter, and so on. The option adds only the covariances between measurement errors within a dimension and can be combined with cfacov().The specified value # may not be reached if all possible within-dimension measurement errors have already been added. cfacfi(#) automatically adds the covariances between measurement errors found with the estat mindices command until the comparative fit index (CFI) of the model is greater than #. More precisely, the “basic” model (without covariances between measurement errors) is fit; then we add the covariance corresponding to the greatest modification index. The model is then fit again with this extra parameter, and so on. The option adds only the covariances between measurement errors within a dimension and can be combined with cfacov(). The specified value # may not 34 validscale: A command to validate measurement scales

be reached if all possible within-dimension measurement errors have already been added. cfaor is useful when both cfarmsea() and cfacfi() are used. By default, covariances between measurement errors are added, and the model is fit until both RMSEA and CFI criteria are met. If cfaor is used, the estimations stop when one of the two criteria is met. convdiv assesses convergent and divergent validities. The option displays the matrix of correlations between items and rest scores (that is, the scores computed after omitting the item being examined). If scores(varlist) is used, then the correlation coefficients are computed between items and scores of varlist. tconvdiv(#) defines a threshold for highlighting some values. # is a real number between 0 and 1. The default is tconvdiv(0.4). Correlations between items and their own score are displayed in red if it is less than #. Moreover, if an item has a smaller correlation coefficient with the score of its dimension than those computed with other scores, the coefficient is displayed in red. convdivboxplots displays boxplots for assessing convergent and divergent validities. The boxes represent the correlation coefficients between the items of a given dimen- sion and all scores. Thus, the box of correlations coefficients between items of a given dimension and the corresponding score must be higher than other boxes. There are as many boxplots as dimensions. alpha(#) defines a threshold for Cronbach’s alpha. # is a real number between 0 and 1. The default is alpha(0.7). Alpha coefficients less than # are displayed in red. delta(#) defines a threshold for Ferguson’s delta coefficient (see delta). Delta coeffi- cients are computed only if compscore(sum)isusedandscores() is not used. # is a real number between 0 and 1. The default is delta(0.9). Delta coefficients less than # are displayed in red. h(#) defines a threshold for Loevinger’s H coefficients (see loevh). # is a real number between 0 and 1. The default is h(0.3). Loevinger’s coefficients less than # are displayed in red. hjmin(#) defines a threshold for Loevinger’s Hj coefficients. The displayed value is the minimal Hj coefficient for an item in the dimension (see loevh). # is a real number between 0 and 1. The default is hjmin(0.3). If the minimal Loevinger’s Hj coefficient is less than #, then it is displayed in red, and the corresponding item is displayed. repet(varlist) assesses reproducibility of scores by defining in varlist the variables corresponding to responses at time 2 (in the same order than for time 1). Scores are computed according to the partition() option. Intraclass correlation coefficients (ICC) for the scores and their 95% confidence interval are computed with Stata’s icc command. kappa computes the kappa statistic for items with Stata’s kap command. B. Perrot, E. Bataille, and J.-B. Hardouin 35 ickappa(#) computes confidence intervals for kappa statistics using kapci. # is the number of replications for the bootstrap used to estimate confidence intervals if items are polytomous. If they are dichotomous, an analytical method is used. See kapci for more details about calculation of confidence intervals for kappa’s coefficients. If the kappa option is absent, then the ickappa() option is ignored. scores2(varlist) allows selecting scores at time 2 from the dataset. kgv(varlist) assesses known-groups validity according to the grouping variables defined in varlist. The option performs an (ANOVA) that compares the scores between groups of individuals constructed with variables in varlist.Ap-value based on a Kruskal–Wallis test is also given. kgvboxplots draws boxplots of the scores split into groups of individuals. kgvgroupboxplots groups all boxplots into one graph. If kgvboxplots is absent, then kgvgroupboxplots is ignored. conc(varlist) assesses concurrent validity with variables precised in varlist.Thesevari- ables are scores from one or several other scales in the dataset. tconc(#) defines a threshold for the correlation coefficients between the computed scores and those from other scales defined in varlist. Correlation coefficients greater than # in absolute value (the default is tconc(0.4)) are displayed in bold.

4 Output 4.1 Data used in the examples

The data used in the output are a sample of responses to the French version of the Impact Of Cancer version 2 questionnaire (Crespi et al. 2008). This questionnaire measures the impact of cancer on survivors’ lives. It consists especially of four positive-impacts subscales named altruism and empathy (AE), health awareness (HA), meaning of cancer (MOC), and positive self-evaluation (PSE) and four negative-impacts subscales named appearance concerns (AC), body change concerns (BCC), life interferences (LI), and worry (W). There is also one subscale for employment and relationship concerns (not used here). The questionnaire is composed of 37 items (ioc1–ioc37)scoredona5-pointscale from 1 (“strongly disagree”) to 5 (“strongly agree”). They are grouped into eight dimensions, and the scores result from the sum of the responses of the corresponding items. The questionnaire was answered by 371 patients. 36 validscale: A command to validate measurement scales

4.2 Examples

Minimal output

. use data_ioc . validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) compscore(sum) Items used to compute the scores HA : ioc1 ioc2 ioc3 ioc4 PSE : ioc5 ioc6 ioc7 ioc8 W : ioc9 ioc10 ioc11 ioc12 ioc13 ioc14 ioc15 BCC : ioc16 ioc17 ioc18 AC : ioc19 ioc20 ioc21 AE : ioc22 ioc23 ioc24 ioc25 LI : ioc26 ioc27 ioc28 ioc29 ioc30 ioc31 ioc32 MOC : ioc33 ioc34 ioc35 ioc36 ioc37 Number of observations: 371

Reliability

n alpha delta H Hj_min HA 369 0.67 0.94 0.35 0.25 (item ioc1) PSE 368 0.69 0.96 0.39 0.30 W 369 0.90 0.99 0.62 0.59 BCC 369 0.79 0.97 0.61 0.58 AC 369 0.81 0.97 0.62 0.60 AE 368 0.71 0.94 0.43 0.34 LI 367 0.81 0.97 0.42 0.29 (item ioc26) MOC 363 0.83 0.97 0.53 0.38

This is the minimal output produced by validscale. In this example, we see that all of Cronbach’s alphas are acceptable according to the threshold specified with alpha(#) (0.7 by default), except for scales HA and PSE (0.67 and 0.69, respectively). Loevinger’s H coefficients for the 8 scales are ≥ 0.3, which indicates good scalability. HA However, the Hioc1 coefficient is < 0.3 (0.25), which means that item ioc1 might not be LI consistent with scale HA.TheHioc26 coefficient is also < 0.3 for item ioc26. For other subscale subscales, no Hitem is < 0.3. By default, Cronbach’s alphas < 0.7andLoevinger’sH coefficients < 0.3aredis- played in red. These thresholds can be changed with the options alpha(#) and h(#), respectively. B. Perrot, E. Bataille, and J.-B. Hardouin 37

The descitems option

. validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) compscore(sum) descitems Items used to compute the scores (output omitted )

Description of items

Missing N Response categories Alpha Hj # of 12345-item NS Hjk

ioc1 3.77% 357 10.08% 12.61% 24.65% 33.05% 19.61% 0.71 0.25 0 ioc2 1.08% 367 3.00% 8.72% 10.90% 39.78% 37.60% 0.52 0.42 0 ioc3 2.16% 363 2.48% 5.79% 11.02% 44.63% 36.09% 0.53 0.43 0 ioc4 2.43% 362 3.31% 8.56% 18.51% 43.09% 26.52% 0.62 0.33 0 ------ioc5 2.96% 360 9.44% 15.28% 22.78% 28.06% 24.44% 0.70 0.30 0 ioc6 2.96% 360 10.28% 15.28% 24.17% 33.61% 16.67% 0.54 0.47 0 ioc7 2.43% 362 4.97% 8.01% 22.10% 42.27% 22.65% 0.67 0.34 0 ioc8 2.16% 363 14.60% 19.83% 33.06% 20.66% 11.85% 0.58 0.44 0 ------ioc9 2.43% 362 15.47% 22.65% 14.64% 28.18% 19.06% 0.89 0.63 0 ioc10 3.23% 359 33.43% 27.58% 20.89% 12.26% 5.85% 0.90 0.59 0 ioc11 1.89% 364 5.49% 9.62% 13.74% 42.03% 29.12% 0.89 0.61 0 ioc12 3.23% 359 8.64% 18.94% 19.22% 37.05% 16.16% 0.89 0.63 0 ioc13 3.23% 359 13.65% 24.79% 18.11% 30.36% 13.09% 0.88 0.66 0 ioc14 1.62% 365 12.05% 26.30% 14.25% 28.49% 18.90% 0.89 0.60 0 ioc15 1.08% 367 6.81% 19.62% 18.26% 39.78% 15.53% 0.89 0.64 0 ------ioc16 1.35% 366 10.11% 21.86% 12.57% 30.33% 25.14% 0.67 0.62 0 ioc17 1.62% 365 9.86% 22.19% 13.97% 32.33% 21.64% 0.69 0.61 0 ioc18 1.08% 367 20.98% 34.88% 12.53% 21.80% 9.81% 0.78 0.58 0 ------ioc19 3.23% 359 14.76% 27.58% 19.50% 24.79% 13.37% 0.75 0.61 0 ioc20 2.70% 361 27.15% 29.36% 19.67% 14.68% 9.14% 0.69 0.66 0 ioc21 1.35% 366 20.22% 19.13% 16.39% 26.50% 17.76% 0.77 0.60 0 ------ioc22 1.08% 367 6.27% 12.26% 20.71% 45.50% 15.26% 0.71 0.34 0 ioc23 0.81% 368 1.63% 2.45% 5.71% 50.54% 39.67% 0.68 0.40 0 ioc24 1.89% 364 2.75% 8.52% 25.00% 42.03% 21.70% 0.55 0.52 0 ioc25 2.43% 362 5.25% 16.57% 33.98% 29.28% 14.92% 0.63 0.44 0 ------ioc26 2.16% 363 28.37% 38.84% 16.53% 9.64% 6.61% 0.82 0.29 0 ioc27 2.70% 361 28.53% 37.40% 11.08% 15.79% 7.20% 0.78 0.46 0 ioc28 1.89% 364 40.11% 35.71% 10.99% 7.97% 5.22% 0.78 0.48 0 ioc29 1.62% 365 33.42% 34.52% 11.23% 13.42% 7.40% 0.78 0.43 0 ioc30 1.35% 366 27.60% 31.15% 12.57% 19.13% 9.56% 0.79 0.43 0 ioc31 1.35% 366 36.34% 35.79% 10.93% 12.02% 4.92% 0.78 0.47 0 ioc32 2.96% 360 14.17% 17.22% 17.78% 33.89% 16.94% 0.81 0.35 0 ------ioc33 2.70% 361 8.59% 24.38% 23.27% 31.30% 12.47% 0.84 0.38 0 ioc34 2.96% 360 6.94% 26.39% 30.83% 24.72% 11.11% 0.78 0.54 0 ioc35 4.04% 356 13.48% 26.40% 28.65% 22.47% 8.99% 0.78 0.56 0 ioc36 3.50% 358 20.95% 32.68% 24.86% 15.64% 5.87% 0.79 0.54 0 ioc37 3.23% 359 18.66% 31.20% 22.84% 19.22% 8.36% 0.76 0.60 0 ------(output omitted ) 38 validscale: A command to validate measurement scales

The column Alpha-item shows the Cronbach’s alpha coefficients of the scales when the considering item is removed. For example, removing ioc1 from the first subscale (HA) would increase the alpha coefficient for this subscale from 0.67 (see previous table) to 0.71, whereas removing ioc2 would decrease the alpha coefficient from 0.67 to 0.52. The column Hj corresponds to the Loevinger’s coefficients of item-scale consistency. We note that the first Hj coefficient of the column (0.25) corresponds to the value of HA Hioc1 displayed in the previous table.

The last column indicates the number of Hjk coefficients (Loevinger’s coefficients between the considering item and each of the other items of the dimension) that are not significantly different from zero. If one Hjk coefficient would appear not to be significantly different from zero, it would indicate a problem in item-scale consistency.

The graph option

. validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) graph (output omitted ) 20 25 30 20 15 20 15 10 10 Percent Percent Percent 10 5 5 0 0 0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 HA PSE W 30 20 20 15 15 20 10 10 Percent Percent Percent 10 5 5 0 0 0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 BCC AC AE 25 30 20 20 15 10 Percent Percent 10 5 0 0 1 2 3 4 5 1 2 3 4 5 LI MOC

Figure 1. Histograms of scores B. Perrot, E. Bataille, and J.-B. Hardouin 39 2

BCC LI AC 1 W 0 − 1 HA − 2 AE

MOC PSE − 3 −1 0 1 2 3 4

Figure 2. Correlations between scores 1

ioc31 HA ioc18 ioc28 PSE ioc20 ioc10 ioc19 ioc30 ioc29ioc17 W ioc32ioc16ioc27ioc9ioc13 BCC ioc21 ioc11ioc15ioc12 0 ioc14 AC ioc26 AE ioc1 LI MOC ioc3 ioc5 ioc22 ioc2 − 1 ioc23 ioc33ioc4 ioc8 ioc25 ioc6ioc24 ioc36 ioc7 ioc35 ioc34 ioc37 − 2 −.5 0 .5 1 1.5 2

Figure 3. Correlations between items

The graph option produces histograms of scores (figure 1), a biplot1 of the scores (figure 2), and a biplot of the items (figure 3). The histogram allows one to examine the distribution of the scores, particularly to check normality and to identify potential floor or ceiling effects (that is, high proportion

1. Only variables are represented in the biplots. 40 validscale: A command to validate measurement scales of subjects scoring minimum score or maximum score, respectively). Figures 2 and 3 are produced with the biplot command. In figure 2, the cosine of the angle between arrows approximates the correlation between the scores. This allows one to graphically assess the correlation between the scores. For example, appearance concerns (AC), body change concerns (BCC), and life interferences (LI) seem very correlated with each other but uncorrelated with the meaning of cancer (MOC). In figure 3, the cosine of the angle between arrows approximates the correlation be- tween the items. Moreover, in figure 3, items within the same dimension are represented in the same color. Thus, we expect items of the same color to be close to each other. In this example, we note that item ioc5, although theoretically grouped with items ioc6, ioc7,andioc8, seems more correlated with ioc2 and ioc3.

The cfa option

. validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) cfa cfacov(ioc1*ioc3) (output omitted )

Confirmatory factor analysis

Warning: some items have less than 7 response categories. If multivariate normality assumption does not hold, maximum likelihood estimation might not be appropriate. Consider using cfasb in order to apply Satorra-Bentler adjustment or using cfamethod(adf). Covariances between errors added: e.ioc1*e.ioc3 Number of used individuals: 292 Item Dimension Factor Standard Intercept Standard Error Variance of loading error error variance dimension

ioc1 HA 1.00 . 3.36 0.07 1.33 0.16 ioc2 HA 2.05 0.46 3.95 0.06 0.45 ioc3 HA 1.53 0.31 4.01 0.06 0.55 ioc4 HA 1.47 0.34 3.77 0.06 0.68 ioc5 PSE 1.00 . 3.42 0.07 1.32 0.32 ioc6 PSE 1.56 0.24 3.27 0.07 0.69 ioc7 PSE 1.15 0.20 3.70 0.06 0.66 ioc8 PSE 1.37 0.22 2.91 0.07 0.80 ioc9 W 1.00 . 3.06 0.08 0.62 1.19 ioc10 W 0.77 0.06 2.26 0.07 0.75 ioc11 W 0.74 0.06 3.77 0.07 0.65 ioc12 W 0.88 0.06 3.27 0.07 0.52 ioc13 W 0.98 0.06 3.01 0.07 0.45 ioc14 W 0.89 0.06 3.12 0.08 0.75 ioc15 W 0.82 0.06 3.33 0.07 0.49 ioc16 BCC 1.00 . 3.30 0.08 0.73 1.08 ioc17 BCC 0.92 0.07 3.30 0.07 0.68 ioc18 BCC 0.90 0.08 2.62 0.08 0.76 ioc19 AC 1.00 . 2.90 0.07 0.56 1.07 ioc20 AC 1.01 0.08 2.45 0.07 0.51 ioc21 AC 0.92 0.08 2.91 0.08 1.07 B. Perrot, E. Bataille, and J.-B. Hardouin 41

ioc22 AE 1.00 . 3.48 0.06 0.89 0.25 ioc23 AE 0.85 0.14 4.22 0.05 0.44 ioc24 AE 1.62 0.23 3.65 0.06 0.32 ioc25 AE 1.53 0.22 3.26 0.06 0.51 ioc26 LI 1.00 . 2.21 0.07 1.04 0.27 ioc27 LI 1.70 0.23 2.26 0.07 0.63 ioc28 LI 1.55 0.21 2.02 0.07 0.63 ioc29 LI 1.58 0.22 2.21 0.07 0.82 ioc30 LI 1.83 0.26 2.47 0.08 0.83 ioc31 LI 1.65 0.23 2.08 0.07 0.56 ioc32 LI 1.33 0.21 3.17 0.08 1.21 ioc33 MOC 1.00 . 3.06 0.07 1.09 0.29 ioc34 MOC 1.39 0.19 3.02 0.06 0.61 ioc35 MOC 1.66 0.22 2.84 0.07 0.52 ioc36 MOC 1.58 0.21 2.46 0.07 0.57 ioc37 MOC 2.01 0.26 2.61 0.07 0.29

Covariances between dimensions: HA PSE W BCC AC AE LI MOC HA 0.16 ...... PSE 0.11 0.32 ...... W 0.11 0.04 1.19 . .... BCC 0.22 0.07 0.14 1.08 .... AC 0.16 -0.02 0.65 -0.06 1.07 . . . AE 0.08 0.04 0.46 0.65 -0.04 0.25 . . LI 0.10 0.17 0.09 0.06 0.06 -0.07 0.27 . MOC 0.06 0.01 0.38 0.39 0.29 0.02 0.10 0.29

Goodness of fit: chi2 df chi2/df RMSEA [90% CI] SRMR NFI 1103.86 600 1.8 0.054 [0.049 ; 0.059] 0.074 0.796 (p-value = 0.000) RNI CFI IFI MCI 0.894 0.894 0.895 0.421

The cfa option uses the official Stata command sem to perform a CFA. The option cfacov(ioc1*ioc3) allows one to consider a covariance between the errors of ioc1 and ioc3. Goodness-of-fit indices computed by Stata or from Gadelrab (2010)aregivenbelow the tables of estimates.

In this example, the RMSEA is < 0.06, which indicates acceptable fit. However, the CFI is only 0.89. To improve the CFI, we could specify direct effects between some items by using cfacov() or cfacfi() (see Options for details). A warning is displayed because items have only five response categories. In that case, we could use cfasb to apply Satorra–Bentler adjustment after maximum likelihood estimation. 42 validscale: A command to validate measurement scales

The convdiv option

. validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) convdiv convdivboxplot Items used to compute the scores (output omitted )

Correlation matrix

HA PSE W BCC AC AE LI MOC

ioc1 0.266 0.171 0.319 0.278 0.262 0.208 0.243 0.093 ioc2 0.535 0.343 0.382 0.206 0.102 0.363 0.202 0.122 ioc3 0.536 0.346 0.306 0.265 0.220 0.258 0.196 0.106 ioc4 0.403 0.274 0.229 0.064 0.002 0.332 0.132 0.266 ------ioc5 0.359 0.362 0.308 0.172 0.107 0.242 0.174 0.126 ioc6 0.296 0.609 0.083 0.013 0.084 0.391 0.077 0.286 ioc7 0.316 0.418 0.102 -0.007 0.052 0.423 -0.011 0.382 ioc8 0.157 0.546 0.024 0.002 0.077 0.321 0.046 0.253 ------ioc9 0.364 0.181 0.743 0.359 0.280 0.184 0.530 -0.024 ioc10 0.216 0.072 0.643 0.364 0.258 0.057 0.466 -0.114 ioc11 0.394 0.045 0.666 0.335 0.156 0.166 0.408 -0.010 ioc12 0.411 0.163 0.734 0.433 0.327 0.230 0.476 -0.020 ioc13 0.318 0.165 0.792 0.389 0.323 0.180 0.524 -0.027 ioc14 0.390 0.191 0.698 0.378 0.272 0.186 0.425 0.062 ioc15 0.427 0.145 0.741 0.424 0.285 0.185 0.431 0.011 ------ioc16 0.268 0.023 0.421 0.673 0.352 0.174 0.546 0.046 ioc17 0.237 0.057 0.344 0.650 0.477 0.138 0.474 -0.033 ioc18 0.233 0.076 0.451 0.565 0.451 0.049 0.509 -0.068 ------ioc19 0.176 0.042 0.330 0.508 0.647 0.146 0.376 -0.056 ioc20 0.160 0.072 0.327 0.454 0.701 0.143 0.414 -0.109 ioc21 0.184 0.147 0.258 0.310 0.620 0.184 0.338 -0.025 ------ioc22 0.271 0.310 0.225 0.137 0.170 0.389 0.216 0.142 ioc23 0.373 0.321 0.222 0.123 0.083 0.435 0.157 0.158 ioc24 0.287 0.391 0.119 0.041 0.083 0.643 0.016 0.271 ioc25 0.287 0.394 0.057 0.127 0.185 0.527 0.053 0.292 ------ioc26 0.149 0.098 0.235 0.324 0.214 0.171 0.386 0.107 ioc27 0.277 0.162 0.552 0.413 0.316 0.167 0.635 -0.041 ioc28 0.190 -0.000 0.398 0.352 0.323 0.035 0.648 -0.080 ioc29 0.171 0.062 0.344 0.391 0.241 0.048 0.583 0.038 ioc30 0.220 0.147 0.583 0.414 0.309 0.146 0.573 -0.024 ioc31 0.182 0.014 0.396 0.472 0.348 0.048 0.630 -0.053 ioc32 0.201 0.095 0.333 0.558 0.365 0.130 0.416 0.063 ------ioc33 0.253 0.266 0.184 0.117 0.044 0.280 0.210 0.433 ioc34 0.231 0.322 0.053 0.010 0.004 0.199 0.026 0.652 ioc35 0.082 0.250 -0.144 -0.124 -0.181 0.175 -0.118 0.673 ioc36 0.117 0.277 -0.053 -0.012 -0.044 0.225 -0.002 0.637 ioc37 0.111 0.312 -0.115 -0.007 -0.047 0.235 -0.046 0.748 ------Convergent validity: 33/37 items (89.2%) have a correlation coefficient with the score of their own dimension greater than 0.400 B. Perrot, E. Bataille, and J.-B. Hardouin 43

Divergent validity: 33/37 items (89.2%) have a correlation coefficient with the score of their own dimension greater than those computed with other scores.

Correlations between items of HA and scores .6 .4 .2 0

HA PSE W BCC AC AE LI MOC

Figure 4. Correlations between items of HA and scores

Correlations between items of PSE and scores .6 .4 .2 0

HA PSE W BCC AC AE LI MOC

Figure 5. Correlations between items of PSE and scores

convdiv assesses convergent and divergent (discriminant) validities through the ex- amination of a correlation matrix. The elements of this matrix are the correlation coefficients between items and rest scores. 44 validscale: A command to validate measurement scales

Elements on the diagonal (correlations between an item and the rest score of its dimension) are displayed in bold font in Stata. On the diagonal, values less than tconvdiv(#) (0.4 by default) are displayed in red, indicating lack of convergent valid- ity. For each row, off-diagonal values greater than values on the diagonal are displayed in red, indicating lack of divergent validity. In this example, the value of the correlation coefficient between ioc1 and HA would be displayed in red because 0.266 < 0.4. The value of the correlation coefficient between ioc1 and W and between ioc1 and BCC would also be displayed in red because 0.319 > 0.266 and 0.278 > 0.266, respectively. The convdivboxplot option produces boxplots representing the values of the corre- lation matrix. In this example, because there are eight subscales, eight graphs composed of eight boxes are generated. Figures 4 and 5 correspond to two of these graphs (the six remaining graphs are not shown). In figure 4, we expect the first boxplot to be the “highest” in the graph because it represents the correlations between HA and its own items. We also expect these correlations to be ≥ 4. In figure 5, we expect the second boxplot to be higher than others because it represents the correlations between PSE and its own items. B. Perrot, E. Bataille, and J.-B. Hardouin 45

The repet() option

. set seed 1234 . foreach v of varlist ioc1-ioc37 { 2. generate `v´_2 = round(rnormal(`v´,0.5)) 3. } (output omitted ) . validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) repet(ioc1_2-ioc37_2) kappa ickappa(500) (output omitted )

Reproducibility

Dimension n Item Kappa 95% CI for Kappa ICC 95% CI for ICC (bootstrapped)

HA 368 ioc1 0.59 [ 0.52 ; 0.66] 0.96 [ 0.96 ; 0.97] ioc2 0.53 [ 0.46 ; 0.59] ioc3 0.52 [ 0.45 ; 0.58] ioc4 0.57 [ 0.50 ; 0.64] PSE 367 ioc5 0.62 [ 0.46 ; 0.59] 0.97 [ 0.97 ; 0.98] ioc6 0.60 [ 0.54 ; 0.66] ioc7 0.57 [ 0.51 ; 0.64] ioc8 0.62 [ 0.57 ; 0.69] W 366 ioc9 0.62 [ 0.45 ; 0.58] 0.99 [ 0.98 ; 0.99] ioc10 0.56 [ 0.51 ; 0.63] ioc11 0.53 [ 0.46 ; 0.59] ioc12 0.59 [ 0.53 ; 0.65] ioc13 0.61 [ 0.55 ; 0.67] ioc14 0.56 [ 0.51 ; 0.62] ioc15 0.56 [ 0.49 ; 0.61] BCC 369 ioc16 0.57 [ 0.50 ; 0.64] 0.98 [ 0.97 ; 0.98] ioc17 0.56 [ 0.49 ; 0.61] ioc18 0.60 [ 0.54 ; 0.66] AC 366 ioc19 0.57 [ 0.55 ; 0.67] 0.98 [ 0.97 ; 0.98] ioc20 0.54 [ 0.48 ; 0.61] ioc21 0.55 [ 0.49 ; 0.61] AE 368 ioc22 0.54 [ 0.54 ; 0.66] 0.96 [ 0.96 ; 0.97] ioc23 0.52 [ 0.46 ; 0.59] ioc24 0.59 [ 0.52 ; 0.65] ioc25 0.58 [ 0.51 ; 0.64] LI 366 ioc26 0.63 [ 0.51 ; 0.64] 0.98 [ 0.98 ; 0.99] ioc27 0.58 [ 0.52 ; 0.64] ioc28 0.56 [ 0.50 ; 0.63] ioc29 0.60 [ 0.55 ; 0.67] ioc30 0.61 [ 0.54 ; 0.66] ioc31 0.56 [ 0.48 ; 0.61] ioc32 0.57 [ 0.51 ; 0.63] MOC 361 ioc33 0.62 [ 0.57 ; 0.69] 0.98 [ 0.97 ; 0.98] ioc34 0.57 [ 0.52 ; 0.64] ioc35 0.57 [ 0.51 ; 0.63] ioc36 0.63 [ 0.56 ; 0.69] ioc37 0.59 [ 0.53 ; 0.65] 46 validscale: A command to validate measurement scales

The repet() option computes ICC and its corresponding confidence intervals with the Stata command icc to assess the reproducibility of scores. varlist is the list of items measured at time 2. The kappa option computes kappa statistics with the Stata command kap to assess the reproducibility of items. Computation of confidence intervals for kappa coefficients are possible with the ickappa() option based on the community-contributed command kapci. If the scores() option is used, one would probably want to also use the scores2() option to indicate the variables corresponding to the scores computed at time 2. In that case, ICCs are based on these scores rather than on a combination of items, as defined by compscore(). Because there was only one time of measurement for this questionnaire, responses to items ioc1-ioc37 at the second time of measurement have been simulated for the sake of this example.

Kappas > 0.6 would indicate good reproducibility of items, and ICCs > 0.75 would indicate good reproducibility of scores.

The kgv() option

. validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) kgv(chemo) kgvboxplot kgvgroup (output omitted )

Known-groups validity

chemo mean sd p-value HA 0 (n=106) 3.71 0.76 0.101 (KW: 0.060) 1 (n=245) 3.85 0.76 PSE 0 (n=105) 3.20 0.85 0.042 (KW: 0.029) 1 (n=245) 3.40 0.85 W 0 (n=105) 3.10 0.90 0.535 (KW: 0.471) 1 (n=244) 3.17 1.01 BCC 0 (n=105) 2.87 1.13 0.009 (KW: 0.011) 1 (n=247) 3.20 1.06 AC 0 (n=105) 2.58 1.10 0.011 (KW: 0.014) 1 (n=245) 2.91 1.13 AE 0 (n=104) 3.62 0.65 0.187 (KW: 0.095) 1 (n=247) 3.73 0.74 LI 0 (n=104) 2.29 0.80 0.157 (KW: 0.215) 1 (n=246) 2.42 0.85 MOC 0 (n=103) 2.76 0.83 0.213 (KW: 0.190) 1 (n=242) 2.90 0.93 B. Perrot, E. Bataille, and J.-B. Hardouin 47 5 5 5 4 4 4 3 3 3 W HA PSE 2 2 2 1 1 1

0 1 0 1 0 1 chemo (p=.101) chemo (p=.042) chemo (p=.535) 5 5 5 4 4 4 3 3 3 AE AC BCC 2 2 2 1 1 1

0 1 0 1 0 1 chemo (p=.009) chemo (p=.011) chemo (p=.187) 5 5 4 4 3 3 LI MOC 2 2 1 1

0 1 0 1 chemo (p=.157) chemo (p=.213)

Figure 6. Known-groups validity: chemotherapy/no chemotherapy

The kgv() option assesses the known-groups validity of the questionnaire. In this example, an ANOVA is performed to compare the scores between women who had chemotherapy and those who had not.

A Kruskal–Wallis test is also performed in case ANOVA assumptions were not met. We can see that respondents who received chemotherapy scored higher than others on the BCC (body change concerns) dimension (3.20 versus 2.87, p =0.009). kgvboxplots produces boxplots of scores by the group variable; kgvgroupboxplots groups the boxplots into a single graph (figure 6).

Theconc()option

. validscale ioc1-ioc37, partition(4 4 7 33475) > scorename(HA PSE W BCC AC AE LI MOC) conc(sf12mcs sf12pcs) (output omitted )

Concurrent validity

sf12mcs sf12pcs HA -0.17 -0.14 PSE -0.04 -0.10 W -0.44 -0.21 BCC -0.48 -0.44 AC -0.26 -0.15 AE -0.16 -0.07 LI -0.49 -0.42 MOC 0.12 -0.00 48 validscale: A command to validate measurement scales

In this example, the conc() option assesses the concurrent validity by examining the correlations between the scores of the Impact Of Cancer version 2 and the SF- 12 (Ware, Kosinski, and Keller 1996). Correlations coefficients ≤−0.40 or ≥ 0.40 are displayed in bold font in Stata. This threshold can be changed with tconc(#). For instance, the worry (W) score is (negatively) correlated with the mental compo- nent summary score (sf12mcs)oftheSF-12 (ρ = −0.44) but only moderately correlated with the physical component summary score (sf12pcs)oftheSF-12 (ρ = −0.21).

4.3 Complex syntax of validscale

An example of a complex syntax with most of the available options is given below.

validscale ioc1-ioc37, partition(4 4 7 3 3475) /// scorename(HA PSE W BCC AC AE LI MOC) categories(1 5) impute(pms) /// noround compscore(sum) descitems graphs cfa cfamethod(ml) cfastand /// cfacov(ioc1*ioc3 ioc2*ioc4) convdiv tconvdiv(0.4) convdivboxplots /// alpha(0.7) delta(0.9) h(0.3) hjmin(0.3) repet(ioc1_2-ioc37_2) /// kappa ickappa(500) kgv(chemo radio) kgvboxplots kgvgroupboxplots /// conc(sf12mcs sf12pcs) tconc(0.4)

5 Implementation of validscale 5.1 Dialog box

A dialog box is available if you want to use validscale more intuitively. The db validscale command displays the dialog box shown in figure 7 (only the first tab is shown here).

Figure 7. Dialog box for validscale B. Perrot, E. Bataille, and J.-B. Hardouin 49

5.2 Online implementation with Numerics by Stata validscale will soon be implemented as a command in an online application called the PRO-online project, which is dedicated to the analysis of patient-reported outcomes. The PRO-online project aims at proposing to students or researchers a way to perform analyses in the field of classical test or item response theories on their own data in a user-friendly way without complex handling of the data. Analyses are performed with Numerics by Stata.

6Conclusion validscale is a command performing analyses to assess the psychometric properties of subjective measurement scales in the framework of CTT in a user-friendly way. It provides information on structural validity, convergent and discriminant validities, re- producibility, known-groups validity, internal consistency, and scalability.

Other theories of measurement coexist in psychometry, particularly IRT and RMT. In these two theories, a latent construct is defined from the relationships between items. Stata provides several commands for IRT (by using the irt command). RMT can be used in Stata by using the community-contributed command raschtest (Hardouin 2007c) for dichotomous items or pcmodel (Hamel et al. 2016) for polytomous items. Finally, validscale can be used in a user-friendly way with a dialog box and will soon be implemented in an online application with Numerics by Stata.

7 References Blanchette, D. 2010. lstrfun: Stata module to modify long local macros. Statis- tical Software Components S457169, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457169.html.

Crespi, C. M., P. A. Ganz, L. Petersen, A. Castillo, and B. Caan. 2008. Refinement and psychometric evaluation of the impact of cancer scale. Journal of the National Cancer Institute 100: 1530–1541.

Fayers, P., and D. Machin. 2007. Quality of Life: The Assessment, Analysis and Inter- pretation of Patient-Reported Outcomes. 2nd ed. Chichester, UK: Wiley.

Gadelrab, H. 2010. Evaluating the Fit of Structural Equation Models: Sensitivity to Specification Error and Descriptive Goodness-of-Fit Indices. Saarbr¨ucken, Germany: Lambert Academic Publishing.

Hamel, J.-F. 2014. mi twoway: Stata module for computing scores on questionnaires containing missing item responses. Statistical Software Components S457901, Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s457901. html. 50 validscale: A command to validate measurement scales

Hamel, J.-F., V. S´ebille, G. Challet-Bouju, and J.-B. Hardouin. 2016. Partial credit model: Estimations and tests of fit with pcmodel. Stata Journal 16: 464–481.

Hardouin, J.-B. 2004. loevh: Stata module to compute Guttman errors and Loevinger H coefficients. Statistical Software Components S439401, Department of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s439401.html. . 2007a. delta: Stata module to compute the Delta index of scale discrimination. Statistical Software Components S456851, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s456851.html. . 2007b. imputeitems: Stata module to impute missing data of binary items. Statistical Software Components S456807, Department of Economics, Boston College. https://econpapers.repec.org/software/bocbocode/s456807.htm. . 2007c. Rasch analysis: Estimation and tests with raschtest. Stata Journal 7: 22–44. Hardouin, J.-B., A. Bonnaud-Antignac, and V. S´ebille.2011. Nonparametric item re- sponse theory using Stata. Stata Journal 11: 30–51. Reichenheim, M. E. 2004. Confidence intervals for the kappa statistic. Stata Journal 4: 421–428. Ware, J. E., Jr., M. Kosinski, and S. D. Keller. 1996. A 12-item short-form health survey: Construction of scales and preliminary tests of reliability and validity. Medical care 34: 220–233.

About the authors

Bastien Perrot is a PhD student in biostatistics at the SPHERE U1246 “methodS in Patient- centered outcomes and HEalth ResEarch” unit at the University of Nantes, France. Part of his research deals with developing and validating measurements scales with applications in health.

Emmanuelle Bataille is a biostatistician at the SPHERE U1246 “methodS in Patient-centered outcomes and HEalth ResEarch” unit at the University of Nantes, France.

Jean-Benoit Hardouin is a teacher–researcher in biostatistics at the SPHERE U1246 “methodS in Patient-centered outcomes and HEalth ResEarch” unit at the University of Nantes, France. His research deals with psychometrics and especially with applying item response theory in clinical trials. The Stata Journal (2018) 18, Number 1, pp. 51–75 A command for fitting mixture regression models for bounded dependent variables using the beta distribution

Laura A. Gray M´onica Hern´andez Alava School of Health and Related Research School of Health and Related Research Health Economics and Decision Science Health Economics and Decision Science University of Sheffield University of Sheffield Sheffield, UK Sheffield, UK laura.gray@sheffield.ac.uk monica.hernandez@sheffield.ac.uk

Abstract. In this article, we describe the betamix command, which fits mix- ture regression models for dependent variables bounded in an interval. The model is a generalization of the truncated inflated beta regression model introduced in Pereira, Botter, and Sandoval (2012, Communications in Statistics—Theory and Methods 41: 907–919) and the mixture beta regression model in Verkuilen and Smithson (2012, Journal of Educational and Behavioral Statistics 37: 82–113) for variables with truncated supports at either the top or the bottom of the distri- bution. betamix accepts dependent variables defined in any range that are then transformed to the interval (0, 1) before estimation. Keywords: st0513, betamix, truncated inflated beta mixture, beta regression, mix- ture model, cross-sectional data, mapping

1 Introduction

Continuous response variables that are bounded at both ends arise in many areas. De- pendent variables measuring proportions, ratios, and rates are common in the empirical literature. They are often limited to the open unit interval (0, 1), but in many cases, values in both boundaries are not only possible but appear with high frequency. Ap- plications where the variables are bounded in alternative intervals linearly transform the dependent variable to the (0, 1) interval. Some examples include modeling the rates of employee participation in pension plans (Papke and Wooldridge 1996), the per- centage of women on municipal councils or executive committees (De Paola, Scoppa, and Lombardo 2010), an index measuring central bank independence (Berggren, Daun- feldt, and Hellstrøm¨ 2014), the proportion of a firm’s total capital accounted for by its long-term debt (Cook, Kieschnick, and McCullough 2008), quality adjusted life years (Basu and Manca 2012), and the score of reading accuracy (Smithson and Verkuilen 2006). Modeling variables bounded at both ends presents several problems. The usual model is not appropriate for bounded dependent variables, because the predictions of the model can lie outside the boundary limits. A common solu-

c 2018 StataCorp LLC st0513 52 Fitting mixture regression models for bounded dependent variables tion is to transform the dependent variable so that it takes values in the real line and then use standard regression models on the transformed dependent variable. How- ever, this approach has an important limitation in that it ignores that the moments of the distribution of a bounded variable are related; as the mean response moves toward a boundary value, the variance and skewness of the variable will tend to de- crease and increase, respectively. Fractional response models (Papke and Wooldridge 1996, 2008) and models based on the beta distribution have been suggested as alterna- tives (Paolino 2001; Kieschnick and McCullough 2003; Ferrari and Cribari-Neto 2004; Smithson and Verkuilen 2006). Fractional response models (Papke and Wooldridge 1996; 2008) assume that the dependent variable takes values in the unit interval [0, 1]. Papke and Wooldridge (1996) specified a functional form for the conditional mean of the dependent variable and proposed the use of a quasilikelihood procedure to estimate the parameters. These models are very useful if the main interest is in the conditional mean of the dependent variable and, if the conditional mean is correctly specified, the parameter estimates are consistent. The standard beta regression model (Paolino 2001; Ferrari and Cribari-Neto 2004; Smithson and Verkuilen 2006) assumes that the dependent variable is continuous in the open unit interval (0, 1). Unlike the fractional response model described above, the beta regression model assumes a distribution for the dependent variable condi- tional on the covariates, and its parameters are estimated using maximum likelihood. A drawback of this model is that distributional misspecification leads to inconsistent parameter estimates. However, it is more suitable if the interest is in the whole dis- tribution. This model has been generalized to allow for values at either boundary or both boundaries by adding a degenerate distribution with probability masses at the boundary values (Cook, Kieschnick, and McCullough 2008; Ospina and Ferrari 2010, 2012b; Basu and Manca 2012). Pereira, Botter, and Sandoval (2012, 2013) extend the framework to model variables such as the ratio of the unemployment benefit to the maximum benefit. This ratio can take the value of zero (if the person is not eligible) or any real number in the interval (τ,1), where τ is the minimum benefit. The ra- tio is also likely to have positive probabilities at the values τ and at 1. This model was termed the truncated inflated beta distribution. The model is a mixture of the beta distribution in the interval (τ,1) and the trinomial distribution with probabil- ity masses at 0, τ, and 1. A related strand of the literature extends the standard beta regression model to allow for mixtures of C-components of beta regressions. This exten- sion is helpful when the distribution of the dependent variable presents characteristics that cannot be captured by a single beta distribution such as multimodality. Allowing for mixtures can help overcome misspecification problems in the conditional distribu- tion of the dependent variable. Some examples of mixtures of beta distributions are found in Ji et al. (2005), Verkuilen and Smithson (2012), and Kent et al. (2015). In Gray, Hern´andez Alava, and Wailoo (Forthcoming), we combine these extensions into a single model to address the problems of modeling health-related quality of life (HRQoL) in health economics.1

1. An alternative model in this context is the adjusted limited dependent variable mixture model, which can be fit using the community-contributed aldvmm command discussed in Hern´andez Alava and Wailoo (2015). L. A. Gray and M. Hern´andezAlava 53

In this article, we present the command betamix, which can be used to fit mixture regression models for dependent variables that are bounded in an interval and can have truncated supports either at the top or at the bottom of the distribution. It extends one of the parameterizations of the community-contributed commands betafit and zoib and the Stata command betareg in several directions.2 First, it generalizes them to mixtures of beta distributions, allowing the model to capture multimodality. Second, it allows the user to model response variables that have a gap between one of the boundaries and the continuous part of the distribution. Third, it can deal with positive probabilities at either boundary or both boundaries and at the truncation point. Fourth, there is no need to manually transform response variables defined in intervals other than (0, 1), because betamix will transform the dependent variable using the supplied options. This article is organized as follows: section 2 gives a brief overview of the model; section 3 describes the betamix syntax and options, including the syntax for predict; section 4 illustrates the syntax of the command and the interpretation of the model using a fictional dataset; and section 5 concludes.

2 A general beta mixture regression model

There are two possible parameterizations of the beta distribution bounded in the interval (0, 1). The most common one uses two shape parameters (Johnson, Kotz, and Balakr- ishnan 1995). An alternative parameterization presented in Ferrari and Cribari-Neto (2004) defines the model in terms of its mean, μ, and a precision parameter, φ.Inthis parameterization, the mean and the variance of y are given by

E (y)=μ a<μ0 1+φ

Thevarianceofy is a function of μ (the mean of y) and decreases as the precision parameter φ increases. The density of the variable y can then be written as

μ−a φ− b−μ φ− Γ(φ)(y − a)( b−a ) 1(b − y)( b−a ) 1 f (y; μ, φ, a, b)= y ∈ (a, b)(1) μ−a b−μ φ−1 Γ b−a φ Γ b−a φ (b − a) where Γ(·) is the gamma function (see Pereira, Botter, and Sandoval [2012]). The trans- formed variable yT =(y − a) / (b − a)0

2. The community-contributed commands betamix, betafit (Buis, Cox, and Jenkins 2003), and zoib (Buis 2010) use a logit and log link for the conditional mean and the conditional scale, respectively. The Stata command betareg allows for a number of different links for both. 54 Fitting mixture regression models for bounded dependent variables

Beta distributions are convenient in modeling because they can display a variety of shapes depending on the values of their two parameters, μ and φ. They are symmetric if μ =(a + b) /2 and asymmetric for any other value of μ. They can also be bell-, J-, and U-shaped. Figure 1 plots several beta probability densities for alternative combinations of μ and φ for a variable defined in the (0, 1) interval. At a value of μ =0.5andsmall values of φ, the beta distribution is U-shaped; if μ =0.5andφ = 2, the distribution becomes the uniform distribution; and as φ increases, the variance decreases, and the distribution becomes more concentrated around its mean. 6 10 4

5 = 0.25, = 2 = 0.75, = 2 density density 2

= 0.5, = 1 = 0.5, = 2 0 0 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1

15 = 0.05, = 25 =0.95, =25 20 = 0.05, = 75 =0.95, =75 15 10

= 0.15, = 75 =0.85, =75 10

density = 0.15, = 25 =0.85, =25 density = 0.25, = 75 =0.75, =75 =0.5, =75

5 = 0.25, = 25 =0.75, =25 =0.5, =25 5 0 0 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1

Figure 1. Probability density of the beta distribution for alternative combinations of μ and φ

Given a sample y1,y2,...,yn of independent random variables, each following the probability density in (1), the beta regression model can be obtained by assuming that a function v (·) of the mean of yi can be written as a linear combination of the set of covariates in the vector zi, μi − a v = z β b − a i where β is a vector of parameters and v (·) is a strictly monotonic and twice differentiable link function that maps the open interval (0, 1) into R. This parameterization simplifies the interpretation of β. A number of link functions can be used, but the logit link is commonly found in applications because the coefficients of the regression can be interpreted as log odds. Using the logit link, we can write the mean of yi as L. A. Gray and M. Hern´andezAlava 55

 exp (ziβ) μi (zi; β)=a +(b − a)  1+exp(ziβ)

Pereira, Botter, and Sandoval (2012, 2013) present a more general beta regression model for variables defined at zero and in the interval [τ,1], which they named the “truncated inflated beta regression model”. It is a mixture model of a multinomial dis- tribution (with probability masses at 0, τ, and 1) and a beta distribution defined in the open interval (τ,1). In Gray, Hern´andez Alava, and Wailoo (Forthcoming), we extend the framework to the case where the second part can be a mixture of C-components of beta distributions incorporating the beta mixture described in Verkuilen and Smithson (2012). Mixtures of beta distributions can display a number of distributional shapes. Figure 2 shows two examples. The left panel plots a 50:50 mixture of two beta distri- butions, both with the same relatively high-precision parameter φ = 50 but with very different means, μ =0.05 and μ =0.80. This mixture displays the usual bimodal shape. The right panel also plots a 50:50 mixture, but the means of the 2 components are closer together (μ =0.60 and μ =0.75), and the precision parameters are different (φ =10 and φ = 100, respectively). This mixture density is asymmetric with a bump on the left tail. Both densities show characteristics that cannot be captured with a single beta distribution.

50:50 mixture of beta densities 50:50 mixture of beta densities ( 1 = 0.05; 2 = 0.80; 1 = 2=50 ) ( 1 = 0.60; 2 = 0.75; 1 = 10; 2=100 ) 8 6 6 4 4 density density 2 2 0 0 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1

Figure 2. Mixture densities of beta distributions

Let us assume that the response variable yi is defined at the point a andintheinterval [τ,b]witha<τ

The probabilities P (yi|xi3) are derived from a multinomial logit model  γ exp (xi3 k) P (yi = k|xi3)=  for k = a, τ, b (3) 1+ exp (xi3γs) s=a,τ,b where xi3 is a column vector of variables that affect the probability of a boundary value of the response variable and γk is the vector of corresponding coefficients. The probability density function h (·) is a mixture of C-components of beta distri- butions with means μci (zi; βc) and precision parameters φc,wherec =1,...,C,

C h (yi|xi1, xi2)= [P (c|xi2) f {yi; μci(xi1; βc),φc,τ,b}](4) c=1 where f (·) is the beta density defined in (1). A multinomial logit model for the proba- bility of latent class membership is assumed to be  δ exp (xi2 c) P (c|xi )= (5) 2 C  δ j=1 exp (xi2 j) where xi2 is a vector of variables that affect the probability of component membership, δc is the vector of corresponding coefficients, and C is the number of classes used in the analysis. One set of coefficients δc is normalized to zero for identification. In the intercept-only model, the probabilities of component membership are constant for all individuals.

Using (2), (3), (4), and (5), we can write the log likelihood of the sample y1,y2,...,yn as ln l (γ,β,δ,φ)= ln P (yi = a|xi3, γ)+ ln P (yi = τ|xi3, γ) i y a i y τ : i= : i= + ln P (yi = b|xi3, γ) i:yi=b ⎧ ⎫ ⎨ ⎬ − | γ + ln ⎩1 P (yi = s xi3, )⎭ i:yi∈p s=a,τ,b   C + ln [P (c|xi2) f {yi; μci(xi1; βc),φc,τ,b}]

i:yi∈(τ,b) c=1 where i =1,...,n. The command betamix described in the section below can fit models with and without truncation and models where the truncation is either at the bottom or at the top of the interval range. The simplest model it can fit is a beta regression model using the alternative parameterization described above. This model can already be fit L. A. Gray and M. Hern´andezAlava 57 in Stata using betareg or the community-contributed command betafit. In addition, the command betamix can fit a finite mixture model using a beta distribution. If there are boundary values, betamix warns the user and adds a small amount of noise (1e–6) to the boundary values after the response variable has been transformed to the interval [0, 1] as in Basu and Manca (2012). This solution is not satisfactory when there are many observations at the boundary values. Provided there is theoretical justification, the user can request that betamix add a second part to the model to add probability masses at any combination of the boundary points and the truncation value (if there is one).3 The description of the model above assumes constant precision parameters, but betamix allows the precision parameters to depend on covariates using a log link such that φ  ln( )=xi4α

These models tend to be more difficult to fit and require good starting values. A good procedure to follow here is to start by fitting a model with constant precision as a stepping stone for the full model (Verkuilen and Smithson 2012). We recommend that the reader become familiar with the idiosyncrasies of fitting mixture models (McLachlan and Peel 2000) before attempting to fit one. In particular, it is important to emphasize that mixture models are known to have multiple optima, and it is important to search for a global solution. Determining the number of compo- nents in a mixture is also not straightforward, and the analyst must exercise judgment in determining the appropriate number of components. Likelihood-ratio tests cannot be used to test models with different numbers of components, because it involves test- ing at the edge of the parameter space. The Bayesian information criterion (BIC)has been proposed as a useful indicator of the number of appropriate components, but other approaches also exist. Exercise caution when using maximum likelihood estimation in small samples. Bias is usually not a problem in large samples, but in small samples, bias-corrected proce- dures are needed. It has been shown (Ospina, Cribari-Neto, and Vasconcellos 2006, 2011; Kosmidis and Firth 2010; Ospina and Ferrari 2012a) that for beta regressions, the biases of the regression parameters tend to be small, but larger biases are found for the precision parameter. In addition, the standard errors of the parameters are sys- tematically underestimated, leading to an exaggeration of the parameters’ significance. Ospina, Cribari-Neto, and Vasconcellos (2006)andKosmidis and Firth (2010)usethe same dataset (sample size n = 32) to compare the effect of different adjustment proce- dures on the 12 estimated parameters and their standard errors. The present version of betamix does not implement any bias-correction procedures for small samples.

3. The online supplementary material in Smithson and Verkuilen (2006) discusses alternative methods and recommends checking the sensitivity of the estimated parameters to different procedures. 58 Fitting mixture regression models for bounded dependent variables 3 Command syntax 3.1 betamix

Syntax betamix depvar if in weight ,muvar(varlist) phivar(varlist) ncomponents(#) probabilities(varlist)pmass(numlist) pmvar(varlist) lbound(#)ubound(#) trun(trun)tbound(#)constraints(constraints) vce(vcetype)level(#) maximize options search(spec)repeat(#)

Description betamix is a community-contributed command that fits a generalized beta regression model using the truncated inflated beta model of Pereira, Botter, and Sandoval (2013) and the mixture beta regression model in Verkuilen and Smithson (2012).

Options muvar(varlist) specifies a set of variables to be included in the mean of the beta regres- sion mixtures. The default is a constant mean. phivar(varlist) specifies a set of variables to be included in the precision of the beta regression mixtures. The default is a constant precision parameter. ncomponents(#) specifies the number of mixture components. # should be an integer. The default is ncomponents(1). probabilities(varlist) specifies a set of variables used to model the probability of component membership. The probabilities are specified using a multinomial logit parameterization. The default is constant probabilities. pmass(numlist) specifies a list of exactly three number indicators (top inside bottom) showing the presence and position of the probability masses. For example, pmass(1 00)specifies a probability mass at b, the top limit of the dependent variable only; pmass(0 0 0) specifies no probability masses (a beta mixture regression model); pmass(1 0 1) specifies a model with probability masses at both limits of the de- pendent variable but no probability mass at the truncation point. The default is no probability masses at any point. Note that pmass() requires a list of exactly three numbers, even if the model has no truncation. pmvar(varlist) specifies the variables used in the inflation part of the model. The model allows for a different set of variables to be used in this part of the model, but in most cases, it is reasonable for the same set of variables to appear in both the inflation model and the mixture of beta regressions. The default is constant probabilities. L. A. Gray and M. Hern´andezAlava 59 lbound(#) specifies the user-supplied lower limit of the dependent variable. The default is lbound(0). Use this option if the dependent variable is limited in the interval (a, b). In this case, the upper bound of the interval also needs to be supplied using the option below. ubound(#) specifies the user-supplied upper limit of the dependent variable. The de- fault is ubound(1). Use this option if the dependent variable is limited in the interval (a, b). In this case, the lower bound of the interval also needs to be supplied using the option above. trun(trun) determines whether there is truncation in the model and, if so, whether it is at the bottom or the top end. trun may be none, top,orbottom.Usenone if no truncation is required; use top if the truncation (gap) is at the top [that is, the dependent variable is defined only in the interval (lbound(), tbound())and the value ubound()]; use bottom if the truncation (gap) is at the bottom [that is, the dependent variable is defined only at the value lbound() andintheinterval (tbound(), ubound())]. The default is trun(none). tbound(#) specifies the user-supplied truncation value. If a truncation value is speci- fied, then the option trun() must be specified as top or bottom. constraints(constraints);see[R] estimation options. vce(vcetype) specifies how to estimate the variance–covariance matrix corresponding to the parameter estimates. vcetype may be oim, opg, robust,orcluster clustvar. The default is vce(oim). The current version of betamix does not allow bootstrap or jacknife estimators; see [R] vce option. level(#);see[R] estimation options. maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), gtolerance(#), nrtolerance(#), nonrtolerance, from(init specs);see[R] maximize. search(spec) specifies whether to use ml’s initial search algorithm. spec may be on or off. The default is search(on). repeat(#) specifies the number of random attempts to be made to find a better initial- value vector. This option is used in conjunction with search(on). The default is repeat(100). The likelihood functions of mixture models have multiple optima. The options difficult, trace, search(spec),andfrom(init specs) are especially useful when the default option does not achieve convergence. 60 Fitting mixture regression models for bounded dependent variables

3.2 predict

Syntax predict type newvar if in , outcome(outcome) predict {stub* | newvar1 ... newvarq} if in ,scores

Description

Stata’s standard predict command can be used following betamix to obtain predicted values using the first syntax as well as the equation-level scores using the second syntax.

Options outcome(outcome) specifies the predictions to be stored. There are two options for outcome: y or all. The default is outcome(y), which stores only the dependent variable prediction in newvar.Useall to also obtain the predicted conditional means, precision parameters, and probabilities for each component in the mix- ture. These are stored as newvar mu1, newvar mu2, ..., newvar phi1, newvar phi2, ...and newvar p1, newvar p2, ..., respectively. If an inflation model is specified, all also stores the predicted probabilities of the multinomial logit part in new- var lb, newvar ub,andnewvar tb corresponding to the predicted probabilities of the lower bound, upper bound, and truncation bound. The probability of an ob- servation belonging to the beta mixture part of the model can be calculated as 1 − newvar lb − newvar ub − newvar tb. scores calculates equation-level score variables.

4 The betamix command in practice

This section illustrates the use of the betamix command using a fictional dataset (betamix example data.dta) that can be downloaded when installing the command. Economic evaluation is used by many decision makers around the world to inform healthcare funding decisions by comparing benefits and costs. EQ-5D-3L (EuroQol Group 1990) is often used to construct measures of health benefits and is thus central to those decisions. The EQ-5D-3L instrument describes health using five different dimensions (mobility, self-care, usual activities, pain and discomfort, anxiety and depression), each with three possible levels (no problems, some problems, extreme problems). In total, EQ- 5D-3L can describe 243 different health states. Separate studies have assigned country- specific values to each health state. A value of 1 represents perfect health, a value of 0 represents a health state considered equivalent to being dead, and negative values L. A. Gray and M. Hern´andezAlava 61 represent health states worse than death.4 Therefore, EQ-5D-3L is a limited dependent variable with a lower limit equal to the value of the worst health state and an upper value of 1. In some cases, data on EQ-5D-3L have not been collected, and there is the need to predict what EQ-5D-3L would have been based on other available measures.

In this example, we are interested in estimating HRQoL measured by EQ-5D-3L5 as a function of a number of covariates. The data have 5,000 observations and is described below:

. use betamix_example_data . describe Contains data from betamix_example_data.dta obs: 5,000 vars: 5 23 May 2017 15:27 size: 90,000

storage display value variable name type format label variable label

pain float %9.0g VAS pain scale [0,1] haq_disability float %9.0g Health Assessment Questionnaire eq5d_3l double %10.0g EQ-5D-3L utility gender byte %8.0g gender_lbl Gender (dummy variable) age byte %8.0g Age in years

Sorted by: eq5d_3l

There are four additional variables in the dataset: the health assessment question- naire (HAQ) disability index, the pain score, age, and gender. The first two variables are collected in the HAQ questionnaire typically used in studies on rheumatoid arthritis (Fries, Spitz, and Young 1982). The HAQ disability index measures physical function- ality ranging from 0 to 3, with 0 indicating no or mild difficulties and 3 indicating very severe disability. The HAQ questionnaire also includes a rating scale for pain severity.6 Summary statistics for the variables included in the dataset are shown below.

. summarize Variable Obs Mean Std. Dev. Min Max

pain 5,000 .0343357 .0239183 0 .1331617 haq_disabi~y 5,000 1.447728 .450958 0 3 eq5d_3l 5,000 .682108 .2246177 -.429 1 gender 5,000 .3954 .4889853 0 1 age 5,000 61.9178 11.00144 13 97

The average age in the dataset is 62 years, and 40% of the individuals are male. Figure 3 shows a histogram of the dependent variable EQ-5D-3L. The histogram presents a number of distributional characteristics that will need addressing when modeling.

4. For example, health states associated with extreme pain are often valued below zero in general population studies. 5. In this example, we use the UK valuation (Dolan et al. 1995). 6. A horizontal visual analog scale going from no pain to severe pain is used. 62 Fitting mixture regression models for bounded dependent variables

EQ-5D-3L (UK valuation) is bounded between −0.594 and 1. Both of these values are possible. In the present dataset, there are no observations at the lower limit of −0.594. There is a pile of observations at the upper boundary of 1 (full health) and a gap between this mass at 1 and the previous value. This gap is not a sample issue, but a property of EQ-5D-3L. There is a theoretical gap between 1 and the next feasible value. The size of the gap is country specific, and in the UK case, there are no values between 1 and 0.883 creating a gap of 0.117. This gap is large relative to the total length of the EQ-5D-3L interval (1.594). The distribution is also multimodal, and conditioning on variables is usually not enough to capture this aspect of the distribution. These idiosyncrasies have been previously reported (see, for example, Hern´andez Alava, Wailoo, and Ara [2012]). 5 4 3 Density 2 1 0 −.5 0 .5 1 EQ−5D−3L utility

Figure 3. Distribution of EQ-5D-3L

The dataset will be used in two examples in sections 4.1 and 4.2: The first example will analyze the subsample of observations that are not in full health and show how to fit a mixture of beta regressions. The second example will use the full sample and, building on section 4.1, will show how to estimate an inflated truncated mixture of beta regressions.

4.1 Example 1: A mixture of beta regressions

For the purpose of this example and to show a simple version of the command, we ignore the observations at full health. In this subsample, the distribution of EQ-5D-3L has a theoretical lower boundary of −0.594 and an upper boundary of 0.883 (highest EQ-5D-3L value below full health). There are no observations at the lower boundary, but there are four observations at the upper boundary.7 We start by estimating a beta regression. Building on those results, we then estimate a mixture of beta regressions and compare the results. We create local macros with the theoretical boundary values of the dependent vari- able as follows:

7. Note that the boundary values of the beta distribution should be the theoretical values, which do not always coincide with the boundaries of the observed data in a sample. L. A. Gray and M. Hern´andezAlava 63

. local a = -0.594 . local b = 0.883

As a first step, we fit a beta regression model conditioning on age, gender, HAQ disability index, and pain. This model can already be fit using the betareg command by transforming the dependent variable and changing the observations at the boundaries by a small amount. Because there are only observations on the upper boundary, this model can be fit as follows:

. generate double eq5d_3l_t =(eq5d_3l-`a´)/(`b´-`a´) if eq5d_3l < 1 (539 missing values generated) . replace eq5d_3l_t = eq5d_3l_t - 1e-6 if eq5d_3l_t==1 (4 real changes made) . betareg eq5d_3l_t i.gender age haq_disability pain (output omitted )

Using betamix, we can estimate the same beta regression as follows:

. betamix eq5d_3l if eq5d_3l < 1, muvar(i.gender age haq_disability pain) > lbound(`a´) ubound(`b´) Warning. Some observations are on the upper boundary but no probability mass. A value of 1 is not supported by the beta distribution. -1e-6 will be added to those observations. initial: log likelihood = 3633.5637 (output omitted ) Iteration 5: log likelihood = 5507.4545 1 component Beta Mixture Model Number of obs = 4,461 Wald chi2(4) = 4854.70 Log likelihood = 5507.4545 Prob > chi2 = 0.0000

eq5d_3l Coef. Std. Err. z P>|z| [95% Conf. Interval]

C1_mu gender Male -.1906903 .0175407 -10.87 0.000 -.2250695 -.1563112 age .0007741 .0007811 0.99 0.322 -.0007568 .0023051 haq_disability -.9759785 .0247947 -39.36 0.000 -1.024575 -.9273818 pain -15.17255 .4339471 -34.96 0.000 -16.02307 -14.32203 _cons 3.826647 .0620419 61.68 0.000 3.705047 3.948247

C1_lnphi _cons 2.974057 .021377 139.12 0.000 2.932158 3.015955

C1_phi 19.57115 .4183728 18.76809 20.40857

. matrix param = e(b) . estimates store beta1lc

The command issues a warning: there are four observations with values of the de- pendent variable at the upper bound. Those values will be changed (on the transformed variable) by a small amount. 64 Fitting mixture regression models for bounded dependent variables

All variables, with the exception of age, appear significant at conventional signifi- cance levels. Higher levels of disability (as measured by HAQ disability index) and pain are associated with lower levels of HRQoL (EQ-5D-3L), as expected. On average, males have lower levels of EQ-5D-3L in this sample. The fitted model assumes a constant pre- cision parameter φ. Because a log link is used to ensure that the precision parameter is positive, the value of the untransformed parameter φ =19.57 is also shown at the bottom of the output table. Although the direction of the effect can be found directly from the estimates to find the magnitude of the effects and to interpret the estimates, it is helpful to use margins.Heremargins is used following estimation to find the predicted EQ-5D-3L for the estimation sample. The mean predicted EQ-5D-3L is 0.6366, close to the mean in the estimation sample (0.6437).

. margins Predictive margins Number of obs = 4,461 Model VCE : OIM Expression : Prediction, predict()

Delta-method Margin Std. Err. z P>|z| [95% Conf. Interval]

_cons .6366395 .0017476 364.29 0.000 .6332143 .6400648

. summarize eq5d_3l if e(sample)==1 Variable Obs Mean Std. Dev. Min Max

eq5d_3l 4,461 .6436987 .2070316 -.429 .883

In some cases, it is plausible that the variance of the distribution depends on ob- served covariates through their effect on the precision parameter φ. Models where the precision parameter is a function of covariates are more difficult to fit, and it is always recommended to start by fitting a model with constant variance and use it as a stepping stone.8 We now show how the command betamix canbeusedtofitamoregeneralmodel using mixtures of beta regressions. As always, when one fits mixture models, it is important that searches be carried out to ensure convergence to a global maximum (see section 2). The repeat() option can be used to increase the number of random attempts to find better starting values. Alternatively, if more control over the search strategy is required, the model can be optimized for a small number of iterations using a large number of different starting values. After this step, only the most promising trials (those with the highest likelihoods) are fully optimized, and the model with the highest likelihood is chosen.9

8. The accompanying do-file shows an example of how to specify and fit a model where φ is a function of a covariate. 9. The accompanying example do-file provides examples of searching procedures and the repeat() option. L. A. Gray and M. Hern´andezAlava 65

We now fit a two-component beta regression mixture model. With the option from(), we use the estimated parameters from the beta regression estimated earlier as initial parameter values to start the optimization.

. betamix eq5d_3l if eq5d_3l < 1, muvar(i.gender age haq_disability pain) > lbound(`a´) ubound(`b´) ncomponents(2) from(param) Warning. Some observations are on the upper boundary but no probability mass. A value of 1 is not supported by the beta distribution. -1e-6 will be added to those observations. initial: log likelihood = 3953.8933 (output omitted ) Iteration 9: log likelihood = 6620.1281 2 component Beta Mixture Model Number of obs = 4,461 Wald chi2(8) = 6125.70 Log likelihood = 6620.1281 Prob > chi2 = 0.0000

eq5d_3l Coef. Std. Err. z P>|z| [95% Conf. Interval]

C1_mu gender Male -.1939933 .0140936 -13.76 0.000 -.2216163 -.1663703 age .0017032 .0006329 2.69 0.007 .0004628 .0029436 haq_disability -.8513914 .0195477 -43.55 0.000 -.8897041 -.8130786 pain -5.37046 .3753207 -14.31 0.000 -6.106075 -4.634845 _cons 3.463807 .0491347 70.50 0.000 3.367505 3.560109

C1_lnphi _cons 4.399757 .0322272 136.52 0.000 4.336593 4.462922

C2_mu gender Male -.1192919 .0418525 -2.85 0.004 -.2013214 -.0372625 age -.0013071 .0018824 -0.69 0.487 -.0049965 .0023823 haq_disability -1.077175 .0588744 -18.30 0.000 -1.192567 -.9617834 pain -25.83155 .9151444 -28.23 0.000 -27.6252 -24.0379 _cons 4.215672 .1446903 29.14 0.000 3.932084 4.49926

C2_lnphi _cons 2.657526 .0476911 55.72 0.000 2.564053 2.750999

Prob_C1 _cons .9734091 .0585439 16.63 0.000 .8586651 1.088153

C1_phi 81.43111 2.624296 76.44666 86.74056 C2_phi 14.26096 .6801207 12.98836 15.65826 pi1 .7257985 .0116511 .7023817 .7480338 pi2 .2742015 .0116511 .2519662 .2976183

. matrix param2=e(b) . estimates store beta2lc

The output now gives the parameter estimates for the two components of the model and for the multinomial logit10 that determines component membership. Note that in this model, the probability of class membership is constant, but it can be allowed to

10. In this case, the model reduces to a logit model because there are only two components. 66 Fitting mixture regression models for bounded dependent variables vary across individuals by including variables in the multinomial logit model using the probabilities() option of betamix. Because the probability of belonging to each com- ponent is constant, the bottom of the output table shows these probabilities (pi1 and pi2) in an interpretable metric along with the precision parameters of both components (C1 phi and C2 phi).

In both components of the mixture, the HAQ disability index and pain score are negatively associated with average EQ-5D-3L, and being male is associated with lower average EQ-5D-3L, just as they were in the beta regression model. Age is significant in the first component, where being older has a positive influence on EQ-5D-3L.Component1is a dominant component with a probability (pi1) of 0.73. It is useful to use the predict command after estimation to visualize the two different components of the mixture.

. predict yhat2, outcome(all) . summarize yhat2* if e(sample)==1 Variable Obs Mean Std. Dev. Min Max

yhat2_mu1 4,461 .6954906 .0731658 .2635273 .8368744 yhat2_phi1 4,461 81.43111 0 81.43111 81.43111 yhat2_p1 4,461 .7257985 0 .7257985 .7257985 yhat2_mu2 4,461 .5599013 .2167423 -.3759323 .853509 yhat2_phi2 4,461 14.26096 0 14.26096 14.26096

yhat2_p2 4,461 .2742015 0 .2742015 .2742015 yhat2 4,461 .6583118 .1101933 .0881865 .8409589

In the estimation sample, the first component is estimated to have a mean EQ-5D-3L of 0.6955 and a precision parameter φ =81.43, whereas the second component has a slightly lower mean of 0.5599 and a more dispersed variance (φ =14.26). Figure 4 plots the probability density of the mixture at this average together with each of the two individual components. The mixture probability density is more similar to the dominant component, but it has heavier tails. 10 y 5 0 0 .2 .4 .6 .8 1

Mixture Component 1 Component 2

Figure 4. Probability densities of the two-component beta mixture and each individual component L. A. Gray and M. Hern´andezAlava 67

As noted earlier, margins can be used to help interpret the model. For example, margins is used below to find the average EQ-5D-3L for groups of patients with different levels of pain.

. margins, at(pain=(0(0.05)0.2)) Predictive margins Number of obs = 4,461 Model VCE : OIM Expression : Prediction, predict() 1._at : pain = 0 2._at : pain = .05 3._at : pain = .1 4._at : pain = .15 5._at : pain = .2

Delta-method Margin Std. Err. z P>|z| [95% Conf. Interval]

_at 1 .736477 .0021127 348.60 0.000 .7323362 .7406177 2 .641518 .0022414 286.22 0.000 .637125 .645911 3 .4912133 .0078346 62.70 0.000 .4758577 .5065689 4 .3413173 .0122964 27.76 0.000 .3172167 .3654178 5 .2366953 .0151763 15.60 0.000 .2069503 .2664404

The plot of the predictive margins is shown in figure 5. Patients who report no pain have the highest predicted EQ-5D-3L. Predicted EQ-5D-3L decreases as reported pain increases.

.8 .6 Prediction .4 .2 0 .05 .1 .15 .2 VAS pain scale [0,1]

Figure 5. Predictive margins with 95% confidence interval

The model with a mixture of two components can be compared with the beta re- gression model using information criteria. In particular, BIC has been shown to give a good indication of the number of components in a mixture. The mixture model has the lower Akaike information criterion (AIC)andBIC (see below), indicating that it fits the data better. 68 Fitting mixture regression models for bounded dependent variables

. estimates stats _all Akaike´s information criterion and Bayesian information criterion

Model Obs ll(null) ll(model) df AIC BIC

beta1lc 4,461 . 5507.454 6 -11002.91 -10964.49 beta2lc 4,461 . 6620.128 13 -13214.26 -13131.02

Note: N=Obs used in calculating BIC; see [R] BIC note.

Figure 6 compares the probability densities of the mixture of two components and the beta regression model (calculated at the average). The mixture distribution has a larger peak at a higher value and flatter tails than the single component beta regression model. 10 8 6 y 4 2 0 0 .2 .4 .6 .8 1

Mixture Beta regression

Figure 6. Probability densities of the two-component beta mixture versus the beta regression

Models with a higher number of components should also be fit and compared to determine the best-fitting model.

4.2 Example 2: An inflated truncated mixture of beta regressions

In this example, we use the full dataset, which now includes the mass of values at the upper boundary of full health. Including these values creates a gap between the upper boundary and the previous feasible value (0.883) (see figure 3).11 We first fit an inflated truncated beta regression model. The new upper bound is now 1 (ub(1)), and the model includes a truncation at the top of the distribu- tion with no density between 0.883 and 1 (tb(0.883) trun(top)). The lower bound remains the same (lb(-0.594)). We allow the same variables to enter the beta re- gression (muvar(i.gender age haq disability pain)) and the inflation part of the model (pmvar(i.gender age haq disability pain)). There is inflation at the upper

11. For examples of more complex models than those estimated here in the area of health economics, see Gray, Hern´andez Alava, and Wailoo (Forthcoming). L. A. Gray and M. Hern´andezAlava 69 bound but no inflation at the truncation point or at the lower bound (pmass(1 0 0)). In this example, it is not necessary to include an inflation at the truncation value be- cause there are only four observations with an EQ-5D-3L of 0.883. As in the previous example, there are no observations at the lower bound. The syntax to fit this model is reproduced below:

. betamix eq5d_3l, muvar(i.gender age haq_disability pain) lbound(-0.594) > ubound(1) tbound(0.883) pmass(1 0 0) pmvar(i.gender age haq_disability pain) > trun(top) from(param)

Note that this model is not the same as the Heckman selection model. It is equivalent to a two-part model, fit jointly under conditional independence between the two parts of the model. The output of this model is reproduced below:

Warning. Some observations are on the truncated boundary but no probability mass. A value of 1 is not supported by the beta distribution. -1e-6 will be added to those observations. Fitting full beta mixture model initial: log likelihood = 2041.7186 (output omitted ) Iteration 5: log likelihood = 4507.4772 1 component Beta Mixture Model with inflation Number of obs = 5,000 Wald chi2(8) = 5477.05 Log likelihood = 4507.4772 Prob > chi2 = 0.0000

eq5d_3l Coef. Std. Err. z P>|z| [95% Conf. Interval]

C1_mu gender Male -.1906903 .0175407 -10.87 0.000 -.2250695 -.1563112 age .0007741 .0007811 0.99 0.322 -.0007568 .0023051 haq_disability -.9759784 .0247947 -39.36 0.000 -1.024575 -.9273818 pain -15.17255 .4339471 -34.96 0.000 -16.02307 -14.32203 _cons 3.826647 .0620419 61.68 0.000 3.705047 3.948247

C1_lnphi _cons 2.974056 .021377 139.12 0.000 2.932158 3.015955

PM_ub gender Male -.4074599 .1191221 -3.42 0.001 -.6409349 -.1739848 age -.0006401 .0052235 -0.12 0.902 -.0108779 .0095978 haq_disability -2.729829 .1830281 -14.91 0.000 -3.088558 -2.371101 pain -81.13075 5.093264 -15.93 0.000 -91.11337 -71.14814 _cons 2.735824 .3788839 7.22 0.000 1.993225 3.478423

C1_phi 19.57115 .4183728 18.76809 20.40856

. estimates store infbeta1LC . matrix param2=e(b)

The output table now has an additional equation, PM ub, corresponding to the infla- tion part of the model at perfect health. Men are less likely than women to be in perfect 70 Fitting mixture regression models for bounded dependent variables health. The likelihood of being in perfect health decreases as age, pain, and the HAQ disability index increase. Note that, because this is a two-part model, the parameter estimates of the beta regression part of the model are the same as the parameters of the beta regression model fit in section 4.1. Using the estimated parameters to initialize the algorithms, we also estimated the inflated truncated beta mixtures of two and three components. We also attempted to estimate a four-component mixture, but convergence was a problem. Results for the inflated truncated beta mixture of the three-component model are presented below:

. betamix eq5d_3l, muvar(i.gender age haq_disability pain) ncomponents(3) > lbound(-0.594) ubound(1) tbound(0.883) pmass(1 0 0) > pmvar(i.gender age haq_disability pain) trun(top) repeat(500) Warning. Some observations are on the truncated boundary but no probability mass. A value of 1 is not supported by the beta distribution. -1e-6 will be added to those observations. Fitting part 1: multinomial logit model Fitting full beta mixture model initial: log likelihood = -977.04385 (output omitted ) Iteration 22: log likelihood = 5669.5447 3 component Beta Mixture Model with inflation Number of obs = 5,000 Wald chi2(16) = 6338.24 Log likelihood = 5669.5447 Prob > chi2 = 0.0000 L. A. Gray and M. Hern´andezAlava 71

eq5d_3l Coef. Std. Err. z P>|z| [95% Conf. Interval]

C1_mu gender Male -.1423077 .0506814 -2.81 0.005 -.2416414 -.0429739 age .0010803 .0021846 0.49 0.621 -.0032015 .0053621 haq_disability -1.108207 .0702167 -15.78 0.000 -1.245829 -.9705848 pain -28.63576 1.197209 -23.92 0.000 -30.98225 -26.28927 _cons 4.399697 .1652882 26.62 0.000 4.075738 4.723656

C1_lnphi _cons 2.926432 .0667624 43.83 0.000 2.79558 3.057284

C2_mu gender Male -.1943955 .013987 -13.90 0.000 -.2218095 -.1669814 age .0015787 .0006276 2.52 0.012 .0003486 .0028088 haq_disability -.8512454 .0193374 -44.02 0.000 -.889146 -.8133448 pain -5.181831 .3752138 -13.81 0.000 -5.917236 -4.446425 _cons 3.455749 .0487475 70.89 0.000 3.360206 3.551292

C2_lnphi _cons 4.366229 .0321085 135.98 0.000 4.303298 4.429161

C3_mu gender Male -.109573 .0435234 -2.52 0.012 -.1948773 -.0242687 age -.004191 .0019958 -2.10 0.036 -.0081028 -.0002792 haq_disability -.4509987 .0664523 -6.79 0.000 -.5812428 -.3207545 pain -5.012307 .879802 -5.70 0.000 -6.736687 -3.287927 _cons 1.332298 .1896052 7.03 0.000 .9606781 1.703917

C3_lnphi _cons 4.978456 .1803428 27.61 0.000 4.624991 5.331922

Prob_C1 _cons 1.912517 .1730229 11.05 0.000 1.573398 2.251635

Prob_C2 _cons 3.115297 .1364868 22.82 0.000 2.847788 3.382806

PM_ub gender Male -.4074599 .1191221 -3.42 0.001 -.640935 -.1739848 age -.0006401 .0052235 -0.12 0.902 -.0108779 .0095978 haq_disability -2.72983 .1830281 -14.91 0.000 -3.088558 -2.371101 pain -81.13075 5.093264 -15.93 0.000 -91.11337 -71.14814 _cons 2.735824 .3788839 7.22 0.000 1.993226 3.478423

C1_phi 18.66093 1.245848 16.37212 21.2697 C2_phi 78.74615 2.52842 73.94325 83.86102 C3_phi 145.25 26.19479 102.0018 206.8351 pi1 .2233604 .0138256 .1974356 .251622 pi2 .7436474 .0125776 .7182305 .7675138 pi3 .0329922 .0045296 .0241143 .04187

. estimates store infbeta3LC 72 Fitting mixture regression models for bounded dependent variables

Both the AIC and BIC are lowest for the three-component truncated inflated beta mixture model, suggesting it has the best model fit among the three models.

. estimates stats infbeta1LC infbeta2LC infbeta3LC Akaike´s information criterion and Bayesian information criterion

Model Obs ll(null) ll(model) df AIC BIC

infbeta1LC 5,000 . 4507.477 11 -8992.954 -8921.265 infbeta2LC 5,000 . 5620.151 18 -11204.3 -11086.99 infbeta3LC 5,000 . 5669.545 25 -11289.09 -11126.16

Note: N=Obs used in calculating BIC; see [R] BIC note.

We again use predict to help visualize the components:

. predict yhat3, outcome(all) . summarize yhat3* Variable Obs Mean Std. Dev. Min Max

yhat3_mu1 5,000 .6141761 .2183959 -.3875231 .8636973 yhat3_phi1 5,000 18.66093 0 18.66093 18.66093 yhat3_p1 5,000 .2233604 0 .2233604 .2233604 yhat3_mu2 5,000 .7041177 .0750912 .2646665 .8360863 yhat3_phi2 5,000 78.74615 0 78.74615 78.74615

yhat3_p2 5,000 .7436474 0 .7436474 .7436474 yhat3_mu3 5,000 .2181699 .1072245 -.1579969 .4943131 yhat3_phi3 5,000 145.25 0 145.25 145.25 yhat3_p3 5,000 .0329922 0 .0329922 .0329922 yhat3_ub 5,000 .1078 .1867089 9.17e-07 .9079351

yhat3 5,000 .6919097 .1326988 .1050494 .9841948

In this sample, we see two components toward the top of the distribution with means 0.70 and 0.61 and a third component lower down in the distribution with mean 0.22. This third component has a much lower probability than the other two components (0.03) and appears to be capturing the mode at lower values of EQ-5D-3L, seen in figure 3. Figure 7 plots the probability density of the three-component mixture and each of the components separately. It is clear in this case that this third component, although small, is important in modeling this dataset and helps handle the relatively small number of individuals with EQ-5D-3L values close to the value of death.

5 Concluding remarks

In this article, we described the community-contributed betamix command, which fits mixture regression models for bounded dependent variables using the beta distribution. The betamix command generalizes the Stata command betareg and the community- contributed commands betafit or zoib. betamix allows the model to capture multi- modality and other distributional shapes. It can accommodate response variables that have a gap between one of the boundary values and the continuous part of the distribu- L. A. Gray and M. Hern´andezAlava 73 10 y 5 0 0 .2 .4 .6 .8 1

Mixture Component 1 Component 2 Component 3

Figure 7. Probability densities of the three-component truncated inflated beta mixture model each individual component (for values below full health) tion and can model positive probabilities at the boundaries and the truncation value. There is no need to manually transform variables bounded in the interval (a, b)tothe (0, 1) interval, because the command will take care of the transformation. It is important to start fitting less complex models and slowly build them up; oth- erwise, convergence problems are likely. The likelihood functions of mixture models are known to have multiple optima. It is important to thoroughly search around the parameter space to avoid local solutions.

6 Acknowledgments

We thank the editor and an anonymous referee for very helpful comments and sugges- tions. M´onica Hern´andez Alava acknowledges support by the Medical Research Council under grant MR/L022575/1. The views expressed in this article, as well as any errors or omissions, are only of the authors.

7 References Basu, A., and A. Manca. 2012. Regression estimators for generic health-related quality of life and quality-adjusted life years. Medical Decision Making 32: 56–69.

Berggren, N., S.-O. Daunfeldt, and J. Hellstrøm.¨ 2014. Social trust and central-bank independence. European Journal of Political Economy 34: 425–439.

Buis, M. L. 2010. zoib: Stata module to fit a zero-one inflated beta distribution by maximum likelihood. Statistical Software Components S457156, Department of Eco- nomics, Boston College. https://ideas.repec.org/c/boc/bocode/s457156.html.

Buis, M. L., N. J. Cox, and S. P. Jenkins. 2003. betafit: Stata module to fit a two-parameter beta distribution. Statistical Software Components S435303, De- 74 Fitting mixture regression models for bounded dependent variables

partment of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/ s435303.html.

Cook, D. O., R. Kieschnick, and B. D. McCullough. 2008. Regression analysis of pro- portions in finance with self selection. Journal of Empirical Finance 15: 860–867.

De Paola, M., V. Scoppa, and R. Lombardo. 2010. Can gender quotas break down negative stereotypes? Evidence from changes in electoral rules. Journal of Public Economics 94: 344–353.

Dolan, P., C. Gudex, P. Kind, and A. Williams. 1995. A social tariff for EuroQol: Results from a UK general population survey. Discussion Paper No. 138, University of York, Centre for Health Economics. http://www.york.ac.uk/che/pdf /DP138.pdf.

EuroQol Group. 1990. EuroQol—A new facility for the measurement of health-related quality of life. Health Policy 16: 199–208.

Ferrari, S., and F. Cribari-Neto. 2004. Beta regression for modelling rates and propor- tions. Journal of Applied Statistics 31: 799–815.

Fries, J. F., P. W. Spitz, and D. Y. Young. 1982. The dimensions of health outcomes: The health assessment questionnaire, disability and pain scales. Journal of Rheuma- tology 9: 789–793.

Gray, L. A., M. Hern´andez Alava, and A. J. Wailoo. Forthcoming. Development of methods for the mapping of utilities using mixture models: Mapping the AQLQ-S to EQ-5D-5L and HUI3 in patients with asthma. Value in Health.

Hern´andez Alava, M., and A. Wailoo. 2015. Fitting adjusted limited dependent variable mixture models to EQ-5D. Stata Journal 15: 737–750.

Hern´andez Alava, M., A. J. Wailoo, and R. Ara. 2012. Tails from the peak district: Adjusted limited dependent variable mixture models of EQ-5D questionnaire health state utility values. Value in Health 15: 550–561.

Ji, Y., C. Wu, P. Liu, J. Wang, and K. R. Coombes. 2005. Applications of beta-mixture models in bioinformatics. Bioinformatics 21: 2118–2122.

Johnson, N. L., S. Kotz, and N. Balakrishnan. 1995. Continuous Univariate Distribu- tions, vol. 2. 2nd ed. New York: Wiley.

Kent, S., A. Gray, I. Schlackow, C. Jenkinson, and E. McIntosh. 2015. Mapping from the Parkinson’s disease questionnaire PDQ-39 to the generic EuroQol EQ-5D-3L. Medical Decision Making 35: 902–911.

Kieschnick, R., and B. D. McCullough. 2003. Regression analysis of variates observed on (0, 1): Percentages, proportions and fractions. Statistical Modelling 3: 193–213.

Kosmidis, I., and D. Firth. 2010. A generic algorithm for reducing bias in parametric estimation. Electronic Journal of Statistics 4: 1097–1112. L. A. Gray and M. Hern´andezAlava 75

McLachlan, G., and D. Peel. 2000. Finite Mixture Models. New York: Wiley. Ospina, R., F. Cribari-Neto, and K. L. P. Vasconcellos. 2006. Improved point and interval estimation for a beta regression model. Computational Statistics and Data Analysis 51: 960–981. . 2011. Erratum to “Improved point and interval estimation for a beta regression model” [Comput. Statist. Data Anal. 51 (2006) 960–981]. Computational Statistics and Data Analysis 55: 2445. Ospina, R., and S. L. P. Ferrari. 2010. Inflated beta distributions. Statistical Papers 51: 111–126. . 2012a. On bias correction in a class of inflated beta regression models. Inter- national Journal of Statistics and Probability 1: 269–282. . 2012b. A general class of zero-or-one inflated beta regression models. Compu- tational Statistics and Data Analysis 56: 1609–1623. Paolino, P. 2001. Maximum likelihood estimation of models with beta-distributed de- pendent variables. Political Analysis 9: 325–346. Papke, L. E., and J. M. Wooldridge. 1996. Econometric methods for fractional response variables with an application to 401(K) plan participation rates. Journal of Applied Econometrics 11: 619–632. . 2008. Panel data methods for fractional response variables with an application to test pass rates. Journal of Econometrics 145: 121–133. Pereira, G. H. A., D. A. Botter, and M. C. Sandoval. 2012. The truncated inflated beta distribution. Communications in Statistics—Theory and Methods 41: 907–919. . 2013. A regression model for special proportions. Statistical Modelling 13: 125–151. Smithson, M., and J. Verkuilen. 2006. A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods 11: 54– 71. Verkuilen, J., and M. Smithson. 2012. Mixed and mixture regression models for con- tinuous bounded responses using the beta distribution. Journal of Educational and Behavioral Statistics 37: 82–113.

About the authors Laura A. Gray is a research associate in health econometrics in the Health Economics and Decision Science section in ScHARR, University of Sheffield, UK. Her main research interest is microeconometrics, particularly in relation to health behaviors and obesity. M´onica Hern´andez Alava is a reader in the Health Economics and Decision Science section in ScHARR, University of Sheffield, UK. Her main research interest is microeconometrics, and recent areas of application include the analysis of health state utility data, disease progression, and the dynamics of child development. The Stata Journal (2018) 18, Number 1, pp. 76–100 Testing for serial correlation in fixed-effects panel models

Jesse Wursten Faculty of Economics and Business KU Leuven Leuven, Belgium [email protected]

Abstract. Current serial correlation tests for panel models are cumbersome to use, not suited for fixed-effects models, or limited to first-order autocorrelation. To fill this gap, I implement three recently developed tests. Keywords: st0514, xtqptest, xthrtest, xtistest, serial correlation, panel time series, fixed effects, higher-order serial correlation

1 Introduction

The issue of serial correlation in panel models has been largely ignored in recent decades (Inoue and Solon 2006), and the discussion is generally dismissed by using robust or clustered standard errors. However, as shown in Pesaran and Smith (1995), serial cor- relation can lead to inconsistent estimates in dynamic panels, and omitting or includ- ing time trends can dramatically change parameter estimates (for example, Allegretto, Dube, and Reich [2010]; Solon [1984]). Serial correlation tests can help to identify which model is most credible from a statistical perspective to complement arguments based on theory. The two main commands currently implemented in Stata, xtserial and abar,are limited in their usability. xtserial implements the Wooldridge–Drukker (WD)(Drukker 2003; Wooldridge 2010) test, which is limited to first-order autocorrelation1 and assumes a constant variance over time. Additionally, it is cumbersome to use when working with many or large models. I show in section 6 that the tests described in this article have considerably higher power than the WD test using Monte Carlo simulations. The abar command, developed by Arellano and Bond (1991) and implemented in Stata by Roodman (2009), can test for any order of serial correlation but is not appro- priate for fixed-effects regressions. In its current implementation, abar cannot be used after xtreg. Instead, I propose using tests developed by Inoue and Solon (2006)andBornand Breitung (2016), which I have implemented as xtistest, xtqptest,andxthrtest. They are residual-based tests, which makes them fast and easy to use. Moreover, both xtistest and xtqptest can test for serial correlation up to any order. xthrtest is suited to detect first-order correlation only but relaxes the constant variance assumption. 1. More precisely, it can detect differences only between first- and second-order correlation.

c 2018 StataCorp LLC st0514 J. Wursten 77

By making serial correlation testing more accessible, I hope it will be used more frequently to test model specifications and refine the empirical framework. Often, the barrier of manually implementing such tests prevents theoretical advances from be- ing used in applied research (Pullenayegum et al. 2016; Choodari-Oskooei and Morris 2016). The remainder of this article is structured as follows. Section 2 describes the econo- metrics behind the tests, focusing on what they do and how they work. Section 3 discusses the syntax of the implemented commands. Section 4 demonstrates each test with a worked example. Section 5 provides some short guidelines on when to use which test. The relative strengths and shortcomings are detailed further using Monte Carlo evidence in section 6, where we also investigate the tests’ performance when their as- sumptions are not met. Section 7 concludes.

2 Econometrics 2.1 xtistest: Inoue–Solon test for serial correlation xtistest implements the portmanteau test for serial correlation introduced by Inoue and Solon (2006). More specifically, it tests whether any off-diagonal element of the   autocovariance matrix E(i, i) is nonzero, where i =[ i1, i2,..., iT ] is a vector of error terms. Thus, the unrestricted alternative hypothesis (Ha) is the presence of “some” serial correlation, without restricting the order at which it might occur, much like the ubiquitous Ljung–Box test used in time series. The drawback of this method is that as the number of elements in the covariance matrix increases quadratically with the length of the panel (T ), so does the number of implicit hypotheses tested. Therefore, the test requires the number of panel units (N) to be large relative to the number of time periods. Figure 1, panel A illustrates this graphically for a T = 5 panel, where the axes represent the time period and the lines represent the corresponding autocovariances tested. Different patterns indicate the covariances are used to test a different implicit hypothesis. In the current case, each covariance corresponds to a separate hypothesis. In total, there are (T −1)×(T −2)/2 = 6 hypotheses.2 Under the null hypothesis of no serial correlation, the IS statistic converges to a χ2 distribution with (T − 1) × (T − 2)/2=6 degrees of freedom as N goes to infinity. We refer to the original article for an exposition on how the test statistic is constructed concretely.3

2. Because of the fixed-effects framework, the covariances have to be demeaned first, which leads to a singular covariance matrix. To overcome this problem, Inoue and Solon (2006) drop one row and column from the covariance matrix. In the visual example (and in the code), I drop the last column to retain as much data as possible. The choice of column does not affect the test statistic ex ante. 3. Alternatively, Born and Breitung (2016) provide a more applied explanation. 78 Testing for serial correlation in fixed-effects panel models

Panel A: Unrestricted IS−test Panel B: Restricted IS−test xtistest, lags(all) xtistest, lags(1) 1 2 3 4 5 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5

Panel C: Q(2) test Panel D: LM(2) test xtqptest, lags(2) xtqptest, order(2) 1 2 3 4 5 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5

Panel E: HR test xthrtest 1 2 3 4 5

1 b

2 f b

3 f f b

4 f f f b

5 f f f f b

Figure 1. Visual representation of the various serial correlation tests. Each line rep- resents the demeaned covariance between residuals at two different time periods. For example, a line between (1, 1) and (1, 3) refers to the covariance between t =1and t = 3. Likewise, (3, 3) to (3, 4) refers to the covariance between residuals at time t =3 and t = 4. Different line patterns indicate that the covariances are used to test a differ- ent underlying hypothesis. The b’s and f’s relate to backward- and forward-demeaned residuals (see section 2.3). J. Wursten 79

As in the Ljung–Box test, the lag lengths considered in the IS test can be restricted to keep the number of hypotheses in check. Under the restricted H0 of no autocorrelation up to lag p, the modified IS(p)-statistic is χ2{pT − p(p +1)/2} distributed, growing only linearly in T . This is visually illustrated in panel B of figure 1 for p =1.Only autocovariances of the first order are considered, leading to four underlying hypotheses to evaluate.

2.2 xtqptest

The xtqptest command calculates the two bias-corrected test statistics introduced by Born and Breitung (2016), referred to as LM(k)andQ(p). The first tests for autocor- relation of order k, whereas the second looks for autocorrelation up to order p.

Lagrange multiplier (LM) test of serial correlation at order k

The LM(k) statistic is fairly straightforward and comes down to a heteroskedasticity- and autocorrelation-robust t test of ς = −1/(T − 1), with ς as the coefficient on the kth-order demeaned residuals in (1). The eit residual includes the fixed effects; that is, it is produced in Stata by predict, ue.

eit − ei = ς(ei,t−k − ei)+ it (1)

This leads to the asymptotically equivalent test statistic LM (k)definedby(2)and (3), which under the null hypothesis of no serial correlation at order k has a standard normal limiting distribution and is calculated by xtqptest, order(k). T  1 2 zk,i = (eit − ei)(ei,t−k − ei)+ (ei,t−k − ei) (2) T − 1 t=k+1 N  i=1 zk,i LMk = (3)   2 N 2 − 1 N i=1 zk,i N i=1 zk,i

Figure 1, panel D illustrates this visually for k = 2. There is now just one underlying hypothesis as the information from all autocovariances of lag length 2 is pooled. Note the contrast with the IS test, where they were considered separately.

Q test of serial correlation up to order p

The Q(p) statistic, which tests for autocorrelation up to order p is more complicated be- cause of finite-sample bias induced by correlation among the variously lagged demeaned residuals. Born and Breitung (2016) eliminate this bias by transforming the residuals as per (4). 80 Testing for serial correlation in fixed-effects panel models

⎛ ⎞ 0k ⎜ − ⎟ ⎜ ei1 ei ⎟ T − k Lkei = ⎜ . ⎟ + ei (4) ⎝ . ⎠ T 2 − T ei,T −k − ei

The first vector on the right-hand side represents the lagged demeaned residuals (0k is a column vector of k zeros). The second term, which involves the residuals including the fixed effects, removes the bias. It converges to zero as T increases and hence mainly plays a role in shorter panels. Ordinary least-squares estimates of the coefficients in (5) converge to zero for all T and N →∞under the null hypothesis of no serial correlation.

 M = IT − iT iT

Mei = φ1L1ei + φ2L2ei + ···+ φpLpei + vi (5)

The matrix M demeans the fixed-effect residuals (IT is the identity matrix, iT a column vector of ones). Figure 1, panel C visualizes this test for p set to 2. As obser- vations are pooled per lag length, there are two implicit hypotheses to test, φ1 =0and φ2 =0.Thextqptest, lags(p) command calculates the asymptotically equivalent Q'(p) statistic, defined in (6). Under the null of no serial correlation up to order p,it follows a χ2 distribution with p degrees of freedom.

Zi =(L1ei,...,Lpei)  Ai = eiMZi         N N N N −1 N '  1   Qp = Ai A Ai − A Ai A (6) i N i i i=1 i=1 i=1 i=1 i=1

2.3 xthrtest xthrtest implements the heteroskedasticity-robust test statistic, also introduced in Born and Breitung (2016). It is based, respectively, on the forward- and backward- f b transformed residuals eit and eit defined in (7)and(8).

f 1 e = eit − (eit + ···+ eiT )(7) it T − t +1 b 1 e = eit − (ei + ···+ eit)(8) it t 1

The statistic can then be seen as a heteroskedasticity- and autocorrelation-robust t test on the ψ coefficient in (9).

f b ∈{ − } eit = ψei,t−1 + ωit t 3, 4,...,T 1 (9) J. Wursten 81

Subtracting, respectively, the forward- and backward-looking mean in (7)and(8) ensures the test is robust to time-varying variances. Under the null of no first-order au- tocorrelation, it has a standard normal limiting distribution. Figure 1,panelEpresents a visual representation of the heteroskedasticity-robust (HR) test. The dots have been replaced by b’s and f’s to highlight that the statistic is not based on the same demeaning process as the other tests.

3 Syntax xtistest varlist if in ,lags(integer | all) original xtqptest varlist if in ,lags(integer)order(integer) force xthrtest varlist if in , force

The three commands share a similar syntax, which calculates their respective test statistics and p-values for the variables specified in varlist. Alternatively, they can be used as xtreg postestimation commands by omitting the varlist. xtistest has two options. The first, lags(), indicates the maximum number of lags to check for autocorrelation. All lags up to this maximum are checked. Specifying lags(all), leads to the unrestricted IS test, which checks for serial correlation of any order (see section 2.1). The test defaults to a maximum lag length of two (lags(2)).4 By default, the command uses the faster Born and Breitung (2016) implementation, but the initial Inoue and Solon (2006) method can be requested by using the original option. The results should be identical. xtqptest takes the mutually exclusive lags() and order() options. The first will calculate the Q(p) test statistic for serial correlation up to order p, whereas the latter refers to the LM(k) test for serial correlation of order k. For example, if there is first- but not second-order autocorrelation, the test should reject the null if lags(2) was specified but not when the user opted for order(2). The test requires varlist to be residuals including the fixed effect (‘ue’ residuals in predict terminology). There is some internal machinery to check whether you actually supplied ‘ue’ residuals, but it is not infallible (in both directions). Specifying force skips this check.5 xthrtest can detect only first-order serial correlation and thus does not ask for any lag specification. The force option again bypasses the check described above.

4. The choice for lag length two is arbitrary. The alternative—making the unrestricted test the default—is unattractive because in very large datasets, calculating this statistic can take a signifi- cant amount of time. 5. Note that as T increases, the difference between using the standard residuals and the ‘ue’ residuals converges to zero. 82 Testing for serial correlation in fixed-effects panel models 4Example

This section serves two purposes. First, it shows how the commands work in prac- tice and how to interpret their output. Second, it highlights that coefficients can vary strongly depending on the modeling assumptions. Serial correlation measures by them- selves cannot tell us which estimates are more credible; nonetheless, they are one of the elements that can guide this decision. For example, ceteris paribus, one would put more stock in a “correctly” specified model than one that exhibits clear statistical issues.

The example is based on a publicly available dataset of UK firms’ yearly employ- ment, wages, capital, and output between 1976 and 1984.6 Thesamedatawereused in Arellano and Bond (1991) to illustrate the famous Arellano–Bond estimator and in- troduce their serial correlation test, implemented in Stata as abar (see section 1). As a first step, we regress the level of log employment (n) on the logs of wages (w), capital (k), and industry output (ys). We include year and firm fixed effects.

. webuse abdata . xtreg nwkysyr*, vce(cluster id) fe note: yr1984 omitted because of collinearity Fixed-effects (within) regression Number of obs = 1,031 Group variable: id Number of groups = 140 R-sq: Obs per group: within = 0.6321 min = 7 between = 0.8481 avg = 7.4 overall = 0.8349 max = 9 F(11,139) = 48.94 corr(u_i, Xb) = 0.5930 Prob > F = 0.0000 (Std. Err. adjusted for 140 clusters in id)

Robust n Coef. Std. Err. t P>|t| [95% Conf. Interval]

w -.2968768 .1262997 -2.35 0.020 -.5465938 -.0471597 k .5475598 .050709 10.80 0.000 .4472991 .6478204 ys .2648254 .1529614 1.73 0.086 -.0376065 .5672574 (output omitted )

We find elasticities of −0.3, 0.55, and 0.26 for earnings, capital, and industrial out- put, respectively. However, our simple equation in levels might be misspecified. Perhaps there are strong trends in firm employment. If these are correlated to any of our inde- pendent variables, this would bias their coefficient estimates. One way to check whether this could be an issue is to look for serial correlation in the residuals. Let’s start with first-order auto correlation, using the Q(1) test.

6. The dataset is loaded in the first line of the example. J. Wursten 83

. xtqptest, lags(1) Bias-corrected Born and Breitung (2016) Q(p)-test as postestimation Panelvar: id Timevar: year p (lags): 1

Variable Q(p)-stat p-value N maxT balance?

Post Estimation 65.17 0.000 140 9 unbalanced

Notes: Under H0, Q(p) ~ chi2(p) H0: No serial correlation up to order p. Ha: Some serial correlation up to order p. The output of xtqptest, lags(1) lists the specific test performed, the structure of your panel, and the number of lags used (p = 1). Because no varlist was specified, the command predicted the fixed-effect residuals for us, based on the last regression. The test strongly rejects the null hypothesis, indicating the residuals are serially correlated. Next, we look at the output from the other tests [LM(1), IS(1),andHR].

. xtqptest, order(1) Bias-corrected Born and Breitung (2016) LM(k)-test as postestimation Panelvar: id Timevar: year k (order): 1

Variable LM(k)-stat p-value N maxT balance?

Post Estimation 8.05 0.000 140 9 unbalanced

Notes: Under H0, LM(k) ~ N(0,1) H0: No serial correlation of order k. Ha: Some serial correlation of order k. . xtistest, lags(1) Inoue and Solo (2006) IS-test as postestimation Panelvar: id Timevar: year p (lags): 1

Variable IS-stat p-value N maxT balance?

Post Estimation 62.08 0.000 140 9 unbalanced

Notes: Under H0, IS ~ chi2(p*T-p(p+1)/2) H0: No auto-correlation up to order 1. Ha: Auto-correlation up to order 1. . xthrtest Heteroskedasticity-robust Born and Breitung (2016) HR-test as postestimation Panelvar: id Timevar: year

Variable HR-stat p-value N maxT balance?

Post Estimation 1.31 0.190 140 9 unbalanced

Notes: Under H0, HR ~ N(0,1) H0: No first-order serial correlation. Ha: Some first-order serial correlation. 84 Testing for serial correlation in fixed-effects panel models

The output is organized in the same way as before. Both the LM and IS tests strongly suggest first-order serial correlation is present in the residuals. Rather surprisingly, the HR test suggests the residuals are not correlated at the first order. If there are good reasons to assume the variance changed significantly over the sample, then a conflicting result from the HR test should not be discarded easily.7 In this case, however, there is no reason to assume the variance would change over time. Additionally, the other tests firmly rejected the null, with very extreme statistics. Combined, this makes the presence of first-order serial correlation more plausible than its absence. The remainder of this section is summarized in table 1, where I present coefficient estimates for various specifications and their respective serial correlation results.8 The numbers in parentheses represent the p-values corresponding to the coefficient estimates and serial correlation tests. Column (1) recapitulates the discussion above.

Table 1. Summary of various specifications

(1) (2) (3) (4) Levels Trends Differences Differences with Lags Coefficient estimates w −0.30 (0.02) −0.39 (0.02) −0.49 (0.00) −0.43 (0.11) k 0.55 (0.00) 0.41 (0.00) 0.33 (0.00) 0.52 (0.00) ys 0.26 (0.09) 0.43 (0.09) 0.60 (0.00) 0.52 (0.14) Serial correlation tests Q(1) 65.17 (0.00) 13.57 (0.00) 4.85 (0.03) 0.39 (0.53) LM(1) 8.05 (0.00) 3.73 (0.00) 2.21 (0.03) 0.73 (0.47) IS(1) 62.08 (0.00) 36.54 (0.00) 25.39 (0.00) 5.98 (0.31) HR(1) 1.31 (0.19) 4.77 (0.00) 1.72 (0.09) 1.38 (0.17)

Q(2) 73.51 (0.00) 42.42 (0.00) 6.31 (0.04) 7.37 (0.03) LM(2) 3.89 (0.00) −6.47 (0.00) −1.33 (0.18) −2.21 (0.03) IS(2) 72.63 (0.00) 56.46 (0.00) 27.74 (0.01) 13.29 (0.15)

IS(all) 77.89 (0.00) 69.63 (0.00) 36.31 (0.13) 16.02 (0.38) The numbers in parentheses represent p-values. (1): xtreg n w k ys yr*, vce(cluster id) fe (2): xtreg n w k ys yr* id#c.year, vce(cluster id) fe (3): xtreg D.n D.(w k ys) i.year, vce(cluster id) fe (4): xtreg D.n DL(0/2).(w k ys) i.year, vce(cluster id) fe

In column (2), I refit the model with firm-specific trends. The coefficient estimates have changed considerably. The wage elasticity has increased (in absolute value) by 25%, and the elasticity with respect to industry output is almost twice as large as before. The residuals still appear to be serially correlated. The four statistics above the dashed line all point toward the presence of first-order correlation. The Q(2) and IS(2) tests indicate there is serial correlation up to the second order as well; the LM(2) test suggests there is correlation of the second order (it does not say anything about first-order correlation, unlike the previous two).

7. An example would be stock data, which can be stable for years and suddenly enter a turbulent phase. Figure 3 in the appendix provides an example of a time series whose variance changes over time. 8. I limit this to second-order serial correlation for the sake of brevity. J. Wursten 85

Finally, the unrestricted IS test [IS(all)] tests for serial correlation of any order, with- out limiting the lag length at which it might occur. In this sense, it is less arbitrary than the previous tests, which require the researcher to indicate a cutoff point. Unfor- tunately, it can be used only in panels that are much wider than they are long and is less powerful than the restricted tests (see section 6).9 An alternative option is to fit the model in differences [column (3)]. Again, the coefficients have changed. The elasticity of earnings on employment is now almost −0.5, and the coefficient on industrial output has increased by another 37%. The serial correlation statistics look a lot better. The unrestricted IS statistic at the bottom of the table no longer rejects the null of no serial correlation (at any order). The other tests show more mixed signals, with p-values skirting on both edges of the common 0.05 rejection threshold. Still, we cannot say with confidence that our estimation is free of serial correlation. Finally, in column (4) we add two lags of the independent variables.10 Idonot introduce lags of the dependent variable for simplicity, because such (short) dynamic panels cannot be estimated consistently with standard fixed-effects regressions.11 The values presented in the top half of the table refer to the sum of the coefficient estimates and are in line with the previous results, although only the coefficient on capital remains significant. The residuals seem fairly free of autocorrelation. The tests above the dashed line all point to a lack of first-order serial correlation. The presence of correlation at the second orderislessclear,withtheQ(2) and LM(2) tests still rejecting the null at 5%, unlike the IS(2) test. The unrestricted IS test points to a complete absence of serial correlation. Is this last set of estimates more credible than those presented in columns (1)– (3)? Serial correlation statistics by themselves cannot answer that question. However, they are definitely one indicator that these coefficients are less vulnerable to model misspecification than the others.

5 Usage guidelines

The various tests are frequently in disagreement with one another, as can be seen in the example in section 4. Below I provide a few guidelines that might help to reach an informed conclusion in such situations. In section 6, I provide Monte Carlo evidence to support the rules of thumb presented here. Note that all three tests assume that the

9. As a rule of thumb, the unrestricted IS test requires that N>T2/2, where N is the number of panel units and T is the number of time periods. The equivalent requirement for the restricted IS test is N>p× T ,wherep is the order up to which to test. Section 2 provides more details. 10. The choice of the number of lags is based on the serial correlation tests. In other cases, theory is likely to be the prime guideline. Here we have only a short panel of yearly data, so there are few cues from theory to guide the lag selection (does capital lead to changes in employment with one year’s delay? two years’? five years’?). 11. The biggest contribution of the earlier mentioned Arellano and Bond (1991) article is that it intro- duces a generalized method of moments estimator that can do so. 86 Testing for serial correlation in fixed-effects panel models idiosyncratic errors are uncorrelated with the regressors.12 This means that including predetermined or endogenous regressors might lead to unreliable test statistics, although the Monte Carlo evidence suggests this is not always the case.

• Use xtistest

– if the data contain gaps (the IS test accommodates them without requiring further action). – if you are concerned about higher-order serial correlation, common when individual time trends should be included but are not. – only if the panel dimension (N) is significantly larger than the time dimension (T ).13

• Use xthrtest

– if you suspect the variance changes significantly over time, for example, when studying stock indexes that can be stable for years and turn volatile when a shock happens. – only if you are just concerned about first-order serial correlation.

• xtqptest is the most reliable in all other situations.

– The power of the Q(p) (the lags() option) increases with N and T ,making it suitable for situations where either N or T or both are small. It also performs best in the Monte Carlo simulations (see section 6). – The Q(p) test can be complemented with the LM(k) test (the order option) to identify at which order the serial correlation is present. For example, if the Q(2) test for serial correlation up to the second order is near the edge of rejection or acceptance territory but the other tests cannot reject the null, then it can be interesting to use the LM(1) and LM(2) tests to check for, respectively, first- and second-order autocorrelation separately.

6 Monte Carlo

In this section, we examine the finite sample performance of the four tests presented in this article and compare them with the existing WD (xtserial) and Arellano–Bond (abar) commands. The tests are evaluated under seven different scenarios, inspired by issues the practitioner might face. In each scenario, we investigate whether the tests detect serial correlation if it is present (their “power”) and accept the null of no serial correlation if there is indeed none (their “size”).

12. For xtqptest and xthrtest, the actual assumption made is a bit more involved, but for all practical purposes, it comes down to the same thing. 13.Asaruleofthumb,N should be larger than pT ,whereN is the number of panel units, T the time length, and p the order up to which you want to test for serial correlation (the lags() option). J. Wursten 87

6.1 Setup

The basic framework defining the panel is described in (10)–(13). Size results are based on uit = f( it), where it follows a standard normal distribution. Power results are basedonanAR(2) process, such that uit = f(vit, it), where vit =0.1 × vi,t−1 +0.05 × 14 vi,t−2 + it. Both size and power might depend on the dimension of the panel. To that extent, we set the number of panel units N to 20, 50, or 500 and the time length T to 7 or 50. This way, we capture both typical macro panels (for example, N = 20, T = 50) and representative micro panels (N = 500, T = 7) as well as some intermediate forms.

yit = ci + xitβ + uit (10) ci ∼ N(5, 10) (11) xit ∼ N(0, 1) (12) β = 1 (13)

Table 2 provides a summary of the various scenarios. In the baseline scenario [1], the panel is balanced, and the variance is constant over time. We apply the fixed- effect estimator to obtain the residuals. In scenario [2], the error terms are multiplied 0.2t 15 by ht = e , causing the variance to explode over time. Time-varying volatility is particularly common in the financial literature. In scenario [3], the variance is constant once again, but now the panel is unbalanced, with individual panels missing up to a fifth of their first and last observations (uniformly distributed).

Table 2. Scenarios used in Monte Carlo simulations

Extension [1] Constant variance Baseline 0.2t [2] Temporal heteroskedasticity uit = it × ht and ht = e [3] Unbalanced panel Ti = Tj (but no gaps) [4] No fixed effects ci =0=cj [5] Heterogeneous slopes βi ∼ U[0, 2] [6] Dynamic panel yit =0.5yi,t−1 +0.25yi,t−2 + ci + xitβ + uit [7] Cross-sectional dependence uit = λi × vt + it and λi ∼ U[−1, 3] and vt ∼ N(0, 1)

In scenario [4], we set all fixed effects to zero. Such a situation might occur when working with first-differenced data or in a panel regression without error component. Scenario [5] imposes panel-specific βi coefficients (uniformly distributed over [0, 2]). We fit this heterogeneous slope model using the mean-group estimator (Pesaran and Smith 1995).

14. In the appendix, I show power results for an MA(1) process. 15. Figures 2 and 3 in the appendix show example time series with constant and time-varying variances. 88 Testing for serial correlation in fixed-effects panel models

The final two scenarios put the serial correlation statistics to the test when their assumptions are not met. In scenario [6], yit itself follows an AR(2) process with yit = 0.5yi,t−1 +0.25yi,t−2 + ci + xitβ + uit. Fixed-effects estimation no longer consistently estimates each coefficient (Nickell 1981), one of the requirements of the serial correlation tests. Scenario [7] introduces cross-sectional dependence into the error structure by intro- ducing a common factor with panel-specific factor loadings [see (14)–(16)].

uit = λi × vt + it (14) λi ∼ U[−1, 3] (15) vt ∼ N(0, 1) (16)

This cross-sectional correlation violates the assumption of independently distributed errors. Comovement across panel units is a growing concern, especially in the macroe- conomic literature (for example, Eberhardt, Helmers, and Strauss [2013]; Chudik, Pe- saran, and Tosetti [2011]; Allegretto, Dube, and Reich [2010]). Note that in this spec- ification, fixed-effects estimation is still consistent because the error term remains un- correlated with the regressors (Eberhardt, Banerjee, and Reade 2010).

6.2 Results

Results for scenarios [1]–[3] are presented in tables 3–4.TheLM(1), HR,andWD tests look for first-order correlation only. LM(2) tests look just for second-order correlation. Three tests—IS(2), Q(2), and second-order Arellano–Bond [AB(2)]—look for serial cor- relation up to the second order. Finally, I also test for serial correlation up to the fourth order with Q(4) to investigate whether the power of the Q(p) test declines if you allow for longer serial correlation than is actually present in the data. J. Wursten 89

Table 3. Monte Carlo results, scenario [1]–[3], size

T =7 T =50 N =20 N =50 N = 500 N =20 N =50 N = 500 No serial correlation Ideal: 0.05 [1] Constant variance - IS(2) 0.00 0.04 0.05 . . 0.04 - Q(2) 0.11 0.06 0.05 0.09 0.07 0.05 - Q(4) 0.20 0.09 0.05 0.15 0.09 0.05 - LM(1) 0.08 0.05 0.05 0.06 0.06 0.05 - LM(2) 0.07 0.06 0.04 0.07 0.07 0.06 - HR 0.08 0.06 0.05 0.06 0.06 0.05 - AB(2) 0.43 0.79 1.00 0.09 0.19 0.88 - WD 0.07 0.06 0.06 0.05 0.06 0.05

[2] Time-varying variance - IS(2) 0.00 0.39 1.00 . . 1.00 - Q(2) 0.09 0.06 0.08 0.10 0.06 0.07 - Q(4) 0.18 0.08 0.10 0.17 0.10 0.14 - LM(1) 0.07 0.06 0.09 0.08 0.06 0.05 - LM(2) 0.07 0.07 0.23 0.06 0.05 0.05 - HR 0.06 0.06 0.05 0.07 0.05 0.05 - AB(2) 0.42 0.80 1.00 0.39 0.40 0.70 - WD 0.09 0.14 0.82 0.13 0.27 0.99

[3] Unbalanced panel - IS(2) 0.00 0.05 0.06 . . 0.04 - Q(2) 0.10 0.06 0.04 0.09 0.07 0.06 - Q(4) 0.18 0.09 0.04 0.15 0.10 0.06 - LM(1) 0.07 0.06 0.05 0.06 0.06 0.05 - LM(2) 0.07 0.05 0.04 0.08 0.06 0.05 - HR 0.07 0.05 0.05 0.07 0.06 0.05 - AB(2) 0.50 0.86 1.00 0.10 0.21 0.93 - WD 0.07 0.06 0.05 0.06 0.06 0.05

Simulations performed with 2,000 replications. 90 Testing for serial correlation in fixed-effects panel models

Table 4. Monte Carlo results, scenario [1]–[3], power

T =7 T =50 N =20 N =50 N = 500 N =20 N =50 N = 500 AR(2) (α1 =0.10,α2 =0.05) Ideal: 1.00 [1] Constant variance - IS(2) 0.00 0.08 0.77 . . 1.00 - Q(2) 0.14 0.18 0.96 0.85 1.00 1.00 - Q(4) 0.21 0.18 0.91 0.83 1.00 1.00 - LM(1) 0.12 0.20 0.94 0.85 1.00 1.00 - LM(2) 0.07 0.07 0.07 0.40 0.77 1.00 - HR 0.09 0.06 0.23 0.73 0.99 1.00 - AB(2) 0.19 0.36 1.00 0.69 0.97 1.00 - WD 0.09 0.10 0.37 0.19 0.35 1.00

[2] Time-varying variance - IS(2) 0.00 0.54 1.00 . . 1.00 - Q(2) 0.13 0.20 0.98 0.34 0.66 1.00 - Q(4) 0.20 0.20 0.96 0.37 0.58 1.00 - LM(1) 0.11 0.14 0.82 0.32 0.61 1.00 - LM(2) 0.07 0.07 0.11 0.14 0.25 0.98 - HR 0.08 0.07 0.22 0.18 0.33 1.00 - AB(2) 0.21 0.38 1.00 0.64 0.82 1.00 - WD 0.07 0.07 0.22 0.06 0.09 0.43

[3] Unbalanced panel - IS(2) 0.00 0.09 0.61 . . 1.00 - Q(2) 0.12 0.14 0.85 0.76 0.99 1.00 - Q(4) 0.20 0.15 0.76 0.73 0.98 1.00 - LM(1) 0.10 0.15 0.83 0.75 0.99 1.00 - LM(2) 0.07 0.06 0.05 0.33 0.65 1.00 - HR 0.09 0.06 0.13 0.59 0.93 1.00 - AB(2) 0.27 0.56 1.00 0.53 0.88 1.00 - WD 0.08 0.09 0.31 0.16 0.29 0.99

Simulations performed with 2,000 replications. J. Wursten 91

Do the tests falsely reject the null (no serial correlation) when there actually is no serial correlation (table 3)?16 In the baseline scenario [1], with a constant variance and a balanced panel, we find that all tests are decently sized, apart from the Arellano–Bond test, which is not suited for fixed-effects models. Note that the IS(2) test is not valid for two of the dimension combinations. When we introduce temporal heteroskedasticity in scenario [2], both the IS test and the WD test become oversized, worsening as the sample gets larger. The HR test honors its name and stays very close to the optimal size of 0.05 regardless of N and T . The other tests perform worse than in the constant variance case, but the distortions remain relatively minor overall. Moving to an unbalanced panel in scenario [3], we find no significant differences with the balanced case. As for the power results (do they reject the null when it is indeed false?) in table 4, we find that in the baseline scenario [1], the Q tests have noticeably more power than the standard WD test. The LM(2) and HR tests have fairly little power when T = 7, though they improve considerably as the panel gets longer.17 Introducing time-variation in the variance [2] leads to relatively similar results. The WD test loses even more power, and the Q tests still perform best overall. Surprisingly, the HR test, which was developed for this scenario, has fairly little power, although it does improve as the sample gets bigger in both dimensions. As with the size results, the balance of the sample [3] does not have a large impact on the power of the tests. The minor reductions in power across the board can be attributed to the reduced effective time length of the individual panels (between 0.6 × T and T ). Results for the more challenging scenarios [4]–[7] are presented in tables 5–6.Given the bad performance of the Arellano–Bond test, we omit it in these tables to reduce the number clutter.

16. As is common in the literature, we reject at p =0.05. Thus, a “correctly sized test” will reject the null hypothesis 5% of the time when it is true. 17. It is unclear what to make of the Arellano–Bond power results, given that it was so oversized in these scenarios. 92 Testing for serial correlation in fixed-effects panel models

Table 5. Monte Carlo results, scenario [4]–[7], size

T =7 T =50 N =20 N =50 N = 500 N =20 N =50 N = 500 No serial correlation Ideal: 0.05 [4] No fixed effects - IS(2) 0.00 0.49 1.00 . . 0.28 - Q(2) 0.10 0.07 0.06 0.10 0.07 0.05 - Q(4) 0.18 0.10 0.06 0.16 0.09 0.05 - LM(1) 0.07 0.05 0.06 0.07 0.06 0.05 - LM(2) 0.06 0.06 0.05 0.07 0.06 0.05 - HR 0.07 0.06 0.05 0.08 0.05 0.06 - WD 0.07 0.05 0.04 0.05 0.05 0.05

[5] Heterogeneous slopes - IS(2) 0.00 0.04 0.06 . . 0.03 - Q(2) 0.09 0.07 0.06 0.09 0.07 0.05 - Q(4) 0.17 0.10 0.05 0.16 0.09 0.06 - LM(1) 0.07 0.05 0.05 0.07 0.06 0.05 - LM(2) 0.06 0.05 0.05 0.07 0.06 0.05 - HR 0.06 0.05 0.05 0.07 0.05 0.06 - WD 0.07 0.05 0.05 0.06 0.05 0.05

[6] Dynamic panel - IS(2) 0.02 0.07 0.68 . . 0.07 - Q(2) 0.08 0.09 0.90 0.05 0.07 0.61 - Q(4) 0.08 0.05 0.79 0.13 0.10 0.69 - LM(1) 0.05 0.09 0.90 0.03 0.03 0.36 - LM(2) 0.06 0.05 0.09 0.06 0.08 0.49 - HR 0.05 0.04 0.04 0.03 0.02 0.06 - WD 0.97 1.00 1.00 1.00 1.00 1.00

[7] Cross-sectional dependence - IS(2) 0.00 0.90 1.00 . . 1.00 - Q(2) 0.72 0.88 0.98 0.79 0.93 0.99 - Q(4) 0.83 0.96 1.00 0.90 0.99 1.00 - LM(1) 0.58 0.72 0.91 0.64 0.76 0.92 - LM(2) 0.57 0.71 0.91 0.62 0.77 0.93 - HR 0.55 0.72 0.91 0.63 0.77 0.92 - WD 0.58 0.73 0.91 0.64 0.77 0.94

Simulations performed with 2,000 replications. J. Wursten 93

Table 6. Monte Carlo results, scenario [4]–[7], power

T =7 T =50 N =20 N =50 N = 500 N =20 N =50 N = 500 AR(2) (α1 =0.10,α2 =0.05) Ideal: 1.00 [4] No fixed effects - IS(2) 0.00 0.80 1.00 . . 1.00 - Q(2) 0.13 0.19 0.96 0.87 1.00 1.00 - Q(4) 0.22 0.19 0.91 0.82 0.99 1.00 - LM(1) 0.13 0.20 0.94 0.86 1.00 1.00 - LM(2) 0.06 0.06 0.07 0.41 0.76 1.00 - HR 0.07 0.07 0.25 0.73 0.97 1.00 - WD 0.08 0.10 0.39 0.20 0.36 1.00

[5] Heterogeneous slopes - IS(2) 0.00 0.07 0.57 . . 1.00 - Q(2) 0.11 0.15 0.86 0.86 1.00 1.00 - Q(4) 0.20 0.15 0.76 0.81 0.99 1.00 - LM(1) 0.11 0.16 0.83 0.85 0.99 1.00 - LM(2) 0.07 0.06 0.07 0.40 0.74 1.00 - HR 0.07 0.06 0.18 0.72 0.97 1.00 - WD 0.08 0.10 0.39 0.20 0.36 1.00

[6] Dynamic panel - IS(2) 0.02 0.11 0.98 . . 1.00 - Q(2) 0.10 0.19 1.00 0.69 0.99 1.00 - Q(4) 0.09 0.10 0.99 0.63 0.96 1.00 - LM(1) 0.08 0.18 1.00 0.53 0.94 1.00 - LM(2) 0.06 0.04 0.08 0.51 0.89 1.00 - HR 0.05 0.04 0.04 0.35 0.74 1.00 - WD 0.99 1.00 1.00 1.00 1.00 1.00

[7] Cross-sectional dependence - IS(2) 0.00 0.93 1.00 . . 1.00 - Q(2) 0.73 0.91 0.99 0.79 0.92 0.99 - Q(4) 0.84 0.98 1.00 0.92 0.99 1.00 - LM(1) 0.59 0.72 0.92 0.59 0.75 0.93 - LM(2) 0.58 0.74 0.91 0.62 0.78 0.93 - HR 0.58 0.73 0.92 0.60 0.78 0.92 - WD 0.62 0.76 0.92 0.66 0.78 0.93

Simulations performed with 2,000 replications. 94 Testing for serial correlation in fixed-effects panel models

Regarding table 5, in scenario [4], we set the fixed effects to zero. The IS test becomes highly oversized, rejecting the null of no serial correlation far more often than it should (28–100% of the time rather than the ideal 5%). The other statistics are unaffected. Panel-specific coefficients [5] do not appear to trouble any of the tests. On the other hand, including lags of the dependent variable [6] is detrimental to the size of all tests considered, apart from the HR statistic. Performance remains acceptable in smaller panels but gets progressively worse as N and T increase. The WD test completely breaks down.18 Introducing cross-sectional dependence [7] leads to even worse size results. Every test is now majorly oversized. To my knowledge, there is no serial correlation test that is robust to cross-sectional dependence.19 Finally, we move to the power results in table 6. As with the sizes in the previous table, setting the fixed effects to zero [4] does not affect the power of any of the tests, except for the IS test. The introduction of heterogeneous slopes [5] has no discernible impact on the power of the test statistics. Turning to the dynamic panel scenario [6], we find that only the small-sample results are meaningful, given that all the tests became oversized as the sample grew in either dimension. We also find that the power of the tests holds in these small samples. The values presented for the cross-sectional dependence case [7] are entirely meaningless, given that none of the tests “accepted” the null hypothesis of no serial correlation even when it was true. All in all, we find that, at least in the common scenarios we considered, the Q test is the most robust overall. It has decent size and is generally more powerful than the alternatives, especially compared with the standard WD test.

7Conclusion

I believe that serial correlation tests can help researchers understand their data and correctly specify their empirical models. Currently, however, serial correlation testing is far from accessible with panel data. Thus, serial correlation testing is often neglected altogether or postponed until the very end of the research cycle. In contrast, the three tests I introduced are easy to use and interpret, making it feasible to test for serial correlation in earlier, exploratory phases of research. Moreover, I provided Monte Carlo evidence indicating they have considerably more power in finite samples than the current generation of tests.

18. In a certain sense, these bad results are inevitable. The bias induced by the lagged dependent variables mainly affects the autoregressive (AR) coefficients. These incorrectly estimated AR pa- rameters in turn lead to serial correlation in the residuals (even if the true errors are serially uncorrelated), which is then detected by the test statistics. In table 9 in the appendix, we intro- duce endogeneity by correlating the regressor xit with the error term as an alternative specification. Then, the tests are oversized for small samples but approach 5% as the sample gets longer. 19. Table 9 in the appendix shows results for the cross-sectional dependence scenario when we fit the model with the common correlated effects pooled (CCEP) estimator (Pesaran 2006) instead of standard fixed effects. The CCEP estimator filters out the common trends by including the cross- sectional mean of all variables as regressors with panel unit-specific coefficients. We find decent-size results for all tests in that case. J. Wursten 95 8 References Allegretto, S., A. Dube, and M. Reich. 2010. Do minimum wages really reduce teen employment? Accounting for heterogeneity and selectivity in state panel data. Insti- tute for Research on Labor and Employment, Working Paper Series from Institute of Industrial Relations, UC Berkeley. https://econpapers.repec.org/paper/cdlindrel/ qt7jq2q3j8.htm.

Arellano, M., and S. Bond. 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Review of Economic Studies 58: 277–297.

Born, B., and J. Breitung. 2016. Testing for serial correlation in fixed-effects panel data models. Econometric Reviews 35: 1290–1316.

Choodari-Oskooei, B., and T. P. Morris. 2016. Quantifying the uptake of user-written commands over time. Stata Journal 16: 88–95.

Chudik, A., M. H. Pesaran, and E. Tosetti. 2011. Weak and strong cross-section depen- dence and estimation of large panels. Econometrics Journal 14: C45–C90.

Drukker, D. M. 2003. Testing for serial correlation in linear panel-data models. Stata Journal 3: 168–177.

Eberhardt, M., A. Banerjee, and J. J. Reade. 2010. Panel estimation for worriers. Eco- nomics Series Working Papers 514, University of Oxford, Department of Economics. https://econpapers.repec.org/paper/oxfwpaper/514.htm.

Eberhardt, M., C. Helmers, and H. Strauss. 2013. Do spillovers matter when estimating private returns to R&D? Review of Economics and Statistics 95: 436–448.

Inoue, A., and G. Solon. 2006. A Portmanteau test for serially correlated errors in fixed effects models. Econometric Theory 22: 835–851.

Nickell, S. 1981. Biases in dynamic models with fixed effects. Econometrica 49: 1417– 1426.

Pesaran, M. H. 2006. Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica 74: 967–1012.

Pesaran, M. H., and R. P. Smith. 1995. Estimating long-run relationships from dynamic heterogeneous panels. Journal of Econometrics 68: 79–113.

Pullenayegum, E. M., R. W. Platt, M. Barwick, B. M. Feldman, M. Offringa, and L. Thabane. 2016. Knowledge translation in biostatistics: A survey of current prac- tices, preferences, and barriers to the dissemination and uptake of new statistical methods. Statistics in Medicine 35: 805–818.

Roodman, D. 2009. How to do xtabond2: An introduction to difference and system GMM in Stata. Stata Journal 9: 86–136. 96 Testing for serial correlation in fixed-effects panel models

Solon, G. 1984. The effects of unemployment insurance eligibility rules on job quitting behavior. Journal of Human Resources 19: 118–126.

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data.2nd ed. Cambridge, MA: MIT Press.

About the author

Jesse Wursten is a PhD student at KU Leuven.

A Appendix A.1 Example time series

Figures 2–3 illustrate the difference between residuals with a constant variance and those whose variance increases exponentially over time. The top panel shows an example time series without serial correlation, the bottom panel one with mild first- and second-order autocorrelation.

No serial correlation 2 0 e_no − 2 − 4 0 10 20 30 40 t First− and second−order autocorrelation a1 = 0.1, a2 = 0.05 2 1 0 e_ar − 1 − 2 0 10 20 30 40 t

Figure 2. Constant variance, with and without serial correlation J. Wursten 97

No serial correlation, heteroskedastic process 10 0 − 10 e_no_hetero − 20 − 30 0 10 20 30 40 t First− and second−order autocorrelation, heteroskedastic process a1 = 0.1, a2 = 0.05 60 40 20 0 e_ar_hetero − 20 − 40 0 10 20 30 40 t

Figure 3. Time-varying (“heteroskedastic”) variance, with and without serial correlation

A.2 Monte Carlo evidence: Moving averages

Tables 7–8 present power results for errors following a first-order moving-average [MA(1)] process, with θ1 =0.10. There are no major differences compared with the AR(2) results, with the Q test performing best overall. We include the unrestricted IS test in these tables. Overall, its performance is very similar to that of the IS(2) test, with a mild reduction in power, but note that in many cases, it cannot be estimated because the dimension of the test is too large. 98 Testing for serial correlation in fixed-effects panel models

Table 7. Monte Carlo results, MA(1)-process, scenario [1]–[3], power

T =5 T =50 N =20 N =50 N = 500 N =20 N =50 N = 500 MA(1) (θ1 =0.10) Ideal: 1.00 [1] Constant variance - IS(2) 0.00 0.08 0.89 . . 1.00 - IS(all) 0.00 0.07 0.80 ... - Q(2) 0.16 0.25 0.99 0.81 1.00 1.00 - Q(4) 0.23 0.23 0.97 0.78 0.99 1.00 - LM(1) 0.14 0.30 0.99 0.86 1.00 1.00 - LM(2) 0.08 0.11 0.53 0.07 0.07 0.11 - HR 0.10 0.17 0.89 0.81 0.99 1.00 - AB(2) 0.13 0.29 1.00 0.66 0.97 1.00 - WD 0.14 0.21 0.95 0.59 0.95 1.00

[2] Time-varying variance - IS(2) 0.00 0.52 1.00 . . 1.00 - IS(all) 0.00 0.35 1.00 ... - Q(2) 0.15 0.26 0.99 0.30 0.54 1.00 - Q(4) 0.24 0.23 0.98 0.34 0.48 1.00 - LM(1) 0.12 0.23 0.97 0.32 0.61 1.00 - LM(2) 0.11 0.17 0.90 0.07 0.06 0.06 - HR 0.10 0.16 0.86 0.26 0.51 1.00 - AB(2) 0.15 0.31 1.00 0.61 0.81 1.00 - WD 0.07 0.06 0.11 0.07 0.07 0.16

[3] Unbalanced panel - IS(2) 0.01 0.08 0.82 . . 1.00 - IS(all) 0.01 0.08 0.72 ... - Q(2) 0.13 0.20 0.96 0.72 0.98 1.00 - Q(4) 0.21 0.21 0.92 0.68 0.96 1.00 - LM(1) 0.12 0.24 0.98 0.79 0.99 1.00 - LM(2) 0.09 0.11 0.59 0.07 0.07 0.11 - HR 0.09 0.12 0.73 0.70 0.98 1.00 - AB(2) 0.20 0.46 1.00 0.50 0.91 1.00 - WD 0.12 0.18 0.89 0.50 0.90 1.00

Simulations performed with 2,000 replications. J. Wursten 99

Table 8. Monte Carlo results, MA(1)-process, scenario [4]–[7], power

T =7 T =50 N =20 N =50 N = 500 N =20 N =50 N = 500 MA(1) (θ1 =0.10) Ideal: 1.00 [4] No fixed effects - IS(2) 0.00 0.73 1.00 . . 1.00 - IS(all) 0.00 0.63 1.00 ... - Q(2) 0.15 0.24 0.99 0.80 0.99 1.00 - Q(4) 0.23 0.22 0.97 0.77 0.98 1.00 - LM(1) 0.15 0.30 0.99 0.86 1.00 1.00 - LM(2) 0.09 0.11 0.53 0.07 0.06 0.10 - HR 0.10 0.17 0.88 0.80 0.99 1.00 - WD 0.13 0.22 0.95 0.59 0.95 1.00

[5] Heterogeneous slopes - IS(2) 0.00 0.06 0.71 . . 1.00 - IS(all) 0.00 0.05 0.58 ... - Q(2) 0.12 0.18 0.94 0.79 0.99 1.00 - Q(4) 0.21 0.18 0.88 0.75 0.98 1.00 - LM(1) 0.11 0.22 0.96 0.85 1.00 1.00 - LM(2) 0.07 0.09 0.40 0.07 0.07 0.10 - HR 0.08 0.13 0.73 0.80 0.99 1.00 - WD 0.13 0.22 0.95 0.59 0.95 1.00

[6] Dynamic panel - IS(2) 0.02 0.10 0.98 . . 1.00 - IS(all) 0.02 0.10 0.92 ... - Q(2) 0.10 0.20 1.00 0.37 0.83 1.00 - Q(4) 0.09 0.12 1.00 0.39 0.74 1.00 - LM(1) 0.09 0.26 1.00 0.48 0.91 1.00 - LM(2) 0.07 0.11 0.55 0.08 0.11 0.73 - HR 0.05 0.07 0.28 0.35 0.76 1.00 - WD 1.00 1.00 1.00 1.00 1.00 1.00

[7] Cross-sectional dependence - IS(2) 0.00 0.93 1.00 . . 1.00 - IS(all) 0.00 0.81 1.00 ... - Q(2) 0.73 0.89 0.99 0.79 0.93 0.99 - Q(4) 0.84 0.97 1.00 0.92 0.99 1.00 - LM(1) 0.56 0.74 0.91 0.62 0.77 0.93 - LM(2) 0.59 0.73 0.91 0.60 0.77 0.93 - HR 0.55 0.72 0.92 0.63 0.76 0.93 - WD 0.62 0.76 0.92 0.67 0.79 0.94

Simulations performed with 2,000 replications. 100 Testing for serial correlation in fixed-effects panel models

A.3 Monte Carlo evidence: Endogenous regressor and cross-section- al dependence

If we introduce a bias in the coefficient of the regressor xit by correlating it to the error term, we find that in longer panels, the size still remains acceptable (top half of table 9). This is in contrast with the lagged dependent variable approach in table 5. The good performance of the WD test is due to a lack of serial correlation in the endogenous regressor and may not extend to more realistic situations.

Table 9. Monte Carlo results, endogenous regressor and CDP,size

T =7 T =50 N =20 N =50 N = 500 N =20 N =50 N = 500 No serial correlation Ideal: 0.05 Endogenous regressor - IS(2) 0.00 1.00 1.00 . . 0.76 - Q(2) 0.43 0.61 0.94 0.13 0.09 0.13 - Q(4) 0.73 0.85 1.00 0.21 0.12 0.16 - LM(1) 0.28 0.41 0.76 0.09 0.08 0.11 - LM(2) 0.28 0.39 0.76 0.09 0.07 0.10 - HR 0.27 0.39 0.76 0.10 0.07 0.11 - WD 0.07 0.06 0.05 0.05 0.05 0.05

Cross-sectional dependence Estimated with CCEP-estimator - IS(2) 0.00 0.04 0.05 . . 0.03 - Q(2) 0.10 0.07 0.05 0.09 0.06 0.05 - Q(4) 0.22 0.10 0.06 0.17 0.09 0.05 - LM(1) 0.07 0.05 0.05 0.07 0.06 0.05 - LM(2) 0.07 0.05 0.05 0.07 0.05 0.05 - HR 0.07 0.06 0.04 0.07 0.06 0.05

Simulations performed with 2,000 replications.

In the bottom half of table 9, we see that if we estimate our cross-sectionally depen- dent panel with the common correlated effects pooled (CCEP) estimator (Pesaran 2006), sizes of all tests return to normal. The CCEP estimator adds the cross-sectional averages of each variable (dependent and independent) as regressor, with cross section-specific coefficients. This filters out the cross-sectional dependence if N is sufficiently large. The Stata Journal (2018) 18, Number 1, pp. 101–117 ldagibbs: A command for topic modeling in Stata using latent Dirichlet allocation

Carlo Schwarz University of Warwick Coventry, UK [email protected]

Abstract. In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications. Keywords: st0515, ldagibbs, machine learning, latent Dirichlet allocation, Gibbs sampling, topic model, text analysis

1 Introduction

Text data are a potentially rich source of information for researchers. For example, many datasets consist of large collections of unclassified text data. These text collections, which are usually referred to as “corpora”, can consist of many thousands of individual documents with millions of words. Hence, a text corpus often cannot be used for statistical analysis without automated text analysis or a machine-learning algorithm. In this article, I therefore introduce the ldagibbs command for latent Dirichlet allocation (LDA), which was developed by Blei, Ng, and Jordan (2003). LDA is the most popular topic model and allows for the automatic clustering of any kind of text documents into a user-chosen number of clusters of similar content, usually referred to as “topics”.

LDA relies only on text data; therefore, researchers can apply LDA for nearly all tasks of text clustering. Additionally, LDA can cluster corpora independent of the number of documents and language of texts. LDA’s topic clustering already has been proven to be reliable for many different text corpora. For example, LDA is capable of grouping together medical studies (Wu et al. 2012), statements from the transcripts of the Federal Reserve’s Open Market Committee (Hansen, McMahon, and Prat Forthcoming), and journal articles (Griffiths and Steyvers 2004).

In its clustering, LDA uses a probabilistic model of the text data. This model exploits the fact that when talking about the same topic, authors tend to use similar words. Hence, in texts about the same topic, similar words tend to co-occur. LDA uses the co-occurrences of words to describe each topic as a probability distribution over words and to describe each document as a probability distribution over topics.

c 2018 StataCorp LLC st0515 102 LDA in Stata

The ldagibbs command conveniently runs LDA in Stata. Its output enables a com- parison of similarities between documents and topics, making it easy to analyze their previously hidden relationships. ldagibbs extends the text analysis capabilities of Stata, making it possible to com- pare the similarities between entire documents and to compare string similarities based on their Levenshtein edit distances with the strdist command (Barker 2012). While screening (Belotti and Depalo 2010) already allows one to search texts for individ- ual keywords, and txttool (Williams and Williams 2014)allowsonetocreateabag- of-words representation of text data, ldagibbs provides a new option for large-scale document-content analysis. Thereby, ldagibbs makes previously unused text data us- able for research.

2LDA

LDA consists of two parts: The first is the probabilistic model, which describes the text data as a likelihood function. The second is an approximate inference algorithm, because simply maximizing this likelihood function is computationally unfeasible.

2.1 Probabilistic model

The probabilistic model of LDA assumes that each document d of the D documents in the corpus can be described as a probabilistic mixture of T topics. These probabilities are contained in a document topic vector θd of length T . The output of LDA is a D × T matrix θ containing P (tt|dd), which is the probability of document d belonging to topic t: ⎛ ⎞ ⎛ ⎞ θ1 P (t1|d1) ··· P (tT |d1) ⎜ ⎟ ⎜ ⎟ θ . . . . = ⎝ . ⎠ = ⎝ . .. . ⎠ θD P (t1|dD) ··· P (tT |dD) Each topic t ∈ T is defined by a probabilistic distribution over the vocabulary (the set of unique words in all documents) of size V . A topic, therefore, describes how likely it is to observe a word conditional on a topic. For example, documents regarding physics might have a high probability for words such as “particle”, “quantum”, and “boson”, while the most frequent words in documents regarding medicine could be “clinical”, “virus”, and “infection”. The word-probability vectors of each topic are contained in a matrix φ of dimensions V × T . ⎛ ⎞ P (w1|t1) ··· P (w1|tT ) ⎜ ⎟ φ φ φ ⎝ . . . ⎠ = 1,..., T = . .. . P (wV |t1) ··· P (wV |tT )

The probabilities P (wv|tt)intheφt vectors describe how likely it is to observe word v from the vocabulary, conditional on topic t. In this way, the φt vectors allow one to C. Schwarz 103 judge the content of each topic, because LDA does not provide any additional topic labels.

Given the parameters θ and φ, LDA’s probabilistic model assumes that the text in the corpus is generated by the following process:

1. Draw a word-probability distribution φ ∼ Dir(β). 2. For each document d in the corpus:

a. Draw topic proportions θd ∼ Dir(α).

b. For each of the Nd words wd in d:

i. Draw a topic assignment zd,n ∼ Mult(θd). ii. Draw a word wd,n from p(wd,n|zd,n, φ).

In this model, α and β are hyperparameters, which are necessary for the Gibbs sampling process. Given the probabilistic model, the overall likelihood of the corpus with respect to the model parameters is ⎧ ⎫ (D ⎨(Nd ⎬ θ | |θ | φ P ( d α) ⎩ P (zd,n d) P (wd,n zd,n, )⎭ (1) d=1 n=1 zd,n

P (θd|α)in(1) is the likelihood of observing the topic distribution of θd of document d, conditional on α. P (zd,n|θd) is the likelihood of the individual topic assignment zd,n of word n in document d, conditional on the topic distribution of the document. Finally, P (wd,n|zd,n, φ) is the probability to observe a specific word, conditional on the topic assignment of the word and the word probabilities of the given topic, which are φ contained in . By calculating the sum over) all possible topic assignments ( z), the Nd product over all Nd words in a document ( n=1), and the product over all documents )D in the corpus ( d=1), we obtain the likelihood of observing the words in the documents.

The LDA classification is based on finding the optimal topic assignment zd,n for each word in each document and the optimal word probabilities φ for each topic that maximizes this likelihood. Maximizing this likelihood would require summing over all possible topic assignment for all words in all documents, which is computationally un- feasible. Therefore, alternative methods to approximate the likelihood function have been developed. The ldagibbs command uses the Gibbs sampling method developed by Griffiths and Steyvers (2004).

2.2 Approximate inference using Gibbs sampling

Gibbs sampling is a Markov chain Monte Carlo algorithm, which is based on repeat- edly drawing new samples conditional on all other data. The WinBUGS command (Thompson, Palmer, and Moreno 2006) provides a general framework for the imple- mentation of the Gibbs sampler in Stata. 104 LDA in Stata

In the specific case of LDA, the Gibbs sampler relies on iteratively updating the topic assignment of words, conditional on the topic assignments of all other words. Because Gibbs sampling is a Bayesian technique, it requires priors for the values of the hyperparameters α and β. These initial parameter values should lie inside the unit interval. The prior for α is usually chosen based on the number of topics T , while the prior for β depends on the size of the vocabulary. The higher the number of topics or the larger the vocabulary, the smaller should be the priors for α and β. As a heuristic for the choice of priors, Griffiths and Steyvers (2004) suggest α =50/T (only applicable if T>50) and β =0.1. In general, the choice of the priors should not influence the outcome of the sampling process. As the first step, ldagibbs splits the documents into individual words—so-called word tokens. These word tokens are then randomly assigned to one of the T topics with equal probabilities. This provides an initial assignment of words and thereby documents to topics for the sampling process. Afterward, ldagibbs samples new topic assignments for each of the word tokens. The probability of a word token being assigned to topic t based on the probabilistic model is

P (zd,n = t|wd,n, φ) ∝ P (wd,n|zd,n = t, φ) × P (zd,n = t)

The Gibbs sampler uses the topic assignment of all other tokens to obtain approximate values for P (wd,n|zd,n = t, φ)andP (zd,n = t). P (wd,n|zd,n = t, φ) is given by the number of words identical to wd,n assigned to topic t divided by the total number of words assigned to that topic. For example, if a specific word is assigned to topic t some x t t times, then P (wd,n|zd,n = t, φ)=(x + β)/(N + Vβ), where N is the total number of words in the corpus assigned to topic t,andβ is the chosen prior for β. Similarly, P (zd,n = t) is given by the fraction of words in a document currently assigned to topic t plus the prior for α. In each iteration, ldagibbs samples new topic assignments for all words in the corpus. By running the Gibbs sampler for a burn-in of several hundred iterations, the Markov chain converges toward the maximum of the likelihood function. In the example of Griffiths and Steyvers (2004), the sampler already reaches convergence after 150 iterations. To analyze the convergence of the Gibbs sampler, ldagibbs provides an option called likelihood. When the likelihood option is specified, the Gibbs sampler reports the log likelihood and the change of the log likelihood of the model every 50 iterations. The log likelihood should initially increase with the number of iterations of the Gibbs sampler until only small changes occur.1 After the burn-in period, samples are taken from the Gibbs sampler to obtain an approximation of the parameter values of θd and φ. To guarantee the statistical inde- pendence of these samples, additional iterations of the sampling process should be run in between the samples.

1. The log likelihood will be negative in nearly all cases. Furthermore, it is possible that the likelihood falls in some iterations of the Gibbs sampler, especially because the log likelihood is calculated using only the current state of the Gibbs sampler. C. Schwarz 105

Like most Gibbs sampling applications, it is possible that the Gibbs sampler con- verges toward a local instead of the global maximum of the likelihood function. Users can experiment with different random seeds to compare different sampling results. Of- ten, the local maximum result in similar topics and document clusters.

3 Implementation

This section describes how to use the ldagibbs command to fit the topic model to text data. For ldagibbs, each observation in the dataset should represent one of the documents the user wants to cluster. Long documents can also be split into paragraphs to cluster each paragraph separately. In both cases, the text strings for the clustering should be contained in one variable. It is advisable to clean nonalphanumerical charac- ters from these text strings because they can potentially interfere with the clustering.

In theory, LDA can be applied to any type of text data independently of the length of the individual texts. LDA works well for abstracts of scientific articles or when the individual text documents contain at least 50–100 words. The number of words required for easily interpretable topics varies with the type of document and the size of the vocabulary. When the individual text documents are very short, LDA might generate less meaningful topics. For example, LDA does not perform as well when applied to tweets (Hong and Davison 2010; Zhao et al. 2011), because in these cases, LDA has too few word co-occurrence information. This problem can be overcome either by combining short documents together (Quan et al. 2015) or by using an alternative topic model (for example, see Rosen-Zvi et al. [2004], Yan et al. [2013], or Kim and Shim [2014]). The ldagibbs command requires only the variable containing the text strings as an input. The individual options of the command allow one to modify the behavior of the Gibbs sampler and specify names for the output variables. For convenience, ldagibbs also includes some basic text cleaning capabilities to remove stopwords and short words from the data. For more advanced text cleaning options, see the txttool command (Williams and Williams 2014).

3.1 Syntax ldagibbs varname ,topics(integer) burnin iter(integer)alpha(real) beta(real) samples(integer) sampling iter(integer)seed(integer) likelihood min char(integer) stopwords(string)name new var(string) normalize mat save path(string) 106 LDA in Stata

3.2 Options

Gibbs sampler options topics(integer) specifies the number of topics LDA should create. The default is topics(10). burnin iter(integer) specifies how many iterations the Gibbs sampler should run as a burn-in period. The default is burnin iter(500). alpha(real) sets the prior for the topic-probability distribution. The value for alpha() should be between 0 and 1. As a heuristic, users can use α =50/T (only applicable if T>50). The default is alpha(0.25). beta(real) sets the prior for the word-probability distribution. The value for beta() should be between 0 and 1. The default is beta(0.1). samples(integer) specifies how many samples the algorithm should collect after the burn-in period. To obtain robust results, at least 10 samples should be taken. The default is samples(10). sampling iter(integer) specifies how many iterations the Gibbs sampler should ig- nore between the individual samples. Running additional iterations of the Gibbs sampler guarantees the statistical independence of the samples. The default is sampling iter(50). seed(integer) sets the seed for the random-number generator to guarantee the repro- ducibility of the results. The default is seed(0). likelihood specifies that the Gibbs sampler calculate and report the log likelihood of the LDA model every 50 iterations. This option allows one to analyze the convergence of the Gibbs sampler but slows down the sampling process.

Text cleaning options min char(integer) allows the removal of short words from the texts. Words with fewer characters than integer will be excluded from the sampling algorithm. The default is min char(0). stopwords(string) specifies a list of words to exclude from the Gibbs sampler. Usually, highly frequent words, such as “I” or “you”, are removed from the text because these words do not help classify the documents. Predefined stopword lists for different languages are available online.

Output options name new var(string) specifies the name of the output variables created by ldagibbs. These variables contain the topic assignments for each document. string should be unique in the dataset. The default is name new var("topic prob"), such that the C. Schwarz 107

names of the new variables will be topic prob1, topic prob2, ..., topic probT, where T is the total number of topics. normalize specifies that ldagibbs normalize the raw topic assignment counts to proba- bilities. By default, ldagibbs returns the raw topic assignment counts. You almost always want to specify this option. mat save specifies to save the word-probability matrix. This matrix defines the most frequent words in each topic. By default, the matrix will not be saved. path(string) sets the path where the word-probability matrix is to be saved.

3.3 Output ldagibbs generates T new variables that describe the topics of each document. If normalize is specified, each variable contains the probability of a document belonging to topic t ∈ T . Without normalize, the variable contains the number of words in the document being assigned to topic t plus α × s,wheres is the number of samples drawn (samples(s)). The content of each topic can be analyzed using the word-probability vectors. To save these vectors, the mat save option must be specified. ldagibbs then saves a Mata matrix file with the name word prob in the folder specified in the path() option.2 The file contains the word probabilities for each of the T topics. To analyze the word- probability vectors, ldagibbs also provides the wprobimport command, which imports the stored word-probability data into a Stata dataset. Sorting the word prob data in descending order by one of the variables will ensure the most frequent words in each topic are displayed.

4 Examples Example 1: Clustering laws

The first example dataset contains the title and the summary of 1,000 bills from the U.S. Congress. After opening it and combining the title and the summary for each bill, nonalphanumerical characters are removed from the strings. This provides the raw strings for LDA.

. use example_data_laws . *combine title and summary . generate text_strings = title +""+summary . *remove nonalphanumerical characters . replace text_strings = subinstr( text_strings, ".", " ", .) (613 real changes made)

2. Storing the data as a Mata matrix is the most efficient solution because the Stata matrix size is limited; thus, storing the results as a Stata dataset could potentially be memory intensive. 108 LDA in Stata

. replace text_strings = subinstr( text_strings, "!", " ", .) (0 real changes made) . replace text_strings = subinstr( text_strings, "?", " ", .) (0 real changes made) . replace text_strings = subinstr( text_strings, ":", " ", .) (142 real changes made) . replace text_strings = subinstr( text_strings, ";", " ", .) (142 real changes made) . replace text_strings = subinstr( text_strings, ",", " ", .) (361 real changes made) . replace text_strings = subinstr( text_strings, "(", " ", .) (189 real changes made) . replace text_strings = subinstr( text_strings, ")", " ", .) (189 real changes made) . replace text_strings = stritrim(text_strings) (413 real changes made) . list text_strings if _n<=10

text_strings

1. Social Security and Medicare Lock-Box Act of 2001 Requires any Feder.. 2. Economic Growth and Tax Relief Act of 2001 Prohibits minimum bracket.. 3. Energy Policy Act of 2002 Sec 206 Grants FERC jurisdiction over an e.. 4. Reserved for the speaker 5. Marriage Penalty and Family Tax Relief Act of 2001 Repeals provision..

6. CARE Act of 2002 Sec 103 Sets forth a rule for determining the amoun.. 7. Death Tax Elimination Act of 2001 Title III Unified Credit Replaced .. 8. Reserved for the speaker 9. Railroad Retirement and Survivors´ Improvement Act of 2001 Sec 102 R.. 10. Financial Contract Netting Improvement Act of 2001

The LDA Gibbs sampling process is started using the ldagibbs command. For the small example dataset, LDA takes a few minutes to execute. Because the time needed for sampling increases proportional to the number of words in all documents, running LDA for large datasets with tens of thousands of observations can take several hours. The sampler reports each time 50 iterations have finished and each time a new sample is collected. This provides an estimate of how long the entire sampling process will take.

. global stopwords "a able about across after all almost also am among an and > any are as at be because been but by can cannot could dear did do does either > else ever every for from get got had has have he her hers him his how however > i if in into is it its just least let like likely may me might most must my > neither no nor not of off often on only or other our own rather said say says > she should since so some than that the their them then there these they this > tis to too twas us wants was we were what when where which while who whom why > will with would yet you your" C. Schwarz 109

. ldagibbs text_strings, topics(5) alpha(0.25) beta(0.1) seed(3) burnin_iter(1000) > samples(10) sampling_iter(50) min_char(5) normalize mat_save > stopwords("$stopwords") ************************************* ********* Parameters for LDA ********* ************************************* Number of Topics: 5 Prior Alpha: .25 Prior Beta: .1 Number of Iterations: 1000 Number of Samples: 10 Iterations between Samples: 50 Seed: 3 Minimal Word Length: 5 *************************************

************************************* ******** Preparing Documents ******** *************************************

Processing Document: 500 Processing Document: 1000 ************************************* **** Initializing Gibbs Sampler ***** *************************************

************************************* ******* Running Gibbs Sampler ******* *************************************

Iteration: 50 Iteration: 100 (output omitted ) Iteration: 1000 Sample: 2 (output omitted ) Iteration: 50

After LDA is finished, one can sort the documents by the assigned topics. Even in the small example dataset, LDA accurately classifies relevant bills (for example, bills about educational policy) into one topic. 110 LDA in Stata

. *list example topics . forvalues i=1/5 { 2. gsort -topic_prob`i´ 3. list title topic_prob`i´ if _n<=5, str(45) 4. }

title topic_p~1 1. MD-CARE Act .97647059 2. Public Education Reinvestment, Reinvention, a.. .97429429 3. National Science Education Act .97394844 4. Character Counts for the 21st Century Act .96168582 5. Excellence and Accountability in Education Act .95501449

(output omitted )

As a final component of analysis, one can load the information from the word- probability matrix by using the wprobimport command. After importing the data, the most frequent words in each topic are listed. This allows the user to assign labels to the topics if required. It is worth noting again that the individual φt vectors for each topic describe how likely it is to see a word conditional on a topic. Hence, the sum of all word probabilities within each individual φt vector is 1, but the word probabilities across topics will not add up to 1.

. wprobimport using "word_prob" . *list most frequent words in each topic . forvalues i=1/5 { 2. gsort -word_prob`i´ 3. list words word_prob`i´ if _n<=5 4. }

words word_pr~1 1. education .02226475 2. secretary .01650924 3. programs .01549032 4. program .01342494 5. requires .01320464

words word_pr~2 1. amend .02118311 2. health .01655658 3. revenue .0138505 4. internal .01347223 5. certain .01341403 C. Schwarz 111

words word_pr~3 1. energy .02905464 2. national .01797795 3. secretary .01628051 4. federal .01409986 5. program .0095775

words word_pr~4 1. bankruptcy .01477839 2. property .0123517 3. debtor .01002773 4. court .00991218 5. certain .00885933

words word_pr~5 1. federal .02166268 2. states .0160524 3. united .01468338 4. state .01199903 5. protection .00836174

Example 2: Clustering economic articles

The second example uses titles and abstracts from 10,000 economic articles published from 2010–2015. The data are prepared similarly to the first example. Afterward, the Gibbs sampler is started, this time specifying the likelihood option. The reported like- lihood numbers indicate that the Gibbs sampler converges after around 400 iterations, because afterward the likelihood changes only marginally.

. ldagibbs text_strings, topics(15) alpha(0.25) beta(0.1) seed(6) burnin_iter(600) > samples(10) sampling_iter(50) min_char(5) normalize likelihood > stopwords("$stopwords") ************************************* ********* Parameters for LDA ********* ************************************* (output omitted ) ************************************* ******** Preparing Documents ******** *************************************

Processing Document: 500 (output omitted ) Processing Document: 10000 112 LDA in Stata

************************************* **** Initializing Gibbs Sampler ***** *************************************

************************************* ******* Running Gibbs Sampler ******* *************************************

Iteration: 50 Log-Likelihood (Change): -5207989(443079) Iteration: 100 Log-Likelihood (Change): -5188673(19316) Iteration: 150 Log-Likelihood (Change): -5183687(4986) Iteration: 200 Log-Likelihood (Change): -5182548(1139) Iteration: 250 Log-Likelihood (Change): -5181958(589) Iteration: 300 Log-Likelihood (Change): -5181721(237) Iteration: 350 Log-Likelihood (Change): -5180363(1358) Iteration: 400 Log-Likelihood (Change): -5180061(302) Iteration: 450 Log-Likelihood (Change): -5180449(-389) Iteration: 500 Log-Likelihood (Change): -5180211(238) Iteration: 550 Log-Likelihood (Change): -5179952(259) Iteration: 600 Log-Likelihood (Change): -5179954(-2)

(output omitted ) C. Schwarz 113

As a next step, one can generate indicator variables for the topic with the largest topic share in each article and plot the average topic shares over time.

. egen max_prob = rowmax(topic_prob*) . replace max_prob = round(max_prob,0.0000001) (5,742 real changes made) . forvalues i=1/15 { 2. gsort -topic_prob`i´ 3. display as error `i´ 4. list title if _n<5 5. replace topic_prob`i´ = float(round(topic_prob`i´,0.0000001)) 6. generate topic_`i´ = (max_prob==topic_prob`i´) 7. } 1

title

1. Estimating panel data models in the presence of endogeneity and se.. 2. Residual based tests for cointegration in dependent panels 3. Estimation in Partially Linear Single-Index Panel Data Models With.. 4. Semiparametric GMM estimation of spatial autoregressive models

(10,000 real changes made) 2

title

1. Corporate governance when founders are directors 2. Corporate governance and capital allocations of diversified firms 3. Shareholder governance, bondholder governance, and managerial risk.. 4. Staggered boards, corporate opacity and firm value

(10,000 real changes made) (output omitted ) . preserve . collapse (mean) topic_prob*, by(pub_year) 114 LDA in Stata

. twoway line topic_prob1 pub_year || line topic_prob2 pub_year || > line topic_prob4 pub_year || line topic_prob8 pub_year || > line topic_prob12 pub_year, legend(label(1 "Econometrics") > label(2 "Corporate governance") label(3 "Game theory") > label(4 "Monetary economics") label(5 "Labor economy") ) > xtitle("Year") ytitle("Average topic share") graphregion(color(white)) .12 .1 .08 Average topic share .06

2010 2011 2012 2013 2014 2015 Year

Econometrics Corporate governance Game theory Monetary economics Labor economy

Figure 1. Topic trends over time

As we can see in figure 1, the econometrics topic3 has by far the largest average share of the five plotted topics. Moreover, after a drop in 2013, the econometrics topic becomes again more prominent in the corpus in 2014 and 2015. One can complement this analysis by fitting a regression model using the number of citations and the previously created indicator variables.

3. The topic labels in figure 1 were given based on the titles of the five articles with the largest share in each of the topics. C. Schwarz 115

. regress citations topic_1-topic_15 c.topic_1#i.pub_year i.pub_year Source SS df MS Number of obs = 10,000 F(25, 9974) = 42.90 Model 1011163.27 25 40446.5307 Prob > F = 0.0000 Residual 9402745.73 9,974 942.72566 R-squared = 0.0971 Adj R-squared = 0.0948 Total 10413909 9,999 1041.49505 Root MSE = 30.704

citations Coef. Std. Err. t P>|t| [95% Conf. Interval]

topic_1 -7.830208 2.784147 -2.81 0.005 -13.2877 -2.372718 topic_2 -6.149039 1.611277 -3.82 0.000 -9.307467 -2.99061 topic_3 6.123092 1.599216 3.83 0.000 2.988307 9.257878 topic_4 -7.347089 1.611442 -4.56 0.000 -10.50584 -4.188337 topic_5 6.878272 1.444997 4.76 0.000 4.045787 9.710758 topic_6 -.6757591 1.510611 -0.45 0.655 -3.636861 2.285343 topic_7 4.887668 1.644834 2.97 0.003 1.663462 8.111874 topic_8 8.545773 1.513232 5.65 0.000 5.579532 11.51201 topic_9 8.915616 1.487744 5.99 0.000 5.999337 11.8319 topic_10 -2.372908 1.653648 -1.43 0.151 -5.614392 .8685759 topic_11 -.6464624 1.771297 -0.36 0.715 -4.118562 2.825637 topic_12 1.921868 1.686994 1.14 0.255 -1.38498 5.228716 topic_13 -4.580828 1.508604 -3.04 0.002 -7.537996 -1.62366 topic_14 -8.313427 1.651837 -5.03 0.000 -11.55136 -5.075493 topic_15 -1.756457 1.824063 -0.96 0.336 -5.331989 1.819075

pub_year# c.topic_1 2011 1.082498 3.622517 0.30 0.765 -6.018367 8.183363 2012 1.344766 3.654425 0.37 0.713 -5.818646 8.508177 2013 7.94672 3.783296 2.10 0.036 .5306967 15.36274 2014 5.019593 3.673452 1.37 0.172 -2.181115 12.2203 2015 6.619671 3.539678 1.87 0.061 -.3188111 13.55815

pub_year 2011 -4.859705 1.150878 -4.22 0.000 -7.115658 -2.603751 2012 -9.511767 1.164812 -8.17 0.000 -11.79503 -7.2285 2013 -16.19891 1.137882 -14.24 0.000 -18.42938 -13.96843 2014 -20.13599 1.131391 -17.80 0.000 -22.35375 -17.91824 2015 -24.88938 1.132622 -21.98 0.000 -27.10955 -22.66921

_cons 31.07525 1.215859 25.56 0.000 28.69192 33.45858

The regression results allow for a comparison of the citations of articles in different topics over time. Thereby, LDA provides a way to identify trending topics and analyze the shifting importance of research fields.

5 Acknowledgments

I thank Sascha O. Becker and Fabian Waldinger for their valuable suggestions on the draft of this article. Further thanks go to the editor and one anonymous referee, whose comments helped me to further improve the article. 116 LDA in Stata 6 References Barker, M. 2012. strdist: Stata module to calculate the Levenshtein distance, or edit distance, between strings. Statistical Software Components S457547, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457547.html.

Belotti, F., and D. Depalo. 2010. Translation from narrative text to standard codes variables with Stata. Stata Journal 10: 458–481.

Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.

Griffiths, T. L., and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101: 5228–5235.

Hansen, S., M. McMahon, and A. Prat. Forthcoming. Transparency and deliberation within the FOMC: A computational linguistics approach. Quarterly Journal of Eco- nomics.

Hong, L., and B. D. Davison. 2010. Empirical study of topic modeling in Twitter. In Proceedings of the First Workshop on Social Media Analytics, 80–88. New York: ACM.

Kim, Y., and K. Shim. 2014. TWILITE: A recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation. Information Systems 42: 59–77.

Quan, X., C. Kit, Y. Ge, and S. J. Pan. 2015. Short and sparse text topic modeling via self-aggregation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2270–2276. Buenos Aires, Argentina: AAAI Press.

Rosen-Zvi, M., T. Griffiths, M. Steyvers, and P. Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the Twentieth Conference on Uncer- tainty in Artificial Intelligence, 487–494. Banff, Canada: AUAI Press.

Thompson, J., T. Palmer, and S. Moreno. 2006. Bayesian analysis in Stata with Win- BUGS. Stata Journal 6: 530–549.

Williams, U., and S. P. Williams. 2014. txttool: Utilities for text analysis in Stata. Stata Journal 14: 817–829.

Wu, Y., M. Liu, W. J. Zheng, Z. Zhao, and H. Xu. 2012. Ranking gene–drug relation- ships in biomedical literature using latent Dirichlet allocation. Pacific Symposium on Biocomputing 422–433.

Yan, X., J. Guo, Y. Lan, and X. Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd International World Wide Web Conference, 1445–1456. New York: ACM. C. Schwarz 117

Zhao, W. X., J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. 2011. Compar- ing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval Research, 338–349. Berlin: ECIR.

About the author Carlo Schwarz is a PhD student at the University of Warwick and the Centre for Competitive Advantage in the Global Economy (CAGE)intheUK. His research interests are in the fields of applied microeconomics and political economy, with a focus on text analysis and machine learning. The Stata Journal (2018) 18, Number 1, pp. 118–158 Exploring marginal treatment effects: Flexible estimation using Stata

Martin Eckhoff Andresen Statistics Norway Oslo, Norway martin.eckhoff[email protected]

Abstract. In settings that exhibit selection on both levels and gains, marginal treatment effects (MTE) allow us to go beyond local average treatment effects and estimate the whole distribution of effects. In this article, I survey the theory behind MTE and introduce the package mtefe, which uses several estimation methods to fit MTE models. This package provides important improvements and flexibility over existing packages such as margte (Brave and Walstrum, 2014, Stata Journal 14: 191–217) and calculates various treatment-effect parameters based on the results. I illustrate the use of the package with examples. Keywords: st0516, mtefe, margte, heterogeneity, marginal treatment effects, in- strumental variables

1 Introduction

Well-known instrumental variables (IVs) methods solve problems of selection on lev- els, estimating local average treatment effects (LATEs) for instrument compliers even with nonrandom selection into treatment. However, in the more reasonable case with selection into treatment based both on levels and gains, LATEs may represent average treatment effects (ATEs) for very particular subpopulations, and we learn little about the distribution of treatment effects in the population at large.

Marginal treatment effects (MTEs) allow us to go beyond LATEs in settings that exhibit this sort of selection. MTEsaretheATEs for people with either a particular resistance to treatment or at a particular margin of indifference. Thus, MTEscapture heterogeneity in the treatment effect along the unobserved dimension we call resistance to treatment. This is precisely what generates selection on unobserved gains: people who choose treatment because they have a particularly low resistance might have different gains than those with high resistance. Usually at the cost of stronger assumptions than required under standard IVs, we can estimate the full distribution of (marginal) treatment effects and back out the parameters of interest. These include ATEs, ATEson the treated (ATT) and untreated (ATUT), and policy-relevant treatment effects (PRTEs) from a hypothetical policy that shifts the propensity to choose treatment.

The most common way of estimating MTEs is via the method of local IVs(Heckman and Vytlacil 2007). Alternatively, MTEs can be estimated using the separate approach (Heckman and Vytlacil 2007; Brinch, Mogstad, and Wiswall 2017) or, in the baseline case with joint normal errors, maximum likelihood. When we estimate MTEsusingthe

c 2018 StataCorp LLC st0516 M. E. Andresen 119 separate approach or maximum likelihood, similarities to selection models are apparent; we are basically fitting a traditional selection model. MTEs thus represent old ideas in new wrapping, but they still provide substantial contributions to the interpretation and estimation of models in settings with essential heterogeneity. Unfortunately, commands for flexibly estimating MTEs and using their output are not easily available in popular software such as Stata. Brave and Walstrum (2014) document the package margte, which estimates MTEs, but this package has some important limitations. In this article, I introduce the package mtefe, which fits parametric and semipara- metric MTE models. These include the joint normal, polynomial, and semiparametric models like Brave and Walstrum (2014); mtefe additionally adds the option of spline functions in the MTE for increased flexibility. It can fit all models using either local IVs or the separate approach as well as maximum likelihood for the joint normal model, while the estimation method in margte depends on the model. Furthermore, mtefe allows for fixed effects using Stata’s categorical variables, which is important to isolate exogenous variation in many applications and provides gains in computational speed over generating dummies manually. mtefe supports frequency and probability weights, which is important to obtain population estimates in many datasets.

In addition, mtefe exploits the full potential of MTEs by calculating treatment-effect parameters as weighted averages of the MTE curve, shedding light on why, for example, the LATE differs from the ATE. Other improvements include calculation of analytic stan- dard errors (that admittedly ignore the uncertainty in the propensity-score estimation), large improvements in computational speed when fitting semiparametric models, and reestimation of the propensity score when bootstrapping for more appropriate inference. Although quickly becoming a part of the toolbox of the applied econometrician, appliedworkusingMTEs often imposes restrictive parametric assumptions. The Monte Carlo simulations in appendix B illustrates how conclusions might be sensitive to these choices, at least under the particular data-generating processes used. This illustrates the importance of correct functional form and highlights that researchers should probe their results to functional form assumptions when using MTEs as well as base the choice of functional form on detailed knowledge about the case at hand—what might constitute the unobservables and sound economic arguments for the nature of the relationship between these unobservables and the outcome.

This article proceeds as follows: Section 2 reviews the theory behind MTEs, identi- fying assumptions, estimation methods, and derivation of treatment parameter weights. Section 3 presents the mtefe command, its most important options, and examples of its use. Section 4 concludes. Appendix A details the estimation algorithm of mtefe,and appendix B contains the Monte Carlo simulations. 120 Exploring marginal treatment effects: Flexible estimation using Stata 2 Estimation of MTEs

MTEs are based on the generalized Roy model:

Yj = μj(X)+Uj for j =0, 1(1)

Y = DY1 +(1− D)Y0 (2) { } D = ½ μD(Z) >V where Z =(X, Z−)(3)

Y1 and Y0 are the potential outcomes in the treated and untreated state, such as log wages with and without a college degree. They are both modeled as functions of ob- servables X, which may contain fixed effects. Equation (3) is the selection equation and can be interpreted as a latent index

because ½ is the indicator function. It is a reduced-form way of modeling selection into treatment as a function of observables X and instruments Z− that affect the probability of treatment but not the potential outcomes. The unobservable V in the choice equation is a negative shock to the latent index determining treatment. It is often interpreted as unobserved resistance to or negative preference for treatment. As long as the unobservable V has a continuous distribution, we can rewrite the selection equation as P (Z) >UD,whereUD represents the quantiles of V and P (Z) represents the propensity score. UD, by construction, has a uniform distribution in the population.1 Identification of this model requires the following assumption:

Assumption 1: Conditional independence (U0,U1,V) ⊥ Z−|X

The model described by (1)–(3), together with Assumption 1, implies and is implied by the standard assumptions in Imbens and Angrist (1994) necessary to interpret an IV as a LATE. Vytlacil (2002) shows how the standard IV assumptions of relevance, exclusion, and monotonicity2 are equivalent to some representation of a choice equation as in (3). Thus, the model described above is no more restrictive than the LATE model used in standard IV analysis.

In principle, it is possible to estimate MTEs with no further assumptions. How- ever, this requires full support of the propensity scores in both treated and untreated samples for all values of X. In practice, this is rarely feasible, as shown, for example, in Carneiro, Heckman, and Vytlacil (2011). Instead, most applied articles (Carneiro, Heckman, and Vytlacil 2011; Maestas, Mullen, and Strand 2013; Carneiro and Lee 2009; Cornelissen et al. Forthcoming; Felfe and Lalive 2014) proceed by assuming a stronger assumption:

1. This normalization also ensures that UD is uniform within cells of X. For details, see Mogstad and Torgovitsky (2017)andMatzkin (2007). 2. Or rather, uniformity, because it is a condition across people, not across realizations of Z;this assumption requires that Pr(D =1|Z = z) ≥ Pr(D =1|Z = z) or the other way around for all people, but it does not require full monotonicity in the classical sense. M. E. Andresen 121

Assumption 2: Separability E(Uj | V,X)=E(Uj | V )

This can be a strong assumption and must be carefully evaluated by the researcher for each application. Nonetheless, it is clearly less restrictive than, for example, a joint normal distribution of (U0,U1,V) as assumed by traditional selection models. This article, along with the applied MTE literature3 and the mtefe command, proceeds by imposing this stronger assumption, as well as working with linear versions of μj(X)= 4 Xβj and μD(Z)=γZ.

This assumption has two important consequences: First, the MTE is identified over the common support, unconditional on X. Second, the MTEs are additively separable in UD and X. This implies that the shape of the MTE is independent of X; the intercept, but not the slope, is a function of X. Using this model, we find that the returns to treatment are simply the difference between the outcomes in the treated and untreated states. MTEs were introduced by Bj¨orklundand Moffitt (1987), later generalized by Heckman and Vytlacil (1999, 2001, 2005, 2007), and with the above assumption and index structure, can be defined as

MTE(x, u) ≡ E(Y1 − Y0|X = x, UD = u) − E − | = x*(β1+, β0-) + * (U1 U+,0 UD = u-) heterogeneity in observables k(u): heterogeneity in unobservables

They measure average gains in outcomes for people with particular values of X and the unobserved resistance to treatment UD. Alternatively, the MTE can be interpreted as the mean return to treatment for individuals at a particular margin of indifference. The above expression shows how the separability assumption allows us to separate the treatment effect into one part varying with observables and another part varying across the unobserved resistance to treatment.

MTEs are closely related to LATE.ALATE is the average effect of treatment for people who are shifted into (or out of) treatment when the instrument is exogenously  shifted from z to z . In the above choice model, these people have UD in the interval (P (z),P(z)). Note how, when z − z is infinitesimally small, so that P (z)=P (z), the LATE converges to the MTE.AMTE is thus a limit form of LATE (Heckman and Vytlacil 2007).

3. Some articles apply a stronger full independence assumption: (U0,U1,V) ⊥ X, Z−. This implies the separability in Assumption 2 but is stricter than necessary. In particular, the separability as- sumption places no restrictions on the dependence between V and X.SeeMogstad and Torgovitsky (2017, in particular sec. 6) for a review and a detailed discussion of the differences between these assumptions. 4. Note, however, that you are free to specify fixed effects, so that a fully saturated model may be specified. 122 Exploring marginal treatment effects: Flexible estimation using Stata

2.1 Estimation using local IV

One way of estimating MTEs is using the method of local IV developedbyHeckmanand Vytlacil (1999, 2001, 2005). This method identifies the MTE as the derivative of the conditional expectation of Y with respect to the propensity score.

First, consider the intuition for why the derivative identifies the MTE using figure 1a. Abstracting from covariates, the dotted line represents the expected Y for each value of the propensity score, which can be estimated from data given a propensity-score model. When p increases, the probability of getting the treatment increases. If the treatment effect is constant, E(Y |p) is linear in p. In contrast, under essential heterogeneity, the increase in p has the additional effect that the expectations of the error terms change, resulting in nonlinearities in E(Y |p). Local IVs therefore identify the MTEs from nonlinearities in the expectation of Y given p. Now, consider two particular instrument values z and z, generating propensity scores P (z)=0.4andP (z)=0.8, respectively. This shift of the instrument induces people  with unobserved resistance P (z)

MTE (left axis) E(Y | p) (right axis) E(Y1 | u) E(Y0 | u) MTE

(a) Local IV (b) Separate approach

Figure 1. Identification of MTEs M. E. Andresen 123

Formally, using p = P (Z), the index structure, and the separability assumption, we can show that

E {Y |X = x, P (Z)=p} = E {Y0 + D(Y1 − Y0)|X = x, P (Z)=p} − E − | ≤ = xβ0 + x(β0 β1)p + *p (U1 U+,0 UD p-) (4) K(p) wherewecannormalizeE(Uj)=0aslongasX includes an intercept. K(p)isa nonlinear function of p that captures heterogeneity along the unobservable resistance to treatment UD; if people with different resistances to treatment have different ex- pectations of the error terms, the MTE will be nonlinear. The K(p) notation follows Brinch, Mogstad, and Wiswall (2017). Taking the derivative of this expression with respect to p and evaluating it at u,we get the MTE

∂E {Y |X = x, P (Z)=p} ∂ {pE(U1 − U0|UD ≤ p)} |p u = x(β − β )+ |p u ∂p = 1 0 ∂p = − E − | MTE(x, u)=(β1 β0)x + * (U1 U+,0 UD = u-) k(u) where k(u)=E(U1 − U0|UD = u). The above suggests the following estimation procedure: We start by identifying the selection into treatment based on (3), using a probability model, such as probit or logit; the linear probability model; or even a semiparametric binary choice model. With this estimate of P (Z) in hand, what remains is to make an assumption about the unknown 5 function K(p)=pE(U1 − U0|UD ≤ p), estimate the conditional expectation of Y from (4), and form its derivative to get the MTE. The different functional form assumptions commonly made on K(p) are summarized in table 1. Alternatively, if we are unwilling to make parametric assumptions, the MTE can be estimated using semiparametric methods as summarized in table 2 by estimating (4) as a partially linear model.

2.2 Estimation using the separate approach

Alternatively, as suggested by Heckman and Vytlacil (2007), MTEs can be estimated using the separate approach. This has the benefit of estimating all the parameters of both the potential outcomes so that we can plot these over the distribution of UD. First, consider the intuition using figure 1b. Abstracting from covariates, the solid line depicts the true E(Y1|UD = u), the expected value of the outcome in the treated state for a given value of u. Likewise, the dashed line depicts the expected value of the outcome in the untreated state. Next, imagine comparing two particular values of the instrument, z versus z. Individuals with Z = z have a predicted propensity score of P (z)=0.4, while individuals with z have P (z)=0.8. We will have both treated and  p 5. Or usually directly on k(u), and then we exploit that K(p)= 0 k(u)du. 124 Exploring marginal treatment effects: Flexible estimation using Stata untreated individuals in both these groups. For untreated individuals with Z = z,we know that their resistance UD >P(z)=0.4. Therefore, the average outcome among these people informs us of the average of the dashed line between 0.4 and 1, depicted by the long dash dot line. In contrast, the treated individuals with Z = z have resistance UD ≤ P (z) and therefore inform us about the average of Y1 for people with UD between 0 and 0.4, depicted by the short dash dot line. Likewise, the people with Z = z inform us about the average of Y1 from 0 to 0.8 (dash dot line) and the average of Y0 from 0.8 to 1 (long dash line). With a continuous instrument, we will have propensity scores all over the (0, 1) interval, and we can identify nonlinear functions in u, but this illustration shows that a binary instrument identifies a linear MTE model. We need only four averages from the groups composed of the treated and untreated individuals with the instrument switched on and off (Brinch, Mogstad, and Wiswall 2017). In practice, this means specifying some function for the conditional expectations of the error terms and then estimating the conditional expectations of Y1 and Y0 in the sample of treated and untreated separately:

E(Y1|X = x, D =1)=xβ1 + E(U1|UD ≤ p)=xβ1 + K1(p)(5)

E(Y0|X = x, D =0)=xβ0 + E(U0|UD >p)=xβ0 + K0(p)(6)

I follow the notation in Brinch, Mogstad, and Wiswall (2017) and control selection via the control functions Kj (p). Note that this is the specification we are using when fitting selection models; depending on the specification of Kj(p), this amounts to, for example, Heckman selection or a semiparametric selection model.

To estimate the MTE, we estimate the conditional expectation of Y in the sample of treated and untreated separately using the regression

Yj = Xβj + Kj(p)+

Based on the assumptions on the unknown functions kj (u), we can infer the func- tional form of the control function Kj(p) as summarized in table 1. Alternatively, when using semiparametric methods, estimate (5)–(6) using partially linear models as sum- marized in table 2. In the implementation, this is estimated in a stacked regression to allow for some coefficients to be restricted to be the same in the treated and untreated state. After estimating Kj (p), we can construct the kj (u) functions and find the MTE estimate from

MTE(x, u)=E(Y1|X = x, UD = u) − E(Y0|X = x, UD = u)

= x(β1 − β0)+k1(u) − k0(u) where kj(u)=E(Uj |UD = u) M. E. Andresen 125 p +1 knots l +1 l ) )and ) q Q q Q h h +1 +1 +1 l l l (1 − +1 ) (1 − ) ) l − q ) q q − +1) +1) +1) h q l h h ,...,h l l ( 2) with ( ( +1 h 1 − l +1 +1 1 h l ) +1 l ) (1 − (1 − ( l q L ≥ q h )( p − h − − − + q + + l l ) h +1) , ) ) )+ q )( p − q q )( p − h ) q p ≥ 1 h h 1 q l ( +1 1 +1) p )( l +1 h +1 l l p h ½ l ≥ − − ( u ) du p ≥ − p )( l (1 − − p ≥ ( p − − k ( u ( l ½ l l p (1 − p ½ )( u )( u ½ 0 u ( A ) is the indicator +1 u (1 − q q l ( u l ½ mean 0 ) l l ) l h h ( u ) l 0 q 0 q 1 l q q k 0 1 h π π π π h ≥ ≥ π k π − L L =1 − =1 L ( p )= =1 =1 Q (1 − ( u ( u =1 Q L l =1 l l q l q ½ ½ K ( u Polynomial with splines l ( u ) L =2 L L th-order polynomials ( =2 q 0 l l l 1 l q q 1 0 q k π k + π π + π =1 Q )+ q =1 =1 Q Q =1 Q 0 q q ( u )as q +1) +1) +1 l j β l l L +1 =2 p l p l L L ( u ) ( u ) =2 =2 L ( =2 l l − l 1 0 l p ( k k 1 1 l π π + + ( β x for quadratic and higher-order terms at ( 1 0 L k ( u )or =1 L =1 l l k xβ xβ models ( u )= ) ) 1 1 l +1 )+ k 1 +1) +1 +1 1) 1 l l p 0 MTE l − − β +1 l l ( u )as +1 − l p )( l − − l p p j l − l l l k p (1 − p ( u 1 1 u (1 − l ( u π l l π ( β l 0 0 l 1 x L π π =1 π Polynomial π L l =1 with mean 0 l ( u )or L L =1 =1 L k =1 L l l =1 l l L th-order polynomials ), . ⎫ ⎬ ⎭ Σ ( p ) 1 ( u ) 1 − p ) 1 (0 , ( Table 1. Parametric p ) ( u ) du Φ − ( 1 0 p ) ( u ) ( u ) p 1 and equivalently for the spline coefficients. Calculated as 2 0 1 k − 1 1 )Φ ρ − σ l ∼N Φ 1 − − p 0 ) ϕ 0 (1 − Φ Σ= ρ 0 Φ Φ π ϕ ρ 1 0 Normal ϕ ,V 1 2 − 1 0 p ) ρ ρ − 01 1 0 ρ ρ − σ 1 ρ l ρ − 1 ρ − 1 ,U π 0 (1 ( ρ ⎧ ⎨ ⎩ U / = − l π ) x ) ) ( p )=1 = x x p ) 0 u )( K ) = = ≤ = p ) u ) u ) u, X D D ≤ >p = = and = | U u, X u, X | U D D 0 D D 0 A .Notethat D = = U U | U | U | U | U 1 0 1 0 | U − D D − 0 ( u ) du Definition 1 ( U ( U 1 ( U ( U Y | U | U 1 k E E E E 1 0 ( U − ( U p 0 ( Y ( Y 1 E p E E E ( Y /p E ) ) ) ( p ) ( p ) ( p ) ( u ) ( u ) ( u ) 1 0 1 0 k ( p )=1 K k k ( x, u ( x, u 1 K K 1 0 function for the event K Y Y Function Note: Expressions for conditional expectations with different assumptions for the joint distribution of the error terms. MTE( x, u 126 Exploring marginal treatment effects: Flexible estimation using Stata oe tp nteetmto fsemiparametric of estimation the in Steps Note: npicpe ti osbet obn h eiaaercadteplnma prahb rtetmtn the estimating first by approach polynomial the and semiparametric the combine to possible is it principle, In Estimate obersda regression residual Double oe n hnuigsmprmti ehd ofind to methods semiparametric using then and model ecnfidsmlrepesosfor expressions similar find can We httesmprmti siaeo hsmdlsol eaybte hnteMEcntutdfo h aaercestimates. parametric the from constructed MTE the than better any be should model this of estimate semiparametric the that polynomial() oethat note siaigequation Estimating osrc residual Construct oisn1988 ( Robinson osrc MTE Construct Construct Estimate regression β 0 K Step ,β 1 ( p )= and 1 − K k β semiparametric E 0 ) ( U using 1 | U D ≤ p )=1 oa oyoilrgeso of regression polynomial local r pcfidtgte in together specified are oa oyoilrgesosof regressions polynomial local k 0 /p ( u ). 0 p E Y ( U ,u )= MTE( x, e Y  Y = MTE 1 residuals | U = = Xβ D al .Semiparametric 2. Table Y e X 0 = oesuiglocal using models − n slope and + β K u ) du k 0  X X ( u )= hsi h eiaaercpolynomial semiparametric the is This . e + oa IV Local β  lhuhcmuainlyfrls ope,teei iteter othink to theory little is there complex, less far computationally Although mtefe . Y ( β x 0 e , ⇒ X 1 −  e β − X × K K X 1 p K ,and e β Y 1   ( β − ( p  ( p )= 0 on Y β ( p )+ K 1 β ) 1 , 0 − ) p  − X aiglevel saving , IV X β − + ,and X β 0 ( p × (1 /p ) K 0 )+ e rtesprt prah oseterlto between relation the see To approach. separate the or s k P   )+ Y ( u ) p 1 × MTE ( p )+(1 p K give ( p models /p ) k ) 1 ,wihlasto leads which ( u ), rae n nrae ape,svn level saving samples, untreated and treated eaaelclplnma ersin of regressions polynomial local separate eaaelclplnma ersin of regressions polynomial local separate MTE e on ,u )= x MTE( x, Y = oe,ipeetdif implemented model, p ntetdadutetdsmlsgive samples untreated and treated in De residuals k k Y  Y 0 1 β ( u )= ( u )= = Y = offiinsfo h polynomial the from coefficients 1 Y +(1 e j eaaeapproach Separate X − = n slope and k β e 1 X 0 Y Xβ  ( u )= K − K K j β  β + 0 1 D 0 1 and ( p ( p j D − − ) e + ) )+ ( β Y X β − K e 0 1 1 K 0 X ( p )+ (1 p   j n iia for similar and − j K j ( p β ( p + ,construct − 1 β 1 )+ k ) 0 ( p k − j p ) e pK 1 ( u )and ) ) ( u ) β K  X 0 1  0  ( p ). + ( p − D  ) Y Y  k K 0 on ( u ) and j K e ( p ), X j p ( p X in ) M. E. Andresen 127

2.3 Estimation using maximum likelihood

In the case where we assume joint normality of (U0,U1,V), the MTE can be estimated using maximum likelihood like a Heckman selection model. It is straightforward to infer the log-likelihood function. For details, see appendix A.4 or Lokshin and Sajaia (2004).

2.4 Treatment-effect parameters and weights

MTEs unify most common treatment-effect parameters and allow us to recover the pa- rameters from the MTEs.GiventhatwehaveestimatedMTE(x, u), we need only esti- mates of the distribution of UD given x in a particular population to obtain estimates of the ATE for that population. In practice, common treatment-effect parameters can be expressed as some weighted average of a particular MTE curve (Heckman and Vytlacil 2007). For the ATE condi- tional on the event A = a that defines the population relevant for the parameter of interest and X = x,wehave . 1 Ta(x)=E(Y1 − Y0|A = a, X = x)= MTE(x, uD)ωa(x, u)duD 0

where ωa(x, u)=fUD |A=a,X=x(u) where f is the conditional density of UD. Because of the additive separability given by Assumption 2 and the linear forms of μj(X), this simplifies to . 1 Ta(x)=x(β1 − β0)+ k(uD)ωa(u)duD 0

where ωa(u)=fUD |A=a(u)

In principle, we can calculate the treatment-effect parameters Ta(x) for any value of x, but we are usually interested in the unconditional parameter in the population. In general, we need to integrate over all values of X,6 but because of the additive separability implied by Assumption 2, we can estimate the average x in the population of interest separately and calculate the unconditional treatment-effect parameters as . 1 Ta = E(Y1 − Y0|A = a)=xa(β1 − β0)+ k(uD)ωa(u)duD 0

where ωa(u)=fUD |A=a(u) and xa = E(X|A = a)  a a We can estimate xa using the weighted average 1/N κi xi,whereκi is an estimate of the relative probability that event a happens to person i, {P (A = a|Z = zi)}/{P (A = a)}. In practice, we can estimate both the weighted average xa and ωa(u)byusing sample analogs from data on p, Z,andD as summarized in table 3. 6. Note, however, that in cases where the propensity score model is misspecified, the weighted av- erage of the conditional weights may differ from the unconditional weights. See, for example, Carneiro, Heckman, and Vytlacil (2011)andCarneiro, Lokshin, and Umapathi (2017). 128 Exploring marginal treatment effects: Flexible estimation using Stata oe egt o omnteteteet.Dsrt itiuinof distribution Discrete effects. treatment common for Weights Note: MPRTE3 MPRTE2 MPRTE1 PRTE 2SLS ATUT LATE ATT ATUT ATE ATT ATE s points. egtdaverage Weighted Parameter Marginal Marginal Marginal Policy-relevant Local al .Ucniinlteteteetprmtr n weights and parameters treatment-effect Unconditional 3. Table ATE PRTE PRTE PRTE ATE 3 2 1 LATE P ( z | γZ pp ≤ | | ≤ ≤ < < A p < p P  ( z  ) { υ i − p ( z E E E cov( D, E { ( p ( p f 1 − f ( p V i 1 −   υ E V i  ) − ) − ) − κ p ( γZ − ) } ( p ) / 1 1 E ( γZ i i p ( p ) p p ( z ( D E E i i υ ( p ) ( p ) ) ) } ) i i  − ) U D D ) with { P E { ( p ( z υ P f | p>u p ( p s { E ( u ) f  s × ) >u E 1 − s {  >u { P s ( ( p uf f ) − f cov( D, 1 − ω / s E V P ( p>u E V p p }− P  ) − ( u p ) − ( p ) {  ( p>u ( u 1 E s ( γZ ( p ) E − ( u ) F ( ( p ) } P υ ) V p ) E − ) ) } ) ( p>u { ( p ) } ) } 1 υ p ( z ) ( u ) } ) P ( p>u ) >u ) } ) M. E. Andresen 129

The ATT is the average effect of treatment for the subpopulation that chooses treat- ment. When one calculates the weighted average of the X in this population, this parameter will weight people with high propensity scores more precisely because they have a higher probability of choosing treatment. Likewise, the weights ωATT will weight points at the lower end of the UD distribution higher because a larger share of the population at these values of UD will choose treatment—they have lower resistance.

In contrast, the ATUT weights individuals with low propensity scores higher when calculating the weighted average of the X. This is precisely because these people, all else the same, have higher probability to be untreated. ωATUT weights high values of UD more because these people have high resistance.

A LATE is the average effect of treatment for people who are shifted into (or out of)  treatment when the instrument is shifted from z to z . These are people with UD in the interval (P (z),P(z)). To be in the complier group, an individual must have resistance   between P (z)andP (z ). Note how, when z − z is infinitesimally small, the LATE converges to the MTE,andanMTE is thus a limit form of LATE (Heckman and Vytlacil 2007). Similarly to the way we can estimate the probability of being a complier in a tradi- tional IV analysis, we can estimate the weights of the linear IV or two-stage least squares (2SLS) parameter. With a continuous instrument, IV is a weighted average over all possi-  ble LATEscomposedofallz−z pairs (Angrist and Imbens 1995). Heckman and Vytlacil (2007) show the derivation of these weights; see also section A.6 in the appendix and Cornelissen et al. (2016). In sum, the IV parameter uses a weighted average of X,in which people with large positive or negative values of υ/,ameasureofhowmuchthe instrument affects propensity scores for each individual, are given more weight. These are precisely the people who have higher probabilities of having their treatment status determined by the instrument. Likewise, values of UD where people have υ/ above the average, and thus are more likely to be compliers, get a higher weight ωIV(u). Assuming policy invariance (see Heckman and Vytlacil [2007] for a formal defini- tion), we can calculate the PRTE for a counterfactual policy that manipulates propensity scores. This is the expected treatment effect for the people that are shifted into treat- ment by the new policy relative to the baseline. If the policy is a shift in the instruments themselves, it is natural to use the estimated first stage to evaluate the shift in propen- sity scores, but propensity scores could be manipulated directly as well. Note that if the policy is a particular set of instrument values z, and the baseline is another set z,the PRTE and LATE are the same. In practice, the PRTE parameter weights the treatment effect of people who are affected more strongly by the alternative policy relative to the baseline.

Lastly, we can use the estimated MTEs to calculate marginal policy relevant treat- ment effects (MPRTEs), which can be interpreted as average effects of making marginal shifts to the propensity scores. MPRTEs are fundamentally easier to identify than PRTEs (Carneiro, Heckman, and Vytlacil 2010), particularly because they do not require full support; marginal changes to propensity scores will not drive the scores outside the common support. 130 Exploring marginal treatment effects: Flexible estimation using Stata

Carneiro, Heckman, and Vytlacil (2011) suggest three ways to define distance to the margin. The first MPRTE, labeled MPRTE1 in table 3, defines the distance in terms of the differences between the index γZ and the resistance V and corresponds to a marginal change in a variable entering the first stage such as an instrument. MPRTE2 defines the margin as having propensity scores close to the normalized resistance UD and corresponds to a policy that would increase all propensity scores with a small amount. MPRTE3 defines marginal as the relative distance between the propensity score and UD and corresponds to a policy that increases all propensity scores by a small fraction. What we can see from the expressions in table 3 is that an absolute shift in propensity scores uses the observed density of the propensity scores as the weight distribution, while a relative shift will place more weight on the upper part of the UD distribution precisely because a relative shift increases the propensity scores of people with high initial propensity scores more.

3 The mtefe package 3.1 Syntax

As outlined in the introduction, mtefe contains many improvements over earlier MTE packages such as margte. The basic syntax of the mtefe command mimics that of Stata’s ivregress and other IV estimators, while the syntax for many of its options resemble the same options in margte. All independent variable lists also accept Stata’s categorical variables syntax i.varname. fweightsandpweights are supported. mtefe depvar indepvars (depvart = varlistiv) if in weight , polynomial(#) splines(numlist) semiparametric restricted(varlist) link(string) separate mlikelihood trimsupport(#) fullsupport prte(varname) bootreps(#) norepeat1 vce(vcetype) level(#) degree(#) ybwidth(#) xbwidth(#) gridpoints(#) kernel(string)first second noplot omit(varlist) savefirst(string) savepropensity(newvar) savekp saveweights(string) mtexs(matrixlist)

3.2 Options

The most important options that determine which model is fit and how are the following: polynomial(#) specifies polynomial MTE models of degree #. In contrast to margte, to ensure consistency between estimation procedures, this is the degree of k(u)and in turn the MTE, not the degree of K(p). Although most restrictive, to ensure consistency with margte, the default if polynomial() is not specified is the joint normal model. M. E. Andresen 131 splines(numlist) adds second-order and higher splines to k(u) at the points specified by numlist. Use this only for polynomial models of degree ≥ 2. All points in numlist must be in the interval (0, 1). semiparametric fits K(p)orKj (p) using semiparametric methods as described in ta- ble 2. This amounts to the semiparametric model if polynomial is not specified or the semiparametric polynomial otherwise. restricted(varlist) specifies that variables in varlist be included in both the first and second stages but are restricted to have the same effect in the two states. link(string) specifies the first-stage model. string may be probit, logit,orlpm.The default is link(probit). separate fits the model using the separate approach rather than local IVs. mlikelihood fits the model using the maximum likelihood rather than local IVs. This is appropriate only for the joint normal model. The command allows many other options, for which I refer the reader to the help file. The mtefe package additionally contains the command mtefeplot, which plots one or more MTE plots—optionally including treatment parameter weights—based on stored or saved MTE estimates from mtefe. The command mtefe gendata generates the data in the following examples and Monte Carlo simulations. By default, mtefe reports analytical standard errors for the coefficients, treatment- effect parameters, and MTEs. These ignore the uncertainty in the estimation of the propensity scores by treating these as fixed in the second stage of the estimation as well as the means of X and the treatment-effect parameter weights that are used for estimating treatment effects. For matching, Abadie and Imbens (2016) show that ignor- ing the uncertainty in the propensity score increases the standard errors for the ATE, while the impact on other treatment-effect parameters is ambiguous. To the best of my knowledge, we do not know how this omission affects the standard errors in MTE applications, so careful researchers should therefore bootstrap the standard errors using the bootreps() option, which reestimates the propensity scores, the mean of X,and the treatment-effect parameter weights for each bootstrap repetition.

3.3 Example output from mtefe

To illustrate the use of mtefe and for Monte Carlo simulations in appendix B, let’s imagine the following problem: We are interested in the monetary returns to a college education. Unfortunately, college education is endogenous, for example, because higher- ability people do better in the labor market and have more education. Furthermore, people choose education based partly on knowledge about their own gains from college. Thus, the problem exhibits selection on both levels and gains. Consider distance to college, thought to be a cost shifter for college education. Al- though traditional in the returns to education literature, there are reasons to doubt the exclusion restriction for this instrument (Carneiro and Heckman 2002). To fix ideas, 132 Exploring marginal treatment effects: Flexible estimation using Stata suppose that the average distance to college in a district, a measure of rurality, is cor- related with average outcomes in the labor market. For example, more rural labor markets could provide worse average employment opportunities, particularly for college jobs. However, if these differences work at the district level, the instrument is valid, conditional on district fixed effects that control for the average variation in distance to college. Thus, the remaining within-district variation in distance to college is a valid instrument for college attendance. To implement this thought experiment, I draw the average labor market quality for college and noncollege jobs and the average distance to college for each district from a joint normal distribution. The observed distance to college is equal to this district-level average plus some random normal variation, so the within-district variation in distance is a valid instrument. In addition, I generate the error terms U0, U1,andV from either a joint normal or a polynomial error structure, where the three errors are correlated and thus generate selection on both levels and gains. Controls X include experience uniformly distributed on (0, 30) and its square. These affect both the selection equation and the outcomes in the two states. The full data-generating process is described in appendix B.

. set seed 1234567 . mtefe_gendata, obs(10000) districts(10) . mtefe lwage exp exp2 i.district (col=distCol) Parametric normal MTE model Observations : 10000 Treatment model: Probit Estimation method: Local IV

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

beta0 exp .0278374 .006707 4.15 0.000 .0146903 .0409844 exp2 -.0004643 .0002097 -2.21 0.027 -.0008753 -.0000534

district 2 -.3713221 .061335 -6.05 0.000 -.4915511 -.2510932 3 -.0308218 .06261 -0.49 0.623 -.15355 .0919065 4 .6884756 .0866584 7.94 0.000 .5186077 .8583435 5 .0262556 .0642906 0.41 0.683 -.0997671 .1522782 6 .4946678 .0633903 7.80 0.000 .37041 .6189255 7 -.2666547 .0613036 -4.35 0.000 -.386822 -.1464873 8 .2437004 .0580991 4.19 0.000 .1298145 .3575864 9 .115046 .0598625 1.92 0.055 -.0022965 .2323885 10 .027915 .0602922 0.46 0.643 -.0902699 .1460999

_cons 3.105918 .0659863 47.07 0.000 2.976572 3.235265 M. E. Andresen 133

beta1-beta0 exp -.0231842 .0100974 -2.30 0.022 -.0429771 -.0033914 exp2 .0006458 .0003238 1.99 0.046 .0000111 .0012804

district 2 .0529143 .1010226 0.52 0.600 -.1451104 .250939 3 .0531692 .1038045 0.51 0.609 -.1503086 .256647 4 -.0521345 .1184164 -0.44 0.660 -.2842546 .1799855 5 .1319906 .1028638 1.28 0.199 -.0696432 .3336243 6 .0141403 .1050543 0.13 0.893 -.1917872 .2200679 7 .0021696 .1019711 0.02 0.983 -.1977144 .2020536 8 -.4022728 .1008029 -3.99 0.000 -.5998667 -.2046788 9 -.2616856 .1026346 -2.55 0.011 -.4628702 -.0605009 10 -.2654738 .1029734 -2.58 0.010 -.4673225 -.063625

_cons .5889556 .0967044 6.09 0.000 .3993954 .7785157

k mills -.5492868 .0595307 -9.23 0.000 -.665979 -.4325946

effects ate .3627266 .0239355 15.15 0.000 .3158082 .409645 att .6166828 .0388462 15.87 0.000 .5405364 .6928291 atut .0735471 .036977 1.99 0.047 .0010647 .1460295 late .340083 .0238033 14.29 0.000 .2934237 .3867424 mprte1 .3567228 .0249318 14.31 0.000 .3078514 .4055941 mprte2 .3133149 .0239501 13.08 0.000 .2663677 .360262 mprte3 -.0568992 .0483759 -1.18 0.240 -.1517257 .0379273

Test of observable heterogeneity, p-value 0.0000 Test of essential heterogeneity, p-value 0.0000

Note: Analytical standard errors ignore the facts that the propensity score, the mean of X and the treatment effect parameter weights are estimated objects when calculating standard errors. Consider using bootreps() to bootstrap the standard errors.

Using the data-generating process outlined above, the code uses the commands mtefe and mtefe gendata to generate data with a normal error structure and estimate them using the joint normal MTE model and local IVs. mtefe first reports the estimated coefficients β0, β1 − β0,andρ1 − ρ0; then it reports the treatment-effect parameters as shown in the output. Lastly, mtefe reports the p-values for two statistical tests: a joint test of the β1 − β0, which can be interpreted as a test of whether the treatment effect differs across X, and a test of essential heterogeneity. The latter is a joint test of all coefficients in k(u).7

7. Or, in the case of semiparametric models, a test of whether all MTEs are the same. 134 Exploring marginal treatment effects: Flexible estimation using Stata

Based on the output from mtefe, we can straightforwardly evaluate the impact of covariates. Average differences in outcomes across covariates can be interpreted directly from the β0, just like a regular control variable. For instance, the coefficient for expe- rience in the first panel of the output table indicates that one more year of experience translates into approximately 2.8% higher wages, although the effect is nonlinear, and we cannot say that it is the extra experience that causes the higher wages without strong exogeneity assumptions on X.

Similarly, the β1 − β0 can be interpreted as differences in treatment effects across covariate values, just like an interaction between treatment status and a covariate in an ordinary least-squares regression. The coefficient on experience in the second panel of the output table thus indicates that a person with one more year of experience has 2.3% lower gains from college, but again, we cannot give this a causal interpretation, and we must account for the nonlinear effect.

In addition, mtefe plots the MTE curve with associated confidence intervals as well as the density of the estimated propensity scores separately for the treated and untreated individuals so that the researcher can evaluate the common support. Examples of these plots are found in figures 2a and 2b. We see that the estimated MTE in this example is downward sloping, with relatively high treatment effects above 1 at the beginning of the UD distribution, eventually declining to negative effects below −0.5attherightendof the distribution. This implies an ATE of around 0.36, and the downward sloping pattern is consistent with positive selection on unobservable gains as predicted by a simple Roy model.

Alternatively, we can use the polynomial MTE model and the separate estimation approach or the semiparametric model to relax the joint normal assumption. mtefe fits these models if you specify the polynomial(2) or the semiparametric option,8 respectively. When one fits MTE models, it is useful to plot several MTE curves in the same figure. This can be done using mtefeplot, specifying the names of the saved or stored MTE estimates. An example of this plot is provided in figure 2c for the normal, polynomial, and semiparametric MTE models. MTEs are downward sloping and relatively similar in all three specifications. As discussed in section 2.4, mtefe also estimates the treatment parameter weights and the parameters themselves. One way of illustrating why, for example, the LATE9 differs from the ATE is by depicting the weights that LATEs put on different parts of the X and UD distribution. This can be done using mtefeplot with the late option. The resulting plot is found in figure 2e. Because the MTE curve at the average of X and the MTE curve for compliers practically overlap, it does not seem like the people induced to enroll because of the instrument have different values of X—this is hardly surprising given the data-generating process. Instead, the weight distribution reveals

8. For semiparametric models, the option gridpoints() will greatly improve computational speed by performing the first local linear regression at # points equally spaced over the support of p rather than at each and every observed value of p in the population. 9. The term is used somewhat ambiguously here to refer to the linear IV estimate as a weighted average over all possible LATEs from all combinations of two values of the instrument, in line with much of the literature using continuous instruments. M. E. Andresen 135 that compliers have a much higher probability to have unobserved resistance in the middle part of the distribution. These people have MTEs slightly below the average, so when taking a weighted average over the MTE curve for compliers, we find that the LATE is lower than the ATE. The compliers to the instrument are people with slightly below-average gains from college. Furthermore, we can exploit the fact that when using the separate estimation pro- cedure, we identify both k1(u)andk0(u). After first fitting the polynomial model using the separate option to use the separate approach rather than local IVs, we can plot the resulting potential outcomes using the separate option of mtefeplot. Because the MTE is simply the difference between Y1 and Y0, we can investigate whether the down- ward sloping trend in the MTE is generated by upward sloping Y0, downward sloping Y1, or a combination. From figure 2d, we see that Y0 is relatively flat, while Y1 is clearly downward sloping. This indicates that people who have low resistance to treatment do much better than their high-resistance counterparts with a college degree, but relatively similar without. Therefore, these people have higher effects of treatment. As a last example, consider a hypothetical policy that mandates a maximum distance of 40 miles to the closest college. To estimate the effect of such a policy, we predict the propensity scores from the probit model using the adjusted distance to college: . qui probit col distCol exp exp2 i.district . generate temp=distCol . replace distCol=40 if distCol>40 . predict double p_policy . replace distCol=temp . mtefe lwage exp exp2 i.district (col=distCol), pol(2) prte(p_policy) (output omitted )

The results of this exercise are depicted in figure 2f.First,notehowtheMTE curve for the policy compliers practically overlaps the MTE curve at the mean; policy compliers do not seem to have X-values that give them different gains from college than the average. Next, notice the distribution of weights; policy compliers come exclusively from the lower part of the UD distribution. This is not surprising, given that the people most affected by the reform are people with low propensity scores (driven by high distance to college) before the reform. To be affected by the reform, these people must be untreated under the baseline and thus have UD’s above those low propensity scores. Compared with the people with higher propensity scores, the average UD for these people are relatively low, generating high weights on the lower part of the UD distribution. Thus, the potential policy compliers have low UD’s and, subsequently, high treatment effects. Therefore, the expected gain from the reform is larger than the ATE.

. mtefe lwage exp exp2 i.district (col=distCol) . estimates store normal . mtefe lwage exp exp2 i.district (col=distCol), polynomial(2) separate . estimates store polynomial . mtefe lwage exp exp2 i.district (col=distCol), semiparametric gridpoints(100) . estimates store semipar . mtefeplot normal polynomial semipar, memory > names("Normal" "Polynomial" "Semiparametric") . mtefeplot polynomial, separate memory . mtefeplot polynomial, late memory 136 Exploring marginal treatment effects: Flexible estimation using Stata

Common support Marginal Treatment Effects 2 8 6 1 4 Density 0 Treatment effect 2 − 1 0 0 .2 .4 .6 .8 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Propensity score Unobserved resistance to treatment

Treated Untreated MTE 95% CI ATE

(a) Common support plot, probit (b) Estimated MTE curve, normal

Marginal Treatment Effects Marginal Treatment Effects 5 1.5 1.5 1 1 4.5 .5 4 .5 0 Treatment effect Treatment effect Potential outcomes 3.5 0 − .5 3 − 1 − .5 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Unobserved resistance to treatment Unobserved resistance to treatment

Normal Polynomial Semiparametric MTE ATE Y0 Y1

(c) MTE estimates, different models (d) MTEs and potential outcomes, polynomial

Marginal Treatment Effects Marginal Treatment Effects 1 1.5 .03 .012

1 PRTE .01 .5 .02 ATE .008 .5

ATE Weights

LATE Weights 0 .006 Treatment effect .01 Treatment effect 0 .004 0 − .5 − .5 .002 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Unobserved resistance to treatment Unobserved resistance to treatment

MTE MTE (LATE) MTE MTE (PRTE) 2SLS LATE weights PRTE weights

(e) MTE for compliers with LATE weights (f) Estimated effects of a hypothetical policy

Figure 2. Example figures from mtefe and mtefeplot M. E. Andresen 137

Lastly, we might be interested in the pattern of selection on observable gains. Is it the case that covariates that positively impact treatment choices also positively impact the gains from treatment? It is if γ ×(β1 −β0) > 0, which happens either if both coefficients are positive or if both are negative. As an example, there is positive selection on the covariate experience in the example reported in figure 2; more experience is associated both with less college (not shown) and with lower treatment effects.

3.4 Postestimation using mtefe

Because mtefe saves results in e(), estimates are readily available for postestimation. As an example, expected treatment effects can be predicted for every individual with characteristics X, D,andp,usingthat

E(Y1 − Y0|X, D, p)

= x(β1 − β0)+DE(U1 − U0|UD ≤ p)+(1− D)E(U1 − U0|UD >p) D − p = x(β − β )+ K(p) 1 0 p(1 − p) where I use the fact that both U1 and U0 are normalized to have mean zero. Because all these objects are estimated by mtefe, we can predict treatment effects for each individual. Use options savekp and savepropensity() to save the propensity scores and the relevant variables of the K(p) function, then estimate expected treatment effects from the expression above. The resulting variable contains the expected treatment effect given treatment status, propensity scores, and X for each individual. Summarizing this predicted treatment effect among the treated and untreated separately closely matches the ATT and ATUT, respectively.

This procedure highlights the relationship between selection models and MTE—but note how both the ATT and ATUT could be recovered without specifying Kj (p), only the expected difference K(p). When we use the separate approach, potential outcomes can be predicted directly because both K(p)andK0(p) are estimated: D − p E(Y |X, D, p)=xβ + K (p) 1 1 1 − p 1 p − D E(Y |X, D, p)=xβ + K (p) 0 0 p 0 K(p) − (1 − p)K (p) where K (p)= 0 1 p

These expressions can be calculated by predicting Kj(p) and constructing the expected value of the outcome for each individual from the expressions above. The difference between these predicted outcomes closely matches the predicted treatment effect from above and treatment-effect parameters calculated by mtefe as weighted averages over the appropriate MTE curves. 138 Exploring marginal treatment effects: Flexible estimation using Stata 4Conclusion

MTEs are increasingly becoming a part of the toolbox of the applied econometrician when the problem at hand exhibits selection on both levels and unobserved gains. In contrast to traditional linear IV analysis, MTEs uncover the distribution of treatment effects rather than a LATE, which is often of little interest. In practice, this typically comes at the cost of stricter assumptions.

In this article, I outlined the MTEs framework, the relationship to traditional se- lection models, and three different methods for fitting these models. The documented package, mtefe, contains many improvements over existing packages. Among these are support for fixed effects, estimation using both local IVs, the separate approach and maximum likelihood, a larger number of parametric and semiparametric MTE models, support for weights, speed improvements when running semiparametric models, more appropriate bootstrap inference, and improved graphical output. In addition, mtefe cal- culates common treatment-effect parameters such as LATEs, ATT and ATUT,andPRTEs for a user-specified shift in propensity scores to allow the user to exploit the potential of the MTEs framework in understanding treatment-effect heterogeneity.

The Monte Carlo simulations detailed in appendix B show that MTE estimation can be sensitive to function form specifications and that wrong functional form assumptions may result in too high rates of rejection. Note that the two data-generating processes used and the functional form choices made in these simulations are very restrictive, and it is perhaps not surprising that a second-order polynomial cannot approximate a nor- mal very well or the other way around. Nonetheless, they illustrate two things: First, applied researchers should base the choice of functional form on detailed knowledge about the case at hand—what might plausibly constitute the unobservable dimension and economic arguments for how these factors might affect the outcome. Second, re- searchers should strive to probe the robustness of their results to the functional form choices, using less restrictive models to guide the choice of specification.

5 Acknowledgments

The author wishes to thank Edwin Leuven, Christian Brinch, Thomas Cornelissen, Mar- tin Huber, Katrine Vetlesen Løken, Simen Markussen, Scott Brave, Thomas Walstrum, Yiannis Spanos, Jonathan Norris, and Magne Mogstad for advice, comments, help, and testing at various stages of this project. Brave and Walstrum also deserve many thanks for the package margte, which clearly inspired mtefe. All errors remain my own. While carrying out this research, I have been associated with the Centre for the Study of Equality, Social Organization, and Performance at the Department of Economics at the University of Oslo. The Centre for the Study of Equality, Social Organization, and Performance is supported by the Research Council of Norway through its Centres of Excellence funding scheme, project number 179552. M. E. Andresen 139 6 References Abadie, A., and G. W. Imbens. 2016. Matching on the estimated propensity score. Econometrica 84: 781–807.

Angrist, J. D., and G. W. Imbens. 1995. Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American Statistical Association 90: 431–442.

Bj¨orklund, A., and R. Moffitt. 1987. The estimation of wage gains and welfare gains in self-selection models. Review of Economics and Statistics 69: 42–49.

Brave, S., and T. Walstrum. 2014. Estimating marginal treatment effects using para- metric and semiparametric methods. Stata Journal 14: 191–217.

Brinch, C. N., M. Mogstad, and M. Wiswall. 2017. Beyond LATE with a discrete instru- ment. Journal of Political Economy 125: 985–1039.

Carneiro, P., and J. J. Heckman. 2002. The evidence on credit constraints in post- secondary schooling. Economic Journal 112: 705–734.

Carneiro, P., J. J. Heckman, and E. J. Vytlacil. 2010. Evaluating marginal policy changes and the average effect of treatment for individuals at the margin. Economet- rica 78: 377–394.

. 2011. Estimating marginal returns to education. American Economic Review 101: 2754–2781.

Carneiro, P., and S. Lee. 2009. Estimating distributions of potential outcomes using local instrumental variables with an application to changes in college enrollment and wage inequality. Journal of Econometrics 149: 191–208.

Carneiro, P., M. Lokshin, and N. Umapathi. 2017. Average and marginal returns to upper secondary schooling in Indonesia. Journal of Applied Econometrics 32: 16–36.

Cornelissen, T., C. Dustmann, A. Raute, and U. Sch¨onberg. 2016. From LATE to MTE: Alternative methods for the evaluation of policy interventions. Labour Economics 41: 47–60.

. Forthcoming. Who benefits from universal child care? Estimating marginal returns to early child care attendance. Journal of Political Economy .

Felfe, C., and R. Lalive. 2014. Does early child care help or hurt children’s development? Discussion Paper No. 8484, Institute for the Study of Labor (IZA). http://ftp.iza.org/ dp8484.pdf.

Heckman, J. J., S. Urzua, and E. J. Vytlacil. 2006a. Web supplement to under- standing instrumental variables in models with essential heterogeneity: Estimation of treatment effects under essential heterogeneity. http://jenni.uchicago.edu/underiv/ documentation 2006 03 20.pdf. 140 Exploring marginal treatment effects: Flexible estimation using Stata

. 2006b. Understanding instrumental variables in models with essential hetero- geneity. Review of Economics and Statistics 88: 389–432.

Heckman, J. J., and E. J. Vytlacil. 1999. Local instrumental variables and latent variable models for identifying and bounding treatment effects. Proceedings of the National Academy of Sciences 96: 4730–4734.

. 2001. Local instrumental variables. In Nonlinear Statistical Modeling: Pro- ceedings of the Thirteenth International Symposium in Economic Theory and Econo- metrics: Essays in Honor of Takeshi Amemiya, ed. C. Hsiao, K. Morimune, and J. L. Powell, 1–46. New York: Cambridge University Press.

. 2005. Structural equations, treatment effects, and econometric policy evalua- tion. Econometrica 73: 669–738.

. 2007. Econometric evaluation of social programs, part II: Using the marginal treatment effect to organize alternative econometric estimators to evaluate social pro- grams, and to forecast their effects in new environments. In Handbook of Economet- rics, vol. 6B, ed. J. J. Heckman and E. E. Leamer, 4875–5143. Amsterdam: Elsevier.

Imbens, G. W., and J. D. Angrist. 1994. Identification and estimation of local average treatment effects. Econometrica 62: 467–475.

Klein, R. W., and R. H. Spady. 1993. An efficient semiparametric estimator for binary response models. Econometrica 61: 387–421.

Lokshin, M., and Z. Sajaia. 2004. Maximum likelihood estimation of endogenous switch- ing regression models. Stata Journal 4: 282–289.

Maestas, N., K. J. Mullen, and A. Strand. 2013. Does disability insurance receipt discourage work? Using examiner assignment to estimate causal effects of SSDI receipt. American Economic Review 103: 1797–1829.

Matzkin, R. L. 2007. Nonparametric identification. In Handbook of Econometrics, vol. 6B, ed. J. J. Heckman and E. E. Leamer, 5307–5368. Amsterdam: Elsevier.

Mogstad, M., and A. Torgovitsky. 2017. Identification and extrapolation with instru- mental variables. Unpublished manuscript.

Robinson, P. M. 1988. Root-n-consistent semiparametric regression. Econometrica 56: 931–954.

Vytlacil, E. 2002. Independence, monotonicity, and latent index models: An equivalence result. Econometrica 70: 331–341.

About the author Martin Eckhoff Andresen is a researcher at Statistics Norway and received his PhD from the University of Oslo in 2017. His research interests are applied econometrics, economics of education, and labor economics. M. E. Andresen 141 A Estimation algorithms

The mtefe package is inspired by Heckman, Urzua, and Vytlacil (2006a,b)andBrave and Walstrum (2014). The following sections describe the main steps of the estimation using this program:

A.1 First stage and common support 1. Estimate the first stage using a linear probability model, a probit or a logit model of D on Z. In principle, this step could be made more flexible by using, for example, a semiparametric single index model such as Klein and Spady (1993), which is not currently implemented in mtefe. 2. Predict the propensity score, p/. a. If using the linear probability, manually adjust propensity scores below 0 to 0 and above 1 to 1. 3. Construct the common support matrix: a. If trimming the sample using trimsupport(): i. Estimate the density of the propensity scores separately in the two sam- ples. ii. Remove points of support with the lowest densities until the specified share of each sample has been removed. iii. Construct the common support as the points of overlapping support between the two samples. iv. Remove observations with estimated propensity scores outside the com- mon support from the estimation sample. v. Fit the baseline propensity score model on the trimmed sample again. b. If not trimming: i. If using a parametric model, use the full support in 0.01 intervals from 0.01 to 0.99. ii. If using a semiparametric model, use all points of overlapping support in the treated and untreated samples, from 0.01 to 0.99 in intervals of 0.01. 4. Plot the distribution of propensity scores in the treated and untreated samples, including the trimming limits if used, to visualize the common support.

5. Compute weights for the ATT, ATUT, LATE, MPRTEs, and, if specified, PRTE pa- rameter weights as described in section 2.4.

A.2 Local IV estimation 1. Construct K(p), which depends on your choice of parametric or semiparametric model—see tables 1 and 2. 142 Exploring marginal treatment effects: Flexible estimation using Stata

2. Estimate the conditional expectation of Y given X and p as

Y = Xβ0 + X(β1 − β0)p + K(p)+

3. From these estimates, recover k(u)=K(u).

4. Construct the MTE as the derivative of the conditional expectation: 0 MTE(x, u)=x(β1 − β0)+k(u)

A.3 Separate approach

1. Construct K1(p)andK0(p), depending on your specification; see tables 1 and 2. 2. Estimate the conditional mean of Y from the stacked regression:

Y = Xβ + KD(p)+D{X(β − β )+KD(p)} + 0 1 0 K1(p)ifD =1 where KD(p)= K0(p)ifD =0

3. From these estimates, recover the kj (u) functions.

4. Construct the estimates of the potential outcomes and the MTE as 1 / Yj(x, u)=xβj + kj (u) / 1 MTE(x, u)=Y1(x, u) − Y0(x, u)

A.4 Maximum likelihood estimation

Relevant only for the joint normal model, the mlikelihood option implements the maximum likelihood estimator described in Lokshin and Sajaia (2004). The individual log-likelihood contribution is 2  3 1 U1i i = Di ln {F (η1i)} +ln f 2 σ1 σ1 3 1 U0i +(1− Di) ln {F (−η0i)} +ln f σ0 σ10 Uji 1 where ηji = γZi + ρj 4 σj 2 1 − ρj where f is the standard normal density. This log likelihood can be maximized to give the coefficients γ,β0,β1,σ0,σ1,ρ0,andρ1. These parameter estimates can be used to construct the MTE and treatment-effect parameters as detailed in table 1. M. E. Andresen 143

A.5 Standard errors

When using parametric models, mtefe by default calculates standard errors for the MTEs and the treatment-effect parameters from the estimated variance of β1 − β0 and the parameters of k(u). This ignores the fact that the propensity score and the means of X themselves are estimated objects. Little is known of the effect of this omission, although Abadie and Imbens (2016) show that ignoring the fact that the propensity scores are estimated when matching increases the standard errors of the ATE, but the effect on other parameters is ambiguous. Alternatively, and preferably, standard errors should be estimated using the bootstrap. Implement this using the bootreps() option, or alternatively, use vce(cluster clustvar) if cluster bootstrap is appropriate. This procedure reestimates the propensity score, the mean of X, and the treatment-effect parameter weights for each bootstrap replication and so accounts for the uncertainty from the first stage unless the option norepeat1 is specified.

A.6 IV weights

To estimate the IV weights, we need measures of the impact of the instrument on each individual conditional on X. We are looking at partitioning the linear first-stage regression

D = XβD + γZ− +

1. Remove the impact of covariates on D and Z− by regressing them separately on

X and saving the residuals in D and Z− .

2. Regress the residualized treatment D on Z− (without a constant) to recover the first-stage estimate of the impact of the instrument on treatment conditional on controls by the Frisch–Waugh–Lowell theorem.

3. The predicted values from this regression, υ/, contains the individual impact of the instrument on treatment conditional on controls. Note how we can recover the first-stage estimate from cov(D, υ/), the reduced-form estimate from cov(Y,υ/), and the traditional 2SLS estimate from {cov(Y,υ/)}/{cov(D, υ/)}.

4. Compute the weights υ/i − υ/ Di − D κ/LATE = i cov(D, υ/)

Note how κLATE weights treated individuals with positive υ/ and untreated individ- uals with negative υ/ more—these are individuals who are more strongly affected by the instruments and so are more likely to be compliers.

5. Compute {E(υ/|p>u) − E(υ/)} P (p>u) ω/ (u)= LATE s × cov(D, υ/) 144 Exploring marginal treatment effects: Flexible estimation using Stata

by replacing expectations and probabilities with sample analogs. This puts more weight on the values of UD above which people have higher υ/—precisely the people who are more affected by the instruments and so are more likely to be compliers.

This procedure recovers the weights from Cornelissen et al. (2016).

B Monte Carlo simulations

To investigate the properties of the different MTE estimators and models implemented by the mtefe package, I simulate data from a data-generating process based on the example described above and detailed in section B.1. For each of the data-generating processes (joint normal or polynomial error structure), I draw 10,000 observations for each of 500 repetitions, randomly allocating each observation to one of 10 districts. The first-stage model that determines selection into treatment is either a linear probability or probit model. For each repetition, I calculate the difference between the estimated and the true parameter, the estimated analytic and bootstrapped standard errors, and the rejection rates for the true parameter. I fit all models using the separate approach and local IVs to compare any differences.

B.1 Data-generating process

This algorithm determines the data-generating process implemented by mtefe gendata that is part of the mtefe package and used in the Monte Carlo simulations and examples in this article. This data-generating process generates selection on both levels and gains as well as a continuous instrument that is valid only conditional on fixed effects.

1. In a sample of N individuals, randomly allocate each to one of G districts with equal probability.

2. Draw average labor market quality Πj in the college and noncollege labor mar- kets and average distance to college in each district AvgDist from a joint normal distribution:

Π , Π , AvgDist ∼N(0, Σf ) 0 1 ⎧ ⎫ ⎨ 0.1 ⎬ Σf = ⎩ 0.05 0.1 ⎭ −0.05 −0.110

In the simulations, I draw these once rather than repeat for each simulation to have a true value to compare the estimated coefficients with, but by default, mtefe gendata draws these for each replication unless you specify the parameters. M. E. Andresen 145

3. Generate individual distance to college distCol as AvgDist plus N(40, 10). 4. Draw exp from U(0, 30) and construct exp2 as its square.

5. Construct the error terms U0,U1,andV from one of two error structures: Normal Draw randomly from a joint normal model:

U ,U ,V ∼N(0, Σ) 0 1 ⎧ ⎫ ⎨ .5 ⎬ Σf = ⎩ .3 .5 ⎭ −.1 −.51

Polynomial Generate Uj as second-order polynomials of UD with mean zero:

−1 a. Draw UD = U(0, 1) and generate V = F (UD).  2 l −{ } ∼ b. Generate Uj = l=1 πjl[UD 1/(l +1) ]+ ,where N(0, 0.2):

i. using π11 =0.5, π12 = −0.1 ii. and π01 =2,π12 = −1.

6. Construct the potential outcomes as

Yj = Xβj +Πj + Uj

X = exp exp2 1 β0 =0.025 −0.0004 3.2 β1 =0.01 0 3.6

7. Determine treatment using one of two binary choice models: Probit

D = ½ [γZ > V ]

Z = distCol exp exp2 1 γ = −0.125 −0.08 0.002 5.59

LPM

D = ½ [γZ > UD]

Z = distCol exp exp2 1 γ = −0.015 −0.01 0.00025 1.17375

8. Determine the observed outcome lwage as DY1 +(1− D)Y0. 146 Exploring marginal treatment effects: Flexible estimation using Stata

B.2 A well-specified baseline

To establish a baseline, I first investigate the case where both the first stage and the polynomial model for k(u) are correctly specified. The results from these exercises are displayed in table 4. Unsurprisingly, and in line with theory and Monte Carlo evidence from Brave and Walstrum (2014), mtefe does a good job at estimating both the coefficients of the outcome equations and the MTEs when both the first stage and the parametric model are correctly specified. The difference between the estimated and true coefficients center at 0 and have low standard deviations. The estimated analytical standard errors come close to the standard devia- tions of the coefficient. Furthermore, even though the analytical standard errors ignore the fact that the propensity score itself is an estimated object, they fall very close to the bootstrapped standard errors that account for this by reestimating the propensity scores for each bootstrap replication. Note that bootstrapped standard errors are not always higher than analytic standard errors. Rejection rates vary somewhat from parameter to parameter but lie around 0.05 as they should. M. E. Andresen 147 r 0.050 0.044 0.054 0.054 0.060 0.062 0.062 0.056 0.054 0.066 0.034 0.062 Separate Polynomial Polynomial diff s.e. 0.000 0.000 0.096 0.002 0.016 0.104 0.001 0.015 0.102 0.000 0.023 0.070 0.000 0.000 0.050 (0.002) 0.002 (0.000) 0.000 (0.020) 0.019 (0.019) 0.019 (0.019) 0.019 (0.021) 0.020 (0.002) 0.002 (0.000) 0.000 (0.024) 0.023 (0.024) 0.023 (0.022) 0.023 (0.025) 0.024 − − − − − Continued on next page r 0.044 0.032 0.058 0.056 0.062 0.072 0.048 0.052 0.048 0.046 0.042 0.052 Local IV Polynomial Polynomial diff s.e. 0.000 0.042 0.0340.001 0.001 0.043 0.023 0.024 0.052 0.001 0.023 0.032 0.000 0.000 0.070 0.001 0.025 0.062 0.001 0.026 0.0840.000 0.004 0.038 0.000 0.017 0.098 0.000 0.0020.002 0.058 0.043 0.068 0.001 0.023 0.062 − − (0.003) 0.003 (0.000) 0.000 (0.026) 0.026 (0.026) 0.026 (0.025) 0.025 (0.028) 0.028 (0.004) 0.004 (0.000) 0.000 (0.039) 0.039 (0.047) 0.047 (0.037) 0.038 (0.039) 0.039 − − − − − r 0.066 0.056 0.060 0.060 0.058 0.064 0.070 0.056 0.060 0.036 0.050 0.050 Normal Normal Separate diff s.e. 0.001 0.042 0.074 0.003 0.040 0.072 0.000 0.023 0.078 0.005 0.058 0.062 0.002 0.041 0.042 0.002 0.041 0.062 0.001 0.024 0.064 0.000 0.016 0.096 0.000 0.000 0.070 0.000 0.000 0.040 0.000 0.004 0.088 0.000 0.003 0.078 0.000 0.002 0.084 − − − − (0.005) 0.004 (0.000) 0.000 (0.045) 0.044 (0.044) 0.042 (0.043) 0.043 (0.047) 0.045 (0.006) 0.006 (0.000) 0.000 (0.062) 0.061 (0.058) 0.062 (0.061) 0.061 (0.060) 0.059 − − r 0.050 0.050 0.054 0.058 0.064 0.046 0.044 0.046 0.052 0.066 0.066 0.036 Normal Normal Local IV Table 4. Monte Carlo estimates: A well-specified baseline diff s.e. 0.003 0.062 0.036 0.004 0.059 0.060 0.010 0.098 0.030 0.004 0.060 0.052 0.000 0.000 0.038 0.000 0.006 0.042 0.006 0.065 0.042 0.002 0.043 0.078 − − (0.059) 0.057 − − (0.104) 0.100 (0.058) 0.058 (0.098) 0.098 − (0.000) 0.000 (0.006) 0.006 (0.061) 0.058 (0.091) 0.094 − (0.063) 0.062 (0.010) 0.010 (0.114) 0.112 (0.000) 0.000 0.51 0.35 0.0030.07 0.106 0.028 0.006 0.001 0.107 0.060 0.044 0.062 0.001 0.060 0.050 0.4 3.2 0.004 0.000 0.000 0.040 0.000 0.000 0.066 0.015 0.001 0.010 0.042 0.001 0.006 0.076 0.34 0.34 1.00 0.007 0.109 0.078 0.005 0.061 0.038 0.025 Model − − − 0.0004 − − Procedure True model 2 2

exp exp 3.district 6.district 10.district constant exp exp 3.district 6.district 10.district constant Coefficient Truth

0 0 1 β β − β 148 Exploring marginal treatment effects: Flexible estimation using Stata

oe ot al xeiet ecie nsection in described experiments Carlo Monte Note: MTE k(u) oyoil(otm.Frtsaei rbt 0 elctos ahwt apesz f1,0 n1 districts. 10 in 10,000 of size sample a with each and replications, (top) 500 normal shows probit. values is True stage analytic. First otherwise (bottom). underlined, polynomial are errors standard bootstrapped Coefficient ATE( MTE( MTE( MTE( MTE( MTE( MTE( MTE( π π ρ 1 12 11 − − − ρ x x, x, x, x, x, x, x, π π 0 .70000050060000010000020040040010000.086 0.010 0.001 0.054 0.014 0.002 0.050 0.021 0.000 0.026 0.025 0.000 0.37 ) 02 01 )03 .0 .2 .2 .0 .2 0.050 0.021 0.000 0.040 0.050 0.026 0.002 0.025 0.000 0 . 95) 0.09 0 . 9) 0.36 0 . 75) 0.63 0 . 5) 0.88 0 . 25) 1.02 0 . 1) 0 . 05) remodel True Procedure Model − − rt i s.e. diff Truth − − .9(.2)0.023 (0.023) 0.045 (0.044) 0.29 0.078 (0.077) 0.49 0.099 (0.098) 0.67 0.74 .6(.2)0.023 (0.023) 0.36 .9(.8)0.080 (0.082) 0.19 .0(.0)0.101 (0.103) 0.20 0.047 (0.048) 0.19 0.9 .90040160.044 0.106 0.004 0.044 0.085 0.003 0.29 0.15 . .0 .6 0.048 0.062 0.002 1.5 0.4 009 0.059 (0.059) − − − .0 .4 .2 .0 .3 .4 .0 .2 .6 .0 .1 0.104 0.017 0.002 0.100 0.035 0.003 0.078 0.068 0.045 0.021 0.004 0.002 0.054 0.052 0.007 0.048 0.044 0.070 0.030 0.009 0.000 0.042 0.045 0.001 0.052 0.028 0.056 0.047 0.001 0.001 0.038 0.081 0.003 0.042 0.103 0.003 oa IV Local Normal Normal B tnaddvain nprnhss eut ae on based Results parentheses. in deviations Standard . 0.062 0.036 0.060 0.058 0.060 0.036 0.044 0.046 0.052 r − − − − 002 0.032 (0.032) 002 0.022 (0.022) 0.057 (0.058) 0.046 (0.048) 0.031 (0.032) 0.022 (0.022) 0.031 (0.030) 0.047 (0.046) 0.057 (0.056) .0 .3 0.048 0.032 0.000 .0 .5 .4 .0 .7 .3 .0 .4 0.016 0.047 0.000 0.030 0.074 0.005 0.032 0.056 0.003 0.046 0.057 0.001 0.054 0.047 0.001 .0 .3 0.058 0.031 0.000 i s.e. diff Separate Normal Normal 0.050 0.034 0.052 0.058 0.054 0.034 0.038 0.044 0.050 r 041 0.451 (0.431) − 044 0.439 (0.424) − − 004 0.015 (0.014) 0.068 (0.068) 0.051 (0.050) 0.020 (0.018) 0.028 (0.027) 0.023 (0.023) 0.057 (0.054) 0.075 (0.071) .4 .3 .4 .1 .4 0.020 0.248 0.010 0.048 0.433 0.046 .0 .2 0.060 0.027 0.002 .5 .3 0.048 0.434 0.051 .0 .2 0.012 0.022 0.000 i s.e. diff Polynomial Polynomial oa IV Local 0.050 0.038 0.048 0.054 0.048 0.018 0.048 0.050 0.030 0.034 r − − 028 0.231 (0.218) 0.249 (0.236) − − 002 0.012 (0.012) 0.041 (0.039) 0.033 (0.031) 0.018 (0.016) 0.016 (0.015) 0.020 (0.020) 0.040 (0.039) 0.050 (0.048) .0 .1 0.032 0.016 0.000 .1 .5 0.028 0.253 0.014 .0 .3 0.014 0.037 0.000 .0 .1 0.024 0.018 0.000 i s.e. diff Polynomial Polynomial Separate 0.046 0.038 0.042 0.028 0.032 0.032 0.042 0.056 0.044 0.046 r M. E. Andresen 149

Comparing the results from local IV estimates with the results from the separate approach reveals one important difference; across the model specifications, estimates seem to be more precise using the separate approach. Both the standard deviations of the coefficients and the estimated standard errors are lower than estimates from local IV, as illustrated for a range of parameters in figure 3. This is surprising, given that more parameters are estimated when using the separate approach than local IV.Ihave no good explanation for this result, but it warrants more research. 150 Exploring marginal treatment effects: Flexible estimation using Stata 80 15 60 10 40 5 20 0 0 −.02 0 .02 .04 −.2 −.1 0 .1 .2 beta1−beta0:exp mte:u25 15 20 15 10 10 5 5 0 0 −.2 −.1 0 .1 .2 −.1 −.05 0 .05 .1 k:mills mte:u50 20 15 15 10 10 5 5 0 0 −.1 −.05 0 .05 .1 −.2 −.1 0 .1 .2 effect:ate mte:u75

(a) Normal model 2 20 15 1.5 1 10 5 .5 0 0 −2 −1 0 1 2 −.05 0 .05 k:p1 mte:u25 2 30 1.5 20 1 10 .5 0 0 −2 −1 0 1 2 −.1 −.05 0 .05 .1 k:p2 mte:u50 25 30 20 20 15 10 10 5 0 0 −.04 −.02 0 .02 .04 −.1 −.05 0 .05 effect:ate mte:u75

(b) Polynomial model

Note: Kernel density plots of estimated coefficients using local IV (solid), the separate approach (dashed) for selected parameters of the MTE models. Well-specified baseline model. 500 Monte Carlo simulations from the normal (a) and polynomial (b) models.

Figure 3. Efficiency of the two estimation methods M. E. Andresen 151

B.3 Misspecification of k(u)

The above experiment focused on the case where both the first-stage model and the parametric model for k(u) was correctly specified. To illustrate how the estimators do when one of the two are misspecified, I keep the assumption that the first-stage model is correctly specified but use a misspecified MTE model. In table 5, I generate data from the normal model and estimate the MTE using the polynomial model of second order and vice versa. Both models do a relatively good job at estimating the coefficients of the outcome equations, and rejection rates for the true β0 and β1 − β0 coefficients are close to 0.05, as they should be. Note, however, that the bootstrap now outperforms the analytic standard errors, yielding rejections rates closer to 5% than the analytic standard errors in most cases.

The MTE estimates, and in turn the ATE, are severely overrejected. Bootstrapping the standard errors and accounting for the uncertainty in the estimates of p produce lower rejection rates, but these are still above 5%. The problem seems to be driven by inconsistent estimates rather than too-low standard errors. The result of this is large rates of overrejection when the functional form is misspecified; average rejection rates are far above 0.05 for a 5% test. The polynomial model seems to do a better job at approximating the normal model than the other way around, as is apparent by the lower rejection rates. This illustrates the importance of using the correct parametric model and working with flexible specifications. In this case, a higher-order polynomial or a polynomial with splines could have generated a better fit to the underlying model. 152 Exploring marginal treatment effects: Flexible estimation using Stata

β1 − β0 β0 10.district 6.district 3.district exp exp constant 10.district 6.district 3.district exp exp osat0.4 constant offiin Truth Coefficient 2 2 remodel True Procedure − − .040000000.056 0.000 0.000 0.0004 − − − Model 0.025 .00010030060030030000050190060020010.064 0.061 0.002 0.046 0.109 0.005 0.050 0.023 0.003 0.066 0.043 0.001 1.00 0.092 0.016 0.001 0.052 0.025 0.002 0.34 0.34 .0 .0 .0 .8 .0 .0 .2 .0 .0 .3 .0 .0 0.078 0.000 0.000 0.038 0.000 0.000 0.128 0.000 0.000 0.084 0.000 0.000 0.004 0.015 3.2 .70010030080020030020010170.040 0.107 0.001 0.052 0.023 0.002 0.028 0.043 0.001 0.07 0.35 0.51 006 0.026 (0.026) 003 0.003 (0.003) 000 0.000 (0.000) 005 0.025 (0.025) 000 0.000 (0.000) 008 0.037 (0.038) − 000 0.038 (0.040) 0.047 (0.046) 0.004 (0.004) − 0.027 (0.028) − 009 0.039 (0.039) 0.025 (0.026) − − − .0 .2 .7 .3 .1 .8 .1 .6 0.034 0.066 0.011 0.480 0.016 0.031 0.076 0.026 0.007 .0 .2 0.086 0.023 0.001 .0 .0 .6 .0 .0 .5 .0 .1 .5 .0 .0 0.060 0.006 0.000 0.058 0.010 0.000 0.054 0.002 0.000 0.062 0.004 0.000 0.086 0.003 0.000 .2 .3 0.078 0.039 0.023 .0 .4 .3 .0 .2 .3 .0 .0 0.030 0.106 0.000 0.034 0.023 0.000 0.036 0.042 0.002 0.050 0.024 0.001 i s.e. diff al .MneCroetmts iseicto of Misspecification estimates: Carlo Monte 5. Table Polynomial oa IV Local Normal 0.102 0.056 0.040 0.048 0.052 0.064 0.070 0.040 0.060 0.054 0.052 0.054 r − − 002 0.023 (0.022) 0.023 (0.023) 0.023 (0.023) 0.023 (0.022) 0.000 (0.000) 0.002 (0.002) 0.019 (0.019) 0.019 (0.018) 0.018 (0.018) 0.019 (0.018) 0.000 (0.000) 0.002 (0.002) − − − .0 .1 0.110 0.015 0.004 .0 .0 0.126 0.002 0.000 .3 .2 .4 .0 .0 .3 .3 .6 0.088 0.060 0.035 0.038 0.102 0.003 0.248 0.022 0.030 .0 .0 0.058 0.000 0.000 .0 .1 0.094 0.016 0.002 i s.e. diff Polynomial Separate Normal 0.242 0.054 0.054 0.040 0.058 0.056 0.364 0.034 0.062 0.044 0.068 0.068 r − − − − − 011 0.097 (0.101) 0.100 (0.099) 0.112 (0.109) 0.098 (0.097) 0.000 (0.000) 0.010 (0.010) 0.063 (0.064) 0.057 (0.059) 0.058 (0.062) 0.058 (0.059) 0.000 (0.000) 0.006 (0.006) .0 .5 0.070 0.059 0.005 .0 .0 0.042 0.006 0.000 .0 .0 0.036 0.000 0.000 .0 .6 .4 .0 .4 0.088 0.050 0.042 0.060 0.000 0.002 0.040 0.062 0.001 i s.e. diff Polynomial oa IV Local Normal k ( u ) 0.056 0.056 0.040 0.046 0.052 0.064 0.044 0.060 0.074 0.054 0.046 0.050 r otne nnx page next on Continued − − − − 000 0.062 (0.060) 0.052 0.061 (0.060) 0.062 (0.064) 0.061 (0.061) 0.000 (0.000) 0.006 (0.006) 0.047 (0.047) 0.043 (0.044) 0.043 (0.045) 0.044 (0.046) 0.000 (0.000) 0.004 (0.005) − − − .0 .4 0.084 0.040 0.003 .0 .0 0.068 0.004 0.000 .0 .0 0.060 0.000 0.000 .0 .4 0.054 0.041 0.000 .0 .6 0.058 0.060 0.001 0.044 0.060 0.001 0.060 0.045 0.005 i s.e. diff Polynomial Separate Normal 0.082 0.064 0.042 0.052 0.052 0.046 0.046 0.076 0.066 0.058 0.050 r M. E. Andresen 153 r 0.082 0.194 0.140 0.398 0.082 0.384 0.692 0.186 Normal Separate Polynomial diff s.e. 2.513 0.664 0.073 0.043 0.398 0.026 0.047 0.088 (0.112) 0.122 (0.089) 0.096 (0.046) 0.046 (0.043) 0.043 (0.047) 0.048 (0.096) 0.096 (0.122) 0.121 (0.025) 0.028 − (0.628) 0.672 (0.620) 0.647 − − r 0.088 0.050 0.142 0.054 0.114 0.060 0.104 0.042 Normal Local IV Polynomial diff s.e. 1.333 1.091 0.087 0.175 0.0860.008 0.132 0.050 0.079 0.117 0.109 0.102 0.092 0.220 0.004 0.068 0.056 0.035 0.056 0.092 0.104 1.087 1.296 0.651 (0.173) 0.174 (0.131) 0.131 (0.051) 0.050 (0.068) 0.067 (0.053) 0.052 (0.132) 0.133 (0.174) 0.176 (0.031) 0.034 (1.074) 1.079 − − − − − (1.058) 1.066 r 1.000 1.000 1.000 1.000 1.000 0.054 0.716 0.720 Normal Separate Polynomial diff s.e. 0.025 0.008 0.822 0.005 0.036 0.822 0.033 0.026 0.822 0.115 0.012 0.215 0.021 1.000 0.189 0.017 1.000 0.080 0.011 1.000 0.040 0.0520.002 0.110 0.018 0.0300.052 0.043 0.022 0.646 0.044 0.034 0.166 0.142 0.046 0.122 0.186 0.090 0.160 0.098 0.300 0.372 0.124 0.682 (0.023) 0.023 (0.019) 0.019 (0.013) 0.014 (0.010) 0.010 (0.012) 0.012 (0.017) 0.016 (0.021) 0.020 (0.010) 0.010 − (0.012) 0.012 − − − − − − r . Standard deviations in parentheses. Results based on bootstrapped 0.438 0.644 0.148 1.000 0.106 0.922 0.994 0.482 B Normal Local IV Polynomial 0.021 0.010 0.588 0.199 0.025 0.073 0.041 0.460 0.077 0.032 0.672 0.020 0.019 0.196 0.106 0.034 0.894 0.187 0.042 0.990 − (0.024) 0.024 − − − − − − ( u ) rather than differences. k 0.15 0.29 0.36 (0.011) 0.011 0.740.67 (0.041) 0.042 (0.033) 0.034 0.49 (0.020) 0.020 0.190.20 (0.032) 0.031 (0.040) 0.040 0.290.19 (0.011) 0.011 (0.019) 0.018 N/A* N/A* N/A* Truth diff s.e. − − Model Procedure True model 0 . 05)0 . 1) 1.02 0 . 25) 0.88 0 . 5) 0.63 0 . 75) 0.360 . 9) 0.09 0.0520 . 95) 0.010 0.012 1.000 0.020 0.084 0.048 0.008 0.065 1.000 0.012 1.000 01 02 )0.37 0 π π x, x, x, x, x, x, x, x ρ − − − 11 12 1 Coefficient ρ π π MTE( MTE( MTE( MTE( MTE( MTE( MTE( ATE(

standard errors are underlined,* otherwise Coefficients analytic. reported 500 for replications, each with a sample size of 10,000 in 10 districts.

MTE ) (

u k Note: Monte Carlo experiments described in section 154 Exploring marginal treatment effects: Flexible estimation using Stata

B.4 Misspecification of P(Z)

Next, I move on to investigate the specification of the first stage. To focus on the problem of misspecification of the first stage, assume that the parametric model for k(u) is correct and second-order polynomial. Instead, I let the functional form of the selection equation be either linear or probit when generating data, and I use the other functional form when fitting the model. The results from this exercise are provided in table 6.Theβ parameters are some- what consistently estimated with rejection rates around 0.05, but there is larger variation across parameters than for the previous experiments. The MTEs, ATE, and parameters of the k(u) function are severely overrejected. Again, this does not seem to be driven by underestimation of standard errors if we compare them with the standard deviation of the estimated coefficients. Rather, it seems to be the coefficients themselves that do not center at 0. As an alternative misspecification, consider the case where the functional form is cor- rect but where the experience controls are not included in either the first- or the second- stage estimation. This should affect the precision of the coefficients in the propensity score model, but because the omitted regressors are orthogonal to everything else, it should not bias the estimates of other parameters except the constant. The results from this exercise are found in table 7, and fortunately, we see that the results are not sensitive to this sort of omission of variables as long as the exclusion restriction holds. In conclusion, specification of the propensity score generally receives relatively little attention from articles, but it turns out to be important. Careful researchers should evaluate the robustness of their MTE models using various and flexible first-stage models. M. E. Andresen 155 r 0.060 0.058 0.028 0.044 0.050 0.084 0.044 0.048 0.024 0.048 0.046 0.070 LPM Probit Separate diff s.e. 0.000 0.014 0.100 0.001 0.020 0.030 0.000 0.020 0.044 0.000 0.000 0.048 0.000 0.014 0.050 0.021 0.027 0.172 0.000 0.000 0.104 − − (0.015) 0.016 (0.016) 0.016 (0.016) 0.016 (0.032) 0.033 (0.018) 0.020 (0.020) 0.020 (0.019) 0.020 (0.040) 0.043 − − − − − Continued on next page r 0.052 0.040 0.050 0.130 0.056 0.050 0.052 0.100 ) ( Z LPM P Probit Local IV diff s.e. 0.003 0.091 0.008 0.000 0.000 0.052 0.002 0.047 0.028 0.007 0.092 0.094 0.000 0.020 0.048 0.067 0.070 0.142 0.000 0.005 0.058 0.000 0.001 0.100 (0.005) 0.005 0.042(0.000) 0.000 0.050(0.041) (0.002) 0.041 0.002 (0.050) (0.000) 0.052 0.000 (0.037) 0.038 (0.068) 0.071 (0.008) 0.009 0.058(0.000) 0.000 0.048(0.078) (0.002) 0.077 0.002 (0.105) (0.000) 0.110 0.000 (0.071) 0.071 (0.114) 0.120 − − − − − − r 0.050 0.050 0.068 0.060 0.046 0.060 0.040 0.040 0.056 0.052 0.052 0.074 LPM Probit Separate diff s.e. 0.000 0.016 0.084 0.001 0.046 0.016 0.001 0.023 0.062 0.001 0.016 0.110 0.000 0.023 0.048 0.010 0.017 0.116 0.000 0.015 0.126 0.004 0.045 0.080 0.000 0.014 0.106 0.000 0.000 0.088 0.000 0.000 0.056 0.000 0.002 0.038 0.001 0.009 0.056 0.000 0.002 0.044 − − − − − (0.002) 0.002 (0.000) 0.000 (0.019) 0.019 (0.019) 0.019 (0.019) 0.019 (0.020) 0.021 (0.002) 0.002 (0.000) 0.000 (0.023) 0.023 (0.024) 0.023 (0.024) 0.023 (0.024) 0.025 − − − r 0.070 0.070 0.128 0.664 0.044 0.050 0.094 0.076 0.176 0.742 0.050 0.048 LPM Probit Local IV Table 6. Monte Carlo estimates: Misspecification of diff s.e. 0.007 0.050 0.036 0.049 0.050 0.148 0.001 0.023 0.054 0.002 0.091 0.024 0.071 0.026 0.736 0.006 0.047 0.038 0.013 0.024 0.092 0.099 0.123 0.102 0.025 0.038 0.120 0.000 0.000 0.102 0.003 0.005 0.092 0.006 0.029 0.056 − − (0.029) 0.029 (0.045) 0.046 (0.028) 0.028 (0.000) 0.000 − (0.056) 0.054 − (0.044) 0.045 − (0.003) 0.003 − (0.045) 0.046 (0.030) 0.030 (0.030) 0.031 (0.000) 0.000 (0.005) 0.005 0.51 0.002 0.027 0.048 0.07 0.35 0.4 3.2 0.004 0.015 0.025 0.001 0.003 0.102 0.000 0.002 0.092 − − − 0.0004 0.000 0.000 0.086 0.000 0.000 0.042 − − Procedure First-stage model 2 2 exp exp True first-stage model exp exp constant constant

3.district6.district 0.34 0.34 0.024 0.028 0.132 3.district 6.district 1.00 0.143 0.051 0.776

10.district 10.district Coefficient Truth

0 0 1 β β − β 156 Exploring marginal treatment effects: Flexible estimation using Stata

oe ot al xeiet ecie nsection in described experiments Carlo Monte Note: MTE k(u) tnaderr r nelnd tews nltc 0 elctos ahwt apesz f1,0 n1 districts. 10 in 10,000 of size sample a with each replications, 500 polynomial. is analytic. structure otherwise Error underlined, are errors standard MTE( MTE( MTE( MTE( MTE( MTE( MTE( offiin rt i s.e. diff Truth Coefficient π π ATE( 12 11 refis-tg model first-stage True − − x, x, x, x, x, x, x, π π x 5 .90020020160040070480160000240030020.180 0.026 0.042 0.090 0.033 0.029 0.254 0.758 0.080 0.019 0.106 0.050 0.498 0.900 0.017 0.024 0.20 0.034 0.074 0 . 95) 0.116 0.022 0.19 0.012 0 . 75) 0.49 0 . 25) 0.74 0 . 05) )02 .9 .2 .0 .3 .2 1.000 0.020 0.132 1.000 0.026 0.19 0.198 0 . 9) 0.29 0 . 5) 0.67 0 . 1) is-tg model First-stage 02 01 )0.37 Procedure − .6(.1)0.016 (0.015) 0.36 0.9 . .1 .1 .0 .6 .9 1.000 0.291 1.466 1.000 0.411 2.612 1.5 009 0.049 (0.049) − 0.027 (0.026) 0.056 (0.054) − 041 0.428 (0.411) 005 0.065 (0.065) − 0.021 (0.021) 0.024 (0.023) 0.073 (0.070) − 038 0.415 (0.398) − − .5 .5 0.776 0.056 0.150 1.000 0.051 0.249 .5 .7 0.944 0.073 0.250 1.000 0.067 0.361 .8 .1 1.000 0.410 2.488 .0 .1 .5 .1 .1 .7 .9 .9 .8 .2 .3 0.134 0.032 0.024 0.188 0.092 0.099 0.274 0.011 0.014 0.056 0.015 0.005 oa IV Local Probit LPM B tnaddvain nprnhss eut ae nbootstrapped on based Results parentheses. in deviations Standard . 0.972 0.866 0.934 1.000 0.084 0.998 1.000 1.000 1.000 0.046 r − − − − 005 0.046 (0.045) 0.036 (0.036) 0.018 (0.018) 0.019 (0.017) 0.022 (0.022) 0.045 (0.045) 0.057 (0.056) 0.278 (0.261) 0.297 (0.282) 003 0.014 (0.013) − .8 .4 .8 .6 .3 .9 .2 .9 0.046 0.092 0.029 0.090 0.232 0.167 0.180 0.087 0.580 0.084 0.041 0.086 0.276 0.215 0.290 0.774 0.038 0.110 .4 .5 .2 .3 .9 .9 .4 .1 0.046 0.113 0.043 0.098 0.295 0.230 0.820 0.052 0.145 0.184 0.107 0.105 0.274 0.275 0.368 0.926 0.049 0.173 .3 .8 .0 .5 .6 .0 .1 .8 0.136 0.480 0.412 0.202 1.463 1.654 1.000 0.288 1.435 i s.e. diff Separate Probit LPM 0.880 0.668 0.768 1.000 0.366 0.680 0.852 1.000 1.000 0.168 r − − 023 0.292 (0.273) 0.229 (0.213) 0.086 (0.079) 0.039 (0.038) 0.087 (0.086) 0.227 (0.224) 0.290 (0.286) 1.501 (1.446) 1.514 (1.474) 000 0.094 (0.090) .0 .5 0.236 1.459 1.808 .3 .3 0.174 0.037 0.036 i s.e. diff oa IV Local Probit LPM 0.102 0.086 0.042 0.136 0.204 0.234 0.232 0.180 0.206 0.158 r − − − 017 0.111 (0.107) 0.091 (0.087) 0.046 (0.044) 0.020 (0.020) 0.055 (0.052) 0.108 (0.104) 0.130 (0.126) 0.506 (0.494) 0.554 (0.541) 006 0.038 (0.036) .8 .9 0.176 0.496 0.481 .1 .1 0.126 0.018 0.010 .0 .4 0.050 0.046 0.001 i s.e. diff Separate Probit LPM 0.050 0.050 0.046 0.084 0.072 0.094 0.100 0.120 0.122 0.072 r M. E. Andresen 157 r 0.052 0.054 0.066 1.000 0.048 0.042 0.064 1.000 0.076 0.074 Probit Probit Separate diff s.e. 0.001 0.016 0.108 0.000 0.017 0.096 0.147 0.264 0.058 0.002 0.016 0.124 0.107 0.020 1.000 − − − (0.020) 0.020 (0.019) 0.020 (0.020) 0.020 (0.017) 0.017 (0.024) 0.024 (0.023) 0.024 (0.025) 0.024 (0.020) 0.020 (0.260) 0.271 (0.238) 0.251 − − ) Continued on next page ( Z r P 0.058 0.056 0.056 1.000 0.058 0.042 0.050 0.896 0.050 0.044 Probit Probit Local IV diff s.e. 0.001 0.025 0.076 0.000 0.026 0.072 0.038 0.457 0.048 0.002 0.025 0.064 0.104 0.034 0.872 − − − (0.028) 0.027 (0.028) 0.028 (0.027) 0.026 (0.023) 0.022 (0.043) 0.042 (0.049) 0.050 (0.043) 0.041 (0.033) 0.032 (0.471) 0.476 (0.455) 0.465 − − r 0.056 0.056 0.056 1.000 0.066 0.050 0.038 0.784 0.058 0.050 LPM LPM Separate diff s.e. 0.000 0.015 0.088 0.202 0.488 0.064 0.001 0.022 0.036 0.002 0.045 0.030 0.002 0.024 0.058 0.113 0.037 0.842 0.001 0.022 0.060 0.002 0.045 0.042 0.001 0.024 0.042 − − − (0.018) 0.018 (0.018) 0.018 (0.018) 0.018 (0.031) 0.032 (0.022) 0.022 (0.023) 0.022 (0.021) 0.022 (0.039) 0.042 (0.532) 0.567 (0.490) 0.519 − − r 0.036 0.066 0.046 0.992 0.038 0.076 0.042 0.206 0.050 0.060 LPM LPM Local IV 0.003 0.051 0.0180.005 0.049 0.096 0.000 0.015 0.088 0.109 1.167 0.064 0.001 0.050 0.020 0.000 0.015 0.096 0.111 0.100 0.166 0.258 0.055 0.994 0.259 0.025 1.000 0.254 0.021 1.000 0.255 0.014 1.000 (0.053) 0.055 − (0.044) 0.045 − (1.183) 1.196 − (0.075) 0.079 − (0.041) 0.042 (0.058) 0.055 (0.121) 0.115 (1.191) 1.211 (0.089) 0.094 − (0.083) 0.085 Table 7. Monte Carlo estimates: Omissions in the specification of 1.5 0.106 1.162 0.058 0.197 0.501 0.086 0.028 0.458 0.054 0.136 0.269 0.074 0.51 0.07 0.002 0.098 0.008 0.35 0.005 0.098 0.016 0.9 0.4 3.2 − − − − Procedure 01 02 π π First-stage model − − True first-stage model 11 12

constant constant 3.district6.district 0.34 0.34 3.district 6.district 1.00 0.009 0.099 0.110 0.000 0.022 0.052 0.002 0.046 0.058 0.001 0.025 0.034

π π 10.district 10.district Coefficient Truth diff s.e.

) ( 0 0 1 β β − β u k 158 Exploring marginal treatment effects: Flexible estimation using Stata

oe ot al xeiet ecie nsection in described experiments Carlo Monte Note: MTE ihasml ieo 000i 0dsrcs ro tutr spolynomial. each is replications, structure 500 Error analytic. districts. otherwise 10 underlined, in are 10,000 errors of standard size bootstrapped sample on a based with Results parentheses. in MTE( MTE( MTE( MTE( MTE( MTE( MTE( offiin rt i s.e. diff Truth Coefficient ATE( refis-tg model first-stage True x, x, x, x, x, x, x, x 5 0.20 0 . 95) 0.19 0 . 75) 0.49 0 . 25) 0.74 0 . 05) )02 .0 .3 .8 .0 .2 .9 .0 .2 .7 .0 .1 0.076 0.017 0.009 0.070 0.028 0.003 0.098 0.020 0.006 0.086 0.035 0.19 0.005 0 . 9) 0.29 0 . 5) 0.67 0 . 1) is-tg model First-stage )0.37 Procedure .6(.6)0.073 (0.069) 0.36 − 0.037 (0.039) 0.184 (0.178) − 014 0.186 (0.184) − 024 0.236 (0.234) − 0.074 (0.073) − 0.075 (0.072) − 0.234 (0.227) − .1 .9 0.050 0.191 0.014 0.068 0.172 0.012 .0 .7 0.050 0.071 0.004 .1 .4 0.054 0.241 0.019 0.030 0.079 0.003 0.068 0.068 0.001 0.062 0.220 0.016 oa IV Local LPM LPM B 0.058 0.056 0.044 0.066 0.048 0.036 0.038 0.044 hr x n exp and exp where , r − − − 010 0.112 (0.110) 0.092 (0.090) 0.046 (0.046) 0.022 (0.022) 0.055 (0.052) 0.108 (0.102) 0.132 (0.123) 005 0.038 (0.035) − − − − .2 .9 0.054 0.093 0.028 0.080 0.055 0.001 0.100 0.086 0.025 .1 .3 .7 .0 .1 0.060 0.015 0.000 0.074 0.033 0.011 .0 .4 .2 .0 .2 .7 .0 .1 0.112 0.018 0.054 0.003 0.114 0.037 0.070 0.050 0.022 0.046 0.003 0.008 0.082 0.124 0.074 0.042 0.000 0.006 0.100 0.107 0.033 i s.e. diff Separate 2 LPM LPM r mte rmtemdl tnaddeviations Standard model. the from omitted are 0.058 0.056 0.052 0.074 0.048 0.038 0.040 0.040 r − 003 0.073 (0.073) 0.055 (0.055) 0.021 (0.023) 0.030 (0.029) 0.024 (0.024) 0.060 (0.061) 0.079 (0.079) 006 0.016 (0.016) − − .0 .6 0.026 0.060 0.007 .0 .7 0.022 0.079 0.009 0.064 0.023 0.002 i s.e. diff oa IV Local Probit Probit 0.042 0.056 0.082 0.062 0.054 0.054 0.054 0.056 r − − − 006 0.045 (0.046) 0.036 (0.037) 0.019 (0.021) 0.017 (0.016) 0.022 (0.022) 0.044 (0.043) 0.054 (0.054) 003 0.013 (0.013) − − − .1 .3 0.106 0.037 0.010 .1 .4 0.064 0.040 0.018 .0 .1 0.108 0.011 0.003 .1 .4 0.092 0.048 0.015 .2 .5 0.060 0.051 0.025 0.072 0.019 0.002 i s.e. diff Separate Probit Probit 0.096 0.100 0.076 0.092 0.050 0.058 0.054 0.054 r The Stata Journal (2018) 18, Number 1, pp. 159–173 Fitting and interpreting correlated random-coefficient models using Stata

Oscar Barriga Cabanillas Jeffrey D. Michler Aleksandr Michuda University of California University of Saskatchewan University of California Davis, CA Saskatoon, Canada Davis, CA [email protected] jeff[email protected] [email protected]

Emilia Tjernstr¨om University of Wisconsin Madison, WI [email protected]

Abstract. In this article, we introduce the community-contributed command randcoef, which fits the correlated random-effects and correlated random-coef- ficient models discussed in Suri (2011, Econometrica 79: 159–209). While this approach has been around for a decade, its use has been limited by the compu- tationally intensive nature of the estimation procedure that relies on the optimal minimum distance estimator. randcoef can accommodate up to five rounds of panel data and offers several options, including alternative weight matrices for estimation and inclusion of additional endogenous regressors. We also present postestimation analysis using sample data to facilitate understanding and inter- pretation of results. Keywords: st0517, randcoef, correlated random effects, correlated random coeffi- cients, technology adoption, heterogeneity

1 Introduction

Random-coefficient models have frequently been used to help explain heterogeneity in returns to human capital (Heckman and Vytlacil 1998) or the adoption of a new tech- nology (Suri 2011). At its most basic, the random-coefficient model can be written as

yi = α0i + α1ihi (1)

The outcome variable (earnings in Heckman and Vytlacil [1998], maize yields in Suri [2011]) is a function of the rate of return, α1i, to a choice variable hi (for example, investment in human capital or technology adoption). The rate of return varies by individual, as does the intercept term. Assume α0i = α0 + 0i and α1i = α1 + 1i,where 0i and 1i are zero in expectation. We can then rewrite (1)as

yi = α0 + α1hi +( 0i + 1ihi)(2)

The difficulty in identifying the model arises when α1i, the rate of return for the individ- ual, is correlated with the individual’s choice of hi. In the case of schooling, this would

c 2018 StataCorp LLC st0517 160 Correlated random-coefficient models using Stata be the case if one’s return to education influences one’s choice to invest in education. In the case of technology adoption, the classic example is when one’s return to adoption influences one’s choice to adopt the new technology. The literature has produced several alternative approaches to identifying the corre- lated random-coefficient (CRC)modelin(2). Both Heckman and Vytlacil (1998)and Wooldridge (2003) use instrumental variables. More recently, Suri (2011) developed a generalization of the Chamberlain (1984) fixed-effects (FE)methodandapplieditto the CRC model. In this article, we present the randcoef command, which enables a user to fit Suri’s CRC model. The approach, which is structural in nature, uses a set of reduced-form parameters to recover the structural parameters of interest via the optimal minimum distance (OMD) estimator. randcoef allows for the use of equally weighted minimum distance (EWMD) or diagonally weighted minimum distance (DWMD) instead of the inverse of the variance–covariance matrix of the reduced-form estimates. Ad- ditionally, it allows for the inclusion of exogenous covariates as controls and can also accommodate the inclusion of an additional endogenous covariate.1

The structural approach to identifying the CRC model has several advantages over the instrumental variables approach: First, it does not require the researcher to choose an instrument and defend the exclusion restriction. Rather, identification relies on the linear projection of an individual’s rate of return onto his or her complete history of adoption, similar to the correlated random-effects (CRE) method pioneered by Mundlak (1978)andChamberlain (1984). Second, the approach allows for a test of the importance of the individual’s rate of return (comparative advantage in Suri’s terminology) in the adoption decision. Third, the approach allows researchers to recover the distribution of the rate of return for postestimation analysis, which we discuss using the practical example of seed adoption decisions among a sample of farmers in Ethiopia.

2 The CRE and CRC models

This section explains the math behind the calculations of the CRE and CRC models, as well as the CRC model with an additional endogenous covariate. Two-period versions of these models are developed in Suri (2011). To fix these ideas, we outline the calculations of a three-period model. The methodology does not depend on the number of periods in the data but rather on the specifics of the structural equations, reduced-form equations, and variance–covariance matrix, which do vary depending on the number of periods. Though we discuss only a three-period model, randcoef can accommodate anywhere between two and five periods of panel data.

1. In the case of the hybrid maize adoption story in Suri (2011), this additional endogenous variable is the choice to apply fertilizer in addition to the choice to adopt hybrid maize. O. Barriga Cabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 161

2.1 Three-period CRE model

We start with a simple three-period, no-covariates CRE model, for which the data- generating process is given by

yit = ζ + βhit + αi + uit (3) where yit is the variable of interest for individual i at time t. The outcome of interest is a function of a binary indicator of adoption (hit), an indicator for the individual (αi), 2 and an idiosyncratic error term uit ∼N(0,σu). We assume strict exogeneity of the error term.

In an FE model, the unique individual indicators are simply dummy variables. In the CRE model, we replace the FE with their linear projections upon the history of the individual’s adoption behavior

αi = λ0 + λ1hi1 + λ2hi2 + λ3hi3 + νi (4) where hit for t =1, 2, 3 is an indicator that equals 1 if individual i adopts at time t and the λ’s are the projection coefficients. By the nature of the projection, νi is uncorrelated with hit (Suri 2011). Substituting (4)into(3), we get

yit = ζ + βhit + λ0 + λ1hi1 + λ2hi2 + λ3hi3 + νi + uit

Let it = νi + uit,where it is strictly exogenous. For each time period, we have

yi1 =(ζ + λ0)+(β + λ1)hi1 + λ2hi2 + λ3hi3 + i1

yi2 =(ζ + λ0)+λ1hi1 +(β + λ2)hi2 + λ3hi3 + i2

yi3 =(ζ + λ0)+λ1hi1 + λ2hi2 +(β + λ3)hi3 + i3

These are the structural equations for each period. Note that we cannot identify the variable of interest (β) by estimating these equations. Instead, we estimate the following reduced-form equations:

yi1 = ζ1 + γ1hi1 + γ2hi2 + γ3hi3 + ni1 (5)

yi2 = ζ2 + γ4hi1 + γ5hi2 + γ6hi3 + ni2 (6)

yi3 = ζ3 + γ7hi1 + γ8hi2 + γ9hi3 + ni3 (7)

We can use the nine reduced-form parameters to calculate the four structural parame- ters. This requires placing the following restrictions on the parameters:

γ1 =(β + λ1) γ4 = λ1 γ7 = λ1 γ2 = λ2 γ5 =(β + λ2) γ8 = λ2 γ3 = λ3 γ6 = λ3 γ9 =(β + λ3)

We estimate (5)–(7) as a set of seemingly unrelated regressions (SUR) and then preserve the nine reduced-form parameters in a vector π[9×1]. We can also preserve the variance– covariance matrices from the three regressions in a large symmetric block matrix V[9×9]. 162 Correlated random-coefficient models using Stata

The restrictions on the γ’s can be expressed as π = Hδ,whereH[9×4] is a restriction matrix embodying the nine restrictions on γ,andδ[4×1] is a vector of our four structural parameters.

The OMD function is

 min =(π − Hδ) V−1 {π − Hδ} δ

δ Solving for ,weget  −1  δ = H V−1H H V−1π which is the OMD estimator. The key to getting the OMD to produce the correct esti- mates requires ensuring that the restriction matrix is correctly specified. In the three- period CRE case, we have ⎛ ⎞ ⎡ ⎤ γ1 1100 ⎜ ⎟ ⎢ ⎥ ⎜ γ2 ⎟ ⎢ 0010⎥ ⎜ ⎟ ⎢ ⎥ ⎛ ⎞ ⎜ γ3 ⎟ ⎢ 0001⎥ ⎜ ⎟ ⎢ ⎥ β ⎜ γ4 ⎟ ⎢ 0100⎥ ⎜ ⎟ ⎜ ⎟ ⎢ ⎥ ⎜ λ1 ⎟ ⎜ γ5 ⎟ = ⎢ 1010⎥ ⎝ ⎠ ⎜ ⎟ ⎢ ⎥ λ2 ⎜ γ6 ⎟ ⎢ 0001⎥ ⎜ ⎟ ⎢ ⎥ λ3 ⎜ γ7 ⎟ ⎢ 0100⎥ ⎝ ⎠ ⎣ ⎦ γ8 0010 γ9 1001 which is just π = Hδ. Alternatively, one could replace the inverse of the variance–covariance matrix (V−1) with either the identity matrix (I)togeneratetheEWMD estimator or the inverse of the variance–covariance matrix with all off-diagonals set to zero (diag[V−1]) to gen- erate the DWMD estimator. As Altonji and Segal (1996) discuss, OMD may generate significant bias in small samples, leading some researchers to prefer EWMD. However, as Blundell, Pistaferri, and Preston (2008) point out, EWMD does not allow for het- eroskedasticity, which can be addressed by using DWMD.

Finally, note that the above allows us to calculate the OMD coefficients but not the structural variance–covariance matrix, which requires taking derivatives of the restric- tions. In the CRE model, this is trivial because all the derivatives are equal to one.

2.2 Three-period CRC model

The three-period CRC model is methodologically similar to the CRE model, but the specifics of the structural equations, reduced-form equations, and restrictions differ. The data-generating process is given by

yit = ζ + βhit + θi + φθihit + τi + uit (8) O. Barriga Cabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 163 where φ is the coefficient on an individual’s rate of return or comparative advantage in adoption (θi), τi is an individual’s absolute advantage (equivalent to an FE), and all other terms are as previously defined. To estimate (8), we must eliminate the dependence of θi on hit.Todothis,Suri (2011)andChamberlain (1984) replace θi with its linear projection onto the individual’s full history of adoption:

θi = λ0 + λ1hi1 + λ2hi2 + λ3hi3 + λ4hi1hi2 + λ5hi1hi3 + λ6hi2hi3 + λ7hi1hi2hi3 + νi (9)

Note that we must include the history of interactions, because while the projection error νi is uncorrelated with each individual history, it is not necessarily uncorrelated with the product of the histories. Substituting (9)into(8) and writing out each time period’s function gives yi1 =(ζ + λ0)+{β + φλ0 + λ1(1 + φ)}hi1 + λ2hi2 + λ3hi3 + {φλ2 + λ4(1 + φ)}hi1hi2

+ {φλ3 + λ5(1 + φ)}hi1hi3 + λ6hi2hi3 + {φλ6 + λ7(1 + φ)}hi1hi2hi3 +(νi

+ φνihi1 + ui1) yi2 =(ζ + λ0)+λ1hi1 + {β + φλ0 + λ2(1 + φ)}hi2 + λ3hi3 + {φλ1 + λ4(1 + φ)}hi1hi2

+ λ5hi1hi3 + {φλ3 + λ6(1 + φ)}hi2hi3 + {φλ5 + λ7(1 + φ)}hi1hi2hi3 +(νi

+ φνihi2 + ui2) yi3 =(ζ + λ0)+λ1hi1 + λ2hi2 + {β + φλ0 + λ3(1 + φ)}hi3 + {φλ1 + λ5(1 + φ)}hi1hi3

+ λ4hi1hi2 + {φλ2 + λ6(1 + φ)}hi2hi3 + {φλ4 + λ7(1 + φ)}hi1hi2hi3

+(νi + φνihi3 + ui3)

These are the structural yield equations for each period. From these, we can estimate the following reduced-form equations:

yi1 = ζ1 + γ1hi1 + γ2hi2 + γ3hi3 + γ4hi1hi2 + γ5hi1hi3 + γ6hi2hi3

+ γ7hi1hi2hi3 + ni1 (10)

yi2 = ζ2 + γ8hi1 + γ9hi2 + γ10hi3 + γ11hi1hi2 + γ12hi1hi3 + γ13hi2hi3

+ γ14hi1hi2hi3 + ni2 (11)

yi3 = ζ3 + γ15hi1 + γ16hi2 + γ17hi3 + γ18hi1hi2 + γ19hi1hi3 + γ20hi2hi3

+ γ21hi1hi2hi3 + ni3 (12)

− These equations give 21 reduced-form coefficients (γ1 γ21), from which we can estimate 10 structural parameters (β,φ,λ0 −λ7). Note that if we normalize the θ’s so that θi = 0, we can effectively eliminate λ0 and must estimate only nine structural parameters. 164 Correlated random-coefficient models using Stata

The necessary restrictions to identify the structural parameters are

γ1 = {β + φλ0 + λ1(1 + φ)} γ12 = λ5 γ2 = λ2 γ13 = {φλ3 + λ6(1 + φ)} γ3 = λ3 γ14 = {φλ5 + λ7(1 + φ)} γ4 = {φλ2 + λ4(1 + φ)} γ15 = λ1 γ5 = {φλ3 + λ5(1 + φ)} γ16 = λ2 γ6 = λ6 γ17 = {β + φλ0 + λ3(1 + φ)} γ7 = {φλ6 + λ7(1 + φ)} γ18 = λ4 γ8 = λ1 γ19 = {φλ1 + λ5(1 + φ)} γ9 = {β + φλ0 + λ2(1 + φ)} γ20 = {φλ2 + λ6(1 + φ)} γ10 = λ3 γ21 = {φλ4 + λ7(1 + φ)} γ11 = {φλ1 + λ4(1 + φ)}

Similar to before, we estimate (10)–(12)asSUR and preserve the 21 reduced-form parameters in a vector π[21×1], the variance–covariance matrices in a large symmetric block matrix V[21×21], the restrictions as H[21×9], and the structural parameters as δ[9×1]. We now must take derivatives of the elements in Hδ with respect to each of the struc- tural parameters. This gives us the 63 derivatives, which we preserve in the structural variance–covariance matrix to facilitate calculation of standard errors.

2.3 Three-period CRC model with endogenous covariates

Following Suri (2011), we expand the three-period CRC model to allow for an additional endogenous covariate. This requires that we account not only for the whole history of adoption but also for those histories interacted with the endogenous variable. We start by including the endogenous covariate, fit, in the data-generating process:

yit = ζ + βhit + ρfit + θi + φθihit + τi + uit (13)

We can write out the linear projection of the θi’s on the history of the individual’s adoption behavior of both endogenous technologies:

θi = λ0 + λ1hi1 + λ2hi2 + λ3hi3 + λ4hi1hi2 + λ5hi1hi3 + λ6hi2hi3 + λ7hi1hi2hi3

+ λ8fi1 + λ9fi2 + λ10fi3 + λ11hi1fi1 + λ12hi2fi1 + λ13hi3fi1 + λ14hi1hi2fi1

+ λ15hi1hi3fi1 + λ16hi2hi3fi1 + λ17hi1hi2hi3fi1 + λ18hi1fi2 + λ19hi2fi2

+ λ20hi3fi2 + λ21hi1hi2fi2 + λ22hi1hi3fi2 + λ23hi2hi3fi2 + λ24hi2hi2hi3fi2

+ λ25hi1fi3 + λ26hi1hi2fi3 + λ27hi3fi3 + λ28hi1hi2fi3 + λ29hi3fi3

+ λ30hi2hi3fi3 + λ31hi2hi2hi3fi3 + νi O. Barriga Cabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 165

Substituting the above into (13), we can write out the structural equations for each time period.2 From these equations, we can estimate the following reduced-form equations:

yi1 = ζ1 + γ1hi1 + γ2hi2 + γ3hi3 + γ4hi1hi2 + γ5hi1hi3 + γ6hi2hi3 + γ7hi1hi2hi3

+ γ8fi1 + γ9fi2 + γ10fi3 + γ11hi1fi1 + γ12hi2fi1 + γ13hi3fi1 + γ14hi1hi2fi1

+ γ15hi1hi3fi1 + γ16hi2hi3fi1 + γ17hi1hi2hi3fi1 + γ18hi1fi2 + γ19hi2fi2

+ γ20hi3fi3 + γ21hi1hi2fi2 + γ22hi1hi3fi2 + γ23hi2hi3fi2 + γ24hi1hi2hi3fi2

+ γ25hi1fi3 + γ26hi2fi3 + γ27hi3fi3 + γ28hi1hi2fi3 + γ29hi1hi3fi3

+ γ30hi2hi3fi3 + γ31hi1hi2hi3fi3 +(νi + φνihi1 + ui1) (14)

yi2 =(ζ + λ0)+γ32hi1 + γ33hi2 + γ34hi3 + γ35hi1hi2 + γ36hi1hi3 + γ37hi2hi3

+ γ38hi1hi2hi3 + γ39fi1 + γ40fi2 + γ41fi3 + γ42hi1fi1 + γ43hi2fi1 + γ44hi3fi1

+ γ45hi1hi2fi1 + γ46hi1hi3fi1 + γ47hi2hi3fi1 + γ48hi1hi2hi3fi1 + γ49hi1fi2

+ γ50hi2fi2 + γ51hi3fi2 + γ52hi1hi2fi2 + γ53hi1hi3fi2 + γ54hi2hi3fi2

+ γ55hi1hi2hi3fi2 + γ56hi1fi3γ57hi2fi3 + γ58hi3fi3 + γ59hi1hi2fi3

+ γ60hi1hi3fi3 + γ61hi2hi3fi3 + γ62hi1hi2hi3fi3 +(νi + φνihi2 + ui2) (15)

yi3 =(ζ + λ0)+γ63hi1 + γ64hi2 + γ65hi3 + γ66hi1hi2 + γ67hi1hi3 + γ68hi2hi3

+ γ69hi1hi2hi3 + γ70fi1 + γ71fi2 + γ72fi3 + γ73hi1fi1 + γ74hi2fi1 + γ75hi3fi1

+ γ76hi1hi2fi1 + γ77hi1hi3fi1 + γ78hi2hi3fi1 + γ79hi1hi2hi3fi1 + γ80hi1fi2

+ γ81hi2fi2 + γ82hi3fi3 + γ83hi1hi2fi2 + γ84hi1hi3fi2 + γ85hi2hi3fi2

+ γ86hi1hi2hi3fi2 + γ87hi1fi3 + γ88hi2fi3 + γ89hi3fi3 + λ90hi1hi2fi3

+ γ91hi1hi3fi3 + γ92hi2hi3fi3 + γ93hi1hi2hi3fi3 +(νi + φνihi3 + ui3) (16)

2. Note that the logic of identification of the second endogenous variable does not precisely follow the logic of identification of the adoption decision. In Suri (2011), identification of the adoption decision requires controlling for the full history of adoption, including all of its own interaction terms. However, identification of the second endogenous variable relies only on the second variable’s interaction with the full history of adoption and not the second variable’s own interaction terms (that is, fi1fi2). 166 Correlated random-coefficient models using Stata

These equations give 93 reduced-form coefficients (γ1 −γ93), from which we can estimate 37 structural parameters (λ1 − λ34,ρ,β,φ). The restrictions are the following:

γ1 = {λ1(1 + φ)+β + φλ0} γ32 = λ1 γ63 = λ1 γ2 = λ2 γ33 = {λ2(1 + φ)+β + φλ0} γ64 = λ2 γ3 = λ3 γ34 = λ3 γ65 = {λ3(1 + φ)+β + φλ0} γ4 = {λ4(1 + φ)+φλ2} γ35 = {λ4(1 + φ)+φλ1} γ66 = λ4 γ5 = {λ5(1 + φ)+φλ3} γ36 = λ5 γ67 = {λ5(1 + φ)+φλ1} γ6 = λ6 γ37 = {λ6(1 + φ)+φλ3} γ68 = {λ6(1 + φ)+φλ2} γ7 = {λ7(1 + φ)+φλ6} γ38 = {λ7(1 + φ)+φλ5} γ69 = {λ7(1 + φ)+φλ4} γ8 = {λ8 + ρ} γ39 = λ8 γ70 = λ8 γ9 = λ9 γ40 = {λ9 + ρ} γ71 = λ9 γ10 = λ10 γ41 = λ10 γ72 = {λ10 + ρ} γ11 = {λ11(1 + φ)+φλ8} γ42 = λ11 γ73 = λ11 γ12 = λ12 γ43 = {λ12(1 + φ)+φλ8} γ74 = λ12 γ13 = λ13 γ44 = λ13 γ75 = {λ13(1 + φ)+φλ8} γ14 = {λ14(1 + φ)+φλ12} γ45 = {λ14(1 + φ)+φλ11} γ76 = λ14 γ15 = {λ15(1 + φ)+φλ13} γ46 = λ15 γ77 = {λ15(1 + φ)+φλ11} γ16 = λ16 γ47 = {λ16(1 + φ)+φλ13} γ78 = {λ16(1 + φ)+φλ12} γ17 = {λ17(1 + φ)+φλ16} γ48 = {λ17(1 + φ)+φλ15} γ79 = {λ17(1 + φ)+φλ14} γ18 = {λ18(1 + φ)+φλ9} γ49 = λ18 γ80 = λ18 γ19 = λ19 γ50 = {λ19(1 + φ)+φλ9} γ81 = λ19 γ20 = λ20 γ51 = λ20 γ82 = {λ20(1 + φ)+φλ9} γ21 = {λ21(1 + φ)+φλ23} γ52 = {λ21(1 + φ)+φλ18} γ83 = λ21 γ22 = {λ22(1 + φ)+φλ20} γ53 = λ22 γ84 = {λ22(1 + φ)+φλ18} γ23 = λ23 γ54 = {λ23(1 + φ)+φλ20} γ85 = {λ23(1 + φ)+φλ19} γ24 = {λ24(1 + φ)+φλ23} γ55 = {λ24(1 + φ)+φλ22} γ86 = {λ24(1 + φ)+φλ21} γ25 = {λ25(1 + φ)+φλ10} γ56 = λ25 γ87 = λ25 γ26 = λ26 γ57 = {λ26(1 + φ)+φλ10} γ88 = λ26 γ27 = λ27 γ58 = λ27 γ89 = {λ27(1 + φ)+φλ10} γ28 = {λ28(1 + φ)+φλ26} γ59 = {λ28(1 + φ)+φλ25} γ90 = λ28 γ29 = {λ29(1 + φ)+φλ27} γ60 = λ29 γ91 = {λ29(1 + φ)+φλ25} γ30 = λ30 γ61 = {λ30(1 + φ)+φλ27} γ92 = {λ30(1 + φ)+φλ26} γ31 = {λ31(1 + φ)+φλ30} γ62 = {λ31(1 + φ)+φλ29} γ93 = {λ31(1 + φ)+φλ28}

As before, we estimate (14)–(16) and preserve the 93 reduced-form parameters in a vector π[93×1], the variance–covariance matrices in a large symmetric block matrix V[93×93], the restrictions as H[93×37], and the structural parameters as δ[37×1].To calculate standard errors, we take derivatives of the elements in Hδ with respect to each of the structural parameters. As one can see, the complexity of the problem grows exponentially. The addition of each new period requires the estimation of only one additional structural parameter T in the CRE model but 2 − 1 parameters in the CRC models. Table 1 summarizes the number of structural and reduced-form parameters that must be estimated in each model for up to five time periods. We believe that it is this computational intensity O. Barriga Cabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 167 that has limited the usefulness of Suri’s approach to fitting CRC models. Of the 420-plus articles that cited Suri (2011) as of 2018, only 2 (both working articles) have actually implemented her method of estimation. The randcoef command is designed to lower this hurdle and allow for a broader application of the estimation procedure.

Table 1. Number of parameters estimated in each model

Period Parameter CRE CRC CRC endogenous 2 Structural 2 + 1 = 3 3 + 2 = 5 (3 × 3+2)+3=14 Reduced 2 × 2=4 3× 2=6 11× 2=22 3 Structural 3 + 1 = 4 7 + 2 = 9 (7 × 4+3)+3=34 Reduced 3 × 3=9 7× 3 = 21 31 × 3=93 4 Structural 4 + 1 = 5 15 + 2 = 17 (15 × 5+4)+3=82 Reduced 4 × 4 = 16 15 × 4 = 60 79 × 4 = 316 5 Structural 5 + 1 = 6 31 + 2 = 33 (31 × 6 + 5) + 3 = 194 Reduced 5 × 5 = 25 31 × 5 = 155 191 × 5 = 955

3 Data considerations

Including the full history of adoption (the adoption decision in each year plus all inter- actions) in the projection of θi allows for the assumption that there is no correlation between νi and the decision to adopt (Chamberlain 1984; Suri 2011). However, iden- tification of φ in Suri’s CRC model when adoption is binary requires that all possible adoption histories be observed in the data. If this is not the case, then the model will suffer from multicollinearity. This is a result both of how Suri constructed the projec- tion of θi and of measuring adoption as a binary variable. For relatively small datasets, or as the number of time periods in a panel grows, there will be a higher probability that one of the adoption histories is not observed in the data. This is not an estimation or coding problem in the command but rather an inherent limitation of the method. To illustrate this, let’s assume we have two time periods in the panel. The projection of θi onto adoption will be

θi = λ0 + λ1hi1 + λ2hi2 + λ3hi1hi2 + νi (17)

In this case, there will be four potential adoption histories: 1) those that never adopt (never adopters), 2) those that always adopt (always adopters), 3) those that adopted in period two but not in period one (adopters), and 4) those that adopted in period one but disadopted in period two (disadopters). If adoption corresponds to a 1 and nonadoption to a 0, then each history corresponds to a particular tuple of choices. For the two-period case, always adopters would be represented by the tuple (1, 1), never adopters by the tuple (0, 0), adopters by the tuple (0, 1), and disadopters by the tuple (1, 0). 168 Correlated random-coefficient models using Stata

Table 2 summarizes the histories as they correspond to the variables in the projec- tion. Note that the use of the intercept (λ0)in(17) captures the never adopters, so they are not included in the table.

Table 2. Adoption histories

hi1 hi2 hi1hi2 (1, 1) 1 1 1 (1, 0) 1 0 0 (0, 1) 0 1 0

From table 2, we can see that if any one of the histories is not observed in the data (equivalent to a row disappearing from the table), then at least two of the independent variables becomes collinear by construction. For instance, if there were no adopters in the data, then the last row would be missing. In this case, hi2 and hi1hi2 would now be perfectly collinear (the highlighted cells in the table). As such, estimation of Suri’s CRC model requires a large enough dataset to ensure all adoption histories are present or, alternatively, must measure adoption as continuous.

4 Fitting the CRE and CRC models using Stata 4.1 Description

The randcoef command fits the CRE (default) and CRC models.3 The command follows the same general syntax as the sureg command. However, it does not allow for equa- tion naming, because the independent variables are the same in all regressions. The command’s syntax requires the order of both the outcome and choice variables to be listed chronologically. randcoef uses the number of dependent variables to know how many time periods the CRE or CRC model contains and then generates the necessary interaction terms and calculates the appropriate restriction matrix. If users want to change the order of the variables, they can use the matrix() option accordingly.4

4.2 Syntax

ThesyntaxtofittheCRE model is randcoef (depvar1 depvar2 depvar3 ...) if , choice(indepvar1 indepvar2 indepvar3 ...) method(CRE) controls(varlist) showreg matrix(string)

3. Note that randcoef requires that the tuple package be installed. 4. Applies only for the CRE method. O. Barriga Cabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 169

ThesyntaxtofittheCRC model is randcoef (depvar1 depvar2 depvar3 ...) if , choice(indepvar1 indepvar2 indepvar3 ...) method(CRC) controls(varlist) showreg endogenous(varlist) keep weighting(string)

4.3 Options choice(indepvar1 indepvar2 indepvar3 ...) specifies the variables of interest orga- nized in the same order as the dependent variables. In the case of the CRC model, interactions should not be included, because they are created automatically by the program. These variables must be defined as dummies. choice() is required. method(string) determines the method to be used: CRE (default) or CRC. controls(varlist) allows for the addition of exogenous covariates as controls to the underlying sureg regression. showreg allows the user to see the SUR output. By default, the program does not show the SUR output for faster computation and to avoid displaying unnecessary information. matrix(string) allows the user to specify the restriction matrix (for changing the or- der of coefficients). Note that the restriction’s order is important, because it must match the order in which the parameters of interest are input in the SUR. This op- tion is allowed only if the CRE method is specified. The restriction matrix for the CRC method is preprogrammed automatically following that outlined in Suri (2011) depending on the number of dependent variables. endogenous(varlist) allows for endogenous variables to be added to the CRC estima- tion. This must be a dummy variable. The interactions needed between the choice variables and the endogenous variables are automatically created. keep allows the user to preserve the interactions between the choice variables and the endogenous ones (in the case that the endogenous option is used) in the command line after the estimation has taken place. If keep is specified, running the command several times requires dropping those variables. weighting(string) specifies the weighting matrix to be used in estimation. This option is allowed only if the CRC method is specified. string maybeoneofthefollowing:

omd (default) specifies OMD, which uses the inverse of the variance–covariance matrix of the reduced-form estimates.

ewmd specifies EWMD, which uses the identity matrix.

dwmd specifies DWMD, which uses the OMD matrix but with zeros on the off-diagonals. 170 Correlated random-coefficient models using Stata 5 Interpreting output

As an example of fitting and interpreting the CRC model, we use household-level adop- tion of improved chickpea in Ethiopia.5 The dependent variables are income per capita in each of the three survey rounds (lnp08, lnp10, lnp14), while the choice variables are adoption of improved chickpea in each period (icp08, icp10, icp14). We also include a set of exogenous covariates as controls (`controls1´ ...).6 Note that panels in a “long” format must be reshaped as “wide” prior to estimation. Output from the CRC method using the OMD estimator is

. randcoef lnp08 lnp10 lnp14, choice(icp08 icp10 icp14) > controls(`controls1´ `controls2´ `controls3´) method(CRC) keep RUNNING MODEL WITH OMD WEIGHTING MATRIX Equations used in sureg:lnp08 lnp10 lnp14 = icp08 icp10 icp14 > __00000C __00000D __00000E __00000F `controls1´ `controls2´ `controls3´ The model used is : method: CRC The variables of interest are : icp08 icp10 icp14 __00000C __00000D __00000E > __00000F `controls1´ `controls2´ `controls3´ Minimun Distance Estimator is being calculated (output omitted ) With corresponding Parameters matrix:

Coef. Std. Err. z P>|z| [95% Conf. Interval]

l1 .1902277 .1583402 1.20 0.230 -.1201135 .5005689 l2 .3902405 .1481763 2.63 0.008 .0998202 .6806608 l3 .1208645 .1233118 0.98 0.327 -.1208222 .3625512 l4 -.6114457 .3197189 -1.91 0.056 -1.238083 .0151918 l5 -.0115502 .1943865 -0.06 0.953 -.3925407 .3694403 l6 -.2721276 .14402 -1.89 0.059 -.5544015 .0101463 l7 .4747094 .3116379 1.52 0.128 -.1360896 1.085508 b .2158545 .0746901 2.89 0.004 .0694647 .3622443 phi .9196741 1.500435 0.61 0.540 -2.021124 3.860472

In this case, the interactions for the choice variable are automatically created. Given that keep is used, these interactions are available after the estimation for further anal- ysis. Note that the magnitudes of the structural parameters from (9)areλ1 =0.190, λ2 =0.390, λ3 =0.121, λ4 = −0.611, λ5 = −0.012, λ6 = −0.272, and λ7 =0.475. The aggregate returns to improved chickpea adoption is β =0.216, while φ =0.920 is the coefficient on the individual’s rate of return or comparative advantage in adoption. In this example, the aggregate returns to adoption are significant, but the comparative advantage term is not. One would interpret this as a lack of significant differences in comparative advantage in this economy. Returns to improved chickpea adoption, at least in this sample, are homogeneous not heterogeneous.

5. For a description of the data, see Verkaart et al. (2017). 6. Because our interest lies in the coefficient estimates for the structural parameters, and in the interests of parsimony, we have listed only the locals for the control variables. A complete list of the control variables and variable definitions is contained in the accompanying data and do-file, which also allows for replication of results. O. Barriga Cabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 171

Despite a lack of significant differences in the returns to adoption based on household unobservables, we can still explore heterogeneity within the population by predicting the θ/ term for a given adoption history. We can recover θ/ using (9) and our structural OMD estimates. This requires generating variables equal to the eight mass points for each of the θ/’s.

. generate theta1 = l0 + _b[l1]*0 + _b[l2]*0 + _b[l3]*0 + _b[l4]*0 + _b[l5]*0 > + _b[l6]*0 + _b[l7]*0 . label variable theta1 "never adopt" . generate theta2 = l0 + _b[l1]*1 + _b[l2]*1 + _b[l3]*1 + _b[l4]*1 + _b[l5]*1 > + _b[l6]*1 + _b[l7]*1 . label variable theta2 "always adopt" . generate theta3 = l0 + _b[l1]*0 + _b[l2]*1 + _b[l3]*1 + _b[l4]*0 + _b[l5]*0 > + _b[l6]*1 + _b[l7]*0 . label variable theta3 "early adopters" . generate theta4 = l0 + _b[l1]*0 + _b[l2]*0 + _b[l3]*1 + _b[l4]*0 + _b[l5]*0 > + _b[l6]*0 + _b[l7]*0 . label variable theta4 "late adopters" . generate theta5 = l0 + _b[l1]*1 + _b[l2]*0 + _b[l3]*0 + _b[l4]*0 + _b[l5]*0 > + _b[l6]*0 + _b[l7]*0 . label variable theta5 "early disadopters" . generate theta6 = l0 + _b[l1]*1 + _b[l2]*1 + _b[l3]*0 + _b[l4]*1 + _b[l5]*0 > + _b[l6]*0 + _b[l7]*0 . label variable theta6 "late disadopters" . generate theta7 = l0 + _b[l1]*1 + _b[l2]*0 + _b[l3]*1 + _b[l4]*0 + _b[l5]*1 > + _b[l6]*0 + _b[l7]*0 . label variable theta7 "mixed adopters" . generate theta8 = l0 + _b[l1]*0 + _b[l2]*1 + _b[l3]*0 + _b[l4]*0 + _b[l5]*0 > + _b[l6]*0 + _b[l7]*0 . label variable theta8 "mixed disadopters"

Given that each history is binary, and given that we observe at least one household in each history, the projection is fully saturated. Once we have recovered the θ/’s, we can predict the average returns for a given / // / adoption history. This involves calculating β + φθi,whereβ is the average return to improved varieties and each i is a specific adoption history, such as generate r1 = b[b] + b[phi]*theta1. The results can be viewed as the counterfactual returns for nonadopting households using weighted averages of all possible returns. Again, because the histories are binary and the projection is fully saturated, the process produces eight mass points, which we graph in figure 1.7

7. Complete code for generating figure 1 is available as part of the Stata code associated with this article. 172 Correlated random-coefficient models using Stata .4 .3 .2 .1 0

Always adopter Mixed disadopter Early adopter Late disadopter Late adopter Early disadopter Mixed adopter Never adopter

Figure 1. Distribution of returns to adoption

As the bars in figure 1 reveal, returns to adoption appear heterogeneous, at least for some adoption histories. What the coefficient on φ tells us is that ultimately these differences are not significant. Because φ is not significant, the weighted average across these groups would yield β or the aggregate returns to adoption. If φ were significant, we would expect larger differences across the bars or differences between groups based on adoption history. See Suri (2011) for graphs in which returns differ significantly and consistently across groups.

6 Conclusions

Correlated random-coefficient models can be used to understand a variety of activities in which the return on the activity to the individual is correlated with the individual’s choice to participate. Several methods exist to consistently fit these models, but most rely on an instrumental variables approach. While the structural approach first de- veloped in Suri (2006) has been around for a decade, its use has been limited by the computationally intensive nature of the estimation procedure. randcoef allows users to fit both CRE and CRC models and provides them with a variety of estimation options. Additionally, it allows for estimation using up to five rounds of data and can accommo- date an additional endogenous regressor. We hope randcoef will lower the hurdle for implementing this approach and thereby add another tool for fitting CRC models. O. Barriga Cabanillas, J. D. Michler, A. Michuda, and E. Tjernstr¨om 173 7 References

Altonji, J. G., and L. M. Segal. 1996. Small-sample bias in GMM estimation of covariance structures. Journal of Business and Economic Statistics 14: 353–366.

Blundell, R., L. Pistaferri, and I. Preston. 2008. Consumption inequality and partial insurance. American Economic Review 98: 1887–1921.

Chamberlain, G. 1984. Panel data. In Handbook of Econometrics, vol. 2, ed. Z. Griliches and M. D. Intriligator, 1247–1318. Amsterdam: Elsevier.

Heckman, J., and E. Vytlacil. 1998. Instrumental variables methods for the correlated random coefficient model: Estimating the average rate of return to schooling when the return is correlated with schooling. Journal of Human Resources 33: 974–987. Mundlak, Y. 1978. On the pooling of time series and cross section data. Econometrica 46: 69–85. Suri, T. 2006. Selection and comparative advantage in technology adoption. Discussion Paper 944, Economic Growth Center, Yale University. http://www.econ.yale.edu/ growth pdf /cdp944.pdf. . 2011. Selection and comparative advantage in technology adoption. Economet- rica 79: 159–209. Verkaart, S., B. G. Munyua, K. Mausch, and J. D. Michler. 2017. Welfare impacts of improved chickpea adoption: A pathway for rural development in Ethiopia? Food Policy 66: 50–61. Wooldridge, J. M. 2003. Further results on instrumental variables estimation of average treatment effects in the correlated random coefficient model. Economics Letters 79: 185–191.

About the authors Oscar Barriga Cabanillas is a PhD student in the Department of Agricultural and Resource Economics at the University of California, Davis. Jeffrey D. Michler is an assistant professor in the Department of Agricultural and Resource Economics at the University of Saskatchewan. Aleksandr Michuda is a PhD student in the Department of Agricultural and Resource Eco- nomics at the University of California, Davis. Emilia Tjernstr¨om is an assistant professor in the Robert M. La Follette School of Public Affairs and the Department of Agricultural and Applied Economics at the University of Wisconsin– Madison. The Stata Journal (2018) 18, Number 1, pp. 174–183 heckroccurve: ROC curves for selected samples

Jonathan A. Cook Public Company Accounting Oversight Board (PCAOB) Washington, DC [email protected]

Ashish Rajbhandari Investment Strategy Group The Vanguard Group Malvern, PA ashish [email protected]

Abstract. Receiver operating characteristic (ROC) curves can be misleading when they are constructed with selected samples. In this article, we describe heckroccurve, which implements a recently developed procedure for plotting ROC curves with selected samples. The command estimates the area under the ROC curve and a graphical display of the curve. A variety of plot options are available, including the ability to add confidence bands to the plot.

Keywords: st0518, heckroccurve, receiver operating characteristic curves, ROC curves, classifier evaluation, sample-selection bias

1 Introduction

Receiver operating characteristic (ROC) curves are widely used in many fields to measure the performance of ratings. An advantage of ROC curves over metrics like accuracy (defined as the portion of cases correctly predicted) is that ROC curves provide the full range of tradeoffs between true positives and false negatives. Despite their widespread use, the effects of sample selection on ROC curves was not explored until recently. Sample selection is common in many areas. Consider a medical test administered only to patients that are referred by their physicians. We want to know how well the test correctly diagnoses illness, but we observe test results only for referred patients. A different but related problem arises in commercial banking. The Basel Accords require banks to estimate the probability of default for their loans. To assess the predictive performance of their probability of default models, banks could construct a ROC curve with the sample of loan applicants that were granted loans. Hand and Adams (2014)andKraft, Kroisandt, and M¨uller (2014)appeartohave been the first to discuss selection bias for ROC curves. Cook (2017) presents a procedure to plot a ROC curve that is a consistent estimate of the ROC curve that would be obtained with a random sample. The heckroccurve command implements Cook’s procedure and provides confidence intervals for the area under the curve and confidence bands for the ROC curve.

c 2018 StataCorp LLC st0518 J. A. Cook and A. Rajbhandari 175

There are many existing Stata commands for plotting ROC curves, including rocreg, roctab,androccomp, but none of these commands correct for the effects of sample selection. The syntax of heckroccurve was kept close to existing Stata commands for sample-selection problems, that is, heckman, heckprobit,andheckoprobit.Like heckprobit and heckoprobit, heckroccurve is based on assumptions similar to those of Heckman (1976). The output from heckroccurve was designed to be similar to that of Stata’s built-in commands for ROC curves. The next section describes the procedure performed by heckroccurve. Sections 3 and 4 provide the command’s syntax and examples: The first example in section 4 illustrates the syntax. The second example in section 4 shows how selection can affect ROC curves. The dataset used for this second example is provided with heckroccurve. Section 5 concludes.

2 ROC curves for selected samples

We assume that each observation belongs to one of two classes (for example, positive and negative). Our task is to evaluate how well our ordinal rating predicts class. Given a threshold, we could predict that all observations with a rating value above the threshold are positive, and all observations below the threshold are negative. To see how well the rating with the threshold predicts class, we define sensitivity and specificity as

TP Sensitivity = , and (1) P TN Specificity = (2) N where the confusion matrix in table 1 defines true positives (TP), true negatives (TN), positives (P), and negatives (N).

Table 1. Confusion matrix truth positive negative positive True False

Positives (TP) Positives (FP) prediction negative False True

Negatives (FN) Negatives (TN) total Positives (P ) Negatives (N)

ROC curves, which plot sensitivity as a function of specificity for all possible thresh- olds, illustrate a rating’s tradeoff between true positives and false negatives. A higher value of sensitivity for a given value of specificity indicates better performance. The area 176 heckroccurve: ROC curves for selected samples under the ROC curve (AUC) is a common metric for evaluating a rating’s performance. If the rating has no connection to the true class, the expected AUC would be 0.5. An excellent introduction to ROC curves is provided by Fawcett (2006).

2.1 Notation and setup

We denote the rating’s output as ai for each observation i. The unobserved propensity to be a positive case is denoted as pi. The true outcome is  ∗ positive if pi >p outcomei = negative otherwise

∗ where p is the threshold for an instance to be a positive case. We assume that pi follows a standard normal distribution. The modeler never observes pi, only outcomei.Fora given threshold c, we can give probabilistic definitions of sensitivity and specificity:

∗ Sensitivity = Prob(ai >c| pi >p ), and (3) ∗ Specificity = Prob(ai ≤ c | pi ≤ p )(4)

Evaluating (1)and(2) with the sample at hand provides estimates of these probabilities. The selection rule is  Selected if bi ≡ δ Xi + γai + εi >s (5) Not selected otherwise where s is a constant, Xi is a vector of variables, and εi is a standard normal random variable. The parameter δ is a vector of coefficients, and γ indicates the degree to which the rating was incorporated into the selection process. These parameters can be estimated from a probit regression of selection on X and a. If the vector X does not contain a constant, then the intercept from the probit regression would provide an estimate of −s. The procedure that we describe here does not require an estimate of s. We denote sensitivity and specificity conditional on selection as

∗ Sensitivity | Selection = Prob(ai >c| pi >p ,bi >s), and (6) ∗ Specificity | Selection = Prob(ai ≤ c | pi ≤ p ,bi >s)(7)

When data are chosen according to (5), the values in (1)and(2) provide estimates of (6)and(7) instead of (3)and(4). It is possible that the ROC curve implied by (6)and (7) differs greatly from the curve implied by (3)and(4).

2.2 Procedure for creating ROC curves with selected samples

Cook’s (2017) procedure for creating ROC curves with selected samples infers the predic- tive power of the classifier (taking selection into consideration), then draws the implied J. A. Cook and A. Rajbhandari 177

ROC curve. After standardizing the classifier’s output, the likelihood for the data can be expressed as ( 1(outcomei=positive) L = Φ2{δ Xi + γai − s, −(β0 + β1ai);ρεp} i

1(outcomei=negative) × Φ2(δ Xi + γai − s, β0 + β1ai ; −ρεp)

1(outcomei=NA) × Φ{−(δ Xi + γai − s)} (8) where 1(·) is the indicator function,

p∗ β0 ≡ 4 , and 2 1 − ρap ρap β1 ≡−4 2 1 − ρap

This likelihood function contains two correlations: ρap and ρεp. The correlation be- tween the rating’s output and pi, denoted ρap, is crucial for determining the strength of the classifier. The likelihood also contains ρεp, which is the correlation between the unobserved component of the selection rule (that is, εi)andpi. The likelihood function in (8) is the same likelihood derived by Van de Ven and Van Praag (1981), which heckprobit maximizes. To take advantage of heckprobit’s many built-in features, heckroccurve’s maximum likelihood estimation is performed ∗ by calling heckprobit. Estimates of p and ρap are found by applying the appropriate transformations to the estimates of β0 and β1. ∗ /∗ To draw the ROC implied by the estimates of p and ρap (denoted here as p and ρ1ap), we begin with a set of cutoffs with a sufficiently large range (heckroccurve uses −4 to 4). For each cutoff c ∈ [−4, 4], we find the corresponding value of sensitivity as

. ∞ − ∗ /∗ 1 Prob(ai >c| pi >p ) ≈ 1 − Φ p φ(a) 2  c 3 4 /∗ 2 1 − Φ p − ρ1ap a / 1 − ρ1ap da and specificity as

. c  4 − ∗ /∗ 1 /∗ 2 Prob(ai

Confidence intervals and bands are obtained from confidence intervals for the maximum likelihood estimates of β0 and β1 and the functional invariance property of maximum likelihood. 178 heckroccurve: ROC curves for selected samples 3 The heckroccurve command 3.1 Syntax heckroccurve refvar classvar if in weight , select( depvar s = varlist s) collinear table level(#) noci cbands noempirical nograph norefline irocopts(cline options) erocopts(cline options) rlopts(cline options) cbands(cline options) twoway options vce(vcetype) robust maximize options

3.2 Options select( depvar s = varlist s) specifies the selection equation, dependent and in- dependent variables, and whether to have a constant term and offset variable. select() is required. collinear keeps collinear variables. table displays the raw data in a 2 × k contingency table. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level;see[U] 20.8 Specifying the width of confidence intervals. noci does not display confidence intervals for the inferred AUC. cbands displays confidence bands for the inferred ROC curve. noempirical does not include the empirical ROC curve in the plot. nograph suppresses graphical output. norefline does not include a reference line in the plot. irocopts(cline options) affects rendition of the inferred ROC curve; see [G-3] cline options. erocopts(cline options) affects rendition of the empirical ROC curve; see [G-3] cline options. rlopts(cline options) affects rendition of the reference line; see [G-3] cline options. cbands(cline options) affects rendition of the confidence bands. twoway options are any of the options documented in [G-3] twoway options, excluding by(). vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspec- ification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. J. A. Cook and A. Rajbhandari 179

robust is the synonym for vce(robust). maximize options: difficult, technique(algorithm spec), iterate(#), tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, from(init specs);see[R] maximize. These options are seldom used.

4 Examples

Example 1: Illustration of syntax

Our first example illustrates the command’s syntax. We begin by loading Mroz’s (1987) well-known dataset on women’s wages and creating a binary variable:

. use http://fmwww.bc.edu/ec-p/data/wooldridge/mroz . * Creating a binary variable to demonstrate procedure . generate high_wage = 0 if inlf (325 missing values generated) . replace high_wage = 1 if wage > 2.37 & inlf (311 real changes made) If we want to see how years of education, educ, predicts “high wage”, we can type the syntax that follows. Note that inlf is an indicator variable for whether a woman is in the labor force. For women not in the labor force, their wage is not observed.

. heckroccurve high_wage educ, select(inlf = educ kidslt6 kidsge6 nwifeinc) Estimating inferred ROC curve... Empirical Inferred Inferred AUC ROC area ROC area 95% Conf. Interval

0.6472 0.6606 0.5782 0.7310

A more common use for ROC curves is constructing them after estimating a probit or logit. Here we provide an example of calling heckroccurve after a logit.

. quietly logit high_wage educ age exper if inlf . predict predicted_xb, xb . heckroccurve high_wage predicted_xb, > select(inlf = predicted_xb educ kidslt6 kidsge6 nwifeinc) Estimating inferred ROC curve... Empirical Inferred Inferred AUC ROC area ROC area 95% Conf. Interval

0.7211 0.7329 0.6377 0.8044

Note that we used the fitted values option xb rather than predicted probabilities, because Cook’s (2017) assumption that the classifier’s output is normally distributed is more likely to hold for the fitted values. The following syntax illustrates the command’s plot options: 180 heckroccurve: ROC curves for selected samples

. heckroccurve high_wage predicted_xb, > select(inlf = predicted_xb educ kidslt6 kidsge6 nwifeinc) > noempirical cbands irocopts(lcolor(black) lwidth(medthick)) > rlopts(lcolor(gray)) Estimating inferred ROC curve... Empirical Inferred Inferred AUC ROC area ROC area 95% Conf. Interval

0.7211 0.7329 0.6377 0.8044 1 .8 .6 Sensitivity .4 .2 0 0 .2 .4 .6 .8 1 1 − Specificity

Inferred ROC Reference 95% Lower bound 95% Upper bound

Figure 1. Plot created using heckroccurve’s plot options

Example 2: Correcting bias in ROC curves

This example uses a dataset that contains outcomes, two ratings, a selection indi- cator, and an independent variable x. We first compare the performance of the two ratings with the full dataset, and then we remove outcomes for the nonselected data:

. sysuse heckroccurve_example, clear . * Compare ratings using full dataset . roccomp outcome rating_a rating_b ROC Asymptotic Normal Obs Area Std. Err. [95% Conf. Interval]

rating_a 1,000 0.8815 0.0103 0.86120 0.90171 rating_b 1,000 0.7557 0.0151 0.72616 0.78515

Ho: area(rating_a) = area(rating_b) chi2(1) = 46.92 Prob>chi2 = 0.0000 J. A. Cook and A. Rajbhandari 181

While the outcome is observed for all 1,000 observations, this dataset contains a vari- able selected that is equal to 1 for half the observations and 0 for the rest. We set the outcome to missing when selected equals 0 to see the effect of selection on the classifiers’ ROC curves:

. * Remove nonselected outcomes and compare ratings again . replace outcome=. if !selected (500 real changes made, 500 to missing) . roccomp outcome rating_a rating_b ROC Asymptotic Normal Obs Area Std. Err. [95% Conf. Interval]

rating_a 500 0.7493 0.0238 0.70273 0.79593 rating_b 500 0.7767 0.0256 0.72650 0.82685

Ho: area(rating_a) = area(rating_b) chi2(1) = 0.63 Prob>chi2 = 0.4262

Rating A performs better than rating B with the full dataset, but with the selected sample, performance is similar. Notice that the effect of selection differs for the two ratings. The AUC for rating A has decreased from 0.8815 to 0.7493, while the AUC for rating B is much less affected by selection. Calling heckroccurve allows us to recover the AUCs that are obtained with the full sample. Figure 2 provides the graphical output from the syntax below:

. heckroccurve outcome rating_a, select(x rating_a rating_b) cbands Estimating inferred ROC curve... Empirical Inferred Inferred AUC ROC area ROC area 95% Conf. Interval

0.7493 0.8804 0.8253 0.9149

. heckroccurve outcome rating_b, select(x rating_a rating_b) cbands Estimating inferred ROC curve... Empirical Inferred Inferred AUC ROC area ROC area 95% Conf. Interval

0.7767 0.7781 0.7283 0.8192 182 heckroccurve: ROC curves for selected samples 1 1 .8 .8 .6 .6 Sensitivity Sensitivity .4 .4 .2 .2 0 0 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 1 − Specificity 1 − Specificity

Inferred ROC Reference Inferred ROC Reference 95% Lower bound 95% Upper bound 95% Lower bound 95% Upper bound

(a) Rating A (b) Rating B

Figure 2. Plots from heckroccurve show an example of the bias that can result from constructing ROC curves with a selected sample. Selection caused the ROC curve for rating A to cave in but left the ROC curve for rating B largely unaffected.

The inferred AUC for rating A is 0.8804. This is quite close to the value of 0.8815 that we obtained with the full dataset. The confidence interval for the inferred AUC does not contain the AUC that is obtained when only the selected sample is used (that is, 0.7493).

5 Discussion and conclusion

Hand and Adams (2014) suggest an alternative approach for comparing ROC curves that are constructed with selected samples. Realizing that truncation of a rating leads to biased ROC curves, Hand and Adams explore the effects of reducing the data so that both ratings are truncated to a similar extent. The goal of truncating both ratings is to create ROC curves that are biased to a similar extent for both ratings and thus better facilitate a comparison of the ratings. While an advantage of this procedure is that it does not make any parametric assumptions, it does not remove the bias induced by selection. The procedure performed by heckroccurve provides a consistent estimate when the assumptions are met.

heckroccurve implements Cook’s (2017) procedure for plotting ROC curves with selected samples and provides the AUC along with a confidence interval. The com- mand’s maximum likelihood estimation is performed by calling heckprobit.Thereare situations for which heckprobit will fail to converge. Changing the specification for the selection equation may allow for convergence. The inferred ROC curve is based on parametric assumptions (just as heckman and heckprobit’s estimations are based on parametric assumptions). Cook (2017) provides an example with wine data for which distributional assumptions are not met, yet the procedure recovers the AUC that is ob- tained with the full sample. While this one example is encouraging, the performance of the procedure when its distributional assumptions do not hold has not been thoroughly explored. J. A. Cook and A. Rajbhandari 183 6 References

Cook, J. A. 2017. ROC curves and nonrandom data. Pattern Recognition Letters 85: 35–41.

Fawcett, T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27: 861–874.

Hand, D. J., and N. M. Adams. 2014. Selection bias in credit scorecard evaluation. Journal of the Operational Research Society 65: 408–415.

Heckman, J. J. 1976. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5: 475–492.

Kraft, H., G. Kroisandt, and M. M¨uller. 2014. Redesigning ratings: Assessing the discriminatory power of credit scores under censoring. Journal of Credit Risk 10: 71–94. Mroz, T. A. 1987. The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions. Econometrica 55: 765–799. Van de Ven, W. P. M. M., and B. M. S. Van Praag. 1981. The demand for deductibles in private health insurance: A probit model with sample selection. Journal of Econo- metrics 17: 229–252.

About the authors

Jonathan A. Cook is a financial economist at PCAOB.ThePCAOB, as a matter of policy, dis- claims responsibility for any private publication or statement by any of its Economic Research Fellows and employees. The views expressed in this article are the views of the author and do not necessarily reflect the views of the Board, individual Board members, or staff of the PCAOB. Ashish Rajbhandari is a quantitative investment analyst at The Vanguard Group. His research interests are applied econometrics and macroeconomics. The views expressed in this article are those of the author and do not necessarily reflect those of The Vanguard Group or its staff. The Stata Journal (2018) 18, Number 1, pp. 184–196 Panel unit-root tests for heteroskedastic panels

Helmut Herwartz Simone Maxand Fabian H. C. Raters University of Goettingen University of Goettingen University of Goettingen Goettingen, Germany Goettingen, Germany Goettingen, Germany [email protected]

Yabibal M. Walle University of Goettingen Goettingen, Germany

Abstract. In this article, we describe the command xtpurt, which imple- ments the heteroskedasticity-robust panel unit-root tests suggested in Herwartz and Siedenburg (2008, Computational Statistics and Data Analysis 53: 137–150), Demetrescu and Hanck (2012a, Economics Letters 117: 10–13), and, recently, Herwartz, Maxand, and Walle (2017, Center for European, Governance and Eco- nomic Development Research Discussion Papers 314). While the former two tests are robust to time-varying volatility when the data contain only an intercept, the latter test is unique because it is asymptotically pivotal for trending heteroskedas- tic panels. Moreover, xtpurt incorporates lag-order selection, prewhitening, and detrending procedures to account for serial correlation and trending data. Keywords: st0519, xtpurt, xtunitroot, panel unit-root tests, nonstationary volatil- ity, cross-sectional dependence, inflation

1 Introduction

In the last two decades, there has been a surge of interest in using panel unit-root tests (PURTs) in applied macroeconometric research. One obvious reason for this in- creasing attention is that PURTs—by using the cross-sectional dimension—are more powerful than their univariate counterparts. However, the validity of most PURTs, in- cluding those in Levin, Lin, and Chu (2002)andBreitung and Das (2005), relies on the assumption of homoskedasticity. Hence, the tests could provide misleading infer- ence under heteroskedasticity (Herwartz, Siedenburg, and Walle 2016). This presents a significant limitation to the applicability of PURTs because time-varying volatility is a typical characteristic of most macroeconomic and financial time series. For instance, volatilities of several macroeconomic series had been declining since the 1980s up to the recent global financial crisis, and the phenomenon has come to be known as the “Great Moderation” (Stock and Watson 2002).

Consequently, the recent literature has suggested a few PURTs that remain pivotal under time-varying volatility. By using the Cauchy estimator, where the sign instead of the actual value of the lagged level series enters the test regression, Demetrescu and Hanck (2012a,b) suggest PURTs that are robust to variance breaks. Herwartz, Siedenburg, and Walle (2016) prove that the Cauchy instrumenting is not necessary

c 2018 StataCorp LLC st0519 H. Herwartz, S. Maxand, F. H. C. Raters, and Y. M. Walle 185 for the test in Herwartz and Siedenburg (2008), which is a non-Cauchy version of the test in Demetrescu and Hanck (2012a), to remain pivotal under heteroskedasticity. The heteroskedasticity-robust PURT of Westerlund (2014) uses the information embodied in group-specific variances. While these tests are heteroskedasticity robust in cases where the data feature an intercept term, they are not robust to volatility shifts if the series contain a linear trend. To our knowledge, the test in Herwartz, Maxand, and Walle (2017) is unique in avoiding complicated resampling techniques used to define PURTsin trending heteroskedastic panels. This is an important property because several macroe- conomic time series, including gross domestic product, money supply, and commodity prices, contain a linear trend (Westerlund 2015). Here we present the new command xtpurt, which implements the heteroskedasticity- robust PURTs suggested in Herwartz and Siedenburg (2008), Demetrescu and Hanck (2012a), and Herwartz, Maxand, and Walle (2017).

2 Heteroskedasticity-robust PURTs 2.1 The first-order panel autoregression

A first-order panel autoregression with and without a linear trend can be specified as (Pesaran 2007)

yt =(1− ρ)μ + ρyt−1 + et (1)

yt = μ +(1− ρ)δt + ρyt−1 + et t =1,...,T (2)

   where yt =(y1t,...,yNt) , yt−1 =(y1,t−1,...,yN,t−1) ,andet =(e1t,...,eNt) are N × 1 vectors, and et is heterogeneously distributed with mean zero and covariance  Ωt. Panel-specific intercepts and trends are stacked in the vectors μ =(μ1,...,μN )  and δ =(δ1,...,δN ) , respectively. Specification (1) has been used to formalize a panel unit-root null hypothesis (H0 : ρ = 1) of a driftless random walk against an alternative hypothesis (H1 : ρ<1) of a stationary process with individual specific intercept terms. Similarly, the panel unit-root null hypothesis from (2)representsthe case of distinguishing between a random walk with drift against a trend stationary process. Further assumptions on Ωt are necessary to ensure asymptotic normality of all considered test statistics. We allow for weak forms of cross-sectional dependence, that is, a positive definite matrix Ωt with bounded eigenvalues. In addition, the error terms −1/2 1 ut =Ωt et are assumed to have finite moments up to order eight.

1. Note that the assumption of finite eighth-order moments is required only for the test in Herwartz, Maxand, and Walle (2017). This assumption enables us to derive asymptotic proper- ties of the test statistic from the theory of near-epoch dependent sequences. For convergence of the other tests, assuming finiteness of moments up to order four is sufficient. 186 Panel unit-root tests for heteroskedastic panels

2.2 Heteroskedasticity-robust tests

TheWhite-typetest

Using a White-type covariance estimator, Herwartz and Siedenburg (2008)proposea PURT that is robust to cross-sectional dependence. Setting μ =0in(1)andμ = δ =0 in (2), we obtain the test statistic  T  y Δyt d 4 t=1 t−1 → / tHS =  N(0, 1) et =Δyt = et (3) T  / / t=1 yt−1etetyt−1

Herwartz, Siedenburg, and Walle (2016) show that tHS is also robust to heteroskedas- ticity.

TheWhite-typeCauchytest

The heteroskedasticity-robust test in Demetrescu and Hanck (2012a) uses the “Cauchy” estimator, which applies the sign function for the lagged level series as  T  sgn(yt− ) Δyt d 4 t=1 1 → tDH =  N(0, 1) (4) T / / t=1 sgn(yt−1) etetsgn(yt−1) where sgn(·) denotes the sign function.

Both tHS and tDH can be applied to the model in (1) even if the error term is heteroskedastic. A nonzero intercept under the alternative hypothesis (that is, μ =0) can be removed by subtracting the first observation from the level series (Breitung 2000). Applying (3)and(4) to trending data generated from the model in (2) requires that the error terms be homoskedastic.

The heteroskedasticity-robust test for trending panels

Noting that existing heteroskedasticity-robust tests do not remain pivotal in cases where the data feature panel-specific linear trends, that is, when μ = 0 under H0 and when δ = 0 under H1 in (2), Herwartz, Maxand, and Walle (2017) propose a new test that allows trending panels to be heteroskedastic. Their test is a modified version of tHS that is adjusted for biases arising from heteroskedastic detrended data. More precisely, the test statistic in Herwartz, Maxand, and Walle (2017) involves detrending the lagged level as in Chang (2002), t− t− 2 1 6 1 y't− = yt− + y − jyj 1 1 t − 1 j t(t − 1) j=1 j=1 with centering the first-differences as T ∗ 1 Δy =Δyt − Δyt t (T − 1) t=2 H. Herwartz, S. Maxand, F. H. C. Raters, and Y. M. Walle 187

Then, the numerator of tHS can be written as   t−1 T  ∗  1  y' Δy = ai,t− e et − ai,t− e ek (5) t−1 t 1 i T 1 i i=1 k=2 where the weights ai,t−1 are given by  2 (i − 1)i ai,t− =1+ (t − i) − 3 1 − (6) 1 t − 1 (t − 1)t Herwartz, Maxand, and Walle (2017) note that the expression in (5) has a nonzero expectation if there is heteroskedasticity under the panel unit-root null hypothesis. The theoretical counterpart of their proposed test is given by  T  ∗ y' Δy − νt t=2 t−1 t τ =  (7)  2  2 T ' ∗ − T E t=2 yt−1Δyt t=2 νt

' ∗ where νt = E(yt−1Δyt ). Thevarianceofthenumeratorin(7)readsas ⎧     ⎫   ⎨ T 2 T 2⎬ T 2 2 ' ∗ − − − s := ⎩E yt−1Δyt νt ⎭ = ζ1 ζ2 + ζ3 + ζ4 + ζ5 νt t=2 t=2 t=2 where the components ζ1,...,ζ5 correspond to the following quantities, T−1 T−1 T T ζ1 =2 ai,s−1aj,t−1 {tr(ΩiΩj ) + tr(Ωi)tr(Ωj )} i=1 j=i+1 s=i+1 t=j+1 T−1 T T ζ2 =2 'ai,s−1ai,t−1tr(ΩiΩs) i=1 s=i+1 t=i+1 T−1 T '2 ζ3 = ai,t−1tr(ΩiΩt) i=1 t=i+1 ⎡ ⎤ T−1 T T 2 ⎣  2 ⎦ ζ4 = ai,t−1 E (eiei) + tr(ΩiΩj ) i=1 t=i+1 j ,j i,t ⎡ =1 = ⎤ T−2 T−1 T T ⎣  2 ⎦ ζ5 =2 ai,t−1ai,s−1 E (eiei) + tr(ΩiΩj ) i=1 s=i+1 t=s+1 j=1,j= i,t,s and 'ai,t−1 = {1 − (1/T )}ai,t−1 and ai,t−1 =(1/T )ai,t−1 with coefficients ai,t−1 defined in (6). Similarly, t−1 ' ∗ − νt = E yt−1Δyt = ai,t−1tr(Ωi) i=1 188 Panel unit-root tests for heteroskedastic panels

The empirical test is then the version of τ in (7) using estimated components; that is,  T ' ∗ − / t=2 yt−1Δyt νt t = τ/ = (8) HMW s/ where estimators of νt and the variance components are based on the estimation of the traces of the covariance matrices Ωi. More precisely, denoting the vector of centered first- ∗  2 differences by /ei =Δyi , we estimate the components tr(Ωi), tr(ΩiΩj ), and E{(eiei) }     2 by /ei/ei, /ei/ej/ei/ej ,and(/ei/ei) , respectively. Accordingly, ν/t is given by

t−1  ν/t = − ai,t−1/ei/ei i=1 and the variance is estimated as / / / /2 ' − / / ' ' s = ζ1 ζ2 + ζ3 + ζ4 + ζ5 where T − T − T T / 1 1 ' // 2 ζ1 =2 ai,s−1aj,t−1 (eiej ) i=1 j=i+1 s=i+1 t=j+1 T−1 T T /  2 ζ2 =2 'ai,s−1ai,t−1 (/ei/es) i=1 s=i+1 t=i+1 T−1 T / '2 // 2 ζ3 = ai,t−1 (eiet) i=1 t=i+1 ⎧ ⎫ T − T ⎨ T ⎬ / 1 ' 2 // 2 ζ4 = ai,t−1 ⎩ (eiej ) ⎭ i=1 t=i+1 j=1,j= i,t T − T − T T / 2 1 ' // 2 ζ5 =2 ai,t−1ai,s−1 (eiej ) i=1 s=i+1 t=s+1 j=1,j= i,t,s

Herwartz, Maxand, and Walle (2017) show that under the panel unit-root null hy- pothesis and the condition of finite moments up to order eight (stated in section 2.2), asymptotic normality of the test statistic in (8) follows by application of a central limit theorem for near-epoch dependent sequences.

3 The xtpurt command

When implementing the PURTs for heteroskedastic panels, we pay special attention to conform to the built-in Stata command xtunitroot. Our command xtpurt is called on panel data in long format. That means that there exist at least a time variable, a panel variable, and the data variable to be assessed. H. Herwartz, S. Maxand, F. H. C. Raters, and Y. M. Walle 189 xtpurt varname if in , test(testname)trend noconstant lagsel(method) maxlags(#) lagsexport(newvar) lags(lagorders)

xtpurt requires a balanced panel dataset in long format declared explicitly to be panel data—see help xtset.Herevarname specifies the variable name in the working space to be assessed for a unit root. Optionally, if and in restrict the observations under investigation. However, missing observations are not allowed. Strictly speaking, the first-order differences of the declared time variable (the time variable “delta”) must be constant.

3.1 Options

Here we describe the options in detail. Regarding the specification of the deterministic components, table 1 helps to clarify the lag-order selection and the data handling in terms of the prewhitening and detrending procedures. test(testname) specifies the testing method to be applied. The three tests described in section 1 are supported: Herwartz and Siedenburg (2008), Demetrescu and Hanck (2012a), and Herwartz, Maxand, and Walle (2017). All approaches can be called jointly, returning their results in a single output table. However, the test by Her- wartz, Maxand, and Walle (2017) requires that the option trend (see below) be set. Accordingly, testname can be hs, dh, hmw,orall. The default is test(hs); test(hmw) if trend is specified. trend adds a deterministic trend to the regression model. This option affects the lag-order selection, prewhitening, and detrending procedures. Table 1 provides an overview. noconstant removes the constant in the regression model. This option cannot be com- bined with trend and might be rarely used in practice. It affects the lag-order selection, prewhitening, and detrending procedures. Table 1 provides an overview. lagsel(method) sets the selection method for determining the lag order. Internally, the command uses the built-in Stata command varsoc. The user can choose between the Akaike, Schwarz–Bayesian, or Hannan–Quinn information criterion. Accordingly, method can be aic, bic,orhqic. The default is lagsel(bic). maxlags(#) specifies the maximum lag order when determining the lag orders per panel for the prewhitening procedure by means of a selection criterion—see the option lagsel(method). See the option lags(lagorders) for setting fixed lag orders. The default is maxlags(4). lagsexport(newvar) generates a new integer variable, newvar, in the workspace con- taining the lag orders per panel, which were used in the prewhitening procedure. This variable might then be used for further analysis; it can be edited or set as input in the option lags(lagorders).Bydefault,lagsexport() is not set, and no variable will be created. 190 Panel unit-root tests for heteroskedastic panels lags(lagorders) allows the specification of fixed lag orders for the prewhitening pro- cedure. Either lagorders is a scalar integer that can be used to set one common fixed lag length for all panels, or the user can refer to a variable containing fixed lag orders per panel. This variable’s format is the same as the output’s format of option lagsexport(newvar). Hence, any user-preferred lag-order selection method might be applied before, or prewhitening can be disabled by setting lagorders to 0. By default, lags() is not set, and xtpurt applies a lag-order selection criterion per panel—see the options lagsel(method) and maxlags(#). H. Herwartz, S. Maxand, F. H. C. Raters, and Y. M. Walle 191 can hmw it if there is no  Δ y trend ), could be applied ij T  =2 y

t ij j  y dh trend 1 t 1 =1 −

t =1 j

T j 1 i t 2  and y − +1) 6 t − + it it it ( t hs it it it       y y y Δ y Δ y Δ y − default ,and proceeds in three steps: 1) de- ======∗ ∗ ∗ it it it it it it    y y y Δ y Δ y Δ y 3. Detrending Detrend the prewhitened data using the detrending scheme xtpurt it e , + j noconstant , 0 it it i e e b i,t − , + + + j Δ y j j j ij b i,t −  y i,t − i,t − i,t − i =1 ij p

j b  Δ y Δ y Δ y i − ij ij ij =1 p b b b

j it are obtained from the i i i y − =1 =1 =1 p p p

j j j ij it b  are appropriate only for the case with a y = = = =Δ it it it = it dh it  Δ y Δ y Δ y  regression where Δ y Prewhiten the data as y and it e + hs 0 it it i e e b i + + + ∀ j j j i  p i,t − i,t − i,t − maxlags() )and Δ y Δ y Δ y ij ij ij b b b i i i =1 =1 =1 p p p

j j j = = = it it it Δ y Δ y Δ y using ( lagsel() , the lag selection regression Determine optimal trend . Note, however, that trend Deterministics 1. Lag order selection 2. Prewhitening none ( noconstant ) constant (default) constant and on data detrendedbe according applied to only any with of the three detrending scenarios (that is, termine the lag orders, 2) apply the prewhitening procedure, and 3) detrend the data. While heteroskedasticity in the data. Table 1. Procedure specification details by choice of deterministics. The command 192 Panel unit-root tests for heteroskedastic panels

3.2 Remarks

It might be important to know that the sample will be rebalanced after prewhitening the panels with their preset or determined lag orders. While Number of periods indicates the number of observations per panel used for prewhitening, the specification After rebalancing refers to the number of recent observations that are actually used for (detrending and) testing. The display Prewhitening specifies whether the lag order used for prewhitening was detected automatically (AIC, BIC,orHQIC;seelagsel(method) and maxlags(#)) or was set manually by the user via lags(lagorders) (FIX). Lag orders indicates the smallest (min)andlargest(max) lag orders used per panel for the prewhitening proce- dure. These values are determined as the minimum and maximum of the stored lag vector r(lags), which is explained in the following section.

3.3 Stored results xtpurt stores the following in r():

Scalars r(N) number of observations r(t hs p) p-value for HS test statistic r(N g) number of panels r(t dh) Demetrescu and Hanck (DH) r(N t) number of time periods test statistic r(N r) number of time periods after r(t dh p) p-value for DH test statistic rebalancing r(t hmw) Herwartz, Maxand, and Walle r(t hs) Herwartz and Siedenburg (HS) (HMW) test statistic test statistic r(t hmw p) p-value for HMW test statistic Macros r(test) hs, dh, hmw,orall r(lagsel) aic, bic, hqic,orfix r(determ) noconstant, constant,or r(warnings) caution message as string, if trend warning occurred Matrices r(lags) vector containing the lag orders per panel used for the prewhitening procedure

3.4 Empirical application

As an empirical illustration, we examine the stationarity of the log consumer price index (base year 2010, lprices) and its first-order differences (that is, inflation, dlprices)of 21 countries2 using quarterly Organisation for Economic Co-operation and Development (OECD) data from 1980/Q1 until 2010/Q4 (prices). Whether inflation is stationary or contains a unit root has important implications for macroeconomic theory. For instance, while a stationary inflation is consistent with the rational expectations hypothesis, an I(1) inflation is in line with an accelerationist

2. The countries under investigation are Australia, Austria, Belgium, Canada, Denmark, Finland, France, Germany, Italy, Japan, Korea, Luxembourg, Netherlands, New Zealand, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom, and the United States. H. Herwartz, S. Maxand, F. H. C. Raters, and Y. M. Walle 193 hypothesis (Lee and Wu 2001). Moreover, it has an implication for the so-called Fisher hypothesis, which postulates a stable relationship between nominal interest rates and the expected rate of inflation. Given that nominal interest rates often contain a unit root, the validity of the Fisher hypothesis rests on the assumption that inflation is nonstationary and cointegrated with nominal interest rates. Further discussions on the theoretical implications and concise reviews of the empirical evidence can be found in Culver and Papell (1997), Lee and Wu (2001), and Cook (2009), among others.

In the following, we first apply the three heteroskedasticity-robust tests tHS, tDH, and tHMW to the log-prices (lprices) series, assuming that the panels are trending (trend).

. use xtpurtdata . xtset panel variable: country (strongly balanced) time variable: date, 1980q1 to 2010q4 delta: 1 quarter . generate lprices = log(prices) . generate dlprices = D.lprices (21 missing values generated) . xtpurt lprices, test(all) trend maxlags(9) All methods unit-root test for lprices

Ho: Panels contain unit roots Number of panels = 21 Ha: Panels are stationary Number of periods = 124 After rebalancing = 114 Constant: Included Prewhitening: BIC Time trend: Included Lag orders: min=2 max=9

Name Statistic p-value

t_hs 0.4208 0.6631 t_dh 0.8313 0.7971 t_hmw -0.0153 0.4939

Caution: t_hs and t_dh are not robust to heteroskedasticity with trend

All three tests suggest that a panel unit root in consumer prices cannot be rejected at conventional levels. This result is not unexpected, because the debate in the literature has been whether these prices could be described as an I(1) or I(2) process. 194 Panel unit-root tests for heteroskedastic panels

Consequently, we subject the log-differences dlprices to panel unit-root testing, and results are reported below.

. xtpurt dlprices, test(all) maxlags(9) All methods unit-root test for dlprices

Ho: Panels contain unit roots Number of panels = 21 Ha: Panels are stationary Number of periods = 123 After rebalancing = 114 Constant: Included Prewhitening: BIC Time trend: Not included Lag orders: min=1 max=8

Name Statistic p-value

t_hs -0.6620 0.2540 t_dh 0.1493 0.5593

. xtpurt dlprices, test(all) trend maxlags(9) All methods unit-root test for dlprices

Ho: Panels contain unit roots Number of panels = 21 Ha: Panels are stationary Number of periods = 123 After rebalancing = 114 Constant: Included Prewhitening: BIC Time trend: Included Lag orders: min=1 max=8

Name Statistic p-value

t_hs -1.4105 0.0792 t_dh -0.2907 0.3856 t_hmw -1.8139 0.0348

Caution: t_hs and t_dh are not robust to heteroskedasticity with trend

Stationarity of the inflation series dlprices is examined both with and without a linear trend. Because tHMW requires the trend assumption, this test is applied only in the second case. Results for tHS and tDH, however, depend on whether the series are centered by subtracting the first observation (only intercept case) or they are detrended. In both cases, results for tHS and tDH imply that inflation has a unit root, while tHMW suggests rejecting this hypothesis at the 5% significance level. Because it is highly likely that the series have nonzero linear trends and are heteroskedastic, tHMW is more reliable. Hence, we conclude that inflation is (trend) stationary for the considered OECD economies, at least during the period under study.

4 Summary

In this article, we proposed a new command, xtpurt, for implementing the three heteroskedasticity-robust PURTs suggested in Herwartz and Siedenburg (2008), Deme- trescu and Hanck (2012a), and Herwartz, Maxand, and Walle (2017). Given that pre- viously available PURTs are not robust to heteroskedasticity in trending panels and H. Herwartz, S. Maxand, F. H. C. Raters, and Y. M. Walle 195 that heteroskedasticity is a common feature of most macroeconomic data, these tests should be valuable to applied researchers. Although all tests are heteroskedasticity ro- bust when the data contain an intercept but not a linear trend, the test in Herwartz, Maxand, and Walle (2017) remains pivotal even with trending heteroskedastic panels. The command xtpurt is flexible and incorporates a prewhitening procedure to account for serial correlation.

5 References Breitung, J. 2000. The local power of some unit root tests for panel data. In Advances in Econometrics: Vol. 15—Nonstationary Panels, Panel Cointegration, and Dynamic Panels, ed. B. H. Baltagi, 161–178. New York: Elsevier.

Breitung, J., and S. Das. 2005. Panel unit root tests under cross-sectional dependence. Neerlandica 59: 414–433.

Chang, Y. 2002. Nonlinear IV unit root tests in panels with cross-sectional dependency. Journal of Econometrics 110: 261–292.

Cook, S. 2009. A re-examination of the stationarity of inflation. Journal of Applied Econometrics 24: 1047–1053.

Culver, S. E., and D. H. Papell. 1997. Is there a unit root in the inflation rate? Evidence from sequential break and panel data models. Journal of Applied Econometrics 12: 435–444.

Demetrescu, M., and C. Hanck. 2012a. A simple nonstationary-volatility robust panel unit root test. Economics Letters 117: 10–13.

. 2012b. Unit root testing in heteroscedastic panels using the Cauchy estimator. Journal of Business and Economic Statistics 30: 256–264.

Herwartz, H., S. Maxand, and Y. Walle. 2017. Heteroskedasticity-robust unit root test- ing for trending panels. Center for European, Governance and Economic Development Research Discussion Papers 314, University of Goettingen, Department of Economics. http://econpapers.repec.org/paper/zbwcegedp/314.htm.

Herwartz, H., and F. Siedenburg. 2008. Homogenous panel unit root tests under cross sectional dependence: Finite sample modifications and the wild bootstrap. Compu- tational Statistics and Data Analysis 53: 137–150.

Herwartz, H., F. Siedenburg, and Y. M. Walle. 2016. Heteroskedasticity robust panel unit root testing under variance breaks in pooled regressions. Econometric Reviews 35: 727–750.

Lee, H.-Y., and J.-L. Wu. 2001. Mean reversion of inflation rates: Evidence from 13 OECD countries. Journal of Macroeconomics 23: 477–487. 196 Panel unit-root tests for heteroskedastic panels

Levin, A., C.-F. Lin, and C.-S. J. Chu. 2002. Unit root tests in panel data: Asymptotic and finite-sample properties. Journal of Econometrics 108: 1–24.

Pesaran, M. H. 2007. A simple panel unit root test in the presence of cross-section dependence. Journal of Applied Econometrics 22: 265–312.

Stock, J. H., and M. W. Watson. 2002. Has the business cycle changed and why? NBER Macroeconomic Annual 17: 159–218.

Westerlund, J. 2014. Heteroscedasticity robust panel unit root tests. Journal of Business & Economic Statistics 32: 112–135. . 2015. The effect of recursive detrending on panel unit root tests. Journal of Econometrics 185: 453–467.

About the authors Helmut Herwartz is a full professor and chairholder for econometrics at the Department of Economics at the University of Goettingen, Germany. Simone Maxand is a PhD candidate in econometrics at the Department of Economics at the University of Goettingen, Germany. Fabian H. C. Raters is a PhD candidate in econometrics at the Department of Economics at the University of Goettingen, Germany. Yabibal M. Walle is a postdoctoral researcher in econometrics at the Department of Economics at the University of Goettingen, Germany. The Stata Journal (2018) 18, Number 1, pp. 197–205 Some commands to help produce Rich Text Files from Stata

Matthew S. Gillman King’s College London Faculty of Life Sciences & Medicine School of Cancer and Pharmaceutical Sciences London, UK [email protected]

Abstract. Producing Rich Text Format (RTF) files from Stata can be difficult and somewhat cryptic. In this article, I introduce commands to simplify this process; one builds a table row by row, another inserts a PNG image file into an RTF document, and the others start and finish the RTF document. Keywords: pr0068, initiatertf, insertimagertf, addrowrtf, endrtf, Rich Text Format, RTF, automation, PNG,tables,fonts

1 Introduction

Stata users sometimes wish to generate documents in Rich Text Format (RTF). Such documents are not tied to one particular operating system but instead are platform independent. They can be opened (and edited) with word-processing software such as Microsoft Word and Libre Office.

One paradigm in which RTF documents may be required is if a statistician has an analysis that they repeat many times with varying parameters, data, or both. It would be tedious (not to say error-prone) if, each time the analysis was rerun, the user had to copy the results from Stata and paste them into the document he or she was working on. It would be far more convenient (and reproducible) if the statistician used a do-file that consistently performed the same analysis and wrote the results to the document automatically.

The RTF specification was developed by Microsoft, and the final version (1.9.1) is available for download (Microsoft Corporation 2008). A key resource is the useful book written by Sean M. Burke (2003), which summarizes the main features of the specification.

c 2018 StataCorp LLC pr0068 198 Some commands to help produce Rich Text Files from Stata

Writing RTF documents is somewhat cryptic. A Hello world example might be the following (excluding the line numbers on the left-hand side):

If the content (excluding line numbers) in the left side of the preceding table is saved in a text file with the name hello.rtf, it should be viewable in standard word- processing packages (for example, Microsoft Word, Libre Office, etc.) as shown in the right side of the table. Note that such a package will display the contents in RTF format; if one wishes to view and debug code like that in the left side of the table, one must use an ASCII editor such as Notepad or vi. Here the first line opens the document and sets the default font to f1 in the (forthcoming) font table. The second line declares the font table. Lines 3 and 4 define the fonts to be used for fonts f0 and f1 (there can be more than this). In this specific example, line 3 is superfluous because font f0 is not used in the document. The closing brace on line 5 ends the font table. The {\pard command on line 6 opens a new paragraph. Line 7 contains the text for this paragraph, using font f1 from the table (f0 could be used instead, if desired) and size 15 pt. (Note that RTF requires double this number to be specified as the fs [font size] value, in this case 15*2=30). The \par} on line 8 closes the paragraph. Finally, the closing brace on line 9 completes the RTF document. To insert a table, things get more complicated, especially if one wishes to have visible borders around the table. I wrote an ado-file to simplify this process (see section 2).

One can also add PNG images into an RTF document; I wrote a command to facilitate this (see section 3). Hopefully, these commands will add to those already produced by Roger Newson (2009), author of rtfutil, and others. M. S. Gillman 199

Of course, we assume that the RTF document is produced by Stata. Hence, the lines above might be placed in a do-file as file write statements:

// This is helloworldtest.do global RTFout = "hello.rtf" file open handle1 using "$RTFout", write replace file write handle1 "{\rtf1\ansi\deff1" _n file write handle1 "{\fonttbl" _n file write handle1 "{\f0\froman Arial;}" _n file write handle1 "{\f1\froman Times New Roman;}" _n file write handle1 "}" _n file write handle1 "{\pard" _n file write handle1 "\f1\fs30 Hello, world!" _n file write handle1 "\par}" _n file write handle1 "}" file close handle1

Newlines ( n) are not strictly necessary, but they aid in the debugging of RTF file syntax.

It is tedious to set up the font table each time one wishes to create an RTF document. Accordingly, I wrote the initiatertf command and wrote the endrtf command to finalize the document. When we use these helper commands, the above becomes

global RTFout = "hello.rtf" file open handle1 using "$RTFout", write replace initiatertf handle1

file write handle1 "{\pard" _n file write handle1 "\f1\fs30 Hello, world!" _n file write handle1 "\par}" _n

endrtf handle1 file close handle1

initiatertf has options to specify the fonts used in the document’s font table, to set the size of the margins (in mm), and to set the size of the width of the entire page (again in mm).

2 Adding rows to a table

To write a simple 2 × 2 table (without borders) to an RTF file, one might include

// Row 1 file write handle1 "{" _n "\trowd \trgaph180" _n "\cellx1440\cellx2880" _n file write handle1 "\pard\intbl {Add.}\cell" _n "\pard\intbl {some.}\cell" file write handle1 "\row" _n "}"

// Row 2 file write handle1 "{" _n "\trowd \trgaph180" _n "\cellx1440\cellx2880" _n file write handle1 "\pard\intbl {data.}\cell" _n "\pard\intbl {here.}\cell" _n file write handle1 "\row" _n "}" 200 Some commands to help produce Rich Text Files from Stata

Here is the above code fragment in context, using the helper commands already introduced:

// This is twobytwo.do global RTFout = "myfirsttable.rtf" file open handle1 using "$RTFout", write replace initiatertf handle1

// Add a paragraph of text: file write handle1 "{\pard" _n file write handle1 "\f1\fs30 Here is a table:\line" n file write handle1 "\par}" _n

// Opening brace to enclose the table (for safety) file write handle1 "{" _n

// Row 1 file write handle1 "{" _n "\trowd \trgaph180" _n "\cellx1440\cellx2880" _n file write handle1 "\pard\intbl {Add.}\cell" _n "\pard\intbl {some.}\cell" file write handle1 "\row" _n "}"

// Row 2 file write handle1 "{" _n "\trowd \trgaph180" _n "\cellx1440\cellx2880" _n file write handle1 "\pard\intbl {data.}\cell" _n "\pard\intbl {here.}\cell" _n file write handle1 "\row" _n "}"

// Closing brace to enclose the table (for safety) file write handle1 "}" _n

endrtf handle1 // End of document file close handle1

Note that RTF does not require a “table object” to be declared but builds up a table one row at a time. The Stata code above would result in content looking like the following: Here is a table:

Add. some. data. here.

Note that although the Stata newline character ( n) can be used to generate a new line in the RTF file generated by Stata, adding a blank line within the RTF document as it is displayed by a word processor requires use of \line (see bold above). These can be chained together: \line\line\line.

Readers will note the increasingly cryptic nature of the RTF text. In this example based on one in Burke (2003), the internal margins of the cells are 180 twips.1 That is what the \trgaph180 sets up. Starting from the left, the first cell ends at 1,440 twips (1 inch) from the margin and the second at 2,880 twips (2 inches). Each row is declared to begin with \trowd and end with \row. This example follows the safety recommendation in Burke (2003) to wrap the text of each cell, each table row, and indeed the entire table in braces {}. 1.1twip=1/20 of a typographical point = 1/1440 inch or ∼ 1/57 mm. M. S. Gillman 201

The situation is even more complicated when one wishes to add borders to a table. However, rather than showing the full details of this, I introduce addrowrtf,whichis very simple to use; a table similar to the one above but with borders (the default) would be declared in the do-file as

file write handle1 "{" _n // Opening brace to enclose table. addrowrtf handle1 Add. some. // Row 1 addrowrtf handle1 data. here. // Row 2 file write handle1 "}" _n // Closing brace to enclose table.

Notice that here the only parameters passed to addrowrtf are the file handle and the text to be entered in the cells. This example will produce a table looking something like this:

Add. some. data. here.

The addrowrtf command has a number of options that can be used. Here no options were specified, so the default table font size of 10 points was used. Options are described in the help file and include the ability to specify the cell width, whether borders are drawn, the font size, the font number, and relative cell widths. In fact, addrowrtf will dynamically calculate the number of columns based on the number of arguments supplied. There is no requirement for successive rows to have the same number of cells (columns) as the preceding rows. An example might be the following,

// Note the quotes to force "my data" to appear in one cell: addrowrtf handle1 this is "my data" 26

// Need a space in quotes for a blank cell: addrowrtf handle1 more data " " addrowrtf handle1 abcdefghijklmnopqrstuvwxyz file write handle1 "}" _n which produces this:

this is my data 26 more data a b c d e

Here the rows are positioned using the default offset and cell widths. Using these, the third row expands beyond the margin of the page. Options such as tablewidth() and internal() maybeusedtopreventthis. 202 Some commands to help produce Rich Text Files from Stata

A suitable do-file might look like this:

// File diffrowlengths.do program drop _all global outputRTF = "Table with different row lengths.rtf" file open handle1 using "$outputRTF", write replace initiatertf handle1 , fonts ("Calibri, Edwardian Script ITC, Arial")

addrowrtf handle1 this is "my data" 26 addrowrtf handle1 more data " " addrowrtf handle1 abcdefghijklmnopqrstuvwxyz, /// tablewidth(100) internal(0.5) file write handle1 "}" _n

endrtf handle1 file close _all

The appearance is thus changed:

this is my data 26 more data a b c d e f g h i j k l m n o p q r s t u v w x y z

For full details of addrowrtf’s options, refer to the help file.

3 Inserting a PNG image

One can dynamically create a graph in Stata, save it to a PNG file, and then insert the latter into an RTF document file. It is tricky to do this because Stata must open the PNG file, read a single byte at a time, and convert each byte into hexadecimal format before writing it to the RTF file. Readers will be pleased to know I wrote the insertimagertf command for just such an eventuality. Like addrowrtf,itusesafile handle. Nichols (2010) has written png2rtf, which is similar, although it requires a filename to be provided and has more options; insertimagertf is a simpler tool.

Note that the only two arguments supplied are the RTF file handle (as before) and the name of the PNG file written by Stata previously. A simple design philosophy was adopted because a user can produce the final formatted version in the word processor to his or her own preference. This command has a sole option, leftoffset(), that specifies (in mm) the distance of the left-hand side of the image from the left-hand page margin. Negative values are allowed to position it to the left of this margin. This command builds upon rtfutil’s rtflink program, except that rtflink inserts a link to an image, and insertimagertf inserts a copy of the actual image. Full details are given in the insertimagertf help file. If the required image file is missing, this command will gracefully fail without corrupting the output RTF document. M. S. Gillman 203 4 Example: Putting it all together

Here is a complete example that brings together the thoughts in the previous sections. The document shown below is produced by the Stata code following it. 204 Some commands to help produce Rich Text Files from Stata

Notice the table set up by referencing the auto mean results in the following code:

// This is alltogether.do clear all sysuse auto, clear set graphics on scatter mpg weight, by(foreign)

global pngfile = "myplot.png" graph export "$pngfile", replace // Save graphics file to disk.

// Set up output file. global outputRTF = "all together.rtf"; // The name of the new RTF file file open handle1 using "$outputRTF", write replace

initiatertf handle1

file write handle1 "{\pard" // Start new paragraph file write handle1 "\f1\fs30 Here is a {\i beautiful} picture... \line\line" file write handle1 "\par}" // Close paragraph

insertimagertf handle1 "$pngfile" , leftoffset(0) // Adds graph into document. rm "$pngfile" // Delete file (if desired).

/////////////////////////////////////////////////////////

file write handle1 "{\pard\f1\fs20 \line\line And now for a sample data " /// "table. \line\line\par}" file write handle1 "{" _n // Start of table tabstat mpg rep78 price, save

// This is a semiautomated way of obtaining the values desired: matrix mydata = r(StatTotal) local cols = colsof(mydata) // Number of columns forvalues colnum = 1/`cols´ { global d`colnum´ : display %10.3f mydata[1, `colnum´] } // Or you can do it manually, like this (no formatting here): // global d1 = mydata[1,1] // global d2 = mydata[1,2] // global d3 = mydata[1,3]

addrowrtf handle1 item mpg rep78 "Price in USD" addrowrtf handle1 Mean $d1 $d2 $d3 file write handle1 "}" _n // End of table

////////////////////////////////////////////////////////

file write handle1 "{\pard\f0\fs24 \line\line That´s {\ul\b\scaps all}, " /// "folks!\line\line\par}"

endrtf handle1 file close handle1 clear all M. S. Gillman 205 5 Known issues

As stated previously, both commands have deliberately been designed to be as easy to use as possible, given the complexities of RTF usage. Although there are options to control output, the final RTF document will be viewed in a word processor and can be edited to the user’s taste therein. With addrowrtf, a truly long list of arguments (cell values) for a single row may cause the table row to extend past the right-hand edge of the page in some cases. It may be possible to keep such a long table row within the page margins by judicious use of options.

6Conclusion

Using Stata to produce RTF files can be tricky. I introduced two commands that will hopefully aid users who wish to automatically generate RTF documents from Stata.

7 Acknowledgments

This work was supported by Cancer Research UK C8162/A16892. I wish to thank Stata Technical Support, especially Derek Wagner, who assisted with the conversion from byte values to hexadecimal values in the insertimagertf command. I also thank Dr. Francesca Pesola and Professor Peter Sasieni, both formerly of Queen Mary Uni- versity of London and now of King’s College London, for their helpful comments and suggestions. The suggestions of an anonymous reviewer greatly improved the final sub- mission.

8 References

Burke, S. M. 2003. RTF Pocket Guide. Sebastopol, CA: O’Reilly.

Microsoft Corporation. 2008. Word 2007: Rich Text Format (RTF) Specification, version 1.9.1. https://www.microsoft.com/en-gb/download/details.aspx?id=10725. Newson, R. 2009. rtfutil: Stata module to provide utilities for writing Rich Text Format (RTF) files. Statistical Software Components S457080, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457080.html.

Nichols, A. 2010. png2rtf: Stata module to include PNG graphics in RTF documents. Statistical Software Components S457192, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s457192.html.

About the author Matthew S. Gillman is a statistical programmer at King’s College London. This work was undertaken at Queen Mary University of London. The Stata Journal (2018) 18, Number 1, pp. 206–222 Standard-error correction in two-stage optimization models: A quasi–maximum likelihood estimation approach

Fernando Rios-Avila Gustavo Canavire-Bacarreza Levy Economics Institute School of Economics and Finance Annandale-on-Hudson, NY Universidad EAFIT [email protected] Medell´ın, Colombia gcanavir@eafit.edu.co

Abstract. Following Wooldridge (2014, Journal of Econometrics 182: 226–234), we discuss and implement in Stata an efficient maximum-likelihood approach to the estimation of corrected standard errors of two-stage optimization models. Specif- ically, we compare the robustness and efficiency of the proposed method with routines already implemented in Stata to deal with selection and endogeneity problems. This strategy is an alternative to the use of bootstrap methods and has the advantage that it can be easily applied for the estimation of two-stage optimization models for which already built-in programs are not yet available. It could be of particular use for addressing endogeneity in a nonlinear framework. Keywords: st0520, maximum likelihood estimation, nonlinear models, endogeneity, two-step models, standard errors

1 Introduction

Selection and endogeneity are common problems in applied econometrics. One common approach to deal with these problems is the use of two-stage estimations, where results from one model, such as the predicted residuals, are used in a second model to obtain coefficient estimates that are unbiased and consistent in the presence of endogeneity or selection. However, to draw appropriate statistical inferences of the models for the estimation of the standard errors (SEs), one has to be careful given the probabilistic nature of the first-stage estimates that are introduced in the second model.

Although the properties and formulation of the correct SEs associated with two-stage optimization models have been available in the literature for decades (Murphy and Topel 1985; Newey and McFadden 1994; White 1982), outside of already packaged commands, practical implementation is not yet standard, particularly for the estimation of nonlinear models (Terza 2016). Despite the efforts of many authors to provide strategies and programming codes to facilitate the implementation of standard-error corrections in Stata (Hardin 2002; Hole 2006; Terza 2016), most applied researchers either implement bootstrap methods at best or ignore the problem by reporting the uncorrected errors at worst (Terza 2016).

c 2018 StataCorp LLC st0520 F. Rios-Avila and G. Canavire-Bacarreza 207

In this article, we suggest a different, easier approach to implement the estimation of corrected SEs of two-stage optimization models, based on Wooldridge (2014). The method relies on the maximum likelihood (ML) estimation available in Stata, which is used to estimate a joint function that characterizes the data-generating process of the first-stage and second-stage systems under the assumption of conditional independent distributions. The main use of this strategy is to facilitate the estimation of endoge- nous nonlinear models using a control function approach such as the ones described in Terza, Basu, and Rathouz (2008) with corrected SEs. The rest of the article is organized as follows: Section 2 reviews the framework of two-stage optimization models and characterization of the estimation of the correct SEs. Section 3 establishes the general framework of the estimation based on ML using the Stata command ml and the basic program structure that defines the objective function. Section 4 provides examples of the estimation compared with existing methods in Stata using Monte Carlo simulations. Section 5 concludes.

2 Two-stage estimations

Systems in which results from one estimation are used in a second model are common in the applied literature. The most prominent examples are the estimation of the two-stage Heckman selection models (Heckman 1979), two-stage least squares (2SLS)forthetreat- ment of endogeneity in linear models, and the control-function approach for the robust estimation of nonlinear models with endogenous variables (Terza, Basu, and Rathouz 2008; Wooldridge 2014). Although these types of models can be fit jointly, two-step procedures are often easier to estimate because they require fewer restrictions on the joint distribution of the data-generating functions (Greene 2012). While easy to implement, the main drawback of two-stage models has been that the estimation of SEs from the second stage alone are incorrect because they ignore the measurement error that carries over from using the predictions of one model in the next model. Hardin (2002)andHole (2006) provide guidance for implementing the variance estimator suggested by Murphy and Topel (1985) when including the predictions of a first-stage model into a second model. More recently, Terza (2016) suggests an additional simplification of the estimation of SEs in two-stage models, emphasizing the application of the two-stage residual inclusion approach and the handling of endogeneity in nonlinear models. The original model described in Hardin (2002) is characterized as follows,

E(y1|x1, θ1)(1)

E {y2|x2, θ2,E(y1|x1, θ1)} (2) where in (1) we model the conditional mean of the endogenous variable y1 as a function of exogenous variables x1,andin(2) we model the conditional mean of the variable y2 as a function of exogenous variables x2 and some form of the predicted values of the first model. 208 Standard-error correction in two-stage optimization models

According to Hardin (2002), two approaches can be used to fit these models. The first approach is a full-information ML, where we start by specifying a joint distribution f(y1, y2|x1, x2, θ1, θ2), which is then maximized jointly. A second approach is the estimation of a limited-information ML, which is a two-step procedure. Because the first model depends only on the parameter θ1, it can be consistently fit. The estimated parameters of the first model can be directly included in the second model, which can be fit using a conditional log-likelihood function: θ |θ | θ θ/ maxθ2 L2( 2 1)=lnΣln f y2 x2, 2,g x1, 1

In this framework, Hardin (2002)andHole (2006) indicate that while the raw vari- ance obtained from the second-stage model is incorrect, it can be easily corrected using the Murphy–Topel approach, which states that / / ∗ / ∗ / / /  / / /  / / /  / ∗ V (θ2)=V (θ2)+V (θ2) CV1C + RV1C − CV1R V (θ2)

/ ∗ where V (θ2) is the incorrect estimation of the variance obtained from the second-stage / maximization problem and V1 is the correct estimation of the variance of the first-stage estimate. C/ and R/ are the matrices products defined as   ∂L ∂L ∂L ∂L C/ = E 2 2 and R/ = 2 2 θ θ θ θ ∂ 2 ∂ 1 ∂ 2 ∂ 1 which Hardin (2002) indicates can be traced back to represent the sandwich estimate of the variance given two ML functions L1(θ1)andL2(θ2|θ1)(seeHardin [2002] for details). Terza (2016) proposes a simplified method to address the correction of SEsintwo- stage procedures that could be applied for the estimation of endogenous models in the framework of Terza, Basu, and Rathouz (2008) and the two-stage residual imputation (2SRI) approach, using as an example a model where the outcome and endogenous variables are nonlinear functions of the exogenous variables.

3 Quasi–ML approach

According to Wooldridge (2014), the model described above can also be fit using quasi– limited-information ML. As described in the article, for cases in which the models are linear (2SLS), and under the assumption that the errors of the models distribute normal and independent, the joint maximum log-likelihood function can be written as / L(θ1, θ2)=L1(θ1)+L2 θ2|θ1

In other words, if we assume the distribution function of the parameter θ and condi- / 1 tional distribution of θ2|θ1 are independent from each other, the full system of equations can be estimated using the ML approach. The estimation of this model would not re- quire additional adjustments on the estimation of the SEs, because the measurement / error from θ1 is already accounted for in the model. F. Rios-Avila and G. Canavire-Bacarreza 209

As the article argues, the estimation of this type of model can be done using readily available statistical software. Further, inferences can be drawn by reporting the White (1982) type of sandwich variance estimations to account for misspecified likelihood func- tions. Furthermore, Wooldridge (2014) indicates that the estimation of this joint quasi– maximum likelihood estimation (QMLE) can also be applied to a larger set of ML models that belong to the linear exponential family because they remain consistent even when the density function is partially misspecified (Cameron and Trivedi 2005). This sug- gests that a range of models such as the models described in Terza, Basu, and Rathouz (2008)andWooldridge (2015) can also be fit consistently using quasi–ML functions.

3.1 Estimation of QMLE in Stata

The estimation of models using Stata ML functions is simple but requires some program- ing and knowledge of the appropriate density functions that the researcher assumes de- termines the data-generating process of the data in hand or some criteria of an objective function that needs to be optimized. In this section, we provide the general setup of how a program can be written to specify the objective function in the framework of a two-stage optimization model. For the specification of this program, we will use the lf estimation method within the ML environment in Stata. This is the simplest method for ml estimation commands, which require only specifying to an objective function that will be maximized.1 As described in Cameron and Trivedi (2010), the lf method can be used for the special case where the objective function is an m estimator—namely, where the objective function is the sum or average of some function q(·) over the sample of observations. As the authors indicate, this method can be applied for any function q(·), but robust SEs need to be reported if the objective function is not a likelihood-based function. Say that we are interested in jointly fitting a two-linear model, with normal and independent distributed errors:  β ∼ y1i = β10 + β11x1i + β12x2i + β13x3i + u1i = X1i 1 + u1i with u1i N(0,σ1)(3)  β ∼ y2i = β20 + β21x1i + β23x3i + β24x4i + u2i = X2i 2 + u2i with u2i N(0,σ2)(4)

If we were to estimate this function using ML, the log-likelihood function of the full system of equations would be written as

1 2 Li = Li + Li  β  β Li =lnφ(y1i,X1i 1,σ1)+lnφ(y2i,X2i 2,σ2)(5)  where φ(yki,Xkiβk,σk) is the normal density function, given the conditional mean  Xkiβk and standard deviation σk. To be able to fit this model using ml, we need to create a program that defines the objective function to be maximized as a function of parameters that need to be

1. Note that ml also allows for the specification of the analytical first and second derivatives of the objective function, which provides improvements in the speed of the estimations. This feature, however, is beyond the scope of this article. 210 Standard-error correction in two-stage optimization models estimated. In this simple model, we need to estimate four parameters, the conditional  β  β mean of y1(X1i 1), the conditional mean of y2(X2i 2), and their respective standard deviations σ1 and σ2. This can be done as follows:

program myols args lnf xb1 lns1 xb2 lns2 quietly { tempvar lnf1 lnf2 * Ordinary least squares (OLS) ML component L2 generate double `lnf2´=ln(normalden($ML_y2,`xb2´,exp(`lns2´))) * OLS ML component L1 generate double `lnf1´=ln(normalden($ML_y1,`xb1´,exp(`lns1´))) replace `lnf´=`lnf1´+`lnf2´ } end

Thus, we begin by declaring the internal variable (args)thatstoresthevaluesof the objective functions lnf, followed by the parameters that will be used to maximize the objective function. In this example, the four parameters that are estimated are the conditional means (xb1 and xb2) and the standard deviations (exp(lns1) and exp(lns2)). Note that xb1 and xb2 can be linear combinations of explanatory variables. Also, because the standard deviations must be positive, they are not directly estimated, but their natural logs, lns1 and lns2, are estimated instead. Two auxiliary variables are created as temporary variables lnf1 and lnf2 to store the log likelihood corresponding to the first and the second equation. They are functions of the parameters that need to be estimated and the dependent variables in the model that are called using the macros $ML y1 and $ML y2, which represent the first and second dependent variables in the model. Finally, both likelihood functions are combined and stored in lnf,which resembles the expression in (5). Once the program is stored in memory, the estimation of the model can be done using the following command:

ml model lf myols (xb1:y1=x1 x2 x3) (lns1:) (xb2:y2=x1 x3 x4) (lns2:), /// maximize vce(robust)

The first part, ml model lf, indicates that the model will be fit using the lf option of ML, which implies that the gradients and Hessians are estimated numerically. Right after, we specify the name of the program we just created, which declares the objective function, in this case the log likelihood associated with the two-equation system. Next, in the same order as they were presented in the program, we declare the information that will be used for the estimation of conditional means and standard deviations. In the order they are written, all variables that are to the left of = are considered dependent variables and are internally used in the program as $ML y1, $ML y2, etc. All variables to the right of = are used as explanatory variables that enter in the estimation of the parameters as linear combinations. In this example, the first and second dependent variables are y1 and y2; after each variable, we specify the explanatory variables that are used for the conditional means. We assume homoskedastic errors, so no explanatory variables are written for lns1 or lns2. After the comma, we indicate we want to maximize the objective function and ask to report robust SEs. F. Rios-Avila and G. Canavire-Bacarreza 211

Imagine now that the models described in (3)and(4) are no longer independent. Let us assume that the outcome y1i is also a function of y2i and that both outcomes depend on some unobserved component vi so that the system of equations specified before becomes

y1i = β10 + β11x1i + β12x2i + β13x3i + β14y2i + u1i + vi with u1i ∼ N(0,σ1)(6)

y2i = β20 + β21x1i + β13x3i + β14x4i + u2i + vi with u2i ∼ N(0,σ2)(7)

This is a common case of endogeneity between y1i and y2i caused by the presence of unobserved heterogeneity vi. Under this structure, standard estimation methods will no longer provide consistent estimates for (6). One solution to deal with endogeneity is the use of 2SLS. This requires fitting a model where the endogenous covariate y2i is modeled 2 as a function of all exogenous variables in (6)and(7), and the predicted outcome y/2i is used in the main equation instead of the actual value y2i. In other words, this implies the estimation of the following models:

y2i = α20 + α11x1i + α21x2i + α13x3i + β14x4i + 2i

y1i = α10 + α11x1i + α12x2i + α13x3i + gy/2i + 1i

Instead of using this two-step procedure manually, the program created above can be adapted to implement this process so that this model can be fit using the ml process. This requires specifying an additional coefficient that needs to be estimated. In this example, we call this link g. The program and the maximization command then are modified as follows:

program myols2 args lnf xb1 g lns1 xb2 lns2 quietly { tempvar lnf1 lnf2 generate double `lnf2´=ln(normalden($ML_y2,`xb2´,exp(`lns2´))) generate double `lnf1´=ln(normalden($ML_y1,`xb1´+`g´*`xb2´,exp(`lns1´))) replace `lnf´=`lnf1´+`lnf2´ } end

Notice that in this example, the log-likelihood function of the second equation, lnf2, is the same as before. The log-likelihood function of the first equation, lnf1,ismodified so that, in addition to the linear combination of the exogenous variables xb1, it includes the conditional mean from the first equation, xb2, multiplied by the coefficient g.For the estimation of the model, we also need to modify the command so it specifies, in the same order as in the program, that an additional coefficient needs to be estimated.

ml model lf myols2 (xb1:y1=x1 x2 x3) (g:) (s1:) (xb2:y2=x1 x2 x3 x4) (s2:), /// maximize vce(robust)

Notice that the endogenous variable y2 is not included in the list of explanatory variables in the main model, because we are using the predicted values instead.

2. Note that for the identification of the model, an exogenous covariate or instrumental variable that is correlated with the endogenous variable y2i and uncorrelated with the final outcome y1i is needed. 212 Standard-error correction in two-stage optimization models

The above code can be easily extended to allow for a system of multiple equations. For example, if we were to have a model with two endogenous variables, their interaction on the main outcome equation can be accounted for by including an additional link parameter from the first-stage to the second-stage equations:

program myols3 args lnf xb1 g2 g3 s1 xb2 s2 xb3 s3 quietly { tempvar lnf1 lnf2 lnf3 generate double `lnf2´=ln(normalden($ML_y2,`xb2´,exp(`lns2´))) generate double `lnf3´=ln(normalden($ML_y3,`xb3´,exp(`lns3´))) generate double `lnf1´= /// ln(normalden($ML_y1,`xb1´+`g2´*`xb2´+`g3´*`xb3´,exp(`lns1´))) replace `lnf´=`lnf1´+`lnf2´+`lnf3´ } end ml model lf myols_ols (xb1:y1=x1 x2) (g2:) (g3:) (lns1:) (xb2:y2=x1 x2 x3 x4) /// (lns2:) (xb3:y3=x1 x2 x3 x4) (lns3:), maximize vce(robust)

In this case, we have two endogenous variables, y2 and y3, that are identified using instruments x3 and x4. The predicted values of their corresponding models, xb2 and xb3, are included in the outcome equation multiplied by the link parameters g2 and g3. This simple setup can be adapted to implement other strategies that involve two- step estimation modeling.3 In the next section, we present three examples of how to use this setup to fit a Heckman selection model, a linear model with endogenous covariates, and a nonlinear model (probit) with endogenous covariates, comparing the results with Stata built-in procedures based on Monte Carlo simulations.

4 Applications 4.1 Simulated data structure

In this section, we apply the strategy described above to fit models with selection and endogeneity problems in the framework of linear and nonlinear outcomes and compare the results to already built-in Stata programs. For the applications that follow, we use Monte Carlo simulations to assess the consistency of the estimated SEs of the proposed method. Unless otherwise noted in each subsection, the simulated data are generated using the following data-generating process:

3. An additional advantage of this setup is that after the model is fit, one can obtain marginal effects by using the margins command with the option vce(unconditional). We thank an anonymous reviewer who pointed out this feature. F. Rios-Avila and G. Canavire-Bacarreza 213

The errors u1 and u2 are assumed to be distributed as a bivariate normal distribution, with mean zero, standard deviation 1, and a correlation of 0.7: u1 0 10.7 ∼ B μu = , Σu = u2 0 0.71

Four exogenous explanatory variables, [x1,x2,x3,x4], are jointly distributed as a multivariate normal distribution, also with mean zero and a variance–covariance matrix Σx: ⎛ ⎞ ⎛ ⎞ x1 0 10.30.3 −0.1 ⎜ ⎟ ⎜ ⎟ ⎜x2⎟ ⎜ 0 0.31 0.35 0.1 ⎟ ⎝ ⎠ ∼ MV ⎝μx = , Σx = ⎠ x3 0 0.30.35 1 −0.15 x4 0 −0.10.1 −0.15 1

Finally, we consider three exogenous instrumental variables, z1, z2,andz3,thatare assumed to be distributed as jointly normal with mean zero and variance–covariance matrix Σz: ⎛ ⎞ ⎛ ⎞ z1 0 10.5 −0.5 ⎝ ⎠ ⎝ ⎠ z2 ∼ MV μz = 0 , Σz = 0.51−0.8 z3 0 −0.5 −0.81 Additional details on the data-generating process and the programs used for the estimation of the models can be found in the companion program to this article.

4.2 Heckman selection model. Two-stage procedure

As indicated before, a common problem in empirical analysis is the selection bias. This problem arises because the outcome of interest is not observed for everyone because of some unobserved nonrandom component. According to the standard Heckman (1979), if the probability of observing the outcome could be modeled as a probit model based on observed characteristics, including some variable or instrument that is strongly related to the probability of selection but not correlated to the outcome, then unbiased estimates of the parameters can be obtained using a two-step procedure, also known as a heckit procedure. This procedure involves the estimation of a probit model in the first stage, using a set of explanatory variables (Zi) that may include variables not included in the main model (Xi). The parameters of this first stage are used to estimate a selection term or inverse mills ratio, which are included in the outcome equation. Least-squares estimates from this equation would provide unbiased and consistent estimates of the parameters of interest  P (y2i =1)=Φ(Ziθ) ∗ where Φ is the normal cumulative density function, and y2i =1ify2 > 0, and = 0 otherwise, and   φ(Ziθ) y1i = Xiβ + λ  + ei Φ(Ziθ) where φ is the normal density function and y1i is observed only if y2i =1. 214 Standard-error correction in two-stage optimization models

This two-step procedure can be implemented in the framework of the QMLE method, where the corresponding log-likelihood function could be written as

Li = L + L selection outcome     φ(Ziθ) Li = y2iln {Φ(Ziθ)} +(1− y2i)ln {1 − Φ(Ziθ)} + y2ilnφ y1i, Xiβ + λ  ,σe Φ(Ziθ)

The program associated with the estimation of this model can then be written as

program myheckman args lnf xb1 g lns1 zg1 quietly { * Selection equation tempvar lnf1 lnf2 generate double `lnf1´=ln(normal(`zg1´)) if $ML_y2==1 replace `lnf1´=ln(1-normal(`zg1´)) if $ML_y2==0 * Outcome equation generate double `lnf2´= /// ln(normalden($ML_y1,`xb1´+`g´*normalden(`zg1´)/normal(`zg1´),exp(`lns1´))) replace `lnf2´=0 if $ML_y2==0 * Adding both log likelihoods replace `lnf´=`lnf1´+`lnf2´ } end

For the simulation exercise, we use the simulated data described above, with the ∗ data-generating process for the dependent variables y1 and y2 defined as follows: Selection: ∗ − → ∗ ≤ ∗ y2i =1+x1i x2i + x3i + u2i y2i =0 if y2i 0andy2i =1 if y2i > 0

Outcome: × − ∗ y1i =2+x1i +0.5 x2i x4i + u1i if y2i > 0

In this case, the estimation of the model can be done using the following command:4

ml model lf myheckman (xb1:y1=x1 x2 x4) (g:) (lns1:) (xb2:y2=x1 x2 x3), /// maximize missing

For the simulation, we draw 5,000 samples that are each of size 1,000. In table 1, we provide the summary statistics of the point estimates and the estimated SEsfrom the Stata built-in command heckman and the QMLE Heckman proposed here, as well as the standard deviations of the estimated coefficients from the Monte Carlo simulation exercise.

4. Notice that in the estimation command, we add the option missing to allow for missing observations (because of selection) to be kept in the estimation model. F. Rios-Avila and G. Canavire-Bacarreza 215

Table 1. Simulation results: Selection problem

Heckman ML Heckman two-step Heckman two-step QMLE Coefficients Simulated Coefficients Simulated Coefficients Simulated standard standard standard deviations deviations deviations

Beta x1 1.000 0.048 1.000 0.054 1.001 0.054 SE beta x1 0.048 0.003 0.054 0.003 0.054 0.004 Beta x2 0.501 0.042 0.501 0.044 0.501 0.045 SE beta x2 0.042 0.002 0.044 0.002 0.045 0.003 Beta x4 −1.000 0.036 −1.000 0.037 −1.000 0.037 SE beta x4 0.035 0.002 0.036 0.001 0.036 0.002 Constant 2.000 0.049 2.001 0.060 2.000 0.060 SE constant 0.049 0.005 0.061 0.003 0.061 0.004 Selection term/g 0.700 0.083 0.698 0.130 0.702 0.131 SE selection term/g 0.082 0.015 0.129 0.007 0.130 0.010 Note: The summary corresponds to 5,000 random draws of size 1,000 as described above. All commands were executed using robust SEs whenever possible. Only the outcome variables results are shown.

As expected from asymptotic theory, the point estimates for all three methods con- verge to the true parameters. Comparing the average estimated SE for the covariates x1, x2,andx4 and the constant with the SEs from the simulation, we can conclude that they are estimated without bias in our proposed method. In regard to the selection term, g, we observe that the average estimated SEs converge to the ones obtained from the simulation. While the method proposed here has a similar performance in terms of its asymptotic properties to the Stata heckman twostep estimation, it is worth noting that the heckman ml estimation is more efficient, showing smaller SEs of the equivalent selection term compared with the other two procedures5 as well as for the SEs of the other models.

4.3 Linear regression model with endogeneity

A model that is typically used for addressing endogeneity problems when the outcome and the endogenous variables are continuous is the 2SLS process. This estimation in- volves obtaining linear predictions of the endogenous variables using all exogenous vari- ables, which then are used instead of the original variables in the outcome equation (two-stage predictor substitution). Alternatively, one can obtain the predicted residu- als of the first-stage models and add them to the main outcome model to control for the possible endogeneity. This strategy is also known as a control-function approach or two-

5. The Heckman ML model does not directly estimate a selection term as in the two-step procedure. However, an equivalent parameter is obtained using the product between the estimated correlation between the selection and outcome errors and the standard deviation of the error in the outcome model. 216 Standard-error correction in two-stage optimization models stage residual inclusion (2SRI) approach and is known for performing better when the outcome model is nonlinear (Terza, Basu, and Rathouz 2008; Wooldridge 2014, 2015). As indicated by Wooldridge (2014), using the 2SRI approach in an ML estimation frame- work is equivalent to using a limited-information ML approach. Under the assumption that the errors in all the equations are normally and inde- pendent distributed, the ML function of the system of equations could be written as follows,

Li = LiX + LiY z  x  Li =lnφ(Xi , Ziγ,σx)+lnφ yi, Xi βx +(Ziγ)βz,σy

x z where Xi stands for the set of all exogenous variables that determine y, Xi is an endogenous variable in the model, and Zi is the set of all exogenous variables including z the instrumental variables, which determine Xi . If we assume a model with two endogenous variables, the program associated with the estimation of this model can then be written as

program myivreg2sls args lnf xb1 g2 g3 lns1 zg1 lns2 zg2 lns3 quietly { tempvar lnf1 lnf2 lnf3 generate double `lnf2´=ln(normalden($ML_y2,`zg1´,exp(`lns2´))) generate double `lnf3´=ln(normalden($ML_y3,`zg2´,exp(`lns3´))) generate double `lnf1´= /// ln(normalden($ML_y1,`xb1´+`g2´*`zg1´+`g3´*`zg2´,exp(`lns1´))) replace `lnf´=`lnf1´+`lnf2´+`lnf3´ } end

If instead we use a 2SRI approach, the likelihood function and program would be modified as z  x z z  Li =lnφ(Xi , Ziγ,σx)+lnφ yi, Xi βx + Xi βz +(Xi − Ziγ)θ, σy with the corresponding QMLE program:

program myivreg2sri args lnf xb1 g2 g3 lns1 zg1 lns2 zg2 lns3 quietly { tempvar lnf1 lnf2 lnf3 generate double `lnf2´=ln(normalden($ML_y2,`zg1´,exp(`lns2´))) generate double `lnf3´=ln(normalden($ML_y3,`zg2´,exp(`lns3´))) generate double `lnf1´=ln(normalden($ML_y1,`xb1´+`g2´*($ML_y2-`zg1´) /// +`g3´*($ML_y3-`zg2´),exp(`lns1´))) replace `lnf´=`lnf1´+`lnf2´+`lnf3´ } end

In addition, because the outcome and endogenous models can be written assuming an additive error, we can drop the assumption of normality and specify the objective function in the program as if we were fitting the model using ordinary least squares F. Rios-Avila and G. Canavire-Bacarreza 217

(OLS) or nonlinear least squares (NLS). This simplifies the model because the standard deviations do not need to be estimated (lns1, lns2, lns3). However, for appropriate statistical inferences, the model needs to be fit using the vce(robust) option for the estimation of the SEs:

program myivreg2sls2 args lnf xb1 g2 g3 zg1 zg2 quietly { tempvar lnf1 lnf2 lnf3 generate double `lnf2´=-($ML_y2-`zg1´)^2 generate double `lnf3´=-($ML_y3-`zg2´)^2 generate double `lnf1´=-($ML_y1-(`xb1´+`g2´*`zg1´+`g3´*`zg2´))^2 replace `lnf´=`lnf1´+`lnf2´+`lnf3´ } end

For the simulation exercise, we assume a model where the outcome of interest y1 is ∗ a function of two exogenous variables, x1 and x2, and two endogenous variables, x3 and ∗ x4, which are defined as follows: ∗ × × x3i = x3i +0.4 zi1 +0.6 z2i + u2i ∗ − × × x4i = x4i 0.5 z1i +0.5 z3i + u2i

At the same time, the outcome variable y1 is defined as × − × × ∗ − × ∗ y1i =1+0.5 x1i 0.5 x2i +0.5 x3i 0.5 x4i + u1i

Using the programs presented above, we can fit the model using the following com- mands:

ml model lf myivreg2sls (xb1:y1=x1 x2) (g1:) (g2:) (lns1:) /// (zg1:x3s=x1 x2 z1 z2 z3) (lns3:) (zg2:x4s=x1 x2 z1 z2 z3) (lns4:), /// maximize vce(robust) ml model lf myivreg2sri (xb1:y1=x1 x2 x3s x4s) (g1:) (g2:) (lns3:) /// (zg1:x3s=x1 x2 z1 z2 z3) (lns3:) (zg2:x4s=x1 x2 z1 z2 z3) (lns4:), /// maximize vce(robust) ml model lf myivreg2sls2 (xb1:y1=x1 x2) (g1:) (g2:) /// (zg1:x3s=x1 x2 z1 z2 z3) (zg2:x4s=x1 x2 z1 z2 z3), maximize vce(robust)

Regarding the estimation of the QMLE and quasi–OLS models, note that for the 2SRI, the endogenous variables are directly introduced in the set of explanatory variables in the outcome equation, which is not the case for the standard 2SLS. The summary of the results after the Monte Carlo simulations is presented in table 2. 218 Standard-error correction in two-stage optimization models

Table 2. Linear model with endogenous covariates

ivregress 2sls ivregress liml ivregress gmm Coefficient Simulated Coefficient Simulated Coefficient Simulated standard standard standard deviations deviations deviations

Beta x1 0.500 0.035 0.501 0.036 0.500 0.035 SE beta x1 0.036 0.006 0.037 0.009 0.036 0.006 Beta x2 −0.501 0.053 −0.495 0.057 −0.501 0.053 SE beta x2 0.052 0.016 0.056 0.030 0.052 0.016 Beta x3 0.502 0.096 0.489 0.107 0.502 0.096 SE beta x3 0.094 0.037 0.103 0.071 0.094 0.037 Beta x4 −0.499 0.097 −0.512 0.108 −0.499 0.097 SE beta x4 0.095 0.037 0.104 0.070 0.095 0.037 Constant 0.999 0.032 0.999 0.033 0.999 0.032 SE constant 0.033 0.005 0.034 0.007 0.033 0.005

QMLE-2SLS QMLE-2SRI QNLS-2SLS Coefficient Simulated Coefficient Simulated Coefficient Simulated SE SE SE

Beta x1 0.500 0.035 0.500 0.035 0.500 0.035 SE beta x1 0.036 0.007 0.036 0.007 0.036 0.007 Beta x2 −0.500 0.054 −0.500 0.054 −0.500 0.054 SE beta x2 0.054 0.022 0.054 0.022 0.054 0.022 Beta x3 0.501 0.100 0.502 0.099 0.502 0.099 SE beta x3 0.099 0.053 0.098 0.051 0.098 0.051 Beta x4 −0.500 0.101 −0.500 0.100 −0.500 0.100 SE beta x4 0.099 0.051 0.099 0.049 0.099 0.049 Constant 0.999 0.033 0.999 0.032 0.999 0.032 SE constant 0.033 0.006 0.033 0.006 0.033 0.006 Note: The summary corresponds to 5,000 random draws of size 1,000 as described above. All commands were executed using robust SEs whenever possible. Only the coefficients of the main model are shown. QNLS-2SLS reports the results fitting the model using ml where the objective function minimizes the sum of squared errors of all the models.

In the upper section of table 2, we provide the results obtained from using the Stata command ivregress with the three methodologies that it allows (2sls, liml,andgmm). As expected, the estimated coefficients and SEs obtained from these strategies converge to the true values. Nevertheless, we observe that the results based on the limited- information ML approach provide somewhat larger SEs compared with the other two alternatives, with estimated coefficients that are the farthest from the true coefficients.

In the lower section, we provide the estimates from the QMLE estimation, using the standard 2SLS approach and the 2SRI approach. We also provide the summary for the estimations based on the 2SLS approach but drop the assumption of normality. Based on the simulation, the results from the three strategies are almost identical, with the coefficient estimates converging to the true values and SEs converging to the standard deviations obtained from the simulation. Compared with what we observe using the built-in commands, we observe only a small efficiency advantage from using either the ivregress 2sls or ivregress gmm approach. F. Rios-Avila and G. Canavire-Bacarreza 219

4.4 Probit model with continuous endogenous covariates

According to Terza, Basu, and Rathouz (2008), a consistent estimation of models with endogeneity in the context of nonlinear outcome functions can be obtained using the two- stage residual approach, also known as the control-function approach (Wooldridge 2015). A special case of nonlinear outcomes is when the dependent variable can be specified using a probit model with continuous and endogenous covariates. The default approach for the estimation of such models in Stata is to use full-information ML. However, a control-function approach can also be used to obtain two-step estimates as described in Newey (1987)andRivers and Vuong (1988).

According to this process, the first step is to estimate an OLS regression of the endogenous variable against all exogenous variables in the model and use this estimate to obtain the predicted residual. Next, the residuals are included in the main outcome model, which can be fit using a simple probit model. In this sense, the likelihood function corresponding to this model could be written as

Li = LOLS + LProbit z  Li =lnφ(Xi , Ziγ,σx)+(yi =1)× lnP (yi =1)+(y =0)× ln {1 − P (yi =1)}

x z  x where P (yi =1)=Φ{Xi βt +(Xi − Ziγ)θ}, Xi is the set of exogenous variables, and z Xi is the endogenous covariate. Similar to the previous subsection, if we assume the model has two endogenous covariates, the program that defines the log-likelihood function can be written as follows:

program myivprobit1 args lnf xb1 g1 g2 zg1 lns2 zg2 lns3 quietly { tempvar lnf1 lnf2 lnf3 pr * First stage OLS generate double `lnf1´=ln(normalden($ML_y2,`zg1´,exp(`lns1´))) generate double `lnf2´=ln(normalden($ML_y3,`zg2´,exp(`lns2´))) * Estimated residual generate double `mresid´=`g1´ *($ML_y2-`zg1´)+`g2´*($ML_y3-`zg2´) * Probit model generate double `lnf3´=ln(1-normal(`xb1´+`mresid´)) if $ML_y1==0 replace `lnf3´=ln(normal(`xb1´+`mresid´)) if $ML_y1==1 replace `lnf´=`lnf1´+`lnf2´+`lnf3´ } end

For the simulation exercise, we will use the data-generating process and definitions described in the previous section with one modification. The outcome variable will be defined as follows:

∗ ∗ ∗ y =1+0.5 × x i − 0.5 × x i +0.5 × x − 0.5 × x + u i i  1 2 3i 4i 1 ∗ =1 if yi > 0 yi = ∗ =0 if yi ≤ 0 220 Standard-error correction in two-stage optimization models

In this case, the model can be fit using the following command:

ml model lf myivprobit1 (xb1:y1=x1 x2 x3s x4s) (g1:) (g2:) /// (zg1:x3s=x1 x2 z1 z2 z3) (lns1:) (zg2:x4s=x1 x2 z1 z2 z3) (lns2:), maximize vce(robust) In table 3, we provide estimates from the Monte Carlo simulation using 5,000 draws with a sample size of 1,000 observations each, comparing the results from the Stata ivprobit built-in command using the two-step and the ML options, and compare them with the proposed method.

Table 3. Probit model with endogenous covariates

ivprobit mle ivprobit twostep Coefficient Simulated Coefficient Simulated SE SE

Beta x1 0.497 0.100 0.634 0.079 SE beta x1 0.105 0.043 0.081 0.009 Beta x2 −0.505 0.160 −0.635 0.111 SE beta x2 0.165 0.062 0.113 0.024 Beta x3 0.516 0.234 0.639 0.195 SE beta x3 0.238 0.088 0.195 0.057 Beta x4 −0.479 0.079 −0.629 0.200 SE beta x4 0.074 0.057 0.196 0.060 Constant 0.994 0.190 1.266 0.089 SE constant 0.198 0.103 0.089 0.008

QMLE QMLE with adjustment Coefficient Simulated Coefficient Simulated SE SE

Beta x1 0.636 0.079 0.499 0.099 SE beta x1 0.081 0.012 0.104 0.038 Beta x2 −0.631 0.115 −0.508 0.158 SE beta x2 0.117 0.048 0.163 0.044 Beta x3 0.628 0.205 0.521 0.232 SE beta x3 0.207 0.109 0.235 0.066 Beta x4 −0.642 0.210 −0.478 0.078 SE beta x4 0.209 0.104 0.074 0.084 Constant 1.268 0.090 0.998 0.188 SE constant 0.089 0.009 0.196 0.089 Note: The summary corresponds to 5,000 random draws of size 1,000 as described above. All commands were executed using robust SEs whenever possible. Only the coefficients of the main model are shown.

Based on the results from the simulation, we can draw some conclusions. On one hand, the correction obtained for the estimation of the SEs of the coefficients from the ivprobit command are almost identical to the ones obtained using our approach. On the other hand, while the ivprobit twostep and the simple implementation of our ap- proach provide the same results, they cannot be directly compared with the coefficients F. Rios-Avila and G. Canavire-Bacarreza 221 from the ML approach because the former are estimations of the true components only up to a scale. On the last two rows of table 3, we provide additional estimates using an adjustment in the spirit of Newey (1987). After the adjustment, the two-step–QMLE method produces results almost identical to the ivprobit mle approach. The main difference among the estimations is that the command ivregress produces SEsthat seem to be estimated with more precision compared with our method.

5 Summary

In the past decade, a few articles have offered various strategies and codes to correct the estimation of SEs in the framework of two-stage optimization models, including Hardin (2002), Hole (2006), and, more recently, Terza (2016). While these strategies are potentially as simple to be implemented as bootstrap methods, bootstrap methods are usually preferred in the empirical literature because they do not require further manipulation to obtain the corrected estimations. In this article, we provided an alternative strategy for the estimation of two-step models using a joint quasilikelihood function as described in Wooldridge (2014). While this method also requires some level of programming, the strategy described here pro- vides an intuitive framework for the specification of the underlying likelihood function that needs to be estimated. The baseline programs provided can be easily modified to allow for a large set of two-stage optimization models, which would allow the consis- tent estimation of linear and nonlinear models in the presence of endogeneity or sample selection as proposed in Terza, Basu, and Rathouz (2008)andTerza (2016). Using Monte Carlo simulations, we compared the results obtained from the proposed strategy with the results from built-in Stata commands in the presence of selection bias, linear models with endogeneity, and a nonlinear model (probit) with endogeneity, showing that performance on the estimation of point estimates and SEs is comparable. With the latest release of Stata, a new set of commands, extended regressions, were introduced to address problems regarding endogeneity, sample selection, and endogenous treatments in linear, probit, and censored regression models, adding to the already substantial set of routines that handle these types of problems in empirical analysis. While the methodology proposed here has been used to fit models that are already “solved” in the literature, its usefulness goes beyond the standard models. The use of this estimation method may facilitate the estimation of two-stage nonlinear models in thespiritofTerza, Basu, and Rathouz (2008)andWooldridge (2014, 2015), for which canned estimation procedures are not yet available.

6 References Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applica- tions. New York: Cambridge University Press.

. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. 222 Standard-error correction in two-stage optimization models

Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ:Prentice Hall. Hardin, J. W. 2002. The robust variance estimator for two-stage models. Stata Journal 2: 253–266. Heckman, J. J. 1979. Sample selection bias as a specification error. Econometrica 47: 153–161. Hole, A. R. 2006. Calculating Murphy–Topel variance estimates in Stata: A simplified procedure. Stata Journal 6: 521–529. Murphy, K. M., and R. H. Topel. 1985. Estimation and inference in two-step econometric models. Journal of Business and Economic Statistics 3: 370–379. Newey, W. K. 1987. Efficient estimation of limited dependent variable models with endogenous explanatory variables. Journal of Econometrics 36: 231–250. Newey, W. K., and D. McFadden. 1994. Large sample estimation and hypothesis testing. In Handbook of Econometrics, vol. 4, ed. R. F. Engle and D. L. McFadden, 2111–2245. Amsterdam: Elsevier. Rivers, D., and Q. H. Vuong. 1988. Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics 39: 347–366. Terza, J. V. 2016. Simpler standard errors for two-stage optimization estimators. Stata Journal 16: 368–385. Terza, J. V., A. Basu, and P. J. Rathouz. 2008. Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling. Journal of Health Economics 27: 531–543. White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50: 1–25. Wooldridge, J. M. 2014. Quasi-maximum likelihood estimation and testing for nonlinear models with endogenous explanatory variables. Journal of Econometrics 182: 226– 234. . 2015. Control function methods in applied econometrics. Journal of Human Resources 50: 420–445.

About the authors Fernando Rios-Avila is a research scholar at Levy Economics Institute of Bard College un- der the Distribution of Income and Wealth program. His research interests include applied econometrics, labor economics, and poverty and inequality. Gustavo Canavire-Bacarreza is the Director of the Center for Research on Economics and Finance (CIEF) and a professor in the Department of Economics at Universidad EAFIT in Medell´ın, Colombia. His main interests are applied econometrics, impact evaluation, labor economics, and development. The Stata Journal (2018) 18, Number 1, pp. 223–233 Assessing the health economic agreement of different data sources

Daniel Gallacher Felix Achana Warwick Clinical Trials Unit Warwick Clinical Trials Unit University of Warwick University of Warwick Coventry, UK Coventry, UK [email protected] [email protected]

Abstract. In this article, we present a simple-to-use framework for assessing the agreement of cost-effectiveness endpoints generated from different sources of data. The aim of this package is to enable the rapid assessment of routine data for use in cost-effectiveness analyses. By quantifying the comparability of routine data with “gold-standard” trial data, we inform decisions on the suitability of routine data for cost-effectiveness analysis. The rapid identification of informative routine data will increase the opportunity for economic analyses and potentially reduce the cost and burden of collecting patient-reported data in clinical trials. Keywords: st0521, heabs, heapbs, health economic agreement, cost-effectiveness analysis

1 Introduction

With healthcare budgets under increasing scrutiny, the economic analysis of clinical decision making is of growing importance. This extends to the collection of evidence (Petrou and Gray 2011b,a). There is increasing pressure to gather information and make cost-effective decisions. Health economists routinely use data generated from clin- ical trials to assess the cost effectiveness of interventions. However, it can take months or years for patient follow-up to be completed. This delay, alongside the cost and burden placed on patients to complete lengthy study questionnaires, provides opportunities to consider alternative approaches. There is increasing interest in the use of routine data to inform clinical decision making because of its potential for identifying cost-effective solutions rapidly and inex- pensively (Gates et al. 2017; Raftery, Roderick, and Stevens 2005). However, the utility of routine data for this purpose remains uncertain. This issue can be informed by identi- fying the level of agreement between routine data and a “gold standard” such as existing trial data. While questions remain over what is an acceptable level of agreement, this article introduces a simple-to-use tool that quantifies the agreement between final eco- nomic endpoints generated using alternative sources of cost-effectiveness data. The routines implemented within this tool are suitable for use in a wide variety of decision- making contexts. For example, they can be used to compare and validate routine data for use in trial-based economic evaluations when alternative sources of information on costs and effects are available for trial participants.

c 2018 StataCorp LLC st0521 224 Assessing the health economic agreement of different data sources

Achana et al. (2018) introduced a methodological framework for the assessment of agreement using Lin’s (1989) concordance correlation coefficient (CCC), the difference in incremental net benefit (INB) estimates, the probability of miscoverage (PMC), and the probability of cost effectiveness (PCE). Building on this work, we present generalizable commands allowing these analyses to be performed with relative ease. The commands are designed for use when data similar to those generated by a typical two-arm clinical trial are available. Using individual patient data, the commands assist with the calculation of the in- cremental cost-effectiveness ratio (ICER)andINB. Briefly, the ICER is the ratio of incre- mental costs (that is, the difference in mean costs between the treatment and control interventions) to incremental effectiveness (that is, the corresponding difference in mean effectiveness between treatment and control interventions),

CostA − CostB ICER = EffectA − EffectB where CostA and EffectA represent the means of the cost and effect in treatment A, and CostB and EffectB represent the equivalent in treatment group B (where B is the control intervention). The ICER is normally the main summary measure of interest in most economic evaluations. However, as a ratio statistic, ICERs can be problematic to work with mathematically (Glick et al. 2007). The INB transforms comparisons of cost and effect instead to a linear scale and is given by

INB = λ(EffectA − EffectB) − (CostA − CostB) where λ is the cost-effectiveness (or willingness-to-pay) threshold, which is the maximum threshold at which a decision maker is willing to pay per unit of effectiveness gained. INBs can be framed in net monetary terms (as given by the equation above) or net health terms, which are equivalent (see, for example, Glick et al. [2007] for further discussion of cost-effectiveness ratios and INBs).

In the applications that follow, the INB is used as the statistic for assessing agreement because of the mathematical convenience of working on a linear scale. If two sources of data are available, Lin’s CCC of the two INB estimates will be calculated alongside the difference between the two INB estimates with a 95% confidence interval. Compatible with bootstrapping, the commands allow the calculation of the PMC and the PCE,and they produce a simple plot assessing the cost effectiveness. The current version of the package allows only for assessment of two datasets where the comparison involves analysis of individual participant data on costs and effects (binary or continuous measures) such as that from simple two-arm randomized con- trolled trial data. Future development will focus on extending the routines to allow for i) randomized controlled trials with more complex designs (cluster-randomized, multiple-treatment comparisons, etc.); ii) inclusion of adjustment covariates in a re- gression so that routines can be applied to comparisons involving nonrandomized study designs; and iii) greater customization of the graphical output. D. Gallacher and F. Achana 225 2 The commands 2.1 Description

The heabs command calculates the ICER and INB for up to two pairs of cost-effectiveness data. The command is flexible; if just one pair of cost and effect variables are entered, then only the ICER and INB will be calculated. However, if two sets of cost and effect variables are entered, then the command calculates the ICER and INB for both data sources and performs a simple comparison of the INB scores, calculating the CCC and the difference in INB estimates and relative confidence interval. In addition to cost and effect data, the function also requires the user to input the willingness-to-pay threshold and whether a higher score in the effect variable is beneficial (for example, quality of life) or detrimental to the health of the individual (for example, mortality). The command also requires the treatment indicator variable to be encoded as 0 and 1. heabs stores a range of calculated values, allowing for simple use alongside Stata’s built-in bootstrap command.

Lin (1989) introduced his CCC as a means of quantifying agreement between two measures. It follows traditional correlation coefficients, with a score of 1 suggesting perfect agreement, −1 suggesting perfect inverse agreement, and 0 suggesting no agree- ment. McBride (2005) suggests that moderate agreement could be identified if the lower 95% confidence interval of the CCC estimate was above 0.90. However, Cicchetti (2001) suggests a score as low as 0.4 can be taken as a measure of fair agreement. With such a wide range of views, we recommend that the user decide his or her own suitable CCC threshold for determining agreement.

Similarly, the difference in INB has no firm interpretation in terms of its assessment of agreement. The standard errors (SEs) will often be large, reflecting the wide range of costs associated with a typical health economic analysis. Hence, it is unlikely that a significant difference will be found between the two sources of data. Again, we recom- mend that the user establish an acceptable level of difference between the two sources of data and examine the results of the bootstrap accordingly.

The heapbs command calculates the PMC and PCE for a set of cost-effectiveness data that have been generated through the heabs command combined with the bootstrap prefix. It can also produce a scatterplot of the cost-effectiveness data by fitting a confidence ellipse around them. The command is flexible, calculating only the scores specified by the user.

The PMC is recommended by Achana et al. (2018) to be implemented by taking the INB from the routine data source and the confidence intervals of the gold-standard trial data. This will then yield the probability that the confidence intervals do not contain the INB estimate, allowing for assessment of the agreement between the sets of data. 226 Assessing the health economic agreement of different data sources

The PCE, while not a direct assessment of agreement, allows the user to assess what percentage of the bootstrapped INB estimates for a dataset are positive (that is, cost effective). A comparison of the PCE for both sets of cost-effectiveness data could yield further insight into the agreement of the two sources. Finally, the optionally generated graph allows for the visual interpretation of a set of cost-effectiveness data through a scatterplot, complete with a 95% confidence el- lipse. The ellipse is generated using an incorporated version of the command ellip (Alexandersson 2004). This command calculates the confidence ellipse assuming the costs and effects are elliptically distributed, drawing the ellipse with a twoway line. The fitted ellipse should contain roughly 95% of the scattered points.

2.2 Syntax heabs cost1 effect1 cost2 effect2 , intervention(varname) response(string) w2p(#)

cost1 and effect1 represent the cost and effect variables obtained from the first dataset. cost2 and effect2 are the variables from the second dataset, which are required if a comparison is to be performed. If the second pair of variables is not provided, the command performs a simple routine cost-effectiveness analysis. All of these variables must be numeric without any missing data. heapbs , lci(varname) uci(varname) ref(#) inb(varname) draw cost(varname) effect(varname) twoway options

2.3 Options intervention(varname) specifies the variable that indicates which treatment arm in- dividuals are in. It requires that 0 and 1 be used to distinguish between the two treatment arms. intervention() is required. response(string) specifies whether a higher effect score is positive (bene) or negative (detr). response() is required. w2p(#) is the willingness-to-pay threshold, which is used in the calculation of the INB. This reflects how much the decision maker is willing to pay per unit of effect. The default is w2p(0). lci(varname) specifies the variable containing the bootstrapped estimates of the lower 95% confidence interval of the INB. This option is needed for PMC calculation. uci(varname) specifies the variable containing the bootstrapped estimates of the upper 95% confidence interval of the INB. This option is needed for PMC calculation. D. Gallacher and F. Achana 227 ref(#) is the reference INB to be used in the PMC. Achana et al. (2018) suggest using the INB from one dataset as the reference and comparing it with the 95% confidence intervals of the other dataset. inb(varname) specifies the variable containing the bootstrapped estimates of the INB. This option is needed for PCE calculation. draw specifies if the user would like to generate a plot of the cost-effectiveness boot- strapped data with a confidence ellipse. cost(varname) specifies the variable containing the bootstrapped cost estimates. This option is needed to draw the plot. effect(varname) specifies the variable containing the bootstrapped effect estimates. This option is needed to draw the plot. twoway options allow the control of title, legend, axis, and ellipse settings. See the ellip command for further details (Alexandersson 2004).

2.4 Stored results heabs stores the following in r():

Scalars r(cost1) incremental cost from the first set of cost-effectiveness data r(outcome1) incremental effect from the first set of cost-effectiveness data r(cost2) incremental cost from the second set of cost-effectiveness data r(outcome2) incremental effect from the second set of cost-effectiveness data r(NB1) INB from the first set of data r(seNB1) SE of the INB from the first set of data r(loCINB1) lower 95% confidence interval of the INB from the first set of data r(upCINB1) upper 95% confidence interval of the INB from the first set of data r(NB2) INB from the second set of data r(seNB2) SE of the INB from the second set of data r(loCINB2) lower 95% confidence interval of the INB from the second set of data r(upCINB2) upper 95% confidence interval of the INB from the second set of data r(diffNB) difference in the INB estimates of the two sets of data r(ICER1) ICER from the first set of data r(ICER2) ICER from the second set of data r(cccNB) CCC estimate of the two sources of data r(zcccNB) z score of CCC estimate; can be used in hypothesis testing heapbs stores the following in r():

Scalars r(pce) PCE r(pmc) PMC 228 Assessing the health economic agreement of different data sources 3Example

An example of the commands is shown through manipulation of the bpwide.dta built- in dataset, which can be reproduced using the do-file included in the package. Blood pressure is the outcome of interest, with a higher score associated with poorer health. Gender is recoded as the intervention indicator, and artificial cost data are created based loosely on the blood pressure data. The “before” variable is treated as the first gold-standard trial dataset, and the “after” variable is treated as the comparator. The full list of changes is shown through the commands below:

. sysuse bpwide (fictional blood-pressure data) . set seed 123 . _strip_labels _all . drop agegrp . generate cost1 = (200-bp_before)*rnormal(50,10) . generate cost2 = (250-bp_after)*rnormal(50,10) . rename bp_before bp1 . rename bp_after bp2 . rename sex intervention

Once the dataset is prepared, the functions can be applied as follows. With no second source of data indicated, the command performs a simple routine cost-effectiveness analysis. The output displays a summary of the data, followed by estimates of the ICER, INB,andINB SE.

. heabs cost1 bp1, intervention(intervention) response(detr) w2p(10) Summary: Int 0 Int 1 N6060 Min Cost 98.84935 667.84313 Max Cost 3700.2055 3592.1422 Min Effect 140 138 Max Effect 185 185

Cost Effect Inc Cost Inc Effect ICER

Int 0 2049.930 159.267 262.202 5.633 46.545 Int 1 2312.132 153.633

INB INB SE

INB Results -205.869 118.961

If a second set of data is added, then a comparison is performed with the routine analysis. Here the estimates for the first dataset are unchanged from above, but the corresponding estimates for the second data source are also displayed. In addition, the CCC estimate and the difference of the INB estimates are shown. Note that if a D. Gallacher and F. Achana 229 willingness-to-pay threshold of £10 per unit decrease in blood pressure is used, both sources of data agree that the intervention is not cost effective, yielding negative INB scores. The CCC suggests very weak agreement between the two sources of data, and we can see the difference in the INB estimates is 110.85 units. However, the ICERsare very similar, with less than two units’ difference.

. heabs cost1 bp1 cost2 bp2, intervention(intervention) response(detr) w2p(10) Summary: Data 1 Data 2 Int 0 Int 1 Int 0 Int 1 N60606060 Min Cost 98.84935 667.84313 2947.6459 2947.6459 Max Cost 3700.2055 3592.1422 7765.9072 7765.9072 Min Effect 140 138 125 127 Max Effect 185 185 185 178

ICER INB INB SE CCC Diff INB

DATA 1 46.545 -205.869 118.961 0.075 -110.853 DATA 2 48.083 -316.722 197.863

The relationships between the CCC estimate and the INB estimate can be shown by changing the willingness-to-pay threshold. Below the threshold is increased to £100, and while the ICERsandINB estimates suggest that the treatment is now cost effective, the CCC is negative, and the difference between the INB estimates has increased. A negative CCC implies that the datasets are closer to drawing opposite conclusions than perfect agreement. While this appears strange given the apparent agreement of the INB estimates and ICERs, an investigation of the costs and blood pressure scores explains why. In figure 1, we see that while the data are paired, there is very little correlation between the two sources of data for both the costs and blood pressure scores. It just so happens that the populations agree. It is reasonable to expect a higher correlation between real paired individual-level data.

. heabs cost1 bp1 cost2 bp2, intervention(intervention) response(detr) w2p(100) (output omitted )

ICER INB INB SE CCC Diff INB

DATA 1 46.545 301.131 123.958 -0.041 130.647 DATA 2 48.083 431.778 208.976 230 Assessing the health economic agreement of different data sources 190 4000 180 3000 170 bp1 2000 cost1 160 1000 150 140 0 120 140 160 180 0 2000 4000 6000 8000 bp2 cost2

Figure 1. Scatterplots showing the relationship between the costs and blood pressure scores of the sources of data

The command can be implemented simply within the bootstrap prefix of Stata, as demonstrated below. The key variables used for the heapbs command are shown; however, additional outputs can be added.

. bootstrap cost1=r(cost1) cost2=r(cost2) effect1=r(outcome1) > effect2=r(outcome2) NB1=r(NB1) NB2=r(NB2) NB1Lo=r(loCINB1) NB1Up=r(upCINB1) > NB2Lo=r(loCINB2) NB2Up=r(upCINB2), saving(dummybpbs, replace) reps(100) > seed(24): heabs cost1 bp1 cost2 bp2, w2p(10) intervention(intervention) > response(detr) (output omitted )

Observed Bootstrap Normal-based Coef. Std. Err. z P>|z| [95% Conf. Interval]

cost1 262.2022 136.2642 1.92 0.054 -4.870768 529.2751 cost2 399.8885 203.18 1.97 0.049 1.663059 798.1139 effect1 5.633333 1.824368 3.09 0.002 2.057638 9.209029 effect2 8.316667 2.338167 3.56 0.000 3.733945 12.89939 NB1 -205.8688 122.2152 -1.68 0.092 -445.4063 33.66862 NB2 -316.7218 190.0119 -1.67 0.096 -689.1382 55.69465 NB1Lo -439.0318 120.8765 -3.63 0.000 -675.9454 -202.1182 NB1Up 27.29416 124.7917 0.22 0.827 -217.293 271.8813 NB2Lo -704.5338 186.0972 -3.79 0.000 -1069.278 -339.79 NB2Up 71.09025 197.8324 0.36 0.719 -316.6542 458.8347

Once a bootstrapped dataset has been created, stored, and loaded into Stata, the heabps command can be applied. Here the second dataset is treated as our routine dataset, and the first dataset is treated as the gold-standard data. The text output from the command shows that the PMC is 21%, meaning that the INB estimate from the second source of data fails to appear within the 95% confidence interval from the bootstrapped dataset for the first source of data 21% of the time. The PCE estimate suggests that for the willingness-to-pay threshold selection during the bootstrap run, the drug is cost effective only 4% of the time. D. Gallacher and F. Achana 231

. use dummybpbs.dta, clear (bootstrap: heabs) . heapbs, lci(NB1Lo) uci(NB1Up) ref(-316.722) inb(NB2) draw cost(cost2) > effect(effect2) graphregion(color(white)) Probability of Miscoverage = 21% Probability of Cost Effectiveness = 4%

Figure 2 shows the graphical output. Here the scatterplot for the second source of cost-effectiveness data depicts the intervention showing the treatment to be more effective and more expensive in the majority of bootstrap runs, with a 95% confidence ellipse and mean values clearly indicated.

Confidence ellipse Means centered 1000 500 Cost 0 − 500 0 5 10 15 Effect

95% CI Mean Average

boundary constant = 6.1784

Figure 2. Plot of cost-effectiveness data for second source of data with 95% confidence ellipse

4Conclusion

The heabs and heapbs commands described and demonstrated in this article are simple tools to aid with the evaluation of individual-level cost-effectiveness data. They also give users the opportunity to compare two sources of cost-effectiveness data with the aim of enabling more efficient clinical trial designs in the future. The flexibility of the commands allows users to calculate only the values they require and to customize the graphical output accordingly. 232 Assessing the health economic agreement of different data sources 5 Acknowledgments

This work built on ideas from Felix Achana and Stavros Petrou (Warwick Clinical Trials Unit). Daniel Gallacher is funded by the National Institute of Health Research (Research Methods Fellowship). In this article, we present independent research funded by the National Institute for Health Research. The views expressed are those of the authors and not necessarily those of the NHS, the National Institute for Health Research, or the Department of Health.

6 References Achana, F., S. Petrou, K. Khan, A. Gaye, and N. Modi. 2018. A methodological frame- work for assessing agreement between cost-effectiveness outcomes estimated using alternative sources of data on treatment costs and effects for trial-based economic evaluations. European Journal of Health Economics 19: 75–86. Alexandersson, A. 2004. Graphing confidence ellipses: An update of ellip for Stata 8. Stata Journal 4: 242–256. Cicchetti, D. V. 2001. Methodological commentary: The precision of reliability and va- lidity estimates re-visited: Distinguishing between clinical and statistical significance of sample size requirements. Journal of Clinical and Experimental Neuropsychology 23: 695–700. Gates, S., R. Lall, T. Quinn, C. D. Deakin, M. W. Cooke, J. Horton, S. E. Lamb, A. M. Slowther, M. Woollard, A. Carson, M. Smyth, K. Wilson, G. Parcell, A. Rosser, R. Whitfield, A. Williams, R. Jones, H. Pocock, N. Brock, J. J. Black, J. Wright, K. Han, G. Shaw, L. Blair, J. Marti, C. Hulme, C. McCabe, S. Nikolova, Z. Ferreira, and G. D. Perkins. 2017. Prehospital randomised assessment of a mechanical com- pression device in out-of-hospital cardiac arrest (PARAMEDIC): A pragmatic, cluster randomised trial and economic evaluation. Health Technology Assessment 21: 1–176. Glick, H. A., J. A. Doshi, S. S. Sonnad, and D. Polsky. 2007. Economic Evaluation in Clinical Trials. Oxford: Oxford University Press. Lin, L. I.-K. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45: 255–268. McBride, G. B. 2005. A proposal for strength-of-agreement criteria for Lin’s con- cordance correlation coefficient. NIWA Client Report HAM2005-062, National In- stitute of Water and Atmospheric Research. https://www.medcalc.org/download/ pdf /McBride2005.pdf. Petrou, S., and A. Gray. 2011a. Economic evaluation alongside randomised controlled trials: Design, conduct, analysis, and reporting. British Medical Journal 342: d1548. . 2011b. Economic evaluation using decision analytical modelling: Design, con- duct, analysis, and reporting. British Medical Journal 342: d1766. D. Gallacher and F. Achana 233

Raftery, J., P. Roderick, and A. Stevens. 2005. Potential use of routine databases in health technology assessment. Health Technology Assessment 9: 1–92.

About the authors Daniel Gallacher is a research associate at the Warwick Clinical Trials Unit. His work includes providing statistical support on clinical trials and assisting with health technology appraisals. Felix Achana is a research fellow at the Warwick Clinical Trials Unit. He provides health economic support on clinical trials and has an interest in economic evaluations on health technology appraisals. The Stata Journal (2018) 18, Number 1, pp. 234–261 Manipulation testing based on density discontinuity

Matias D. Cattaneo Michael Jansson University of Michigan University of California at Berkeley and CREATES Ann Arbor, MI Berkeley, CA [email protected] [email protected]

Xinwei Ma University of Michigan Ann Arbor, MI [email protected]

Abstract. In this article, we introduce two community-contributed commands, rddensity and rdbwdensity, that implement automatic manipulation tests based on density discontinuity and are constructed using the results for local-polynomial density estimators in Cattaneo, Jansson, and Ma (2017b, Simple local polyno- mial density estimators, Working paper, University of Michigan). These new tests exhibit better size properties (and more power under additional assump- tions) than other conventional approaches currently available in the literature. The first command, rddensity, implements manipulation tests based on a novel local-polynomial density estimation technique that avoids prebinning of the data (improving size properties) and allows for restrictions on other features of the model (improving power properties). The second command, rdbwdensity,imple- ments several bandwidth selectors specifically tailored for the manipulation tests discussed herein. We also provide a companion R package with the same syntax and capabilities as rddensity and rdbwdensity. Keywords: st0522, rddensity, rdbwdensity, falsification test, manipulation test, regression discontinuity

1 Introduction

McCrary (2008) introduced the idea of manipulation testing in the context of regression discontinuity (RD) designs. Consider a setting where each unit in a random sample from a large population is assigned to one of two groups depending on whether one of their observed covariates exceeds a known cutoff. In this context, the two possible groups are generically referred to as control and treatment groups. The observed variable, de- termining group assignment, is generically referred to as the score, index, or running variable. The key idea behind manipulation testing in this context is that in the absence of systematic manipulation of the unit’s index around the cutoff, the density of units should be continuous near this cutoff value. Thus, a manipulation test seeks to formally determine whether there is evidence of a discontinuity in the density of units at the

c 2018 StataCorp LLC st0522 M. D. Cattaneo, M. Jansson and X. Ma 235 known cutoff. Presence of such evidence is usually interpreted as empirical evidence of self-selection or nonrandom sorting of units into control and treatment status.

Manipulation testing is useful for falsification of RD designs: see Cattaneo and Es- canciano (2017) for an edited volume with a recent overview of the RD literature; see Cattaneo, Titiunik, and Vazquez-Bare (2017c) for a practical introduction to RD de- signs with a comparison between leading empirical methods; see Calonico, Cattaneo, and Titiunik (2015a) for a discussion of graphical presentation and falsification of RD designs; and see references therein for other related topics. In addition to providing a formal statistical check for RD designs, a manipulation test can be used substantively whenever the empirical goal is to test for self-selection or endogenous sorting of units exposed to a known hard threshold-crossing assignment rule. Thus, flexible data-driven implementations of manipulation tests with good size and power properties are poten- tially very useful for empirical work in economics and related social sciences. To implement a manipulation test, the researcher must estimate the density of units near the cutoff to conduct a hypothesis test about whether the density is discontin- uous. Three distinct manipulation tests have been proposed in the literature. First, McCrary (2008) introduced a test based on the nonparametric local-polynomial den- sity estimator of Cheng, Jianqing, and Marron (1997), which requires prebinning of the data and hence introduces additional tuning parameters. Second, Otsu, Xu, and Mat- sushita (2014) proposed an empirical likelihood method using boundary-corrected ker- nels. Third, Cattaneo, Jansson, and Ma (2017b) developed a set of manipulation tests based on a novel local-polynomial density estimator, which does not require prebinning of the data and is constructed in an intuitive way based on easy-to-interpret kernel func- tions. The latter procedures are shown to also provide demonstrable improvements in both size and power under appropriate assumptions and relative to the other approaches currently available in the literature. Finally, Frandsen (2017) recently proposed a ma- nipulation test in the context of RD designs with a discrete running variable. In this article, we discuss data-driven implementations of manipulation tests follow- ing the results in Cattaneo, Jansson, and Ma (2017b). We introduce two commands that together give several manipulation test implementations, which depend on i) the restrictions imposed in the underlying data-generating process, ii) the method for bias correction, iii) the bandwidth selection approach, and iv) the method to estimate stan- dard errors (SEs), among many other alternatives. Specifically, our command rddensity implements two distinct manipulation tests given a choice of bandwidth and SE estima- tor: one test is constructed using the basic Wald statistic, which requires undersmooth- ing, while the other test uses robust bias-correction (Calonico, Cattaneo, and Farrell Forthcoming) to obtain valid statistical inference. This command also allows for both unrestricted and restricted models, where the cumulative distribution function and higher-order derivatives are assumed to be equal for both groups (thereby increasing the power of the test). These methods can also be used under additional assumptions in the context of RD designs with a discrete running variable. We review the main aspects of these different approaches to manipulation testing in section 2 below, but we refer the reader to Cattaneo, Jansson, and Ma (2017b) and its supplemental appendix for most of the technical and theoretical discussion. 236 Manipulation testing based on density discontinuity

The command rddensity also offers a plot of the manipulation test. To implement this plot, a density estimate must be constructed not only at the cutoff point but also at nearby evaluation points, which may also be affected by boundary bias. Thus, to con- struct this plot in a principled way, rddensity uses the package lpdensity,whichim- plements local-polynomial–based density estimation methods. This density estimation package must be installed to construct the density plot; see Cattaneo, Jansson, and Ma (2017a) for further details. If lpdensity is not installed, then rddensity issues an error message when trying to construct a manipulation test plot. To complement the command rddensity, and because in empirical applications researchers often want to select the bandwidth entering the manipulation test in a data- driven and automatic way, we introduce the companion command rdbwdensity,which provides several bandwidth selection methods based on asymptotic mean squared error (AMSE) minimization. Our implementation accounts for whether the unrestricted or the restricted model is used for inference, and it also allows for both different bandwidths on either side of the cutoff (whenever possible) and a common bandwidth for both sides. By default, rddensity uses the companion command rdbwdensity to estimate the bandwidth(s) whenever the user does not provide a specific choice, thereby giving fully automatic and data-driven inference procedures implemented by rddensity. The commands rddensity and rdbwdensity complement the recently introduced Stata community-contributed commands and R functions rdrobust, rdbwselect,and rdplot, which are useful for graphical presentation, estimation, and inference in RD de- signs and use nonparametric local-polynomial techniques. For an introduction to the lat- ter commands, see Calonico, Cattaneo, and Titiunik (2014a, 2015b)andCalonico et al. (2017). Together, the five commands offer a complete toolkit for empirical work us- ing RD designs. In addition, see Cattaneo, Titiunik, and Vazquez-Bare (2016) for Stata commands and R functions (rdrandinf, rdwinselect, rdsensitivity, rdrbounds) implementing randomization-based inference methods for RD designs under a local ran- domization assumption. The rest of this article is organized as follows. In section 2, we provide a brief review of the methods implemented in our two commands. In sections 3 and 4, we describe the syntax of rddensity and rdbwdensity, respectively. In section 5, we illustrate some of the functionalities of our commands using real data from Cattaneo, Frandsen, and Titiunik (2015). In section 6, we report results from a small-scale simulation study, tailored to investigate the performance of our testing procedure. We also provide a companion R package with the same functionalities and syntax.

The latest version of this software, as well as other related software for RD designs, can be found at https://sites.google.com/site/rdpackages/. M. D. Cattaneo, M. Jansson and X. Ma 237 2 Methods overview

This section offers a brief overview of the methods implemented in our commands rddensity and rdbwdensity. We closely follow the results in Cattaneo, Jansson, and Ma (2017b), including those in their supplemental appendix. Regularity conditions and most of the technical details are not discussed here to conserve space and ease the exposition.

2.1 Setup and notation

We assume that {X1,X2,...,Xn} is a random sample of size n from the random variable X with the cumulative distribution function (c.d.f.) and probability density function given by F (x)andf(x), respectively. The random variable Xi denotes the score, index, or running variable of unit i in the sample. Each unit is assigned to control or treatment depending on whether its observed index exceeds a known cutoff denoted by x.That is, treatment assignment is given by

unit i assigned to control group if Xi < x unit i assigned to treatment group if Xi ≥ x where the cutoff point x is known and, of course, we assume enough observations for each group are available (for example, f(x) > 0nearx and the sample is large enough). In the specific case of RD designs, manipulation testing can be used for sharp RD designs (where treatment assignment and treatment status coincide) and for fuzzy RD designs (where treatment assignment and treatment status differ). In the latter case, of course, the test applies to the intention-to-treat mechanism because units can select into treatment or control status beyond the hard-thresholding rule for treatment assignment (that is, assigned to control group if Xi < x and assigned to treatment group if Xi ≥ x). A manipulation test in this context is a hypothesis test on the continuity of the density f(·) at the cutoff point x. Formally, we are interested in the testing problem

H0 : lim f(x) = lim f(x)versus Ha : lim f(x) = lim f(x) x↑x x↓x x↑x x↓x To construct a test statistic for this hypothesis testing problem, we follow Cattaneo, Jansson, and Ma (2017b) and estimate the density f(x) using a local-polynomial den- sity estimator based on the c.d.f. of the observed sample. This estimator has several interesting properties, including the fact that it does not require prebinning of the data and is quite intuitive in its implementation (for example, simple second-order kernels can be used). Importantly, this estimator also permits incorporating restrictions on the c.d.f. and higher-order derivatives of the density, leading to new manipulation tests with better power properties in applications. For an introduction to conventional local- polynomial techniques, see, for example, Fan and Gijbels (1996). The manipulation test statistics implemented in rddensity take the form / − / f+,p(h) f−,p(h) / 2 V/ / − / Tp(h)= / Vp (h)= f+,p(h) f−,p(h) Vp(h) 238 Manipulation testing based on density discontinuity a / where Tp(h) ∼ N (0, 1) under appropriate assumptions, and the notation V{·} is meant to denote some plug-in consistent estimator of the population quantity V{·}.The parameter h is the bandwidth(s) used to localize the estimation and inference procedures near the cutoff point x. The statistics may be constructed in several different ways, as we discuss in more detail below. In particular, given a choice of bandwidth(s), two main ingredients to construct the test statistic Tp(h) are i) the local-polynomial density / / / estimators f+,p(h)andf−,p(h), and ii) the corresponding SE estimator Vp(h). These estimators also depend on the choice of polynomial order p, the choice of kernel function K(·), and the restrictions imposed in the model, among other possibilities. The SE / formulas Vp(h) could be based on either an asymptotic plug-in or a jackknife approach, and its specific form will depend on whether additional restrictions are imposed to the model. A crucial ingredient is, of course, the choice of bandwidth h, which determines which observations near the cutoff x are used for estimation and inference. This choice can either be specified by the user or be estimated using the available data. Our commands allow, when possible, for different bandwidth choices on either side of the cutoff x. A common bandwidth on both sides of the cutoff is always possible. In the following two subsections, we discuss the alternatives for estimation and inference: i) unrestricted inference, ii) restricted inference, iii) SE estimation in both cases, and iv) bandwidth selection in both cases. In closing this section, we also offer a brief review of the different data-driven inference methods implemented in our Stata (and R) commands.

2.2 Unrestricted testing

In unrestricted testing, the manipulation test becomes a standard two-sample problem / / where the estimators f+,p(h)andf−,p(h) are unrelated. Thus, the SE formula reduces / 2 / / / / / / / to Vp (h)=V{f+,p(h) − f−,p(h)} = V{f+,p(h)} + V{f−,p(h)}. To be more concrete, the density estimators take the form

/  β/ /  β/ f−,p(h)=e1 −,p(h)andf+,p(h)=e1 +,p(h) with n / /  2 β−,p(h)=arg min 1(Xi < x) F (Xi) − rp(Xi − x) β Kh(Xi − x) β∈Rp+1 i=1

n / /  2 β ,p(h)=arg min 1(Xi ≥ x) F (Xi) − rp(Xi − x) β Kh(Xi − x) + β∈Rp+1 i=1 / where F (Xi) denotes the (leave-one-out) c.d.f. estimator for the full sample, rp(x)= p   p+1 (1,x,...,x ) , e1 =(0, 1, 0, 0,...,0) ∈ R is the second unit vector, Kh(u)= K(u/h)/h with K(·) being a kernel function, h is a positive bandwidth sequence, and 1(·) denotes the indicator function. Following the results in Cattaneo, Jansson, and Ma β/ β/ (2017b), it can be shown that −,p(h)and +,p(h) roughly approximate, respectively, M. D. Cattaneo, M. Jansson and X. Ma 239

(1) { } (p) (1) { } (p) [F−,f−, (1/2!)f− ,..., 1/(p +1)! f− ]and[F+,f+, (1/2!)f+ ,..., 1/(p +1)! f+ ], where we use the notation s s (s) ∂ (s) ∂ f− = lim f(x)andf = lim f(x) x↑x ∂xs + x↓x ∂xs β/ β/ s =1, 2,...,p. Thus, the second elements of −,p(h)and +,p(h) are used to construct consistent, boundary-corrected density estimators at the cutoff point x entering the numerator of the manipulation test statistic. Notice that this estimation approach avoids prebinning of the data and may be constructed using simple and easy-to-interpret kernels K(·), such as the uniform or triangular kernels. Consistency, asymptotic normality, and moment approximations for the estimators / / f−,p(h)andf+,p(h) are derived in Cattaneo, Jansson, and Ma (2017b), where these re- sults are also used to study the asymptotic properties of the unrestricted manipulation tests implemented in our main command, rddensity. We discuss the exact implemen- tations at the end of this section.

2.3 Restricted testing

The approach described above treats manipulation testing as a traditional two-sample problem, where the left and right approximations to the density f(x) at the cutoff x are done independently. However, in the context of manipulation testing, it may be argued that the c.d.f. F (x) and higher-order derivatives f (s)(x), s ≥ 1, are equal for both groups at the cutoff even when f− = f+ (that is, H0 is false). In this case, researchers may wish to incorporate these restrictions to construct a more powerful testing procedure. A restricted manipulation test is constructed by solving the above weighted (local) least-squares problem with the additional restrictions ensuring that all but the second element in β are equal in both groups. It follows that this restricted problem can be represented as a single regression problem, n 2 /R / R  βp(h)=arg min F (Xi) − rp(Xi − x) β Kh(Xi − x) β∈Rp+2 i=1 R 2 3 p  where rp(x)={1,x · 1(x

 p (0, 0, 1, 0, 0,...,0) ∈ R +1 denotes the third unit vector. Furthermore, the SE formula is different because now cross-restrictions are incorporated in the estimation procedure, { /R /R } leading to a different asymptotic variance for f−,p(h), f+,p(h) and, consequently, for /R − /R f+,p(h) f−,p(h) as well. This is exactly the source of the power gains of a restricted / manipulation test relative to an unrestricted one. In this case, the SE formula Vp(h)= V/{ /R − /R }  V/{ /R } V/{ /R } f+,p(h) f−,p(h) = f+,p(h) + f−,p(h) . Formal asymptotic properties for restricted local-polynomial density estimation and inference are also discussed in Cattaneo, Jansson, and Ma (2017b), where these results are then used to propose an asymptotically valid restricted manipulation test. We also discuss the exact implementations at the end of this section.

2.4 Standard errors

As mentioned above, the asymptotic variance entering the denominator of the manip- ulation test statistic Tp(h) will be different depending on whether an unrestricted or a restricted model is used. Our commands rddensity and rdbwdensity allow for both cases. In addition, for each of these cases, the commands provide two distinct consistent SE estimators: i) a plug-in estimator based on the asymptotic variance of the numera- tor of Tp(h), and ii) a jackknife estimator based on the leading term of an expansion of asymptotic variance of the numerator of Tp(h). The plug-in estimator is faster because it essentially requires no additional estima- tion beyond the quantities entering the numerator of Tp(h), but it relies on asymptotic approximations. On the other hand, the jackknife estimator is slower because it requires additional estimation and looping over the data, but according to simulation evidence in Cattaneo, Jansson, and Ma (2017b), it appears to provide a more accurate approxima- tion to the finite-sample variability of the numerator of Tp(h) in both cases (unrestricted model and restricted model). Therefore, our implementations use the jackknife SE esti- mator by default, but we also offer the plug-in estimator for cases involving relatively largesamplesizes.

2.5 Bandwidth selection rddensity requires the specification of bandwidths for estimation; otherwise, it uses rdbwdensity to construct data-driven bandwidth choices specifically tailored for the manipulation tests discussed in this article. In this subsection, we briefly outline the data-driven implementations provided in rdbwdensity for automatic bandwidth selec- tion. In the unrestricted model, the user has the option to specify two distinct bandwidths: hl for left estimation and hr for right estimation. Of course, one such choice may be equal bandwidths: h = hl = hr. In the restricted model, however, only a common bandwidth h can be specified because estimation is done jointly by construction. For each of these cases, whenever possible, Cattaneo, Jansson, and Ma (2017b) develop M. D. Cattaneo, M. Jansson and X. Ma 241 three distinct approaches to select the bandwidth(s) using the mean squared error (MSE) / / / criterion function, generically denoted by MSE(θ)=E{(θ − θ)2},whereθ denotes some estimator and θ denotes its target estimand. For the specific context considered in this article, Cattaneo, Jansson, and Ma (2017b) develop valid asymptotic expansions of several MSE criterion functions. These results can be briefly summarized as / / MSE θ(h) ≈ AMSE θ(h) where / p+1 2 p+2 2 1 AMSE θ(h) = h B (θ)+h B (θ)+ Vp(θ) p p+1 nh with, for the unrestricted model, / / / / / / / θ(h)representingoneof f−,p(h); f+,p(h); f+,p(h) − f−,p(h); f+,p(h)+f−,p(h) and with, for the restricted model, / /R − /R /R /R θ(h)representingoneof f+,p(h) f−,p(h); f+,p(h)+f−,p(h) and, of course, with

θ representing one of {f−; f+; f+ − f−; f+ + f−} as appropriate according to the choice of θ/(h). Crucially, for each combination of es- / timator θ(h) and estimand θ, the corresponding bias constants {Bp(θ),Bp+1(θ)} and variance constant Vp(θ) are different. In all cases, as is usual in nonparametric problems, these constants involve features of both the data-generating process and the nonpara- metric estimator. Given a choice of estimator and estimand, and provided that preliminary estimates of the leading asymptotic constants in the associated MSE expansion are available, it is straightforward to construct a plug-in bandwidth selector. In our implementations, we consider the five alternative plug-in rules for bandwidth selection mentioned above, which use simple rules of thumb to approximate the leading constants in the MSE ex- pansions. Technical and methodological details underlying these choices are given in Cattaneo, Jansson, and Ma (2017b) and not reproduced here to conserve space. Specifically, rdbwdensity allows for the following alternative bandwidth selectors. We do not introduce additional notation to reflect the estimation of the leading AMSE constants only to ease the exposition, but our implementations are automatic because they rely on preliminary rule-of-thumb estimators to approximate those constants.

• Unrestricted model with different bandwidths: when no restrictions are imposed on the model and (hl,hr) are allowed to be different, the bandwidths are chosen to minimize the AMSE of the density estimators separately, that is, / / / / hl,p =argminh>0AMSE f−,p(h) and hr,p =argminh>0AMSE f+,p(h) 242 Manipulation testing based on density discontinuity

• Unrestricted model with equal bandwidth: when no restrictions are imposed on the model but (hl,hr) are forced to be equal, the common bandwidth may be chosen in two distinct ways:

1. Difference of densities: / / − / hdiff,p =argminh>0AMSE f+,p(h) f−,p(h)

2. Sum of densities: / / / hsum,p =argminh>0AMSE f+,p(h)+f−,p(h)

• Restricted model: when the restrictions are imposed on the model, then h = hl = hr by construction. Also in this case, the common bandwidth may be chosen in two distinct ways:

1. Difference of densities: /R /R − /R hdiff,p =argminh>0AMSE f+,p(h) f−,p(h)

2. Sum of densities: /R /R /R hsum,p =argminh>0AMSE f+,p(h)+f−,p(h)

All the bandwidth selectors above have closed-form solutions, and their specific decay rates (as a function of the sample size) depend on the specific choices. This is an important point because Bp(θ)orBp+1(θ) may be 0 depending on the choice of − ∝ (p) − (p) θ and p. For example, if θ = f+ f− then Bp(θ) f+ f− when p = 2, but ∝ (p) (p) Bp(θ) f+ + f− when p = 3, which implies that Bp(θ) = 0 under the plausible (p) (p) assumption that f+ = f− . Following the discussion in Cattaneo, Jansson, and Ma (2017b), we also implement two simple “regularization” approaches that avoid these possible degeneracies:

• Unrestricted model with different bandwidths: / / / / hl,comb,p =median hl,p, hdiff,p, hsum,p / / / / hr,comb,p =median hr,p, hdiff,p, hsum,p

• Unrestricted model with equal bandwidth: / / / hcomb,p =min hdiff,p, hsum,p

• Restricted model: /R /R /R hcomb,p =min hdiff,p, hsum,p M. D. Cattaneo, M. Jansson and X. Ma 243

2.6 Overview of methods

We have discussed how the density point estimators and corresponding SEsarecon- structed to form the test statistic Tp(h). As mentioned above, these estimators depend on whether the unrestricted or the restricted model is considered. In addition, the bandwidth h =(hl,hr) may be chosen in different ways, including both cases where the two bandwidths are assumed equal and cases where they are allowed to be different. In this section, we close the discussion of manipulation testing by briefly addressing the issue of critical value (or quantile) choice to form the testing procedure. As mentioned in passing before, we rely on large-sample approximations to justify a a result of the form Tp(h) ∼ N (0, 1) provided that an appropriate choice of bandwidth h and polynomial order p is used. Specifically, the command rddensity implements three distinct methods for inference. Each of these methods takes a different approach to handle the potential presence of a first-order bias in the statistic Tp(h)whenalarge bandwidth h is used (for example, when any of the MSE-optimal bandwidth choices discussed above are used).

To briefly discuss the three alternative manipulation tests, we let hMSE,p denote any of the bandwidth choices given above when pth order local-polynomial density estimators are used. The following discussion applies to all cases, whether they are unrestricted or restricted models with equal or unequal bandwidths chosen by any of ∈ 2 the methods mentioned before. Let α (0, 1) and χ1(α)denotetheαth quantile of a chi-squared distribution with 1 degree of freedom. The three methods for inference implemented in rddensity are as follows:

• Robust bias-correction approach. This approach uses ideas in Calonico, Cattaneo, and Titiunik (2014b) and Calonico, Cattaneo, and Farrell (Forthcoming), which are based on analytic bias-correction coupled with variance adjustments, leading to testing procedures with improved distributional properties in finite samples. In particular, letting q ≥ p + 1, the manipulation test takes the form H 2 2 − An α-level test rejects 0 iff Tq (hMSE,p) >χ1(1 α)

2 In words, the construction of Tq (hMSE,p) ensures an asymptotically valid distribu- tional approximation because q ≥ p+1. Thus, in this case, the possible first-order 2 bias of the statistic Tp (hMSE,p) (note the change from q to p in the subindex) is removed by using a higher-order polynomial in the estimation of the densities and, of course, adjusting the SE formulas accordingly. This approach to manipulation testing is the default option in rddensity because it is always theoretically justified and leads to power improvements asymptotically. For example, see the simulations presented in section 6. • Small bias approach/undersmoothing approach. A second approach to inference would be to ignore the impact of a possibly first-order bias in the distributional 2 approximation for the statistic Tp (hMSE,p). In some specific cases, such an ap- proach may be justified [for example, if Bp(θ) happens to be exactly 0 for a choice 244 Manipulation testing based on density discontinuity

of p and θ] and the bandwidth was chosen in an ad-hoc manner, but this cannot be justified in general. Alternatively, this approach is valid if the user uses an undersmoothed bandwidth choice relative to the MSE-optimal bandwidth hMSE,p. For completeness, rddensity also implements a conventional Wald test without bias correction. To describe this procedure, let’s suppose h is the bandwidth used. Then the command implements the conventional testing procedure

H 2 2 − Reject 0 iff Tp (h) >χ1(1 α)

which delivers a valid α-level test only when the smoothing bias in approximating f(x) at the cutoff point x is indeed small. In practice, this requires choosing an undersmoothed bandwidth; for example, an ad hoc choice is h = s · hMSE,p for some user-chosen scale value s ∈ (0, 1). This testing procedure is also reported when the option all is specified.

3 The rddensity command

This section describes the syntax of the command rddensity, which implements the manipulation tests for a choice of bandwidth(s).

3.1 Syntax rddensity runvar if in ,c(cutoff)p(pvalue)q(qvalue) fitselect(fitmethod) kernel(kernelfn)h(hlvalue hrvalue) bwselect(bwmethod) vce(vcemethod) all plot plot range(xmin xmax) plot n(nl nr) plot grid(gridmethod) genvars(varname) level(#) graph options(graphopts)

runvar is the running variable (also known as the score or index variable).

3.2 Options c(cutoff) specifies the RD cutoff. The default is c(0). p(pvalue) specifies the order of the local polynomial used to construct the density point estimators. The default is p(2) (local quadratic approximation). q(qvalue) specifies the order of the local polynomial used to construct the bias-corrected density point estimators. The default is q(p(#)+1) (local cubic approximation for the default p(2)). M. D. Cattaneo, M. Jansson and X. Ma 245 fitselect(fitmethod) specifies whether restrictions should be imposed. fitmethod may be one of the following: unrestricted for density estimation without any restrictions (two-sample, unre- stricted inference). This is the default option. restricted for density estimation assuming equal c.d.f. and higher-order deriva- tives. kernel(kernelfn) is the kernel function K(·) used to construct the local-polynomial estimator(s). kernelfn may be triangular, uniform,orepanechnikov. The default is kernel(triangular). h(hlvalue hrvalue) are the values of the main bandwidths hl and hr, respectively. If only one value is specified, then hl = hr is used. If not specified, they are computed by the companion command rdbwdensity. If two bandwidths are specified, the first bandwidth is used for the data below the cutoff, and the second bandwidth is used for the data above the cutoff. bwselect(bwmethod) specifies the bandwidth selection procedure to be used. bwmethod maybeoneofthefollowing:

each specifies bandwidth selection based on the MSE of each density separately, that / / is, hl,p and hr,p. This is available only when the unrestricted model is used. diff specifies bandwidth selection based on the MSE of the difference of densities, / that is, hdiff,p. sum specifies bandwidth selection based on the MSE of the sum of densities, that is, / hsum,p. comb specifies bandwidth selection as a combination of the alternatives above, that / / / /R is, either (hl,comb,p, hr,comb,p), hcomb,p,orhcomb,p, depending on the model chosen. This is the default option. vce(vcemethod) specifies the procedure used to compute the variance–covariance matrix estimator. vcemethod may be one of the following:

plugin for asymptotic plug-in SEs.

jackknife for jackknife SEs. This is the default option. all reports two different manipulation tests (given choices fitselect(fitmethod) and bwselect(bwmethod)): the conventional test statistic (not valid when using the MSE-optimal bandwidth choice) and the robust bias-corrected statistic (the default option). plot plots the density around the cutoff (this feature depends on the companion package lpdensity). Note that additional estimation (computing time) is needed. plot range(xmin xmax) specifies lower (xmin) and upper (xmax) endpoints for the manipulation test plot. By default, it is three bandwidths around the cutoff. 246 Manipulation testing based on density discontinuity plot n(nl nr) specifies the number of evaluation points below (nl)andabove(nr)the cutoff to be used for the manipulation test plot. The default is plot n(10 10). plot grid(gridtype) specifies the location of evaluation points, whether they are evenly spaced (es) or quantile spaced (qs), to be used for the manipulation test plot. The default is plot grid(es). genvars(varname) specifies that new variables should be generated to store estimation results for plotting. See the help file for details. level(#) specifies the level of confidence intervals for the manipulation test plot. # must be between 0 and 100. The default is level(95). graph options(graphopts) are graph options passed to the plot command.

3.3 Description rddensity provides an implementation of manipulation tests using local-polynomial density estimators. The user must specify the running variable. This command permits fully data-driven inference by using the companion command rdbwdensity,whichmay also be used as a standalone command.

4 The rdbwdensity command

This section describes the syntax of the command rdbwdensity. This command imple- ments the different bandwidth selection procedures for manipulation tests based on the local-polynomial density estimators discussed above.

4.1 Syntax rdbwdensity runvar if in ,c(cutoff)p(pvalue) fitselect(fitmethod) kernel(kernelfn) vce(vcemethod)

runvar is the running variable (also known as the score or index variable).

4.2 Options c(cutoff) specifies the RD cutoff. The default is c(0). p(pvalue) specifies the order of the local polynomial used to construct the point esti- mator. The default is p(2) (local quadratic approximation). M. D. Cattaneo, M. Jansson and X. Ma 247 fitselect(fitmethod) specifies the model used for estimation and inference. fitmethod maybeoneofthefollowing: unrestricted for density estimation without any restrictions (two-sample, unre- stricted inference). This is the default option. restricted for density estimation assuming equal c.d.f. and higher-order deriva- tives. kernel(kernelfn) specifies the kernel function K(·)thatisusedtoconstructthelocal- polynomial estimator(s). kernelfn may be triangular, uniform,orepanechnikov. The default is kernel(triangular). vce(vcemethod) specifies the procedure used to compute the variance–covariance matrix estimator. vcemethod may be

plugin for asymptotic plug-in SEs.

jackknife for jackknife SEs. This is the default option.

4.3 Description rdbwdensity implements several bandwidth selection procedures specifically tailored for manipulation testing based on the local-polynomial density estimators implemented in rddensity. The user must specify the running variable.

5 Illustration of methods

We illustrate our community-contributed commands using the same dataset already used in Calonico et al. (2017)andCattaneo, Titiunik, and Vazquez-Bare (2016), where related RD commands are introduced and discussed. This facilitates comparison across the Stata and R packages available for analysis and interpretation of RD designs; the package introduced herein discusses a falsification approach to RD designs via manipu- lation testing, while the other packages discuss graphical presentation and falsification, estimation, and inference. rddensity senate.dta contains the running variable from a larger dataset con- structed and studied in Cattaneo, Frandsen, and Titiunik (2015), focusing on party advantages in U.S. Senate elections for the period 1914–2010. Thus, the unit of obser- vation is a state in the United States. In this section, we focus on the running variable used to analyze the RD effect of the Democratic party winning a U.S. Senate seat on the vote share obtained in the following election for that same seat. This empirical illustra- tion is analogous to the one presented by McCrary (2008) for U.S. House elections. The variable of interest is margin, which ranges from −100 to 100 and records the Demo- cratic party’s margin of victory in the statewide election for a given U.S. Senate seat, defined as the vote share of the Democratic party minus the vote share of its strongest opponent. This corresponds to the index, score, or running variable. By construction, the cutoff is x =0. 248 Manipulation testing based on density discontinuity

First, we load the dataset and present summary statistics:

. use rddensity_senate . summarize margin Variable Obs Mean Std. Dev. Min Max

margin 1,390 7.171159 34.32488 -100 100

The dataset has a total of 1,390 observations, with an average Democratic party margin of victory of about 7 percentage points. We now conduct a manipulation test using the command rddensity with its default options.

. rddensity margin Computing data-driven bandwidth selectors. RD Manipulation Test using local polynomial density estimation. Cutoff c = 0 Left of c Right of c Number of obs = 1390 Model = unrestricted Number of obs 640 750 BW method = comb Eff. Number of obs 408 460 Kernel = triangular Order est. (p) 2 2 VCE method = jackknife Order bias (q) 33 BW est. (h) 19.841 27.119 Running variable: margin.

Method T P>|T|

Robust -0.8753 0.3814

The output contains a variety of useful information. First, the upper-left panel gives basic summary statistics on the data being used, separate for control (Xi < x)and treatment units (Xi ≥ x). This panel also reports the value of the bandwidth(s) cho- sen. Second, the upper-right panel includes general information regarding the overall sample size and implementation choices of the manipulation test. Finally, the lower panel reports the results from implementing the manipulation test. In this first execu- tion, the test statistic is constructed using a q = 3 polynomial, with different bandwidths chosen for an unrestricted model with polynomial order p = 2. Specifically, the band- / / width choices are (hl,comb,p, hr,comb,p)=(19.841, 27.119), leading to effective sample sizes of N− = 408 and N+ = 460 for control and treatment groups, respectively. The final / / manipulation test is Tq(hl,comb,p, hr,comb,p)=−0.8753, with a p-value of 0.3814. There- fore, in this application, there is no statistical evidence of systematic manipulation of the running variable. rddensity also offers an automatic plot of the manipulation test. This plot is im- plemented using the package lpdensity for local-polynomial–based density estimation in Stata and R. If the user does not have this package installed, rddensity will issue an error and will request the user to install it. (In R, the package lpdensity is declared as a dependency, so no error ever occurs.) The following command installs the package lpdensity in Stata: M. D. Cattaneo, M. Jansson and X. Ma 249

. net install lpdensity, > from(https://sites.google.com/site/nppackages/lpdensity/stata) replace

To obtain the default manipulation test plot, you just need to add the plot option to rddensity, as follows:

. rddensity margin, plot Computing data-driven bandwidth selectors. RD Manipulation Test using local polynomial density estimation. Cutoff c = 0 Left of c Right of c Number of obs = 1390 Model = unrestricted Number of obs 640 750 BW method = comb Eff. Number of obs 408 460 Kernel = triangular Order est. (p) 2 2 VCE method = jackknife Order bias (q) 33 BW est. (h) 19.841 27.119 Running variable: margin.

Method T P>|T|

Robust -0.8753 0.3814

The resulting default plot is given in figure 1:

rddensity plot (p=2, q=3) .03 .02 .01 0 −50 0 50 100 margin

point estimate 95% C.I.

Figure 1. Manipulation test plot (default options) 250 Manipulation testing based on density discontinuity

The basic manipulation test plot can be improved using user-specified options. For example, the following command changes the plot’s legends and general appearance. The resulting plot is given in figure 2:

. rddensity margin, plot > graph_options(graphregion(color(white)) > xtitle("Margin of victory") ytitle("Density") legend(off)) Computing data-driven bandwidth selectors. RD Manipulation Test using local polynomial density estimation. Cutoff c = 0 Left of c Right of c Number of obs = 1390 Model = unrestricted Number of obs 640 750 BW method = comb Eff. Number of obs 408 460 Kernel = triangular Order est. (p) 2 2 VCE method = jackknife Order bias (q) 33 BW est. (h) 19.841 27.119 Running variable: margin.

Method T P>|T|

Robust -0.8753 0.3814 .03 .02 Density .01 0 −50 0 50 100 Margin of victory

Figure 2. Manipulation test plot (with user options)

To further illustrate some capabilities of rddensity, we consider a few additional runs. We can obtain two distinct statistics, conventional and bias-corrected, by including the option all. This gives the following output: M. D. Cattaneo, M. Jansson and X. Ma 251

. rddensity margin, all Computing data-driven bandwidth selectors. RD Manipulation Test using local polynomial density estimation. Cutoff c = 0 Left of c Right of c Number of obs = 1390 Model = unrestricted Number of obs 640 750 BW method = comb Eff. Number of obs 408 460 Kernel = triangular Order est. (p) 2 2 VCE method = jackknife Order bias (q) 33 BW est. (h) 19.841 27.119 Running variable: margin.

Method T P>|T|

Conventional -1.6506 0.0988 Robust -0.8753 0.3814

This second output still uses all the default options, but now it reports two test statistics. The first statistic, labeled Conventional, will exhibit asymptotic bias and hence will overreject the null hypothesis of no manipulation when the MSE-optimal or another large bandwidth is used. This is confirmed in the simulation study reported in section 6. The second statistic, labeled Robust, implements inference based on ro- bust bias-correction and is the default and recommended option for implementing a manipulation test. The following output showcases other features of rddensity. Here we conduct a manipulation test using the restricted model and plug-in SEs:

. rddensity margin, fitselect(restricted) vce(plugin) Computing data-driven bandwidth selectors. RD Manipulation Test using local polynomial density estimation. Cutoff c = 0 Left of c Right of c Number of obs = 1390 Model = restricted Number of obs 640 750 BW method = comb Eff. Number of obs 396 362 Kernel = triangular Order est. (p) 2 2 VCE method = plugin Order bias (q) 33 BW est. (h) 18.753 18.753 Running variable: margin.

Method T P>|T|

Robust -1.4768 0.1397

In this case, because the restricted model is being used, a common bandwidth for /R both control and treatment units is selected. This value is hcomb,p =18.753 with the choice p = 2 and is now using the plug-in SE estimator (instead of the jackknife method, as used before). This empirical finding shows that we continue to not reject the null hypothesis of no manipulation (p-value is 0.1397), even when the restricted model is 252 Manipulation testing based on density discontinuity used, which provides further empirical evidence in favor of the validity of the RD design in this application. To close this section, we report output from the companion command rdbwdensity. This command was used all along implicitly by rddensity, but here we use it as a standalone command to illustrate some of its features. The default output for the empirical applications is as follows:

. rdbwdensity margin Bandwidth selection for manipulation testing. Cutoff c = 0.000 Left of c Right of c Number of obs = 1390 Model = unrestricted Number of obs 640 750 Kernel = triangular Min Running var. -100.000 0.011 VCE method = jackknife Max Running var. -0.079 100.000 Order loc. poly. (p) 22 Running variable: margin.

Target Bandwidth Variance Bias^2

left density 19.841 0.109 0.000 right density 27.569 0.085 0.000 difference densities 27.119 0.194 0.000 sum densities 19.531 0.194 0.000

The output of rdbwdensity closely mimics the output from rddensity.Notice that the defaults are all the same [for example, x =0,p =2,K(·) = triangular, and unrestricted model]. The main results are reported in the lower panel, where now the output includes different estimated bandwidth choices. These choices depend on the MSE criterion function (and the model considered, as discussed previously): the first row / (labeled left density) reports hl,p, the second row (labeled right density) reports / / hr,p, the third row (labeled difference densities) reports hdiff,p,andthefourthrow / / (labeled sum densities) reports hsum,p.Ofcourse,hcomb,p may be easily constructed using the above information. A similar set of results may also be obtained for the restricted model. For this case, the command is rdbwdensity margin, fitselect(restricted), but we do not report these results here to conserve space. M. D. Cattaneo, M. Jansson and X. Ma 253

Finally, we briefly illustrate how the two commands can be combined:

. quietly rdbwdensity margin . matrix h = e(h) . local hr = h[2,1] . rddensity margin, h(10 `hr´) RD Manipulation Test using local polynomial density estimation. Cutoff c = 0 Left of c Right of c Number of obs = 1390 Model = unrestricted Number of obs 640 750 BW method = manual Eff. Number of obs 251 464 Kernel = triangular Order est. (p) 2 2 VCE method = jackknife Order bias (q) 33 BW est. (h) 10.000 27.569 Running variable: margin.

Method T P>|T|

Robust -1.0331 0.3016

First, rdbwdensity is used to estimate the bandwidths quietly, but then the left band- width is set manually (hl = 10) while the right bandwidth is estimated (hr =27.569) when executing rddensity. The companion replication file (rddensity illustration.do) includes the syntax of all the examples discussed above, as well as additional examples not included here to conserve space. These examples are

1. rddensity margin, kernel(uniform) Manipulation testing using uniform kernel. 2. rddensity margin, bwselect(diff) Manipulation testing with bandwidth selection based on MSE of difference of den- sities. 3. rddensity margin, h(10 15) Manipulation testing using user-chosen bandwidths hl =10andhr = 15. 4. rddensity margin, p(2) q(4) Manipulation testing using p =2andq =4. 5. rddensity margin, c(5) all Manipulation testing at cutoff x = 5 with all statistics. 6. rdbwdensity margin, p(3) fitselect(restricted) Bandwidth selection for restricted model with p =3. 7. rdbwdensity margin, kernel(uniform) vce(jackknife) Bandwidth selection with uniform kernel and jackknife SE. 254 Manipulation testing based on density discontinuity 6 Simulation study

This section reports the numerical findings from a simulation study aimed to illustrate the finite sample performance of our new manipulation test.

We consider several implementations of the manipulation test Tp(h), which varies according to the choice of polynomial order (p =2orp = 3) and choice of bandwidth. We analyze the performance of the inference procedure using a grid of bandwidths around the MSE-optimal choice, as well as a data-driven implementation of this optimal bandwidth choice, allowing for both equal and difference bandwidth choices on either side of the threshold. We also investigate the performance of robust bias-correction because we report both the na¨ıve testing procedure without bias correction and its robust bias-corrected version. For implementation, in all cases, we also consider both asymptotic plug-in or jackknife variance estimation. We consider the data-generating process given by Xi ∼ 3/5T (5), where T (k) denotes a Student’s t distribution with k degrees of freedom and a cutoff point x =0.5, which induces an asymmetric distribution. We use a sample size n = 1000, and all simulations were based on 2,000 replications with a triangular kernel to implement our density estimators.

6.1 Point estimation and empirical size

In this section, we report simulation results concerning the point-estimation properties of the underlying density estimator at the boundary point x as well as the average rejection rate (empirical size) of our proposed manipulation test. All the numerical results are given in table 1. M. D. Cattaneo, M. Jansson and X. Ma 255

Table 1. Density estimation and manipulation test (empirical size)

Bandwidth Density Estimators Plug-in SE Jackknife SE

left right bias− bias+ bias sd bias/sd mse mean size mean size

Grid h−,h+ 0.5× 0.538 0.538 0.014 −0.001 −0.015 0.092 0.163 0.816 0.091 0.050 0.091 0.050 0.6× 0.645 0.645 0.021 −0.003 −0.024 0.084 0.292 0.721 0.083 0.056 0.083 0.059 0.7× 0.753 0.753 0.031 −0.005 −0.036 0.078 0.460 0.699 0.078 0.077 0.077 0.078 0.8× 0.861 0.861 0.042 −0.007 −0.049 0.073 0.666 0.735 0.073 0.107 0.072 0.112 0.9× 0.968 0.968 0.054 −0.010 −0.064 0.069 0.921 0.832 0.069 0.143 0.068 0.152

hMSE,p 1.076 1.076 0.067 −0.013 −0.080 0.065 1.223 1.000 0.066 0.208 0.064 0.232 1.1× 1.183 1.183 0.081 −0.016 −0.097 0.062 1.567 1.248 0.064 0.312 0.061 0.345 1.2× 1.291 1.291 0.095 −0.019 −0.114 0.059 1.939 1.565 0.061 0.468 0.059 0.497 1.3× 1.399 1.399 0.109 −0.023 −0.132 0.056 2.333 1.940 0.059 0.609 0.056 0.646 1.4× 1.506 1.506 0.121 −0.027 −0.148 0.054 2.737 2.363 0.057 0.747 0.054 0.785 1.5× 1.614 1.614 0.133 −0.031 −0.165 0.052 3.144 2.820 0.056 0.856 0.053 0.876   Est. h−, h+  Tp(hp)0.618 0.710 0.023 −0.004 −0.027 0.089 0.303 0.827 0.083 0.084 0.083 0.091  Tp(hp−1)0.254 0.218 0.014 −0.001 −0.015 0.142 0.103 1.928 0.141 0.054 0.143 0.050  Tp+1(hp)0.618 0.710 0.007 0.005 −0.002 0.130 0.017 1.593 0.130 0.052 0.131 0.046   Est. h− = h+  Tp(hp)0.596 0.596 0.021 −0.002 −0.023 0.094 0.248 0.880 0.088 0.080 0.087 0.085  Tp(hp−1)0.187 0.187 0.009 0.000 −0.008 0.159 0.053 2.395 0.157 0.059 0.160 0.052  Tp+1(hp)0.596 0.596 0.005 0.005 −0.001 0.137 0.005 1.776 0.136 0.056 0.137 0.052

(a) p =2

Bandwidth Density Estimators Plug-in SE Jackknife SE

left right bias− bias+ bias sd bias/sd mse mean size mean size

Grid h−,h+ 0.5× 0.604 0.604 0.003 0.003 0.000 0.133 0.001 1.948 0.134 0.048 0.135 0.046 0.6× 0.725 0.725 0.001 0.004 0.004 0.121 0.030 1.607 0.122 0.045 0.124 0.047 0.7× 0.845 0.845 −0.004 0.005 0.009 0.112 0.082 1.378 0.113 0.052 0.114 0.048 0.8× 0.966 0.966 −0.007 0.007 0.013 0.105 0.128 1.219 0.106 0.048 0.107 0.044 0.9× 1.087 1.087 −0.007 0.008 0.015 0.099 0.153 1.103 0.100 0.050 0.101 0.046

hMSE,p 1.208 1.208 −0.006 0.009 0.015 0.094 0.156 1.000 0.095 0.052 0.095 0.049 1.1× 1.328 1.328 −0.003 0.009 0.012 0.090 0.139 0.903 0.090 0.056 0.091 0.054 1.2× 1.449 1.449 0.002 0.010 0.008 0.086 0.094 0.816 0.087 0.052 0.087 0.048 1.3× 1.570 1.570 0.008 0.010 0.002 0.083 0.020 0.746 0.084 0.048 0.083 0.053 1.4× 1.691 1.691 0.016 0.010 −0.007 0.080 0.082 0.697 0.081 0.050 0.080 0.050 1.5× 1.812 1.812 0.026 0.009 −0.016 0.077 0.211 0.676 0.079 0.048 0.077 0.053   Est. h−, h+  Tp(hp)1.190 1.237 −0.006 0.008 0.014 0.098 0.140 1.077 0.095 0.051 0.097 0.050  Tp(hp−1)0.610 0.705 0.010 0.005 −0.005 0.129 0.040 1.830 0.131 0.048 0.132 0.046  Tp+1(hp)1.190 1.237 0.007 0.006 −0.001 0.129 0.007 1.818 0.134 0.044 0.135 0.040   Est. h− = h+  Tp(hp)1.119 1.119 −0.006 0.008 0.014 0.100 0.136 1.124 0.099 0.048 0.100 0.051  Tp(hp−1)0.590 0.590 0.009 0.001 −0.007 0.136 0.055 2.043 0.137 0.050 0.138 0.052  Tp+1(hp)1.119 1.119 0.008 0.006 −0.002 0.134 0.016 1.960 0.140 0.043 0.140 0.038

(b) p =3 256 Manipulation testing based on density discontinuity

Notes: i) Columns “Bandwidth”: bandwidths for left and right density estimators. ii) Columns “Density estimators”: bias of left and right density estimators, and bias, standard deviation, standardized bias, and MSE of difference of density estimators. iii) Columns “Plug-in SE”: average of plug-in SE (“mean”) and empirical size of corresponding manipulation test Tp(h) (“size”). iv) Columns “Jackknife SE”: average of jackknife SE (“mean”) and empirical size of corresponding manipulation test Tp(h) (“size”). v) hp denotes estimated MSE-optimal bandwidth for p-order density estimator.

We first describe the format of table 1. Starting with the columns, each table reports the following:

i) the bandwidths used to construct the density estimators on the left and on the right of the cutoff x (columns under the label “Bandwidth”); ii) the average bias of the two density estimators and the difference thereof, simulation variability, standardized bias of the difference of density estimators, and simulation MSE of the difference of density estimators at the cutoff x (columns under the label “Density estimators”);

iii) the average of the asymptotic plug-in SE estimator and the empirical size of the associated feasible manipulation test (columns under the label “Plug-in SE”); and

iv) the average of the jackknife SE estimator and the empirical size of the associated feasible manipulation test (columns under the label “Jackknife SE”).

Continuing with the rows, and to better understand the role of the bandwidth choice hn on the finite sample performance of the density and SE estimators and of the ma- nipulation test, table 1 reports:

i) a grid of fixed bandwidths constructed around the theoretical infeasible MSE-op- timal bandwidth (rows under the label “Grid”); ii) estimated bandwidths, which are allowed to differ on the two sides of the cutoff, / / / / obtained as hl,comb,p and hr,comb,p (rows under the label “Est. h−, h+”); and iii) estimated bandwidths, which are required to be the same on the two sides of / / the cutoff, obtained as the smaller of hdiff,p and hsum,p (rows under the label / / “Est. h− = h+”).

The simulation results allow us to explore the finite sample performance of the differ- ent ingredients entering the manipulation test (density, SE, and bandwidth estimators) and of the test itself (and by implication, the quality of the Gaussian distributional approximation). Given the large amount of information, we offer only a summary of the main results we observe from the Monte Carlo experiment. We first look at nonrandom bandwidths, which help us separate the finite sample performance of the main theoretical results in the article from the impact of bandwidth M. D. Cattaneo, M. Jansson and X. Ma 257 estimation. From the grid of bandwidths around the MSE-optimal hMSE,p, we find that i) the simulation variability of the difference of density estimators (the column labeled “sd”) is approximated very well by the jackknife SE estimator and reasonably well by the asymptotic plug-in SE estimator (compare with the columns labeled “mean”), and ii) the manipulation test exhibits some empirical size distortion when using the MSE-optimal bandwidth hMSE,p, as expected, but exhibits excellent empirical size when undersmoothing this bandwidth choice. These numerical findings indicate that our main theoretical results concerning bias, variance, and distributional approximations, as well as the consistency of the proposed SE formulas, are borne out in the Monte Carlo experiment. We now explore the impact of bandwidth selection on estimation and inference in the context of the manipulation test. We focus on the last six rows of table 1. When looking / at the different bandwidth estimators, we find that i) our bandwidth estimator hp tends to deliver smaller values than hMSE,p on average, a finding that actually helps control the empirical size of the manipulation test, and ii) the robust bias-correction approach [that / is, inference based on Tp(hp−1)] delivers manipulation tests with very good empirical size properties. Because the robust bias-correction approach is theoretically justified and valid for all sample sizes, we recommend it as the best alternative for applications. To summarize, based on the simulation evidence we obtained, we recommend for em- / pirical work the manipulation test based on the feasible statistic Tp(hp−1) constructed using a pth-order local-polynomial density estimator, an MSE-optimal bandwidth choice coupled with robust bias-correction, and the corresponding jackknife SE estimator. This testing procedure exhibited close-to-correct empirical size across all designs we consid- ered and performed as well as (if not better than) all the alternatives we explored. These numerical findings agree with our main theoretical results.

6.2 Empirical power

To complement the numerical results presented above, we also investigate the power of the manipulation test constructed using the local-polynomial density estimator. We continue to use the same data-generating process, but now we scale the data so that the true density satisfies

f+/f− ∈{0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5} at the cutoff x =0.5. Recall that the sample size is n = 1000 and that all simulations are based on 2,000 replications. The results are given in table 2. 258 Manipulation testing based on density discontinuity

Table 2. Manipulation test (empirical power)

Bandwidth Rejection Rate H0 : f+/f− =1vs. Ha : value below left right 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Plug-in SE

Tp(hMSE,p)1.076 1.076 1.000 0.990 0.922 0.734 0.454 0.212 0.080 0.036 0.060 0.126 0.233   Est. h−, h+  Tp(hp)0.615 0.712 0.888 0.715 0.498 0.284 0.154 0.072 0.064 0.078 0.134 0.218 0.330  Tp(hp−1)0.254 0.215 0.478 0.316 0.202 0.122 0.074 0.056 0.049 0.060 0.072 0.094 0.130  Tp+1(hp)0.615 0.712 0.474 0.313 0.191 0.112 0.074 0.053 0.050 0.065 0.100 0.143 0.187   Est. h− = h+  Tp(hp)0.594 0.594 0.865 0.686 0.456 0.266 0.138 0.070 0.060 0.074 0.122 0.190 0.285  Tp(hp−1)0.185 0.185 0.353 0.239 0.151 0.099 0.066 0.050 0.047 0.060 0.079 0.100 0.131  Tp+1(hp)0.594 0.594 0.458 0.301 0.188 0.114 0.072 0.057 0.055 0.068 0.088 0.117 0.159 Jackknife SE

Tp(hMSE,p)1.076 1.076 1.000 0.990 0.926 0.747 0.480 0.232 0.090 0.045 0.071 0.148 0.265   Est. h−, h+  Tp(hp)0.615 0.712 0.866 0.700 0.490 0.288 0.155 0.076 0.067 0.082 0.136 0.231 0.352  Tp(hp−1)0.254 0.215 0.459 0.309 0.196 0.114 0.063 0.047 0.044 0.055 0.070 0.098 0.139  Tp+1(hp)0.615 0.712 0.450 0.305 0.196 0.118 0.068 0.051 0.047 0.062 0.104 0.143 0.200   Est. h− = h+  Tp(hp)0.594 0.594 0.844 0.670 0.450 0.266 0.142 0.073 0.063 0.076 0.123 0.202 0.310  Tp(hp−1)0.185 0.185 0.345 0.230 0.148 0.097 0.057 0.043 0.044 0.054 0.072 0.100 0.132  Tp+1(hp)0.594 0.594 0.434 0.292 0.188 0.114 0.074 0.052 0.050 0.066 0.088 0.126 0.170 (a) p =2

Bandwidth Rejection Rate H0 : f+/f− =1vs. Ha : value below left right 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Plug-in SE

Tp(hMSE,p)1.208 1.208 0.701 0.456 0.236 0.110 0.056 0.057 0.086 0.143 0.218 0.311 0.412   Est. h−, h+  Tp(hp)1.205 1.259 0.680 0.455 0.264 0.138 0.076 0.057 0.090 0.141 0.210 0.306 0.409  Tp(hp−1)0.617 0.712 0.472 0.291 0.168 0.098 0.064 0.050 0.055 0.078 0.108 0.152 0.190  Tp+1(hp)1.205 1.259 0.478 0.280 0.149 0.078 0.051 0.050 0.060 0.076 0.106 0.137 0.179   Est. h− = h+  Tp(hp)1.134 1.134 0.658 0.442 0.240 0.124 0.066 0.052 0.084 0.122 0.191 0.272 0.368  Tp(hp−1)0.594 0.594 0.447 0.284 0.166 0.102 0.068 0.051 0.056 0.070 0.092 0.122 0.162  Tp+1(hp)1.134 1.134 0.450 0.263 0.148 0.075 0.052 0.052 0.056 0.070 0.098 0.129 0.162 Jackknife SE

Tp(hMSE,p)1.208 1.208 0.646 0.424 0.228 0.107 0.057 0.054 0.083 0.150 0.236 0.336 0.446   Est. h−, h+  Tp(hp)1.205 1.259 0.624 0.430 0.256 0.134 0.074 0.053 0.086 0.142 0.232 0.336 0.448  Tp(hp−1)0.617 0.712 0.444 0.286 0.174 0.102 0.063 0.041 0.054 0.072 0.112 0.152 0.208  Tp+1(hp)1.205 1.259 0.444 0.274 0.152 0.080 0.047 0.040 0.055 0.080 0.110 0.154 0.204   Est. h− = h+  Tp(hp)1.134 1.134 0.600 0.415 0.236 0.123 0.064 0.049 0.078 0.128 0.210 0.300 0.403  Tp(hp−1)0.594 0.594 0.434 0.275 0.170 0.100 0.064 0.045 0.048 0.068 0.092 0.130 0.177  Tp+1(hp)1.134 1.134 0.420 0.267 0.148 0.078 0.047 0.042 0.053 0.072 0.101 0.140 0.184 (b) p =3 M. D. Cattaneo, M. Jansson and X. Ma 259

Notes: i) Columns “Bandwidth”: bandwidths for left and right density estimators under the null hypoth- esis f+/f− = 1. For other cases, the bandwidth for estimating f+ is adjusted proportional to f+/f−. ii) hp:estimatedMSE-optimal bandwidth for p-order density estimator.

We first describe the format of this table. The two main columns report the following:

i) the bandwidths used to construct the density estimators on the left and on the right of the cutoff x (columns under the label “Bandwidth”); and ii) the density discontinuity and the corresponding rejection rates (columns under the label “Rejection rate”). Note that the column f+/f− = 1 corresponds to the null hypothesis being true, and the rejection rate under that column is just the empirical size of the test.

Continuing with the rows of table 2, we examine the empirical power with the infeasible MSE-optimal bandwidth as well as the estimated bandwidths, with either the plug-in or the jackknife SE used.

Based on the simulation evidence obtained, we find that using the infeasible MSE- optimal bandwidth will lead to size distortion and the power curve does not attain minimum when the null hypothesis is true (for example, in the p = 2 case in table 2,the minimum rejection rate occurs when f+/f− =1.2), which again confirms the finding that without bias correction, the manipulation test will not only be inconsistent but also lose power. For the different bandwidth estimators, we again find that the robust bias-correction approach delivers manipulation tests with very good empirical size properties, because with either method, the power curve achieves minimum when the null hypothesis is true. Also as expected, bias correction will lead to some power loss compared with the / Tp(hp)case.

7Conclusion

In this article, we discussed the implementation of nonparametric manipulation tests using local-polynomial density estimators, which are useful for falsification of RD de- signs and for empirical research analyzing whether units are self-selected into a par- ticular group or treatment status. We introduced two commands: rddensity and rdbwdensity. These commands use ideas from Cattaneo, Jansson, and Ma (2017b). In particular, the first command implements several nonparametric manipulation tests, while the second command provides an array of bandwidth selection methods. Compan- ion R functions are also available from the authors. The latest version of this and related software for RD designs can be found at https://sites.google.com/site/rdpackages/. 260 Manipulation testing based on density discontinuity 8 Acknowledgments

We thank Sebastian Calonico, David Drukker, Rocio Titiunik, and Gonzalo Vazquez- Bare for useful comments that improved this manuscript as well as our implementa- tions. The first author gratefully acknowledges financial support from the National Science Foundation through grants SES-1357561 and SES-1459931. The second author gratefully acknowledges financial support from the National Science Foundation through grant SES-1459967 and the research support of CREATES (funded by the Danish Na- tional Research Foundation under grant no. DNRF78).

9 References Calonico, S., M. D. Cattaneo, and M. H. Farrell. Forthcoming. On the effect of bias estimation on coverage accuracy in nonparametric inference. Journal of the American Statistical Association. Calonico, S., M. D. Cattaneo, M. H. Farrell, and R. Titiunik. 2017. rdrobust: Software for regression-discontinuity designs. Stata Journal 17: 372–404. Calonico, S., M. D. Cattaneo, and R. Titiunik. 2014a. Robust data-driven inference in the regression-discontinuity design. Stata Journal 14: 909–946. . 2014b. Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica 82: 2295–2326. . 2015a. Optimal data-driven regression discontinuity plots. Journal of the Amer- ican Statistical Association 110: 1753–1769. . 2015b. rdrobust: An R package for robust nonparametric inference in regression- discontinuity designs. R Journal 7: 38–51. Cattaneo, M. D., and J. C. Escanciano, eds. 2017. Advances in Econometrics: Vol. 38— Regression Discontinuity Designs: Theory and Applications. Bingley, UK:Emerald. Cattaneo, M. D., B. R. Frandsen, and R. Titiunik. 2015. Randomization inference in the regression discontinuity design: An application to party advantages in the U.S. Senate. JournalofCausalInference3: 1–24. Cattaneo, M. D., M. Jansson, and X. Ma. 2017a. lpdensity: Local polynomial density estimation and inference. Working paper, University of Michigan. . 2017b. Simple local polynomial density estimators. Working paper, University of Michigan. Cattaneo, M. D., R. Titiunik, and G. Vazquez-Bare. 2016. Inference in regression discontinuity designs under local randomization. Stata Journal 16: 331–367.

. 2017c. Comparing inference approaches for RD designs: A reexamination of the effect of head start on child mortality. Journal of Policy Analysis and Management 36: 643–681. M. D. Cattaneo, M. Jansson and X. Ma 261

Cheng, M.-Y., F. Jianqing, and J. S. Marron. 1997. On automatic boundary corrections. Annals of Statistics 25: 1691–1708.

Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications.New York: Chapman & Hall/CRC.

Frandsen, B. 2017. Party bias in union representation elections: Testing for manipula- tion in the regression discontinuity design when the running variable is discrete. In Advances in Econometrics: Vol. 38—Regression Discontinuity Designs: Theory and Applications, ed. M. D. Cattaneo and J. C. Escanciano, 29–72. Bingley, UK:Emerald. McCrary, J. 2008. Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics 142: 698–714.

Otsu, T., K.-L. Xu, and Y. Matsushita. 2014. Estimation and inference of discontinuity in density. Journal of Business and Economic Statistics 31: 507–524.

About the authors Matias D. Cattaneo is a professor of economics and a professor of statistics at the University of Michigan. Michael Jansson is the Edward G. and Nancy S. Jordan Family Professor of Economics at the University of California, Berkeley. Xinwei Ma is a PhD candidate in economics at the University of Michigan. The Stata Journal (2018) 18, Number 1, pp. 262–286 Speaking Stata: Logarithmic binning and labeling

Nicholas J. Cox Department of Geography Durham University Durham, UK [email protected]

Abstract. Histograms on logarithmic scale cannot be produced by an option like xscale(log). You need first to transform the variable concerned with a loga- rithm function. That raises small choices: how to select bin start, bin width, and informative axis labels and titles? Problems and solutions are discussed here in detail. In contrast, for logarithmic scales on other graphs, options xscale(log) and yscale(log) may do most of what you want. But there is usually still scope for “nicer” axis labels than are given by default and indeed scope for differing tastes on what “nice” means. This column introduces the niceloglabels command for helping (even automating) label choice. Historical notes and references are sprinkled throughout. Keywords: gr0072, niceloglabels, logarithms, axis scales, axis labels, binning, his- tograms, quantile plots, transformations, graphics

1 Introduction

Here is a common problem. You look at a histogram of a highly positively skewed variable and realize that you would be better off with a graph showing magnitudes on a logarithmic scale. Typically, that means equal widths of histogram bins (classes or intervals) on that logarithmic scale. Here is the same problem in another guise. Often, you know in advance to move straight to logarithmic scale. Economists and others frequently work with income or wealth on such a scale. Many young children know the words millionaires and billionaires and so have taken the first steps to appreciating that several orders of magnitude (powers of ten) separate the very rich from the very poor. Examples in scholarly books aimed also at general readers include graphs in Atkinson (2015), Milanovic (2016), and Pinker (2018). Another fundamental variable best thought of on logarithmic scale is plant height (Moles et al. 2009). The range in height from short herbs through shrubs and on to the tallest trees is at least 4 orders of magnitude. So your design of a histogram starts knowing that.

c 2018 StataCorp LLC gr0072 N. J. Cox 263

The nub of a Stata solution for histograms on logarithmic scale is to use a logarithmic function to generate a transformed variable and then to show a histogram of that new variable. In practice, you should want to go further and improve the small details of axis labels and titles. This column focuses first on the choices needed and how to make them easily. Section 2 explains such binning. It is assumed that you are comfortable with the idea of logarithms. If you are not so comfortable, you perhaps should read section 5 first, which assumes rather less. Alternatively, you understand the idea well, but your students or colleagues might find more explanation helpful. Section 3 extends the discussion to labeling using logarithmic scales on other kinds of graphs, such as scatter or line graphs. The focus is on using “nice” labels. Official Stata graph commands default to using labels equally spaced on the original scale, which is often not what you want. A new helper program, niceloglabels, is published with this column. Section 4 gives a formal statement of its syntax. The column does not quite extend to underlining how similar ideas could be used for nonlinear scales other than logarithmic. Such extensions have already been discussed in previous columns (Cox 2008, 2012). Similarly, the question of when and why histograms should or should not be used is not tackled directly. A subversive minor theme, however, is that quantile plots are generally a good thing. Various historical details are sprinkled capriciously throughout. Here is the first. The Oxford English Dictionary, consulted online, has its first example of bin in this sense as mentioned by the statistically minded geologist William Christian Krumbein (1902– 1979), writing on particle size data from sedimentology: “In setting up a histogram, we are in effect setting up a series of separate ‘bins’, each of which contains a certain percent of the grains” (Krumbein 1934, 68).

2 Histogram binning 2.1 Axis scale option is not the solution

At this point, I should underline that the option xscale(log)—or if your bins are horizontal, the option yscale(log)—is not the answer to histogram binning on log- arithmic scale. It is often the answer for other needs of logarithmic scales, as I will discuss in section 3. So why not here? Either axis scale option takes the histogram that you would otherwise have, from histogram or twoway histogram, and warps the magnitude axis logarithmically. The bin boundaries shown are what would have been shown otherwise. The graph com- mand does not redo the calculation and give you equal-width bins on a logarithmic scale. Worse than that, areas of bars no longer have the interpretation of showing the probability distribution geometrically. The key idea behind a histogram that bar area 264 Speaking Stata: Logarithmic binning and labeling encodes bin frequency (or some equivalent) is no longer honored—or rather not hon- ored consistently, insofar as the scale varies within the histogram. Using this option to achieve log scale on the magnitude axis is therefore not a solution. What to do instead is a longer story to which we now turn.

2.2 Sandbox: data on island areas

We use two datasets publicly available as sandboxes for play. Figure 1 shows the areas of islands above 2,500 km2 using data from Wikipedia. A data file is posted in the website directory associated with this issue.

. set scheme sj . use island_areas (https://en.wikipedia.org/wiki/List_of_islands_by_area 21 Sept 2017)

The histogram syntax is a bare default, apart from pulling the x-axis title downward a little, given the power 2 inside the variable label.

. histogram area, xscale(titlegap(*5)) (bin=13, start=2535, width=163712.69) 6.0e − 06 4.0e − 06 Density 2.0e − 06 0 0 500000 1000000 1500000 2000000 Area (km2)

Figure 1. Histogram of areas of those islands above 2,500 km2. A highly skewed distribution is evident.

The histogram illustrates a glaring problem. It conveys clearly that the distribution is highly skewed. Apart from that, it delivers very little detail. Almost all the islands fall within the leftmost bin. We could tinker with the possible choices, say, by showing more bins, but it is predictable that most of the space on the graph would still be wasted. N. J. Cox 265

With an outcome or response variable that is always positive, measured on a con- tinuous scale, and highly positively skewed, statistical experience should indicate that we try a logarithmic scale. Before we do that, we should back up and say more about the data. The lower cut- off of 2,500 km2 arises because the Wikipedia article disclaims completeness of listing below that island size. The data are also questionable at the upper end. You will recall the elementary definition that an island is an area of land surrounded by water. You will also recall that what are conventionally called continents are typically listed separately in reference works. So we are missing Africa, Asia, and Europe (combined); North and South America (combined); Antarctica; and Australia. Those are four (not seven!) very large islands if we respect the criterion of being surrounded by water. We could digress further and discuss the geographical, geological, and geopolitical senses of the idea of continents. Ambrose Bierce’s jest (16 April 1881; see Bierce [2002, 19]) unerringly singled out the main absurdity:

Australia, n. A country lying in the South Sea, whose industrial and com- mercial development has been unspeakably retarded by an unfortunate dis- pute among geographers as to whether it is a continent or an island.

For more discussion, see the scholarly and incisive analysis by Lewis and Wigen (1997). At its simplest, the idea of a continent was negative, say, that of land not known to be an island at one time. Now that the world is mapped fairly completely, all conventional continents are known to be islands or parts of islands. We can summarize for our purposes by noting that the distribution in figure 1 is doubly incomplete, because data are omitted beyond both lower and upper limits. Thus, the problem of showing the distribution is even worse than the graph implies. Firing up summarize, detail shows that the ratio of maximum and minimum of the data in hand is about 840. It would be much larger with more data at lower and upper ends.

2.3 Logarithmic functions

Stata (and Mata too) offers two logarithmic functions: log(), and synonymously ln(), offers natural (hyperbolic or Napierian) logarithms to base e ≈ 2.71828. If you do not know e, there is more in section 5 (including some helpful references). I note in passing an often quoted remark by a famous mathematician (Paul Richard Halmos [1916–2006]) that the notation ln is “a textbook vulgarization” (Halmos 1985, 271). That is a natural (indeed) attitude for mathematicians to take, but applied people should find it useful as a flag to themselves and others that they are not using logarithms to base 10. Oddly, although many notations and abbreviations were tried over several centuries, log (lowercase l, no stop or period) for natural logarithm did not become standard notation until the 20th century (Cajori 1929). 266 Speaking Stata: Logarithmic binning and labeling

The ln notation is often attributed to Stringham (1893), following Cajori, but is older yet. Velleman (2013) found an earlier example in Steinhauser (1875).1 log10() offers logarithms to base 10. If you ever forget whether log() means to base e or to base 10, the quickest check is to use display with known answers. display can be abbreviated di.

. display log10(10) 1 . display log(10) 2.3025851 . display ln(10) 2.3025851

If you have need of logarithms to other bases, my guess is that you know that already and know how to calculate them. What is the base-2 logarithm of 8 (mental check: 23 =8)?

. display log10(8)/log10(2) 3

From the first example, and the definition of logarithms discussed in section 5, it can be said that natural logarithms are just logarithms to base 10 multiplied by ln 10. The other way round, logarithms to base 10 are natural logarithms multiplied { } by log10(e)=log10 exp(1) or divided by ln 10. Graphs with logarithms to different bases look the same; it is just the numbers in axis labels that will differ. So, for many purposes, it does not much matter which base you use, so long as you are consistent and remember which. For most of my statistical and scientific work, I reach for ln(). My reasoning is that this relates more directly to any context with mathematical reasoning using natural logarithms or exponentials. That said, for the problem in view of choosing bins on logarithmic scale, I recommend logarithms to base 10. People in most scientific or practical fields are more likely to relate to powers of 10 than to powers of e (or even 2). In their excellent books, Cleveland (1985, 1993, 1994)andRobbins (2005, 2013) encourage the use of logarithmic axis labels using powers of 2. But how many people would prefer 1,024 to 1,000 or 1,048,576 to 1 million in the graphs they read? For how many, except possibly computer scientists or experts in combinatorics, does 210 or 220 have more impact and meaning than 103 or 106? Whatever the answers, a new command to be discussed later does offer some support to those wishing to use powers of 2.

1. David Velleman is a brother of the statistician Paul F. Velleman. N. J. Cox 267

2.4 Digression: A binning scheme using base 3

Base 3 may seem surprising as a basis for binning, but consider this intriguing proposal from the statistical ecologist C. B. Williams (1953; 1964,9;1970, 24).2 His work was also discussed in a previous column (Cox 2005). Williams’s main focus is on counted data such as the number of insects caught in a light trap or the number of words in sentences of text, so the data are just counts ≥ 1. Williams suggests using bins that are 1; 2 3 4; 5678910111213;andsoon. Hence, bin midpoints run 1, 3, 9,...,sothekth bin has midpoint 3k−1. The bin widths (here the number of distinct values) are the same as the midpoints. Constructing these bins in Mata is engaging. One method is to start with the first bin, for which lower, middle, and upper values are identically 1; then, we add further lower limits by adding 1 to the previous upper limit and further upper limits by adding 3k−1 to the previous upper limit.

. mata mata (type end to exit) : bin = (1,1,1) : for(k = 2; k <= 7; k++) bin = bin \ (bin[k-1, 3] + 1, 3^(k-1), > bin[k-1, 3] + 3^(k-1)) : bin 123

1 111 2 234 3 5913 4 14 27 40 5 41 81 121 6 122 243 364 7 365 729 1093

: end

Alternatively, the upper and lower limits can be given directly as (3k−1 +1)/2and (3k − 1)/2. Compare entries https://oeis.org/A007051 and https://oeis.org/A003462 in the On-line Encyclopedia of Integer Sequences, or sequences M1458 and M3463 in Sloane and Plouffe (1995).

These limits are not quite geometric progressions, so rounding log3 of the data does not yield those bins. Rather, binning for a variable y is given directly by k = log3(2y +1) or in Stata terms,

. generate k = ceil(log(2 * y + 1)/log(3))

2. Williams (1889–1981) and I went to the same high school and have one university in common. We did not overlap. 268 Speaking Stata: Logarithmic binning and labeling

2.5 Island areas

Let us go back to work on our island areas. The first step is to get a log base-10 version of area into a new variable.

. clonevar log10_area = area . replace log10_area = log10(area) variable log10_area was long now double (180 real changes made)

What I just did is a twist on the more obvious

. generate log10_area = log10(area)

Indeed, I am already looking ahead because I know how I will use this new variable. Some small details thus need explanation. Why is clonevar used first? The reasoning is with that command any variable label will be copied over from the original variable. That saves the labor of looking up the variable label and retyping it, or working out how to copy it in some other way. Variable labels can be long and they can be fiddly (mentioning units of measurement, or whatever), so any gain on this front can be welcome. The variable label no longer matches the contents of the variable. The label says area in square km, but the contents are the base-10 logarithm of that area. So I am playing a little dangerously. That is good reason for making the variable name as informative as possible, in this case log10 area. I will need to ensure that what readers (me, anybody else) see on the histogram is consistent: text and numbers should all match. N. J. Cox 269

Let us look at a rudimentary histogram for this log transformed variable (figure 2):

. histogram log10_area, xscale(titlegap(*5)) (bin=13, start=3.403978, width=.22496652) 1.5 1 Density .5 0 3.5 4 4.5 5 5.5 6 Area (km2)

Figure 2. Histogram of areas of those islands above 2,500 km2 using a logarithmic scale. A reduction in skewness is evident.

That shows progress. The distribution is still skewed, but we are getting a clearer picture in the histogram. A refinement here (and quite often when magnitudes are truncated below) is that we might prefer that binning start at (the logarithm base 10 of) 2,500 km2. We already know that display will give us this number:

. display log10(2500) 3.39794

So we could retype (or, more smartly, copy and paste) that number into the histogram command. But there is an even better way. We can ask Stata to do the calculation on the fly and then use the result, all in one. An example is easier to understand than an explanation:

. histogram log10_area, xscale(titlegap(*5)) start(`=log10(2500)´) (bin=13, start=3.39794, width=.22543098) To save space, we will not show the graph here; it differs only slightly from figure 2. Do you get the trick here? One piece of syntax that may be new to you here is the punctuation surrounding an expression to be evaluated. The syntax form `=exp´ is an instruction to evaluate the expression exp and insert the result in its place. So Stata goes off on the side and works out the expression, here just log10(2500). Then, the histogram command sees and uses the result. A tiny bonus, not usually important in practice but welcome in principle, is that you get more precision than display shows 270 Speaking Stata: Logarithmic binning and labeling you by default. It turns out that in this example, 3.397940008672 is the result used. The extra digits make no discernible graphical difference, but using the maximum precision possible can be important for reproducibility in other nongraphical problems. We have arranged that the bins start where they should, given the lower cutoff. The next task is to work a bit at the axis labels. We know that 104 is 10,000, 105 is 100,000, and 106 is 1 million. Those seem obvious label choices. I would want 2,500 to be a label, too, and we can use the same trick as before to get that shown as a label. We show the resulting graph as figure 3.

. histogram log10_area, xscale(titlegap(*5)) start(`=log10(2500)´) > xlabel(`=log10(2500)´ "2500" 4 "10000" 5 "100000" 6 "1000000") (bin=13, start=3.39794, width=.22543098) 1.5 1 Density .5 0 2500 10000 100000 1000000 Area (km2)

Figure 3. Improved histogram on logarithmic scale with better labeling on the horizontal axis

The effect of the syntax should seem simple, even if any detail in the xlabel() call is new to you. We say where the axis labels should go (at 3.39794, or so, and at 4 5 6), and, crucially, we say what text should appear at those positions. Now the axis labels make sense to the reader as areas in square km and match the axis title, inherited automatically from the variable label we arranged earlier. We are most of the way there, but there is just one more fix that is usually needed. The densities shown are still calculated with respect to log10 area. Getting axis la- bels to show areas in the original units has not changed that. Almost always, density calculated with respect to a variable not even shown directly is not what you want to see. I would usually reach for one of three options: frequency, fraction,orpercent. For exploratory work on one-off datasets, frequency is often simplest and best. For comparing with other work, fraction or percent might seem better. Figure 4 shows where we are now. N. J. Cox 271

. histogram log10_area, xscale(titlegap(*5)) start(`=log10(2500)´) > xlabel(`=log10(2500)´ "2500" 4 "10000" 5 "100000" 6 "1000000") frequency (bin=13, start=3.39794, width=.22543098) 50 40 30 Frequency 20 10 0 2500 10000 100000 1000000 Area (km2)

Figure 4. Histogram with better magnitude axis labeling

Small tweaks are always possible to any graph. Here are some possibilities. Natu- rally, my own tastes weigh heavily in the suggestions.

More labels. Some people might want more axis labels, say, at 25,000 and 250,000 too. You already have the tricks for doing that.

Text for big numbers. At some point, depending partly on taste and partly on whether you run out of space, you might prefer to change the text for the biggest numbers to, say, "1 million" or "1 m".

Change of units. Nothing except possibly taste or convention rules out a change of units, say, to 1,000 km2, so that the text shown ranges from 2.5 to 1,000.

Horizontal labels. The y-axis labels would look better aligned horizontally. Hor- izontal text is much easier to read, so long as we do not lose too much space thereby.

More or fewer bins. Some people might want more or fewer bins. It may be easier to play with bin(), specifying number of bins, than with width(),which (remember!) must be expressed on the logarithmic scale.

Align bin bounds and labels. Some find it a little disconcerting whenever axis labels do not line up with bin boundaries. My advice is not to think that this is a problem. We will shortly see an example where it is easy to get. 272 Speaking Stata: Logarithmic binning and labeling

Colors. Bar colors could be more attractive. Here in the Stata Journal,weare using scheme sj, but you need not make the same choice. More explanation. For presentation or publication, we would want to add a good graph title or caption, but I will stop short of addressing what or how.

What the data are telling us is a good question, which we will also not pursue. If you know about Pareto distributions, for example, you will know that taking logs of a Pareto distribution gives you an exponential distribution, but let that be a throwaway remark.

2.6 Sandbox: Data on country populations

Figure 5 shows the populations of countries, again using data from Wikipedia. A data file is posted in the website directory associated with this issue. The data are as they come: see the extensive set of comments on Wikipedia on territories whose status is in dispute in some sense. We have skipped the pretense of discovering that a logarithmic scale is the right way to think here. We should know that in advance. That is precisely why the example is included.

. use country_populations, clear (https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population) . describe Contains data from country_populations.dta obs: 240 https://en.wikipedia.org/wiki/Li st_of_countries_and_dependencies _by_population vars: 4 5 Oct 2017 13:58 size: 28,080 (_dta has notes)

storage display value variable name type format label variable label

name str54 %54s population double %10.0g Population date str18 %18s source str37 %37s

Sorted by: . clonevar log10_pop = population . replace log10_pop = log10(log10_pop) (240 real changes made) N. J. Cox 273

. histogram log10_pop (bin=15, start=1.7558749, width=.492403) .4 .3 Density .2 .1 0 2 4 6 8 10 Population

Figure 5. Distribution of country populations as reported on Wikipedia, 22 September 2017. Histogram shows populations on logarithmic scale. The histogram is a good start but needs improvement in several details.

The default—fortuitously but fortunately—illustrates a point small enough not to worry about but large enough to explain. If we tweak the start of binning to 1.5 and the bin width to 0.5, then some of the bin boundaries, namely, the integers 2 to 9, will be “nice” round numbers, as I hope you agree. There will be more on what defines niceness in the next section. That makes informative labeling quite easy. Much scope exists for tinkering with the mix of numbers and abbreviations (m for millions?) or for choosing fewer labels (surely not more?). I will flag (among other cosmetic changes) the tricks of extended tick length and pulling down the axis title. Figure 6 is the improved version. 274 Speaking Stata: Logarithmic binning and labeling

. histogram log10_pop, start(1.5) width(0.5) xlabel(2 "100" 3 "1000" 4 "10000" > 5 "100000" 6 "1 m" 7 "10 m" 8 "100 m" 9 "1000 m", tlength(*2)) > xscale(titlegap(*5)) frequency bfcolor(gs15) ylabel(, angle(horizontal)) (bin=16, start=1.5, width=.5)

50

40

30

Frequency 20

10

0 100 1000 10000 100000 1 m 10 m 100 m 1000 m Population

Figure 6. Distribution of country populations as reported on Wikipedia, 22 September 2017. Histogram shows populations on logarithmic scale. Note choices of bin start, bin width, tick length, and title position as indicated by the accompanying syntax.

You can experiment, as I did, with other bin widths, such as 1 or 0.25. As usual, there is a tradeoff between detail and simplicity. I stopped with the graph you see in figure 6.

3 Labeling 3.1 Default labeling can be awkward

The main strategy for logarithmic binning for histograms was to transform to a new variable, yet also to arrange that readers see axis labels and titles in the original units. Those original units are usually easier to think about. Logarithmic scales that are conventionally accepted and routinely used (say, decibels, pH for acidity, Richter scale for earthquake magnitude, stellar magnitude) do not need such transformations. Occa- sionally, readers do need to be reminded that on logarithm base-10 scales, a step of 1 corresponds to a factor of 10 and so the difference (really, the multiplier) can be a big deal. Just using xscale(log) or yscale(log) is much easier when it is the right answer. Whenever that is done, we still need to think about axis labels. The same kinds of questions arise with many kinds of graphs, ranging from scatterplots and line graphs N. J. Cox 275 to more exotic or specialized kinds, but we will not need to look at many examples to think about the issue. In fact, we will keep going with the minor theme of univariate distributions. A quantile plot using the official command quantile shows the problem that can arise. In passing, I will mention that qplot (Cox 1999, 2005, 2016)ismoreversatile,but we do not need it here. Figure 7 shows the country populations we have been looking at. Log scale warps the reference line for a uniform distribution that appears by default, so we blank it out as a distraction using no line color.

. quantile population, yscale(log) rlopts(lcolor(none)) 1.50e+09 1.00e+09 5.00e+08 Quantiles of Population

0 .25 .5 .75 1 Fraction of the data

Figure 7. Quantile plot for country populations. The vertical axis labels on logarithmic scale are squeezed up and unreadable.

The problem is unreadable labels at one end of the scale. An obvious solution is to reach in and spell out a better choice of labels manually. Experienced users will often have done that. The contribution of this section is to offer a helper program that suggests some nice choices given the range of the variable (or, alternatively, a specified minimum and maximum).

3.2 Nicer labels can be specified

“Nice numbers” for axis labels have been discussed in several places. Heckbert (1990) commendably tried to make the problem, or its solution, as simple as possible. Ideas similar if not identical often underlie programs, both in Stata and elsewhere (for exam- ple, Hardin [1995]). In contrast, and equally commendably, Wilkinson (2005, 95–97) and Talbot, Lin, and Hanrahan (2010) have spelled out how fully automated choice entails a delicate tradeoff between several desiderata, chiefly simplicity, coverage, granularity, 276 Speaking Stata: Logarithmic binning and labeling and legibility. Talbot, Lin, and Hanrahan (2010) give a full survey of the problem. The aim here is more modest, merely to make suggestions given users’ preferences on style. In this small literature, little attention has been paid to logarithmic scales. It is an easy start that, for example, integer powers of any base, such as ..., 1,10, 100, 1,000, . . . or ..., 1,2,4,8,... would be equally spaced on logarithmic scale. In what follows, these are called styles 1 and 2. The main issue is what to do if you want more labels, especially with the first set of choices. This is where the new command niceloglabels provides support. Various styles of logarithmic labels go beyond integer powers with evident preferences for using nice numbers as labels, even at the cost of sacrificing equal intervals for slightly varying intervals. Thus two styles common in various literatures are ..., 1,2,5,10,20,50, 100, ...... , 1,3,10,30, 100, . . . So multiplication steps are alternately 2× and 2.5× in the first set and alternately 3× and (10/3)× in the second set. In what follows, these are called styles 125 and 13. The choice between these styles can be based partly on taste and partly on the fact that 125 typically offers more labels than 13 for the same relative range. What they have in common is approximately equal multiplicative steps combined with single significant figures. Recall that significant figures are what is left after setting aside leading and trailing zeros. Thus, 109 has one significant figure and 0.12 has two. A different style runs ..., 1,4,7,10,40,70, 100, . . . with the idea of equal additive steps within each power of 10, given that 4 − 1=7− 4= 10 − 7 = 3. This is less common in my experience, but for an example, see Dupont (2009, 270). Finally, I will mention ..., 1,5,10,50, 100, . . . with a similar motivation but showing a preference for 1 and 5 as significant figures. This is called style 15. N. J. Cox 277

If you use niceloglabels, you need to do a little work.

1. You need to specify the variable concerned or else a minimum and maximum defining a range.

2. You need to choose your idea of “nice” given a small catalog of available styles. As in everyday life, tastes can differ, so you need to specify yours.

3. You need to name a local macro to hold results.

Consider the variable pop shown in figure 7. Here are some examples of using niceloglabels to suggest labels. First, we try the pure powers of 10 implied by style 1 for that variable. The output, echoed to the Results window, shows what many will have realized immediately: there are far too many zeros for a congenial display.

. niceloglabels pop, style(1) local(yla) 100 1000 10000 100000 1000000 10000000 100000000 1000000000

At that point, we can reach for an option to specify powers using the syntax defined in help text. Typing this kind of detail is especially tedious and error-prone, quite apart from the possibility that people need scripts to automate graph production.

. niceloglabels pop, style(1) local(yla) powers 100 "10{sup:2}" 1000 "10{sup:3}" 10000 "10{sup:4}" 100000 "10{sup:5}" > 1000000 "10{sup:6}" 10000000 "10{sup:7}" 100000000 "10{sup:8}" > 1000000000 "10{sup:9}"

In each case, the text displayed is echoed to a local macro. The name of that macro should be specified in a graph option. There is absolutely no need to retype the syntax or even copy and paste it. As this example shows, we can specify other details at the same time.

. quantile pop, yscale(log) ylabel(`yla´, angle(horizontal)) > rlopts(lcolor(none)) 278 Speaking Stata: Logarithmic binning and labeling

Figure 8 shows the result.

109

108

107

106

105

104 Quantiles of Population

103

102

0 .25 .5 .75 1 Fraction of the data

Figure 8. Quantile plot for country populations. The vertical axis labels on a logarithmic scale were produced using niceloglabels and are congenially spaced and readable. Compare figure 7.

4 The niceloglabels command 4.1 Syntax diagram niceloglabels varname if in ,local(macname)style(style) powers niceloglabels #1 #2,local(macname)style(style) powers

4.2 Description niceloglabels suggests axis labels that would look nice on a graph using a logarithmic scale. It can help when you choose yscale(log) or xscale(log), or both, and wish to show nicer labels than the default. Results are put in a local macro for later use.

4.3 Remarks

Therearetwosyntaxes.Inthefirst,thenameofanumericvariablemustbegiven.The values selected must all be positive. In the second, two numeric values are given, which will be interpreted as indicating minimum and maximum of an axis range. Those two values can be given in any order, but as before, values must both be positive. N. J. Cox 279

“Nice” is a little hard to define but easier to recognize. For example, it is a bonus if labels are exactly or approximately equally spaced on a logarithmic scale (or conversely, on the original scale), and it is a bonus if numbers are powers of 10 or 2 multiplied by small integers. Users must specify their preferred style, one from the following list:

1 meanspowersof10suchas..., 0.1,1,10, 100, 1,000, . . . 13meanscyclingsuchas..., 0.3,1,3,10,30, 100, 300, . . . 15meanscyclingsuchas..., 0.5,1,5,10,50, 100, 500, . . . 125meanscyclingsuchas..., 0.1,0.2,0.5,1,2,5,10,... 147meanscyclingsuchas..., 0.1,0.4,0.7,1,4,7,10,... 2 meanspowersof2 suchas..., 1,2,4,8,16,...

When the ratio of maximum (max) and minimum (min) is an order of magnitude (power of 10) or less, none of these styles will suggest more than a few labels. In that case, you are almost certainly better off with labels equally spaced on the original scale, which is what Stata gives you anyway. To make this concrete, here are the numbers of labels suggested when the minimum is10=1e1=101 and the power of the maximum is as tabulated in rows. Thus, the first row is for min = 10 and max = 100 = 1e2 = 102, for which the labels suggested, for styles 2 147 125 15 13 1, are, respectively, 16 32 64; 10 40 70 100; 10 20 50 100; 10 50 100; 10 30 100; 10 100; hence, the numbers of labels are 3 4 4 3 3 2. style 2 147 125 15 13 1 2 34 4332 3 77 7553 power 4 10 10 10 7 7 4 of 5 13 13 13 9 9 5 max 6 17 16 16 11 11 6 7 20 19 19 13 13 7 8 23 22 22 15 15 8 9 27 25 25 17 17 9

Powers of 2 make most sense in practice when the amounts to be shown are small positive integers or the problem has some combinatorial flavor, or both. niceloglabels is conservative in that it typically will not suggest labels outside the data range. You could add such labels on the fly in your calls to later graphics commands. Technical hint: The small print behind “typically” is a fudging of minimum and maximum as a workaround for precision problems. For an example of 147, see Dupont (2009, 270). Note the suggestion by Cleveland (1985, 39; 1994, 39) of 3–10 labels on any axis. 280 Speaking Stata: Logarithmic binning and labeling

4.4 Options local(macname) inserts the specification of labels in local macro macname within the calling program’s space. Hence, that macro will be accessible after niceloglabels has finished. This is helpful for later use with graph or other graphics commands. local() is required. Anyone new to the idea and use of local macros should study the examples in the help carefully. niceloglabels creates a local macro, which is a kind of bag holding the text to be inserted in a graph command. The local macro is referred to in that graph command using the punctuation `´around the macro name. Note that the opening (left) single quote and the closing (right) single quote are different. Other single quotation marks will not work. Do not be troubled by the closing single quote (’) appearing as upright in many fonts. style(style) specifies a style for axis labels. style() is required. See section 4.3. powers specifies that labels be specified using syntax interpreted by graph as super- scripts and ready to be used within a ylabel() or xlabel() option call. Thus, if the labels were 100, 1,000, and 10,000, the output would be

100 "10{sup:2}" 1000 "10{sup:3}" 10000 "10{sup:4}"

5 Tutorial: Logarithms as powers

Many friendly accounts of logarithms can be found in a great variety of styles. Abbott (1942) is clear and concise and only slightly old fashioned. Gullberg (1997) combines simple explanations with historical and biographical details. Axler (2009 or any later edition) is informatively discursive with many examples of applications. Maor (1994)is entertaining and enthusiastic on the larger history and mathematical context but also a little erratic (see the review in Blank [2001]). Let’s start by focusing on power notation. The 2 and 3 within 102 and 103 are powers or exponents. In the simplest case with such powers that are positive integers (whole numbers), this notation is a way to express repeated multiplication:

10 = 101, 100 = 10 × 10 = 102, 1000 = 10 × 10 × 10 = 103

The power or exponent (in this case 1, 2, 3) is the number of times the base (in this case 10) appears in the product. Multiplication corresponds to adding powers, for example, 100 × 1000 = 102 × 103 =105 = 100000 and division therefore to subtracting powers, for example,

1000/100 = 103/102 =101 =10 N. J. Cox 281

Therefore, 102/102 =100 = 1, and if we continue dividing by 10,

100/101 =10−1 =0.1, 10−1/101 =10−2 =0.01, etc.

That extension gives a meaning to powers that are zero or negative integers. We can extend this to intermediate positive and negative powers. If 100 = 102 and 1000 = 103,thenwelookforx in 200 = 10x, which is between 2 and 3 because 200 is between 100 and 1,000. In this broadening, x need not be an integer or whole number: it could be a real number with a fractional part. If y =10x,thenwecallx the logarithm or log of y to the base 10. That is, x =log10 y.So − log10 100000 = 5, log10 100 = 2, log10 0.01 = 2

We use a calculator or computer function to get

log10 200 = 2.30103 or other logarithms with fractional parts. Logarithms to the base 10 are occasionally called common or Briggsian logarithms (after the English mathematician Henry Briggs, 1561–1630). The general definition is that if y = cx,thenx is the logarithm or power of y to the base c, which must be positive. Two frequently used bases apart from 10 are 2 and e ≈ 2.71828. Powers of 2 are met in binary arithmetic, and also in sedimentology, − where particle size is often reported on the phi scale (φ), which is log2 of particle size in mm. The number e has many special properties, particularly those encountered in differential and integral calculus, such as that the derivative with respect to x of ex is also ex. Logarithms to the base e are often called natural or Napierian logarithms, after the Scottish mathematician John Napier (1550–1617), and denoted loge or ln. If we have logarithms to two different bases a and b,then

loga y =loga b × logb y ≈ loga b is always a constant: for example, log10 e 0.43429. We have two rules already, that multiplying corresponds to adding logarithms and dividing to subtracting them. In general,

log(y1 × y2)=logy1 +logy2 and log(y1/y2)=logy1 − log y2 so that log y2 =log(y × y)=logy +logy =2logy log y3 =log(y × y × y)=3logy 282 Speaking Stata: Logarithmic binning and labeling and in fact log yk = k log y, for any value of k

Sometimes, we want to reverse the process and return to the original scale, which is called antilogging: antilog(log y)=y The appropriate function will be cx or clog y for base c, for example, 10x,2x, ex for bases 10, 2, e. √ If x2 = y,thenx is the square root of y, denoted y, √ √ y × y = y1 so the power or logarithm of the square root is given by √ √ √ log ( y)+log( y)=2log( y)=1 √ 1/2 3 and is√ therefore 1/2, so that y = y . Similarly, if x = y,thenx isthecuberootof y or 3 y = y1/3. Many processes are physically multiplicative (growth or decline depends on the amount present, such as growth under compound interest or radioactive decay) and are rendered mathematically additive by representation on logarithmic scales. Many variables vary over several orders of magnitude (which means powers of 10) but are compressed toward the lower end of the scale. Taking logs squeezes high values to- gether relative to low values: hence, this reduces positive skewness of single variables and crowding near the origin of scatterplots, which are both common in data analysis. Another reason for taking logarithms is that power functions y = axb are converted to straight lines because

log y =loga +logxb =loga + b log x which is a straight line on a plot of log y against log x. Hence, many curved relationships (convex or concave) can be straightened. For any readers particularly interested in the history of ideas, I now add further notes and references. Kaunzer (1994) gives a scholarly and thorough outline of the history. The term logarithm, as a portmanteau creation based on Greek roots logos (here best translated as ratio)andarithmos or number, was introduced by John Napier. Napier’s life and works have received much attention (Napier 1834; Knott 1915; Gladstone-Millar 2003; Havil 2014). Incidentally, Havil (2014, ix) lists 12 variant spellings of his family name, which helps to explain why variants such as “Naperian” persist to the present. Some key points from the longer history deserve emphasis. The ideas of powers and exponents as notation and as easing multiplication, division, and rooting go back many centuries. Napier’s major contribution was practical provision of tables of logarithms, yet his tables were at most implicitly of logarithms to base 1/e: calling natural log- arithms Napierian is thus a little stretch, or even a large one. Honors to Napier for N. J. Cox 283 introducing logarithms as calculation aids should be shared to some degree with Henry Briggs; the Swiss Jost B¨urgi(1552–1632); and indeed several others who produced fuller tables until well into the twentieth century. However, these major practical advances did not mean that the ideas of logarithms as merely exponents, the scope for many different bases, and ideas of logarithmic and exponential functions were born in modern form just like that. They grew erratically and unevenly but were well formed and well explained in the work of the great Swiss mathematician Leonhard Euler (1707–1783). His expository works (notably Euler [1748, 1770]) remain highly accessible to modern readers, so look out for versions in your first language (for example, Euler [1822, 1988, 1990]). Dunham (1999) gives a splendid introduction to Euler’s mathematical work, while Calinger (2016) is a scholarly but highly readable biography.

6Conclusion

In statistical computing (and in anything worthwhile?), details matter in getting the best results. Seeing or knowing that a logarithmic scale would be a good idea is, with a little experience, an easy first step in designing a graph. Binning for histograms and getting nice labels on graph axes can be trickier. This column has tried to provide some clear paths through the swamps of syntax and the thickets of taste and technique.

7 References Abbott, P. 1942. Teach Yourself Algebra. London: English Universities Press.

Atkinson, A. B. 2015. Inequality: What can be done? Cambridge, MA:Harvard University Press.

Axler, S. 2009. Precalculus: A Prelude to Calculus. Hoboken, NJ: Wiley.

Bierce, A. 2002. The Unabridged Devil’s Dictionary. Athens, GA:UniversityofGeorgia Press. Blank, B. 2001. Reviewed Works: A History of Pi by Petr Beckmann; The Joy of Pi by David Blatner; The Nothing That Is by Robert Kaplan; The Story of a Number by Eli Maor; An Imaginary Tale by Paul Nahin; Zero: The Biography of a Dangerous Idea by Charles Seife. College Mathematics Journal 32: 155–160. Cajori, F. 1929. A History of Mathematical Notations. Volume II: Notation Mainly in Higher Mathematics. Chicago: Open Court. Calinger, R. S. 2016. Leonhard Euler: Mathematical Genius in the Enlightenment. Princeton, NJ: Princeton University Press.

Cleveland, W. S. 1985. The Elements of Graphing Data. Monterey, CA: Wadsworth.

. 1993. Visualizing Data. Summit, NJ: Hobart. 284 Speaking Stata: Logarithmic binning and labeling

. 1994. The Elements of Graphing Data. Rev. ed. Summit, NJ: Hobart.

Cox, N. J. 1999. gr42: Quantile plots, generalized. Stata Technical Bulletin 51: 16–18. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 113–116. College Station, TX: Stata Press.

. 2005. Speaking Stata: The protean quantile plot. Stata Journal 5: 442–460.

. 2008. Stata tip 59: Plotting on any transformed scale. Stata Journal 8: 142–145.

. 2012. Speaking Stata: Transforming the time axis. Stata Journal 12: 332–341.

. 2016. Software Updates: gr42 7: Quantile plots, generalized. Stata Journal 16: 813–814.

Dunham, W. 1999. Euler: The Master of Us All. Washington, DC: Mathematical Association of America.

Dupont, W. D. 2009. Statistical Modeling for Biomedical Researchers: A Simple Intro- duction to the Analysis of Complex Data. 2nd ed. Cambridge: Cambridge University Press.

Euler, L. 1748. Introductio in Analysin Infinitorum. Lausanne: Apud Marcum- Michaelem Bousquet et Socios.

. 1770. Vollst¨andige Anleitung zur Algebra. St. Petersburg: Kaiserliche Academie der Wissenschaften.

. 1822. Elements of Algebra. London: Longman.

. 1988. Introduction to the Analysis of the Infinite: Book I. New York: Springer.

. 1990. Introduction to the Analysis of the Infinite: Book II. New York: Springer.

Gladstone-Millar, L. 2003. John Napier: Logarithm John. Edinburgh: National Muse- ums of Scotland Publishing.

Gullberg, J. 1997. Mathematics: From the Birth of Numbers. New York: W. W. Norton.

Halmos, P. R. 1985. I Want to be a Mathematician: An Automathography.NewYork: Springer.

Hardin, J. W. 1995. dm28: Calculate nice numbers for labeling or drawing grid lines. Stata Technical Bulletin 25: 2–3. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 19–20. College Station, TX: Stata Press.

Havil, J. 2014. John Napier: Life, Logarithms, and Legacy. Princeton, NJ: Princeton University Press.

Heckbert, P. S. 1990. Nice numbers for graph labels. In Graphics Gems, ed. A. S. Glassner, 61–63. Cambridge, MA: Academic Press. N. J. Cox 285

Kaunzer, W. 1994. Logarithms. In Companion Encyclopedia of the History and Phi- losophy of the Mathematical Sciences, ed. I. Grattan-Guinness, 210–228. London: Routledge.

Knott, C. G., ed. 1915. Napier Tercentenary Memorial Volume. London: Longmans, Green for Royal Society of Edinburgh.

Krumbein, W. C. 1934. Size frequency distributions of sediments. Journal of Sedimen- tary Petrology 4: 65–77.

Lewis, M. W., and K. E. Wigen. 1997. The Myth of Continents: A Critique of Meta- geography. Berkeley and Los Angeles, CA: University of California Press.

Maor, E. 1994. e: The Story of a Number. Princeton, NJ: Princeton University Press.

Milanovic, B. 2016. Global Inequality: A New Approach for the Age of Globalization. Cambridge, MA: Harvard University Press.

Moles, A. T., D. I. Warton, L. Warman, N. G. Swenson, S. W. Laffan, A. E. Zanne, A. Pitman, F. A. Hemmings, and M. R. Leishman. 2009. Global patterns in plant height. Journal of Ecology 97: 923–932.

Napier, M. 1834. Memoirs of John Napier of Merchiston, His Lineage, Life and Times, With a History of the Invention of Logarithms. Edinburgh: William Blackwood.

Pinker, S. 2018. Enlightenment Now: The Case for Reason, Science, Humanism, and Progress. New York: Viking.

Robbins, N. B. 2005. Creating More Effective Graphs. Hoboken, NJ: Wiley.

. 2013. Creating More Effective Graphs. Wayne, NJ: Chart House.

Sloane, N. J. A., and S. Plouffe. 1995. The Encyclopedia of Integer Sequences.San Diego: Academic Press.

Steinhauser, A. 1875. Lehrbuch der Mathematik f¨ur h¨ohere Gewerbeschulen. Algebra. Wien: Carl Gerold’s Sohn.

Stringham, I. 1893. Uniplanar Algebra. Being Part I of a Propædeutic to the Higher Mathematical Analysis. San Francisco: Berkeley Press.

Talbot, J., S. Lin, and P. Hanrahan. 2010. An extension of Wilkinson’s algorithm for positioning tick labels on axes. IEEE Transactions on Visualization and Computer Graphics 16: 1036–1043.

Velleman, D. 2013. How did the notation “ln” for “log base e” become so perva- sive? https://math.stackexchange.com/questions/1694/how-did-the-notation-ln- for-log-base-e-become-so-pervasive/428560#428560.

Wilkinson, L. 2005. The Grammar of Graphics. 2nd ed. New York: Springer. 286 Speaking Stata: Logarithmic binning and labeling

Williams, C. B. 1953. The relative abundance of different species in a wild animal population. Journal of Animal Ecology 22: 14–31.

. 1964. Patterns in the Balance of Nature and Related Problems in Quantitative Ecology. London: Academic Press. . 1970. Style and Vocabulary: Numerical Studies. London: Charles Griffin.

About the author Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks, postings, FAQs, and programs to the Stata user community. He has also coauthored 16 com- mands in official Stata. He was an author of several inserts in the Stata Technical Bulletin and is an editor of the Stata Journal. His “Speaking Stata” articles on graphics from 2004 to 2013 have been collected as Speaking Stata Graphics (2014, College Station, TX: Stata Press). The Stata Journal (2018) 18, Number 1, pp. 287–289 Stata tip 129: Efficiently processing textual data with Stata’s new Unicode features

Alexander Koplenig Department of Lexical Studies Institute for the German language (IDS) Mannheim, Germany [email protected] Prior to Stata 13 and especially Stata 14, Stata’s abilities to process natural language data were limited because of the string length limit of 244 characters and the lack of Unicode support. To extract basic descriptive information from unformatted text data (for example, word frequency information), one needed to rely on workarounds such as Benoit’s (2003) wordscores implementation. With Stata 14, this situation has changed. To demonstrate why the new string-processing capabilities of Stata are highly relevant and useful for anyone who deals with natural language data, let us consider, for example, that we want to extract the five most frequent words of the English Universal Declaration of Human Rights. We can do this by first downloading the text file from http://www.unicode.org/udhr/ using the copy command. Then, we can use the new string functions ustrwordcount() and ustrword() to produce language-specific Unicode words that are based on word-boundary rules or dictionaries for languages that do not use spaces between words (for example, for Thai, see below):

. copy "http://unicode.org/udhr/d/udhr_eng.txt" udhr.raw, replace . clear . local words=ustrwordcount(fileread("udhr.raw")) . set obs `words´ number of observations (_N) was 0, now 1,963 . generate word=ustrword(fileread("udhr.raw"),_n) . contract word . gsort -_freq . list in 1/5

word _freq

1. the 120 2. and 106 3. ,95 4. of 93 5. to 83

Note that Stata automatically separates punctuation tokens from actual word tokens. In many situations, this is convenient because it makes (effortful) cleaning procedures unnecessary.

c 2018 StataCorp LLC dm0093 288 Stata tip 129

In a similar vein, it is easy to extract frequency statistics for n-grams that are sequences of n-consecutive word tokens. Let us say we want to extract the five most frequent word pairs (that is, 2-grams) from the data above. We can do this by generating a word identifier that records the position of each word in the text. The resulting file consisting of all first words of each 2-gram is temporarily stored and then merged with all second words:

. generate long wordidentifier=_n . rename word word1 . tempfile TEMP . save `TEMP´, replace (note: file F:\ST_0j000002.tmp not found) file F:\ST_0j000002.tmp saved . replace wordidentifier=wordidentifier-1 (1,963 real changes made) . rename word1 word2 . merge 1:1 wordidentifier using `TEMP´, keep(3) Result # of obs.

not matched 0 matched 1,962 (_merge==3)

. contract word1 word2 . gsort -_freq . order word1 word2 _freq . list in 1/5

word1 word2 _freq

1. . Article 30 2. right to 28 3. the right 28 4. has the 25 5. of the 23

Note that another possibility to extract the most frequent n-grams and correspond- ing (absolute or relative) frequencies would be to use the groups command written by Cox (2017), instead of using the contract command before sorting and listing. As written above, for languages that do not use spaces between words, using the functions ustrwordcount() and ustrword() has the additional advantage that Stata takes care of the word segmentation. For example, if we want to extract the five most frequent words of the Thai Universal Declaration of Human Rights, we just down- load the Thai text file. Interestingly, we do not have to change the loc argument in the ustrwordcount() and ustrword() functions. This is because Stata uses the ICU tokenizer, which automatically switches to dictionary-based rules when it identifies par- ticular Unicode script input. A. Koplenig 289

While the ICU documentation (http://userguide.icu-project.org/boundaryanalysis) lists dictionary-based support for Japanese, Khmer, Chinese, and Thai, one can find that other languages, such as Burmese (language-specific ISO 639-3 code: mya), Lao (lao), or Tibetan (bod), are also supported by using the corresponding ISO code in the copy command above (udhr ISO.txt).

1 References Benoit, K. 2003. Wordscores: Software for coding political texts. http://www.tcd.ie/ Political Science/wordscores/software.html. Cox, N. J. 2017. Speaking Stata: Tables as lists: The groups command. Stata Journal 17: 760–773. The Stata Journal (2018) 18, Number 1, pp. 290–291 Software Updates dm0078 3: newspell: Easy management of complex spell data. H. Kr¨oger. Stata Journal 17: 515–516; 16: 244; 15: 155–172. In previous versions of the program, the newspell combine command combined overlapping spells of the same spell type into a new type of spell. This has been fixed. Now the program only combines overlapping spells of different spell types into a new type of spell. st0279 2: Modeling underdispersed count data with generalized Poisson regression. T. Harris, Z. Yang, and J. W. Hardin. Stata Journal 15: 605–606; 12: 736–747. All commands were updated to use the modern free-parameter notation in Stata 15. st0336 1: Regression models for count data based on the negative binomial(p) distribu- tion. J. W. Hardin and J. M. Hilbe. Stata Journal 14: 280–291. You could previously specify the generic eform option to obtain exponentiated co- efficients. The nbregp command now accepts the irr option, so the exponentiated coefficients are properly labeled as incidence-rate ratios. nbregp has also been up- dated to use the modern free-parameter notation in Stata 15. st0337 1: Estimation and testing of binomial and beta-binomial regression models with and without zero inflation. J. W. Hardin and J. M. Hilbe. Stata Journal 14: 292– 303. The vuong option has been removed from zib and zibbin. The estimation com- mands were updated to use the modern free-parameter notation in Stata 15. The pr1() option has been removed from predict following zib and zibbin. st0351 1: Modeling count data with generalized distributions. T. Harris, J. M. Hilbe, andJ.W.Hardin.Stata Journal 14: 562–579. gbin, nbregf,andnbregw were updated to use the modern free-parameter notation in Stata 15. nbregw has been updated so that you no longer obtain starting values by fitting a constant-only model when the specified model was a constant-only model. st0378 1: Regression models for count data from truncated distributions. J. W. Hardin andJ.M.Hilbe.Stata Journal 15: 226–246. Updated to use the modern free-parameter notation in Stata 15. st0493 1: Response surface models for OLS and GLS detrending-based unit-root tests in nonlinear ESTAR models. J. Otero and J. Smith. Stata Journal 17: 704–722. The current version of the two commands includes a guard against rounding values in alphas.

c 2018 StataCorp LLC up0058 291 st0496 1: Speaking Stata: Tables as lists: The groups command. N. J. Cox. Stata Journal 17: 760–773. groups would exit with an error message if weights were specified. This has been corrected. st0508 1: Response surface models for the Elliott, Rothenberg, and Stock unit-root test. J. Otero and C. F. Baum. Stata Journal 17: 985–1002. The current version of the command includes a guard against rounding values in alphas.