Randomization Does Not Justify Logistic Regression

Statistical Science 2008, Vol. 23, No. 2, 237–249 DOI: 10.1214/08-STS262 c Institute of Mathematical Statistics, 2008 Randomization Does Not Justify Logistic Regression David A. Freedman Abstract. The logit model is often used to analyze experimental data. However, randomization does not justify the model, so the usual estimators can be inconsistent. A consistent estimator is proposed. Ney- man’s non-parametric setup is used as a benchmark. In this setup, each subject has two potential responses, one if treated and the other if untreated; only one of the two responses can be observed. Beside the mathematics, there are simulation results, a brief review of the literature, and some recommendations for practice. Key words and phrases: Models, randomization, logistic regression, logit, average predicted probability. 1. INTRODUCTION nπT subjects at random and assign them to the treatment condition. The remaining nπ subjects The logit model is often fitted to experimental C are assigned to a control condition, where π = 1 data. As explained below, randomization does not C π . According to Neyman (1923), each subject has− justify the assumptions behind the model. Thus, the T two responses: Y T if assigned to treatment, and Y C conventional estimator of log odds is difficult to in- i i if assigned to control. The responses are 1 or 0, terpret; an alternative will be suggested. Neyman’s where 1 is “success” and 0 is “failure.” Responses setup is used to define parameters and prove re- are fixed, that is, not random. sults. (Grammatical niceties apart, the terms “logit If i is assigned to treatment (T ), then Y T is ob- model” and “logistic regression” are used interchange- i served. Conversely, if i is assigned to control (C), ably.) then Y C is observed. Either one of the responses After explaining the models and estimators, we i may be observed, but not both. Thus, responses are present simulations to illustrate the findings. A brief subject-level parameters. Even so, responses are es- review of the literature describes the history and timable (see Section 9). Each subject has a covari- arXiv:0808.3914v1 [stat.ME] 28 Aug 2008 current usage. Some practical recommendations are ate Z , unaffected by assignment; Z is observable. derived from the theory. Analytic proofs are sketched i i In this setup, the only stochastic element is the ran- at the end of the paper. domization: conditional on the assignment variable X , the observed response Y = X Y T + (1 X )Y C 2. NEYMAN i i i i i i is deterministic. − There is a study population with n subjects in- Population-level ITT (intention-to-treat) param- dexed by i = 1,...,n. Fix πT with 0 <πT < 1. Choose eters are defined by taking averages over all n subjects in the study population: David A. Freedman is Professor, Department of 1 αT = Y T , Statistics, University of California, Berkeley, California n i 94720-3860, USA e-mail: [email protected]. (1) X 1 αC = Y C . This is an electronic reprint of the original article n i published by the Institute of Mathematical Statistics in T X Statistical Science, 2008, Vol. 23, No. 2, 237–249. This For example, α is the fraction of successes if all reprint differs from the original in pagination and subjects are assigned to T ; similarly for αC . A pa- typographic detail. rameter of considerable interest is the differential log 1 2 D. A. FREEDMAN odds of success, (To verify this, check first that Ui is distributed − αT αC like +Ui.) The parameter vector β = (β1, β2, β3) is (2) ∆ = log log . usually estimated by maximum likelihood. We de- 1 αT − 1 αC − − note the MLE by βˆ. The logit model is all about log odds (more on this below). The parameter ∆ defined by (2) may there- Interpreting the Coefficients in the Model fore be what investigators think is estimated by run- In the case of primary interest, Xi is 1 or 0. Con- T ning logistic regressions on experimental data, al- sider the log odds λi of success when Xi = 1, as well C though that idea is seldom explicit. as the log odds λi when Xi = 0. In view of (4), The Intention-to-Treat Principle T p(β, 1, Zi) λi = log 1 p(β, 1, Zi) The intention-to-treat principle, which goes back − to Hill (1961, page 259), is to make comparisons = β1 + β2 + β3Zi, based on treatment assigned rather than treatment (5) received. Such comparisons take full advantage of C p(β, 0, Zi) λi = log the randomization, thereby avoiding biases due to 1 p(β, 0, Zi) − self-selection. For example, the unbiased estimators = β1 + β3Zi. for the parameters in (1) are the fraction of successes T C In particular, λ λ = β2 for all i, whatever the in the treatment group and the control group, re- i − i spectively. Below, these will be called ITT estima- value of Zi may be. Thus, according to the model, tors. ITT estimators measure the effect of assign- Xi =1 adds β2 to the log odds of success. ment rather than treatment. With crossover, the Application to Experimental Data distinction matters. For additional discussion, see To apply the model to experimental data, define Freedman (2006a). Xi =1if i is assigned to T , while Xi =0if i assigned 3. THE LOGIT MODEL to C. Notice that the model not justified by randomization. Why would the logit specification be correct To set up the logit model, we consider a study rather than the probit—or anything else? What jus- population of n subjects, indexed by i = 1,...,n. tifies the choice of covariates? Why are they exoge- Each subject has three observable random variables: nous? If the model is wrong, what is βˆ2 supposed Yi, Xi, Zi. Here, Yi is the response, which is 0 or 1. to be estimating? The last rhetorical question may The primary interest is the “effect” of Xi on Yi, and have an answer: the parameter ∆ in (2) seems like Zi is a covariate. a natural choice, as indicated above. For our purposes, the best way to formulate the More technically, from Neyman’s perspective, given model involves a latent (unobservable) random vari- the assignment variables Xi , the responses are de- T { } C able Ui for each subject. These are assumed to be terministic: Yi = Yi if Xi = 1, while Yi = Yi if Xi = independent across subjects, with a common logistic 0. The logit model, on the other hand, views the re- distribution: for <u< , sponses Y as random—with a specified distribution— −∞ ∞ i given the{ assignment} variables and covariates. (3) P (Ui < u) = exp(u)/[1 + exp(u)], u The contrast is therefore between two styles of where exp(u)= e . The model assumes that X and inference. Z are exogenous, that is, independent of U. More formally, Xi, Zi : i = 1,...,n is assumed to inde- Randomization provides a known distribution for { } • pendent of Ui : i = 1,...,n . Finally, the model as- the assignment variables; statistical inferences are { } sumes that Yi =1 if based on this distribution. Modeling assumes a distribution for the latent β1 + β2Xi + β3Zi + Ui > 0; • variables; statistical inferences are based on that else, Yi = 0. assumption. Furthermore, model-based inferences Given X and Z, it follows that responses are inde- are conditional on the assignment variables and pendent across subjects, the conditional probability covariates. that Yi = 1 being p(β, Xi, Zi), where A similar contrast will be found in other areas too, exp(β1 + β2x + β3z) including sample surveys. See Koch and Gillings (2005) (4) p(β,x,z)= . 1 + exp(β1 + β2x + β3z) for a review and pointers to the literature. LOGISTIC REGRESSION 3 What if the Logit Model is Right? From Neyman to Logits Suppose the model is right, and there is a causal How could we get from Neyman to the logit model? T C interpretation. We can intervene and set Xi to 1 To begin with, we would allow Yi and Yi to be 0–1 without changing the Z’s or U’s, so Yi =1 if and valued random variables; the Zi can be random too. only if β1 + β2 + β3Zi + Ui > 0. Similarly, we can set To define the parameters in (1) and (2), we would re- T C Xi to 0 without changing anything else, and then place Yi and Yi by their expectations. None of this Yi = 1 if and only if β1 + β3Zi + Ui > 0. Notice that is problematic, and the Neyman model is now ex- β2 appears when Xi is set to 1, but disappears when tremely general and flexible. Randomization makes Xi is set to 0. the assignment variables Xi independent of the T {C } On this basis, for each subject, whatever the value potential responses Yi ,Yi . of Zi may be, setting Xi to 1 rather than 0 adds β2 To get the logit model, however, we would need to the log odds of success. If the model is right, β2 is to specialize this setup considerably, assuming the a very useful parameter, which is well estimated by existence of IID logistic random variables Ui, inde- the MLE provided n is large. For additional detail pendent of the covariates Zi, with T on causal modeling and estimation, see Freedman Yi = 1 if and only if (2005). β1 + β2 + β3Zi + Ui > 0, Even if the model is right and n is large, β2 differs (9) T C from ∆ in (2). For instance, α will be nearly equal Yi = 1 if and only if 1 n T T to n i=1 p(β, 1, Zi).

Randomization Does Not Justify Logistic Regression

Logistic Regression, Dependencies, Non-Linear Data and Model Reduction

Logistic Regression Maths and Statistics Help Centre

Generalized Linear Models

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Variance Partitioning in Multilevel Logistic Models That Exhibit Overdispersion

Chapter 19: Logistic Regression

An Introduction to Biostatistics: Part 2

Lecture 9: Logistic Regression (V2)

Logistic Regression, Part I: Problems with the Linear Probability Model

Generalized Linear Models Link Function the Logistic Equation Is

Binary Logistic Regression

Bayesian Logistic Regression, Bayesian Generative Classification