INFORMATION THEORY MAKES LOGISTIC REGRESSION SPECIAL Ernest S
Total Page:16
File Type:pdf, Size:1020Kb
INFORMATION THEORY MAKES LOGISTIC REGRESSION SPECIAL Ernest S. Shtatland, PhD Mary B. Barton, MD, MPP Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT transformation is similar to the probit but on biological grounds is more arbitrary. ” In many sources the opinion This paper is a continuation of our previous presentations has been expressed that the logistic regression is a very at NESUG , 1997 and SUGI, 1998 ([1,2]). Our first aim is powerful, convenient, and flexible statistical tool; to provide a theoretical justification for using logistic however, it is completely empirical, with no theoretical regression (vs. probit, gompit, or angular regression justifications ( [5], p. 164, [6], p. 1724). To the best of our models, for example). The consensus is that logistic knowledge, [6] is the first and only work that provides us regression is a very powerful, convenient and flexible with some theoretical justification for logistic models. statistical tool; however, it is completely empirical. However, this justification is given on deterministic Information theory can guide our interpretation of logistic grounds - in terms of general systems theory. regression as a whole, and its’coefficients; through this interpretation we will demonstrate how logistic regression We will show that logistic regression is special and unlike is special, and unlike other regression models mentioned other regression models mentioned above by justifying it above. A similar approach will be used to interpret Bayes’ in statistical terms (within information theory) rather formula in terms of nformation. than on deterministic grounds as in [6]. This is more natural because logistic analysis is first and foremost a Our second goal is to propose a test of significance that in statistical tool. case of small samples is superior to the conventional Chi- Square test. This is important because, in addition to the A typical logistic regression model is of the form unreliability of Chi-Square, small sizes are typical and unavoidable in many fields including medical and health log (P / (1 - P)) = services research. The proposed test can also be b0 + b1 X1 + b2 X2 + ... +bk Xk (1) interpreted in terms of information theory. where P is the probability of the event of interest, LOGISTIC REGRESSION AND INFORMATION b0, b1, b2,....,bk are the parameters of the model, X1, X2, ... ,Xk are the explanatory variables, and log is the To model the relationship between a dichotomous natural logarithm. In this form the logistic model with the outcome variable (YES vs. NO, DEAD vs. LIVE, logit link looks really arbitrary, with no advantages over SELL vs. NOT SELL, etc.) and a set of explanatory any other model discussed above. variables we have a fairly wide “menu” of alternatives, such as logistic, probit, gompit, and angular regression As is well-known, for any random event E we have two models as examples. In [3] we can find the larger list of 7 numbers: the event’s probability P(E) and it’s information alternatives. Only two of them, probit and logit, have I(E) - the information contained in the message that E received significant attention. According to [3], p. 79, occurred. These quantities are connected according to the even probit and logit (not to mention the other possible formula nonlinear specifications) are arbitrary. See also [4], p. 388 about the arbitrariness of the logit models: “ The logit and engineering) in which information is even more convenient and natural than probability. Perhaps, logistic I(E) = - log P(E) (2) regression is one of these cases. Usually log in (2) is the binary logarithm and in this case Taking into consideration the definition of information information is measured in bits. Of course, other bases of (2), it is easy to see that the left side of (1) is the the logarithm can be used and as a result information difference in information content between the event of units can vary. Information is as fundamental a concept interest E and nonevent NE. The appearance of the as probability and there are cases (in particular, in physics information difference (ID) between E and NE seems logical because logistic regression could be treated as a variant of discriminant analysis when the assumption of The interpretability in terms of information is a normality is not justified (see, for example, [7], p. 232, unique property of logistic regression and constitutes the [8], pp. 19-20, 34-36, or [9], pp. 355-356). That is why advantage of the logistic model over probit, gompit and the information difference ID could be called also the other similar models. discriminant information difference. INFORMATION AND BAYES’ FORMULA We will also give a new interpretation to coefficients b0, b1, b2,....,bk in (1). Usually they are interpreted as It is well-known how important is Bayes’ formula for logarithms of odds ratios. According to [8], p. 41, "This modifying disease probabilities based on diagnostic test fact concerning the interpretability of the coefficients is results (see for example, [12] and [13]). The most popular the fundamental reason why logistic regression has proven and intuitive variant of Bayes’ formula used in medical such a powerful analytic tool for epidemiologic research.” literature is the odds-likelihood ratio form: And further, see [8], p. 47: “This relationship between the logistic regression coefficient and the odds ratio posterior odds in favor of disease = (4) provides the foundation for our interpretation of all prior odds in favor of disease * likelihood ratio logistic regression results.” But the question arises whether odds ratios themselves are a solid foundation for or this interpretation. According to the same authors ([8], p. 42), “The interpretation given for the odds ratio is based P( D | R) / P ( ND | R) = (4') on the fact that in many instances it approximates a quantity called the relative risk.” Other authors are not so P( D) / P(ND) * P ( R | D) / P ( R | ND) optimistic about this approximation and the value of odds ratios. For example, the opinions vary from “Odds ratios where D stands for disease, ND - for nondisease, and R - are hard to comprehend directly” ([10], p.989), to “odds for a test result. Even this, “odds-instead-of-probabilities” ratio is hard to interpret clinically ” ([11], p. 1233 ), to form is not intuitive enough. The first problem here is Miettinen’s opinion that the odds ratio is similar to the problem with odds ratios: unlike risks, odds epidemiologically “unintelligible”. Also, according to are difficult to understand, they are fairly easy to visualize Altman , ([9], p. 271): “The odds ratio is approximately when they are greater than one, but are less easily grasped the same as the relative risk if the outcome of interest is when the value is less than one ([10], rare. For common events, however, they can be quite p. 989-990). The second problem with odds is that, different, so it is best to think of the odds ratio as a although they are related to risk, the relation is not measure in its own right.” ([9], p. 271). Altman’s opinion straightforward - these characteristics become that “it is best to think of the odds ratio as a measure in its increasingly different in the upper part of the scale. The own right”, comes especially to the point. We think that third problem is related to the fact that formula (4') is logistic regression coefficients also need a new inherently multiplicative while human thinking grasps interpretation in their own right and that this more easily additive relationships. This is the reason why interpretation should be done in terms of information. It researchers working with more conventional forms of is easy to show that coefficients b1, b2,....,bk have the Bayes’ formula like (4) or (4'), use sometimes special meaning of the change in the discriminant information nomograms and tables to calculate the posterior odds and difference (ID) as the corresponding explanatory variable probabilities ([13], pp. 124 -126). gets a one-unit increase (with statistical adjusting for other variables). For example, if X1 is a dichotomous Taking the logarithms of both sides of (4') (in this case the explanatory variable with values 0 and 1 then binary logarithm is more appropriate) we arrive at the following relationship between information quantities: b1 = ID( E vs.NE | X1 =1 ) - (3) ID( E vs.NE | X1 =0 ) I(D,ND|R) = ID(D, ND)+ID(R|D,ND) (5) Thus, equation (1) can be treated as the decomposition of where ID(D,ND|R) is the posterior information difference the discriminant information difference between the event between disease D and nondisease ND given the result R E and nonevent NE into the sum of contributions of the of the test, ID(D,ND) is the corresponding prior explanatory variables. This decomposition in terms of information difference, and ID(R|D,ND) is the difference information is linear unlike the original logistic model in information contents of the test result between disease which is nonlinear in terms of probabilities. and nondisease cases. In other words, (5) could be as a sample estimate of the difference between the reformulated as follows: information characteristics of the ‘null’ model (b = 0) and the model under consideration. The most Discrimination information difference between disease natural interpretation of this difference is the information and nondisease after the test = gain, IG, that we have obtain Discrimination information difference between disease moving from the simplest 'null' model to the fitted model and nondisease before the test + ([17], pp. 163-173). As a result we have Information contained in the test result about the disease/nondisease dilemma. IG*(N /K) = 2(log L(b)-log L(0))/K (8) Thus the information variant of Bayes’ formula (5) or conveys literally what we always imply when working with the conventional Bayes’ theorem: increase and * = balance of information.