Epidemiology and Community Health, 1988, 42, 311-315
Total Page:16
File Type:pdf, Size:1020Kb
J Epidemiol Community Health: first published as 10.1136/jech.42.4.311 on 1 December 1988. Downloaded from Journal of Epidemiology and Community Health, 1988, 42, 311-315 Review article: Research methods in epidemiology, I Uses and abuses of multivariate methods in epidemiology STEPHEN J W EVANS From the Department ofClinical Epidemiology, The London Hospital Medical College, Turner Street, London El guest. Protected by copyright. Uses is, the lower their blood pressure). This fact that the value of the regression coefficient changes when other The general medical literature is replete with examples variables are in the illustrates of equation both the major so called "multivariate methods". These include use and a major abuse of multivariate methods. multiple regression, logistic regression, discriminant All multivariate methods have this use analysis, "Cox that allows regression" and log linear models. for the effect of a particular variable to be adjusted for They are different types of generalisation of the the effect of other measured variables. The classical method ofsimple linear regression, the object being to of obtain problem confounding can sometimes have a a description of a biological phenomenon, solution in the use of multivariate using an equation. This in methods. For its simplest form is example, to compare the blood pressure in two y = a + bx; y is the outcome (dependent) variable and x different racial is a predictor groups whose height and weight differ, (independent) or explanatory variable. then multiple regression may be used. This will allow The complex calculations which are required when the using these comparison between the groups to be made, methods are easily carried out using differences between the computers and the computer software for them is uncomplicated by any groups widely in height or weight. Any confounding effect of these available. variables has been removed. In Multiple regression is the generalisation to more addition, multiple than regression may be used to see if the relationship one predictor variable; eg, y =a+ bIxI + b2x2. between and blood This equation should be as simple as possible while height, weight pressure is the same for each racial group.' http://jech.bmj.com/ still providing an adequate description, so that extra The terms are only added if model as described is a linear model in two they provide an important senses. The first is that gain. The x variables which are not of particular the effect on blood pressure ofa interest unit change in, say, height is the same at all heights; a in themselves are often referred to as line covariates. If y is blood pressure and xI, x2 are height straight relationship is assumed. A curvilinear and relationship could be obtained by including as a weight, then the equation will provide a much further better description than either equation for blood predictor some particular x variable (such as pressure involving height or weight alone. Blood height) as both x and x2. Part of the art (for it remains pressure is more dependent on how overweight (for an art) of fitting equations to data is in choosing on September 25, 2021 by their height) someone is, than on their weight alone. appropriate types of curve to fit the data. The second The relationship ofblood pressure with height may sense is that even an equation with both an x and x2 in be it as nearly non-existent (b, could be positive, negative or such zero) but ifweight is in the equation the value ofb,, the y=a+bix, +b2x2+b3x12 regression coefficient for height, will very definitely be is still described as a linear function in that there is negative (that is, for a given weight, the taller someone addition of the terms rather than multiplication of 311 J Epidemiol Community Health: first published as 10.1136/jech.42.4.311 on 1 December 1988. Downloaded from 312 Stephen J W Evans them. The word linear is here used in a technical regression and discriminant analysis give similar mathematical sense, but this is the sense of the word results. It should be noted that using ordinary Y outside the when used in a general statistical context. Some regression can give predicted values of 0 1 with logistic statisticians reserve the word "multivariate" for range to which is not possible analysis of multiple outcome variables (several regression. The technique then allows for There are detailed different y variates) and use "multivariable" for discrimination between the groups. regression and the other methods discussed technical discussions of the issues of which method to multiple rule, with two may be useful to distinguish them. use with two groups.5 As a general here, and it variables are is a form of multiple regression groups and where some of the x Logistic regression where the studied is not a continuous categorical rather than continuous, or where the outcome 0-3 to 0 7, is instead a binary probability Y can be outside the range measure like blood pressure, but logistic variable (yes/no, dead/alive, etc). The obvious utility logistic regression (sometimes called in that context) is to be preferred.6 a method in medical research is clear. The discrimination of such analysis may also be used to examine a observations of a binary variable may be coded as 0 Discriminant set of equations which use several predictor variables and 1-for example 0 = alive, 1 = dead. Although any a variable on a nominal scale (unordered will always be 0 or 1, we can to classify particular observation more than two categories. For the probability ofdeath for an individual categories) having talk about P, example, in a study to distinguish three diagnostic group of people. The range of P is from 0 to 1 or a groups such as motor neurone disease, stroke and while P -the odds of death-has a range from 0 to multiple sclerosis, the variables used to distinguish 1-P between the groups can themselves be binary or guest. Protected by copyright. infinity. The logarithm of this ratio-the log odds or continuous, and a nominal variable with n categories logit (P)-has a range from - oo to + oo, making it can be re-expressed as n- 1 binary variables. A scoring suitable for use in a regression equation. Studies ofthe system can then be used to classify cases on the basis of Framingham type2 have used logistic regression to symptoms etc.7 examine many risk factors in predicting death from Cox Regression or proportional hazards regression coronary heart disease. Measures ofmorbidity such as is a form of logistic regression suitable for survival respiratory symptoms and illnesses (coded as present/ analysis where the time to occurrence (survival time) absent) can also be used.3 The odds ratio which is ofan event (usually death) is ofprimary importance. It obtained from the logistic model provides a has the advantage of being able to deal with varying convenient and comprehensible summary of the effect amounts of follow up in a natural way and has of explanatory variables. Recent applications of this provided a very important tool in the study of survival method to the analysis of case-control studies have time. Since it is a type ofregression analysis then all the been particularly useful. The outcome variable here is issues of ordinary multiple regression can apply here. whether an individual is a case or a control. The odds Explanatory variables, whether of primary interest or ratios obtained from the logistic regression may, in the potential confounders, may be incorporated in the situation of a rare disease, be interpreted as relative model. They are assumed to have an effect which risks.4 Care must be taken in the analysis since multiplies the death rate by a certain amount (the different forms of logistic regression should be used equivalent of the slope, or regression coefficient in matched This effect may be depending on whether the study design was ordinary multiple regression). http://jech.bmj.com/ or not. estimated together with its standard error which Discriminant analysis and logistic regression are allows for calculation of confidence intervals and closely related and in many instances may be used statistical significance testing (p values) in the usual interchangeably, their differences being in the way. The "Log-Rank" test is a very similar, though technical form of the equations used. Discriminant less general, technique. The Cox model also has is classically used to derive an equation which applications in the analysis of case-control studies, analysis from will be the best one to predict membership of one of though this is a very specialised area. The results an estimate of relative two groups. Group membership may be treated in this such an analysis may lead to on September 25, 2021 by as a binary outcome variable Y (coded as 0 for risk, with standard errors. case for dealing with both and 1 for the other group). The observed Log-linear models are methods one group data. are a of Y for a particular combination of x values unordered and ordered categorical They average in the way that seen as the probability of being in the second categorical analogue of correlation can be of regression. They group. The discriminant function then becomes logistic regression is an analogue essentially identical to ordinary multiple regression. are mainly used for looking at the inter-relationships 0 3 to 0-7, logit (Y) and Y are of several variables without necessarily concentrating When Y is in the range method is applied to approximately linearly related and so logistic on a single outcome variable. The J Epidemiol Community Health: first published as 10.1136/jech.42.4.311 on 1 December 1988.