<<

J Epidemiol : first published as 10.1136/jech.42.4.311 on 1 December 1988. Downloaded from

Journal of and , 1988, 42, 311-315

Review article: Research methods in epidemiology, I

Uses and abuses of multivariate methods in epidemiology STEPHEN J W EVANS From the Department ofClinical Epidemiology, The Hospital Medical College, Turner Street, London El guest. Protected by copyright. Uses is, the lower their blood pressure). This fact that the value of the regression coefficient changes when other The general medical literature is replete with examples variables are in the illustrates of equation both the major so called "multivariate methods". These include use and a major abuse of multivariate methods. multiple regression, , discriminant All multivariate methods have this use , "Cox that allows regression" and log linear models. for the effect of a particular variable to be adjusted for They are different types of generalisation of the the effect of other measured variables. The classical method ofsimple , the object being to of obtain problem can sometimes have a a description of a biological phenomenon, solution in the use of multivariate using an equation. This in methods. For its simplest form is example, to compare the blood pressure in two y = a + bx; y is the outcome (dependent) variable and x different racial is a predictor groups whose height and weight differ, (independent) or explanatory variable. then multiple regression may be used. This will allow The complex calculations which are required when the using these comparison between the groups to be made, methods are easily carried out using differences between the computers and the computer software for them is uncomplicated by any groups widely in height or weight. Any confounding effect of these available. variables has been removed. In Multiple regression is the generalisation to more addition, multiple than regression may be used to see if the relationship one predictor variable; eg, y =a+ bIxI + b2x2. between and blood This equation should be as simple as possible while height, weight pressure is the same for each racial group.' http://jech.bmj.com/ still providing an adequate description, so that extra The terms are only added if model as described is a in two they provide an important senses. The first is that gain. The x variables which are not of particular the effect on blood pressure ofa interest unit change in, say, height is the same at all heights; a in themselves are often referred to as line covariates. If y is blood pressure and xI, x2 are height straight relationship is assumed. A curvilinear and relationship could be obtained by including as a weight, then the equation will provide a much further better description than either equation for blood predictor some particular x variable (such as pressure involving height or weight alone. Blood height) as both x and x2. Part of the art (for it remains pressure is more dependent on how overweight (for an art) of fitting equations to is in choosing on September 25, 2021 by their height) someone is, than on their weight alone. appropriate types of curve to fit the data. The second The relationship ofblood pressure with height may sense is that even an equation with both an x and x2 in be it as nearly non-existent (b, could be positive, negative or such zero) but ifweight is in the equation the value ofb,, the y=a+bix, +b2x2+b3x12 regression coefficient for height, will very definitely be is still described as a linear in that there is negative (that is, for a given weight, the taller someone addition of the terms rather than multiplication of 311 J Epidemiol Community Health: first published as 10.1136/jech.42.4.311 on 1 December 1988. Downloaded from

312 Stephen J W Evans them. The word linear is here used in a technical regression and discriminant analysis give similar mathematical sense, but this is the sense of the word results. It should be noted that using ordinary Y outside the when used in a general statistical context. Some regression can give predicted values of 0 1 with logistic reserve the word "multivariate" for to which is not possible analysis of multiple outcome variables (several regression. The technique then allows for There are detailed different y variates) and use "multivariable" for discrimination between the groups. regression and the other methods discussed technical discussions of the issues of which method to multiple rule, with two may be useful to distinguish them. use with two groups.5 As a general here, and it variables are is a form of multiple regression groups and where some of the x Logistic regression where the studied is not a continuous categorical rather than continuous, or where the outcome 0-3 to 0 7, is instead a binary Y can be outside the range measure like blood pressure, but logistic variable (yes/no, dead/alive, etc). The obvious utility logistic regression (sometimes called in that context) is to be preferred.6 a method in is clear. The discrimination of such analysis may also be used to examine a observations of a binary variable may be coded as 0 Discriminant set of equations which use several predictor variables and 1-for example 0 = alive, 1 = dead. Although any a variable on a nominal scale (unordered will always be 0 or 1, we can to classify particular observation more than two categories. For the probability ofdeath for an individual categories) having talk about P, example, in a study to distinguish three diagnostic group of people. The range of P is from 0 to 1 or a groups such as motor neurone , stroke and while P -the odds of -has a range from 0 to multiple sclerosis, the variables used to distinguish

1-P between the groups can themselves be binary or guest. Protected by copyright. infinity. The logarithm of this ratio-the log odds or continuous, and a nominal variable with n categories logit (P)-has a range from - oo to + oo, making it can be re-expressed as n- 1 binary variables. A scoring suitable for use in a regression equation. Studies ofthe system can then be used to classify cases on the basis of Framingham type2 have used logistic regression to symptoms etc.7 examine many factors in predicting death from Cox Regression or proportional hazards regression coronary heart disease. Measures ofmorbidity such as is a form of logistic regression suitable for survival respiratory symptoms and illnesses (coded as present/ analysis where the time to occurrence (survival time) absent) can also be used.3 The which is ofan event (usually death) is ofprimary importance. It obtained from the logistic model provides a has the advantage of being able to deal with varying convenient and comprehensible summary of the effect amounts of follow up in a natural way and has of explanatory variables. Recent applications of this provided a very important tool in the study of survival method to the analysis of case-control studies have time. Since it is a type ofregression analysis then all the been particularly useful. The outcome variable here is issues of ordinary multiple regression can apply here. whether an individual is a case or a control. The odds Explanatory variables, whether of primary interest or ratios obtained from the logistic regression may, in the potential confounders, may be incorporated in the situation of a rare disease, be interpreted as relative model. They are assumed to have an effect which .4 Care must be taken in the analysis since multiplies the death rate by a certain amount (the different forms of logistic regression should be used equivalent of the slope, or regression coefficient in matched This effect may be depending on whether the study design was ordinary multiple regression). http://jech.bmj.com/ or not. estimated together with its which Discriminant analysis and logistic regression are allows for calculation of confidence intervals and closely related and in many instances may be used testing (p values) in the usual interchangeably, their differences being in the way. The "Log-Rank" test is a very similar, though technical form of the equations used. Discriminant less general, technique. The Cox model also has is classically used to derive an equation which applications in the analysis of case-control studies, analysis from will be the best one to predict membership of one of though this is a very specialised area. The results an estimate of relative two groups. Group membership may be treated in this such an analysis may lead to on September 25, 2021 by as a binary outcome variable Y (coded as 0 for risk, with standard errors. case for dealing with both and 1 for the other group). The observed Log-linear models are methods one group data. are a of Y for a particular combination of x values unordered and ordered categorical They in the way that seen as the probability of being in the second categorical analogue of correlation can be of regression. They group. The discriminant function then becomes logistic regression is an analogue essentially identical to ordinary multiple regression. are mainly used for looking at the inter-relationships 0 3 to 0-7, logit (Y) and Y are of several variables without necessarily concentrating When Y is in the range method is applied to approximately linearly related and so logistic on a single outcome variable. The J Epidemiol Community Health: first published as 10.1136/jech.42.4.311 on 1 December 1988. Downloaded from

Uses and abuses ofmultivariate methods in epidemiology 313 multi-dimensional tables and ought to be considered equivalently in some cases) to assume that all variables in place of standardising mortality rates in many are independent of each other in their effects (that is, situations. there are no interactions) may also be naive. While it may be parsimonious to use a linear These five methods detailed above can all be put additive model with no interactions and sophisticated together as examples of "Generalised Linear Models" to use a non-linear model with and interactions and a have an overall structure which is similar. There complex error structure, it is more important to be are differences between the methods in the type of aware of limitations of equations in describing equation used but they have an underlying complex reality. The ability of computers to provide commonality of approach. good graphical displays has only just begun to be For all ofthe methods their appropriateness may be realised and is at a in seen to depend on very elementary stage three things: (1) The variability multivariate analysis. The temptation to use a new (called by statisticians the error term) which remains in method because it is there rather than checking the data when the equation has been used is specified whether it is appropriate has been particularly and allowed for correctly; (2) Theform ofthe equation prevalent in . As noted above the relating the variables to each other is correct (this "Cox model" is called relates "proportional hazards": that is, strongly to the assumptions made in using the it assumes that the instantaneous death rate (hazard) is equation); and (3) The relevant variables have been multiplied by a certain factor for each variable, but measured over a range of medical interest. this assumtion may not be true and checking it, again is vital. Abuses using graph plots, guest. Protected by copyright. As yet, no formal mechanism exists for reaching the correct conclusions regarding error terms or theform ERROR of Various In elementary equations. attempts to do so automatically much emphasis is placed on (such as ) may provide some aid but this factor. In regression (or t tests which can be seen as they can be The a form of regression) misleading. consequence is that everyone is worried about analysis requires time and care and is best done by whether the distribution of the error is Gaussian those in (Normal). With experienced handling real data. Some multivariate methods a rather more guidelines have been suggested in the context of Cox restrictive assumption of multivariate normality is regression8 but, while useful for all made and this is rarer to multivariable achieve as well as harder to problems, they do not guarantee that the correct test whether it is present. In practice most methods are answer is obtained. reasonably robust to departures from "normality", but occasionally a single, or a few, observation(s) RELEVANCE distort the distribution. These "awkward" Even statistical observations are experienced analysts may have more likely to be present if many weaknesses in this area. The story is told of the variables are measured and analysed. Recent who the statistical attention presented results of a multiple has been paid to methods of of a chemical process by saying detecting influential observations which may or may that water had no effect on not negate the usual yield. The chemists fell distributional assumptions but about laughing because the concentration of water which affect the magnitude and statistical significance was controlled to be less than http://jech.bmj.com/ of the effect of a 1 part in a million and variable. The better computer varied over a small range below this, water in programs provide good facilities for testing substantial amounts assumptions and having caused an explosive effect examining the effect of influential on yield! Even if a relevant variable is included observations which increase in their adverse it must effects in neither be swamped by measurement error nor have an multiple regression. Results from multiple regression of analysis should inadequate range measurement over a biologically not be published without account relevant "dose" range. being taken of problems in this area. The effect ofunmeasured variables is

unpredictable. on September 25, 2021 by Their absence can lead to totally wrong conclusions FORM OF THE EQUATION with to the Epidemiologists have regard effect of measured variables, and become aware of the need to when contradictory results are found in different allow for different forms of equation being used to studies it describe the may well be that none have measured an same set ofdata, such as multiplicative or important variable. The art of additive models. The fact that a epidemiology partly multiplicative model consists of thinking of such variables and of ways to (y=ax,bI.x2b2) provides a "statistically significant" measure them. equation does not demonstrate unequivocally that an Multivariate methods can be additive model is inappropriate. Similarly very helpful to the (and epidemiologist and clinical researcher but like any J Epidemiol Community Health: first published as 10.1136/jech.42.4.311 on 1 December 1988. Downloaded from

314 Stephen J W Evans the success can be most dangerous when used by classification. In other words, in deciding powerful tool an equation those of "a little learning". of a model it classifies each case using derived from all the other cases but excluding this one. This leads to a less biased assessment of the success COMPUTER PROGRAMS which is is very wide indeed rate of the classification. A better approach, The range of computer programs methods (as in fact is and these comments only refer to some of the most applicable to all the multivariate is to use part (half, typically) of the data well known ones. jack-knifing), regression-Virtually all statistical available as a "training" set to derive the equation and Multiple as a "test" set to see how successful the computer programs provide for ordinary multiple the other part ones, including BMDP, derived equation is for . regression. Many of the better SAS (not in PC version) SAS and SPSS (on both mainframe and micro Cox regression-BMDP, the and EGRET all provide good facilities for doing computers) provide useful facilities for checking EGRET and to influential proportional hazards regression, with assumptions and drawing attention best buys. SPSS-X does also provide various methods to BMDP-2L providing joint observations. They provide some survival analysis on the mainframe but build up a model by adding or deleting variables from of this is weak and non-existent on the PC version at the the equation according to the statistical significance is ofcourse able ofcomputer time ofwriting (August 1988). GLIM that variable. These cut down the amount only with difficulty! find an acceptable to do it but time and number of attempts to Log-linear models-All the packages provide a form model but they do not necessarily obtain the most modelling, with BMDP-4F being easiest of log-linear guest. Protected by copyright. sensible model automatically.8 to use and GLIM giving the greatest flexibility. A particularly useful program in trying different of several 2x2 models for EGRET provides very good analysis models is BMDP-9R which fits all possible with Mantel-Haenszel statistics, and for that a "best" model tables a given set of variables and suggests type of analysis is a best buy, but BMDP is more according to a statistical criterion. This is a general. compromise between a model which includes all Summary-EGRET and SPSS/PC have the most possible variables which will always be the "best" interface, though EGRET is not at all most of the "user friendly" model in terms of apparently explaining good at data management and transformations, which variation, and the simplest model which will include good in SPSS/PC. SAS/PC is even better in There is a are quite only the single most significant predictor. this area but is slightly more tedious to use, though it is "penalty" for adding variables so that they are only to the mainframe version than is SPSS/ In spite of the more similar included if they offer a substantial gain. PC. BMD-P is virtually identical on micro and appeal of such a method there is no guarantee of the mainframe, which is both a strength and also a major final "best" model being the "real" model in a it is very user unfriendly. This is a for disadvantage in that scientific sense. BMDP-9R is my best buy program pity since it provides such a wide range of good multiple regression for epidemiologists. facilities. GLIM can do virtually everything but Logistic regression-BMDP-LR and GLIM (on requires quite a bit of work "on the side" to provide mainframe and micro) have always provided good usable output. facilities for logistic regression, with the former having clearer output and a stepping facility, while the latter http://jech.bmj.com/ has great flexibility but output which is only References comprehensible to the expert. SAS on mainframe, and in the latest (6.03) version on micro computers, also I Silman AJ, Evans SJW, Loysen E. Blood pressure and with migration: a study of Bengali immigrants in East London. provides facilities for doing logistic regression 41: 152-5. is between J Epidemiol Community Health 1987; output and flexibility that intermediate 2 Leaverton PE, Sorlie JC, Kleinman JC, et al. BMDP and GLIM. SPSS-X on the mainframe (not of the Framingham risk model for Representativeness with a SPSS-PC) does have a facility for logistic regression coronary heart disease mortality: A comparison . J Chron Dis 1987; 40: 775-84. on September 25, 2021 by but is not as good as the others. A more recent package national and (in both 3 Somerville SM, Roma RJI Chinn S. with good facilities for logistic regression respiratory conditions in primary school children. J matched and unmatched designs) is EGRET from Epidemiol Community Health 1988; 42: 105-10. This is my best buy for epidemiologists, 4Notani PN. Role of in of the upper Seattle. of models in risk assessment. J though BMDP and SAS are also good. alimentary tract: use likely Epidemiol Community Health 1988; 42: 187-92. Discriminant analysis-This technique is less 5 Titterington DM, Murray GD, Murray LS, et al. to be used by epidemiologists but if it is, BMDP, SAS Comparison of discrimination techniques applied to a all provide for it with BMDP-7M (best buy) complex set ofhead injured patients. J Roy Statistical Soc and SPSS 144: 145-75. being particularly useful as it allows for a "jack-knife" A 1981; J Epidemiol Community Health: first published as 10.1136/jech.42.4.311 on 1 December 1988. Downloaded from

Uses and abuses of multivariate methods in epidemiology 315 6 Alexander FE, Roberts MM, Huggins A, Muir B. Use of 8 Lew RA, Day CL, Harrist TJ, Wood WC, Mihm MC. risk factors to allocate schedules for breast Multivariate analysis. Some guidelines for physicians. . J Epidemiol Community Health 1988; 42: JAMA 1983; 249: 641-3. 193-9. 7 Li TM, Day SJ, Alberman E, Swash M. Differential diagnosis of motorneurone disease from other neurological conditions. Lancet 1986; ii: 731-3. Acceptedfor publication August 1988

Notice of Meetings

International Conference on Ionising Radiation and 1986, in Montreal. The theme will be " in Cancer Epidemiology, of Birmingham, Health-The Scientific Challenge for the 21st 12-13 July 1989. Further particulars from: Dr Tom Century". The conference chairman is Dr Josef Sorahan, Department of Social , University Vobecky. of Birmingham, Edgbaston, Birmingham B15 2TJ. For further write to the Conference Secretariat, The Columbia/Kenness Team, 1010 Ste. There will be an International Conference on Catherine St. West, Suite 645, Montreal, Quebec H3B Community Nursing organised by the Netherlands 1G7, Canada. Institute of Primary Care on 16-17 March 1989. The guest. Protected by copyright. conference will take place at the Casino Den Bosch in The 1st European Conference on Health 's-Hertogenbosch (The Netherlands). Further will take place on September 19-21 1989 in Barcelona. information from: NIVEL, Mrs A Kerkstra PhD, PO The theme will be "Incentives in Health Systems." Box 1568, 3500 BN Utrecht, The Netheriands. Abstracts are required by February 1. For further information write to Dr Guillem Lopez Casasnovas, The XIIth International Conference on Preventive Asociacion de Economia de la Salud, Pla9a Catalunya and will take place on August 13-16, 9, 4a planta, 08002 Barcelona, Spain. http://jech.bmj.com/ on September 25, 2021 by