<<

48 Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017

STATISTICS FOR RESEARCHERS Principles of Regression NJ Gogtay, SP Deshpande, UM Thatte

Introduction relationship between one or more the “dependent” variable. In the independent variables and the second example, since CAD is hile correlation analysis helps dependent variable. A correlation associated with both diabetes and Win identifying associations analysis with a scatter and hypertension, CAD would become or relationships between two a regression line1 is however a the dependent variable and diabetes variables,1 the regression technique prerequisite to regression and and hypertension would be the two or regression analysis is used to both analyses are often carried out independent variables. If you will “model” this relationship so as together. notice, diabetes in one example is to be able to predict what will a dependent variable and in the happen in a real-world setting. Let Dependent and other, an independent variable. us understand this with something Independent Variables Thus, what is a dependent and that happens in daily life. As what is an independent variable children, we are often told by our An important first step before needs to be defined à priori by the parents that education is a vital carrying out a regression analysis researcher before carrying out the part of our lives and a “good” is to understand the concept regression analysis. The choice education will help us get “good of independent and dependent of the variables would in turn be jobs” and thus “good wages”! If variables. An independent variable, defined by the research question we were to explore the relationship as the name suggests is a “stand and the hypothesis being explored. between education and wages, alone” variable, and one that Independent variables are often we could think up two questions remains unaffected by other called predictor variables or – a) Does a relationship exist variables that are measured in a exogenous variables and dependent between education and wages? And study. The “dependent” variable is variables are called prognostic or a related but more interesting and the one that is usually of interest to endogenous variables. important question b) for every extra the researcher and alters in response year spent in education, do wages to change/s in the independent variable. Types of Regression commensurately increase [and if Let us understand this concept Essentially in medical research, yes, by how much?]? The latter with the following two research there are three common types of question is moving into the realm of questions – 1) “As age increases, regression analyses that are used . Regression attempts to does the risk of developing diabetes viz., linear, and answer these and similar questions as measured by HbA1C or blood Cox regression. These are chosen regarding relationships between sugar levels increase? Or b) “Given depending on the type of variables variables. that a patient has both diabetes and that we are dealing with (Table 1). hypertension, what is his risk of Cox regression is a special type of Correlation Versus developing coronary artery disease regression analysis that is applied Regression- [CAD]”? In the former example, to survival or “time to event “ we are exploring the relationship Understanding the and will be discussed in detail in between age and diabetes and in the next article in the series. Distinction the latter, the relationship between can be simple The difference between two variables with a third variable– i.e., diabetes and hypertension linear or multiple linear regression correlation analysis and regression while Logistic regression could be lies in the fact that the former focuses with CAD. In the first example, age would be the independent Polynomial in certain cases (Table on the strength and direction of 1). the relationship between two or variable while diabetes, which is more variables without making any dependent on age would become The type of regression analysis assumptions about one variable being independent and the other dependent [see below], but regression analysis Department of Clinical Pharmacology, Seth GS Medical College and KEM Hospital, Mumbai, Maharashtra Received: 07.01.2017; Accepted: 15.01.2017 assumes a dependence or causal Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017 49

Table 1: Types of regression duration of surgery, extent of Type of regression Dependent variable Independent variable Relationship between blood loss and Hb levels at and its nature and its nature variables day 7, it is intuitive that all Simple linear One, continuous, One, continuous, Linear are correlated amongst each normally distributed normally distributed other and therefore will lead to Multiple linear One, continuous Two or more, may Linear erroneous results. What should be continuous or be taken as the independent is only the duration of Logistic One, binary Two or more, may Need not be linear be continuous or surgery. categorical • The sample size should be Polynomial (logistic) Non-binary Two or more, may Need not be linear adequate [multinomial] be continuous or categorical If the dependent variable is Cox or proportional Time to an event Two or more, may Is rarely linear non-binary and has more than hazards regression be continuous or two possibilities, we use the categorical multinomial or polynomial to be used in a given situation is be quantitative or qualitative. logistic regression. primarily driven by the following Both the independent variables three metrics here could be expressed either Step Wise Regression • Number and nature of as continuous data (blood is an independent variable/s pressure or HbA1C values) automated tool that can be used or qualitative data (presence • Number and nature of the in the exploratory stages of model or absence of diabetes as dependent variable/s building to identify a useful subset defined by the ADA 2016 or of predictor variables. The process • Shape of the regression line hypertension as defined by JNC systematically adds the most 3,4 A. Linear regression: Linear VIII). A linear relationship significant variable or removes regression is the most basic should exist between the the least significant variable during and commonly used regression dependent and independent each step. The idea behind using technique and is of two types variables. such automated tools is to maximize viz. simple and multiple B. Logistic regression: This type the power of prediction with a regression. You can use Simple of regression analysis is used minimum number of independent linear regression when there when the dependent variable is variables. is a single dependent and a binary in nature. For example, if single independent variable the outcome of interest is death Steps in Conducting a (e.g. for the research question in a cancer study, any patient Regression Analysis described above “As age in the study can have only one increases, does the risk of of two possible outcomes- dead Regression analysis is done in developing diabetes increase or alive. The impact of one or 3 steps: as measured by HbA1C or more predictor variables on this 1. Analyzing the correlation blood sugar levels?”. Both the binary variable is assessed. The [strength and directionality of variables must be continuous predictor variables can be either the data] 2 (quantitative data ) and the line quantitative or qualitative. 2. Fitting the regression or least describing the relationship is a Unlike linear regression, this squares line, and straight line (linear). type of regression does not Multiple linear regression on require a linear relationship 3. Evaluating the validity and the other hand can be used between the predictor and usefulness of the model. when we have one continuous dependent variables. Step 1: This has been described in 1 dependent variable and two or For logistic regression to be the article on correlation analysis more independent variables, meaningful, the following Step 2: Fitting the regression line for example when we want to criteria must be met/satisfied Conventionally, in mathematics, answer the second question • The independent variables the equation of a line is given by mentioned above, “Given that must not be correlated the formula “y = mx+c”, where, a patient has both diabetes and amongst each other e.g. if m is the “slope” or gradient of hypertension, what is his risk the dependent variable is the line and “c” is where the line of developing coronary artery presence (or absence) of post- “cuts” the y-axis, also called as disease (CAD)”. Importantly, operative wound infection and the “intercept”, (Figure 1). In the independent variables could the independent variables are regression, this equation is given by 50 Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017

Regression Equa�on Table 2: Age and Blood pressure in

women (n=30)

Sr. No. Age (years) Systolic blood HbA1C (%) HbA1C (Y) = B0 + B1*age (X). pressure (mm Hg) 1 22 131 Y-axis B1 2 23 128

B0 3 24 116 4 27 106 5 28 114

6 29 123 X - axis Age (years) 7 30 117 8 32 122 9 35 99 Fig. 1: Regression analysis exploring the relationship between Age and HbA1C 10 35 121 A and regression line of age 11 40 147 12 41 139 versus systolic blood pressure y = 0.889x + 94.61 13 41 171 2 = 0.292 14 46 137 200 15 47 111 16 48 115 150 17 49 133 18 49 128 100 19 50 183 SBP 20 51 130 50 21 51 133 22 51 144 23 52 128 0 24 54 105 0 10 20 30 40 50 60 70 80 25 56 145 26 57 141 Age 27 58 153 Fig. 2: Scatter plot and regression Line 28 59 157 29 63 155 30 67 176 y=B0 +B1x which are the notations of the model commonly used by . Validation techniques can is affected by the choice of the Here, B0 represents the intercept, be broadly divided into two – independent variables and the 1 while B1 represents the slope. In our numerical and graphical. An presence of and hence example of relationship between easy numerical technique is to researchers need to be careful age and HbA1C levels, the equation look at the value of R2. You will while using R2 alone for validation. would be given by HbA1C levels recollect that the coefficient of Graphical analysis of residuals is a 1 (y) = B0 + B1*age (x) (Figure 2). correlation (r) is squared to obtain graphical technique for validation

The intercept B0 is that value of Y the Coefficient of Determination or that uses graphs to visually inspect or the dependent variable (HbA1C, R2. This is the value that is used for the data to assess its robustness. in this case), when the value of the predicting the extent of variability predictor variable is zero (age). in the dependent variable that can Utility of Regression In reality age can never be zero, be explained by the independent Analysis so this is the value of HbA1C that variable. In our example (Figure is present regardless of age. The 1), the R2 is 0.29 or 29% indicating The example given below highlights the utility of regression equation y = B0 + B1x classically that only 29% of the variability in describes . SBP can be explained by age. If the analysis. Table 2 has values of For Multiple linear regression, the value of R2 is high, then we could systolic blood pressure [SBP] for 30 women with age being presented equation would be y = B0 + B1X1 + assume that the model has a high in an ascending order. Age here B2X2 + B3X3+……………………+BkXk predictive value. The value of 29% depending upon the number of indicates that 71% of the variability would be the independent variables and SBP, the dependent variable. predictor variables [X1, X2 and still remains unaccounted for and so on] and k is the kth predictor thus model may not have a great We first make a scatter plot variable. predictive value. It is also useful and eye ball the data and then 2 Step 3: Evaluating the validity to remember that the value of R Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017 51 subsequently generate the predictor [X] variables [when there variables rather than temperature regression line and the regression are multiple predictor variables] were better at . They equation (Figure 1). and all regression coefficients [B1, concluded that this data could The regression equation here is B2…..Bk] are tested together. (b) be used for better allocation of y = 0.889x + 94.61 Testing a few X variables- here we personnel for the management of can choose a few select X variables OR ED services. [these are the ones the researcher Prediction SBP = 94.61 + 0.889 x age may think are maximally relevant] 7 The given data set if you will alone to be tested for their impact Brazer et al used logistic notice does not contain the age on the Y variable and finally (c) regression to predict risk factors for 60 or the corresponding SBP. Say A test for individual significance colorectal cancer in a community we as asking the question, what where a single X variable is tested practice where they studied 461 would be the SBP of a 60-year old for its impact on the Y variable. consecutive patients undergoing woman? We would “fit” 60 into the colonoscopy. Of these, 129 regression equation and calculate Applications of Regression had adenomatous polyps (pre- it as Analysis cancerous) and 34 had colorectal cancer. They randomly chose 292 SBP = 94.61 + 0.889 x 60 or 148 There are three major uses of patients and evaluated the impact mm Hg. regression analysis – attributing of several independent variables in It is important to note here [cause and effect a model that looked at prediction that multiple lines can be drawn relationship], forecasting and of occurrence of colorectal cancer. through the data set and hence prediction. These are explained Five variables were identified we need to define the criterion for with three examples. to be predictive- the patient’s drawing the line. The method that Causality age, sex, hematocrit, fecal occult is most commonly used is called the blood test result and indication Palmer KT and colleagues “ methods”. for colonoscopy. When this model assessed working aged people When we apply this equation was applied to the remainder of from the general population in to the population for making a the 169 patients, it was found to the United Kingdom to estimate prediction, we would really not be be a reliable indicator of risk of the risks of occupational exposure able to predict either the systolic colorectal neoplasia. to noise on self reported hearing blood pressure perfectly. Hence, difficulties and tinnitus using a we need to taken into account an Understanding what validated . The study “error” or “deviation” that is likely Confounders are and the showed that, in both sexes, after to occur when this equation is used. adjustment [see below] for age, the Concept of “Adjustment” Thus, the equation would read as risk of severe hearing difficulty and below In study by Palmer and persistent tinnitus rose with years colleagues presented earlier,6 noise spent in a noisy job indicative of a the relationship between being cause -effect relationship.5 y = B0 + B1x + e in a noisy job and the extent of Forecasting hearing difficulty and tinnitus linear component Efficient management of patient was assessed. Intuitively, we do Simply put, this could be flow in emergency departments know that noise is not the only interpreted as (EDs) is a very important issue for reason why hearing difficulty can Response = signal + noise hospital administrators. Marcilio occur. This may be related to stress, I and colleagues6 studied diverse gender, tiredness, older age and Or in medical research this models in an attempt to forecast many more factors. A confounder would be the daily number of patients is defined as that variable that is Outcome = prediction + deviation seeking emergency department “hidden” or “lurking” and was [ED] services in a general hospital Testing for Significance not accounted for or thought of in Sao Paolo, Brazil, using calendar initially and impacts the outcome Once, a regression equation is variables and ambient temperature being studied. Confounders [to generated, the next logical step reading as the independent the extent possible] need to be is to “test for significance” and variables. They found that the identified before the start of the generate a p value. There are three number of ED visits was study and addressed during different ways in which this can be 389 [166-613] with a seasonal analysis by a process called as done (a) Carrying out an overall distribution with the highest “adjustment”. This is a statistical test of significance where all the patient volumes seen on Mondays technique to eliminate the influence and lowest on weekends. Calendar of one or more confounders on 52 Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017 the treatment effect. Simply put, Conclusion 2. Deshpande S, Gogtay NJ, Thatte UM. Data types. J Assoc Phy Ind 2016; 64:64-65. it can be understood as a process In summary, regression analysis of “statistical correction” that is is a statistical tool that helps 3. Chamberlain JJ, Rhinehart AS, Shaefer CE, done once the data is gathered. Neuman A. Diagnosis and management evaluate relationships between of diabetes:synopsis of the 2016 American In regression analysis, once a a dependent variable and one or Diabetes Assoication Standards of Medical significant association between more independent or predictor Care in Diabetes. Ann Intern Med 2016; the independent variable/s and variables. More specifically, it helps 164:542-552. dependent variable has been us understand how the dependent 4. Hernandez-Vila E. A Review of the JNC found, it is important to see if variable changes with changes in 8 Blood Pressure Guideline. Texas Heart this significance still persists the independent variable and thus Institute Journal 2015; 42:226-228. after potential confounders have finds its application in forecasting 5. Palmer KT, Griffin MJ, Syddall HE, Davis A, Pannett B, Coggon D. Occupational been adjusted for. In the study by and predicting. The technique Palmer, age was identified as a exposure to noise and the attributable must however be used with clear burden of hearing difficulties in Great potential confounder a priori [since understanding of the assumptions Britain. Occup Environ Med 2002; 59:634-9. we know the hearing worsens with in each type of regression analysis, 6. Marcilio I, Hajat S, Gouveia N. Forecasting age] and adjusted for in the final their limitations and the potential daily emergency department visits analysis. Post adjustment, in both error that can occur when models using calendar variables and Ambient sexes, the relationship between are applied to a larger population. temperature readings. Academic Emergency years spent in a noisy job and Medicine 2013; 20:769–777. severe hearing difficulty continued References 7. Brazer SR, Pancotto FS, Long TT 3rd, Harrell to remain significant [relative to FE Jr, Lee KL, Tyor MP, Pryor DB. Using 1. Gogtay NJ, Deshpande S, Thatte UM. ordinal logistic regression to estimate the those in non-noisy jobs]. Principles of Correlation analysis. J Assoc likelihood of colorectal neoplasia. J Clin Phy Ind 2017; 65:78-81. Epidemiol 1991; 44:1263-70.