Testing Hypotheses for Differences Between Linear Regression Lines
Total Page:16
File Type:pdf, Size:1020Kb
United States Department of Testing Hypotheses for Differences Agriculture Between Linear Regression Lines Forest Service Stanley J. Zarnoch Southern Research Station e-Research Note SRS–17 May 2009 Abstract The general linear model (GLM) is often used to test for differences between means, as for example when the Five hypotheses are identified for testing differences between simple linear analysis of variance is used in data analysis for designed regression lines. The distinctions between these hypotheses are based on a priori assumptions and illustrated with full and reduced models. The studies. However, the GLM has a much wider application contrast approach is presented as an easy and complete method for testing and can also be used to compare regression lines. The for overall differences between the regressions and for making pairwise GLM allows for the use of both qualitative and quantitative comparisons. Use of the Bonferroni adjustment to ensure a desired predictor variables. Quantitative variables are continuous experimentwise type I error rate is emphasized. SAS software is used to illustrate application of these concepts to an artificial simulated dataset. The values such as diameter at breast height, tree height, SAS code is provided for each of the five hypotheses and for the contrasts age, or temperature. However, many predictor variables for the general test and all possible specific tests. in the natural resources field are qualitative. Examples of qualitative variables include tree class (dominant, Keywords: Contrasts, dummy variables, F-test, intercept, SAS, slope, test of conditional error. codominant); species (loblolly, slash, longleaf); and cover type (clearcut, shelterwood, seed tree). Fraedrich and others (1994) compared linear regressions of slash pine cone specific gravity and moisture content Introduction on collection date between families of slash pine grown in a seed orchard. Multiple comparison tests were applied Researchers who compare regression lines and to the slopes to examine specific differences between the corresponding model parameters are often not fully aware families. Murthy and others (1997) compared the intercepts of the specific hypothesis being tested. Similarly, problems and slopes of light-saturated net photosynthesis response may arise in the formulation of the correct model or functions over time for loblolly pine grown under various hypothesis test in the statistical software package being combinations of moisture, nutrient, and CO2 treatments. used by the researcher. Most statistical references such as Draper and Smith (1981) and Milliken and Johnson (1984) The exact form of the regression model used in testing formulate the problem as a test of conditional error based on depends on the specific hypothesis under consideration. full and reduced models and specify an F-test upon which There are five alternative hypotheses to be considered in this rejection of the hypothesis is based. In this approach, the kind of work, each based on a specific formulation of the populations for the regressions are considered classes of an regression model (table 1). The researcher must be careful to independent variable, dummy variables are then defined, select the hypothesis that is pertinent to the question under and a regression is developed with the dummy variables and study. The objectives of this paper are to: (1) present the their interactions with the independent variable. underlying statistical methodology for developing the full and reduced models for testing simple linear regressions, Mathematical Statistician, U.S. Forest Service, Southern Research Station, Asheville, NC 28804. (2) define the five hypotheses for testing differences between where regression lines, (3) present three alternative methods for performing the analysis while controlling experimentwise th yi = the i observation for the dependent variable y, type I error by means of the Bonferroni approach, and i=1,2,3,..., n, (4) provide SAS code that shows in detail how these five x = the ith observation for the independent variable x, hypotheses are formulated. i i=1,2,3,..., n, n = number of ( x, y ) paired observations, Full and Reduced Models i i = 1 if and are from treatment 1 Z1i yi xi 0 if and are not from treatment 1, The methodology for testing hypotheses for differences yi xi between simple linear regressions will be illustrated Z2i = 1 if yi and xi are from treatment 2 assuming regression lines are to be compared, with r = 3 0 if yi and xi are not from treatment 2, and extensions to higher or lower dimensions of relatively r εi = the residual error for observation i (assumed normal straightforward. Suppose a researcher is interested in with homogeneous variance for all regressions where comparing regression lines between three treatments ( r = 3). covε , ε = 00 for ≠ ) . The general model can be defined as ()i j i j yi =α0 + α 1 xi +β1ZZ 1i + β2 2i +γ1Z 1i x i + γ 2Z 2i x i + εi (1) Table 1— The fi ve hypotheses used for testing differences between regression lines (all hypotheses assume the usual regression assumptions of normality, independence and homogeneity of variances) Hypothesis Assumptions Model (full and reduced) H0 H1 1. All slopes are All intercepts are 0 At least one yi =α0 + α 1 xi +γ1Z 1i x i + γ 2Z 2i x i + εi γ1= γ 2 = equal equal but unknown γ1 or γ 2 ≠ 0 yi =α0 + α 1 xi + εi 2. All slopes All intercepts are 0 At least one yi =α0 + α 1 xi +β1ZZ 1i + β2 2i + γ1= γ 2 = are equal not necessarily γ1 or γ 2 ≠ 0 equal but unknown γ1Z 1i x i + γ 2Z 2i x i + εi yi =α0 + α 1 xi +β1ZZ 1i + β2 2i + εi 3. All intercepts All slopes are β= β = 0 At least one yi =α0 + α 1 xi +β1ZZ 1i + β2 2i + εi 1 2 are equal equal but unknown β1 or β 2 ≠ 0 yi =α0 + α 1 xi + εi 4. All intercepts All slopes are not β= β = 0 At least one yi =α0 + α 1 xi +β1ZZ 1i + β2 2i + 1 2 are equal necessarily equal β1 or β 2 ≠ 0 but unknown γ1Z 1i x i + γ 2Z 2i x i + εi yi =α0 + α 1 xi +γ1Z 1i x i + γ 2Z 2i x i + εi 5. All intercepts None γ= γ = 0 At least one yi =α0 + α 1 xi +β1ZZ 1i + β2 2i + 1 2 are equal and all β1, β 2 , γ1 or γ 2 ≠ 0 slopes are equal γZ x + γ Z x + ε 1 1i i 2 2i i i β1= β 2 = 0 yi =α0 + α 1 xi + εi 2 The regression parameters to be estimated from the data Hypotheses are α0, α 1, β1, β 2 , γ1, and γ 2. The dummy variables are and . Equation 1 defines all three treatment regression The exact methodology for comparing regression lines ZZ1i 2i lines in terms of the third treatment regression line which depends on the specific hypothesis that the researcher is serves as a baseline for the other treatment regressions. testing. The hypotheses under consideration in this paper can be developed from equation 1 by comparing full and The parameters α0 and α 1 are known as the intercept and slope parameters, respectively, and correspond to the third reduced models. The hypotheses and a priori assumptions about the intercepts or slopes, or both, will define the full treatment regression. The parameters β1 and β 2 correspond to deviations from the intercept ( ) for treatments 1 and 2, model. All appropriate parameters and associated dummy α0 variables must be in the full model. Depending on the respectively. Similarly, the parameters γ1 and γ 2 correspond assumptions, some parameters of model (1) may not be in to deviations from the slope (α1 ) for treatments 1 and 2, respectively. Thus, for treatment = 3, 0 and 0, the full model. In contrast, the reduced model is formed ZZ1i =2i = and the regression equation is given as by allowing the parameters in question to take the values specified in the null hypothesis H . Table 1 lists the five y=α + α x + ε (2) 0 i 0 1 i i hypotheses to be tested, corresponding a priori assumptions and the full and reduced models used to test the hypotheses For treatment = 1, 1 and 0 , and the regression under consideration. A more thorough review of the ZZ1i =2i = equation is given as hypotheses to be tested is now presented. y=α + α x +β + γ x +ε = α + β+ α + γx + ε (3) Hypothesis 1: all slopes are equal i 0 1 i 1 1 i i ()0 1 ()1 1 i i Assumptions: all intercepts are equal and all are unknown where (implies a single common intercept ()α0 and, thus, β1= β 2 = 0) (α+ β ) and (α+ γ ) correspond to the intercept 0 1 1 1 Full model: and slope for treatment = 1. Thus, β and γ are the yi =α0 + α 1 xi +γ1Z 1i x i + γ 2Z 2i x i + εi 1 1 deviations (increase or decrease) in the intercept and slope, Reduced model: y=α + α x + ε respectively, for treatment = 1 as compared to treatment = 3. i 0 1 i i Similarly, treatment = 2 is defined where 0 and 1, Statistical hypothesis: H : γ= γ = 0 ZZ1i =2i = 0 1 2 yielding H : at least one γ or γ ≠ 0 1 1 2 Example: To investigate the effect of three fertilizers yi =α0 + α 1 xi +β2 + γ 2 xi+ε i = () α0+ β 2 +() α1 + γ 2 xi+ ε i (4) (qualitative variable) on the total biomass growth () of yi where young loblolly pine, 1-year-old, uniform, nursery grown seedlings were randomly assigned to the three fertilizers. (α+ β ) and (α+ γ ) correspond to the intercept Each month (), for a year, a tree was removed from each 0 2 1 2 xi and slope for treatment = 2.