Simple Linear Regression
4 Slope-intercept equation for a line 1 Regression equation—an equation that describes the y mx b average relationship between a response (dependent) 2 1 and an explanatory (independent) variable. 0 1 y 8 y 3 (4,6) slope 1.5 x 2 6 (2,3) y 6 3 3 4
2 x 4 2 2
2 4 6 8 10 x 1 2
Deterministic Model 0 2 1 9 A model that defines an exact relationship F C 32 5 between variables. 6 9 t i
Example: y 1.5 x 2 e 7 h n e r h a
There is no allowance for error in the prediction of y 8 F for a given x. 4 4 2 0 -10 0 10 20 30 40 50 Celsius
3 4
1 Probabilistic Model 5 A model that accounts for random error. 1 y 1.5x
Includes both a deterministic component and a random 0 error component. 1
y 1. 5 x random error y
This model hypothesizes a probabilistic relationship 5 between y and x. 0 0 2 4 6 8 10 x 5 6
Probabilistic Model—General Form First-Order (Straight Line) Probabilistic Model y = Deterministic component + Random y 0 1x where y = Dependent variable component x = Independent variable
where y is the “variable of interest”. 0 = population y-intercept of the line—the point at which the line intersects or cuts through the y-axis
Assume that the mean value of the random error is 1 = population slope of the line—the amount of increase (or zero the mean value of y, E(y), equals the decrease) in the deterministic deterministic component of the model component of y for every 1-unit increase (or decrease) in x. = random error component
7 8
2 First-Order (Straight Line) Probabilistic Model Model Development and will generally be unknown. 0 and 1 are population parameters. They will only 0 1 be known if the population of all (x, y) measurements The process of developing a model, estimating model are available. parameters, and using the model can be summarized in these 5-steps: 1. Hypothesize the deterministic component of 0 and 1, along with a specific value of the independent variable x determine the mean value of the model that relates the mean, E(y) to the the dependent variable y. independent variable x
E y 0 1x 2. Use sample data to estimate unknown model parameters ˆ ˆ find estimates: 0 or b0 ,1 or b1 9 10
Model Development (continued) Example: Reaction time versus drug percentage 3. Specify the probability distribution of the random error term and estimate the SD of this distribution Amount of Drug Reaction Time Subject (%) (seconds) ~ N 0, will revisit this later x y 1 1 1 2 2 1 4. Statistically evaluate the usefulness of the model 3 3 2 5. Use model for prediction, estimation or other 4 4 2 purposes 5 5 4
11 12
3 Example: Reaction time versus drug percentage Example: Reaction time versus drug percentage 0 0 . . 4 4 2 2 . . 3 3 ) ) . . c c 3 3 e e . . s s 2 2 ( (
e e m m i i 5 5 . . T T
1 1 n n o o i i t t c c 7 7 . . a a 0 0 e e R R 2 2 . . 0 0 - - 0 0 . . 1 1 - - 0 1 2 3 4 5 0 1 2 3 4 5 Drug (%) Drug (%) 13 14
Example: Reaction time versus drug percentage Least Squares Line Errors of prediction---vertical differences between Also called regression line, or the least squares the observed and the predicted values of y prediction equation 2 x y y 1 x y y y y Method to find this line is called the method of least 1 1 0 (1-0) = 1 1 squares 2 1 1 (1-1) = 0 0 For our example, we have a sample of n = 5 pairs of 3 2 2 (2-2) = 0 0 (x, y) values. The fitted line that we will calculate is 4 2 3 (2-3) = -1 1 written as ˆy b0 b1 x 5 4 4 (4-4) = 0 0 Sum of Sum of squared ˆy is an estimator of the mean value of y, E y ; errors = 0 errors (SSE) = 2
b0 and b 1 are estimators of 0 and 1
15 16
4 Least Squares Line (continued) Formulas for the Least Squares Estimates SS SD Define the sum of squares of the deviations of the y xy y Slope: b1 or b1 r values about their predicted values for all n data points SSxx SDx as: 2 n n 2 2 sxy = SSxy xi x yi y sxx = SSxx xi x ˆ SSE yi y yi b0 b1 xi 2 xi yi x i1 i1 2 i xi yi x n i n We want to find b and b to make the SSE a 0 1 y x minimum---termed least squares estimates y-intercept: b y b x i b i 0 1 n 1 n
ˆy b0 b1 x is called the least squares line n = sample size
17 18
LS Calculations for Drug/Reaction Example LS Line for Drug/Reaction Example 0
2 . xi yi xi xi yi 4 7 ˆy .1 .7 x 1 1 1 1 b1 0. 7 2 .
10 3 2 1 4 2 ) . c 3 e .
3 2 9 6 10 15 s 2 ( b0 .7 5 5 e
4 2 16 8 m i 5 . T
2 .7 3 1 5 4 25 20 n o i t
2 c 2 2. 1 .1 7 . x 15 y 10 x 55 x y 37 a i i i i i 0 e R
2 2 .
15 10 15 0 - SS 37 37 30 7 SS 55 55 45 10 xy 5 xx 5 0 . 1 - 0 1 2 3 4 5 Drug (%) 19 20
5 LS Calculations for Drug/Reaction Example Least Squares Line—Interpretation of ˆy .1 .7 x 2 x y ˆy .1 .7 x y ˆy y ˆy Estimated intercept is negative 1 1 .6 (1-.6) = .4 .16 that the estimated mean reaction time is equal to 2 1 1.3 (1-1.3) = -.3 .09 -0.1 seconds when the amount of drug is 0%. 3 2 2.0 (2-2.0) = 0 .00 4 2 2.7 (2-2.7) = -.7 .49 What does this mean since negative reaction times are not possible? 5 4 3.4 (4-3.4) = .6 .36 Sum of Sum of squared errors = 0 errors (SSE) = 1.10 Model parameters should be interpreted only within the sampled range of the independent variable. The LS line has a sum of errors = 0, but SSE = 1.1 < 2.0 for visual model
21 22
Least Squares Line—Interpretation of ˆy .1 .7 x Coefficient of Determination The slope of 0.7 implies that for every unit increase of A measure of the contribution of x in predicting y x, the mean value of y is estimated to increase by 0.7 Assuming that x provides no information for the prediction of y, units. the best prediction for the value of y is y 0 . 4
2 y y . ˆ
In the context of the problem: 3 ) . c 3 e . s 2 For every 1% increase in the amount of drug in the (
e m i 5 . bloodstream, the mean reaction time is estimated to T
1 n o i increase by 0.7 seconds over the sampled range of drug t c 7 . a 0 e amounts from 1% to 5%. R 2 2 . SS y y 0 yy i
- 0 . 1 - 0 1 2 3 4 5 Drug (%) 23 24
6 Coefficient of Determination (continued) Coefficient of Determination (continued) 0
. 2 4 2 SS y y ˆ yy i --total sample variation around mean SSE yi yi 2 . i 3 )
. 2 c
3 --unexplained sample variability after e ˆ . SSE yi yi s
2 (
fitting e m i 5 . T
1
n SS SSE --explained sample variability attributable to
o yy i t
c linear relationship 7 . a 0 e R SSyy SSE explained 2 . proportion of total sample 0 - SS total yy variability explained by the 0 . 1 - linear relationship 0 1 2 3 4 5 Drug (%) 25 26
Coefficient of Determination (continued)
SS SSE SSE r 2 yy 1 Unexplained variability SSyy SSyy } In simple linear regression r 2 is computed as the square of the correlation coefficient, r
0 r 2 1
Interpretation— r 2 .75 means that the sum of squared deviations of the y values about their predicted values has been reduced by 75% by the use ˆy, instead of y, to predict y of the least squares equation. 27
7