10.3 Modelling Linear Rela Onships with Randomness Present

Home , Simple linear regression

This is “Modelling Linear Rela�onships with Randomness Present”, sec�on 10.3 from the book Beginning Sta�s�cs (v. 1.0). For details on it (including licensing), click here.

For more informa�on on the source of this book, or why it is available for free, please see the project's home page. You can browse or download addi�onal books there. Has this book helped you? Consider passing it on:

Help Crea�ve Commons Help a Public School Crea�ve Commons supports free culture DonorsChoose.org helps people like you help from music to educa�on. Their licenses teachers fund their classroom projects, from helped make this book available to you. art supplies to books to calculators.

Previous Sec�on Table of Contents Next Sec�on

10.3 Modelling Linear Rela�onships with Randomness Present

LEARNING OBJECTIVE

1. To learn the framework in which the sta�s�cal analysis of the linear rela�onship between two variables x and y will be done.

In this chapter we are dealing with a population for which we can associate to each element two measurements, x and y. We are interested in situations in which the value of x can be used to draw conclusions about the value of y, such as predicting the resale value y of a residential house based on its size x. Since the relationship between x and y is not deterministic, statistical procedures must be applied. For any statistical procedures, given in this book or elsewhere, the associated formulas are valid only under speciﬁc assumptions. The set of assumptions in simple linear regression are a mathematical description of the relationship between x and y. Such a set of assumptions is known as a model.

For each ﬁxed value of x a sub-population of the full population is determined, such as the collection of all houses with 2,100 square feet of living space. For each element of that sub-population there is a measurement y, such as the value of any 2,100-square-foot house. Let 퐸(푦) denote the mean of all the y-values for each particular value of x. 퐸(푦) can change from x-value to x-value, such as the mean value of all 2,100-square-foot houses, the (diﬀerent) mean value for all 2,500-square foot-houses, and so on.

Our ﬁrst assumption is that the relationship between x and the mean of the y-values in the sub-population determined by x is linear. This means that there exist numbers 훽1 and 훽0 such that

퐸(푦) = 훽1푥 + 훽0

This linear relationship is the reason for the word “linear” in “simple linear regression” below. (The word “simple” means that y depends on only one other variable and not two or more.)

Our next assumption is that for each value of x the y-values scatter about the mean 퐸(푦) according to a normal distribution centered at 퐸(푦) and with a standard deviation σ that is the same for every value of x. This is the same as saying that there exists a normally distributed random variable ε with mean 0 and standard deviation σ so that the relationship between x and y in the whole population is

푦 = 훽1푥 + 훽0 + 휀

Our last assumption is that the random deviations associated with diﬀerent observations are independent. In summary, the model is:

Simple Linear Regression Model

For each point (푥, 푦) in data set the y-value is an independent observation of

푦 = 훽1푥 + 훽0 + 휀

where 훽1 and 훽0 are ﬁxed parameters and ε is a normally distributed random variable with mean 0 and an unknown standard deviation σ.

The line with equation 푦 = 훽1푥 + 훽0 is called the population regression line.

Figure 10.5 "The Simple Linear Model Concept" illustrates the model. The 2 2 symbols 푁�휇, 휎 � denote a normal distribution with mean μ and variance 휎 , hence standard deviation σ.

Figure 10.5 The Simple Linear Model Concept

It is conceptually important to view the model as a sum of two parts: 푦 = 훽1푥 + 훽0 + 휀

1. Deterministic Part. The ﬁrst part 훽1푥 + 훽0 is the equation that describes the trend in y as x increases. The line that we seem to see when we look at the

scatter diagram is an approximation of the line 푦 = 훽1푥 + 훽0 . There is nothing random in this part, and therefore it is called the deterministic part of the model. 2. Random Part. The second part ε is a random variable, often called the error term or the noise. This part explains why the actual observed values of y are not exactly on but ﬂuctuate near a line. Information about this term is important since only when one knows how much noise there is in the data can one know how trustworthy the detected trend is.

There are three parameters in this model: 훽0, 훽1, and σ. Each has an important interpretation, particularly 훽1 and σ. The slope parameter 훽1 represents the expected change in y brought about by a unit increase in x. The standard deviation σ represents the magnitude of the noise in the data.

There are procedures for checking the validity of the three assumptions, but for us it will be suﬃcient to visually verify the linear trend in the data. If the data set is large then the points in the scatter diagram will form a band about an apparent straight line. The normality of ε with a constant standard deviation corresponds graphically to the band being of roughly constant width, and with most points concentrated near the middle of the band.

Fortunately, the three assumptions do not need to hold exactly in order for the procedures and analysis developed in this chapter to be useful. KEY TAKEAWAY

Sta�s�cal procedures are valid only when certain assump�ons are valid. The assump�ons underlying the analyses done in this chapter are graphically summarized in Figure 10.5 "The Simple Linear Model Concept".

EXERCISES

1. State the three assump�ons that are the basis for the Simple Linear Regression Model.

2. The Simple Linear Regression Model is summarized by the equa�on

푦 = 훽1푥 + 훽0 + 휀 Iden�fy the determinis�c part and the random part.

3. Is the number 훽1 in the equa�on 푦 = 훽1푥 + 훽0 a sta�s�c or a popula�on parameter? Explain.

4. Is the number σ in the Simple Linear Regression Model a sta�s�c or a popula�on parameter? Explain.

5. Describe what to look for in a sca�er diagram in order to check that the assump�ons of the Simple Linear Regression Model are true.

6. True or false: the assump�ons of the Simple Linear Regression Model must hold exactly in order for the procedures and analysis developed in this chapter to be useful.

ANSWERS

1. a. The mean of y is linearly related to x.

b. For each given x, y is a normal random variable with mean 훽1푥 + 훽0 and standard devia�on σ. c. All the observa�ons of y in the sample are independent.

3. 훽1 is a popula�on parameter. 5. A linear trend.

Previous Sec�on Table of Contents Next Sec�on