Notes Chapter 7 (BVD)

Notes Chapter 7 (BVD) Bivariate Data: Relationships between two (or more) variables.

The response variable measures an outcome of a study. The explanatory variable attempts to explain the observed outcomes. When we gather data, we may have in mind which variables are which, but there may also not be explanatory & response variables if our data does not suggest “causation”.

Variables can be both categorical, one categorical and one quantitative, or both quantitative.

The most effective way to display a relation between two quantitative variables is a scatterplot. It shows the relationship between two quantitative variables measure on the same individual. Each individual in the data appears as the point in the plot fixed by both values (ordered pairs).

Always plot the explanatory variable (if there is one) on the horizontal (the x axis) of a scatterplot.

To interpret a scatterplot, look first for an overall pattern. This should include: 1) direction (positive, negative) 2) form (linear, exponential, quadratic) 3) strength (correlation, r) 4) deviations from the pattern (outliers)

Remember on outlier in any graph of data is an individual observation that falls outside the overall pattern of the graph.

Categorical variables can be added to scatterplots by changing the symbols in the plot.

Visual inspection is often not a good judge of how strong a linear relationship is. Changing the plotting scales or the amount of white space around a cloud of points can be deceptive. So we use a numerical measure, correlation (r), to supplement our graph.

Correlation (r) measures the strength and direction of the linear relationship. 1  x  x  y  y  Formula: r   i  i     n 1  S x  S y  Facts about Correlation: 1) positive r – positive association negative r – negative association

2) r falls between –1 and 1 inclusive.

3) r values close to –1 or 1 indicate that the points lie close to a straight line.

4) r values close to 0 indicate a weak linear relationship.

5) r values of –1 or 1 indicate a perfect linear relationship.

6) correlation only measures the strength in linear relationships (not curves).

7) correlation can be strongly affected by extreme values (outliers). Notes Chapter 8

A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Regression (unlike correlation) requires we have an explanatory & response variable.

The least-squares regression line (LSRL) is a mathematical model for the data. The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

 Equation: y= b0 + b 1 x s With slope: 骣y b1 = r 琪桫sx And intercept: b0= y - b 1 x

*We write yˆ (“y hat”) in the equation of the regression line to emphasize that the line gives a predicted response y for any x.

The coefficient of determination, r2, is the fraction of the variation in the value of y that is explained by least-squares regression of y on x.

When we explain r2 then we say… ___% of the variability in ___(y) can be explained by the linear regression of ___(y) on ___(x).

A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is,

Residual = observed y – predicted y or Residual = y – yˆ

The mean of the residuals is always zero.

A residual plot plots the residuals on the vertical axis against the explanatory variables on the horizontal axis. Such a plot magnifies the residuals and makes patterns easier to see.

A residual plot show good linear fit when the points are randomly scattered about y = 0 with no obvious patterns. To create a residual plot on the calculator: 1)You must have done a linear regression with the data you wish to use. 2) From the Stat-Plot, Plot # menu choose scatterplot and leave the x list with the x values. 3) Change the y-list to “RESID” chosen from the list menu. 4) Zoom – 9

Notes Chapter 9

In scatterplots we can have points that are outliers or influential points or both.

An outlier is an observation that lies outside the overall pattern of the other observations in a scatterplot. An observation can be an outlier in the x direction, the y direction, or in both directions.

An observation is influential if removing it would markedly change the position of the regression line. Points that are outliers in the x direction are often influential. Your book calls these leverage points.

Extrapolation is the use of a regression line (or curve) for prediction outside the domain of values of the explanatory variable x. Such predictions cannot be trusted.

A Lurking Variable is a variable that has an important effect on the relationship among the variables in a study but is not included among the variables being studied. Lurking variables can suggest a relationship when there isn’t one or can hide a relationship that exists. With observational data, as opposed to data from a well designed experiment, there is no way to be sure that a lurking variable is not the cause of any apparent association.

Association vs. Causation

A strong association between two variables is NOT enough to draw conclusions about cause & effect. Strong association between two variables x and y can reflect:

A) Causation – Change in x causes change in y B) Common response – Both x and y are responding to some other unobserved factor C) Confounding – the effect on y of the explanatory variable x is hopelessly mixed up with the effects on y of other variables.

Chart to help you understand this concept ( to be written in class):

Data with no apparent linear relationship can also be examined in two ways to see if a relationship still exists:

1) Check to see if breaking the data down into subsets or groups makes a difference. 2) If the data is curved in some way and not linear, a relationship still exists. We will explore that in the next chapter.

On the following page you will find some data sets we may be using in class. Wine Consumption and Heart Disease

Alcohol from wine in liters of alcohol per person and heart disease death rate per 100,000 (data from different countries)

Alcohol 2.5 3.9 2.9 2.4 2.9 0.8 9.1 0.8 0.7 Death Rate 211 167 131 191 220 297 71 211 300

Alcohol 7.9 1.8 1.9 0.8 6.5 1.6 5.8 1.3 1.2 Death Rate 107 167 266 227 86 207 115 285 199

Flights out of Baltimore, MD

Destination Distance (miles) Airfare (US $) Atlanta 576 178 Boston 370 138 Chicago 612 94 Dallas/Fort Worth 1216 278 Detroit 409 158 Denver 1502 258 Miami 946 198 New Orleans 998 188 New York 189 98 Orlando 787 179 Pittsburgh 210 138 St. Louis 737 98