Correlation Coefficient

April 2, 2002

N&S Ch 8: #3, 9, 10 Rice Ch 14 #1, 2, 33, 35 (use R for 2, 33, and 35) correlation coefficient regression residual normal quantile plot response variable explanatory variable measurement error model misfit replicate measurements

Approximate linear relationships

Linear regression explores the possibility of a linear relationship y  bx  a between two measured quantities. The data set consists of a collection of measured pairs

(x1 , y1 ),,(xn , yn ) . To explore the relationship between the variables x and y , first plot the data.

If the plot suggests a linear relationship, estimate the slope b and the intercept a .

Minimize n H (a,b)  (y  (bx  a))2 i1 i i

Differentiating gives  n H (a,b)   2(y  (bx  a))  2n(y  bx  a) a i1 i i  n H (a,b)   2(y  (bx  a))x  2n(xy  bx 2  ax) b i1 i i i giving aˆ  y  bˆx xy  x  y bˆ  x 2  x 2

Residuals The fitted value is the value of y predicted by the estimated linear relationship, that is, ˆ yˆi  aˆ  bxi . The residual is the difference between the observed value and the fitted value

ˆ eˆi  yi  (aˆ  bxi )  yi  yˆ i

Correlation

The correlation coefficient is defined by xy  xy 1 n x  x y  y r    i  i SD(x)SD(y) n i1 SD(x) SD(y) 1 n 2 SD(x) where SD(x)  (xi  x) . Note that is not the sample standard deviation n i1 for x1 ,, xn . Note that r would be the slope of the regression line if we replaced xi and yi by the standardized variables x  x y  y x  i and y  i . i SD(x) i SD(y) Note that SD(x)  SD(y)  1 and x  y  0 .

Measures of variation

n (y  y) 2 The “total sum of squares” i1 i is a measure of how much variability there is in the yi . n n n (y  y) 2  (yˆ  y) 2  (y  yˆ )2 i1 i i1 i i1 i i The first term on the right measures the “explained variation” while the second term measures the “unexplained variation”. Regression programs frequently calculate n (yˆ  y)2 2 i1 i r  n (y  y)2 i1 i as a measure (expressed as a percentage) of the amount of variation in the data accounted for by the model. ( r 2 is the square of the correlation coefficient and is sometimes referred to as the coefficient of determination.)

Standard statistical model

Suppose

yi  a  bxi   i 2 where the  i are independent with mean zero and variance  . Then E[aˆ]  a and E[bˆ]  b , that is, aˆ and bˆ are unbiased estimators of a and b . In addition n  2 x 2 Var(aˆ)  i1 i n n 2 n x 2  x i1 i i1 i  and n 2 Var(bˆ)  n n 2 n x 2  x i1 i i1 i  It also follows that aˆ and bˆ are approximately normal.

Normal model

Assuming that the  i are normally distributed and independent, the yi are independent normal random variables. What are the maximum likelihood estimators for a and b ?

n (y  yˆ ) 2 2 2 i1 i i Let saˆ be the standard deviation of aˆ with  replaced by s  , and n  2 s similarly for bˆ . Then aˆ  a bˆ  b Ta  and T  s b s aˆ bˆ have t-distributions with n  2 degrees of freedom. (If the  i are iid but not normal, then

Ta and Tb should still be approximately normal for large enough n .) Consequently, Ta and Tb are useful test statistics for testing hypotheses about the coefficients a and b .

Calibration problems

In many scientific areas instruments are used that measure quantities of interest indirectly. For example, the quantity of interest in Chapter 8 is the density of the snow pack, but the instrument actually measures gamma photon counts. The higher the density of the snow between the gamma ray detector and the source, the lower the detected counts. The calibration problem is to find a functional relationship between the measured quantity and the quantity of interest, in this case, an amplified version of the gamma photon count and the density.

Snow Gauge Calibration:

Snow gauge

Variable Description density Density of polyethylene blocks in grams per cubic cm gain Amplified version of the gamma photon count A model

The probability that a photon travels from the source to the detector without being scattered depends on how much “stuff” is in between. Roughly, over a short distance, if the amount stuff, as indicated by the density of the material, is doubled, then the probability that the photon get through is doubled. Let Z be the distance the photon travels before being scattered. Then we should have P{Z  z  z | Z  z}  xz , where x is the density of the material. It follows that Z should be exponentially distributed with parameter x . The expected photon count at the detector will be proportional to the probability that a photon travels the distance d between the source and the detector without being absorbed/scattered. Consequently, the basic model for the measurements becomes G  Ce xd  Ce x .

Variability in the local density of the snow suggests (as in the radon example) that G might be lognormally distributed, and hence a natural model for the measurements would be Y  log G  a  x  E , where E is assumed to be normally distributed. More generally, we will just assume that E has mean zero and finite variance.

Design of the calibration experiment

The calibration experiment consists of making several measurements at each of a selection of densities. The choice of the densities and the number of replications for the measurements for each density is the experimental design for the calibration.

Why replication?

In a regression model, the difference between the observed value yi and the fitted value yˆi comes from two sources, measurement error and model misfit. Replication allows one to estimate the variance of the measurement error separately from the prediction error due to model misfit. In particular, if there are k replications for each value of the explanatory variable, the residual sum of squares can be written as m m m k k (y  yˆ ) 2  (y  y ) 2  k (y  yˆ ) 2  j1 ij i  j1 ij i  i i i1 i1 i1 The left side has expectation (n  2) 2  (mk  2) 2 . The first term on the right has expectation m(k 1) 2 and the second term has expectation (m  2) 2 .

A test statistic Large values of m ˆ 2 k(yi  yi ) /(m  2) i1 F  m k (y  y ) 2 / m(k 1)  j1 ij i i1 would indicate poor fit. How large is large? (Note that the expectations of the numerator and the denominator are both  2 .) We can formulate a formal hypothesis test. If we take as the null hypothesis that the responses are given by

Yij  a  bxi  Eij where the Eij are iid mean zero normal random variables, then under the null hypothesis, F has an F-distribution with m  2 and m(k 1) degrees of freedom.

Estimate of density from observed value of gain

If we use a regression equation to predict the result of a new measurement, there are two sources of error, the error in the model, and the measurement error for the new measurement. Viewing the prediction as a random variable (it depends on aˆ and bˆ ), we can derive a prediction interval in a manner analogous to a confidence interval using the fact that  1 (x  x) 2  ˆ 2 Var(y  y)   1  2  .  m (xi  x)  Assuming the standard model with Gaussian errors, y  yˆ will be approximately normal with mean zero.

State Public Expenditures (From Data and Story Library, Carnegie Mellon University)

A linear regression of state and local per capita public expenditures with the percent of population in metropolitan areas (MET) would be insignificant. Actually, the relation is U-shaped. A second order (MET-squared) term is needed to capture this element. In doing so it is best to re-express MET as deviations from the mean MET and to square the deviations to form MET-squared, or MET2. The linear coefficient of MET will then represent the slope of expenditures at mean MET rather than at zero percent metropolitanization. An influential observation (Nevada) is still present as is the differential for Western states (see State Spending and Ability to Pay). Various stepwise and best subset regressions will then produce sets of significant predictor variables accounting for around 70 percent of total variation in state per capita expenditures.