Chapter15–CorrelationandRegression Now,wemeasurebothanX(‘independent’or‘predictor’)andaY (‘dependent’)variable–bothRV’s–foreachpersonorsubjectina study.In correlation analysis ,weviewbothRV’sasvaryingtogether andwishtoquantifythelinearassociation betweenthem.More specifically,weassumethatthecouples(x,y)havea BivariateNormal distribution ,whichentailsfivepopulationparameters: x, σσσx, y, σσσy, and ρρρ(thelastoneisthe‘populationcorrelationcoefficient’).In ,ontheotherhand,conditionalonX=x,weassess thelinearassociationbyfittingtheexpectedline

E(Y/X=x)= y/x = ααα+ βββx (*) Here,weneedtoestimatetheintercept( ααα)andtheslope( βββ).Wealso needtoestimatethestandarderror, σσσy/x ,associatedwiththeexpected line(*);thislinearmodelrequiresthat σσσy/x doesnotvarywithx. ScatterPlots –Thefirstthingtodowithbivariateistoplotthe datawiththedependentvariableontheyaxisandtheindependent variableonthexaxis.Inthehandout(p.1),isthescatterplotofthe n=8pointsplottingy=height(cm)versusx=age(months);note heretheupslopinglinearassociation.Onp.2,weseethescatterplot forthen=38pointsforExample15.2(p.410),plottingy=mileage (mpg)versusx=carweight(pounds).Inthelatterplot,thereisa downsloping(somewhatlinear)relationship.InFigure15.3onp.410, theredoesnotseemtobeanylinearassociation. TheCorrelationCoefficient –Weestimatetheabovepopulation correlationcoefficient( ρρρ)withthecounterpart(r)givenby: n n ()x − x ()y − y ()x − x ()y − y ∑ i i ∑ i i xy − n × x × y r = i=1 = i=1 = ∑ (n − )1 s s n n 2 2 x y 2 2 ∑()x − x ∑()y − y ∑ ()xi − x ∑ ()yi − y i=1 i=1

1 ThecalculationofrbyhandisbestdoneviaatablesuchasTable15.3 onp.412,andcoincideswiththeplotsinFigures15.5and15.6.Note thatthecorrelationcoefficient(r)ispositivewhenthetrendis upward,negativewhenit’sdownward,andalwaysbetween–1and1. LSAT–LawSchoolGPAExample .Heren=200(students),r=0.796

Scatterplot of GPA vs LSAT 4.25

4.00

3.75

3.50 GPA

3.25

3.00

550 600 650 700 750 LSAT Notetheellipticalpatterninthedata.Fromthisplot,wecan‘eyeball’ that x ≈660, y ≈3.4,s x≈40,s y≈0.25,andr≈0.80.Indeed,wehave: Descriptive : LSAT, GPA Variable N Mean StDev Minimum Maximum LSAT 200 650.21 41.11 560.25 649.50 752.99 GPA 200 3.3967 0.2544 2.8659 3.3971 4.1082

Correlations: LSAT, GPA Pearson correlation of LSAT and GPA = 0.796

2 Whereas correlation analysis treatsXandYequally, regression treats Yasthedependentvariable–anddependingonX. SimpleLinearRegression –Anotherwaytoexpressequation(*)is (conditionalonobservingx)

yi= ααα+ βββxi+ εεεi fori=1…n, wherethe εεεiarecalledtheerrorterms.By assumption ,wehavethat 2 2 forallvaluesofi=1…n, εεεi~N(0 ,σσσe );hence,y i~N( ααα+ βββxi,σσσe ). Fromourdiscussionabove,werealizethatthisisreallyaconditional statement(conditionaluponX=x i).Ournextgoalisto estimate and 2 testhypotheses relatedtotheparameterestimatesof ααα, βββand σσσe ;we dosousingthe principle of .Ourobjectivefunctionto minimizeisthe errorsumofsquares ( SSE ) function : 2 2 SSE( ααα,βββ)= ΣΣΣεεεi = ΣΣΣ(y iαααβββxi) (**) Differentiating(**)withrespectto(‘wrt’) αααand βββ,andthensettingto zero,andsolvingwillgiveustheLSE’s(leastsquaresestimates),‘a’ and‘b’.Differentiatingwrt ααα,settingtozero,andsolvinggives: y = a + xb (NE1) Thisexpression,calledthefirst‘ normal equation ’(NE),impliesthat theregressionlinegoesthroughthe pointofaverages .ThesecondNE isobtainedbydifferentiatingwrt βββandsettingtozero: 2 ΣΣΣxiyi=a ΣΣΣxi+b ΣΣΣxi (NE2) SimultaneoussolutionofNE1andNE2givesthe slope LSEformula: x y − yxn (x − x)(y − y) s ∑ i i = ∑ i i = r y b= 2 2 2 ∑ xi − xn ∑ ()xi − x sx Oncethisisobtained,weuse(NE1)forfindthe intercept LSE(a).

3 The predicted values ofthey iare yˆi =a+bx i,andthe residuals are thenthedifferencebetweentheactualandthepredicted:e i=y i yˆi . Theresidualsumofsquares(thetextandMinitab mistakenly callthis 2 SSE )isSSE= ΣΣΣei .Itisveryimportant,then,tonotethedifference between errors and residuals :residualsaretheactualdifferences betweentheactualandpredictedy’s,whereaserrorsarenever observed–inmuchthesamewaythattheestimatedslope(b)is observedandcalculatedandusedtoestimatethetrueslope( βββ),which isneverobserved(unlessthewholepopulationissampled).Finally, 2 ourestimateof σσσe isthemeansquareoftheresiduals, 2 s e =SSE /(n2)

Thesquarerootofthisterm–whichestimates σσσeiscalledtheRMS oftheresidualsonp.423. Example15.1continued –wenowcalculateourestimates(LSE’sfor 2 2 αααand βββands e for σσσe )byhand.Note x =54, y =116,n=8. 2 Summingdowncolumns,weget ∑(xi − x)(yi − y)=3531, ∑(xi − x) =3348,so b=3531/3348=1.05466,a=116–1.05466*54=59.0484,andthe fittedlineis yˆi =59.04839+1.05466 xi,givingusthelasttwocolumns. x y x x y y (x x )(y y ) (x x )2 yˆ e 1 24 87 30 29 870 900 84.36 2.64 2 48 101 6 15 90 36 109.67 8.67 3 60 120 6 4 24 36 122.33 2.33 4 96 159 42 43 1806 1764 160.30 1.30 5 63 135 9 19 171 81 125.49 9.51 6 39 104 15 12 180 225 100.18 3.82 7 63 126 9 10 90 81 125.49 0.51 8 39 96 15 20 300 225 100.18 4.18 2 2 2 Here,SSE= ΣΣΣei =211.997,sos e =211.997 /6=35.333=5.944 .

4 Predictions –Thetextmakesagoodpointonpp.4289(forthis example)inthatwhenpredictionofy=heightismadewithnox=age information,ourpredictionis y =116cm,withSEofs y=23.71cm. Ontheotherhand,withtheinformationthatx=60months,the

(regression)predictedvalueis yˆ =122.33cm,withSEofs e=5.944. Note the huge drop in the SE! Aswecanseefromthehandout,MinitabreportstheSSEunder 2 ‘ResidualErrorSS’,s e (oftencalledtheMSEfor‘meansquare error’)under‘ResidualErrorMS’,s eunder‘S’,itgivestheLSE’s (alongwithsomettestsandpvalues),anditalsogivesand‘Analysis of’orANOVAtable.Atthebottomofthistableisthe ‘TotalSS’–morecorrectlycalledthe corrected totalSS–whichis n 2 equalingeneraltoSST= ∑()yi − y .TheLemmaonp.429showsthat i=1 2 2 2 SSE=(1 r )SST .Usually,‘r ’isexpressedas‘R ’,sothat SS Re g 2 R = SST , whereSSReg(RegressionSS)=SST–SSE.Intheaboveexample

(andhandout),notethat3724.0 /3936.0=0.94614,whichcorresponds withthereportedvalueofR 2.TheusualinterpretationofR 2is‘the proportionofthetotalvariabilitythatisexplainedbytheregression line’.Finally,Theorem15.3establishesthat n −1 1− r 2 1− r 2 se=s y n − 2 ≈s y (forlargen). Intheaboveexample(wheren=8is not large),wehave 7 5.944=23.713 1− .0 9727 2 , 6 whichiscorrect. Read 15.7 to understand ‘the regression paradox’. 5 HypothesisTesting (section15.8)–Paramountisthetestofwhether thelinearregressionlineisflat( βββ=0)orthatthereiszerocorrelation betweenXandY( ρρρ=0),andtheseparametersarerelatedbythe equation: σ βββ= ρρρ y σ x Itfollowsthatahypothesisofzero correlation canbetestedbytesting forzero slope ;rememberingthattheLSEfor βββisb,forthislattertest, though,weneedtofindtheSEassociatedwithb,SE b.Onp.435 bottom,itisshownthatbisalinearcombinationofthe(independent) yi’s,fromwhichTheorems15.4and15.5result.Thefirstshowsthatb isan unbiased estimate of βββ.Thelattertheoremshowsthat 2 σ e Var(b)= 2 ∑()xi − x 2 2 Sincewedon’tknow σσσe ,we’lluses e initsplace: s b 1− r 2 e = SE b= 2 r n − 2 ∑()xi − x

Intheaboveexample,SE b=(1.0547/0.9727) × .0 05385 6 =0.10273. Now,wehavealltheingredientstofindaCIforthetrueslope( βββ)or testhypothesesrelatedtoit.Sincethedegreesoffreedomassociated withitare(n –2),ifwewantthe95%CIfor βββ,therelevanttvaluefor thisexampleist 0.975 =2.4469,sotheCIis: 1.0547±2.4469*0.10273=1.0547±0.2514or (0.8033,1.3061) We’re95%confidentthatthetrueslopeliesbetweenthesevalues. Beware: correlation is not causation! (Section15.9)

6