Last Name First Name Student Number Question Max Mark 1 5 2 5 3 10 4 6
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematial Sciences Final Exam, December 18, 2013 STAC32 Applications of Statistical Methods Duration: 3 hours Last name First name Student number Aids allowed: - My lecture overheads - My long-form lecture notes - Any notes that you have taken in this course - Your marked assignments - My R book - The course SAS text - Non-programmable, non-communicating calculator This exam has 24 numbered pages, including this page. Please check to see that you have all the pages. Answer each question in the space provided (under the question). If you need more space, use the backs of the pages, but be sure to draw the marker's attention to where the rest of the answer may be found. The maximum marks available for each question are shown below. Question Max Mark Question Max Mark 8 6 1 5 9 8 2 5 10 6 3 10 11 12 4 6 12 7 5 8 13 8 6 5 14 9 7 5 Total 100 1 1. A small class has two even smaller lecture sections, labelled a and b. The marks on the nal exam are shown below, and are saved in the le marks.txt. Write SAS code to do the following: read the data from the le, display the whole data set, calculate the mean and SD of nal exam marks for each section, and make side-by-side boxplots of the marks arranged by section. student section mark 1 a 78 2 a 75 3 b 66 4 b 81 5 a 74 6 b 91 7 a 83 8 b 59 2. In R: (a) Discuss the dierence between read.table and read.csv. You might nd it helpful to make up examples of data where you would use each. (b) Suppose you had never heard of read.csv. How could you use read.table to read in a .csv le? 2 3. The data below shows the nine sun-orbiting objects that were formerly known as planets (since 2006, Pluto is no longer considered a planet). Shown are the mean distance from the sun (in million miles) and the length of the planet's year (in multiples of the earth's year): R> planets=read.table("planets.txt",header=T) R> planets Planet SunDist Year 1 Mercury 36 0.24 2 Venus 67 0.61 3 Earth 93 1.00 4 Mars 142 1.88 5 Jupiter 484 11.86 6 Saturn 887 29.46 7 Uranus 1784 84.07 8 Neptune 2796 164.82 9 Pluto 3707 247.68 Our aim is to predict the length of a planet's year from its distance from the sun. Analysis 1 is shown below. R> attach(planets) R> plot(Year~SunDist) ● 250 200 ● 150 Year 100 ● 50 ● ● ●●●● 0 0 1000 2000 3000 SunDist R> year1.lm=lm(Year~SunDist) R> summary(year1.lm) Call: lm(formula = Year ~ SunDist) 3 Residuals: Min 1Q Median 3Q Max -20.082 -7.396 4.959 8.586 17.947 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12.351926 6.076862 -2.033 0.0816 . SunDist 0.065305 0.003589 18.194 3.75e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 13.76 on 7 degrees of freedom Multiple R-squared: 0.9793, Adjusted R-squared: 0.9763 F-statistic: 331 on 1 and 7 DF, p-value: 3.749e-07 R> r=resid(year1.lm) R> f=fitted(year1.lm) R> plot(r~f) ● ● 10 ● ● ● 0 r ● ● −10 ● ● −20 0 50 100 150 200 f (a) Do you see any problems with Analysis 1? Explain briey. 4 (b) Astronomy suggests that there should be a power law at work here. That is, if y is year length and x is distance from the sun, the relationship should be of the form y = axb, where a and b are constants. One way of tting this would be via a Box-Cox transformation. Another way is to take logarithms of both sides, resulting in log y = log a + b log x. This is done in Analysis 2 below: R> log.year=log(Year) R> log.sundist=log(SunDist) R> year2.lm=lm(log.year~log.sundist) R> summary(year2.lm) Call: lm(formula = log.year ~ log.sundist) Residuals: Min 1Q Median 3Q Max -0.011321 -0.001499 0.001741 0.003661 0.004669 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.797111 0.006751 -1007 <2e-16 *** log.sundist 1.499222 0.001088 1378 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.005312 on 7 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.899e+06 on 1 and 7 DF, p-value: < 2.2e-16 R> pred1=predict(year2.lm) R> pred2=exp(pred1) R> pct.error=(Year-pred2)/Year*100 R> cbind(planets,pred=pred2,pct.error) Planet SunDist Year pred pct.error 1 Mercury 36 0.24 0.2405993 -0.24968900 2 Venus 67 0.61 0.6105802 -0.09511447 3 Earth 93 1.00 0.9982609 0.17390744 4 Mars 142 1.88 1.8828212 -0.15006310 5 Jupiter 484 11.86 11.8366830 0.19660221 6 Saturn 887 29.46 29.3523314 0.36547401 7 Uranus 1784 84.07 83.6783677 0.46584076 8 Neptune 2796 164.82 164.1250081 0.42166724 9 Pluto 3707 247.68 250.4998880 -1.13852068 R> r=resid(year2.lm) R> f=fitted(year2.lm) R> plot(r~f) 5 ● 0.005 ● ● ● ● 0.000 ● ● ● r −0.005 −0.010 ● −1 0 1 2 3 4 5 f Based on this output, what are the values of a and b in the power law? (You may need a calculator for this.) (c) Look at the predictions from Analysis 2. Why did I use exp in my code? (d) In the predictions from Analysis 2, do you think the predictions of year length are good or bad overall? Explain briey. Are there any planets for which the prediction accuracy is unusually poor? (e) Which planet is at the bottom right of the residual plot? How do you know? (f) What next action would you recommend, before tting another model? Explain briey. (You might wish to re-read my preamble to this question.) 6 4. The Toronto Star conducted a study in 2009 of men's and women's ability to pull into a downtown parking space. The focus of this study was on accuracy, not speed, so the response variable is the distance from the curb, in inches. Some analysis is shown below. SAS> data parking; SAS> infile 'parking.txt' expandtabs; SAS> input gender $ distance; SAS> SAS> proc means; SAS> class gender; SAS> var distance; The MEANS Procedure Analysis Variable : distance N gender Obs N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------------- Female 47 47 9.3085106 5.3258529 2.0000000 25.0000000 Male 46 46 11.1413043 7.7729324 0.5000000 48.0000000 ------------------------------------------------------------------------------------- SAS> proc boxplot; SAS> plot distance*gender / boxstyle=schematic; 5 0 4 0 3 0 d i s t a n c e 2 0 1 0 0 M a l e F e m a l e g e n d e r 7 (a) The newspaper reported that females are more accurate parkers than males, by an average of almost two inches. Do you think we should trust this conclusion? Explain briey why or why not. (b) A two-sample t-test was also run, as shown: SAS> proc ttest side=l; SAS> var distance; SAS> class gender; The TTEST Procedure Variable: distance gender N Mean Std Dev Std Err Minimum Maximum Female 47 9.3085 5.3259 0.7769 2.0000 25.0000 Male 46 11.1413 7.7729 1.1461 0.5000 48.0000 Diff (1-2) -1.8328 6.6495 1.3791 gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev Female 9.3085 7.7448 10.8722 5.3259 4.4257 6.6892 Male 11.1413 8.8330 13.4496 7.7729 6.4472 9.7902 Diff (1-2) Pooled -1.8328 -Infty 0.4590 6.6495 5.8079 7.7785 Diff (1-2) Satterthwaite -1.8328 -Infty 0.4714 Method Variances DF t Value Pr < t Pooled Equal 91 -1.33 0.0936 Satterthwaite Unequal 79.446 -1.32 0.0947 Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 45 46 2.13 0.0120 Is this test supporting the idea that females are more accurate parkers, on average, than males? Explain briey. 8 5. For the data in the previous question, your instructor decided to carry out a randomization test based on medians. The code is shown below (he had to switch to R): R> park.dist=read.table("parking.txt",header=F) R> names(park.dist)=c("gender","distance") R> attach(park.dist) R> R> # section A of code R> R> obs.med=aggregate(distance~gender,data=park.dist,median) R> obs.med gender distance 1 Female 8.5 2 Male 10.0 R> obs.diff=obs.med[2,2]-obs.med[1,2] R> obs.diff [1] 1.5 R> # section B of code R> R> nsim=1000 R> diff=numeric(nsim) R> n=length(gender) R> R> for (i in 1:nsim) R> { R> shuf=sample(gender,n) R> sim.med=aggregate(distance~shuf,data=park.dist,median) R> sim.diff=sim.med[2,2]-sim.med[1,2] R> diff[i]=sim.diff R> } R> # section C of code R> R> hist(diff) R> abline(v=obs.diff,col="red") 9 Histogram of diff 350 300 250 200 Frequency 150 100 50 0 −4 −2 0 2 4 diff R> # section D of code R> R> table(diff>=obs.diff) FALSE TRUE 849 151 (a) What does Section A of the code do? (b) What does section B of the code do? Pay particular attention to the four lines of code within the loop.