Last Name First Name Student Number Question Max Mark 1 2 3 4 5 6

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematial Sciences Final Exam, December 18, 2013 STAC32 Applications of Statistical Methods Duration: 3 hours Last name First name Student number Aids allowed: - My lecture overheads - My long-form lecture notes - Any notes that you have taken in this course - Your marked assignments - My R book - The course SAS text - Non-programmable, non-communicating calculator This exam has 45 numbered pages, including this page. Please check to see that you have all the pages. Answer each question in the space provided (under the question). If you need more space, use the backs of the pages, but be sure to draw the marker's attention to where the rest of the answer may be found. The maximum marks available for each question are shown below. Question Max Mark Question Max Mark 7 1 8 2 9 3 10 4 11 5 12 6 Total 100 1 1. A small class has two even smaller lecture sections, labelled a and b. The marks on the nal exam are shown below, and are saved in the le marks.txt. Write SAS code to do the following: read the data from the le, display the whole data set, calculate the mean and SD of nal exam marks for each section, and make side-by-side boxplots of the marks arranged by section. student section mark 1 a 78 2 a 75 3 b 66 4 b 81 5 a 74 6 b 91 7 a 83 8 b 59 This, or something like it: SAS> data marks; SAS> infile 'marks.txt' firstobs=2; SAS> input student section $ mark; SAS> SAS> proc print; SAS> SAS> proc means; SAS> var mark; SAS> class section; SAS> SAS> proc sort; SAS> by section; SAS> SAS> proc boxplot; SAS> plot mark*section / boxstyle=schematic; Obs student section mark 1 1 a 78 2 2 a 75 3 3 b 66 4 4 b 81 5 5 a 74 6 6 b 91 7 7 a 83 8 8 b 59 The MEANS Procedure Analysis Variable : mark N section Obs N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------------ a 4 4 77.5000000 4.0414519 74.0000000 83.0000000 b 4 4 74.2500000 14.4539499 59.0000000 91.0000000 ------------------------------------------------------------------------------------ 2 1 0 0 9 0 8 0 m a r k 7 0 6 0 5 0 a b s e c t i o n You need: firstobs=2 to skip the rst line of the data le; both a var and a class within proc means to supply the variable to average and the groups to average over; the proc sort to put all the section a students together and the section b students together (or else you'll get about six boxplots, not just 2), and the boxstyle (or if you must, boxtype, which is literally wrong but works) to allow any outliers to show up. 2. In R: (a) Discuss the dierence between read.table and read.csv. You might nd it helpful to make up examples of data where you would use each. read.table is for reading data separated by white space (spaces or tabs: R is not picky). Something like this, which is called a.txt: a 1 b 2 c 3 R> aa=read.table("a.txt",header=F) R> aa V1 V2 1 a 1 3 2 b 2 3 c 3 and read.csv is used for reading data from a .csv le where the values are separated by commas. This might have been the result of saving saved a spreadsheet to read into R, but doesn't have to be. Here's b.csv: first name,1 second name,2 third,3 and to read it in: R> bb=read.csv("b.csv",header=F) R> bb V1 V2 1 first name 1 2 second name 2 3 third 3 (b) Suppose you had never heard of read.csv. How could you use read.table to read in a .csv le? The key is the sep argument to read.table. You can use it for reading values separated by anything: R> bb=read.table("b.csv",sep=",",header=F) R> bb V1 V2 1 first name 1 2 second name 2 3 third 3 3. The data below shows the nine sun-orbiting objects that were formerly known as planets (since 2006, Pluto is no longer considered a planet). Shown are the mean distance from the sun (in million miles) and the length of the planet's year (in multiples of the earth's year): R> planets=read.table("planets.txt",header=T) R> planets Planet SunDist Year 1 Mercury 36 0.24 2 Venus 67 0.61 3 Earth 93 1.00 4 Mars 142 1.88 5 Jupiter 484 11.86 6 Saturn 887 29.46 7 Uranus 1784 84.07 8 Neptune 2796 164.82 9 Pluto 3707 247.68 Our aim is to predict the length of a planet's year from its distance from the sun. Analysis 1 is shown below. R> attach(planets) R> plot(Year~SunDist) 4 ● 250 200 ● 150 Year 100 ● 50 ● ● ●●●● 0 0 1000 2000 3000 SunDist R> year1.lm=lm(Year~SunDist) R> summary(year1.lm) Call: lm(formula = Year ~ SunDist) Residuals: Min 1Q Median 3Q Max -20.082 -7.396 4.959 8.586 17.947 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12.351926 6.076862 -2.033 0.0816 . SunDist 0.065305 0.003589 18.194 3.75e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 5 Residual standard error: 13.76 on 7 degrees of freedom Multiple R-squared: 0.9793, Adjusted R-squared: 0.9763 F-statistic: 331 on 1 and 7 DF, p-value: 3.749e-07 R> r=resid(year1.lm) R> f=fitted(year1.lm) R> plot(r~f) ● ● 10 ● ● ● 0 r ● ● −10 ● ● −20 0 50 100 150 200 f (a) Do you see any problems with Analysis 1? Explain briey. The relationship on the scatterplot looks a bit curved. This is accentuated on the residual plot: there is a clear curved pattern, which means that the relationship is a curve not a straight line. (b) Astronomy suggests that there should be a power law at work here. That is, if y is year length and x is distance from the sun, the relationship should be of the form y = axb, where a and b are constants. One way of tting this would be via a Box-Cox transformation. Another way is to take logarithms of both sides, resulting in log y = log a + b log x. This is done in Analysis 2 below: 6 R> log.year=log(Year) R> log.sundist=log(SunDist) R> year2.lm=lm(log.year~log.sundist) R> summary(year2.lm) Call: lm(formula = log.year ~ log.sundist) Residuals: Min 1Q Median 3Q Max -0.011321 -0.001499 0.001741 0.003661 0.004669 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.797111 0.006751 -1007 <2e-16 *** log.sundist 1.499222 0.001088 1378 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.005312 on 7 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.899e+06 on 1 and 7 DF, p-value: < 2.2e-16 R> pred1=predict(year2.lm) R> pred2=exp(pred1) R> pct.error=(Year-pred2)/Year*100 R> cbind(planets,pred=pred2,pct.error) Planet SunDist Year pred pct.error 1 Mercury 36 0.24 0.2405993 -0.24968900 2 Venus 67 0.61 0.6105802 -0.09511447 3 Earth 93 1.00 0.9982609 0.17390744 4 Mars 142 1.88 1.8828212 -0.15006310 5 Jupiter 484 11.86 11.8366830 0.19660221 6 Saturn 887 29.46 29.3523314 0.36547401 7 Uranus 1784 84.07 83.6783677 0.46584076 8 Neptune 2796 164.82 164.1250081 0.42166724 9 Pluto 3707 247.68 250.4998880 -1.13852068 R> r=resid(year2.lm) R> f=fitted(year2.lm) R> plot(r~f) 7 ● 0.005 ● ● ● ● 0.000 ● ● ● r −0.005 −0.010 ● −1 0 1 2 3 4 5 f Based on this output, what are the values of a and b in the power law? (You may need a calculator for this.) If a power law holds, the logs of the two variables will have a linear relationship. As shown above, log y = log a + b log x, so the intercept of the linear trend is log a and the slope is b. Thus, using coef to get the coecients: R> cc=coef(year2.lm) R> cc (Intercept) log.sundist -6.797111 1.499222 a is R> exp(cc[1]) (Intercept) 0.001116997 and b is R> cc[2] log.sundist 1.499222 8 (c) Look at the predictions from Analysis 2. Why did I use exp in my code? I had to undo the log transformation, since the linear regression actually predicted log of year length. In R, log takes natural (base e) logs, and the inverse of that is exp, or ey. (d) In the predictions from Analysis 2, do you think the predictions of year length are good or bad overall? Explain briey. Are there any planets for which the prediction accuracy is unusually poor? I think they are very good: they are o by less than 1%, with the exception of Pluto, which is o by 3 years (a bit more than 1%). (Percent error is the right measure, given that we have taken logs.) (e) Which planet is at the bottom right of the residual plot? How do you know? Pluto.

Load more