UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematial Sciences Final Exam, December 18, 2013

STAC32 Applications of Statistical Methods Duration: 3 hours Last name First name Student number

Aids allowed: - My lecture overheads - My long-form lecture notes - Any notes that you have taken in this course - Your marked assignments - My R book - The course SAS text - Non-programmable, non-communicating calculator

This exam has 24 numbered pages, including this page. Please check to see that you have all the pages. Answer each question in the space provided (under the question). If you need more space, use the backs of the pages, but be sure to draw the marker's attention to where the rest of the answer may be found.

The maximum marks available for each question are shown below. Question Max Mark Question Max Mark 8 6 1 5 9 8 2 5 10 6 3 10 11 12 4 6 12 7 5 8 13 8 6 5 14 9 7 5 Total 100

1 1. A small class has two even smaller lecture sections, labelled a and b. The marks on the nal exam are shown below, and are saved in the le marks.txt. Write SAS code to do the following: read the data from the le, display the whole data set, calculate the mean and SD of nal exam marks for each section, and make side-by-side boxplots of the marks arranged by section.

student section mark 1 a 78 2 a 75 3 b 66 4 b 81 5 a 74 6 b 91 7 a 83 8 b 59

2. In R:

(a) Discuss the dierence between read.table and read.csv. You might nd it helpful to make up examples of data where you would use each.

(b) Suppose you had never heard of read.csv. How could you use read.table to read in a .csv le?

2 3. The data below shows the nine sun-orbiting objects that were formerly known as planets (since 2006, Pluto is no longer considered a planet). Shown are the mean distance from the sun (in million miles) and the length of the planet's year (in multiples of the earth's year):

R> planets=read.table("planets.txt",header=T) R> planets

Planet SunDist Year 1 Mercury 36 0.24 2 Venus 67 0.61 3 Earth 93 1.00 4 Mars 142 1.88 5 Jupiter 484 11.86 6 Saturn 887 29.46 7 Uranus 1784 84.07 8 Neptune 2796 164.82 9 Pluto 3707 247.68

Our aim is to predict the length of a planet's year from its distance from the sun. Analysis 1 is shown below.

R> attach(planets) R> plot(Year~SunDist)

● 250 200

● 150 Year 100 ● 50

● ●●●● 0

0 1000 2000 3000

SunDist

R> year1.lm=lm(Year~SunDist) R> summary(year1.lm)

Call: lm(formula = Year ~ SunDist)

3 Residuals: Min 1Q Median 3Q Max -20.082 -7.396 4.959 8.586 17.947

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12.351926 6.076862 -2.033 0.0816 . SunDist 0.065305 0.003589 18.194 3.75e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 13.76 on 7 degrees of freedom Multiple R-squared: 0.9793, Adjusted R-squared: 0.9763 F-statistic: 331 on 1 and 7 DF, p-value: 3.749e-07

R> r=resid(year1.lm) R> f=fitted(year1.lm) R> plot(r~f)

● 10 ● ●

● 0 r

● ● −10

● −20

0 50 100 150 200

f

(a) Do you see any problems with Analysis 1? Explain briey.

4 (b) Astronomy suggests that there should be a power law at work here. That is, if y is year length and x is distance from the sun, the relationship should be of the form y = axb, where a and b are constants. One way of tting this would be via a Box-Cox transformation. Another way is to take logarithms of both sides, resulting in log y = log a + b log x. This is done in Analysis 2 below: R> log.year=log(Year) R> log.sundist=log(SunDist) R> year2.lm=lm(log.year~log.sundist) R> summary(year2.lm) Call: lm(formula = log.year ~ log.sundist)

Residuals: Min 1Q Median 3Q Max -0.011321 -0.001499 0.001741 0.003661 0.004669

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.797111 0.006751 -1007 <2e-16 *** log.sundist 1.499222 0.001088 1378 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.005312 on 7 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.899e+06 on 1 and 7 DF, p-value: < 2.2e-16 R> pred1=predict(year2.lm) R> pred2=exp(pred1) R> pct.error=(Year-pred2)/Year*100 R> cbind(planets,pred=pred2,pct.error) Planet SunDist Year pred pct.error 1 Mercury 36 0.24 0.2405993 -0.24968900 2 Venus 67 0.61 0.6105802 -0.09511447 3 Earth 93 1.00 0.9982609 0.17390744 4 Mars 142 1.88 1.8828212 -0.15006310 5 Jupiter 484 11.86 11.8366830 0.19660221 6 Saturn 887 29.46 29.3523314 0.36547401 7 Uranus 1784 84.07 83.6783677 0.46584076 8 Neptune 2796 164.82 164.1250081 0.42166724 9 Pluto 3707 247.68 250.4998880 -1.13852068 R> r=resid(year2.lm) R> f=fitted(year2.lm) R> plot(r~f)

5 ●

0.005 ● ●

● ●

0.000 ● ●

● r −0.005

−0.010 ●

−1 0 1 2 3 4 5

f

Based on this output, what are the values of a and b in the power law? (You may need a calculator for this.)

(c) Look at the predictions from Analysis 2. Why did I use exp in my code?

(d) In the predictions from Analysis 2, do you think the predictions of year length are good or bad overall? Explain briey. Are there any planets for which the prediction accuracy is unusually poor?

(e) Which planet is at the bottom right of the residual plot? How do you know?

(f) What next action would you recommend, before tting another model? Explain briey. (You might wish to re-read my preamble to this question.)

6 4. The Toronto Star conducted a study in 2009 of men's and women's ability to pull into a downtown parking space. The focus of this study was on accuracy, not speed, so the response variable is the distance from the curb, in inches. Some analysis is shown below.

SAS> data parking; SAS> infile 'parking.txt' expandtabs; SAS> input gender $ distance; SAS> SAS> proc means; SAS> class gender; SAS> var distance;

The MEANS Procedure Analysis Variable : distance N gender Obs N Mean Std Dev Minimum Maximum ------Female 47 47 9.3085106 5.3258529 2.0000000 25.0000000 Male 46 46 11.1413043 7.7729324 0.5000000 48.0000000 ------

SAS> proc boxplot; SAS> plot distance*gender / boxstyle=schematic;

5 0

4 0

3 0 d i s t a n c e 2 0

1 0

0 M a l e F e m a l e g e n d e r

7 (a) The newspaper reported that females are more accurate parkers than males, by an average of almost two inches. Do you think we should trust this conclusion? Explain briey why or why not.

(b) A two-sample t-test was also run, as shown: SAS> proc ttest side=l; SAS> var distance; SAS> class gender; The TTEST Procedure Variable: distance gender N Mean Std Dev Std Err Minimum Maximum Female 47 9.3085 5.3259 0.7769 2.0000 25.0000 Male 46 11.1413 7.7729 1.1461 0.5000 48.0000 Diff (1-2) -1.8328 6.6495 1.3791 gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev Female 9.3085 7.7448 10.8722 5.3259 4.4257 6.6892 Male 11.1413 8.8330 13.4496 7.7729 6.4472 9.7902 Diff (1-2) Pooled -1.8328 -Infty 0.4590 6.6495 5.8079 7.7785 Diff (1-2) Satterthwaite -1.8328 -Infty 0.4714 Method Variances DF t Value Pr < t Pooled Equal 91 -1.33 0.0936 Satterthwaite Unequal 79.446 -1.32 0.0947 Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 45 46 2.13 0.0120 Is this test supporting the idea that females are more accurate parkers, on average, than males? Explain briey.

8 5. For the data in the previous question, your instructor decided to carry out a randomization test based on medians. The code is shown below (he had to switch to R):

R> park.dist=read.table("parking.txt",header=F) R> names(park.dist)=c("gender","distance") R> attach(park.dist) R> R> # section A of code R> R> obs.med=aggregate(distance~gender,data=park.dist,median) R> obs.med

gender distance 1 Female 8.5 2 Male 10.0

R> obs.diff=obs.med[2,2]-obs.med[1,2] R> obs.diff

[1] 1.5

R> # section B of code R> R> nsim=1000 R> diff=numeric(nsim) R> n=length(gender) R> R> for (i in 1:nsim) R> { R> shuf=sample(gender,n) R> sim.med=aggregate(distance~shuf,data=park.dist,median) R> sim.diff=sim.med[2,2]-sim.med[1,2] R> diff[i]=sim.diff R> }

R> # section C of code R> R> hist(diff) R> abline(v=obs.diff,col="red")

9 Histogram of diff 350 300 250 200 Frequency 150 100 50 0

−4 −2 0 2 4

diff

R> # section D of code R> R> table(diff>=obs.diff)

FALSE TRUE 849 151

(a) What does Section A of the code do?

(b) What does section B of the code do? Pay particular attention to the four lines of code within the loop.

(c) What does the output from section C of the code suggest?

(d) What is the P-value from the randomization test? How does this compare to the P-value from the t-test? Which test would you prefer to believe and why?

10 6. Write an R function that, when called with two vectors x and y (you may assume these to be of the same length) will calculate and return the slope of the regression line for predicting y from x. Note that coef(fit) takes a regression t stored in fit and returns the intercept and slope(s).

7. The le calcio.txt contains some names and dates of birth of some people, like this, but all in one column. (Ouasim Bouy and the others are actually below , in other words.)

Gianluigi Buffon 28jan1978 | Ouasim Bouy 11jun1993 Rubinho 04aug1982 | 12jun1988 Leonardo Citti 14jul1995 | 19jan1986 Marco Storari 07jan1977 | Simone Padoin 18mar1984 08may1981 | 30aug1983 01may1987 | 19may1979 Martin Caceres 07apr1987 | Paul Pogba 15mar1993 14aug1984 | Arturo Vidal 22may1987 17sep1986 | 26jan1987 16jan1984 | Fernando Llorente 26feb1985 14may1986 | 31jan1983 Angelo Ogbonna 23may1988 | Carlos Tevez 05feb1984 Federico Peluso 20jan1984 | Mirko Vucinic 01oct1983 Kwadwo Asamoah 09dec1988

(a) Write code to read these names and birthdates into SAS, reading in the whole of the names and keeping the dates as dates. (The names are all 22 characters long, including the spaces at the end.)

(b) How would you display these people, along with their birthdates, in the format 18/12/2013, that is, with date as 2 digits, month as 2 digits, and year as 4 digits? This needs to be 2 lines of code.

(c) Zero-point bonus: who are these people?

11 8. You have an explanatory variable x and a response variable y. Give R code to do all of these:

(a) make a scatterplot, with a title Scatterplot of y and x in blue (b) add the regression line, coloured red (c) add a lowess curve, dashed, coloured purple

This should take you about four lines of R code altogether.

9. As in the last question, suppose that we have an explanatory variable x and a response variable y. This time, we also have a categorical variable group with two dierent values. (You can assume that group is text.) Assume that the data have just been read in correctly into a data set xyg (so that you don't need to provide a data step, and the data set xyg will be used by default).

(a) Give SAS code to draw a scatterplot of y against x, with the regression line drawn on the scat- terplot as a solid line, and the points and line drawn in red. Use the default plotting symbol.

(b) Now make a scatterplot of y against x, with dierent plotting symbols in dierent colours for each group. Assume that you have two groups. Add a regression line for each group, solid for the rst group and dashed for the second.

(c) Create a scatterplot of y and x with no lines or dierent colours/symbols, but with each point labelled with the group that it belongs to, the label below and to the right of the point it belongs to (hint: 9 instead of 3). What SAS code would make this happen?

12 10. Below are three histograms. What would a normal quantile plot of each data set look like? Sketch the normal quantile plot in each case, using the vertical y axis as the data or sample quantile scale and the x axis as the expected or theoretical quantile scale. Your plots need to be only detailed enough to show the important features: that is, show a rough scatter of points as might come from qqnorm and a line that would come from qqline.

(a) R> hist(y1)

Histogram of y1 30 25 20 15 Frequency 10 5 0

0 5 10 15

y1

13 (b) R> hist(y2)

Histogram of y2 50 40 30 Frequency 20 10 0

−100 −50 0 50

y2

14 (c) R> hist(y3)

Histogram of y3 35 30 25 20 Frequency 15 10 5 0

5 10 15 20 25 30 35

y3

15 11. The data set 'crime' contains observations on crime for the 50 US states plus DC. The variables of interest to us are:

ˆ crime: violent crimes per 100,000 people ˆ murder: murders per 1,000,000 people ˆ pctmetro: percent of population living in metropolitan areas (cities) ˆ pctwhite: percent of population that is white ˆ pcths: percent of population with high school education or above ˆ poverty: percent of population living under the poverty line ˆ single: percent of population that are single parents. Our aim is to predict the violent crime rate from the other variables.

(a) What is special about the data set 'crime' that would distinguish it from a data set called crime without the quotes? What is the advantage to using such a special data set?

(b) Here is a multiple regression that uses all the other variables: SAS> proc reg data='crime'; SAS> model crime=murder--single; The REG Procedure Model: MODEL1 Dependent Variable: crime Number of Observations Read 51 Number of Observations Used 51

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 8707075 1451179 62.51 <.0001 Error 44 1021400 23214 Corrected Total 50 9728475

Root MSE 152.36021 R-Square 0.8950 Dependent Mean 612.84314 Adj R-Sq 0.8807 Coeff Var 24.86121

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1143.79091 584.99927 -1.96 0.0569 murder 1 19.33154 4.44424 4.35 <.0001 pctmetro 1 6.62183 1.11851 5.92 <.0001 pctwhite 1 -0.69833 2.50485 -0.28 0.7817 pcths 1 4.79128 6.67813 0.72 0.4769 poverty 1 15.00684 9.72218 1.54 0.1299 single 1 54.85188 21.30379 2.57 0.0135

16 What does the P-value under Pr > F tell you?

(c) What does the P-value of 0.1299 in the Pr > |t| column tell you? (I'm looking for a careful answer.) Does this mean that the correlation between crime and poverty must be low?

(d) I tted a second regression as shown: SAS> proc reg data='crime'; SAS> model crime=murder pctmetro single; The REG Procedure Model: MODEL1 Dependent Variable: crime Number of Observations Read 51 Number of Observations Used 51

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 8637307 2879102 124.01 <.0001 Error 47 1091168 23216 Corrected Total 50 9728475

Root MSE 152.36908 R-Square 0.8878 Dependent Mean 612.84314 Adj R-Sq 0.8807 Coeff Var 24.86266

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -707.56131 208.72715 -3.39 0.0014 murder 1 21.66295 3.99715 5.42 <.0001 pctmetro 1 5.97097 1.03472 5.77 <.0001 single 1 64.36428 19.83899 3.24 0.0022 What is the slope coecient for single? What does it mean? (Careful: there is a key phrase that I'm looking for.)

17 (e) What will happen if I take single out of the regression now?

(f) I did the following: SAS> data preds; SAS> input sid state $ crime murder pctmetro pctwhite pcths poverty single; SAS> cards; SAS> . . . 10 70 70 70 20 15 SAS> ; SAS> SAS> data newdata; SAS> set preds 'crime'; SAS> SAS> proc print data=newdata (obs=10); Obs sid state crime murder pctmetro pctwhite pcths poverty single 1 . . 10.0 70.0 70.0 70.0 20.0 15.0 2 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 3 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5 4 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7 5 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1 6 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5 7 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1 8 7 ct 456 6.3 95.7 89.0 79.2 8.5 10.1 9 8 de 686 5.0 82.7 79.4 77.5 10.2 11.4 10 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 (continues over)

18 SAS> proc reg; SAS> model crime=murder pctmetro single; SAS> output out=crimout p=pred; The REG Procedure Model: MODEL1 Dependent Variable: crime Number of Observations Read 52 Number of Observations Used 51 Number of Observations with Missing Values 1

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 8637307 2879102 124.01 <.0001 Error 47 1091168 23216 Corrected Total 50 9728475

Root MSE 152.36908 R-Square 0.8878 Dependent Mean 612.84314 Adj R-Sq 0.8807 Coeff Var 24.86266

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -707.56131 208.72715 -3.39 0.0014 murder 1 21.66295 3.99715 5.42 <.0001 pctmetro 1 5.97097 1.03472 5.77 <.0001 single 1 64.36428 19.83899 3.24 0.0022 SAS> proc print data=crimout (obs=1); SAS> var crime--pred; Obs crime murder pctmetro pctwhite pcths poverty single pred 1 . 10 70 70 70 20 15 892.501 Describe what I did in words. What does the last line tell me?

19 12. The Law of Large Numbers says that if you take a large sample from any1 population, the sample mean will almost certainly be close to the population mean. The almost certainly works better, the larger the sample gets. By way of example, let's think of a normal distribution with mean 10 and SD 3. Here's a typical random sample of 15 values from this distribution:

R> rnorm(15,10,3)

[1] 14.865602 7.760958 9.193208 7.901395 10.639714 12.126906 6.765013 [8] 12.373931 10.012141 13.287639 5.033575 6.379377 13.806247 12.515180 [15] 7.761681

Let's dene close to be within 1, that is, between 9 and 11. Let's investigate this by simulation with R. Suppose that we take a sample size of 15. Your simulation needs to take 1000 random samples from this normal distribution, compute the sample mean for each one, save it somewhere, and then, when you have all 1000 sample means, count up how many of them are between 9 and 11.

(a) What code would you use to run the above simulation? You'll probably need a loop, with some suitable initialization before.

(b) I ran my code for samples of size 15, with the results shown below. I hid my code. The output shows the number of simulated sample means between 0 and 9, between 9 and 11 and between 11 and 20. (0,9] (9,11] (11,20] 99 801 100 Here is the same thing for samples of size 30: (0,9] (9,11] (11,20] 27 936 37 and for samples of size 50: (0,9] (9,11] (11,20] 7 981 12 Does this output suggest that the Law of Large Numbers works here? Explain briey.

1Subject to the same stu as for the Central Limit Theorem, namely a well-dened population mean and SD.

20 13. A study was made of sea ice across Hudson Bay. At each of 36 numbered locations for each of the years 1971 to 2011, the date was recorded on which, for the rst time in the year, the ice coverage was less than 50% at that location. This was called the breakup date, and the date was recorded as the number of days after January 1. Here is how the data were read into SAS, along with some of the data:

SAS> data breakup; SAS> infile 'breakup.csv' dlm=',' firstobs=2; SAS> input year loc1-loc36; SAS> SAS> proc print data=breakup (obs=10); SAS> var year loc1-loc6;

Obs year loc1 loc2 loc3 loc4 loc5 loc6 1 1971 187 187 201 215 201 201 2 1972 182 189 205 210 217 217 3 1973 205 191 191 212 198 205 4 1974 197 197 190 197 204 218 5 1975 189 182 168 210 217 210 6 1976 194 194 187 222 208 194 7 1977 165 179 186 200 207 200 8 1978 163 184 177 191 198 184 9 1979 176 162 204 204 190 204 10 1980 160 160 167 209 209 209

I also read the data into R, as follows:

R> breakup=read.csv("breakup.csv",header=T) R> breakup[1:10,1:7]

X X1 X2 X3 X4 X5 X6 1 1971 187 187 201 215 201 201 2 1972 182 189 205 210 217 217 3 1973 205 191 191 212 198 205 4 1974 197 197 190 197 204 218 5 1975 189 182 168 210 217 210 6 1976 194 194 187 222 208 194 7 1977 165 179 186 200 207 200 8 1978 163 184 177 191 198 184 9 1979 176 162 204 204 190 204 10 1980 160 160 167 209 209 209

Note that the year is denoted by X and the breakup date at each location is denoted by X followed by the location number. Questions are overleaf.

21 (a) In SAS, what code would you use to plot the breakup dates for locations 1, 4 and 6 against year, all on the same axes? Each location should be represented by a dierent colour and symbol, and the points should be joined by lines of dierent line types (that is, a dierent line type for each location). Add a legend at the top right, so that the reader can distinguish the locations.

(b) Using R, plot the breakup dates for locations 1, 4 and 6 against year on the same plot. Use dierent colours and plotting symbols, and join the breakup dates for each location with lines of dierent types. Add a legend at the top right, showing the colours, line types and plotting symbols. Make the y axis go from 150 to 240. (This is essentially the same graph as in the previous part, only this time using R.) What code would you use?

22 14. Suppose we have a data le like this, called names.txt. We are going to read the data into SAS. I'm happy with SAS's default of reading only the rst 8 characters of the names.

id name group y 1 Abimae a 116 2 Abhishek b 192 3 Andrew c 170 4 Beili d 185 5 Chao a 123 6 Daniel b 199 7 Gary c 109 8 Gemini d 120 9 Guanyu a 177 10 Guilherme b 117 11 HongYi c 134 12 Jagjot d 194 13 Jason a 194 14 Jingzhu b 110 15 John c 101 16 Jonathan d 146 17 Man a 115 18 Manci b 190 19 Mary c 166 20 Mung d 200 21 Neel a 110 22 Pingchuan b 186 23 Rochelle c 142 24 Sankeetha d 189 25 Shao-Yun a 135 26 Shibvonne b 117 27 Tian c 119 28 Tsz d 190 29 Tuoyu a 175 30 Vinit b 186 31 Xiao c 183 32 Yang d 191 33 Yang a 179 34 Ye b 153 35 Yi c 176 36 Yiteng d 101 37 Yuan a 152 38 Zi b 185 39 Zongsheng c 121

Some questions are overleaf.

23 What data steps would you use to do the tasks below?

(a) Get rid of the variable id but read in all the other variables.

(b) Put only the variables name and y in your SAS data set. (Just show me the changes from the previous, here and below.)

(c) Keep only the people in group b.

(d) Omit the people whose value of y is less than 120.

(e) Keep only the people in group b who have y values greater than 170.

(f) Put only the names and groups of the people with id less than or equal to 20 into the data set.

THE END. Merry Christmas!

24