UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematial Sciences Final Exam, December 18, 2013

STAC32 Applications of Statistical Methods Duration: 3 hours Last name First name Student number

Aids allowed:

- My lecture overheads - My long-form lecture notes - Any notes that you have taken in this course

- Your marked assignments - My R book - The course SAS text - Non-programmable, non-communicating calculator

This exam has 45 numbered pages, including this page. Please check to see that you have all the pages. Answer each question in the space provided (under the question). If you need more space, use the backs of the pages, but be sure to draw the marker's attention to where the rest of the answer may be found.

The maximum marks available for each question are shown below. Question Max Mark Question Max Mark 7 1 8 2 9 3 10 4 11 5 12 6 Total 100

1 1. A small class has two even smaller lecture sections, labelled a and b. The marks on the nal exam are shown below, and are saved in the le marks.txt. Write SAS code to do the following: read the data from the le, display the whole data set, calculate the mean and SD of nal exam marks for each section, and make side-by-side boxplots of the marks arranged by section.

student section mark 1 a 78 2 a 75 3 b 66 4 b 81 5 a 74 6 b 91 7 a 83 8 b 59

This, or something like it: SAS> data marks; SAS> infile 'marks.txt' firstobs=2; SAS> input student section $ mark; SAS> SAS> proc print; SAS> SAS> proc means; SAS> var mark; SAS> class section; SAS> SAS> proc sort; SAS> by section; SAS> SAS> proc boxplot; SAS> plot mark*section / boxstyle=schematic; Obs student section mark 1 1 a 78 2 2 a 75 3 3 b 66 4 4 b 81 5 5 a 74 6 6 b 91 7 7 a 83 8 8 b 59

The MEANS Procedure Analysis Variable : mark N section Obs N Mean Std Dev Minimum Maximum ------a 4 4 77.5000000 4.0414519 74.0000000 83.0000000 b 4 4 74.2500000 14.4539499 59.0000000 91.0000000 ------

2 1 0 0

9 0

8 0 m a r k

7 0

6 0

5 0 a b s e c t i o n You need: firstobs=2 to skip the rst line of the data le; both a var and a class within proc means to supply the variable to average and the groups to average over; the proc sort to put all the section a students together and the section b students together (or else you'll get about six boxplots, not just 2), and the boxstyle (or if you must, boxtype, which is literally wrong but works) to allow any outliers to show up.

2. In R:

(a) Discuss the dierence between read.table and read.csv. You might nd it helpful to make up examples of data where you would use each. read.table is for reading data separated by white space (spaces or tabs: R is not picky). Something like this, which is called a.txt: a 1 b 2 c 3 R> aa=read.table("a.txt",header=F) R> aa V1 V2 1 a 1

3 2 b 2 3 c 3 and read.csv is used for reading data from a .csv le where the values are separated by commas. This might have been the result of saving saved a spreadsheet to read into R, but doesn't have to be. Here's b.csv: first name,1 second name,2 third,3 and to read it in: R> bb=read.csv("b.csv",header=F) R> bb V1 V2 1 first name 1 2 second name 2 3 third 3 (b) Suppose you had never heard of read.csv. How could you use read.table to read in a .csv le? The key is the sep argument to read.table. You can use it for reading values separated by anything: R> bb=read.table("b.csv",sep=",",header=F) R> bb V1 V2 1 first name 1 2 second name 2 3 third 3

3. The data below shows the nine sun-orbiting objects that were formerly known as planets (since 2006, Pluto is no longer considered a planet). Shown are the mean distance from the sun (in million miles) and the length of the planet's year (in multiples of the earth's year):

R> planets=read.table("planets.txt",header=T) R> planets

Planet SunDist Year 1 Mercury 36 0.24 2 Venus 67 0.61 3 Earth 93 1.00 4 Mars 142 1.88 5 Jupiter 484 11.86 6 Saturn 887 29.46 7 Uranus 1784 84.07 8 Neptune 2796 164.82 9 Pluto 3707 247.68

Our aim is to predict the length of a planet's year from its distance from the sun. Analysis 1 is shown below.

R> attach(planets) R> plot(Year~SunDist)

4 ● 250 200

● 150 Year 100 ● 50

● ●●●● 0

0 1000 2000 3000

SunDist

R> year1.lm=lm(Year~SunDist) R> summary(year1.lm)

Call: lm(formula = Year ~ SunDist)

Residuals: Min 1Q Median 3Q Max -20.082 -7.396 4.959 8.586 17.947

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12.351926 6.076862 -2.033 0.0816 . SunDist 0.065305 0.003589 18.194 3.75e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

5 Residual standard error: 13.76 on 7 degrees of freedom Multiple R-squared: 0.9793, Adjusted R-squared: 0.9763 F-statistic: 331 on 1 and 7 DF, p-value: 3.749e-07

R> r=resid(year1.lm) R> f=fitted(year1.lm) R> plot(r~f)

● 10 ● ●

● 0 r

● ● −10

● −20

0 50 100 150 200

f

(a) Do you see any problems with Analysis 1? Explain briey. The relationship on the scatterplot looks a bit curved. This is accentuated on the residual plot: there is a clear curved pattern, which means that the relationship is a curve not a straight line. (b) Astronomy suggests that there should be a power law at work here. That is, if y is year length and x is distance from the sun, the relationship should be of the form y = axb, where a and b are constants. One way of tting this would be via a Box-Cox transformation. Another way is to take logarithms of both sides, resulting in log y = log a + b log x. This is done in Analysis 2 below:

6 R> log.year=log(Year) R> log.sundist=log(SunDist) R> year2.lm=lm(log.year~log.sundist) R> summary(year2.lm) Call: lm(formula = log.year ~ log.sundist)

Residuals: Min 1Q Median 3Q Max -0.011321 -0.001499 0.001741 0.003661 0.004669

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.797111 0.006751 -1007 <2e-16 *** log.sundist 1.499222 0.001088 1378 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.005312 on 7 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.899e+06 on 1 and 7 DF, p-value: < 2.2e-16 R> pred1=predict(year2.lm) R> pred2=exp(pred1) R> pct.error=(Year-pred2)/Year*100 R> cbind(planets,pred=pred2,pct.error) Planet SunDist Year pred pct.error 1 Mercury 36 0.24 0.2405993 -0.24968900 2 Venus 67 0.61 0.6105802 -0.09511447 3 Earth 93 1.00 0.9982609 0.17390744 4 Mars 142 1.88 1.8828212 -0.15006310 5 Jupiter 484 11.86 11.8366830 0.19660221 6 Saturn 887 29.46 29.3523314 0.36547401 7 Uranus 1784 84.07 83.6783677 0.46584076 8 Neptune 2796 164.82 164.1250081 0.42166724 9 Pluto 3707 247.68 250.4998880 -1.13852068 R> r=resid(year2.lm) R> f=fitted(year2.lm) R> plot(r~f)

7 ●

0.005 ● ●

● ●

0.000 ● ●

● r −0.005

−0.010 ●

−1 0 1 2 3 4 5

f

Based on this output, what are the values of a and b in the power law? (You may need a calculator for this.) If a power law holds, the logs of the two variables will have a linear relationship. As shown above, log y = log a + b log x, so the intercept of the linear trend is log a and the slope is b. Thus, using coef to get the coecients: R> cc=coef(year2.lm) R> cc (Intercept) log.sundist -6.797111 1.499222 a is R> exp(cc[1]) (Intercept) 0.001116997 and b is R> cc[2] log.sundist 1.499222

8 (c) Look at the predictions from Analysis 2. Why did I use exp in my code? I had to undo the log transformation, since the linear regression actually predicted log of year length. In R, log takes natural (base e) logs, and the inverse of that is exp, or ey. (d) In the predictions from Analysis 2, do you think the predictions of year length are good or bad overall? Explain briey. Are there any planets for which the prediction accuracy is unusually poor? I think they are very good: they are o by less than 1%, with the exception of Pluto, which is o by 3 years (a bit more than 1%). (Percent error is the right measure, given that we have taken logs.) (e) Which planet is at the bottom right of the residual plot? How do you know? Pluto. It has the most negative residual, and the largest tted value (taking logs won't change its tted value being the largest). (f) What next action would you recommend, before tting another model? Explain briey. (You might wish to re-read my preamble to this question.) The least accurate prediction was for Pluto, and in the preamble to the question I said that Pluto wasn't considered a planet any more. This suggests that the power-law relationship doesn't apply so well to Pluto, and we would be justied in taking Pluto out and trying again. I'd expect the relationship without Pluto to be extraordinarily good. I was curious, so I tried it to see. All those -9s are for taking out Pluto. Maybe making a data frame without the 9th row and using that would have been easier. R> year3.lm=lm(log.year[-9]~log.sundist[-9]) R> summary(year3.lm) Call: lm(formula = log.year[-9] ~ log.sundist[-9])

Residuals: Min 1Q Median 3Q Max -0.0017421 -0.0006138 -0.0001039 0.0005050 0.0021572

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.8045877 0.0017000 -4003 <2e-16 *** log.sundist[-9] 1.5007791 0.0002879 5213 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.00123 on 6 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 2.717e+07 on 1 and 6 DF, p-value: < 2.2e-16 The power looks like 1.5, which astronomy suggests is what it should be (this is Kepler's third law). Note that the power (slope) is estimated even more accurately than when we included Pluto. R> pred1=predict(year3.lm) R> pred2=exp(pred1) R> pct.error=(Year[-9]-pred2)/Year[-9]*100 R> cbind(planets[-9,],pred=pred2,pct.error) Planet SunDist Year pred pct.error 1 Mercury 36 0.24 0.2401438 -0.059905874 2 Venus 67 0.61 0.6100142 -0.002331083

9 3 Earth 93 1.00 0.9978451 0.215490997 4 Mars 142 1.88 1.8832779 -0.174358083 5 Jupiter 484 11.86 11.8621892 -0.018458874 6 Saturn 887 29.46 29.4433481 0.056523739 7 Uranus 1784 84.07 84.0292461 0.048476153 8 Neptune 2796 164.82 164.9286006 -0.065890418 The largest percentage error is only 0.2%. This power law is working really very well for the genuine planets. R> r=resid(year3.lm) R> f=fitted(year3.lm) R> plot(r~f)

● 0.002 0.001

● ● r

0.000 ●

● ● −0.001

−1 0 1 2 3 4 5

f

And now the residual plot looks pretty much random (as much as it can with only 8 points). In case you were concerned about the second residual plot, where the points went uphill until the end: this is a typical residual plot when you have an outlier, and indicates that the outlier is a cause for concern. (I didn't ask you about that, because I didn't want to confuse the issue.) 4. The Toronto Star conducted a study in 2009 of men's and women's ability to pull into a downtown

10 parking space. The focus of this study was on accuracy, not speed, so the response variable is the distance from the curb, in inches. Some analysis is shown below.

SAS> data parking; SAS> infile 'parking.txt' expandtabs; SAS> input gender $ distance; SAS> SAS> proc means; SAS> class gender; SAS> var distance;

The MEANS Procedure Analysis Variable : distance N gender Obs N Mean Std Dev Minimum Maximum ------Female 47 47 9.3085106 5.3258529 2.0000000 25.0000000 Male 46 46 11.1413043 7.7729324 0.5000000 48.0000000 ------

SAS> proc boxplot; SAS> plot distance*gender / boxstyle=schematic;

11 5 0

4 0

3 0 d i s t a n c e 2 0

1 0

0 M a l e F e m a l e g e n d e r

(a) The newspaper reported that females are more accurate parkers than males, by an average of almost two inches. Do you think we should trust this conclusion? Explain briey why or why not. This is a dierence of means, and is correct from the output shown. But the question is whether we should trust the mean. The male distances have a clear outlier, suggesting that we should not trust the mean. If you look at the boxplots, the male median is also higher than the female median, but not by as much. If you want to, you can also take the angle that these may not be random samples of all downtown Toronto parkers (or whatever you think the population is), and that we should be cautious of generalizing for that reason. I'm cool with that. (b) A two-sample t-test was also run, as shown: SAS> proc ttest side=l; SAS> var distance; SAS> class gender; The TTEST Procedure Variable: distance gender N Mean Std Dev Std Err Minimum Maximum

12 Female 47 9.3085 5.3259 0.7769 2.0000 25.0000 Male 46 11.1413 7.7729 1.1461 0.5000 48.0000 Diff (1-2) -1.8328 6.6495 1.3791 gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev Female 9.3085 7.7448 10.8722 5.3259 4.4257 6.6892 Male 11.1413 8.8330 13.4496 7.7729 6.4472 9.7902 Diff (1-2) Pooled -1.8328 -Infty 0.4590 6.6495 5.8079 7.7785 Diff (1-2) Satterthwaite -1.8328 -Infty 0.4714 Method Variances DF t Value Pr < t Pooled Equal 91 -1.33 0.0936 Satterthwaite Unequal 79.446 -1.32 0.0947 Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 45 46 2.13 0.0120 Is this test supporting the idea that females are more accurate parkers, on average, than males? Explain briey. No, it isn't. It's correctly one-sided (females are rst, and we're trying to show that the mean distance is lower for females, hence the side), but the P-value is bigger than 0.09 (Satterthwaite is better, but it really doesn't make a dierence), so a null hypothesis of equal mean distances would not be rejected.

5. For the data in the previous question, your instructor decided to carry out a randomization test based on medians. The code is shown below (he had to switch to R):

R> park.dist=read.table("parking.txt",header=F) R> names(park.dist)=c("gender","distance") R> attach(park.dist) R> R> # section A of code R> R> obs.med=aggregate(distance~gender,data=park.dist,median) R> obs.med

gender distance 1 Female 8.5 2 Male 10.0

R> obs.diff=obs.med[2,2]-obs.med[1,2] R> obs.diff

[1] 1.5

R> # section B of code R> R> nsim=1000 R> diff=numeric(nsim) R> n=length(gender) R> R> for (i in 1:nsim) R> { R> shuf=sample(gender,n) R> sim.med=aggregate(distance~shuf,data=park.dist,median) R> sim.diff=sim.med[2,2]-sim.med[1,2] R> diff[i]=sim.diff R> }

13 R> # section C of code R> R> hist(diff) R> abline(v=obs.diff,col="red")

Histogram of diff 350 300 250 200 Frequency 150 100 50 0

−4 −2 0 2 4

diff

R> # section D of code R> R> table(diff>=obs.diff)

FALSE TRUE 849 151

(a) What does Section A of the code do? This calculates the median stopping distances that were actually observed for males and females (10 and 8.5) and notes that in the data, the male stopping distance was a median of 1.5 inches longer.

14 (b) What does section B of the code do? Pay particular attention to the four lines of code within the loop. This is the actual randomization. The bit before the loop is initialization. Inside the loop, we shu e (randomly permute) the gender labels, so we randomly reassociate males and females to the observed stopping distances. This is done to get a sense of how the median stopping distances, or particularly the dierences between them, might vary if there is actually no dierence between males and females. If this is the case, and this is the core of the randomization idea, we could shu e the male and female labels as we wish. Having done the shu ing, we calculate the median male and female distances accord- ing to the shu ed labels, take the male minus female dierence, and store that in our vector of randomization results. (c) What does the output from section C of the code suggest? The observed median dierence (the red line) is larger than average compared to the randomization distribution (which has its centre, unsurprisingly, at zero). But the ob- served dierence is by no means unusually large; there is still an appreciable fraction of the randomization distribution out beyond (as extreme or more extreme) the observed 1.5. (d) What is the P-value from the randomization test? How does this compare to the P-value from the t-test? Which test would you prefer to believe and why? The randomization P-value comes from Section D: I counted how many of the values from the randomization distribution are 1.5 or larger (151 of them) or less (849 of them). The randomization P-value is therefore 0.151. This is a bit larger (and therefore a bit less signicant) than the P-value from the t- test, which was 0.0947. But there was an outlier in the male parking distances, which should make us doubt whether comparing means was smart. I would prefer to trust the randomization test. So the P-value of 0.151 is the one I would believe. Either way, though, there isn't any evidence of a dierence in parking accuracy between males and females.

6. Write an R function that, when called with two vectors x and y (you may assume these to be of the same length) will calculate and return the slope of the regression line for predicting y from x. Note that coef(fit) takes a regression t stored in fit and returns the intercept and slope(s).

This: R> slope=function(x,y) R> { R> y.lm=lm(y~x) R> cc=coef(y.lm) R> cc[2] R> } To test it: R> x=1:4 R> y=c(4,4,5,8) R> slope(x,y) x 1.3 and we can check this via (hope you can live with the nested brackets this time): R> summary(lm(y~x))

15 Call: lm(formula = y ~ x)

Residuals: 1 2 3 4 0.7 -0.6 -0.9 0.8

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.0000 1.3134 1.523 0.267 x 1.3000 0.4796 2.711 0.113

Residual standard error: 1.072 on 2 degrees of freedom Multiple R-squared: 0.786, Adjusted R-squared: 0.6791 F-statistic: 7.348 on 1 and 2 DF, p-value: 0.1134 The slope is indeed 1.3.

7. The le calcio.txt contains some names and dates of birth of some people, like this:

Gianluigi Buffon 28jan1978 Rubinho 04aug1982 Leonardo Citti 14jul1995 Marco Storari 07jan1977 08may1981 01may1987 Martin Caceres 07apr1987 14aug1984 17sep1986 16jan1984 14may1986 Angelo Ogbonna 23may1988 Federico Peluso 20jan1984 09dec1988 Ouasim Bouy 11jun1993 12jun1988 19jan1986 Simone Padoin 18mar1984 30aug1983 19may1979 Paul Pogba 15mar1993 Arturo Vidal 22may1987 26jan1987 Fernando Llorente 26feb1985 31jan1983 Carlos Tevez 05feb1984 Mirko Vucinic 01oct1983

(a) Write code to read these names and birthdates into SAS, reading in the whole of the names and keeping the dates as dates. (The names are all 22 characters long, including the spaces at the end.) This requires informats for the two variables, saying how they are to be read in. The names are 22 characters long of text, so the informat for these is $22., and the format of the dates is known to SAS as date9.:

16 SAS> data players; SAS> infile 'calcio.txt'; SAS> input name $22. birthdate date9.; (b) How would you display these people, along with their birthdates, in the format 18/12/2013, that is, with date as 2 digits, month as 2 digits, and year as 4 digits? This needs to be 2 lines of code. We need to provide a format for the birthdates. The SAS format matching the speci- cations is called ddmmyy10.. This goes as a second line on proc print. SAS> proc print; SAS> format birthdate ddmmyy10.; Obs name birthdate 1 28/01/1978 2 Rubinho 04/08/1982 3 Leonardo Citti 14/07/1995 4 Marco Storari 07/01/1977 5 Andrea Barzagli 08/05/1981 6 Leonardo Bonucci 01/05/1987 7 Martin Caceres 07/04/1987 8 Giorgio Chiellini 14/08/1984 9 Paolo de Ceglie 17/09/1986 10 Stephan Lichtsteiner 16/01/1984 11 Marco Motta 14/05/1986 12 Angelo Ogbonna 23/05/1988 13 Federico Peluso 20/01/1984 14 Kwadwo Asamoah 09/12/1988 15 Ouasim Bouy 11/06/1993 16 Mauricio Isla 12/06/1988 17 Claudio Marchisio 19/01/1986 18 Simone Padoin 18/03/1984 19 Simone Pepe 30/08/1983 20 Andrea Pirlo 19/05/1979 21 Paul Pogba 15/03/1993 22 Arturo Vidal 22/05/1987 23 Sebastian Giovinco 26/01/1987 24 Fernando Llorente 26/02/1985 25 Fabio Quagliarella 31/01/1983 26 Carlos Tevez 05/02/1984 27 Mirko Vucinic 01/10/1983 (c) Zero-point bonus: who are these people? The lename calcio.txt gives a clue, since calcio is the Italian word for soccer. These are all players for the Italian soccer team Juventus.

8. You have an explanatory variable x and a response variable y. Give R code to do all of these:

(a) make a scatterplot, with a title Scatterplot of y and x in blue (b) add the regression line, coloured red (c) add a lowess curve, dashed, coloured purple

This should take you about four lines of R code altogether.

Let's make up some data to illustrate: R> x=1:10 R> y=c(10,9,11,10,11,12,13,16,18,21)

17 We need to t the regression line (you don't need to display it): R> y.lm=lm(y~x) (this is one line of code so far), and the plot, to the specications given: R> plot(y~x,main="scatterplot of y and x",col.main="blue") R> abline(y.lm,col="red") R> lines(lowess(y~x),col="purple",lty="dashed")

scatterplot of y and x

● 20

● 18

● 16 y 14

● 12

● ●

● ● 10

2 4 6 8 10

x

(there's three more lines of code for a total of 4.) You see that my data have a curved trend, so that the lowess curve and the line are visibly dierent. A quadratic relationship (including x squared) would t better than a linear one.

9. As in the last question, suppose that we have an explanatory variable x and a response variable y. This time, we also have a categorical variable group with two dierent values. (You can assume that group is text.) Assume that the data have already been read in correctly into a data set xyg (so that you don't need to provide a data step).

(a) Give SAS code to draw a scatterplot of y against x, with the regression line drawn on the scat- terplot as a solid line, and the points and line drawn in red. Use the default plotting symbol.

18 Again, we need some data. Let's re-use what we had last time, and add a grouping variable. My groups are called alpha and beta, for no particularly good reason. SAS> data xyg; SAS> input x y group $; SAS> cards; SAS> 1 10 alpha SAS> 2 9 alpha SAS> 3 11 alpha SAS> 4 10 alpha SAS> 5 11 beta SAS> 6 12 alpha SAS> 7 13 beta SAS> 8 16 beta SAS> 9 18 beta SAS> 10 21 beta SAS> ; SAS> SAS> proc print; Obs x y group 1 1 10 alpha 2 2 9 alpha 3 3 11 alpha 4 4 10 alpha 5 5 11 beta 6 6 12 alpha 7 7 13 beta 8 8 16 beta 9 9 18 beta 10 10 21 beta Now to make the plot. This requires a symbol1 statement specifying what we want to plot, like this: SAS> symbol1 c=red i=rl l=1 v=; Specifying the plotting symbol as v= like this uses the default. Or, since the default is a +, you can say v=plus. Either is good. c= species the colour of the points, i= species what to join them with (rl is regression line), and l= is the type of line, 1 being solid and 2 and up being various avours of dashed line. Using l= without specifying a number also works, since the solid line is the default. Did it work? SAS> proc gplot; SAS> plot y*x;

19 y 2 1

2 0

1 9

1 8

1 7

1 6

1 5

1 4

1 3

1 2

1 1

1 0

9 12345678910 x This appears to be a greyscale red, so yes it did. This particular data set seems to be crying out for a parabola, which you can ask for with i=rq. (b) Now make a scatterplot of y against x, with dierent plotting symbols in dierent colours for each group. Assume that you have two groups. Add a regression line for each group, solid for the rst group and dashed for the second. This requires two symbol denitions. Following the specications above leads us to this: SAS> symbol1 c=black i=rl l=1 v=plus; SAS> symbol2 c=red i=rl l=2 v=x; You can use any v= symbols you like, as long as they're dierent, and you can use any two colours you like, ditto. l= must be 1 for the rst group and something bigger than 1 for the second, and you must have i=rl in both cases. (Having i=rl on a symbol line species a regression line for that group. Of course, if you have only one group, that's the same as a regression line for all the data.) Now for the plot. Don't forget the = thing to specify the groups! SAS> proc gplot; SAS> plot y*x=group;

20 y 2 1

2 0

1 9

1 8

1 7

1 6

1 5

1 4

1 3

1 2

1 1

1 0

9 12345678910 x g r o u p a l p h a b e t a My group 1 was black (darker) with plusses, and my group 2 was red (lighter) with x's. The legend shows this. (c) Create a scatterplot of y and x with no lines or dierent colours/symbols, but with each point labelled with the group that it belongs to, the label below and to the right of the point it belongs to (hint: 9 instead of 3). What SAS code would make this happen? This means creating that crazy data set rst, as follows: SAS> data mytext; SAS> retain xsys ysys '2' position '9'; SAS> set xyg; SAS> x=x; SAS> y=y; SAS> text=group; My hint 9 was meant to point you to the right value for position. My data set is called xyg  I mentioned this earlier. Now we plot this with an annotate. I also need to redene symbol1 to not plot a line any more. I've set it back to the defaults: SAS> symbol1 c= i= l= v=; SAS> SAS> proc gplot;

21 SAS> plot y*x / annotate=mytext; y 2 1 b e 2 0

1 9

1 8 b e t a 1 7

1 6 b e t a 1 5

1 4

1 3 b e t a 1 2 a l p h a 1 1 a l p h a b e t a 1 0 a l p h a a l p h a 9 a l p h a 12345678910 x Some of my labels went o the edge of the graph, but it is not our business here to ddle with that.

10. Below are three histograms. What would a normal quantile plot of each data set look like? Sketch the normal quantile plot in each case, using the vertical y axis as the data or sample quantile scale and the x axis as the expected or theoretical quantile scale. Your plots need to be only detailed enough to show the important features: that is, show a rough scatter of points as might come from qqnorm and a line that would come from qqline.

(a) R> hist(y1)

22 Histogram of y1 30 25 20 15 Frequency 10 5 0

0 5 10 15

y1

The actual normal quantile plot is shown below. The right skew shows up as a curve. The data are bunched up at the bottom and spread out at the top, so the curve is concave upwards (or opens upwards if you like that terminology better). R> qqnorm(y1) R> qqline(y1)

23 Normal Q−Q Plot

● 15

● ● ● ●

● ● 10

●● ●●● ●●● ●●●●

Sample Quantiles ●●● ●● ●● ●●● 5 ●●●● ●●● ●●● ●●● ●●●● ●●●● ●●●● ●●● ●●●●●● ●●●●● ●●● ●●●●●●●● ●●● ●●●●●●●●● ●●● ● ● ● ● ● 0

−2 −1 0 1 2

Theoretical Quantiles

(b) R> hist(y2)

24 Histogram of y2 50 40 30 Frequency 20 10 0

−100 −50 0 50

y2

This has outliers at both ends, but the middle is apparently normal-looking. So the normal quantile plot should have the middle of the data more or less on the line, but the extreme values should be below the line on the left and above it on the right, since the data is too spread out both ends to be normal: R> qqnorm(y2) R> qqline(y2)

25 Normal Q−Q Plot

● 40 20 ● ● ●● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●

● ● ● −20

● −40 Sample Quantiles −60 −80

−2 −1 0 1 2

Theoretical Quantiles

(c) R> hist(y3)

26 Histogram of y3 35 30 25 20 Frequency 15 10 5 0

5 10 15 20 25 30 35

y3

This one is about as normal as you could wish for. The normal quantile plot should basically follow the line, with maybe a few minor wiggles. R> qqnorm(y3) R> qqline(y3)

27 Normal Q−Q Plot

● ●

● ● 30

● ●● ●●● ● ●●● ●● ● 25 ●●● ●●● ●●●●● ●●●●●● ●●● ● ●●●●● ●●●●●●● ●●● 20 ●●●●● ●● ●●●●● ●● ●●●● ●●●●● ●●● ●●●●

Sample Quantiles ● ● 15 ●● ●●● ● ●●●● ● ● 10 ● ●

● 5

−2 −1 0 1 2

Theoretical Quantiles

11. The data set 'crime' contains observations on crime for the 50 US states plus DC. The variables of interest to us are: ˆ crime: violent crimes per 100,000 people ˆ murder: murders per 1,000,000 people ˆ pctmetro: percent of population living in metropolitan areas (cities) ˆ pctwhite: percent of population that is white ˆ pcths: percent of population with high school education or above ˆ poverty: percent of population living under the poverty line ˆ single: percent of population that are single parents. Our aim is to predict the violent crime rate from the other variables.

(a) What is special about the data set 'crime' that would distinguish it from a data set called crime without the quotes? What is the advantage to using such a special data set?

28 The data set 'crime' is a permanent data set. This means that, having created it, I never need another data step in order to use it: I can just put data='crime' on the end of any proc line and it will be available to use. A regular data set, such as the imaginary one called crime without the quotes, has to be created in every run of SAS usinig a data step. Once I close SAS, it's gone and has to be re-created. (b) Here is a multiple regression that uses all the other variables: SAS> proc reg data='crime'; SAS> model crime=murder--single; The REG Procedure Model: MODEL1 Dependent Variable: crime Number of Observations Read 51 Number of Observations Used 51

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 8707075 1451179 62.51 <.0001 Error 44 1021400 23214 Corrected Total 50 9728475

Root MSE 152.36021 R-Square 0.8950 Dependent Mean 612.84314 Adj R-Sq 0.8807 Coeff Var 24.86121

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1143.79091 584.99927 -1.96 0.0569 murder 1 19.33154 4.44424 4.35 <.0001 pctmetro 1 6.62183 1.11851 5.92 <.0001 pctwhite 1 -0.69833 2.50485 -0.28 0.7817 pcths 1 4.79128 6.67813 0.72 0.4769 poverty 1 15.00684 9.72218 1.54 0.1299 single 1 54.85188 21.30379 2.57 0.0135 What does the P-value under Pr > F tell you? This P-value, which is less than 0.0001, tells me that the regression as a whole is sig- nicant: one or more of the six explanatory variables helps to predict the violent crime rate. (c) What does the P-value of 0.1299 in the Pr > |t| column tell you? (I'm looking for a careful answer.) Does this mean that the correlation between crime and poverty must be low? This is testing whether poverty adds anything to the predictive power of the regression, over and above the other variables. Since this P-value is greater than 0.05, we learn that poverty can be removed from this regression. This doesn't say anything about whether poverty is absolutely important in predicting violent crime rate, or whether it's useful by itself. It just says that in the context of this model, it has nothing to add and can be removed. This last point says that the correlation between crime and poverty does not have to be low, and indeed it is pretty high: SAS> proc corr data='crime'; SAS> var crime--single;

29 The CORR Procedure 7 Variables: crime murder pctmetro pctwhite pcths poverty single

Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum crime 51 612.84314 441.10032 31255 82.00000 2922 murder 51 8.72745 10.71758 445.10000 1.60000 78.50000 pctmetro 51 67.39020 21.95713 3437 24.00000 100.00000 pctwhite 51 84.11569 13.25839 4290 31.80000 98.50000 pcths 51 76.22353 5.59209 3887 64.30000 86.60000 poverty 51 14.25882 4.58424 727.20000 8.00000 26.40000 single 51 11.32549 2.12149 577.60000 8.40000 22.10000

Pearson Correlation Coefficients, N = 51 Prob > |r| under H0: Rho=0 crime murder pctmetro pctwhite pcths poverty single crime 1.00000 0.88620 0.54404 -0.67718 -0.25605 0.50951 0.83887 <.0001 <.0001 <.0001 0.0697 0.0001 <.0001 murder 0.88620 1.00000 0.31611 -0.70619 -0.28607 0.56587 0.85891 <.0001 0.0238 <.0001 0.0418 <.0001 <.0001 pctmetro 0.54404 0.31611 1.00000 -0.33722 -0.00398 -0.06054 0.25981 <.0001 0.0238 0.0155 0.9779 0.6730 0.0656 pctwhite -0.67718 -0.70619 -0.33722 1.00000 0.33855 -0.38929 -0.65644 <.0001 <.0001 0.0155 0.0151 0.0048 <.0001 pcths -0.25605 -0.28607 -0.00398 0.33855 1.00000 -0.74394 -0.21978 0.0697 0.0418 0.9779 0.0151 <.0001 0.1212 poverty 0.50951 0.56587 -0.06054 -0.38929 -0.74394 1.00000 0.54859 0.0001 <.0001 0.6730 0.0048 <.0001 <.0001 single 0.83887 0.85891 0.25981 -0.65644 -0.21978 0.54859 1.00000 <.0001 <.0001 0.0656 <.0001 0.1212 <.0001 The correlation between crime and poverty is 0.51, not small at all. But crime has a higher correlation with other variables (eg. murder) which also have a high correlation with poverty. (d) I tted a second regression as shown: SAS> proc reg data='crime'; SAS> model crime=murder pctmetro single; The REG Procedure Model: MODEL1 Dependent Variable: crime Number of Observations Read 51 Number of Observations Used 51

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 8637307 2879102 124.01 <.0001 Error 47 1091168 23216 Corrected Total 50 9728475

Root MSE 152.36908 R-Square 0.8878 Dependent Mean 612.84314 Adj R-Sq 0.8807 Coeff Var 24.86266

Parameter Estimates Parameter Standard

30 Variable DF Estimate Error t Value Pr > |t| Intercept 1 -707.56131 208.72715 -3.39 0.0014 murder 1 21.66295 3.99715 5.42 <.0001 pctmetro 1 5.97097 1.03472 5.77 <.0001 single 1 64.36428 19.83899 3.24 0.0022 What is the slope coecient for single? What does it mean? (Careful: there is a key phrase that I'm looking for.) It is 64.36. This is saying that, on average, if the percent of single-parents goes up by 1 (percentage point), the violent crime rate will go up by 64.36, other things being equal. This last phrase is the important one. (e) What will happen if I take single out of the regression now? The regression will t noticeably worse, because the P-value for single, 0.0022, is small, and therefore single should not be taken out. You can express this in terms of R- squared, but you have to say that R-squared will go down by a lot, not just that it will go down. (R-squared always goes down when you take a variable out, even a worthless one.) Another way of saying this is that adjusted R-squared will go down, which will happen only if you take out a variable that you shouldn't have. (f) I did the following: SAS> data preds; SAS> input sid state $ crime murder pctmetro pctwhite pcths poverty single; SAS> cards; SAS> . . . 10 70 70 70 20 15 SAS> ; SAS> SAS> data newdata; SAS> set preds 'crime'; SAS> SAS> proc print data=newdata (obs=10); Obs sid state crime murder pctmetro pctwhite pcths poverty single 1 . . 10.0 70.0 70.0 70.0 20.0 15.0 2 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 3 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5 4 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7 5 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1 6 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5 7 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1 8 7 ct 456 6.3 95.7 89.0 79.2 8.5 10.1 9 8 de 686 5.0 82.7 79.4 77.5 10.2 11.4 10 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 SAS> proc reg; SAS> model crime=murder pctmetro single; SAS> output out=crimout p=pred; The REG Procedure Model: MODEL1 Dependent Variable: crime Number of Observations Read 52 Number of Observations Used 51 Number of Observations with Missing Values 1

Analysis of Variance Sum of Mean

31 Source DF Squares Square F Value Pr > F Model 3 8637307 2879102 124.01 <.0001 Error 47 1091168 23216 Corrected Total 50 9728475

Root MSE 152.36908 R-Square 0.8878 Dependent Mean 612.84314 Adj R-Sq 0.8807 Coeff Var 24.86266

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -707.56131 208.72715 -3.39 0.0014 murder 1 21.66295 3.99715 5.42 <.0001 pctmetro 1 5.97097 1.03472 5.77 <.0001 single 1 64.36428 19.83899 3.24 0.0022 SAS> proc print data=crimout (obs=1); SAS> var crime--pred; Obs crime murder pctmetro pctwhite pcths poverty single pred 1 . 10 70 70 70 20 15 892.501 Describe what I did in words. What does the last line tell me? Let's take this in steps: ˆ I created a new data set with all the same variables as 'crime', but only one row of values. Out of the variables we've been using, crime (the response) is missing, but we have values for all the explanatory variables. (I chose these values to be pointing towards an increased violent-crime rate.) ˆ Next I made another new data set by gluing together the one we just created with the missing crime gure and the original 'crime' data. ˆ Then I print out the data set I just created, or at least the rst 10 rows of it. You see the invented data plus the rst 9 rows of the original data set. ˆ Next I t the same regression as in (d), but this time I create an output data set and put the predicted values in it. (You'll notice one Observation with Missing Values, which is the one I added.) Since the response variable for the extra observa- tion was missing, the rest of the output is exactly as in (d). ˆ Last, I print out the rst line of the output data set. This is the extra data that I added, but I have gained a prediction for it in pred. The value at the end of the last line, 892.5, is the predicted violent crime rate for a state with the given values for the explanatory variables. This is indeed a higher-than-average violent crime rate. The key that I was doing a prediction comes from the fact that I re-did the regression, creating an output data set, and going into that output data set was p=, meaning I was saving predicted values. You can trace it back to see what values I was predicting for.

12. The Law of Large Numbers says that if you take a large sample from any1 population, the sample mean will almost certainly be close to the population mean. The almost certainly works better, the larger the sample gets. By way of example, let's think of a normal distribution with mean 10 and SD 3. Here's a typical random sample of 15 values from this distribution:

R> rnorm(15,10,3)

1Subject to the same stu as for the Central Limit Theorem, namely a well-dened population mean and SD.

32 [1] 14.865602 7.760958 9.193208 7.901395 10.639714 12.126906 6.765013 [8] 12.373931 10.012141 13.287639 5.033575 6.379377 13.806247 12.515180 [15] 7.761681

Let's dene close to be within 1, that is, between 9 and 11. Let's investigate this by simulation with R. Suppose that we take a sample size of 15. Your simulation needs to take 1000 random samples from this normal distribution, compute the sample mean for each one, save it somewhere, and then, when you have all 1000 sample means, count up how many of them are between 9 and 11.

(a) What code would you use to run the above simulation? You'll probably need a loop, with some suitable initialization before. Something like this. I saved my sample size in a variable, for ease of changing later, but you don't have to. R> nsim=1000 R> sample.size=15 R> sample.mean=numeric(nsim) R> for (i in 1:nsim) R> { R> my.sample=rnorm(sample.size,10,3) R> sample.mean[i]=mean(my.sample) R> } R> table(sample.mean<9) R> table(sample.mean>11) FALSE TRUE 901 99

FALSE TRUE 900 100 We haven't done a splitting-into-3 before, so the easiest way for you is probably using table twice like this: count the values less than 9, greater than 11, add them up, and subtract from 1000. That gives 1000 − 99 − 100 = 801 values between 9 and 11. Or you can use cut. This turns the sample means into categories: 09, 9-11, 1120. The table then counts those: R> table(cut(sample.mean,c(0,9,11,20))) (0,9] (9,11] (11,20] 99 801 100 Likewise, 801 values in between. (This is strictly greater than 9 and less-or-equal-to-11, but there aren't any means exactly equal to 9 or 11.) (b) I ran my code for samples of size 15, with the results shown below. I hid my code. The output shows the number of simulated sample means between 0 and 9, between 9 and 11 and between 11 and 20. (0,9] (9,11] (11,20] 89 811 100 Here is the same thing for samples of size 30: (0,9] (9,11] (11,20] 33 933 34 and for samples of size 50: (0,9] (9,11] (11,20] 11 971 18 Does this output suggest that the Law of Large Numbers works here? Explain briey.

33 811 of the sample means from samples of size 15 are close to 10, that is, between 9 and 11. For samples of size 30, 933 of them are close, and for samples of size 50, 971 of them are close, in each case out of 1000. As the sample size gets bigger, the simulations show that the sample mean is more likely to be close to the population mean of 10. This is exactly what the Law of Large Numbers says. In case you're wondering about the dierence between 801 and 811, I did a second simu- lation for n = 15. The rst one was in my answer to (a), and the second one was for the question in (b). It was easier to just do it twice.2 Because we're using a normal population, it's also possible to do these calculations ex- actly rather than by simulation, because we know that the sample mean has a normal √ x¯ distribution (exactly) with mean 10 and SD 3/ n. (If the population were not normal, we couldn't do this, or at least not nearly so easily.) Thus, for a sample of size 15, this proportion of the sample means should be between 9 and 11: R> pnorm(11,10,3/sqrt(15))-pnorm(9,10,3/sqrt(15)) [1] 0.8032944 For samples of size 30, we get this: R> pnorm(11,10,3/sqrt(30))-pnorm(9,10,3/sqrt(30)) [1] 0.9321108 and for samples of size 50: R> pnorm(11,10,3/sqrt(50))-pnorm(9,10,3/sqrt(50)) [1] 0.9815779 Our simulations deviate from these a bit, which is because we only took 1000 random samples each time there; 10,000 would have been better. But the overall picture we got agrees with the theory.

13. A study was made of sea ice across Hudson Bay. At each of 36 numbered locations for each of the years 1971 to 2011, the date was recorded on which, for the rst time in the year, the ice coverage was less than 50% at that location. This was called the breakup date, and the date was recorded as the number of days after January 1. Here is how the data were read into SAS, along with some of the data:

SAS> data breakup; SAS> infile 'breakup.csv' dlm=',' firstobs=2; SAS> input year loc1-loc36; SAS> SAS> proc print data=breakup (obs=10); SAS> var year loc1-loc6;

Obs year loc1 loc2 loc3 loc4 loc5 loc6 1 1971 187 187 201 215 201 201 2 1972 182 189 205 210 217 217 3 1973 205 191 191 212 198 205 4 1974 197 197 190 197 204 218 5 1975 189 182 168 210 217 210 6 1976 194 194 187 222 208 194 7 1977 165 179 186 200 207 200 8 1978 163 184 177 191 198 184 9 1979 176 162 204 204 190 204 10 1980 160 160 167 209 209 209

I also read the data into R, as follows:

2I suppose I could have used set.seed to get the exact same random numbers.

34 R> breakup=read.csv("breakup.csv",header=T) R> breakup[1:10,1:7]

X X1 X2 X3 X4 X5 X6 1 1971 187 187 201 215 201 201 2 1972 182 189 205 210 217 217 3 1973 205 191 191 212 198 205 4 1974 197 197 190 197 204 218 5 1975 189 182 168 210 217 210 6 1976 194 194 187 222 208 194 7 1977 165 179 186 200 207 200 8 1978 163 184 177 191 198 184 9 1979 176 162 204 204 190 204 10 1980 160 160 167 209 209 209

Note that the year is denoted by X and the breakup date at each location is denoted by X followed by the location number.

(a) In SAS, what code would you use to plot the breakup dates for locations 1, 4 and 6 against year, all on the same axes? Each location should be represented by a dierent colour and symbol, and the points should be joined by lines of dierent line types (that is, a dierent line type for each location). Add a legend at the top right, so that the reader can distinguish the locations. Dening symbols is the way to go, one for each location, and so three in all, for example: SAS> symbol1 c=black v=star i=join l=1; SAS> symbol2 c=red v=plus i=join l=2; SAS> symbol3 c=blue v=x i=join l=3; and a legend: SAS> legend1 label=none position=(top right inside) mode=share; Your choice of colours, plotting symbols and line types. As long as SAS knows about them, you're good. Then the plot. This is multiple plots with an overlay, like the orange trees: SAS> proc gplot; SAS> plot loc1*year loc4*year loc6*year / overlay legend=legend1;

35 l o c 1 2 4 0 loc1 loc4 loc6

2 3 0

2 2 0

2 1 0

2 0 0

1 9 0

1 8 0

1 7 0

1 6 0

1 5 0 1970 1980 1990 2000 2010 2020 y e a r This is rather cluttered, but that wasn't the point here! (My colours, as usual, came out greyscale; seeing dierent colours would have made things easier.) (b) Using R, plot the breakup dates for locations 1, 4 and 6 against year on the same plot. Use dierent colours and plotting symbols, and join the breakup dates for each location with lines of dierent types. Add a legend at the top right, showing the colours, line types and plotting symbols. Make the y axis go from 150 to 240. (This is essentially the same graph as in the previous part, only this time using R.) What code would you use? Again, your choice of colours, line types, plotting symbols. Anything that R understands is good. It seems to be easiest to refer to plotting characters by number. The strategy for multiple series in R is to start by plotting an empty graph, and then to add the series to it one by one using points or lines (either will work here, since you are plotting both points and lines). Whichever you use, you need type="b" to get both points and lines. R> attach(breakup) R> plot(X1~X,type="n",ylim=c(150,240)) R> lines(X1~X,lty="solid",col="black",pch=2,type="b") R> lines(X4~X,lty="dashed",col="red",pch=3,type="b") R> lines(X6~X,lty="dotted",col="blue",pch=4,type="b")

36 R> locations=c("X1","X4","X6") R> colours=c("black","red","blue") R> linetypes=c("solid","dashed","dotted") R> legend("topright",legend=locations,col=colours,pch=2:4,lty=linetypes)

240 X1 X4 X6 220 200 X1 180 160

1974 1979 1984 1989 1994 1999 2004 2009

X

I made vectors out of my locations, colours and line types, to make my legend command less cumbersome. Come to think of it, I could have dened these rst and then extracted the appropriate things from them when I did my lines. (In fact, in that case, I could even have used a loop.) Don't know why my triangles have lines across them.

14. Suppose we have a data le like this, called names.txt. We are going to read the data into SAS. I'm happy with SAS's default of reading only the rst 8 characters of the names.

id name group y 1 Abimae a 116 2 Abhishek b 192 3 Andrew c 170 4 Beili d 185

37 5 Chao a 123 6 Daniel b 199 7 Gary c 109 8 Gemini d 120 9 Guanyu a 177 10 Guilherme b 117 11 HongYi c 134 12 Jagjot d 194 13 Jason a 194 14 Jingzhu b 110 15 John c 101 16 Jonathan d 146 17 Man a 115 18 Manci b 190 19 Mary c 166 20 Mung d 200 21 Neel a 110 22 Pingchuan b 186 23 Rochelle c 142 24 Sankeetha d 189 25 Shao-Yun a 135 26 Shibvonne b 117 27 Tian c 119 28 Tsz d 190 29 Tuoyu a 175 30 Vinit b 186 31 Xiao c 183 32 Yang d 191 33 Yang a 179 34 Ye b 153 35 Yi c 176 36 Yiteng d 101 37 Yuan a 152 38 Zi b 185 39 Zongsheng c 121

What data steps would you use to do the tasks below?

(a) Get rid of the variable id but read in all the other variables. I'm not bothered about the precise details on infile. As long as there's a drop some- where, that's good. I know you can do the rest of it. SAS> data names; SAS> infile 'names.txt' firstobs=2 expandtabs; SAS> input id name $ group $ y; SAS> drop id; SAS> SAS> proc print; Obs name group y 1 Abimae a 116 2 Abhishek b 192 3 Andrew c 170 4 Beili d 185 5 Chao a 123

38 6 Daniel b 199 7 Gary c 109 8 Gemini d 120 9 Guanyu a 177 10 Guilherm b 117 11 HongYi c 134 12 Jagjot d 194 13 Jason a 194 14 Jingzhu b 110 15 John c 101 16 Jonathan d 146 17 Kelly d 150 18 Man a 115 19 Manci b 190 20 Mary c 166 21 Mung d 200 22 Neel a 110 23 Pingchua b 186 24 Rochelle c 142 25 Sankeeth d 189 26 Shao-Yun a 135 27 Shibvonn b 117 28 Tian c 119 29 Tsz d 190 30 Tuoyu a 175 31 Vinit b 186 32 Xiao c 183 33 Yang d 191 34 Yang a 179 35 Ye b 153 36 Yi c 176 37 Yiteng d 101 38 Yuan a 152 39 Zi b 185 40 Zongshen c 121 (b) Put only the variables name and y in your SAS data set. (Just show me the changes from the previous, here and below.) SAS> data names; SAS> infile 'names.txt' firstobs=2 expandtabs; SAS> input id name $ group $ y; SAS> keep name y; SAS> SAS> proc print; Obs name y 1 Abimae 116 2 Abhishek 192 3 Andrew 170 4 Beili 185 5 Chao 123 6 Daniel 199 7 Gary 109 8 Gemini 120 9 Guanyu 177

39 10 Guilherm 117 11 HongYi 134 12 Jagjot 194 13 Jason 194 14 Jingzhu 110 15 John 101 16 Jonathan 146 17 Kelly 150 18 Man 115 19 Manci 190 20 Mary 166 21 Mung 200 22 Neel 110 23 Pingchua 186 24 Rochelle 142 25 Sankeeth 189 26 Shao-Yun 135 27 Shibvonn 117 28 Tian 119 29 Tsz 190 30 Tuoyu 175 31 Vinit 186 32 Xiao 183 33 Yang 191 34 Yang 179 35 Ye 153 36 Yi 176 37 Yiteng 101 38 Yuan 152 39 Zi 185 40 Zongshen 121 Change the drop to a keep. You can do either of these the other way around, as long as you make sure to keep what you want to keep and drop what you want to drop. (My way is the shortest, though.) (c) Keep only the people in group b. The keep in the question was deliberately to mislead you! keep is for variables; this is about observations, for which you need if: SAS> data names; SAS> infile 'names.txt' firstobs=2 expandtabs; SAS> input id name $ group $ y; SAS> if group='b'; SAS> SAS> proc print; Obs id name group y 1 2 Abhishek b 192 2 6 Daniel b 199 3 10 Guilherm b 117 4 14 Jingzhu b 110 5 18 Manci b 190 6 22 Pingchua b 186 7 26 Shibvonn b 117 8 30 Vinit b 186 9 34 Ye b 153

40 10 38 Zi b 185 (d) Omit the people whose value of y is less than 120. A delete on the end of the if this time: SAS> data names; SAS> infile 'names.txt' firstobs=2 expandtabs; SAS> input id name $ group $ y; SAS> if y<120 then delete; SAS> SAS> proc print; Obs id name group y 1 2 Abhishek b 192 2 3 Andrew c 170 3 4 Beili d 185 4 5 Chao a 123 5 6 Daniel b 199 6 8 Gemini d 120 7 9 Guanyu a 177 8 11 HongYi c 134 9 12 Jagjot d 194 10 13 Jason a 194 11 16 Jonathan d 146 12 40 Kelly d 150 13 18 Manci b 190 14 19 Mary c 166 15 20 Mung d 200 16 22 Pingchua b 186 17 23 Rochelle c 142 18 24 Sankeeth d 189 19 25 Shao-Yun a 135 20 28 Tsz d 190 21 29 Tuoyu a 175 22 30 Vinit b 186 23 31 Xiao c 183 24 32 Yang d 191 25 33 Yang a 179 26 34 Ye b 153 27 35 Yi c 176 28 37 Yuan a 152 29 38 Zi b 185 30 39 Zongshen c 121 (e) Keep only the people in group b who have y values greater than 170. This one has a hidden and in it: the people we want have to be in group B and have a y value greater than 170. Thus: SAS> data names; SAS> infile 'names.txt' firstobs=2 expandtabs; SAS> input id name $ group $ y; SAS> if group='b' & y>170; SAS> SAS> proc print; Obs id name group y 1 2 Abhishek b 192 2 6 Daniel b 199

41 3 18 Manci b 190 4 22 Pingchua b 186 5 30 Vinit b 186 6 38 Zi b 185 (f) Put only the names and groups of the people with id less than or equal to 20 into the data set. A combo this time: we are selecting variables (via keep) and observations: SAS> data names; SAS> infile 'names.txt' firstobs=2 expandtabs; SAS> input id name $ group $ y; SAS> keep name group; SAS> if id<=20; SAS> SAS> proc print; Obs name group 1 Abimae a 2 Abhishek b 3 Andrew c 4 Beili d 5 Chao a 6 Daniel b 7 Gary c 8 Gemini d 9 Guanyu a 10 Guilherm b 11 HongYi c 12 Jagjot d 13 Jason a 14 Jingzhu b 15 John c 16 Jonathan d 17 Man a 18 Manci b 19 Mary c 20 Mung d

42