In-Class Exercise
Total Page:16
File Type:pdf, Size:1020Kb
PROBLEM SET 3—STATS 210 Due: July 29, 2004 1. A group of researchers was studying lifetime lead exposure and IQ in 7-year old kids. They classified 100 children as having “high,” “medium,” or “low” exposure based on the lead concentrations in their teeth (using baby teeth that had fallen out). Then they measured their IQ on a standard, age-appropriate IQ test, the results of which are known to have a nice normal distribution in the population. The following multiple linear regression equation resulted: IQ = 100 + (-2)*(1 if exposure=medium, 0 if low or high) + (-10)*(1 if exposure=high, 0 if low or medium) Regression coefficients: βˆ = −2; s.e.(βˆ ) = .5 1 1 ˆ ˆ β 2 = −10; s.e.(β 2 ) = .8 a. What is the p-value for the test of the null hypothesis that β1=0? That β2=0? Note: Regression coefficients are a test statistic (like a mean or difference in means), and thus have a sampling distribution. The sampling distribution of an estimated regression coefficient ˆ ˆ is a t-distribution: β ~ Tn−k (β , s.e.(β ) ): where n is the sample size; k is the number of estimated coefficients in the model, including the intercept; β is the “true” slope between the predictor and outcome; and the standard error of β represents the variability that we expect to see in estimates of β based on sample sizes of n. ˆ ˆ Correspondingly, the null distribution of β is β ~ Tn−k (0, s.e.(β ) ); (a slope of 0 indicates no relationship between the predictor and the outcome). b. Briefly, give your interpretation of the results. PROBLEM SET 3—STATS 210 Due: July 29, 2004 2. The table below gives the winning times in the men’s Olympic marathon since the beginning of the modern Olympics. The data are also available electronically at the following URL: http://www.infoplease.com/ipsa/A0115025.html AND from the class website: www.stanford.edu/~kcobb/stats210Æproblem set three data Year Winner Time 1896 Spiridon Louis, GRE 2:58:50 1900 Michel Théato, FRA 2:59:45 1904 Thomas Hicks, USA 3:28:53 1906 Billy Sherring, CAN 2:51:23.6 1908 Johnny Hayes, USA* 2:55:18.4 1912 Kenneth McArthur, S. Afr. 2:36:54.8 1920 Hannes Kolehmainen, FIN 2:32:35.8 1924 Albin Stenroos, FIN 2:41:22.6 1928 Boughèra El Ouafi, FRA 2:32:57.0 1932 Juan Carlos Zabala, ARG 2:31:36.0 1936 Sohn Kee-Chung, JPN† 2:29:19.2 1948 Delfo Cabrera, ARG 2:34:51.6 1952 Emil Zátopek, CZE 2:23:03.2 1956 Alain Mimoun, FRA 2:25:00.0 1960 Abebe Bikila, ETH 2:15:16.2 1964 Abebe Bikila, ETH 2:12:11.2 1968 Mamo Wolde, ETH 2:20:26.4 1972 Frank Shorter, USA 2:12:19.8 1976 Waldemar Cierpinski, E. Ger 2:09:55.0 1980 Waldemar Cierpinski, E. Ger 2:11:03.0 1984 Carlos Lopes, POR 2:09:21.0 1988 Gelindo Bordin, ITA 2:10:32 1992 Hwang Young-Cho, S. Kor 2:13:23 1996 Josia Thugwane, RSA. 2:12:36 2000 Gezahenge Abera, ETH 2:10.11 a) Input the data into a SAS dataset. To get SAS to recognize the winning time as a SAS time variable, follow your time variable name with the format “time10.” in the input statement. E.g.: input wintime time10.; b) Plot time vs. year (time on the y-axis and time on the x-axis). To help you practice formatting graphs, I’ve provided you with the SAS code. Feel free to experiment with the code to change the plotting symbol, labels, titles, colors, etc. axis1 order=(7000 to 13000 by 1000) label=(height= 4pct font='Times New Roman' angle=90); axis2 order= (1890 to 2000 by 10) label=(height= 4pct font='Times New Roman'); symbol1 v=circle c=blue h=1 w=1; proc gplot data=marathon; title 'Graph for problem 2, part (b)'; format time time10.; label time='Winning time'; plot time*year/ vaxis = axis1 haxis=axis2 vminor=1 hminor=1; run; quit; PROBLEM SET 3—STATS 210 Due: July 29, 2004 c) Fit a linear regression model to examine the relationship between year of competition and winning time. Write out the resulting model. Interpret. d) Examine residuals (observed value-predicted value). Do any years appear to be outliers? e) Based on your model in (c), predict the winning marathon time in the year 2050. Does this make sense? Why or why not? What’s the problem with using linear regression for these data? f) Add continent to your model as a predictor of winning time (group Asia and Europe together and North and South America together). Write out your final model and give a brief interpretation. PROBLEM SET 3—STATS 210 Due: July 29, 2004 3. The following data were collected to examine possible associations between being breast- fed as an infant and being overweight or obese as an adult. Researchers asked 1486 mothers of 18-year old men about the total duration that they had breast-fed their sons and then took weight and height measurements of the sons. The data are presented below: Anthropometry and body composition of 18-year old men according to total duration of breast- feeding. Values are number (percent) and means (SDs). Duration of total breast feeding (months) Outcomes <1 1-6 6-12 12 Number 69 (14.1) 130 (24.7) 50 (33.3) 44 (13.7) overweight (%) Number obese 47 (9.6) 69 (13.1) 31 (20.7) 27 (8.4) (%) BMI, kg/m2 22.5 (3.8) 22.7 (3.6) 22.9 (3.2) 23.0 (3.5) (SD) Total Number 489 526 150 321 of participants a. Which statistical test would you use to determine whether or not breast-feeding duration is related to overweight at 18? b. Which statistical test would you use to determine whether or not breast-feeding duration is related to obesity at 18? c. Which statistical test would you use to determine whether or not breast-feeding duration is related to BMI at 18? d. Perform any one of the above statistical tests and draw a conclusion. .