STT 825 F 01 Homework #3 Due Friday, October 12

1. (13 points) We took a simple random sample of 30 stores from a list of 290 stores. Data is given below yi=sales, xi=number of employees in ith store. NOTE: tx = 2668. y 73 57 73 41 42 40 52 28 82 58 56 68 66 75 72 51 39 45 39 36 77 36 55 50 73 64 33 67 49 82 x 17 11 14 07 05 03 10 05 17 09 06 15 15 17 17 05 04 04 07 03 18 03 09 05 15 13 04 16 07 15 a. Get the descriptive statistics for the sample, including the sample correlation. (Do this on computer.)

Descriptive Statistics: x, y Variable N Mean Median TrMean StDev SE Mean y 30 55.97 55.50 55.92 15.90 2.90 x 30 9.867 9.000 9.808 5.322 0.972

Correlations: x, y Pearson correlation of x and y = 0.928 b. Compute the unbiased estimator of mean sales, and report its standard error. sample mean for y = 55.97, SE = (1- 30/290) (15.90)/(30)1/2 = .947 (15.90)/ (30)1/2 = 2.75 c. i. Compute the ratio estimator of the mean sales per store.

ˆ ˆ B = 55.97/9.867 = 5.67, yR = 5.67 (2886/290) = 52.19

ii. Compute the standard error of your estimator in (i) in two ways:

I. Use the ei’s and find se.

Descriptive Statistics: eirat Variable N Mean Median TrMean StDev SE Mean eirat 30 0.02 -0.52 0.17 16.52 3.02

ˆ 1/2 Thus, SE( yR ) = (.9470) (16.52)/(30) = 2.86

II. Use the formula (given in text problem 13, page 91)relating SE to the sample statistics. 2 2 2 2 2 2 2 se = sy - 2 Bˆ r sx sy + Bˆ sx = (15.90) - 2 (5.67)(.928) (5.322) (15.90) + (5.67) (5.322) = 16.52 ˆ 1/2 which agrees with the eirat stdev. Thus, SE( yR ) = (.9470) (16.52)/(30) = 2.86 d. Run a simple linear regression on the data.

The regression equation is y = 28.6191 + 2.77171 x

S = 6.02422 R-Sq = 86.1 % R-Sq(adj) = 85.6 %

Analysis of Variance Source DF SS MS F P Regression 1 6310.81 6310.81 173.894 0.000 Error 28 1016.15 36.29 Total 29 7326.97

i. Report the regression function. y = 28.6191 + 2.77171x (rounding off is o.k.) ii. Find the regression estimator of the mean sales. 55.97 + (2.77)(9.2 - 9.867) = 54.12 2 iii. Use the regression output to find se for the regression estimator, and then compute the standard error of your estimator in (ii).

Descriptive Statistics: eireg Variable N Mean Median TrMean StDev SE Mean eireg 30 -0.00 -0.72 0.04 5.92 1.08 = se 2 or obtain se = (n-2) MSE/(n-1) = (28)(36.29)/(29) = 35.04 which gives se = 5.92.

Thus, SE for the regression estimator is (.947) (5.92)/ (30)1/2 =1.024. e. Which estimator (unbiased, ratio, regression) would you recommend? The regression estimator does very well. The ratio estimator is the worst because the data (x,y) doesn’t fit a line through the origin very well. f. In the sample, we recorded the number of employees at each store. If we didn’t know such information before sampling, could be stratify by number of employees at a store? Why or why not? No, you have to know which stratum a store is in and the stratum sizes before you sample. ------2. (3 points) We took a simple random sample of 300 students from a population of 500 students and measured the amount spent on textbooks and the student’s college. Summary statistics are given below:

Number Mean amount spent Standard Deviation Sample of all students 300 $387.20 $67.40 Social Science Students in the sample 72 $301.15 $61.20 a. Estimate the mean amount spent on text books for social science students and report its standard error. 1/2 1/2 1/2 mean = $301.15, SE = (1- n/N) sd/ (n ) = .6325 (61.2) / (72) = 4.56 b. Is the number of students in social science in the sample random or fixed? random c. Is the group of social science students a domain or a stratum? Domain ------3. (5 points) Consider a simplified version of problem #3, Exam 1. Suppose we have 50 flats of up to 6 plants each. a. A simple random sample of 4 labels will be taken from labels {1,2,…,50}. Suppose that Flat #50 was too damaged to use, but all the others were fine to use. If Flat #50 was selected, the technician would just take Flat #49 instead. Show this is NOT an SRS of {1,2,…,49} by giving two units that have unequal chances of being in the sample.

#49 has a greater chance of being in the sample than #1,…, or #48. At STAGE 1, the chance of selecting #49 = 2/50 while the chance of selecting, say, #1 is 1/50. b. Now assume no damaged flats, and suppose we sampled every 10th flat in the list of 50 (giving a sample of size 5), but started the selection at random. i. Show this is not an SRS by finding a sample which is possible under SRS but not possible under this scheme. Any non-systematic sample of size 5 such as {1,2,3,4,5}.

ii. Find the sampling weight of Unit #1. = 1/P(Unit #1 is selected) = 50/5 = 10. c. Suppose we decided to sample plants rather than flats. Select a simple random sample of 3 flats, then select 2 plants at random from each selected flat. (Assume no damaged flats and at least 2 plants per flat.) i. Show this is not an SRS by finding a sample which is possible under SRS but not possible under this scheme. Any sample containing more than 2 plants from one flat is impossible.

ii. Is this Stratified Random Sampling? NO d. Again, we will sample plants. Suppose we select 2 plants at random from each of the 50 flats (thus getting a sample of 100 plants). What kind of sampling is this? (Assume no damanged flats and at 2 plants per flat.) [SRS Stratified Random Sampling Systematic Samping none of these]]

------4. (7 points) Do text problem 3a (Do not do 3b), Chapter 4. Use stratum sizes 5713, 1272, 1288, 5072 which are computed by Area/.039.

Data Display Row capNh nh ybar_h vhat_h that_h Vt_hat_h t_hat v_hat

1 5713 4 0.44 0.068 2513.72 554464 18180.5 5830802 2 1272 6 1.17 0.042 1488.24 11272 3 1288 3 3.92 2.146 5048.96 1183934 4 5072 5 1.80 0.794 9129.60 4081132

Estimate the total number of bushels of clams as 18,180.5, SE = (5830802)1/2 =2417.7

Also answer the following questions: b. What are the sampling units? tows

c. What is the sampling weight for a tow in stratum #4? N4/n4 = 5072/5 = 1014.4

d. Compute a 95% t-confidence interval for the total yield.

18180.5  t025;14 (2417.7) which is 18180.5  (2.145) (2417.7) which is 18180.5  5186.0

e. Comment on use of the t in part (e). The total sample size is only 18 and the stratum sample size are quite small. Question the validity of the confidence level reported in (d).

------5. (8points) We have a population of 500 accounts, stratified by balance of the account: Stratum 1: balance $0 up to $500 Stratum 2: balance $500 up to $2000 Stratum 3: balance $2000 and up.

Yesterday's fees were measured on all accounts, and population characteristics are Stratum number of accounts mean fee standard deviation 1 150 $2.05 $2.07 2 300 $3.93 $2.75 3 50 $8.22 $4.03 population 500 $ (a) below $3.21

a. i. Find the mean fee for all 500 accounts. = wtd average of the stratum means = 3.789 or 3.80 is o.k. (this is a population mean)

ii. Find the total fees for all 500 accounts. 500 x your answer in (i) = 500 (3.789) = 1894.5 or 1900 is o.k.

b. i. Give proportional allocation of a sample of n=100 accounts. h Nh/N nh 1 150/500 = .3 30 2 300/500 = .6 60 3 50/500 = .1 10

ˆ ii. Compute theV[tstr ]. 2 2 h (1-nh/Nh) Nh Sh / nh

1 2570.6 2 9075 3 3248.2 ˆ sum = V[tstr ] = 14,894 (there may be some rounding error)

iii. Give the sampling weight for an observation from Stratum #1. N1/n1 = 150/30 = 5

c. i. For a simple random sample of size 100, compute V[tˆ] . (1 - 100/500) (500)2 (3.21)2 / 100 = 20,608.2

ii. Give the sampling weight for an observation from Stratum #1. N/n = 500/100 = 5

------6. (12 points) Do text problem 10, Chapter 4. a. Convert the data back to the “raw” form: Stratum 1: observations 0,1,1,3,5,5,7, Likewise, for the other stata. Then compute the descriptive statistics by stratum.

ˆ ˆ ˆ Stratum nh Mean sh th V[th ] 1 7 3.143 2.610 320.586 9429.9 2 19 2.105 2.865 652.55 38944.6 3 13 1.231 2.088 267.127 14845.9 4 11 0.455 0.934 80.99 2357.4 1321.25 65,577.8

Estimated total is 1321.25 (or 1321), SE of the estimated total is (65,577.8)1/2 = 256.1 b. SRS: estimated total = 1436.46, SE of estimated total = 296.2 The estimated total is lower for the stratified sample and the SE is lower for the stratified sample (better precision). c. This uses proportions. ˆ ˆ ˆ Stratum ph Nh ph V[ pˆ h ] 1 1/7 = .143 14.57 .0003037 2 10/19=.526 163.16 .0019185 3 9/13=.692 150.23 .0012066 4 8/11=.727 129.45 .0009053 457.415 .0043342 The estimated proportion = 457.415/807 = .57, its SE = (.0043342)1/2 = .0658 d. Yes (comparing (a) and (b)). Also, the number of publications seems to be affected by area.

------7. (2 points) Refer to text problem 15, Chapter 4. We did part (a) in class. Answer part (b). b. Selection Bias: +There may be otter dens with buildings (these strips were omitted from the sampled population. +Only going 110 meters from the shore, may get dens further back. (may be others) Measurement Bias: +Dens may be difficult to find (missed in the count). + Weather may affect ability to count dens (may be others) No - doubt if it’s possible to avoid all selection and measurement bias.