Sociology 208 - STATISTICS for SOCIOLOGISTS - Fall 1996

University of North Carolina at Chapel Hill

Sociology 208 - STATISTICS FOR SOCIOLOGISTS - Fall 1996 Professor François Nielsen

Assignment 4 - Wednesday 23 October DUE Monday 4 November

Value of p for Problem 19: p =

This assignment covers issues of statistical sampling (Chapter 9), point estimations and sampling distribution of the sample mean (Chapter 10), and interval estimation (confidence intervals) for the population mean (Chapter 11).

From Neter, Wasserman, and Whitmore:

1. 9.2 p. 251 (finite or infinite populations) 2. 9.10 p. 252 (sampling and nonsampling error) 3. 9.32 p. 255 (jam jars, diagnostics for infinite population) Do this one on the computer. 4. 9.36 p. 256 (assembly plants, statistic versus parameter) 5. 10.7 p. 282 (aptitude scores, sampling distribution) 6. 10.13 p. 282 (n large, sample mean above or below μ) 7. 10.15 p. 282 (sampling fraction) 8. 10.18 p. 283 (aptitude scores, sampling distribution) 9. 10.21 p. 283 (form of sampling distribution) 10. 10.28 p. 284 (anticipated travel expenditures, unbiasedness) 11. 10.31 p. 284 (efficiency) 12. 10.35 p. 285 (rainfall, bias, efficiency, mean squared error) 13. 10.42 p. 286 Optional (sampling distribution of Bernouilli trials) Hint: derive the probability of the few different values that Xbar can have. 14. 11.3 p. 309 (trade association, large sample) 15. 11.9 p. 310 (nutrition study, small sample; answer is in back of book) 16. 11.11 p. 310 (teams, small sample) 17. 11.17 p. 311 (prescriptions, planning n) 18. 11.22 p. 312 (nutrition study, lower confidence limit)

BY COMPUTER

19. This problem is a replication for the sample proportion of the Monte Carlo experiments of Neter, Wasserman, and Whitmore for the sample mean. NW&W use Monte Carlo

s208na4\ 1 experimentation in the accounts receivable example (pp. 262-265) to show the behavior of the sampling distribution of the sample mean as a function of the sample size and illustrate the central limit theorem, and to show the behavior of the confidence interval for the mean (pp. 292-293). In this project, you will use similar techniques to investigate the sampling distribution of the sample proportion p^ as a function of sample size and as a function of the population value of p, and the behavior of the confidence interval for p. It will give you a feel for the way in which computer simulation is used for research in quantitative methodology. You will need a formatted floppy disk for the files that you create. Jim Kirby can help you work with the computer, and will make copies of the program PHAT.CMD available.

The program PHAT.CMD listed below simulates the drawing of 600 samples of n = 5, 20, and 100 observations from a population in which elements have a certain characteristic with probability p. With the three values of n, the total number of samples is 3x600=1,800. For each sample the program calculates the sample proportion p^ (phat5, phat20, phat100), the standard error of estimate for p^ (sep5, sep20, sep100), the 90 percent confidence interval for p^ (represented by the bounds L90, U90, which are discarded after each use by the program) based on the normal approximation, and whether the true value of p (which is of course known) is contained in this confidence interval (out5, out20, and out100, where out = 1 if p is outside the calculated confidence interval, out = 0 otherwise). Your task is to run the program using the value of p that has been assigned to you and your own personal seed to initialize the random number generator of Systat. You need only modify the statement rseed=313, replacing 313 by any integer you choose between 1 and 30000, and the statement p=0.5 where you need to replace 0.5 by the value assigned to you (if it is different from 0.5).

To do the simulation, run Systat for Windows. From the Systat main window, click on Window/Notepad/File/Open Text File Open. Modify the text of the program as explained in the previous paragraph: edit the rseed = ... line; the let p = 0.5 line; finally, replace the file name in the save phat line as appropriate, e.g. with save 'a:myphat.sys' to save the file to your diskette in drive a:. When you are done editing the program, save it with File/Save Text File, and click on the upper left corner of the window to exit the notepad. Then to run the program, from Systat's main window click on File/submit file Submit. You should see the sentence 'Working on case...' appear in the main window. Wait until the 600 cases are created. After running the program successfully, you should have produced a file phat.sys with 600 observations each with p^ estimates and related statistics for samples of sizes 5, 20, and 100, respectively. The size of phat.sys is about 50K.

Do the following analyses with the data in PHAT.SYS and answer the questions based on the results: a. Calculate basic statistics (mean, standard deviation, etc.) for all variables in the file. b. Use the output to fill the blank table appended to this assignment. In the first row, for each sample size, fill in the true value of p, the one you have used in the simulation. In the second

s208na4\ 2 row report the mean of p^ for each sample size (5, 20, 100), and comment on how close or distant it is to the true proportion p. In the third row, enter the theoretical value of the standard error of estimate of p^, using the appropriate formula based on the value of p and the sample size n. In the fourth row of the table, enter the standard deviation of p^ for each sample size (i.e., the standard deviation of phat5, phat20, phat100). This is an estimate of the standard deviation of the sampling distribution of p^. How does the standard deviation of p^ behave as a function of n? How close are the estimated and the theoretical standard errors of p^ for each sample size? In the fifth row enter the mean of the estimated standard error of p^ (i.e., the mean of sep5, sep20, sep100), which is another estimate of the standard error of the sampling distribution of p^. How does it compare to the theoretical value of the standard error in row 3? c. Look on the screen at graphs depicting the distribution of p^ for each sample size, and print a hard copy of the graph for the distribution of p^ with n=100 only (to save paper). (You can use histograms or other types of graphs; bar charts and dit plots work in this case too; if you use histograms, you can use the smooth option to superimpose a normal curve on the histogram, if you wish.) Comment on the shape of the distribution of p^ as a function of sample size. Does this comparison support the central limit theorem? d. In row 6 of the summary table enter the means of out5, out20, and out100. Each mean is an estimate of the proportion of calculated confidence intervals around p^ that do not include the true value of p. How does the proportion of invalid confidence intervals varies as a function of n? Relate the values of the mean of out to whether the experimental conditions (values of n and p) satisfy the rule of thumb for the normal approximation to the sampling distribution of the sample proportion (see NW&W p. 368).

LISTING OF PHAT.CMD new save phat rseed=313 REM replace 313 by any integer between 1 and 30000 repeat 600

let p=0.5 let casenum=case print 'Working on case' casenum

let n=5 let phat5=0 for i=1 to n if urn <= p then let phat5 = phat5+1 next let phat5 = phat5/n let sep5 = sqr(phat5*(1-phat5)/n) let L90 = phat5 - 1.645*sep5 let U90 = phat5 + 1.645*sep5 let out5 = 0 if p

s208na4\ 3 let n=20 let phat20=0 for i=1 to n if urn <= p then let phat20 = phat20+1 next let phat20 = phat20/n let sep20 = sqr(phat20*(1-phat20)/n) let L90 = phat20 - 1.645*sep20 let U90 = phat20 + 1.645*sep20 let out20 = 0 if p

let n=100 let phat100=0 for i=1 to n if urn <= p then let phat100 = phat100+1 next let phat100 = phat100/n let sep100 = sqr(phat100*(1-phat100)/n) let L90 = phat100 - 1.645*sep100 let U90 = phat100 + 1.645*sep100 let out100 = 0 if p

drop L90, U90, casenum, n run Summary of Monte Carlo Simulation of Interval Estimation of a Population Proportion

n=5 n=20 n=100 ┌───────────┬────────────┬────────────┐ 1. True p │ │ │ │ ├───────────┼────────────┼────────────┤ 2. Mean p^ │ │ │ │ ├───────────┼────────────┼────────────┤ 3. Theoretical σ{p^} │ │ │ │ ├───────────┼────────────┼────────────┤ 4. SD of p^ │ │ │ │ ├───────────┼────────────┼────────────┤ 5. Mean estimated SE{p^} │ │ │ │ ├───────────┼────────────┼────────────┤ 6. % outside CI for p │ │ │ │ └───────────┴────────────┴────────────┘

s208na4\ 4