Sociology 6Z03 Topic 16: Statistical Inference for Proportions
John Fox
McMaster University
Fall 2016
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 1 / 39
Outline: Statistical Inference for Proportions
Introduction A Single Proportion Two Proportions from Independent Samples
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 2 / 39 Introduction
Statistical inference for population proportions follows a very similar pattern to inference for population means. In this lecture we’ll learn how to construct confidence intervals and perform hypothesis tests for an individual proportion and for the difference between two proportions from independent random samples. In the next lecture, we’ll learn how to test for differences among proportions from several independent samples, and to test more generally for relationships in contingency tables. Because many social processes result in dichotomous (two-category) outcomes, statistical inference for proportions is used frequently in practice.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 3 / 39
Introduction Preliminary Examples: A Single Proportion
In a poll taken five days before the October 30, 1995 Quebec sovereignty referendum, 53.5 percent of 959 decided voters said that they planned to vote “yes” (i.e., for separation) in the referendum. Putting aside the possibility of nonsampling errors, can we conclude that more than half of decided voters in the population planned to vote “yes” five days before the referendum?
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 4 / 39 Introduction Preliminary Examples: Two Proportions
In an independent poll of voters by the same polling firm taken roughly two months before the referendum, 45.1 percent of 822 decided voters said that they planned to vote “yes.” The difference in sample proportions is .535 − .451 = .084. That is, support for the “yes” option was 8.4 percent higher in the later poll. Are we justified in concluding that “yes” support increased in the population over the approximately two-month period between the two polls?
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 5 / 39
Introduction Preliminary Examples: Several Proportions
Three different polling firms conducted polls of voters within five days of the referendum: In a poll conducted by the Reid organization, 52.4 percent of 865 decided voters said that they planned to vote “yes”; In a poll conducted by the SOM organization, 53.5 percent of 959 decided voters said that they planned to vote “yes”; In a poll conducted by the firm of L´eger and L´eger,53.1 percent of 884 decided voters said that they planned to vote “yes.” Are these differences too large to ascribe to chance — suggesting different nonsampling errors among the polling firms — or could the differences easily have been the product of sampling variation?
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 6 / 39 A Single Proportion Notation
A population proportion is represented by p (note that a Greek letter is not used). The corresponding sample proportion is number of successes in the sample p = b number of observations in the sample
(read “p-hat”) where a “success” is one of two possible outcomes, such as planning to vote “yes” as opposed to “no” in the referendum.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 7 / 39
A Single Proportion The Sampling Distribution of Sample Proportions
How does the sample proportion pb behave as an estimator of the population proportion p? The count of successes X follows a binomial distribution, and because the sample proportion is just pb = X /n, its distribution follows directly: k n P(p = ) = P(X = k) = pk (1 − p)n−k b n k
The mean of the sampling distribution of pb is exactly p. The standard deviation of the sampling distribution of pb depends upon p, and in a SRS is approximately equal to r p(1 − p) n
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 8 / 39 A Single Proportion The Sampling Distribution of Sample Proportions
Thought Question TRUE or FALSE: The sample proportion pb is an unbiased estimator of the population proportion p. A TRUE. B FALSE. C I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 9 / 39
A Single Proportion The Sampling Distribution of Sample Proportions
Thought Question TRUE or FALSE: As the sample size n grows, the standard deviation of the sample p proportion, SD(pb) = p(1 − p)/n gets larger. A TRUE. B FALSE. C I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 10 / 39 A Single Proportion The Sampling Distribution of Sample Proportions: Normal Approximation
In a relatively large simple random sample from a much larger population, it is more convenient to use the normal approximation to the binomial distribution:
The sampling distribution of pb is approximately normal. The approximation gets better as the sample size n grows, and is better when the population proportion p is close to .5 than when it is close to 0 or 1. Here are two useful rules:
1 The normal distribution is a good-enough approximation to the sampling distribution of pb when the sample size is sufficiently large (see below). p 2 The standard deviation of pb is well-enough approximated by p(1 − p)/n when the population is at least 10 times larger than the sample. This condition is almost always met.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 11 / 39
A Single Proportion The Sampling Distribution of Sample Proportions
These results provide a basis for performing statistical inference about proportions, but they cannot be applied directly, because, without knowledge of the population proportion p, we cannot calculate the standard deviation of pb. We will proceed a little differently for confidence intervals and hypothesis tests.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 12 / 39 A Single Proportion Confidence Interval for p
We need to replace the standard deviation of pb with an estimate, the standard error of pb. To do this, we’ll substitute pb for p in the formula for the standard deviation of pb: r p(1 − p) SE(p) = b b b n Then, the confidence interval for p has the form
∗ pb± z SE(pb) where z∗ is the critical value from the standard normal distribution corresponding to the desired level of confidence.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 13 / 39
A Single Proportion Confidence Interval for p
Thought Question What is the value of z∗ for a 95 percent confidence interval? A 1.0 B 1.65 C 1.96 D 2.58 E I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 14 / 39 A Single Proportion Confidence Interval for p: Example
To illustrate, in a poll conducted five days before the Quebec referendum by the SOM organization, 513 of 959 decided voters reported an intention to vote “yes” in the referendum. On the basis of this result, let us construct a 95 percent confidence interval for the population proportion p intending to vote “yes.” Here, 513 p = = .535 b 959 and r r p(1 − p) .535(1 − .535) SE(p) = b b = = 0.0161 b n 959 The 95 percent confidence interval is ∗ pb± z SE(pb) = .535 ± 1.96 × 0.0161 = .535 ± .032 = (.503, .567)
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 15 / 39
A Single Proportion Hypothesis Test for p
The null hypothesis H0: p = p0 specifies a value p0 for the unknown population proportion p.
Because the test is calculated assuming the truth of H0, we can use p0 in place of p to calculate the standard deviation of pb. That is, under H0, r p (1 − p ) SD(p) = 0 0 b n and we can base the hypothesis test on the test statistic
pb− p0 z = r p0(1 − p0) n If the null hypothesis is true, then z will follow the standard normal distribution.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 16 / 39 A Single Proportion Hypothesis Test for p: Alternative Hypothesis
To find the P-value for the null hypothesis H0, we need to specify an alternative hypothesis Ha. As in the case of statistical inference for the mean, the alternative hypothesis can be nondirectional Ha: p 6= p0 or directional — either Ha: p > p0 or Ha: p < p0
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 17 / 39
A Single Proportion Hypothesis Test for p: Example
Using the data from the October 25 SOM poll we can test the null hypothesis H0: p = .5 against the alternative hypothesis that the “yes” side was ahead in the referendum campaign, Ha: p > .5. The test statistic is .535 − .5 z = r = 2.17 .5(1 − .5) 959 The one-sided P-value for z = 2.17 is P = .015 (see the graph on the next slide), so we have reasonably strong evidence that the “yes” side was leading.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 18 / 39 A Single Proportion Hypothesis Test for p: Example
Standard Normal Distribution
.9850
P = 1 − .9850 = 0.0150
0 z = 2.17
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 19 / 39
A Single Proportion Hypothesis Test for p: Example
Thought Question
Had we specified the nondirectional alternative hypothesis Ha: p 6= .5 (i.e., that one side or the other was in the lead), then the P-value would have been: A the same as the one-sided P-value, .015. B half as large as the one-sided P-value, .015/2 = .0075. C twice as large as the one-sided P-value, P = 2 × .015 = .03. D the complement of the one-sided P-value, P = 1 − .015 = .985. E I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 20 / 39 A Single Proportion Assumptions for z Intervals and Tests
For statistical inference for p based on the normal distribution to be approximately valid, the following conditions must be met: The data must be a simple random sample from the population of interest. The population must be at least 10 times as large as the sample. For a confidence interval, we need at each 15 successes and at least 15 failures. Moore describes an adjustment to normal-distribution confidence intervals for proportions that makes them more accurate in small samples.
For a hypothesis test of H0: p = p0, we need np0 ≥ 10 and n(1 − p0) ≥ 10.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 21 / 39
A Single Proportion Assumptions for z Intervals and Tests: Example
For the Quebec referendum polling data, for example, the sample size n = 959 is much less than 1/10 of the population of Quebec voters, but the SOM poll was probably not based on a simple random sample. To know whether this invalidates the confidence interval and hypothesis test that we calculated, we would have to know more about the sampling procedure that was employed. Here, for the confidence interval, we had 513 yes and 446 no, both much more than 15. For the hypothesis test, np0 = n(1 − p0) = 959(.5) = 479.5 10.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 22 / 39 A Single Proportion Selecting the Sample Size
The margin of error for a confidence interval for p is r p(1 − p) m = z∗ b b n where z∗ is the critical standard-normal value corresponding to the desired level of confidence, and pb is the sample proportion. ∗ We can solve for n in terms of z , pb, and m to determine the sample size required for a desired margin of error m. An obstacle to applying this result is that in advance of collecting data, we don’t know what pb will be. ∗ We can, however, used a guessed value p in place of pb, producing the formula z∗ 2 n = p∗(1 − p∗) m
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 23 / 39
A Single Proportion Selecting the Sample Size
Where does the guessed value p∗ come from? We could base the guess on previous research (e.g., an earlier poll), or upon our knowledge of the process under study. Alternatively, we can use the “conservative” value p∗ = .5, which produces the largest margin of error. In fact, as long as p∗ is between about .3 and .7, it doesn’t much matter what the precise value is:
p∗(1 − p∗) .3(1 − .3) = .21 .5(1 − .5) = .25 .7(1 − .7) = .21
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 24 / 39 A Single Proportion Selecting the Sample Size: Example
Suppose that a pollster wanted to select a simple random sample of Quebec voters that would permit a margin of error of .01 (i.e., 1 percent) for a 95-percent confidence interval. Taking p∗ = .5, the necessary sample size is
1.962 n = .5(1 − .5) = 9, 604 .01
that is, about n = 10, 000.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 25 / 39
Two Proportions from Independent Samples
Notation for comparing proportions from independent samples drawn from two populations is similar to the notation that we employed for means:
Population Sample Sample Population Proportion Proportion Size 1 p1 pb1 n1 2 p2 pb2 n2
We want to use the difference in sample proportions, pb1 − pb2, to draw inferences about the difference in population proportions p1 − p2. As usual, the sampling distribution of pb1 − pb2 provides the basis for statistical inference.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 26 / 39 Two Proportions from Independent Samples Sampling Distribution of the Difference in Proportions: Key Facts
The sample difference pb1 − pb2 is an unbiased estimator of the population difference p1 − p2. That is, p1 − p2 is the mean of the sampling distribution of pb1 − pb2. The variance of pb1 − pb2 is the sum of the variances of pb1 and pb2: p (1 − p ) p (1 − p ) 1 1 + 2 2 n1 n2 The standard deviation of pb1 − pb2 is therefore s p1(1 − p1) p2(1 − p2) SD(pb1 − pb2) = + n1 n2
Important Point It is the variances of pb1 and pb2 that add, not the standard deviations.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 27 / 39
Two Proportions from Independent Samples Sampling Distribution of the Difference in Proportions: Key Facts
When the sample sizes n1 and n2 are sufficiently large, the sampling distribution of pb1 − pb2 is approximately a normal distribution; that is s p1(1 − p1) p2(1 − p2) pb1 − pb2 ∼ N p1 − p2, + n1 n2
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 28 / 39 Two Proportions from Independent Samples Confidence Interval for a Difference in Proportions
As before, we cannot apply the sampling distribution of pb1 − pb2 directly because the standard deviation of pb1 − pb2 depends upon the population proportions p1 and p2, which are unknown. The solution is to calculate the standard error of pb1 − pb2 by substituting pb1 for p1 and pb2 for p2: s pb1(1 − pb1) pb2(1 − pb2) SE(pb1 − pb2) = + n1 n2
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 29 / 39
Two Proportions from Independent Samples Confidence Interval for a Difference in Proportions
Then the confidence interval for p1 − p2 is
∗ (pb1 − pb2) ± z SE(pb1 − pb2) which has the usual form estimate ± z∗SE(estimate) and where z∗ is the critical value from the standard normal distribution corresponding to the level of confidence.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 30 / 39 Two Proportions from Independent Samples Confidence Interval for a Difference in Proportions: Example
We’ll use this procedure to construct a confidence interval for the change in support for Quebec sovereignty between the independent September 12 and October 25 polls conducted by the SOM polling firm. The relevant data are as follows:
Population Sample Sample Population Description Proportion Size 1 Quebec Voters, Oct. 25 pb1 = .535 n1 = 959 2 Quebec Voters, Sept. 12 pb2 = .451 n2 = 822
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 31 / 39
Two Proportions from Independent Samples Confidence Interval for a Difference in Proportions: Example
Then, the standard error of pb1 − pb2 is r .535(1 − .535) .451(1 − .451) SE(p − p ) = + = .0237 b1 b2 959 822
and the 95 percent confidence interval for p1 − p2 is
(.535 − .451) ± 1.96 × .0237 = .084 ± .046 = (.038, .130)
which is quite wide (but excludes 0).
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 32 / 39 Two Proportions from Independent Samples Hypothesis Test for a Difference in Proportions
As in the case of a single proportion, estimating the standard deviation of pb1 − pb2 is a bit different for a hypothesis test. The usual null hypothesis for a difference of proportions is the hypothesis of no difference: that is, H0: p1 − p2 = 0, or equivalently, H0: p1 = p2. The alternative hypothesis can be nondirectional or directional, depending upon our expectation for the difference between p1 and p2.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 33 / 39
Two Proportions from Independent Samples Hypothesis Test for a Difference in Proportions
Because the null hypothesis specifies that p1 and p2 are the same, we can do better by combining the data from the two samples to estimate this common population proportion (let us call it p) to get the standard error of pb1 − pb2, rather than using the separate estimates pb1 and pb2. The pooled estimate of p is number of successes in both samples combined p = b number of observations in both samples combined
Then, the standard error of pb1 − pb2 is s 1 1 SE(pb1 − pb2) = pb(1 − pb) + n1 n2
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 34 / 39 Two Proportions from Independent Samples Hypothesis Test for a Difference in Proportions
Finally, the test statistic for the null hypothesis H0: p1 − p2 = 0 is p − p z = b1 b2 SE(pb1 − pb2) which approximately follows a standard normal distribution if the hypothesis is true.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 35 / 39
Two Proportions from Independent Samples Hypothesis Test for a Difference in Proportions: Example
For the referendum polling data, 513 of 959 respondents (pb1 = .535) reported an intention to vote “yes” in the October 25 poll, while 371 of 822 respondents (pb2 = .451) reported an intention to vote “yes” in the September 12 poll.
Let us test the null hypothesis of no change in voting intentions, H0: p1 − p2 = 0, against the alternative hypothesis that there has been some change, Ha: p1 − p2 6= 0.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 36 / 39 Two Proportions from Independent Samples Hypothesis Test for a Difference in Proportions: Example
The pooled estimate of p is 513 + 371 p = = .496 b 959 + 822 The standard error of pb1 − pb2 is s 1 1 SE(p − p ) = .496(1 − .496) + = .0238 b1 b2 959 822
The test statistic for H0: p1 − p2 = 0 is .535 − .451 z = = 3.53 .0238 for which the P-value is about 2 × .0002 = .0004. The evidence for change is therefore very strong.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 37 / 39
Two Proportions from Independent Samples Hypothesis Test for a Difference in Proportions: Example
Thought Question Had we performed a one-tail test predicting a rise in support for sovereignty, then what P-value would we have obtained? A P = .0004, the same as the two-sided P-value. B P = .0004/2 = .0002, half the size of the two-sided P-value. C P = 2 × .0004 = .0008, twice the size of the two-sided P-value. D1 − .0004 = .9996, the complement of the two-sided P-value. E I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 38 / 39 Two Proportions from Independent Samples Assumptions for z Tests and Intervals
For the normal approximation to the sampling distribution of pb1 − pb2 to be adequate, the following conditions must hold: The two samples should be independent simple random samples from their respective populations. Each population should be at least 10 times as large as the corresponding sample. For a confidence interval, we should have at least 10 successes and 10 failures in each sample. Again, Moore describes an adjustment that produces more accurate results in small samples. For a hypothesis test, we should have at least 5 successes and 5 failures in each sample. Note that (with the exception of having SRSs) these conditions are easily met for the example.
John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 39 / 39