Sociology 6Z03 Topic 16: Statistical Inference for Proportions Outline

Sociology 6Z03 Topic 16: Statistical Inference for Proportions

John Fox

McMaster University

Fall 2016

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 1 / 39

Outline: Statistical Inference for Proportions

Introduction A Single Proportion Two Proportions from Independent Samples

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 2 / 39 Introduction

Statistical inference for population proportions follows a very similar pattern to inference for population means. In this lecture we’ll learn how to construct confidence intervals and perform hypothesis tests for an individual proportion and for the difference between two proportions from independent random samples. In the next lecture, we’ll learn how to test for differences among proportions from several independent samples, and to test more generally for relationships in contingency tables. Because many social processes result in dichotomous (two-category) outcomes, statistical inference for proportions is used frequently in practice.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 3 / 39

Introduction Preliminary Examples: A Single Proportion

In a poll taken ﬁve days before the October 30, 1995 Quebec sovereignty referendum, 53.5 percent of 959 decided voters said that they planned to vote “yes” (i.e., for separation) in the referendum. Putting aside the possibility of nonsampling errors, can we conclude that more than half of decided voters in the population planned to vote “yes” ﬁve days before the referendum?

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 4 / 39 Introduction Preliminary Examples: Two Proportions

In an independent poll of voters by the same polling firm taken roughly two months before the referendum, 45.1 percent of 822 decided voters said that they planned to vote “yes.” The difference in sample proportions is .535 − .451 = .084. That is, support for the “yes” option was 8.4 percent higher in the later poll. Are we justified in concluding that “yes” support increased in the population over the approximately two-month period between the two polls?

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 5 / 39

Introduction Preliminary Examples: Several Proportions

Three different polling firms conducted polls of voters within five days of the referendum: In a poll conducted by the Reid organization, 52.4 percent of 865 decided voters said that they planned to vote “yes”; In a poll conducted by the SOM organization, 53.5 percent of 959 decided voters said that they planned to vote “yes”; In a poll conducted by the firm of Léger and Léger,53.1 percent of 884 decided voters said that they planned to vote “yes.” Are these differences too large to ascribe to chance — suggesting different nonsampling errors among the polling firms — or could the differences easily have been the product of sampling variation?

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 6 / 39 A Single Proportion Notation

A population proportion is represented by p (note that a Greek letter is not used). The corresponding sample proportion is number of successes in the sample p = b number of observations in the sample

(read “p-hat”) where a “success” is one of two possible outcomes, such as planning to vote “yes” as opposed to “no” in the referendum.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 7 / 39

A Single Proportion The Sampling Distribution of Sample Proportions

How does the sample proportion pb behave as an estimator of the population proportion p? The count of successes X follows a binomial distribution, and because the sample proportion is just pb = X /n, its distribution follows directly: k n P(p = ) = P(X = k) = pk (1 − p)n−k b n k

The mean of the sampling distribution of pb is exactly p. The standard deviation of the sampling distribution of pb depends upon p, and in a SRS is approximately equal to r p(1 − p) n

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 8 / 39 A Single Proportion The Sampling Distribution of Sample Proportions

Thought Question TRUE or FALSE: The sample proportion pb is an unbiased estimator of the population proportion p. A TRUE. B FALSE. C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 9 / 39

A Single Proportion The Sampling Distribution of Sample Proportions

Thought Question TRUE or FALSE: As the sample size n grows, the standard deviation of the sample p proportion, SD(pb) = p(1 − p)/n gets larger. A TRUE. B FALSE. C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 10 / 39 A Single Proportion The Sampling Distribution of Sample Proportions: Normal Approximation

In a relatively large simple random sample from a much larger population, it is more convenient to use the normal approximation to the binomial distribution:

The sampling distribution of pb is approximately normal. The approximation gets better as the sample size n grows, and is better when the population proportion p is close to .5 than when it is close to 0 or 1. Here are two useful rules:

1 The normal distribution is a good-enough approximation to the sampling distribution of pb when the sample size is suﬃciently large (see below). p 2 The standard deviation of pb is well-enough approximated by p(1 − p)/n when the population is at least 10 times larger than the sample. This condition is almost always met.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 11 / 39

A Single Proportion The Sampling Distribution of Sample Proportions

These results provide a basis for performing statistical inference about proportions, but they cannot be applied directly, because, without knowledge of the population proportion p, we cannot calculate the standard deviation of pb. We will proceed a little diﬀerently for conﬁdence intervals and hypothesis tests.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 12 / 39 A Single Proportion Conﬁdence Interval for p

We need to replace the standard deviation of pb with an estimate, the standard error of pb. To do this, we’ll substitute pb for p in the formula for the standard deviation of pb: r p(1 − p) SE(p) = b b b n Then, the conﬁdence interval for p has the form

∗ pb± z SE(pb) where z∗ is the critical value from the standard normal distribution corresponding to the desired level of conﬁdence.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 13 / 39

A Single Proportion Conﬁdence Interval for p

Thought Question What is the value of z∗ for a 95 percent conﬁdence interval? A 1.0 B 1.65 C 1.96 D 2.58 E I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 14 / 39 A Single Proportion Conﬁdence Interval for p: Example

To illustrate, in a poll conducted five days before the Quebec referendum by the SOM organization, 513 of 959 decided voters reported an intention to vote “yes” in the referendum. On the basis of this result, let us construct a 95 percent confidence interval for the population proportion p intending to vote “yes.” Here, 513 p = = .535 b 959 and r r p(1 − p) .535(1 − .535) SE(p) = b b = = 0.0161 b n 959 The 95 percent confidence interval is ∗ pb± z SE(pb) = .535 ± 1.96 × 0.0161 = .535 ± .032 = (.503, .567)

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 15 / 39

A Single Proportion Hypothesis Test for p

The null hypothesis H0: p = p0 speciﬁes a value p0 for the unknown population proportion p.

Because the test is calculated assuming the truth of H0, we can use p0 in place of p to calculate the standard deviation of pb. That is, under H0, r p (1 − p ) SD(p) = 0 0 b n and we can base the hypothesis test on the test statistic

pb− p0 z = r p0(1 − p0) n If the null hypothesis is true, then z will follow the standard normal distribution.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 16 / 39 A Single Proportion Hypothesis Test for p: Alternative Hypothesis

To ﬁnd the P-value for the null hypothesis H0, we need to specify an alternative hypothesis Ha. As in the case of statistical inference for the mean, the alternative hypothesis can be nondirectional Ha: p 6= p0 or directional — either Ha: p > p0 or Ha: p < p0

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 17 / 39

A Single Proportion Hypothesis Test for p: Example

Using the data from the October 25 SOM poll we can test the null hypothesis H0: p = .5 against the alternative hypothesis that the “yes” side was ahead in the referendum campaign, Ha: p > .5. The test statistic is .535 − .5 z = r = 2.17 .5(1 − .5) 959 The one-sided P-value for z = 2.17 is P = .015 (see the graph on the next slide), so we have reasonably strong evidence that the “yes” side was leading.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 18 / 39 A Single Proportion Hypothesis Test for p: Example

Standard Normal Distribution

.9850

P = 1 − .9850 = 0.0150

0 z = 2.17

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 19 / 39

A Single Proportion Hypothesis Test for p: Example

Thought Question

Had we speciﬁed the nondirectional alternative hypothesis Ha: p 6= .5 (i.e., that one side or the other was in the lead), then the P-value would have been: A the same as the one-sided P-value, .015. B half as large as the one-sided P-value, .015/2 = .0075. C twice as large as the one-sided P-value, P = 2 × .015 = .03. D the complement of the one-sided P-value, P = 1 − .015 = .985. E I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 20 / 39 A Single Proportion Assumptions for z Intervals and Tests

For statistical inference for p based on the normal distribution to be approximately valid, the following conditions must be met: The data must be a simple random sample from the population of interest. The population must be at least 10 times as large as the sample. For a conﬁdence interval, we need at each 15 successes and at least 15 failures. Moore describes an adjustment to normal-distribution conﬁdence intervals for proportions that makes them more accurate in small samples.

For a hypothesis test of H0: p = p0, we need np0 ≥ 10 and n(1 − p0) ≥ 10.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 21 / 39

A Single Proportion Assumptions for z Intervals and Tests: Example

For the Quebec referendum polling data, for example, the sample size n = 959 is much less than 1/10 of the population of Quebec voters, but the SOM poll was probably not based on a simple random sample. To know whether this invalidates the conﬁdence interval and hypothesis test that we calculated, we would have to know more about the sampling procedure that was employed. Here, for the conﬁdence interval, we had 513 yes and 446 no, both much more than 15. For the hypothesis test, np0 = n(1 − p0) = 959(.5) = 479.5 10.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 22 / 39 A Single Proportion Selecting the Sample Size

The margin of error for a conﬁdence interval for p is r p(1 − p) m = z∗ b b n where z∗ is the critical standard-normal value corresponding to the desired level of conﬁdence, and pb is the sample proportion. ∗ We can solve for n in terms of z , pb, and m to determine the sample size required for a desired margin of error m. An obstacle to applying this result is that in advance of collecting data, we don’t know what pb will be. ∗ We can, however, used a guessed value p in place of pb, producing the formula z∗ 2 n = p∗(1 − p∗) m

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 23 / 39

A Single Proportion Selecting the Sample Size

Where does the guessed value p∗ come from? We could base the guess on previous research (e.g., an earlier poll), or upon our knowledge of the process under study. Alternatively, we can use the “conservative” value p∗ = .5, which produces the largest margin of error. In fact, as long as p∗ is between about .3 and .7, it doesn’t much matter what the precise value is:

p∗(1 − p∗) .3(1 − .3) = .21 .5(1 − .5) = .25 .7(1 − .7) = .21

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 24 / 39 A Single Proportion Selecting the Sample Size: Example

Suppose that a pollster wanted to select a simple random sample of Quebec voters that would permit a margin of error of .01 (i.e., 1 percent) for a 95-percent conﬁdence interval. Taking p∗ = .5, the necessary sample size is

1.962 n = .5(1 − .5) = 9, 604 .01

that is, about n = 10, 000.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 25 / 39

Two Proportions from Independent Samples

Notation for comparing proportions from independent samples drawn from two populations is similar to the notation that we employed for means:

Population Sample Sample Population Proportion Proportion Size 1 p1 pb1 n1 2 p2 pb2 n2

We want to use the diﬀerence in sample proportions, pb1 − pb2, to draw inferences about the diﬀerence in population proportions p1 − p2. As usual, the sampling distribution of pb1 − pb2 provides the basis for statistical inference.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 26 / 39 Two Proportions from Independent Samples Sampling Distribution of the Diﬀerence in Proportions: Key Facts

The sample diﬀerence pb1 − pb2 is an unbiased estimator of the population diﬀerence p1 − p2. That is, p1 − p2 is the mean of the sampling distribution of pb1 − pb2. The variance of pb1 − pb2 is the sum of the variances of pb1 and pb2: p (1 − p ) p (1 − p ) 1 1 + 2 2 n1 n2 The standard deviation of pb1 − pb2 is therefore s p1(1 − p1) p2(1 − p2) SD(pb1 − pb2) = + n1 n2

Important Point It is the variances of pb1 and pb2 that add, not the standard deviations.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 27 / 39

Two Proportions from Independent Samples Sampling Distribution of the Diﬀerence in Proportions: Key Facts

When the sample sizes n1 and n2 are suﬃciently large, the sampling distribution of pb1 − pb2 is approximately a normal distribution; that is  s  p1(1 − p1) p2(1 − p2) pb1 − pb2 ∼ N p1 − p2, +  n1 n2

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 28 / 39 Two Proportions from Independent Samples Conﬁdence Interval for a Diﬀerence in Proportions

As before, we cannot apply the sampling distribution of pb1 − pb2 directly because the standard deviation of pb1 − pb2 depends upon the population proportions p1 and p2, which are unknown. The solution is to calculate the standard error of pb1 − pb2 by substituting pb1 for p1 and pb2 for p2: s pb1(1 − pb1) pb2(1 − pb2) SE(pb1 − pb2) = + n1 n2

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 29 / 39

Two Proportions from Independent Samples Conﬁdence Interval for a Diﬀerence in Proportions

Then the conﬁdence interval for p1 − p2 is

∗ (pb1 − pb2) ± z SE(pb1 − pb2) which has the usual form estimate ± z∗SE(estimate) and where z∗ is the critical value from the standard normal distribution corresponding to the level of conﬁdence.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 30 / 39 Two Proportions from Independent Samples Conﬁdence Interval for a Diﬀerence in Proportions: Example

We’ll use this procedure to construct a conﬁdence interval for the change in support for Quebec sovereignty between the independent September 12 and October 25 polls conducted by the SOM polling ﬁrm. The relevant data are as follows:

Population Sample Sample Population Description Proportion Size 1 Quebec Voters, Oct. 25 pb1 = .535 n1 = 959 2 Quebec Voters, Sept. 12 pb2 = .451 n2 = 822

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 31 / 39

Two Proportions from Independent Samples Conﬁdence Interval for a Diﬀerence in Proportions: Example

Then, the standard error of pb1 − pb2 is r .535(1 − .535) .451(1 − .451) SE(p − p ) = + = .0237 b1 b2 959 822

and the 95 percent conﬁdence interval for p1 − p2 is

(.535 − .451) ± 1.96 × .0237 = .084 ± .046 = (.038, .130)

which is quite wide (but excludes 0).

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 32 / 39 Two Proportions from Independent Samples Hypothesis Test for a Diﬀerence in Proportions

As in the case of a single proportion, estimating the standard deviation of pb1 − pb2 is a bit different for a hypothesis test. The usual null hypothesis for a difference of proportions is the hypothesis of no difference: that is, H0: p1 − p2 = 0, or equivalently, H0: p1 = p2. The alternative hypothesis can be nondirectional or directional, depending upon our expectation for the difference between p1 and p2.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 33 / 39

Two Proportions from Independent Samples Hypothesis Test for a Diﬀerence in Proportions

Because the null hypothesis speciﬁes that p1 and p2 are the same, we can do better by combining the data from the two samples to estimate this common population proportion (let us call it p) to get the standard error of pb1 − pb2, rather than using the separate estimates pb1 and pb2. The pooled estimate of p is number of successes in both samples combined p = b number of observations in both samples combined

Then, the standard error of pb1 − pb2 is s 1 1 SE(pb1 − pb2) = pb(1 − pb) + n1 n2

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 34 / 39 Two Proportions from Independent Samples Hypothesis Test for a Diﬀerence in Proportions

Finally, the test statistic for the null hypothesis H0: p1 − p2 = 0 is p − p z = b1 b2 SE(pb1 − pb2) which approximately follows a standard normal distribution if the hypothesis is true.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 35 / 39

Two Proportions from Independent Samples Hypothesis Test for a Diﬀerence in Proportions: Example

For the referendum polling data, 513 of 959 respondents (pb1 = .535) reported an intention to vote “yes” in the October 25 poll, while 371 of 822 respondents (pb2 = .451) reported an intention to vote “yes” in the September 12 poll.

Let us test the null hypothesis of no change in voting intentions, H0: p1 − p2 = 0, against the alternative hypothesis that there has been some change, Ha: p1 − p2 6= 0.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 36 / 39 Two Proportions from Independent Samples Hypothesis Test for a Diﬀerence in Proportions: Example

The pooled estimate of p is 513 + 371 p = = .496 b 959 + 822 The standard error of pb1 − pb2 is s 1 1 SE(p − p ) = .496(1 − .496) + = .0238 b1 b2 959 822

The test statistic for H0: p1 − p2 = 0 is .535 − .451 z = = 3.53 .0238 for which the P-value is about 2 × .0002 = .0004. The evidence for change is therefore very strong.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 37 / 39

Two Proportions from Independent Samples Hypothesis Test for a Diﬀerence in Proportions: Example

Thought Question Had we performed a one-tail test predicting a rise in support for sovereignty, then what P-value would we have obtained? A P = .0004, the same as the two-sided P-value. B P = .0004/2 = .0002, half the size of the two-sided P-value. C P = 2 × .0004 = .0008, twice the size of the two-sided P-value. D1 − .0004 = .9996, the complement of the two-sided P-value. E I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 38 / 39 Two Proportions from Independent Samples Assumptions for z Tests and Intervals

For the normal approximation to the sampling distribution of pb1 − pb2 to be adequate, the following conditions must hold: The two samples should be independent simple random samples from their respective populations. Each population should be at least 10 times as large as the corresponding sample. For a conﬁdence interval, we should have at least 10 successes and 10 failures in each sample. Again, Moore describes an adjustment that produces more accurate results in small samples. For a hypothesis test, we should have at least 5 successes and 5 failures in each sample. Note that (with the exception of having SRSs) these conditions are easily met for the example.

John Fox (McMaster University) Soc 6Z03: Inference for Proportions Fall 2016 39 / 39