A remark on interpretation of pooling results

Wojciech Zieliński Department of Econometrics and Statistics SGGW Nowoursynowska 159, 02-787 Warszawa e-mail:[email protected]

Summary

A multinomial distribution (−1, π1), (0, π2), (1, π3), π1 + π2 + π3 = 1 is observed.

Suppose n1 values −1, n2 values 0 and n3 values 1 (n = n1 + n2 + n3) were observed.

The construction of the confidence region for the vector (π1, π2, π3) is given. Theoretical considerations are illustrated by an example of results of a pool.

Key words: multinomial distribution, confidence regions

Introduction

In there are several institutes making different kinds of pools. In the table below there are given results of gallup poll made by IPSOS in September 16, 2005. Preferences in presidental election 43% Lech Kaczyński 23% Włodzimierz Cimoszewicz 17% 7% 3% Jarosław Kalinowski 2% Maciej Giertych 2% Janusz Korwin-Mikke 1% 1% Stanisław Tymiński 1% Source: www.ipsos.pl/3 5 070.html To the above results there is given a comment: number of adults in sample is 959; for ana- logous random sample the statistical error of estimates is not greater than (+/−) 3.2% at the confidence level 0.95. Here appears a problem with population interpretation of results. For example, if one wants to estimate the real support for the last of the candidates, then due to the comment the interval (−2.2%, 4.2%) is obtained. It suggest that this candidate may have a negative support (?!). How then the results of a pooling should be generalized to the population? In what follows we will consider a pooling with three possible answers. Let X denotes the random variable describing answers. It may be assumed that X is multinomially distributed:

Pθ{X = −1} = π1,Pθ{X = 0} = π2,Pθ{X = 1} = π3,

where θ = (π1, π2, π3) and 0 ≤ π1, π2, π3 ≤ 1, π1 + π2 + π3 = 1. Values of X symbolize answers to questions in pooling.

Assume that in a sample of size n value −1 were observed n1 times, value 0 n2

times and value 1 n3 times. Of course n1 + n2 + n3 = n. It is known that maximum likelihood estimator of θ is (p1 = n1/n, p2 = n2/n, p3 = n3/n). The problem is in interval estimation of θ.

Confidence region

There are a lot of papers devoted to the problem of simultaneous confidence intervals for probabilities of multinomial distribution. Extensive review of construction methods may be found in [1], [2], [5]. General rule of construction is based on the set of inequalities

|p − π | i i ≤ c, i = 1, 2, 3, πi(1 − πi)

where c is constant such that the following equality holds   |pi − πi| Pθ ≤ c, i = 1, 2, 3 = 1 − α, ∀θ. πi(1 − πi)

Those confidence regions are easy to calculate. However, simultaneous confidence inte- rvals have two disadvantages. Firstly, obtained confidence intervals may go out of (0, 1) interval and secondly, in their construction the condition π1 + π2 + π3 = 1 was not exploit.

For example, let the following sample be given: n1 = 1, n2 = 1, n3 = 48. In the table below there are given limits of some of known simultaneous confidence intervals (1 − α = 0.95).

QH GM NB FS p1 = 0.02 0.0025 0.1402 0.0026 0.1361 −0.1493 0.1893 −0.1303 0.1703 p2 = 0.02 0.0025 0.1402 0.0026 0.1361 −0.1493 0.1893 −0.1303 0.1703 p3 = 0.96 0.8300 0.9916 0.8340 0.9914 0.7907 1.1293 0.8097 1.1103

Here QH denotes Quesenberry and Hurst [6] construction:

2 2 ni(pi − πi) ≤ χ (α, 2)πi(1 − πi), i = 1, 2, 3.

GM denotes Goodman [4] construction:

2 2 ni(pi − πi) ≤ χ (α/3, 1)πi(1 − πi), i = 1, 2, 3.

NB denotes naive binomial construction:   1 n (p − π )2 < χ2(α, 1) , i = 1, 2, 3. i i i 4

FS denotes Fitzpatrick and Scott [3] construction:

2 ni(pi − πi) < γ, i = 1, 2, 3,

where γ = 1 for α = 0.1, γ = 1.13 for α = 0.05 and γ = 1.40 for α = 0.01. Note that, left ends of some of calculated confidence intervals are negative or the sum of admissible probabilities is greater than one.

Another approach to the problem of construction of confidence region is application of Pearson statistic connected with classical chi–square test. This statistic is given by

formula   (p − π )2 (p − π )2 (p − π )2 n 1 1 + 2 2 + 3 3 . π1 π2 π3 Applying the conditions π1 + π2 + π3 = 1 and p1 + p2 + p3 = 1 we obtain:   (p − π )2 (p − π )2 ((p + p ) − (π + π ))2 n 1 1 + 2 2 + 1 2 1 2 . π1 π2 1 − π1 − π2

A confidence region is given by a solution with respect to (π1, π2) of inequality:   (p − π )2 (p − π )2 ((p + p ) − (π + π ))2 1 1 + 2 2 + 1 2 1 2 < χ, π1 π2 1 − π1 − π2 where χ = χ2(α; 2)/n. Let p 2 2 L χ(−1 + π1)π1 + π1 + p1 − 2π1(p1 + p2 − p1p2) − ∆(π1) p2 (π1) = − 2 , 2(π1 + χπ1 − p1) p 2 2 P χ(−1 + π1)π1 + π1 + p1 − 2π1(p1 + p2 − p1p2) + ∆(π1) p2 (π1) = − 2 , 2(π1 + χπ1 − p1) where

2 2 2 2 2 ∆(π1) = (χ(−1+π1)π1 +π1 +p1 +2π1(p1(−1+p2)−p2)) +4(−1+π1)π1(π1 +χπ1 −p1)p2

L P L P Let p1 i p1 be such numbers that ∆(p1 ) = 0 and ∆(p1 ) = 0: √ p √ p χ + 2p − χ χ + 4p − 4p2 χ + 2p + χ χ + 4p − 4p2 pL = 1 1 1 ; pP = 1 1 1 . 1 2(1 + χ) 1 2(1 + χ) The confidence region is given by:

 L P L P (π1, π2) ∈ (0, 1) × (0, 1) : p1 ≤ π1 ≤ p1 , p2 (π1) ≤ π2 ≤ p2 (π1) .

Value of π3 is calculated from the equality π1 + π2 + π3 = 1.

Example

Consider results of given in Introduction pooling in a modified version: Preferences in presidental election Donald Tusk 43% Lech Kaczyński 23% Other candidate 34%

In the picture there is shown the confidence region for probability π1 (x–axis) and

π2 (y–axis) of support for the first and second candidate respectively. Third probability

π3 is calculated as π3 = 1 − π1 − π2. By dot is marked values of point estimators of π1 and π2. Bibliography

[1] Biszof A., Mejza S. (2004): Jednoczesne przedziały ufności dla prawdopodobieństwa w rozkładzie wielomianowym, Colloquium Biometryczne, 34, 77–84. [2] Correa J. C. (2001): Interval Estimation of the Parameters of the Multinomial Di- stribution, ip.statjournals.net:2002/InterStat/ARTICLES/2001/articles/O01001.pdf [3] Fitzpatrick S., Scott A. (1987): Quick simultaneous confidence intervals for multino- mial proportions, Journal of the American Statistical Association, 82, 875–878 [4] Goodman L. A. (1965) On simultaneous confidence intervals for multinomial pro- portions, Technometrics, 7, 247–254 [5] May W. L., Johnson W. D. (1997): Properties of simultaneous confidence intervals for multinomial proportions, Communications in Statistics — Simulations, 26, 495–518 [6] Quesenberry C. P., Hurst D.C (1964): Large sample simultaneous confidence intervals for multinomial proportions, Technometrics, 6, 191–195 0.27 ...... 0.26 ...... 0.25 ...... 0.24 ...... 0.23 ...... • ...... 0.22 ...... 0.21 ...... 0.20 ......

0.19 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 Confidence region for the support of two first candidates