Exercise IV: Confidence intervals. Suppose the distribution of some trait X in a population is dependent on some parameter Q (e.g. the average and/or variance). It is assumed that the form of the density f( x , Q ) or probability function P( X= xk ) = p k ( Q ) is known.

Let E be a vector of observations: ( X1, X 2 ,..., X n ) . E is an n - dimensional random variable and dependent on the parameter Q.

Let U( E ), U ( E ) be functions of the random variable E such that U( E ) U ( E ) .

Let  be a real number 0

(for continuous random variables = 1 - ) the interval U( E ), U ( E ) is called a confidence interval for Q, and 1-a the level of confidence. Theoretically, you can build a confidence interval for each parameter of the trait’s distribution, but in practice it is used to construct confidence intervals for the average (Q = m) and variance (Q = Var(X)). Below we show the corresponding confidence intervals for both parameters.

Confidence interval for population mean (Q=m) Model I. Suppose a trait in some population has a N (m, ) distribution. Let’s assume that - m - is unknown,  is known, and the sample is small (n < 30). The estimate of the population average is the sample statistic X . Given these conditions, X

s has a N( m , ) distribution. n

X- m U = Thus, the standardized random variable s has a N (0, 1) distribution. n f(u)

  1- 2 2

 u 0 u

Hence, the random variable U satisfies:

P(- ua < U < u a ) =1 - a

The value ua can be read from the tables of the standardized normal distribution for a given : a F(u ) = 1 - a 2 X- m U = Since U has a N (0, 1) distribution and we have s : n

骣 琪 X- m P琪- u < < u =1 - a , as a 琪 桫 n

骣 s s P琪 X- ua < m < X + u a =1 -a . 桫 n n

Thus, for example for = 0.05, the confidence interval is as follows ( ua from the table = 1.96): 1,96s 1,96 s X-, X + n n which means that in 95 cases out of 100, the estimated "m" is in this range. In other words,

1,96s the error in estimation is not greater than in 95% of cases. n Example. The waiting time for a tram has been studied and the following values obtained (in minutes): 12, 15, 14, 13, 15. Suppose that the waiting time for a tram has a normal distribution with unknown mean value (m) and a known standard deviation s = 2 . The construction of the confidence interval for the mean consists of the following steps: Step 1 Calculate the sample mean x .

Step 2 Calculate the radius of the confidence interval. Step 3 Construct the confidence interval x-1.753; x + 1.753 = 12.0469;15.5530 . Thus, with a confidence level of 1-a = 0.95 , the population average lies in the calculated confidence interval. Model II. The trait X has a N (m, )distribution in some population, where neither m nor  are known. To build a confidence interval for m, we will use the t-statistic with n-1 degrees of freedom: X- m t= n -1 S

f(t)

1-  

2 2

 t 0 t

The value t is read from the tables for the student distribution with n-1 - degrees of freedom:

P(- ta < t < t a ) =1 -a

骣 琪 X- m P琪- t < < t =1 - a , aS a 琪 桫 n -1

骣 S S P琪 X- ta < m < X + t a =1 -a . 桫 n-1 n - 1

Thus, for example, for = 0.05 the confidence interval is as follows ( ta from tables; e.g. for n = 26, this value is 2.056): 2,056S 2,056 S X-; X + 5 5 which means that in 95 cases out of 100, "m" lies in this range. In other words, the error in 2.056S estimation is not greater than in 95% of cases. 5 Note: this interval is variable, depending on the value of S.

Example. Suppose that in the previous example on the average waiting time for a tram there was no information on the standard deviation in the population. Calculation of the confidence interval will be carried out in 4 steps. Step 1 Calculate the sample mean x . This has not changed and still is 13.8 Step 2 Calculate the sample standard deviation. Step 3 Calculate the radius of the confidence interval.

Step 4 Construct the confidence interval. x-1.618931187; x + 1.618931187 = 12.18107;15.41893 . Thus, with a confidence level of 1-a = 0.95 the population average lies in the calculated confidence interval.

Model III. s For large samples (n> 30), the central limit theorem shows that X N( m , ) , while the n law of large numbers shows that S s . Therefore, by substituting in model I the population standard deviation  by the sample standard deviation s, We get:

骣 s s P琪 X- ua < m < X + u a @1 -a 桫 n n

Example. Suppose that in the analysis of the average waiting time for a tram there was no information on the standard deviation in the population, but we managed to gather much more data.

14 13 15 15 12 15 14 13 15 15 12 15 14 13 15 15 12 15 14 13 15 15 12 15 14 13 15

Calculation of the confidence interval can be conducted using the "data analysis / descriptive statistics" Panel. By entering the data into this previously described panel we get: Mean 13,94117647 Standard Error 0,202221074 Median 14 Mode 15 Standard Deviation 1,179141354 Sample Variance 1,390374332 Kurtosis -1,22463316 Skewness -0,58712867 Range 3 Minimum 12 Maximum 15 Sum 474 Count 34 Confidence Level(95,0%) 0,411421868

Construct a confidence interval x-0.41142; x + 0.41142 = 13.5297;14.3525 . Thus, with a confidence level of 1-a = 0.95 the population average is included in the calculated confidence interval. Note that, in agreement with our intuition, the length of the interval has significantly decreased as a consequence of the large number of observations; the estimation has become more accurate. Model IV (confidence interval for a proportion). If we examine a population according to the presence or absence of a certain characteristic (e.g. quality control - products classed as good and bad, non-smokers and smokers, etc.), it can be described by a two-point distribution: P( X= 1) = p , P ( X = 0) = 1 - p , where the random variable X takes the value 1 if the feature exists and 0 if not present. Thus, if a feature is observed m times in an n-element sample, an approximation of p is

1 n m X= X i = ni=1 n We find that for 0.05 100:

禳 m骣 m m 骣 m 镲 琪1- 琪 1 - 镲mn桫 n m n 桫 n P睚 - ua < p < + u a �1 a 镲n n n n 铪镲 Estimation of the percentage of smokers among students. In a 1800-element sample, the number of smokers m = 600. For 1 -  = 0.95, ua = 1,96 , which can be read from the cumulative distribution function (two-sided interval):

m骣 m m 600 琪1- = = 0,333 , n桫 n . Thus, the 95% confidence level for the fraction n 1800 @ 0,011 n of students smoking is: 32,19% < p < 34,40% .

Confidence interval for standard deviation - variance (Q = σ). Assumption: The feature X has a normal distribution, or close to normal. Model V: The population mean m and population standard deviation  are not known, the sample is

nS 2 2 large (n > 30). The statistic Z = has a c distribution with n-1 degrees of freedom: s 2

1/2 1/2

c1 c2

骣 nS 2 P琪 c1< < c 2 =1 - a 桫 s 2

21 2 1 where: P{c< c1} = a P{ c� c 2 } a . 2 2 Note: Most tables provide the value P{c 2 a} . We obtain the following confidence interval for the variance:

2 2 骣nS2 nS P 琪

Example. Consider again the data used in model III. Step I. We calculate the variance of the sample: Thus, the observed value of the non-standardized c 2 (the numerator of the equation) amounts to 1.3903 * 34 = 47.2727.

Step II. We now calculate the value of c1 and c2.

For c1:

And for c2 Step III. Hence the required confidence interval for the variance is: 47.2727 47.2727 ; 0.9319;2.4819 50.725 19.0466

Model VI. The trait of interest has a normal or close to normal distribution, large sample n> 30 禳 镲 s s P 睚

Sample Size Determination for interval estimates of the average with given confidence level.

Note that in each case we have a required length of interval (the difference between the right and left ends of the confidence interval). This knowledge allows us to determine the necessary sample size, so that one can estimate the required parameter with a predetermined precision (a specified level of confidence). Task: what is the required sample size to obtain a confidence interval of given length (accuracy) at the chosen confidence level 1 - ? Let 2d be the reference length of the interval. Model I:

2 2 uas u s Length of the interval: 2d = 2 . Thus: n = a . n d 2 Model II: t s t2 s 2 Length of the interval: 2d = 2 a . Thus: n =a +1 . n -1 d 2

It is necessary to draw an initial sample (s is calculated from this sample) of size n0. If it is established that n> n0, an extra n - n0 elements must be drawn.

Model III: As in model II.

Model IV: u2 p(1- p ) a) if we know the magnitude of p, the required sample size is n = a : d 2 1 b) if we do not know the order of magnitude of p, we use the inequality p(1- p ) 4

u2 . Thus: n = a . 4d 2 Additional problems. 1) Attendance at statistics lectures is as follows (in %) 72, 70, 58, 62, 67, 58, 90, 91, 56, 68, 68, 70, 71, 52, 69 Assuming this is a random sample of lectures, give a 95% confidence interval for the average percentage attendance. 2) According to the Wall Street Journal , an average of 44 tons of carbon dioxide will be saved per year if new, more efficient lamps are used. Assume that this average is based on a random sample of 25 test runs of the new lamps and the sample standard deviation was 19 tons. Give a 90% confidence interval for the average annual savings. 3) A new optical disc system prototype was tested and it is claimed to be able to record an average of 2.2 hours of HD TV. Assume n=10 trials and σ=0.2 hour. Give a 90% confidence interval for the mean recording time. 4) In a survey, Fortune rated companies on a 0 to 10 scale. A random sample of 10 firms and their scores are as follows: FedEx 8.94, Walt Disney 8.76, CHS 8.67, McDonald’s 7.82, CVS 6.80, Safeway 6.57, Starbucks 8.09, Sysco 7.42, Staples 6.45, HNI 7.29 Construct a 95% confidence interval for the average rating of a company on Fortune’s entire list. 6) Use the following random sample of gasoline prices to construct a 90% confidence interval for the average price of a litre (in PLN): 3.85, 3.95, 4.95, 4.19, 4.50, 425, 4.50, 4.32. 7) An estimate of the percentage of defective pins in a large batch of pins supplied by a vendor is desired to be estimated within 1% with a 90% confidence level. The actual percentage of defective pins is guessed to be 4%. a) What is the minimum sample size? b) If the actual percentage of defective pins may be anywhere between 3% and 6%, tabulate the minimum sample size required for actual percentages from 3% to 6%. c) If the cost of sampling and testing n pins is (25+6n) dollars, tabulate the cost for the same range of percentages as in part (b). 8) For all the above problems of interval estimation, determine the size of the required research sample such that the required precision is obtained with a confidence level of not less than 99%