1 Introduction: Statistical Intervals
Total Page:16
File Type:pdf, Size:1020Kb
Avd. Matematisk statistik Sf2955 COMPUTER INTENSIVE METHODS TOLERANCE INTERVALS with A COMPUTER ASSIGNMENT 2011 Timo Koski 1 Introduction: Statistical Intervals Many practical problems are phrased in terms of individual measurements rather than parameters of distributions. We take two such examples. The first one will require, what is to be called a prediction interval, instead of a confidence interval. A consumer is considering buying a car. Then this person should be far more interested in knowing whether a full tank on a particu- lar automobile will suffice to carry her/him the 500 kms to her/his destination than in learning that there is a 95% confidence inter- val for the mean mileage of the model, which is possible to use to project the average or total gasoline consumption for the ma- nufactured fleet of such cars over their first 5000 kilometers of use. A different situation appears in the following. This will require, what is to be called a tolerance interval, instead of a confidence interval. A design engineer is charged with the problem of determining how large a tank the car model really needs to guarantee that 99% of the cars produced will have a cruising range of 500 kilometers. 1 What the engineer really needs is a tolerance interval for a fraction of 100 β = 99% mileages of such automobiles. × Prediction and tolerance intervals address problems of inference for (future) measurements. 2 Definitions of Other Intervals than Confi- dence Intervals We must distinguish between two different questions (I - II) concerning in- ference for future values. Let X1,...,Xn be I.I.D. with the (cumulative) distribution (function) F and determine an interval [L, U], L < U, such that either of (I) or (II) holds: I For at least γ 100% proportion of time, the proportion β 100% of × × future observations Xn+1,...,Xn+m will fall in [L, U]. II The probability that Xn+1 falls in [L, U] is at least γ. The question posed under I is that of tolerance intervals. Tolerance intervals are meant to locate the bulk of an underlying distribution. The question II is that of designing (average) prediction intervals for an individual observation. 2.1 I: Definition of Tolerance Interval and an Example Let X1,...,Xn be I.I.D. with the (cumulative) distribution (function) F . We have two statistics L = L (X1,...,Xn) and U = U (X1,...,Xn) such that L (X ,...,X ) U (X ,...,X ) . (2.1) 1 n ≤ 1 n Then we have the concept due to Walter Shewhart1. Definition 2.1 β,γ (0, 1). Assume that L and U are such that ∈ P ((F (U) F (L)) β) γ. (2.2) − ≥ ≥ Then the interval [L, U] is a β -content and γ-confidence tolerance in- terval. 1W. Shewhart: Economic Control of Quality of Manufactured Product. Van Nostrand Company, Inc., New York, 1931, republished in 1980 as the 50th Anniversary Commemo- rative Reissue, 1981, by ASQC Quality Press, Milwaukee. 2 If possible, we should replace the last inequality in (2.2) with an equality. In words, we are proposing an interval within which, for at least γ % of × time, the proportion β 100% of future observations Xn+1,...,Xn+m falls. Let us contrast this with× the notion of a confidence interval. The method of confidence intervals creates intervals that cover a real valued population parameter of a distribution (e.g., the mean or the variance) with some pro- bability (giving a degree of confidence for a given interval). The bounds of a tolerance interval depict a range of possible data values that represents a specified percentage of the population. In very simplified terms, a confidence interval characterizes what is known about a single quan- tity given a set of data, whereas a tolerance interval characterizes what is known about values across a collection of items. We shall try to clarify this. In general, the distribution function F would not be known in contexts, where a tolerance interval is desired. In such a case we should be familiar with the result in the following example. Example 2.1 The non-parametric tolerance interval 2 Let F have a probability density. Let X(1),X(2),...,X(n) be the order statistic. Set W = F X 2 1 F X 1 . (k +k ) − (k ) Then W Beta(k2, n k2 1). For given β >0 ochγ > 0 we find n, k2 and ∈ − − k1 such that P (W β) γ, ≥ ≥ then [x(k1), x(k2+k1)] is a β -content and γ-confidence tolerance interval. Or, the problem solved ! However, it has been found that the tolerance in- tervals according to the non-parametric method will tend to be wider than intervals designed specifically for, e.g., scale or location parameter families. 2.2 II: Definition of Prediction or Average Tolerance Intervals The prediction interval can be generically written as P (X [L, U]) γ, (2.3) n+1 ∈ ≥ 2see S.S. Wilks: Mathematical Statistics, John Wiley & Sons, New York 1962, pp. 334 335 − 3 or P (L (X ,...,X ) X U (X ,...,X )) γ, 1 n ≤ n+1 ≤ 1 n ≥ or E (F (U (X ,...,X )) F (L (X ,...,X ))) γ. F 1 n − 1 n ≥ Somebody has called the prediction and tolerance intervals the ’most slippery of all concepts’. A prediction inteval does not tell that a fraction of γ of future observations will fall within [L, U]. It is the expected probability content of [L, U] that is at least γ, and many samples will lead to intervals covering less than 100 γ% of the underlying distribution. Let us note that the 100 × γ% × prediction confidence relates to the whole process of generating X1,...,Xn and Xn+1. 3 The Tolerance Interval for a Normal Dis- tribution 3.1 k -Factor Tolerance Interval Let F N(m, σ2) and let us assume that m and σ are unknown and ↔ X ,...,X are I.I.D. N(m, σ2). We set 1 n ∼ n n 1 1 2 X = X S2 = X X . n i n 1 i − i=1 − i=1 X X Then, as can be expected, we take two statistics L and U such that L = L (X ,...,X )= X kS, 1 n − U = U (X1,...,Xn)= X + kS. Here the values of k are to be chosen such that the tolerance limits L and U satisfy (2.2) for given β and γ. Such k:s are known as k-factors, and [X kS, X + kS] − is known as the k-factor tolerance interval. Let us next check that such a k can be found and how this might be done. 4 We shall write first down (F (U) F (L)) (conditioned on X = x) in (2.2), whereby it is convenient− to introduce the following auxiliary notation x+ks 1 2 1 − (t−m) A (k, x)= e 2σ2 dt σ√2π − Zx ks = F (U) F (L) X = x. − | Then we set P k, β X = P A k, X β X n | n ≥ | = the conditional probability that Ak, X β given X. ≥ Then iterated expectation gives P (k, β)= E P k, β X . n n | Thus Pn (k, β) is the probability that the interval [X kS, X + k] includes at least 100 β% of the outcomes of a random− variable × with the distribution N(m, σ2). Then for each n β [0, 1] and γ [0, 1] there exists a k such that ∈ ∈ Pn (k, β)= γ. (3.1) Clearly the eqn. (3.1) is the current instance of (2.2), and must be solved w.r.t. k by computational means to get the β -content and γ-confidence k-factor tolerance interval. There are known algorithms like the Wald-Wolfowitz method for approximate solution of (3.1). A. Wald and J. Wolfowitz showed (in 1946) that k r u, (3.2) ≈ × where r is a function of n and β is determined from 1 +r 1 √n 2 e−t /2dt = β, 1 √2π −r Z √n 5 e.g., by the Newton-Raphson method, and u is defined by f u = 2 , sχ1−γ (f) 2 where f = n 1 and χ1−γ (f) is the 1 γ percentile of χ2 distribution− with f degrees of freeedom.− k-values computed according to (3.2) have been tabulated 3. 3.2 A Prediction Interval We invoke some piece of standard elementary theory of confidence intervals to create a prediction interval by means of a trick. 2 X ,X ,...,X 1 N(µ , σ ) 1 2 n ∼ 1 2 Y ,Y ,...,Y 2 N(µ , σ ), 1 2 n ∼ 2 where all X’s and Y ’s are independent. We want to find a confidence interval for µ µ . We estimate µ µ by X Y . It follows that 1 − 2 1 − 2 − (X Y ) (µ µ ) − − 1 − 2 1 1 σ n1 + n2 q has a N(0, 1) -distribution. Let us assume that σ is unknown. We estimate σ2 by (n 1)s2 +(n 1)s2 s2 = 1 − 1 2 − 2 , n + n 2 1 2 − 2 2 2 where s1 and s2 are the familiar unbiased estimates of σ based on X’s and Y ’s, respectively. Then it follows that (X Y ) (µ µ ) − − 1 − 2 (3.3) 1 1 S n1 + n2 q 3e.g., in D.B. Owen: Handbook of Statistical Tables. Addison-Wesley, Palo Alto, 1962,. Here one finds a table for r and a table for u for some choices of β (= P in the table) and γ. Disturbingly these values are said to be unreliable..